Variance of a vector: motivation and visualization
I always show my students the definition of the variance of a vector, and they usually don't pay attention. You need to know what it is, already at the level of simple regression (to understand the derivation of the slope estimator variance), and even more so when you deal with time series. Since I know exactly where students usually stumble, this post is structured as a series of questions and answers.
Think about ideas: how would you define variance of a vector?
be a vector with components, each of which is a random variable. How would you define its variance?
The answer is not straightforward because we don't know how to square a vector. Let denote the transposed vector. There are two ways to multiply a vector by itself: and
Question 2. Find the dimensions of and and their expressions in terms of coordinates of
Answer 2. For a product of matrices there is a compatibility rule that I write in the form
Recall that in the notation means that the matrix has rows and columns. For example, is of size Verbally, the above rule says that the number of columns of should be equal to the number of rows of In the product that common number disappears and the unique numbers ( and ) give, respectively, the number of rows and columns of Isn't the the formula
easier to remember than the verbal statement? From (2) we see that is of dimension 1 (it is a scalar) and is an matrix.
For actual multiplication of matrices I use the visualization
Short formulation. Multiply rows from the first matrix by columns from the second one.
Long Formulation. To find the element of we find a scalar product of the th row of and th column of To find all elements in the th row of we fix the th row in and move right the columns in Alternatively, to find all elements in the th column of we fix the th column in and move down the rows in . Using this rule, we have
Usually students have problems with the second equation.
Based on (1) and (4), we have two candidates to define variance:
Answer 1. The second definition contains more information, in the sense to be explained below, so we define variance of a vector by (6).
Question 3. Find the elements of this matrix.
Answer 3. Variance of a vector has variances of its components on the main diagonal and covariances outside it:
If you can't get this on your own, go back to Answer 2.
There is a matrix operation called trace and denoted . It is defined only for square matrices and gives the sum of diagonal elements of a matrix.
Exercise 1. Show that In this sense definition (6) is more informative than (5).
Question 4. Note that the matrix (1) is symmetric (elements above the main diagonal equal their mirror siblings below that diagonal). This means that some terms in the second sum on the right of (5) are repeated twice. If you group equal terms in (5), what do you get?
Answer 4. The idea is to write
that is, to join equal elements above and below the main diagonal in (1). For this, you need to figure out how to write a sum of the elements that are above the main diagonal. Make a bigger version of (1) (with more off-diagonal elements) to see that the elements that are above the main diagonal are listed in the sum This sum can also be written as Hence, (5) is the same as
Unlike (6), this equation is applicable when there is autocorrelation.
Suppose we are observing two stocks and their respective returns are A vector autoregression for the pair is one way to take into account their interdependence. This theory is undeservedly omitted from the Guide by A. Patton.
Matrix multiplication is a little more complex. Make sure to read Global idea 2 and the compatibility rule.
The general approach to study matrices is to compare them to numbers. Here you see the first big No: matrices do not commute, that is, in general
The idea behind matrix inversion is pretty simple: we want an analog of the property that holds for numbers.
Some facts about determinants have very complicated proofs and it is best to stay away from them. But a couple of ideas should be clear from the very beginning. Determinants are defined only for square matrices. The relationship of determinants to matrix invertibility explains the role of determinants. If is square, it is invertible if and only if (this is an equivalent of the condition for numbers).
Here is an illustration of how determinants are used. Suppose we need to solve the equation for where and are known. Assuming that we can premultiply the equation by to obtain (Because of lack of commutativity, we need to keep the order of the factors). Using intuitive properties and we obtain the solution: In particular, we see that if then the equation has a unique solution
Let be a square matrix and let be two vectors. are assumed to be known and is unknown. We want to check that solves the equation (Note that for this equation the trick used to solve does not work.) Just plug
(write out a couple of first terms in the sums if summation signs frighten you).
Transposition is a geometrically simple operation. We need only the property
Variance and covariance
Property 1. Variance of a random vector and covariance of two random vectors are defined by
Note that when variance becomes
Property 2. Let be random vectors and suppose are constant matrices. We want an analog of In the next calculation we have to remember that the multiplication order cannot be changed.
Distribution of the estimator of the error variance
If you are reading the book by Dougherty: this post is about the distribution of the estimator defined in Chapter 3.
where the deterministic matrix is of size satisfies (regressors are not collinear) and the error satisfies
is estimated by Denote Using (1) we see that and the residual is estimated by
is a projector and has properties which are derived from those of
If is an eigenvalue of then multiplying by and using the fact that we get Hence eigenvalues of can be only or The equation
tells us that the number of eigenvalues equal to 1 is and the remaining are zeros. Let be the diagonal representation of Here is an orthogonal matrix,
and is a diagonal matrix with eigenvalues of on the main diagonal. We can assume that the first numbers on the diagonal of are ones and the others are zeros.
Theorem. Let be normal. 1) is distributed as 2) The estimators and are independent.
Proof. 1) We have by (4)
Denote From (2) and (5)
and is normal as a linear transformation of a normal vector. It follows that where is a standard normal vector with independent standard normal coordinates Hence, (6) implies
(3) and (7) prove the first statement.
2) First we note that the vectors are independent. Since they are normal, their independence follows from
It's easy to see that This allows us to show that is a function of :
Independence of leads to independence of their functions and
Since maps to zero all elements from the null space is another eigenvalue. If is any orthonormal system in each of is an eigenvector of corresponding to the eigenvalue
A projector cannot have eigenvalues other than and This is proved as follows. Suppose with some nonzero Applying to both sides of this equation, we get It follows that and (because ) The last equation has only two roots: and
Recall that for a square matrix, its trace is defined as the sum of its diagonal elements.
Exercise 2. Prove that if both products and are square. It is convenient to call this property trace-commuting (we know that in general matrices do not commute).
Proof. Assume that is of size and is of size For both products we need only to find the diagonal elements:
All we have to do is change the order of summation:
Exercise 3. Find the trace of a projector.
Solution. In Exercise 1 we established that the projector has eigenvalues and eigenvalues is symmetric, so in its diagonal representation there are unities and zeros on the diagonal of the diagonal matrix By Exercise 2
In case of many dimensions, we follow the same idea. Before doing that we state without proofs two useful facts about independence of random variables (real-valued, not vectors).
Theorem 1. Suppose variables have densities Then they are independent if and only if their joint density is a product of individual densities:
Theorem 2. If variables are normal, then they are independent if and only if they are uncorrelated:
The necessity part (independence implies uncorrelatedness) is trivial.
Let be independent standard normal variables. A standard normal variable is defined by its density, so all of have the same density. We achieve independence, according to Theorem 1, by defining their joint density to be a product of individual densities.
Definition 1. A standard normal vector of dimension is defined by
Definition 2. For a matrix and vector of compatible dimensions a normal vector is defined by
(recall that variance of a vector is always nonnegative).
Distributions derived from normal variables
In the definitions of standard distributions (chi square, t distribution and F distribution) there is no reference to any sample data. Unlike statistics, which by definition are functions of sample data, these and other standard distributions are theoretical constructs. Statistics are developed in such a way as to have a distribution equal or asymptotically equal to one of standard distributions. This allows practitioners to use tables developed for standard distributions.
Exercise 1. Prove that converges to 1 in probability.
Proof. For a standard normal we have and (both properties can be verified in Mathematica). Hence, and
Students of FN3142 often think that they can get by by picking a few technical tricks. The questions below are mostly about intuition that helps to understand and apply those tricks.
Everywhere we assume that is a time series and is a sequence of corresponding information sets. It is natural to assume that for all We use the short conditional expectation notation: .
Question 1. How do you calculate conditional expectation in practice?
Question 2. How do you explain ?
Question 3. Simplify each of and and explain intuitively.
Question 4. is a shock at time . Positive and negative shocks are equally likely. What is your best prediction now for tomorrow's shock? What is your best prediction now for the shock that will happen the day after tomorrow?
Question 5. How and why do you predict at time ? What is the conditional mean of your prediction?
Question 6. What is the error of such a prediction? What is its conditional mean?
Question 7. Answer the previous two questions replacing by .
Question 8. What is the mean-plus-deviation-from-mean representation (conditional version)?
Question 9. How is the representation from Q.8 reflected in variance decomposition?
Question 10. What is a canonical form? State and prove all properties of its parts.
Question 11. Define conditional variance for white noise process and establish its link with the unconditional one.
Question 12. How do you define the conditional density in case of two variables, when one of them serves as the condition? Use it to prove the LIE.
Question 13. Write down the joint distribution function for a) independent observations and b) for serially dependent observations.
Question 14. If one variable is a linear function of another, what is the relationship between their densities?
Question 15. What can you say about the relationship between if ? Explain geometrically the definition of the quasi-inverse function.
Answer 1. Conditional expectation is a complex notion. There are several definitions of differing levels of generality and complexity. See one of them here and another in Answer 12.
The point of this exercise is that any definition requires a lot of information and in practice there is no way to apply any of them to actually calculate conditional expectation. Then why do they juggle conditional expectation in theory? The efficient market hypothesis comes to rescue: it is posited that all observed market data incorporate all available information, and, in particular, stock prices are already conditioned on
Answer 4. Since positive and negative shocks are equally likely, the best prediction is (I call this equation a martingale condition). Similarly, but in this case I prefer to see an application of the LIE:
Answer 5. The best prediction is because it minimizes among all functions of current information Formally, you can use the first order condition
Answer 6. It is natural to define the prediction error by
By the projector property .
Answer 7. To generalize, just change the subscripts. For the prediction we have to use two subscripts: the notation means that we are trying to predict what happens at a future date based on info set (time is like today). Then by definition
Answer 8. Answer 7, obviously, implies The simple case is here.
Answer 12. The conditional density is defined similarly to the conditional probability. Let be two random variables. Denote the density of and the joint density. Then the conditional density of conditional on is defined as After this we can define the conditional expectation With these definitions one can prove the Law of Iterated Expectations:
This is an illustration to Answer 1 and a prelim to Answer 13.
Answer 13. Understanding this answer is essential for Section 8.6 on maximum likelihood of Patton's guide.
a) In case of independent observations the joint density of the vector is a product of individual densities:
b) In the time series context it is natural to assume that the next observation depends on the previous ones, that is, for each depends on (serially dependent observations). Therefore we should work with conditional densities From Answer 12 we can guess how to make conditional densities appear:
The fractions on the right are recognized as conditional probabilities. The resulting expression is pretty awkward:
Answer 14. The answer given here helps one understand how to pass from the density of the standard normal to that of the general normal.