10
Dec 18

Distributions derived from normal variables

Useful facts about independence

In the one-dimensional case the economic way to define normal variables is this: define a standard normal variable and then a general normal variable as its linear transformation.

In case of many dimensions, we follow the same idea. Before doing that we state without proofs two useful facts about independence of random variables (real-valued, not vectors).

Theorem 1. Suppose variables X_1,...,X_n have densities p_1(x_1),...,p_n(x_n). Then they are independent if and only if their joint density p(x_1,...,x_n) is a product of individual densities: p(x_1,...,x_n)=p_1(x_1)...p_n(x_n).

Theorem 2. If variables X,Y are normal, then they are independent if and only if they are uncorrelated: cov(X,Y)=0.

The necessity part (independence implies uncorrelatedness) is trivial.

Normal vectors

Let z_1,...,z_n be independent standard normal variables. A standard normal variable is defined by its density, so all of z_i have the same density. We achieve independence, according to Theorem 1, by defining their joint density to be a product of individual densities.

Definition 1. A standard normal vector of dimension n is defined by

z=\left(\begin{array}{c}z_1\\...\\z_n\\ \end{array}\right)

Properties. Ez=0 because all of z_i have means zero. Further, cov(z_i,z_j)=0 for i\neq j by Theorem 2 and variance of a standard normal is 1. Therefore, from the expression for variance of a vector we see that Var(z)=I.

Definition 2. For a matrix A and vector \mu of compatible dimensions a normal vector is defined by X=Az+\mu.

Properties. EX=AEz+\mu=\mu and

Var(X)=Var(Az)=E(Az)(Az)^T=AEzz^TA^T=AIA^T=AA^T

(recall that variance of a vector is always nonnegative).

Distributions derived from normal variables

In the definitions of standard distributions (chi square, t distribution and F distribution) there is no reference to any sample data. Unlike statistics, which by definition are functions of sample data, these and other standard distributions are theoretical constructs. Statistics are developed in such a way as to have a distribution equal or asymptotically equal to one of standard distributions. This allows practitioners to use tables developed for standard distributions.

Exercise 1. Prove that \chi_n^2/n converges to 1 in probability.

Proof. For a standard normal z we have Ez^2=1 and Var(z^2)=2 (both properties can be verified in Mathematica). Hence, E\chi_n^2/n=1 and

Var(\chi_n^2/n)=\sum_iVar(z_i^2)/n^2=2/n\rightarrow 0.

Now the statement follows from the simple form of the law of large numbers.

Exercise 1 implies that for large n the t distribution is close to a standard normal.

30
Nov 18

Application: estimating sigma squared

Application: estimating sigma squared

Consider multiple regression

(1) y=X\beta +e

where

(a) the regressors are assumed deterministic, (b) the number of regressors k is smaller than the number of observations n, (c) the regressors are linearly independent, \det (X^TX)\neq 0, and (d) the errors are homoscedastic and uncorrelated,

(2) Var(e)=\sigma^2I.

Usually students remember that \beta should be estimated and don't pay attention to estimation of \sigma^2. Partly this is because \sigma^2 does not appear in the regression and partly because the result on estimation of error variance is more complex than the result on the OLS estimator of \beta .

Definition 1. Let \hat{\beta}=(X^TX)^{-1}X^Ty be the OLS estimator of \beta. \hat{y}=X\hat{\beta} is called the fitted value and r=y-\hat{y} is called the residual.

Exercise 1. Using the projectors P=X(X^TX)^{-1}X^T and Q=I-P show that \hat{y}=Py and r=Qe.

Proof. The first equation is obvious. From the model we have r=X\beta+e-P(X\beta +e). Since PX\beta=X\beta, we have further r=e-Pe=Qe.

Definition 2. The OLS estimator of \sigma^2 is defined by s^2=\Vert r\Vert^2/(n-k).

Exercise 2. Prove that s^2 is unbiased: Es^2=\sigma^2.

Proof. Using projector properties we have

\Vert r\Vert^2=(Qe)^TQe=e^TQ^TQe=e^TQe.

Expectations of type Ee^Te and Eee^T would be easy to find from (2). However, we need to find Ee^TQe where there is an obstructing Q. See how this difficulty is overcome in the next calculation.

E\Vert r\Vert^2=Ee^TQe (e^TQe is a scalar, so its trace is equal to itself)

=Etr(e^TQe) (applying trace-commuting)

=Etr(Qee^T) (the regressors and hence Q are deterministic, so we can use linearity of E)

=tr(QEee^T) (applying (2)) =\sigma^2tr(Q).

tr(P)=k because this is the dimension of the image of P. Therefore tr(Q)=n-k. Thus, E\Vert r\Vert^2=\sigma^2(n-k) and Es^2=\sigma^2.

25
Nov 18

Eigenvalues and eigenvectors of a projector

Eigenvalues and eigenvectors of a projector

Exercise 1. Find eigenvalues of a projector.

Solution. We know that a projector doesn't change elements from its image: Px=x for all x\in\text{Img}(P). This means that \lambda =1 is an eigenvalue of P. Moreover, if \{x_i:i=1,...,\dim\text{Img}(P)\} is any orthonormal system in \text{Img}(P), each of x_i is an eigenvector of P corresponding to the eigenvalue \lambda =1.

Since P maps to zero all elements from the null space N(P), \lambda =0 is another eigenvalue. If \{y_i:i=1,...,\dim N(P)\} is any orthonormal system in N(P), each of y_i is an eigenvector of P corresponding to the eigenvalue \lambda =0.

A projector cannot have eigenvalues other than 0 and 1. This is proved as follows. Suppose Px=\lambda x with some nonzero x. Applying P to both sides of this equation, we get Px=P^2x=\lambda Px=\lambda ^2x. It follows that \lambda x=\lambda^2x and (because x\neq 0) \lambda =\lambda^2. The last equation has only two roots: 0 and 1.

We have \dim\text{Img}(P)+\dim N(P)=n because R^n is an orthogonal sum of N(P) and \text{Img}(P).  Combining the systems \{x_i\}, \{y_i\} we get an orthonormal basis in R^{n} consisting of eigenvectors of P.

Trace of a projector

Recall that for a square matrix, its trace is defined as the sum of its diagonal elements.

Exercise 2. Prove that tr(AB)=tr(BA) if both products AB and BA are square. It is convenient to call this property trace-commuting (we know that in general matrices do not commute).

Proof. Assume that A is of size n\times m and B is of size m\times n. For both products we need only to find the diagonal elements:

AB=\left(\begin{array}{ccc}  a_{11}&...&a_{1m}\\...&...&...\\a_{n1}&...&a_{nm}\end{array}  \right)\left(\begin{array}{ccc}  b_{11}&...&b_{1n}\\...&...&...\\b_{m1}&...&b_{mn}\end{array}  \right)=\left(\begin{array}{ccc}  \sum_ia_{1i}b_{i1}&...&...\\...&...&...\\...&...&\sum_ia_{ni}b_{in}\end{array}  \right),

BA=\left(\begin{array}{ccc}  b_{11}&...&b_{1n}\\...&...&...\\b_{m1}&...&b_{mn}\end{array}  \right)\left(\begin{array}{ccc}  a_{11}&...&a_{1m}\\...&...&...\\a_{n1}&...&a_{nm}\end{array}  \right)=\left(\begin{array}{ccc}  \sum_ja_{j1}b_{1j}&...&...\\...&...&...\\...&...&\sum_ja_{jm}b_{mj}  \end{array}\right).

All we have to do is change the order of summation:

tr(AB)=\sum_j\sum_ia_{ji}b_{ij}=\sum_i\sum_ja_{ji}b_{ij}=tr(BA).

Exercise 3. Find the trace of a projector.

Solution. In Exercise 1 we established that the projector P has p=\dim\text{Img}(P) eigenvalues \lambda =1 and n-p eigenvalues \lambda =0. P is symmetric, so in its diagonal representation P=UDU^{-1} there are p unities and n-p zeros on the diagonal of the diagonal matrix D. By Exercise 2

tr(P)=tr(UDU^{-1})=tr(DU^{-1}U)=tr(D)=p.

18
Nov 18

Application: Ordinary Least Squares estimator

Application: Ordinary Least Squares estimator

Generalized Pythagoras theorem

Exercise 1. Let P be a projector and denote Q=I-P. Then \Vert x\Vert^2=\Vert Px\Vert^2+\Vert Qx\Vert^2.

Proof. By the scalar product properties

\Vert x\Vert^2=\Vert Px+Qx\Vert^2=\Vert Px\Vert^2+2(Px)\cdot (Qx)+\Vert Qx\Vert^2.

P is symmetric and idempotent, so

(Px)\cdot (Qx)=(Px)\cdot[(I-P)x]=x\cdot[(P-P^2)x]=0.

This proves the statement.

Ordinary Least Squares (OLS) estimator derivation

Problem statement. A vector y\in R^n (the dependent vector) and vectors x^{(1)},...,x^{(k)}\in R^n (independent vectors or regressors) are given. The OLS estimator is defined as that vector \beta \in R^k which minimizes the total sum of squares TSS=\sum_{i=1}^n(y_i-x^{(1)}\beta_1-...-x^{(k)}\beta_k)^2.

Denoting X=(x^{(1)},...,x^{(k)}), we see that TSS=\Vert y-X\beta\Vert^2 and that finding the OLS estimator means approximating y with vectors from the image \text{Img}X. x^{(1)},...,x^{(k)} should be linearly independent, otherwise the solution will not be unique.

Assumption. x^{(1)},...,x^{(k)} are linearly independent. This, in particular, implies that k\leq n.

Exercise 2. Show that the OLS estimator is

(2) \beta=(X^TX)^{-1}X^Ty.

Proof. By Exercise 1 we can use P=X(X^TX)^{-1}X^T. Since X\beta belongs to the image of P, P doesn't change it: X\beta=PX\beta. Denoting also Q=I-P we have

\Vert y-X\beta\Vert^2=\Vert y-Py+Py-X\beta\Vert^2

=\Vert Qy+P(y-X\beta)\Vert^2 (by Exercise 1)

=\Vert Qy\Vert^2+\Vert P(y-X\beta)\Vert^2.

This shows that \Vert Qy\Vert^2 is a lower bound for \Vert y-X\beta\Vert^2. This lower bound is achieved when the second term is made zero. From

P(y-X\beta)=Py-X\beta =X(X^TX)^{-1}X^Ty-X\beta=X[(X^TX)^{-1}X^Ty-\beta]

we see that the second term is zero if \beta satisfies (2).

Usually the above derivation is applied to the dependent vector of the form y=X\beta+e where e is a random vector with mean zero. But it holds without this assumption. See also simplified derivation of the OLS estimator.

14
Nov 18

Constructing a projector onto a given subspace

Constructing a projector onto a given subspace

Let L be a subspace of R^n. Let k=\dim L\ (\leq n) and fix some basis x^{(1)},...,x^{(k)} in L. Define the matrix X=(x^{(1)},...,x^{(k)}) of size n\times k (the vectors are written as column vectors).

Exercise 1. a) With the above notation, the matrix (X^TX)^{-1} exists. b) The matrix P=X(X^TX)^{-1}X^T exists. c) P is a projector.

Proof. a) The determinant of A=X^TX is not zero by linear independence of the basis vectors, so its inverse A^{-1} exists. We also know that A and its inverse are symmetric:

(1) A^T=A, (A^{-1})^T=A^{-1}.

b) To see that P exists just count the dimensions.

c) Let's prove that P is a projector. (1) allows us to make the proof compact. P is idempotent:

P^2=(XA^{-1}X^T)(XA^{-1}X^T)=XA^{-1}(X^TX)A^{-1}X^T

=X(A^{-1}A)A^{-1}X^T=XA^{-1}X^T=P.

P is symmetric:

P^T=[XA^{-1}X^T]^T=(X^T)^T(A^{-1})^TX^T=XA^{-1}X^T=P.

Exercise 2. P projects onto L: \text{Img}(P)=L.

Proof. First we show that \text{Img}(P)\subseteq L. Put

(2) y=A^{-1}X^Tx,

for any x\in R^n. Then

Px=XA^{-1}X^Tx=Xy=\sum x^{(j)}y_j\in L.

This shows that \text{Img}(P)\subseteq L.

Let's prove the opposite inclusion. Any element of L is of form \sum x^{(j)}y_j with some y. N(X)=\{0\} because we are dealing with a basis. This fact and the general equation N(X)\oplus \text{Img}(X^T)=R^n imply \text{Img}(X^T)=R^n. Hence for any given y there exists x such that Ay=X^Tx. Then (2) is true and, as above, Px=\sum x^{(j)}y_j. We have proved \text{Img}(P)\supseteq L.

9
Nov 18

Geometry and algebra of projectors

Geometry and algebra of projectors

Projectors are geometrically so simple that they should have been discussed somewhere in the beginning of this course. I am giving them now because the applications are more advanced.

Motivating example

Let L be the x-axis and L^\perp the y-axis on the plane. Let P be the projector onto L along L^\perp and let Q be the projector onto L^\perp along L. This geometry translates into the following definitions:

L=\{(x,0):x\in R\}, L^\perp=\{(0,y):y\in R\}, P(x,y)=(x,0), Q(x,y)=(0,y).

The theory is modeled on the following observations.

a) P leaves the elements of L unchanged and sends to zero all elements of L^\perp.

b) L is the image of P and L^\perp is the null space of P.

c) Any element of the image of P is orthogonal to any element of the image of Q.

d) Any x can be represented as x=(x_1,0)+(0,x_2)=Px+Qx. It follows that I=P+Q.

For more simple examples, see my post on conditional expectations.

Formal approach

Definition 1. A square matrix P is called a projector if it satisfies two conditions: 1) P^2=P (P is idempotent; for some reason, students remember this term better than others) and 2) P^T=P (P is symmetric).

Exercise 1. Denote L_P=\{x:Px=x\} the set of points x that are left unchanged by P. Then L_P is the image of P (and therefore a subspace).

Proof. Indeed, the image of P consists of points y=Px. For any such y, we have Py=P^2x=Px=y, so y belongs to L_P. Conversely, any element of L_P is seen to belong to the image of P.

Exercise 2. a) The null space and image of P are orthogonal. b) We have an orthogonal decomposition R^n=N(P)\oplus \text{Img}(P).

Proof. a) If x\in \text{Img}(P) and y\in N(P), then Py=0 and by Exercise 1 Px=x. Therefore x\cdot y=(Px)\cdot y=x\cdot (Py)=0. This shows that \text{Img}(P)\perp N(P).

b) For any x write x=Px+(I-P)x. Here Px\in \text{Img}(P) and (I-P)x\in N(P) because P(I-P)x=(P-P^2)x=0.

Exercise 3. a) Along with P, the matrix Q=I-P is also a projector. b) \text{Img}(Q)=N(P) and N(Q)=\text{Img}(P).

Proof. a) Q is idempotent: Q^2=(I-P)^2=I-2P+P^2=I-P=Q. b) Q is symmetric: Q^T=I^T-P^T=Q.

b) By Exercise 2

\text{Img}(Q)=\{x:Qx=x\}=\{x:(I-P)x=x\}=\{x:Px=0\}=N(P).

Since P=I-Q, this equation implies N(Q)=\text{Img}(P).

It follows that, as with P, the set L_Q=\{x:Qx=x\} is the image of Q and it consists of points that are not changed by Q.

29
Oct 18

Questions for repetition

Questions for repetition

  1. Let \lambda_i be the eigenvalues of a symmetric matrix A. Prove that a) it is positive if and only if its eigenvalues are positive and b) it is non-negative if and only if its eigenvalues are non-negative. Reproduce my proof, which gives only sufficiency (\min_i\lambda_i>0 implies positivity of A and \min_i\lambda_i\geq 0 implies non-negativity of A). For the necessity part, plug eigenvectors of A in (Ax)\cdot x.

  2. Use the Cauchy-Hadamard theorem to relate the radius of convergence of the power series used to define a function of a matrix f(A) to the radius of convergence of the function f(t) of a numerical argument.

  3. What is the matrix solution of the initial value problem x^\prime(t)=Ax(t), x(t_0)=x_0?

  4. How does the knowledge of the diagonal representation of A simplify finding f(A)?

  5. Describe how the knowledge of the diagonal representation of A allows one to split the system x^\prime(t)=Ax(t) into a collection of one-dimensional equations. How does this lead to the solution of the initial value problem in Exercise 3?

  6. Define a square root of a non-negative symmetric matrix and relate it to the definition of the same using the power series. What are the properties of the square root?

  7. Show that the variance matrix \Omega of an arbitrary random vector e (with real random components) is symmetric and non-negative.

  8. When the matrix \Omega is positive, what are the properties of \Omega^{-1/2}?

  9. Find the variance of \Omega ^{-1/2}e.

  10. How does the previous result lead to the Aitken estimator?

  11. Define the absolute value of A and show that this definition is correct.

  12. If A is diagonalized, what is the expression of its determinant in terms of its eigenvalues?

  13. (This is strictly about ideas) Describe the elements of the polar form c=\rho e^{i\theta} of a complex number. How do the definitions of \rho and e^{i\theta } help you define their analogs for matrices?

  14. Derive the polar form for a square matrix.

  15. More on similarity between complex numbers and matrices. For a square matrix with possibly complex entries, define the real part \text{Re}(A)=(A+A^\prime)/2 and the imaginary part \text{Im}(A)=(A-A^\prime)/(2i). Here A^\prime is the adjoint of A. Show that both \text{Re}(A) and \text{Im}(A) are symmetric and that A=\text{Re}(A)+i\text{Im}A).

  16. Using population characteristics, describe the idea of Principal Component Analysis.

  17. Show how this idea is realized in the sampling context.

  18. This is a research problem to support Exercise 17. How do you change places of rows in a matrix? We want to find an orthogonal matrix P such that premultiplication of W by P yields a matrix W_1 where W_1 has the same elements as W except that some rows have changed their places. a) Consider a matrix W of size 2\times 2 and let W_1 be the transformed matrix (with the first row of W as the second row of W_1 and vice versa). Find P from the equation PW=W_1. b) Do the same for a matrix W of size 3\times 3. c) Generalize to the case of an n\times n matrix W, first considering matrices P that change places of only two rows. Let's call such a matrix an elementary matrix. Note that it is orthogonal. d) The matrix that changes any number of rows is a product of elementary ones. It is orthogonal as a product of orthogonal matrices.

29
Sep 18

Questions for repetition

Questions for repetition

  1. Describe the steps leading to a full definition of the set of complex numbers (imaginary unit, complex numbers, operations with them, absolute value, conjugate number).
  2. What are matrix analogs of a) real numbers, b) conjugation and c) an absolute value of a complex number (without proof)?
  3. Write out and prove the properties of the scalar product in C^n.
  4. Assuming the Euler formula known, derive the polar form of a complex number. What can you say about the angle \theta in the polar form?
  5. Prove that a quadratic form of a matrix is homogeneous of degree 2.
  6. Prove that a quadratic form takes values in the set of numbers.
  7. Define positive and non-negative matrices. State Sylvester's criterion.
  8. Illustrate the Sylvester criterion geometrically.
  9. Show that A^TA is symmetric and non-negative.
  10. Give the definition of a basis and derive the formula for the coefficients \xi in the decomposition x=\sum\xi_iu_i for an arbitrary vector x.
  11. How are the decomposition coefficients transformed when one passes from one basis to another?
  12. Give the full picture behind the similarity definition.
  13. Prove that for an orthogonal matrix, a) the inverse and transpose are the same, b) the transpose and inverse are orthogonal.
  14. An orthogonal matrix preserves scalar products, norms and angles.
  15. If you put elements of an orthonormal basis side by side, the resulting matrix will be orthogonal.
  16. The transition matrix from one orthonormal basis to another is orthogonal.
  17. Show that the product of two diagonal matrices is diagonal.
  18. Define eigenvalues and eigenvectors. Why are we interested in them?
  19. What is the link between eigenvalues and a characteristic equation of a matrix?
  20. Prove that the characteristic equation is a polynomial of power n, if A is of size n\times n.
  21. Prove that in C^n any matrix has at least one eigenvector.
  22. A symmetric matrix in C^n has only real eigenvalues.
  23. If A is symmetric, then it has at least one real eigenvector.
  24. A symmetric matrix has at least one eigenvector in any nontrivial invariant subspace.
  25. What is the relationship between the spectra of A_C and A_R?
  26. A matrix diagonalizable by an orthogonal matrix must be symmetric.
  27. For a symmetric matrix, an orthogonal complement of an invariant subspace is invariant.
  28. Main theorem. A symmetric matrix is diagonalizable by an orthogonal matrix.

22
Sep 18

Applications of the diagonal representation IV

Applications of the diagonal representation IV

Principal component analysis is a general method based on diagonalization of the variance matrix. We consider it in a financial context. The variance matrix measures riskiness of the portfolio.  We want to see which stocks contribute most to the portfolio risk. The surprise is that the answer is given not in terms of the vector of returns but in terms of its linear transformation.

8. Principal component analysis (PCA)

Let R be a column-vector of returns on p stocks with the variance matrix V(R)=E(R-ER)(R-ER)^{T}. The idea is to find an orthogonal matrix W such that W^{-1}V(R)W=D is a diagonal matrix D=diag[\lambda_1,...,\lambda_p] with \lambda_1\geq...\geq\lambda_p.

With such a matrix, instead of R we can consider its transformation Y=W^{-1}R for which

V(Y)=W^{-1}V(R)(W^{-1})^T=W^{-1}V(R)W=D.

We know that V(Y) has variances V(Y_1),...,V(Y_p) on the main diagonal. It follows that V(Y_i)=\lambda_i for all i. Variance is a measure of riskiness. Thus, the transformed variables Y_1,...,Y_p are put in the order of declining risk. What follows is the realization of this idea using sample data.

In a sampling context, all population means shoud be replaced by their sample counterparts. Let R^{(t)} be a p\times 1 vector of observations on R at time t. These observations are put side by side into a matrix \mathbb{R}=(R^{(1)},...,R^{(n)}) where n is the number of moments in time. The population mean ER is estimated by the sample mean

\bar{\mathbb{R}}=\frac{1}{n}\sum_{t=1}^nR^{(t)}.

The variance matrix V(R) is estimated by

\hat{V}=\frac{1}{n-1}(\mathbb{R}-\bar{\mathbb{R}}l)(\mathbb{R}-\bar{\mathbb{R}}l)^T

where l is a 1\times n vector of ones. It is this matrix that is diagonalized: W^{-1}\hat{V}W=D.

In general, the eigenvalues in D are not ordered. Ordering them and at the same time changing places of the rows of W^{-1} correspondingly we get a new orthogonal matrix W_1 (this requires a small proof) such that the eigenvalues in W_1^{-1}\hat{V}W_1=D_1 will be ordered. There is a lot more to say about the method and its applications.

9
Sep 18

Applications of the diagonal representation III

Applications of the diagonal representation III

6. Absolute value of a matrix

Exercise 1. For a square matrix A the matrix A^TA is non-negative.

Proof. x^TA^TAx=(Ax)^TAx=\|Ax\|^2\geq 0 for any x.

For a complex number c the absolute value is defined by |c|=(\bar{c}c)^{1/2}. Since transposition of matrices is similar to conjugation of complex numbers, this leads us to the following definition.

Definition 1. By Exercise 1 from the previous post and Exercise 1, the matrix A^TA has non-negative eigenvalues. Hence we can define the absolute value of A as a square root of A^TA,  |A|=(A^TA)^{1/2}.

7. Polar form

A complex nonzero number c in polar form is c=\rho e^{i\theta} where \rho >0 is the absolute value of c and \theta is a real angle, so that |e^{i\theta}|=1. The matrix analog of this form obtains when the condition \rho >0 is replaced by \det A\neq 0, the absolute value of A from Definition 1 is used and an orthogonal matrix plays the role of e^{i\theta}.

Exercise 2. For a symmetric matrix A, its determinant equals the product of its eigenvalues.

Proof. \det A=\det(Udiag[\lambda_1,...,\lambda_n]U^{-1})=(\det U)(\det diag[\lambda_1,...,\lambda_n])(\det(U^{-1}))

=(\det U)^2\det diag[\lambda_1,...,\lambda_n]=\lambda_1...\lambda_n.

Exercise 3. Let \det A\neq 0. Put U=A|A|^{-1}. Then U is orthogonal and the polar form of A is A=U|A|.

Proof. Let \lambda_1,...,\lambda_n be the eigenvalues of A^TA. They are real and non-negative. In fact they are all positive because by Exercise 2 their product equals \det(A^TA)=(\det A)^2.

Hence, |A|^{-1} exists and it is symmetric. U also exists and is orthogonal: U^TU=(|A|^{-1})^TA^TA|A|^{-1}=|A|^{-1}|A|^2|A|^{-1}=I. Finally, from the definition of U, A=U|A|.