7
May 18

## Variance of a vector: motivation and visualization

I always show my students the definition of the variance of a vector, and they usually don't pay attention. You need to know what it is, already at the level of simple regression (to understand the derivation of the slope estimator variance), and even more so when you deal with time series. Since I know exactly where students usually stumble, this post is structured as a series of questions and answers.

## Think about ideas: how would you define variance of a vector?

Question 1. We know that for a random variable $X$, its variance is defined by

(1) $V(X)=E(X-EX)^{2}.$

Now let

$X=\left(\begin{array}{c}X_{1}\\...\\X_{n}\end{array}\right)$

be a vector with $n$ components, each of which is a random variable. How would you define its variance?

The answer is not straightforward because we don't know how to square a vector. Let $X^T=(\begin{array}{ccc}X_1& ...&X_n\end{array})$ denote the transposed vector. There are two ways to multiply a vector by itself: $X^TX$ and $XX^T.$

Question 2. Find the dimensions of $X^TX$ and $XX^T$ and their expressions in terms of coordinates of $X.$

Answer 2. For a product of matrices there is a compatibility rule that I write in the form

(2) $A_{n\times m}B_{m\times k}=C_{n\times k}.$

Recall that $n\times m$ in the notation $A_{n\times m}$ means that the matrix $A$ has $n$ rows and $m$ columns. For example, $X$ is of size $n\times 1.$ Verbally, the above rule says that the number of columns of $A$ should be equal to the number of rows of $B.$ In the product that common number $m$ disappears and the unique numbers ($n$ and $k$) give, respectively, the number of rows and columns of $C.$ Isn't the the formula
easier to remember than the verbal statement? From (2) we see that $X_{1\times n}^TX_{n\times 1}$ is of dimension 1 (it is a scalar) and $X_{n\times 1}X_{1\times n}^T$ is an $n\times n$ matrix.

For actual multiplication of matrices I use the visualization

(3) $\left(\begin{array}{ccccc}&&&&\\&&&&\\a_{i1}&a_{i2}&...&a_{i,m-1}&a_{im}\\&&&&\\&&&&\end{array}\right) \left(\begin{array}{ccccc}&&b_{1j}&&\\&&b_{2j}&&\\&&...&&\\&&b_{m-1,j}&&\\&&b_{mj}&&\end{array}\right) =\left( \begin{array}{ccccc}&&&&\\&&&&\\&&c_{ij}&&\\&&&&\\&&&&\end{array}\right)$

Short formulation. Multiply rows from the first matrix by columns from the second one.

Long Formulation. To find the element $c_{ij}$ of $C,$ we find a scalar product of the $i$th row of $A$ and $j$th column of $B:$ $c_{ij}=a_{i1}b_{1j}+a_{i2}b_{2j}+...$ To find all elements in the $i$th row of $C,$ we fix the $i$th row in $A$ and move right the columns in $B.$ Alternatively, to find all elements in the $j$th column of $C,$ we fix the $j$th column in $B$ and move down the rows in $A$. Using this rule, we have

(4) $X^TX=X_1^2+...+X_n^2,$ $XX^T=\left(\begin{array}{ccc}X_1^2&...&X_1X_n\\...&...&...\\X_nX_1&...&X_n^2 \end{array}\right).$

Usually students have problems with the second equation.

Based on (1) and (4), we have two candidates to define variance:

(5) $V(X)=E(X-EX)^T(X-EX)$

and

(6) $V(X)=E(X-EX)(X-EX)^T.$

Answer 1. The second definition contains more information, in the sense to be explained below, so we define variance of a vector by (6).

Question 3. Find the elements of this matrix.

Answer 3. Variance of a vector has variances of its components on the main diagonal and covariances outside it:

(7) $V(X)=\left(\begin{array}{cccc}V(X_1)&Cov(X_1,X_2)&...&Cov(X_1,X_n)\\Cov(X_2,X_1)&V(X_2)&...&Cov(X_2,X_n)\\...&...&...&...\\Cov(X_n,X_1)&Cov(X_n,X_2)&...&V(X_n)\end{array}\right).$

If you can't get this on your own, go back to Answer 2.

There is a matrix operation called trace and denoted $tr$. It is defined only for square matrices and gives the sum of diagonal elements of a matrix.

Exercise 1. Show that $tr(V(X))=E(X-EX)^T(X-EX).$ In this sense definition (6) is more informative than (5).

Exercise 2. Show that if $EX_1=...=EX_n=0$, then (7) becomes

$V(X)=\left(\begin{array}{cccc}EX^2_1&EX_1X_2&...&EX_1X_n\\EX_2X_1&EX^2_2&...&EX_2X_n\\...&...&...&...\\EX_nX_1&EX_nX_2&...&EX^2_n\end{array}\right).$

8
May 18

## Different faces of vector variance: again visualization helps

In the previous post we defined variance of a column vector $X$ with $n$ components by

$V(X)=E(X-EX)(X-EX)^T.$

In terms of elements this is the same as:

(1) $V(X)=\left(\begin{array}{cccc}V(X_1)&Cov(X_1,X_2)&...&Cov(X_1,X_n)\\Cov(X_2,X_1)&V(X_2)&...&Cov(X_2,X_n)\\...&...&...&...\\Cov(X_n,X_1)&Cov(X_n,X_2)&...&V(X_n)\end{array}\right).$

## So why knowing the structure of this matrix is so important?

Let $X_1,...,X_n$ be random variables and let $a_1,...,a_n$ be numbers. In the derivation of the variance of the slope estimator for simple regression we have to deal with the expression of type

(2) $V\left(\sum_{i=1}^na_iX_i\right).$

Question 1. How do you multiply a sum by a sum? I mean, how do you use summation signs to find the product $\left(\sum_{i=1}^na_i\right)\left(\sum_{i=1}^nb_i\right)$?

Answer 1. Whenever you have problems with summation signs, try to do without them. The product

$\left(a_1+...+a_n\right)\left(b_1+...+b_n\right)=a_1b_1+...+a_1b_n+...+a_nb_1+...+a_nb_n$

should contain ALL products $a_ib_j.$ Again, a matrix visualization will help:

$\left(\begin{array}{ccc}a_1b_1&...&a_1b_n\\...&...&...\\a_nb_1&...&a_nb_n\end{array}\right).$

The product we are looking for should contain all elements of this matrix. So the answer is

(3) $\left(\sum_{i=1}^na_i\right)\left(\sum_{i=1}^nb_i\right)=\sum_{i=1}^n\sum_{j=1}^na_ib_j.$

Formally, we can write $\sum_{i=1}^nb_i=\sum_{j=1}^nb_j$ (the sum does not depend on the index of summation, this is another point many students don't understand) and then perform the multiplication in (3).

Question 2. What is the expression for (2) in terms of covariances of components?

Answer 2. If you understand Answer 1 and know the relationship between variances and covariances, it should be clear that

(4) $V\left(\sum_{i=1}^na_iX_i\right)=Cov(\sum_{i=1}^na_iX_i,\sum_{i=1}^na_iX_i)$

$=Cov(\sum_{i=1}^na_iX_i,\sum_{j=1}^na_jX_j)=\sum_{i=1}^n\sum_{j=1}^na_ia_jCov(X_i,X_j).$

Question 3. In light of (1), separate variances from covariances in (4).

Answer 3. When $i=j,$ we have $Cov(X_i,X_j)=V(X_i),$ which are diagonal elements of (1). Otherwise, for $i\neq j$ we get off-diagonal elements of (1). So the answer is

(5) $V\left(\sum_{i=1}^na_iX_i\right)=\sum_{i=1}^na_i^2V(X_i)+\sum_{i\neq j}a_ia_jCov(X_i,X_j).$

Once again, in the first sum on the right we have only variances. In the second sum, the indices $i,j$ are assumed to run from $1$ to $n$, excluding the diagonal $i=j.$

Corollary. If $X_{i}$ are uncorrelated, then the second sum in (5) disappears:

(6) $V\left(\sum_{i=1}^na_iX_i\right)=\sum_{i=1}^na_i^2V(X_i).$

This fact has been used (with a slightly different explanation) in the derivation of the variance of the slope estimator for simple regression.

Question 4. Note that the matrix (1) is symmetric (elements above the main diagonal equal their mirror siblings below that diagonal). This means that some terms in the second sum on the right of (5) are repeated twice. If you group equal terms in (5), what do you get?

Answer 4. The idea is to write

$a_ia_jCov(X_i,X_j)+a_ia_jCov(X_j,X_i)=2a_ia_jCov(X_i,X_j),$

that is, to join equal elements above and below the main diagonal in (1). For this, you need to figure out how to write a sum of the elements that are above the main diagonal. Make a bigger version of (1) (with more off-diagonal elements) to see that the elements that are above the main diagonal are listed in the sum $\sum_{i=1}^{n-1}\sum_{j=i+1}^n.$ This sum can also be written as $\sum_{1\leq i Hence, (5) is the same as

(7) $V\left(\sum_{i=1}^na_iX_i\right)=\sum_{i=1}^na_i^2V(X_i)+2\sum_{i=1}^{n-1}\sum_{j=i+1}^na_ia_jCov(X_i,X_j)$

$=\sum_{i=1}^na_i^2V(X_i)+2\sum_{1\leq i

Unlike (6), this equation is applicable when there is autocorrelation.

5
May 22

## Vector autoregressions: preliminaries

Suppose we are observing two stocks and their respective returns are $x_{t},y_{t}.$ A vector autoregression for the pair $x_{t},y_{t}$ is one way to take into account their interdependence. This theory is undeservedly omitted from the Guide by A. Patton.

### Required minimum in matrix algebra

Matrix notation and summation are very simple.

Matrix multiplication is a little more complex. Make sure to read Global idea 2 and the compatibility rule.

The general approach to study matrices is to compare them to numbers. Here you see the first big No: matrices do not commute, that is, in general $AB\neq BA.$

The idea behind matrix inversion is pretty simple: we want an analog of the property $a\times \frac{1}{a}=1$ that holds for numbers.

Some facts about determinants have very complicated proofs and it is best to stay away from them. But a couple of ideas should be clear from the very beginning. Determinants are defined only for square matrices. The relationship of determinants to matrix invertibility explains the role of determinants. If $A$ is square, it is invertible if and only if $\det A\neq 0$ (this is an equivalent of the condition $a\neq 0$ for numbers).

Here is an illustration of how determinants are used. Suppose we need to solve the equation $AX=Y$ for $X,$ where $A$ and $Y$ are known. Assuming that $\det A\neq 0$ we can premultiply the equation by $A^{-1}$ to obtain $A^{-1}AX=A^{-1}Y.$ (Because of lack of commutativity, we need to keep the order of the factors). Using intuitive properties $A^{-1}A=I$ and $IX=X$ we obtain the solution: $X=A^{-1}Y.$ In particular, we see that if $\det A\neq 0,$ then the equation $AX=0$ has a unique solution $X=0.$

Let $A$ be a square matrix and let $X,Y$ be two vectors. $A,Y$ are assumed to be known and $X$ is unknown. We want to check that $X=\sum_{s=0}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}$ solves the equation $X-AXA^{T}=Y.$ (Note that for this equation the trick used to solve $AX=Y$ does not work.) Just plug $X:$

$\sum_{s=0}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}-A\sum_{s=0}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}A^{T}$ $=Y+\sum_{s=1}^{\infty }A^{s}Y\left(A^{T}\right) ^{s}-\sum_{s=1}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}=Y$

(write out a couple of first terms in the sums if summation signs frighten you).

Transposition is a geometrically simple operation. We need only the property $\left( AB\right) ^{T}=B^{T}A^{T}.$

### Variance and covariance

Property 1. Variance of a random vector $X$ and covariance of two random vectors $X,Y$ are defined by

$V\left( X\right) =E\left( X-EX\right) \left( X-EX\right) ^{T},$ $Cov\left( X,Y\right) =E\left( X-EX\right) \left( Y-EY\right) ^{T},$

respectively.

Note that when $EX=0,$ variance becomes

$V\left( X\right) =EXX^{T}=\left( \begin{array}{ccc}EX_{1}^{2} & ... & EX_{1}X_{n} \\ ... & ... & ... \\ EX_{1}X_{n} & ... & EX_{n}^{2}\end{array}\right) .$

Property 2. Let $X,Y$ be random vectors and suppose $A,B$ are constant matrices. We want an analog of $V\left( aX+bY\right) =a^{2}V\left( X\right) +2abcov\left( X,Y\right) +b^{2}V\left( X\right) .$ In the next calculation we have to remember that the multiplication order cannot be changed.

$V\left( AX+BY\right) =E\left[ AX+BY-E\left( AX+BY\right) \right] \left[ AX+BY-E\left( AX+BY\right) \right] ^{T}$

$=E\left[ A\left( X-EX\right) +B\left( Y-EY\right) \right] \left[ A\left( X-EX\right) +B\left( Y-EY\right) \right] ^{T}$

$=E\left[ A\left( X-EX\right) \right] \left[ A\left( X-EX\right) \right] ^{T}+E\left[ B\left( Y-EY\right) \right] \left[ A\left( X-EX\right) \right] ^{T}$

$+E\left[ A\left( X-EX\right) \right] \left[ B\left( Y-EY\right) \right] ^{T}+E\left[ B\left( Y-EY\right) \right] \left[ B\left( Y-EY\right) \right] ^{T}$

(applying $\left( AB\right) ^{T}=B^{T}A^{T}$)

$=AE\left( X-EX\right) \left( X-EX\right) ^{T}A^{T}+BE\left( Y-EY\right) \left( X-EX\right) ^{T}A^{T}$

$+AE\left( X-EX\right) \left( Y-EY\right) ^{T}B^{T}+BE\left( Y-EY\right) \left( Y-EY\right) ^{T}B^{T}$

$=AV\left( X\right) A^{T}+BCov\left( Y,X\right) A^{T}+ACov(X,Y)B^{T}+BV\left( Y\right) B^{T}.$

19
Feb 22

## Distribution of the estimator of the error variance

If you are reading the book by Dougherty: this post is about the distribution of the estimator  $s^2$ defined in Chapter 3.

Consider regression

(1) $y=X\beta +e$

where the deterministic matrix $X$ is of size $n\times k,$ satisfies $\det \left( X^{T}X\right) \neq 0$ (regressors are not collinear) and the error $e$ satisfies

(2) $Ee=0,Var(e)=\sigma ^{2}I$

$\beta$ is estimated by $\hat{\beta}=(X^{T}X)^{-1}X^{T}y.$ Denote $P=X(X^{T}X)^{-1}X^{T},$ $Q=I-P.$ Using (1) we see that $\hat{\beta}=\beta +(X^{T}X)^{-1}X^{T}e$ and the residual $r\equiv y-X\hat{\beta}=Qe.$ $\sigma^{2}$ is estimated by

(3) $s^{2}=\left\Vert r\right\Vert ^{2}/\left( n-k\right) =\left\Vert Qe\right\Vert ^{2}/\left( n-k\right) .$

$Q$ is a projector and has properties which are derived from those of $P$

(4) $Q^{T}=Q,$ $Q^{2}=Q.$

If $\lambda$ is an eigenvalue of $Q,$ then multiplying $Qx=\lambda x$ by $Q$ and using the fact that $x\neq 0$ we get $\lambda ^{2}=\lambda .$ Hence eigenvalues of $Q$ can be only $0$ or $1.$ The equation $tr\left( Q\right) =n-k$
tells us that the number of eigenvalues equal to 1 is $n-k$ and the remaining $k$ are zeros. Let $Q=U\Lambda U^{T}$ be the diagonal representation of $Q.$ Here $U$ is an orthogonal matrix,

(5) $U^{T}U=I,$

and $\Lambda$ is a diagonal matrix with eigenvalues of $Q$ on the main diagonal. We can assume that the first $n-k$ numbers on the diagonal of $Q$ are ones and the others are zeros.

Theorem. Let $e$ be normal. 1) $s^{2}\left( n-k\right) /\sigma ^{2}$ is distributed as $\chi _{n-k}^{2}.$ 2) The estimators $\hat{\beta}$ and $s^{2}$ are independent.

Proof. 1) We have by (4)

(6) $\left\Vert Qe\right\Vert ^{2}=\left( Qe\right) ^{T}Qe=\left( Q^{T}Qe\right) ^{T}e=\left( Qe\right) ^{T}e=\left( U\Lambda U^{T}e\right) ^{T}e=\left( \Lambda U^{T}e\right) ^{T}U^{T}e.$

Denote $S=U^{T}e.$ From (2) and (5)

$ES=0,$ $Var\left( S\right) =EU^{T}ee^{T}U=\sigma ^{2}U^{T}U=\sigma ^{2}I$

and $S$ is normal as a linear transformation of a normal vector. It follows that $S=\sigma z$ where $z$ is a standard normal vector with independent standard normal coordinates $z_{1},...,z_{n}.$ Hence, (6) implies

(7) $\left\Vert Qe\right\Vert ^{2}=\sigma ^{2}\left( \Lambda z\right) ^{T}z=\sigma ^{2}\left( z_{1}^{2}+...+z_{n-k}^{2}\right) =\sigma ^{2}\chi _{n-k}^{2}.$

(3) and (7) prove the first statement.

2) First we note that the vectors $Pe,Qe$ are independent. Since they are normal, their independence follows from

$cov(Pe,Qe)=EPee^{T}Q^{T}=\sigma ^{2}PQ=0.$

It's easy to see that $X^{T}P=X^{T}.$ This allows us to show that $\hat{\beta}$ is a function of $Pe$:

$\hat{\beta}=\beta +(X^{T}X)^{-1}X^{T}e=\beta +(X^{T}X)^{-1}X^{T}Pe.$

Independence of $Pe,Qe$ leads to independence of their functions $\hat{\beta}$ and $s^{2}.$

25
Nov 18

## Eigenvalues and eigenvectors of a projector

### Eigenvalues and eigenvectors of a projector

Exercise 1. Find eigenvalues and eigenvectors of a projector.

Solution. We know that a projector doesn't change elements from its image: $Px=x$ for all $x\in\text{Img}(P).$ This means that $\lambda =1$ is an eigenvalue of $P.$ Moreover, if $\{x_i:i=1,...,\dim\text{Img}(P)\}$ is any orthonormal system in $\text{Img}(P),$ each of $x_i$ is an eigenvector of $P$ corresponding to the eigenvalue $\lambda =1.$

Since $P$ maps to zero all elements from the null space $N(P),$ $\lambda =0$ is another eigenvalue. If $\{y_i:i=1,...,\dim N(P)\}$ is any orthonormal system in $N(P),$ each of $y_i$ is an eigenvector of $P$ corresponding to the eigenvalue $\lambda =0.$

A projector cannot have eigenvalues other than $0$ and $1.$ This is proved as follows. Suppose $Px=\lambda x$ with some nonzero $x.$ Applying $P$ to both sides of this equation, we get $Px=P^2x=\lambda Px=\lambda ^2x.$ It follows that $\lambda x=\lambda^2x$ and (because $x\neq 0$) $\lambda =\lambda^2.$ The last equation has only two roots: $0$ and $1.$

We have $\dim\text{Img}(P)+\dim N(P)=n$ because $R^n$ is an orthogonal sum of $N(P)$ and $\text{Img}(P)$.  Combining the systems $\{x_i\},$ $\{y_i\}$ we get an orthonormal basis in $R^{n}$ consisting of eigenvectors of $P$.

### Trace of a projector

Recall that for a square matrix, its trace is defined as the sum of its diagonal elements.

Exercise 2. Prove that $tr(AB)=tr(BA)$ if both products $AB$ and $BA$ are square. It is convenient to call this property trace-commuting (we know that in general matrices do not commute).

Proof. Assume that $A$ is of size $n\times m$ and $B$ is of size $m\times n.$ For both products we need only to find the diagonal elements:

$AB=\left(\begin{array}{ccc} a_{11}&...&a_{1m}\\...&...&...\\a_{n1}&...&a_{nm}\end{array} \right)\left(\begin{array}{ccc} b_{11}&...&b_{1n}\\...&...&...\\b_{m1}&...&b_{mn}\end{array} \right)=\left(\begin{array}{ccc} \sum_ia_{1i}b_{i1}&...&...\\...&...&...\\...&...&\sum_ia_{ni}b_{in}\end{array} \right)$

$BA=\left(\begin{array}{ccc} b_{11}&...&b_{1n}\\...&...&...\\b_{m1}&...&b_{mn}\end{array} \right)\left(\begin{array}{ccc} a_{11}&...&a_{1m}\\...&...&...\\a_{n1}&...&a_{nm}\end{array} \right)=\left(\begin{array}{ccc} \sum_ja_{j1}b_{1j}&...&...\\...&...&...\\...&...&\sum_ja_{jm}b_{mj} \end{array}\right)$

All we have to do is change the order of summation:

$tr(AB)=\sum_j\sum_ia_{ji}b_{ij}=\sum_i\sum_ja_{ji}b_{ij}=tr(BA).$

Exercise 3. Find the trace of a projector.

Solution. In Exercise 1 we established that the projector $P$ has $p=\dim\text{Img}(P)$ eigenvalues $\lambda =1$ and $n-p$ eigenvalues $\lambda =0.$ $P$ is symmetric, so in its diagonal representation $P=UDU^{-1}$ there are $p$ unities and $n-p$ zeros on the diagonal of the diagonal matrix $D.$ By Exercise 2

$tr(P)=tr(UDU^{-1})=tr(DU^{-1}U)=tr(D)=p$.

10
Dec 18

## Distributions derived from normal variables

In the one-dimensional case the economic way to define normal variables is this: define a standard normal variable and then a general normal variable as its linear transformation.

In case of many dimensions, we follow the same idea. Before doing that we state without proofs two useful facts about independence of random variables (real-valued, not vectors).

Theorem 1. Suppose variables $X_1,...,X_n$ have densities $p_1(x_1),...,p_n(x_n).$ Then they are independent if and only if their joint density $p(x_1,...,x_n)$ is a product of individual densities: $p(x_1,...,x_n)=p_1(x_1)...p_n(x_n).$

Theorem 2. If variables $X,Y$ are normal, then they are independent if and only if they are uncorrelated: $cov(X,Y)=0.$

The necessity part (independence implies uncorrelatedness) is trivial.

### Normal vectors

Let $z_1,...,z_n$ be independent standard normal variables. A standard normal variable is defined by its density, so all of $z_i$ have the same density. We achieve independence, according to Theorem 1, by defining their joint density to be a product of individual densities.

Definition 1. A standard normal vector of dimension $n$ is defined by

$z=\left(\begin{array}{c}z_1\\...\\z_n\\ \end{array}\right)$

Properties$Ez=0$ because all of $z_i$ have means zero. Further, $cov(z_i,z_j)=0$ for $i\neq j$by Theorem 2 and variance of a standard normal is 1. Therefore, from the expression for variance of a vector we see that $Var(z)=I.$

Definition 2. For a matrix $A$ and vector $\mu$ of compatible dimensions a normal vector is defined by $X=Az+\mu.$

Properties$EX=AEz+\mu=\mu$ and

$Var(X)=Var(Az)=E(Az)(Az)^T=AEzz^TA^T=AIA^T=AA^T$

(recall that variance of a vector is always nonnegative).

### Distributions derived from normal variables

In the definitions of standard distributions (chi square, t distribution and F distribution) there is no reference to any sample data. Unlike statistics, which by definition are functions of sample data, these and other standard distributions are theoretical constructs. Statistics are developed in such a way as to have a distribution equal or asymptotically equal to one of standard distributions. This allows practitioners to use tables developed for standard distributions.

Exercise 1. Prove that $\chi_n^2/n$ converges to 1 in probability.

Proof. For a standard normal $z$ we have $Ez^2=1$ and $Var(z^2)=2$ (both properties can be verified in Mathematica). Hence, $E\chi_n^2/n=1$ and

$Var(\chi_n^2/n)=\sum_iVar(z_i^2)/n^2=2/n\rightarrow 0.$

Now the statement follows from the simple form of the law of large numbers.

Exercise 1 implies that for large $n$ the t distribution is close to a standard normal.

9
Sep 18

## Applications of the diagonal representation II

### 4. Square root of a matrix

Definition 1. For a symmetric matrix with non-negative eigenvalues the square root is defined by

(1) $A^{1/2}=Udiag[\sqrt{\lambda_1},...,\sqrt{\lambda_n}]U^{-1}.$

Exercise 1. (1) is symmetric and satisfies $(A^{1/2})^2=A.$

Proof. By properties of orthogonal matrices

$(A^{1/2})^T=(U^{-1})^Tdiag[\sqrt{\lambda_1},...,\sqrt{\lambda_n}]U^T=A^{1/2},$ $(A^{1/2})^2=Udiag[\sqrt{\lambda_1},...,\sqrt{\lambda_n}]U^{-1}Udiag[\sqrt{\lambda_1},...,\sqrt{\lambda_n}]U^{-1}$ $=Udiag[\lambda_1,...,\lambda_n]U^{-1}=A.$

### 5. Generalized least squares estimator

The error term $e$ in the multiple regression $y=X\beta +e$ under homoscedasticity and in absence of autocorrelation satisfies

(2) $V(e)=\sigma^2I,$ where $\sigma^2$ is some positive number.

The OLS estimator in this situation is given by

(3) $\hat{\beta}=(X^TX)^{-1}X^Ty.$

Now consider a more general case $V(e)=\Omega.$

Exercise 2. The variance matrix $V(e)=\Omega$ is always symmetric and non-negative.

Proof$V(e)^T=[E(e-Ee)(e-Ee)^T]^T=V(e),$

$x^TV(e)x=Ex^T(e-Ee)(e-Ee)^Tx=E\|(e-Ee)^Tx\|^2\geq 0.$

Exercise 3. Let's assume that $\Omega$ is positive. Show that $\Omega^{-1/2}$ is symmetric and satisfies $(\Omega^{-1/2})^2=\Omega^{-1}.$

Proof. By Exercise 1 the eigenvalues of $\Omega$ are positive. Hence its inverse $\Omega^{-1}$ exists and is given by $\Omega^{-1}=U\Omega_U^{-1}U^T$ where $\Omega_U^{-1}=diag[\lambda_1^{-1},...,\lambda_n^{-1}].$ It is symmetric as an inverse of a symmetric matrix. It remains to apply Exercise 1 to $A=\Omega^{-1/2}.$

Exercise 4. Find the variance of $u=\Omega^{-1/2}e$.

Solution. Using the definition of variance of a vector

$V(u)=E(u-Eu)(u-Eu)^T=\Omega^{-1/2}V(e)(\Omega^{-1/2})^T=\Omega^{-1/2}\Omega\Omega^{-1/2}=I.$

Exercise 4 suggests how to transform $y=X\beta +e$ to satisfy (2). In the equation

$\Omega^{-1/2}y=\Omega^{-1/2}X\beta +\Omega^{-1/2}e$

the error $u=\Omega^{-1/2}e$ satisfies the assumption under which (2) is applicable. Let $\tilde{y}=\Omega^{-1/2}y,$ $\tilde{X}=\Omega^{-1/2}X.$ Then we have $\tilde{y}=\tilde{X}\beta +u$ and from (3) $\hat{\beta}=(\tilde{X}^T\tilde{X})^{-1}\tilde{X}^T\tilde{y}.$ Since $\tilde{X}^T=X^T\Omega^{-1/2},$ this can be written as

$\hat{\beta}=(X^T\Omega^{-1/2}\Omega^{-1/2}X)^{-1}X^T\Omega^{-1/2}\Omega^{-1/2}y=(X^T\Omega^{-1}X)^{-1}X^T\Omega^{-1}y.$

2
Aug 18

# Basic statistics

AP Statistics the Genghis Khan way 1

AP Statistics the Genghis Khan way 2

Descriptive statistics and inferential statistics

Numerical versus categorical variable

Uniform distribution definition, with examples

### Using graphs to describe data

What should you hate about AP Statistics? The TI-83+ and TI-84 graphing calculators are terrible

How to prevent cheating with TI-83+ and TI-84

Minitab is overpriced. Use Excel instead

What is a Pareto chart and how is it different from a histogram?

The stem-and-leaf plot is an archaism - it's time to leave it behind

Histogram versus time series plot, with video

Comparing histogram, Pareto chart and times series plot

Using statistical tables for normal distribution

### Probability

Little tricks for AP Statistics

What is probability. Includes sample space; elementary, impossible, sure events; completeness axiom,  de Morgan’s laws, link between logic and geometry

Independence of events. Includes conditional probability, multiplication rule and visual illustration of independence

Law of total probability - you could have invented this

Significance level and power of test

Reevaluating probabilities based on piece of evidence

p value definition

### Using numerical measures to describe data

What is a median, with an exercise

Using financial examples to explain properties of sample means

Properties of means

What is a mean value. All means in one place: population mean, sample mean, grouped data formula, mean of a continuous random variable

Unbiasedness definition, with intuition

All properties of variance in one place

Variance of a vector: motivation and visualization

Different faces of vector variance: again visualization helps

Inductive introduction to Chebyshev inequality

Properties of covariance

Properties of standard deviation

Correlation coefficient: the last block of statistical foundation

Statistical measures and their geometric roots

Population mean versus sample mean: summary comparison

Mean plus deviation-from-mean decomposition

Scaling a distribution

What is a z score: the scientific explanation

What is a binomial random variable - analogy with market demand

Active learning - away from boredom of lectures, with Excel file and video. How to simulate several random variables at the same time.

From independence of events to independence of random variables. Includes multiplicativity of means and additivity of variance

Normal distributions. Includes standard normal distribution, (general) normal variable, linear transformation and their properties, video and Mathematica file

Definitions of chi-square, t statistic and F statistic

Student's t distribution: one-line explanation of its origin

Confidence interval and margin of error derivation using z-score. Includes confidence and significance levels, critical value

Confidence interval using t statistic: attach probability or not attach?

### Distribution function

Distribution function properties

Density function properties

Examples of distribution functions

Distribution and density functions of a linear transformation

Binary choice models

Binary choice models: theoretical obstacles

### Maximum likelihood

Maximum likelihood: idea and life of a bulb

Maximum likelihood: application to linear model

### Conditioning

Properties of conditional expectation

Conditional expectation generalized to continuous random variables

Conditional variance properties

### Simulation of random variables

Importance of simulation in Excel for elementary stats courses

Generating the Bernoulli random variable (coin), with Excel file

Creating frequency table and histogram and using Excel macros, with Excel file

Modeling a sample from a normal distribution, with Excel file

### Sampling distributions

Demystifying sampling distributions: too much talking about nothing

### Law of large numbers and central limit theorem

Law of large numbers explained

Law of large numbers illustrated

Law of large numbers: the mega delusion of AP Statistics, with Excel file

All about the law of large numbers. Includes convergence in probability, preservation of arithmetic operations and application to simple regression

Central Limit Theorem versus Law of Large Numbers. Includes convergence in distribution and Excel file

Law of large numbers proved

19
Feb 22

## Estimation of parameters of a normal distribution

Here we show that the knowledge of the distribution of $s^{2}$ for linear regression allows one to do without long calculations contained in the guide ST 2134 by J. Abdey.

Theorem. Let $y_{1},...,y_{n}$ be independent observations from $N\left( \mu,\sigma ^{2}\right)$. 1) $s^{2}\left( n-1\right) /\sigma ^{2}$ is distributed as $\chi _{n-1}^{2}.$ 2) The estimators $\bar{y}$ and $s^{2}$ are independent. 3) $Es^{2}=\sigma ^{2},$ 4) $Var\left( s^{2}\right) =\frac{2\sigma ^{4}}{n-1},$ 5) $\frac{s^{2}-\sigma ^{2}}{\sqrt{2\sigma ^{4}/\left(n-1\right) }}$ converges in distribution to $N\left( 0,1\right) .$

Proof. We can write $y_{i}=\mu +e_{i}$ where $e_{i}$ is distributed as $N\left( 0,\sigma ^{2}\right) .$ Putting $\beta =\mu ,\ y=\left(y_{1},...,y_{n}\right) ^{T},$ $e=\left( e_{1},...,e_{n}\right) ^{T}$ and $X=\left( 1,...,1\right) ^{T}$ (a vector of ones) we satisfy (1) and (2). Since $X^{T}X=n,$ we have $\hat{\beta}=\bar{y}.$ Further,

$r\equiv y-X\hat{ \beta}=\left( y_{1}-\bar{y},...,y_{n}-\bar{y}\right) ^{T}$

and

$s^{2}=\left\Vert r\right\Vert ^{2}/\left( n-1\right) =\sum_{i=1}^{n}\left( y_{i}-\bar{y}\right) ^{2}/\left( n-1\right) .$

Thus 1) and 2) follow from results for linear regression.

3) For a normal variable $X$ its moment generating function is $M_{X}\left( t\right) =\exp \left(\mu t+\frac{1}{2}\sigma ^{2}t^{2}\right)$ (see Guide ST2133, 2021, p.88). For the standard normal we get

$M_{z}^{\prime }\left( t\right) =\exp \left( \frac{1}{2}t^{2}\right) t,$ $M_{z}^{\prime \prime }\left( t\right) =\exp \left( \frac{1}{2}t^{2}\right) (t^{2}+1),$

$M_{z}^{\prime \prime \prime}\left( t\right) =\exp \left( \frac{1}{2}t^{2}\right) (t^{3}+2t+t),$ $M_{z}^{(4)}\left( t\right) =\exp \left( \frac{1}{2}t^{2}\right) (t^{4}+6t^{2}+3).$

Applying the general property $EX^{r}=M_{X}^{\left( r\right) }\left( 0\right)$ (same guide, p.84) we see that

$Ez=0,$ $Ez^{2}=1,$ $Ez^{3}=0,$ $Ez^{4}=3,$

$Var(z)=1,$ $Var\left( z^{2}\right) =Ez^{4}-\left( Ez^{2}\right) ^{2}=3-1=2.$

Therefore

$Es^{2}=\frac{\sigma ^{2}}{n-1}E\left( z_{1}^{2}+...+z_{n-1}^{2}\right) =\frac{\sigma ^{2}}{n-1}\left( n-1\right) =\sigma ^{2}.$

4) By independence of standard normals

$Var\left( s^{2}\right) =$ $\left(\frac{\sigma ^{2}}{n-1}\right) ^{2}\left[ Var\left( z_{1}^{2}\right) +...+Var\left( z_{n-1}^{2}\right) \right] =\frac{\sigma ^{4}}{\left( n-1\right) ^{2}}2\left( n-1\right) =\frac{2\sigma ^{4}}{n-1}.$

5) By standardizing $s^{2}$ we have $\frac{s^{2}-Es^{2}}{\sigma \left(s^{2}\right) }=\frac{s^{2}-\sigma ^{2}}{\sqrt{2\sigma ^{4}/\left( n-1\right) }}$ and this converges in distribution to $N\left( 0,1\right)$ by the central limit theorem.

13
Apr 19

## Checklist for Quantitative Finance FN3142

Students of FN3142 often think that they can get by by picking a few technical tricks. The questions below are mostly about intuition that helps to understand and apply those tricks.

Everywhere we assume that $...,Y_{t-1},Y_t,Y_{t+1},...$ is a time series and $...,I_{t-1},I_t,I_{t+1},...$ is a sequence of corresponding information sets. It is natural to assume that $I_t\subset I_{t+1}$ for all $t.$ We use the short conditional expectation notation: $E_tX=E(X|I_t)$.

### Questions

Question 1. How do you calculate conditional expectation in practice?

Question 2. How do you explain $E_t(E_tX)=E_tX$?

Question 3. Simplify each of $E_tE_{t+1}X$ and $E_{t+1}E_tX$ and explain intuitively.

Question 4. $\varepsilon _t$ is a shock at time $t$. Positive and negative shocks are equally likely. What is your best prediction now for tomorrow's shock? What is your best prediction now for the shock that will happen the day after tomorrow?

Question 5. How and why do you predict $Y_{t+1}$ at time $t$? What is the conditional mean of your prediction?

Question 6. What is the error of such a prediction? What is its conditional mean?

Question 7. Answer the previous two questions replacing $Y_{t+1}$ by $Y_{t+p}$.

Question 8. What is the mean-plus-deviation-from-mean representation (conditional version)?

Question 9. How is the representation from Q.8 reflected in variance decomposition?

Question 10. What is a canonical form? State and prove all properties of its parts.

Question 11. Define conditional variance for white noise process and establish its link with the unconditional one.

Question 12. How do you define the conditional density in case of two variables, when one of them serves as the condition? Use it to prove the LIE.

Question 13. Write down the joint distribution function for a) independent observations and b) for serially dependent observations.

Question 14. If one variable is a linear function of another, what is the relationship between their densities?

Question 15. What can you say about the relationship between $a,b$ if $f(a)=f(b)$? Explain geometrically the definition of the quasi-inverse function.

Answer 1. Conditional expectation is a complex notion. There are several definitions of differing levels of generality and complexity. See one of them here and another in Answer 12.

The point of this exercise is that any definition requires a lot of information and in practice there is no way to apply any of them to actually calculate conditional expectation. Then why do they juggle conditional expectation in theory? The efficient market hypothesis comes to rescue: it is posited that all observed market data incorporate all available information, and, in particular, stock prices are already conditioned on $I_t.$

Answers 2 and 3. This is the best explanation I have.

Answer 4. Since positive and negative shocks are equally likely, the best prediction is $E_t\varepsilon _{t+1}=0$ (I call this equation a martingale condition). Similarly, $E_t\varepsilon _{t+2}=0$ but in this case I prefer to see an application of the LIE: $E_{t}\varepsilon _{t+2}=E_t(E_{t+1}\varepsilon _{t+2})=E_t0=0.$

Answer 5. The best prediction is $\hat{Y}_{t+1}=E_tY_{t+1}$ because it minimizes $E_t(Y_{t+1}-f(I_t))^2$ among all functions $f$ of current information $I_t.$ Formally, you can use the first order condition

$\frac{d}{df(I_t)}E_t(Y_{t+1}-f(I_t))^2=-2E_t(Y_{t+1}-f(I_t))=0$

to find that $f(I_t)=E_tf(I_t)=E_tY_{t+1}$ is the minimizing function. By the projector property
$E_t\hat{Y}_{t+1}=E_tE_tY_{t+1}=E_tY_{t+1}=\hat{Y}_{t+1}.$

Answer 6. It is natural to define the prediction error by

$\hat{\varepsilon}_{t+1}=Y_{t+1}-\hat{Y}_{t+1}=Y_{t+1}-E_tY_{t+1}.$

By the projector property $E_t\hat{\varepsilon}_{t+1}=E_tY_{t+1}-E_tY_{t+1}=0$.

Answer 7. To generalize, just change the subscripts. For the prediction we have to use two subscripts: the notation $\hat{Y}_{t,t+p}$ means that we are trying to predict what happens at a future date $t+p$ based on info set $I_t$ (time $t$ is like today). Then by definition $\hat{Y} _{t,t+p}=E_tY_{t+p},$ $\hat{\varepsilon}_{t,t+p}=Y_{t+p}-E_tY_{t+p}.$

Answer 8. Answer 7, obviously, implies $Y_{t+p}=\hat{Y}_{t,t+p}+\hat{\varepsilon}_{t,t+p}.$ The simple case is here.

Answer 9. See the law of total variance and change it to reflect conditioning on $I_t.$

Answer 11. Combine conditional variance definition with white noise definition.

Answer 12. The conditional density is defined similarly to the conditional probability. Let $X,Y$ be two random variables. Denote $p_X$ the density of $X$ and $p_{X,Y}$ the joint density. Then the conditional density of $Y$ conditional on $X$ is defined as $p_{Y|X}(y|x)=\frac{p_{X,Y}(x,y)}{p_X(x)}.$ After this we can define the conditional expectation $E(Y|X)=\int yp_{Y|X}(y|x)dy.$ With these definitions one can prove the Law of Iterated Expectations:

$E[E(Y|X)]=\int E(Y|x)p_X(x)dx=\int \left( \int yp_{Y|X}(y|x)dy\right) p_X(x)dx$

$=\int \int y\frac{p_{X,Y}(x,y)}{p_X(x)}p_X(x)dxdy=\int \int yp_{X,Y}(x,y)dxdy=EY.$

This is an illustration to Answer 1 and a prelim to Answer 13.

Answer 13. Understanding this answer is essential for Section 8.6 on maximum likelihood of Patton's guide.

a) In case of independent observations $X_1,...,X_n$ the joint density of the vector $X=(X_1,...,X_n)$ is a product of individual densities:

$p_X(x_1,...,x_n)=p_{X_1}(x_1)...p_{X_n}(x_n).$

b) In the time series context it is natural to assume that the next observation depends on the previous ones, that is, for each $t,$ $X_t$ depends on $X_1,...,X_{t-1}$ (serially dependent observations). Therefore we should work with conditional densities $p_{X_1,...,X_t|X_1,...,X_{t-1}}.$ From Answer 12 we can guess how to make conditional densities appear:

$p_{X_1,...,X_n}(x_1,...,x_n)=\frac{p_{X_1,...,X_n}(x_1,...,x_n)}{ p_{X_1,...,X_{n-1}}(x_1,...,x_{n-1})}\frac{p_{X_1,...,X_{n-1}}(x_1,...,x_{n-1})}{ p_{X_1,...,X_{n-2}}(x_1,...,x_{n-2})}...\frac{p_{X_1,X_2}(x_1,x_2)}{p_{X_1}(x_1)}p_{X_1}(x_1).$

The fractions on the right are recognized as conditional probabilities. The resulting expression is pretty awkward:

$p_{X_1,...,X_n}(x_1,...,x_n)=p_{X_1,...,X_n|X_1,...,X_n-1}(x_1,...,x_n|x_1,...,x_{n-1})\times$ $\times p_{X_1,...,X_{n-1}|X_1,...,X_{n-2}}(x_1,...,x_{n-1}|x_1,...,x_{n-2})...\times$ $p_{X_1,X_2|X_1}(x_1,x_2|x_1)p_{X_1}(x_1).$

Answer 14. The answer given here helps one understand how to pass from the density of the standard normal to that of the general normal.

Answer 15. This elementary explanation of the function definition can be used in the fifth grade. Note that conditions sufficient for existence of the inverse are not satisfied in a case as simple as the distribution function of the Bernoulli variable (when the graph of the function has flat pieces and is not continuous). Therefore we need a more general definition of an inverse. Those who think that this question is too abstract can check out UoL exams, where examinees are required to find Value at Risk when the distribution function is a step function. To understand the idea, do the following:

a) Draw a graph of a good function $f$ (continuous and increasing).

b) Fix some value $y_0$ in the range of this function and identify the region $\{y:y\ge y_0\}$.

c) Find the solution $x_0$ of the equation $f(x)=y_0$. By definition, $x_0=f^{-1}(y_o).$ Identify the region $\{x:f(x)\ge y_0\}$.

d) Note that $x_0=\min\{x:f(x)\ge y_0\}$. In general, for bad functions the minimum here may not exist. Therefore minimum is replaced by infimum, which gives us the definition of the quasi-inverse:

$x_0=\inf\{x:f(x)\ge y_0\}$.