Search Results for variance of a vector

May 18

Variance of a vector: motivation and visualization

Variance of a vector: motivation and visualization

I always show my students the definition of the variance of a vector, and they usually don't pay attention. You need to know what it is, already at the level of simple regression (to understand the derivation of the slope estimator variance), and even more so when you deal with time series. Since I know exactly where students usually stumble, this post is structured as a series of questions and answers.

Think about ideas: how would you define variance of a vector?

Question 1. We know that for a random variable X, its variance is defined by

(1) V(X)=E(X-EX)^{2}.

Now let


be a vector with n components, each of which is a random variable. How would you define its variance?

The answer is not straightforward because we don't know how to square a vector. Let X^T=(\begin{array}{ccc}X_1& ...&X_n\end{array}) denote the transposed vector. There are two ways to multiply a vector by itself: X^TX and XX^T.

Question 2. Find the dimensions of X^TX and XX^T and their expressions in terms of coordinates of X.

Answer 2. For a product of matrices there is a compatibility rule that I write in the form

(2) A_{n\times m}B_{m\times k}=C_{n\times k}.

Recall that n\times m in the notation A_{n\times m} means that the matrix A has n rows and m columns. For example, X is of size n\times 1. Verbally, the above rule says that the number of columns of A should be equal to the number of rows of B. In the product that common number m disappears and the unique numbers (n and k) give, respectively, the number of rows and columns of C. Isn't the the formula
easier to remember than the verbal statement? From (2) we see that X_{1\times n}^TX_{n\times 1} is of dimension 1 (it is a scalar) and X_{n\times 1}X_{1\times n}^T is an n\times n matrix.

For actual multiplication of matrices I use the visualization

(3) \left(\begin{array}{ccccc}&&&&\\&&&&\\a_{i1}&a_{i2}&...&a_{i,m-1}&a_{im}\\&&&&\\&&&&\end{array}\right) \left(\begin{array}{ccccc}&&b_{1j}&&\\&&b_{2j}&&\\&&...&&\\&&b_{m-1,j}&&\\&&b_{mj}&&\end{array}\right) =\left(  \begin{array}{ccccc}&&&&\\&&&&\\&&c_{ij}&&\\&&&&\\&&&&\end{array}\right)

Short formulation. Multiply rows from the first matrix by columns from the second one.

Long Formulation. To find the element c_{ij} of C, we find a scalar product of the ith row of A and jth column of B: c_{ij}=a_{i1}b_{1j}+a_{i2}b_{2j}+... To find all elements in the ith row of C, we fix the ith row in A and move right the columns in B. Alternatively, to find all elements in the jth column of C, we fix the jth column in B and move down the rows in A. Using this rule, we have

(4) X^TX=X_1^2+...+X_n^2, XX^T=\left(\begin{array}{ccc}X_1^2&...&X_1X_n\\...&...&...\\X_nX_1&...&X_n^2  \end{array}\right).

Usually students have problems with the second equation.

Based on (1) and (4), we have two candidates to define variance:

(5) V(X)=E(X-EX)^T(X-EX)


(6) V(X)=E(X-EX)(X-EX)^T.

Answer 1. The second definition contains more information, in the sense to be explained below, so we define variance of a vector by (6).

Question 3. Find the elements of this matrix.

Answer 3. Variance of a vector has variances of its components on the main diagonal and covariances outside it:

(7) V(X)=\left(\begin{array}{cccc}V(X_1)&Cov(X_1,X_2)&...&Cov(X_1,X_n)\\Cov(X_2,X_1)&V(X_2)&...&Cov(X_2,X_n)\\...&...&...&...\\Cov(X_n,X_1)&Cov(X_n,X_2)&...&V(X_n)\end{array}\right).

If you can't get this on your own, go back to Answer 2.

There is a matrix operation called trace and denoted tr. It is defined only for square matrices and gives the sum of diagonal elements of a matrix.

Exercise 1. Show that tr(V(X))=E(X-EX)^T(X-EX). In this sense definition (6) is more informative than (5).

Exercise 2. Show that if EX_1=...=EX_n=0, then (7) becomes



« Previous Entries  

Search Results for variance of a vector

May 18

Different faces of vector variance: again visualization helps

Different faces of vector variance: again visualization helps

In the previous post we defined variance of a column vector X with n components by


In terms of elements this is the same as:

(1) V(X)=\left(\begin{array}{cccc}V(X_1)&Cov(X_1,X_2)&...&Cov(X_1,X_n)\\Cov(X_2,X_1)&V(X_2)&...&Cov(X_2,X_n)\\...&...&...&...\\Cov(X_n,X_1)&Cov(X_n,X_2)&...&V(X_n)\end{array}\right).

So why knowing the structure of this matrix is so important?

Let X_1,...,X_n be random variables and let a_1,...,a_n be numbers. In the derivation of the variance of the slope estimator for simple regression we have to deal with the expression of type

(2) V\left(\sum_{i=1}^na_iX_i\right).

Question 1. How do you multiply a sum by a sum? I mean, how do you use summation signs to find the product \left(\sum_{i=1}^na_i\right)\left(\sum_{i=1}^nb_i\right)?

Answer 1. Whenever you have problems with summation signs, try to do without them. The product


should contain ALL products a_ib_j. Again, a matrix visualization will help:


The product we are looking for should contain all elements of this matrix. So the answer is

(3) \left(\sum_{i=1}^na_i\right)\left(\sum_{i=1}^nb_i\right)=\sum_{i=1}^n\sum_{j=1}^na_ib_j.

Formally, we can write \sum_{i=1}^nb_i=\sum_{j=1}^nb_j (the sum does not depend on the index of summation, this is another point many students don't understand) and then perform the multiplication in (3).

Question 2. What is the expression for (2) in terms of covariances of components?

Answer 2. If you understand Answer 1 and know the relationship between variances and covariances, it should be clear that

(4) V\left(\sum_{i=1}^na_iX_i\right)=Cov(\sum_{i=1}^na_iX_i,\sum_{i=1}^na_iX_i)


Question 3. In light of (1), separate variances from covariances in (4).

Answer 3. When i=j, we have Cov(X_i,X_j)=V(X_i), which are diagonal elements of (1). Otherwise, for i\neq j we get off-diagonal elements of (1). So the answer is

(5) V\left(\sum_{i=1}^na_iX_i\right)=\sum_{i=1}^na_i^2V(X_i)+\sum_{i\neq j}a_ia_jCov(X_i,X_j).

Once again, in the first sum on the right we have only variances. In the second sum, the indices i,j are assumed to run from 1 to n, excluding the diagonal i=j.

Corollary. If X_{i} are uncorrelated, then the second sum in (5) disappears:

(6) V\left(\sum_{i=1}^na_iX_i\right)=\sum_{i=1}^na_i^2V(X_i).

This fact has been used (with a slightly different explanation) in the derivation of the variance of the slope estimator for simple regression.

Question 4. Note that the matrix (1) is symmetric (elements above the main diagonal equal their mirror siblings below that diagonal). This means that some terms in the second sum on the right of (5) are repeated twice. If you group equal terms in (5), what do you get?

Answer 4. The idea is to write


that is, to join equal elements above and below the main diagonal in (1). For this, you need to figure out how to write a sum of the elements that are above the main diagonal. Make a bigger version of (1) (with more off-diagonal elements) to see that the elements that are above the main diagonal are listed in the sum \sum_{i=1}^{n-1}\sum_{j=i+1}^n. This sum can also be written as \sum_{1\leq i<j\leq n}. Hence, (5) is the same as

(7) V\left(\sum_{i=1}^na_iX_i\right)=\sum_{i=1}^na_i^2V(X_i)+2\sum_{i=1}^{n-1}\sum_{j=i+1}^na_ia_jCov(X_i,X_j)

=\sum_{i=1}^na_i^2V(X_i)+2\sum_{1\leq i<j\leq n}a_ia_jCov(X_i,X_j).

Unlike (6), this equation is applicable when there is autocorrelation.

« Previous Entries  

Search Results for variance of a vector

May 22

Vector autoregressions: preliminaries

Vector autoregressions: preliminaries

Suppose we are observing two stocks and their respective returns are x_{t},y_{t}. A vector autoregression for the pair x_{t},y_{t} is one way to take into account their interdependence. This theory is undeservedly omitted from the Guide by A. Patton.

Required minimum in matrix algebra

Matrix notation and summation are very simple.

Matrix multiplication is a little more complex. Make sure to read Global idea 2 and the compatibility rule.

The general approach to study matrices is to compare them to numbers. Here you see the first big No: matrices do not commute, that is, in general AB\neq BA.

The idea behind matrix inversion is pretty simple: we want an analog of the property a\times \frac{1}{a}=1 that holds for numbers.

Some facts about determinants have very complicated proofs and it is best to stay away from them. But a couple of ideas should be clear from the very beginning. Determinants are defined only for square matrices. The relationship of determinants to matrix invertibility explains the role of determinants. If A is square, it is invertible if and only if \det A\neq 0 (this is an equivalent of the condition a\neq 0 for numbers).

Here is an illustration of how determinants are used. Suppose we need to solve the equation AX=Y for X, where A and Y are known. Assuming that \det A\neq 0 we can premultiply the equation by A^{-1} to obtain A^{-1}AX=A^{-1}Y. (Because of lack of commutativity, we need to keep the order of the factors). Using intuitive properties A^{-1}A=I and IX=X we obtain the solution: X=A^{-1}Y. In particular, we see that if \det A\neq 0, then the equation AX=0 has a unique solution X=0.

Let A be a square matrix and let X,Y be two vectors. A,Y are assumed to be known and X is unknown. We want to check that X=\sum_{s=0}^{\infty }A^{s}Y\left( A^{T}\right) ^{s} solves the equation X-AXA^{T}=Y. (Note that for this equation the trick used to solve AX=Y does not work.) Just plug X:

\sum_{s=0}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}-A\sum_{s=0}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}A^{T} =Y+\sum_{s=1}^{\infty }A^{s}Y\left(A^{T}\right) ^{s}-\sum_{s=1}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}=Y

(write out a couple of first terms in the sums if summation signs frighten you).

Transposition is a geometrically simple operation. We need only the property \left( AB\right) ^{T}=B^{T}A^{T}.

Variance and covariance

Property 1. Variance of a random vector X and covariance of two random vectors X,Y are defined by

V\left( X\right) =E\left( X-EX\right) \left( X-EX\right) ^{T}, Cov\left(  X,Y\right) =E\left( X-EX\right) \left( Y-EY\right) ^{T},


Note that when EX=0, variance becomes

V\left( X\right) =EXX^{T}=\left(  \begin{array}{ccc}EX_{1}^{2} & ... & EX_{1}X_{n} \\  ... & ... & ... \\  EX_{1}X_{n} & ... & EX_{n}^{2}\end{array}\right) .

Property 2. Let X,Y be random vectors and suppose A,B are constant matrices. We want an analog of V\left( aX+bY\right) =a^{2}V\left( X\right) +2abcov\left( X,Y\right) +b^{2}V\left( X\right) . In the next calculation we have to remember that the multiplication order cannot be changed.

V\left( AX+BY\right) =E\left[ AX+BY-E\left( AX+BY\right) \right] \left[  AX+BY-E\left( AX+BY\right) \right] ^{T}

=E\left[ A\left( X-EX\right) +B\left( Y-EY\right) \right] \left[ A\left(  X-EX\right) +B\left( Y-EY\right) \right] ^{T}

=E\left[ A\left( X-EX\right) \right] \left[ A\left( X-EX\right) \right]  ^{T}+E\left[ B\left( Y-EY\right) \right] \left[ A\left( X-EX\right) \right]  ^{T}

+E\left[ A\left( X-EX\right) \right] \left[ B\left( Y-EY\right) \right]  ^{T}+E\left[ B\left( Y-EY\right) \right] \left[ B\left( Y-EY\right) \right]  ^{T}

(applying \left( AB\right) ^{T}=B^{T}A^{T})

=AE\left( X-EX\right) \left( X-EX\right) ^{T}A^{T}+BE\left( Y-EY\right)  \left( X-EX\right) ^{T}A^{T}

+AE\left( X-EX\right) \left( Y-EY\right) ^{T}B^{T}+BE\left( Y-EY\right)  \left( Y-EY\right) ^{T}B^{T}

=AV\left( X\right) A^{T}+BCov\left( Y,X\right)  A^{T}+ACov(X,Y)B^{T}+BV\left( Y\right) B^{T}.


« Previous Entries  

Search Results for variance of a vector

Feb 22

Distribution of the estimator of the error variance

Distribution of the estimator of the error variance

If you are reading the book by Dougherty: this post is about the distribution of the estimator  s^2 defined in Chapter 3.

Consider regression

(1) y=X\beta +e

where the deterministic matrix X is of size n\times k, satisfies \det  \left( X^{T}X\right) \neq 0 (regressors are not collinear) and the error e satisfies

(2) Ee=0,Var(e)=\sigma ^{2}I

\beta is estimated by \hat{\beta}=(X^{T}X)^{-1}X^{T}y. Denote P=X(X^{T}X)^{-1}X^{T}, Q=I-P. Using (1) we see that \hat{\beta}=\beta +(X^{T}X)^{-1}X^{T}e and the residual r\equiv y-X\hat{\beta}=Qe. \sigma^{2} is estimated by

(3) s^{2}=\left\Vert r\right\Vert ^{2}/\left( n-k\right) =\left\Vert  Qe\right\Vert ^{2}/\left( n-k\right) .

Q is a projector and has properties which are derived from those of P

(4) Q^{T}=Q, Q^{2}=Q.

If \lambda is an eigenvalue of Q, then multiplying Qx=\lambda x by Q and using the fact that x\neq 0 we get \lambda ^{2}=\lambda . Hence eigenvalues of Q can be only 0 or 1. The equation tr\left( Q\right) =n-k
tells us that the number of eigenvalues equal to 1 is n-k and the remaining k are zeros. Let Q=U\Lambda U^{T} be the diagonal representation of Q. Here U is an orthogonal matrix,

(5) U^{T}U=I,

and \Lambda is a diagonal matrix with eigenvalues of Q on the main diagonal. We can assume that the first n-k numbers on the diagonal of Q are ones and the others are zeros.

Theorem. Let e be normal. 1) s^{2}\left( n-k\right) /\sigma ^{2} is distributed as \chi _{n-k}^{2}. 2) The estimators \hat{\beta} and s^{2} are independent.

Proof. 1) We have by (4)

(6) \left\Vert Qe\right\Vert ^{2}=\left( Qe\right) ^{T}Qe=\left(  Q^{T}Qe\right) ^{T}e=\left( Qe\right) ^{T}e=\left( U\Lambda U^{T}e\right)  ^{T}e=\left( \Lambda U^{T}e\right) ^{T}U^{T}e.

Denote S=U^{T}e. From (2) and (5)

ES=0, Var\left( S\right) =EU^{T}ee^{T}U=\sigma ^{2}U^{T}U=\sigma ^{2}I

and S is normal as a linear transformation of a normal vector. It follows that S=\sigma z where z is a standard normal vector with independent standard normal coordinates z_{1},...,z_{n}. Hence, (6) implies

(7) \left\Vert Qe\right\Vert ^{2}=\sigma ^{2}\left( \Lambda z\right)  ^{T}z=\sigma ^{2}\left( z_{1}^{2}+...+z_{n-k}^{2}\right) =\sigma ^{2}\chi  _{n-k}^{2}.

(3) and (7) prove the first statement.

2) First we note that the vectors Pe,Qe are independent. Since they are normal, their independence follows from

cov(Pe,Qe)=EPee^{T}Q^{T}=\sigma ^{2}PQ=0.

It's easy to see that X^{T}P=X^{T}. This allows us to show that \hat{\beta} is a function of Pe:

\hat{\beta}=\beta +(X^{T}X)^{-1}X^{T}e=\beta +(X^{T}X)^{-1}X^{T}Pe.

Independence of Pe,Qe leads to independence of their functions \hat{\beta} and s^{2}.


« Previous Entries  

Search Results for variance of a vector

Nov 18

Eigenvalues and eigenvectors of a projector

Eigenvalues and eigenvectors of a projector

Exercise 1. Find eigenvalues and eigenvectors of a projector.

Solution. We know that a projector doesn't change elements from its image: Px=x for all x\in\text{Img}(P). This means that \lambda =1 is an eigenvalue of P. Moreover, if \{x_i:i=1,...,\dim\text{Img}(P)\} is any orthonormal system in \text{Img}(P), each of x_i is an eigenvector of P corresponding to the eigenvalue \lambda =1.

Since P maps to zero all elements from the null space N(P), \lambda =0 is another eigenvalue. If \{y_i:i=1,...,\dim N(P)\} is any orthonormal system in N(P), each of y_i is an eigenvector of P corresponding to the eigenvalue \lambda =0.

A projector cannot have eigenvalues other than 0 and 1. This is proved as follows. Suppose Px=\lambda x with some nonzero x. Applying P to both sides of this equation, we get Px=P^2x=\lambda Px=\lambda ^2x. It follows that \lambda x=\lambda^2x and (because x\neq 0) \lambda =\lambda^2. The last equation has only two roots: 0 and 1.

We have \dim\text{Img}(P)+\dim N(P)=n because R^n is an orthogonal sum of N(P) and \text{Img}(P).  Combining the systems \{x_i\}, \{y_i\} we get an orthonormal basis in R^{n} consisting of eigenvectors of P.

Trace of a projector

Recall that for a square matrix, its trace is defined as the sum of its diagonal elements.

Exercise 2. Prove that tr(AB)=tr(BA) if both products AB and BA are square. It is convenient to call this property trace-commuting (we know that in general matrices do not commute).

Proof. Assume that A is of size n\times m and B is of size m\times n. For both products we need only to find the diagonal elements:

AB=\left(\begin{array}{ccc}  a_{11}&...&a_{1m}\\...&...&...\\a_{n1}&...&a_{nm}\end{array}  \right)\left(\begin{array}{ccc}  b_{11}&...&b_{1n}\\...&...&...\\b_{m1}&...&b_{mn}\end{array}  \right)=\left(\begin{array}{ccc}  \sum_ia_{1i}b_{i1}&...&...\\...&...&...\\...&...&\sum_ia_{ni}b_{in}\end{array}  \right)

BA=\left(\begin{array}{ccc}  b_{11}&...&b_{1n}\\...&...&...\\b_{m1}&...&b_{mn}\end{array}  \right)\left(\begin{array}{ccc}  a_{11}&...&a_{1m}\\...&...&...\\a_{n1}&...&a_{nm}\end{array}  \right)=\left(\begin{array}{ccc}  \sum_ja_{j1}b_{1j}&...&...\\...&...&...\\...&...&\sum_ja_{jm}b_{mj}  \end{array}\right)

All we have to do is change the order of summation:


Exercise 3. Find the trace of a projector.

Solution. In Exercise 1 we established that the projector P has p=\dim\text{Img}(P) eigenvalues \lambda =1 and n-p eigenvalues \lambda =0. P is symmetric, so in its diagonal representation P=UDU^{-1} there are p unities and n-p zeros on the diagonal of the diagonal matrix D. By Exercise 2


« Previous Entries  

Search Results for variance of a vector

Dec 18

Distributions derived from normal variables

Useful facts about independence

In the one-dimensional case the economic way to define normal variables is this: define a standard normal variable and then a general normal variable as its linear transformation.

In case of many dimensions, we follow the same idea. Before doing that we state without proofs two useful facts about independence of random variables (real-valued, not vectors).

Theorem 1. Suppose variables X_1,...,X_n have densities p_1(x_1),...,p_n(x_n). Then they are independent if and only if their joint density p(x_1,...,x_n) is a product of individual densities: p(x_1,...,x_n)=p_1(x_1)...p_n(x_n).

Theorem 2. If variables X,Y are normal, then they are independent if and only if they are uncorrelated: cov(X,Y)=0.

The necessity part (independence implies uncorrelatedness) is trivial.

Normal vectors

Let z_1,...,z_n be independent standard normal variables. A standard normal variable is defined by its density, so all of z_i have the same density. We achieve independence, according to Theorem 1, by defining their joint density to be a product of individual densities.

Definition 1. A standard normal vector of dimension n is defined by

z=\left(\begin{array}{c}z_1\\...\\z_n\\ \end{array}\right)

PropertiesEz=0 because all of z_i have means zero. Further, cov(z_i,z_j)=0 for i\neq jby Theorem 2 and variance of a standard normal is 1. Therefore, from the expression for variance of a vector we see that Var(z)=I.

Definition 2. For a matrix A and vector \mu of compatible dimensions a normal vector is defined by X=Az+\mu.

PropertiesEX=AEz+\mu=\mu and


(recall that variance of a vector is always nonnegative).

Distributions derived from normal variables

In the definitions of standard distributions (chi square, t distribution and F distribution) there is no reference to any sample data. Unlike statistics, which by definition are functions of sample data, these and other standard distributions are theoretical constructs. Statistics are developed in such a way as to have a distribution equal or asymptotically equal to one of standard distributions. This allows practitioners to use tables developed for standard distributions.

Exercise 1. Prove that \chi_n^2/n converges to 1 in probability.

Proof. For a standard normal z we have Ez^2=1 and Var(z^2)=2 (both properties can be verified in Mathematica). Hence, E\chi_n^2/n=1 and

Var(\chi_n^2/n)=\sum_iVar(z_i^2)/n^2=2/n\rightarrow 0.

Now the statement follows from the simple form of the law of large numbers.

Exercise 1 implies that for large n the t distribution is close to a standard normal.

« Previous Entries  

Search Results for variance of a vector

Sep 18

Applications of the diagonal representation II

Applications of the diagonal representation II

4. Square root of a matrix

Definition 1. For a symmetric matrix with non-negative eigenvalues the square root is defined by

(1) A^{1/2}=Udiag[\sqrt{\lambda_1},...,\sqrt{\lambda_n}]U^{-1}.

Exercise 1. (1) is symmetric and satisfies (A^{1/2})^2=A.

Proof. By properties of orthogonal matrices

(A^{1/2})^2=Udiag[\sqrt{\lambda_1},...,\sqrt{\lambda_n}]U^{-1}Udiag[\sqrt{\lambda_1},...,\sqrt{\lambda_n}]U^{-1} =Udiag[\lambda_1,...,\lambda_n]U^{-1}=A.

5. Generalized least squares estimator

The error term e in the multiple regression y=X\beta +e under homoscedasticity and in absence of autocorrelation satisfies

(2) V(e)=\sigma^2I, where \sigma^2 is some positive number.

The OLS estimator in this situation is given by

(3) \hat{\beta}=(X^TX)^{-1}X^Ty.

Now consider a more general case V(e)=\Omega.

Exercise 2. The variance matrix V(e)=\Omega is always symmetric and non-negative.


x^TV(e)x=Ex^T(e-Ee)(e-Ee)^Tx=E\|(e-Ee)^Tx\|^2\geq 0.

Exercise 3. Let's assume that \Omega is positive. Show that \Omega^{-1/2} is symmetric and satisfies (\Omega^{-1/2})^2=\Omega^{-1}.

Proof. By Exercise 1 the eigenvalues of \Omega are positive. Hence its inverse \Omega^{-1} exists and is given by \Omega^{-1}=U\Omega_U^{-1}U^T where \Omega_U^{-1}=diag[\lambda_1^{-1},...,\lambda_n^{-1}]. It is symmetric as an inverse of a symmetric matrix. It remains to apply Exercise 1 to A=\Omega^{-1/2}.

Exercise 4. Find the variance of u=\Omega^{-1/2}e.

Solution. Using the definition of variance of a vector


Exercise 4 suggests how to transform y=X\beta +e to satisfy (2). In the equation

\Omega^{-1/2}y=\Omega^{-1/2}X\beta +\Omega^{-1/2}e

the error u=\Omega^{-1/2}e satisfies the assumption under which (2) is applicable. Let \tilde{y}=\Omega^{-1/2}y, \tilde{X}=\Omega^{-1/2}X. Then we have \tilde{y}=\tilde{X}\beta +u and from (3) \hat{\beta}=(\tilde{X}^T\tilde{X})^{-1}\tilde{X}^T\tilde{y}. Since \tilde{X}^T=X^T\Omega^{-1/2}, this can be written as


This is Aitken's Generalized least squares estimator.

« Previous Entries  

Search Results for variance of a vector

Aug 18

Basic statistics

Basic statistics

AP Statistics the Genghis Khan way 1

AP Statistics the Genghis Khan way 2

Descriptive statistics and inferential statistics

Numerical versus categorical variable

Uniform distribution definition, with examples

Using graphs to describe data

What should you hate about AP Statistics? The TI-83+ and TI-84 graphing calculators are terrible

How to prevent cheating with TI-83+ and TI-84

Minitab is overpriced. Use Excel instead

What is a Pareto chart and how is it different from a histogram?

The stem-and-leaf plot is an archaism - it's time to leave it behind

Histogram versus time series plot, with video

Comparing histogram, Pareto chart and times series plot

Using statistical tables for normal distribution


Little tricks for AP Statistics

What is probability. Includes sample space; elementary, impossible, sure events; completeness axiom,  de Morgan’s laws, link between logic and geometry

Independence of events. Includes conditional probability, multiplication rule and visual illustration of independence

Law of total probability - you could have invented this

Significance level and power of test

Reevaluating probabilities based on piece of evidence

p value definition

Using numerical measures to describe data

What is a median, with an exercise

Using financial examples to explain properties of sample means

Properties of means

What is a mean value. All means in one place: population mean, sample mean, grouped data formula, mean of a continuous random variable

Unbiasedness definition, with intuition

All properties of variance in one place

Variance of a vector: motivation and visualization

Different faces of vector variance: again visualization helps

Inductive introduction to Chebyshev inequality

Properties of covariance

Properties of standard deviation

Correlation coefficient: the last block of statistical foundation

Statistical measures and their geometric roots

Population mean versus sample mean: summary comparison

Mean plus deviation-from-mean decomposition

Scaling a distribution

What is a z score: the scientific explanation

What is a binomial random variable - analogy with market demand

Active learning - away from boredom of lectures, with Excel file and video. How to simulate several random variables at the same time.

From independence of events to independence of random variables. Includes multiplicativity of means and additivity of variance

Normal distributions. Includes standard normal distribution, (general) normal variable, linear transformation and their properties, video and Mathematica file

Definitions of chi-square, t statistic and F statistic

Student's t distribution: one-line explanation of its origin

Confidence interval and margin of error derivation using z-score. Includes confidence and significance levels, critical value

Confidence interval using t statistic: attach probability or not attach?

Distribution function

Distribution function properties

Density function properties

Examples of distribution functions

Distribution and density functions of a linear transformation

Binary choice models

Binary choice models: theoretical obstacles

Maximum likelihood

Maximum likelihood: idea and life of a bulb

Maximum likelihood: application to linear model


Properties of conditional expectation

Conditional expectation generalized to continuous random variables

Conditional variance properties

Simulation of random variables

Importance of simulation in Excel for elementary stats courses

Generating the Bernoulli random variable (coin), with Excel file

Simulating the binomial variable in Excel and deriving its distribution, with Excel file

Creating frequency table and histogram and using Excel macros, with Excel file

Modeling a sample from a normal distribution, with Excel file

Modeling a pair of random variables and scatterplot definition, with video

Sampling distributions

Demystifying sampling distributions: too much talking about nothing

Law of large numbers and central limit theorem

Law of large numbers explained

Law of large numbers illustrated

Law of large numbers: the mega delusion of AP Statistics, with Excel file

All about the law of large numbers. Includes convergence in probability, preservation of arithmetic operations and application to simple regression

Central Limit Theorem versus Law of Large Numbers. Includes convergence in distribution and Excel file

Law of large numbers proved


« Previous Entries  

Search Results for variance of a vector

Feb 22

Estimation of parameters of a normal distribution

Estimation of parameters of a normal distribution

Here we show that the knowledge of the distribution of s^{2} for linear regression allows one to do without long calculations contained in the guide ST 2134 by J. Abdey.

Theorem. Let y_{1},...,y_{n} be independent observations from N\left( \mu,\sigma ^{2}\right) . 1) s^{2}\left( n-1\right) /\sigma ^{2} is distributed as \chi _{n-1}^{2}. 2) The estimators \bar{y} and s^{2} are independent. 3) Es^{2}=\sigma ^{2}, 4) Var\left( s^{2}\right) =\frac{2\sigma ^{4}}{n-1}, 5) \frac{s^{2}-\sigma ^{2}}{\sqrt{2\sigma ^{4}/\left(n-1\right) }} converges in distribution to N\left( 0,1\right) .

Proof. We can write y_{i}=\mu +e_{i} where e_{i} is distributed as N\left( 0,\sigma ^{2}\right) . Putting \beta =\mu ,\ y=\left(y_{1},...,y_{n}\right) ^{T}, e=\left( e_{1},...,e_{n}\right) ^{T} and X=\left( 1,...,1\right) ^{T} (a vector of ones) we satisfy (1) and (2). Since X^{T}X=n, we have \hat{\beta}=\bar{y}. Further,

r\equiv y-X\hat{  \beta}=\left( y_{1}-\bar{y},...,y_{n}-\bar{y}\right) ^{T}


s^{2}=\left\Vert r\right\Vert ^{2}/\left( n-1\right) =\sum_{i=1}^{n}\left(  y_{i}-\bar{y}\right) ^{2}/\left( n-1\right) .

Thus 1) and 2) follow from results for linear regression.

3) For a normal variable X its moment generating function is M_{X}\left( t\right) =\exp \left(\mu t+\frac{1}{2}\sigma ^{2}t^{2}\right) (see Guide ST2133, 2021, p.88). For the standard normal we get

M_{z}^{\prime }\left( t\right) =\exp \left(  \frac{1}{2}t^{2}\right) t, M_{z}^{\prime \prime }\left( t\right) =\exp \left( \frac{1}{2}t^{2}\right) (t^{2}+1),

M_{z}^{\prime \prime \prime}\left( t\right) =\exp \left( \frac{1}{2}t^{2}\right) (t^{3}+2t+t), M_{z}^{(4)}\left( t\right) =\exp \left( \frac{1}{2}t^{2}\right)  (t^{4}+6t^{2}+3).

Applying the general property EX^{r}=M_{X}^{\left(  r\right) }\left( 0\right) (same guide, p.84) we see that

Ez=0, Ez^{2}=1, Ez^{3}=0, Ez^{4}=3,

Var(z)=1, Var\left( z^{2}\right) =Ez^{4}-\left( Ez^{2}\right)  ^{2}=3-1=2.


Es^{2}=\frac{\sigma ^{2}}{n-1}E\left( z_{1}^{2}+...+z_{n-1}^{2}\right) =\frac{\sigma ^{2}}{n-1}\left( n-1\right) =\sigma ^{2}.

4) By independence of standard normals

Var\left( s^{2}\right) = \left(\frac{\sigma ^{2}}{n-1}\right) ^{2}\left[ Var\left( z_{1}^{2}\right)  +...+Var\left( z_{n-1}^{2}\right) \right] =\frac{\sigma ^{4}}{\left(  n-1\right) ^{2}}2\left( n-1\right) =\frac{2\sigma ^{4}}{n-1}.

5) By standardizing s^{2} we have \frac{s^{2}-Es^{2}}{\sigma \left(s^{2}\right) }=\frac{s^{2}-\sigma ^{2}}{\sqrt{2\sigma ^{4}/\left( n-1\right) }} and this converges in distribution to N\left( 0,1\right) by the central limit theorem.


« Previous Entries  

Search Results for variance of a vector

Apr 19

Checklist for Quantitative Finance FN3142

Checklist for Quantitative Finance FN3142

Students of FN3142 often think that they can get by by picking a few technical tricks. The questions below are mostly about intuition that helps to understand and apply those tricks.

Everywhere we assume that ...,Y_{t-1},Y_t,Y_{t+1},... is a time series and ...,I_{t-1},I_t,I_{t+1},... is a sequence of corresponding information sets. It is natural to assume that I_t\subset I_{t+1} for all t. We use the short conditional expectation notation: E_tX=E(X|I_t).


Question 1. How do you calculate conditional expectation in practice?

Question 2. How do you explain E_t(E_tX)=E_tX?

Question 3. Simplify each of E_tE_{t+1}X and E_{t+1}E_tX and explain intuitively.

Question 4. \varepsilon _t is a shock at time t. Positive and negative shocks are equally likely. What is your best prediction now for tomorrow's shock? What is your best prediction now for the shock that will happen the day after tomorrow?

Question 5. How and why do you predict Y_{t+1} at time t? What is the conditional mean of your prediction?

Question 6. What is the error of such a prediction? What is its conditional mean?

Question 7. Answer the previous two questions replacing Y_{t+1} by Y_{t+p} .

Question 8. What is the mean-plus-deviation-from-mean representation (conditional version)?

Question 9. How is the representation from Q.8 reflected in variance decomposition?

Question 10. What is a canonical form? State and prove all properties of its parts.

Question 11. Define conditional variance for white noise process and establish its link with the unconditional one.

Question 12. How do you define the conditional density in case of two variables, when one of them serves as the condition? Use it to prove the LIE.

Question 13. Write down the joint distribution function for a) independent observations and b) for serially dependent observations.

Question 14. If one variable is a linear function of another, what is the relationship between their densities?

Question 15. What can you say about the relationship between a,b if f(a)=f(b)? Explain geometrically the definition of the quasi-inverse function.


Answer 1. Conditional expectation is a complex notion. There are several definitions of differing levels of generality and complexity. See one of them here and another in Answer 12.

The point of this exercise is that any definition requires a lot of information and in practice there is no way to apply any of them to actually calculate conditional expectation. Then why do they juggle conditional expectation in theory? The efficient market hypothesis comes to rescue: it is posited that all observed market data incorporate all available information, and, in particular, stock prices are already conditioned on I_t.

Answers 2 and 3. This is the best explanation I have.

Answer 4. Since positive and negative shocks are equally likely, the best prediction is E_t\varepsilon _{t+1}=0 (I call this equation a martingale condition). Similarly, E_t\varepsilon _{t+2}=0 but in this case I prefer to see an application of the LIE: E_{t}\varepsilon _{t+2}=E_t(E_{t+1}\varepsilon _{t+2})=E_t0=0.

Answer 5. The best prediction is \hat{Y}_{t+1}=E_tY_{t+1} because it minimizes E_t(Y_{t+1}-f(I_t))^2 among all functions f of current information I_t. Formally, you can use the first order condition


to find that f(I_t)=E_tf(I_t)=E_tY_{t+1} is the minimizing function. By the projector property

Answer 6. It is natural to define the prediction error by


By the projector property E_t\hat{\varepsilon}_{t+1}=E_tY_{t+1}-E_tY_{t+1}=0.

Answer 7. To generalize, just change the subscripts. For the prediction we have to use two subscripts: the notation \hat{Y}_{t,t+p} means that we are trying to predict what happens at a future date t+p based on info set I_t (time t is like today). Then by definition \hat{Y} _{t,t+p}=E_tY_{t+p}, \hat{\varepsilon}_{t,t+p}=Y_{t+p}-E_tY_{t+p}.

Answer 8. Answer 7, obviously, implies Y_{t+p}=\hat{Y}_{t,t+p}+\hat{\varepsilon}_{t,t+p}. The simple case is here.

Answer 9. See the law of total variance and change it to reflect conditioning on I_t.

Answer 10. See canonical form.

Answer 11. Combine conditional variance definition with white noise definition.

Answer 12. The conditional density is defined similarly to the conditional probability. Let X,Y be two random variables. Denote p_X the density of X and p_{X,Y} the joint density. Then the conditional density of Y conditional on X is defined as p_{Y|X}(y|x)=\frac{p_{X,Y}(x,y)}{p_X(x)}. After this we can define the conditional expectation E(Y|X)=\int yp_{Y|X}(y|x)dy. With these definitions one can prove the Law of Iterated Expectations:

E[E(Y|X)]=\int E(Y|x)p_X(x)dx=\int \left( \int yp_{Y|X}(y|x)dy\right)  p_X(x)dx

=\int \int y\frac{p_{X,Y}(x,y)}{p_X(x)}p_X(x)dxdy=\int \int  yp_{X,Y}(x,y)dxdy=EY.

This is an illustration to Answer 1 and a prelim to Answer 13.

Answer 13. Understanding this answer is essential for Section 8.6 on maximum likelihood of Patton's guide.

a) In case of independent observations X_1,...,X_n the joint density of the vector X=(X_1,...,X_n) is a product of individual densities:


b) In the time series context it is natural to assume that the next observation depends on the previous ones, that is, for each t, X_t depends on X_1,...,X_{t-1} (serially dependent observations). Therefore we should work with conditional densities p_{X_1,...,X_t|X_1,...,X_{t-1}}. From Answer 12 we can guess how to make conditional densities appear:

p_{X_1,...,X_n}(x_1,...,x_n)=\frac{p_{X_1,...,X_n}(x_1,...,x_n)}{  p_{X_1,...,X_{n-1}}(x_1,...,x_{n-1})}\frac{p_{X_1,...,X_{n-1}}(x_1,...,x_{n-1})}{  p_{X_1,...,X_{n-2}}(x_1,...,x_{n-2})}...\frac{p_{X_1,X_2}(x_1,x_2)}{p_{X_1}(x_1)}p_{X_1}(x_1).

The fractions on the right are recognized as conditional probabilities. The resulting expression is pretty awkward:

p_{X_1,...,X_n}(x_1,...,x_n)=p_{X_1,...,X_n|X_1,...,X_n-1}(x_1,...,x_n|x_1,...,x_{n-1})\times \times p_{X_1,...,X_{n-1}|X_1,...,X_{n-2}}(x_1,...,x_{n-1}|x_1,...,x_{n-2})...\times p_{X_1,X_2|X_1}(x_1,x_2|x_1)p_{X_1}(x_1).

Answer 14. The answer given here helps one understand how to pass from the density of the standard normal to that of the general normal.

Answer 15. This elementary explanation of the function definition can be used in the fifth grade. Note that conditions sufficient for existence of the inverse are not satisfied in a case as simple as the distribution function of the Bernoulli variable (when the graph of the function has flat pieces and is not continuous). Therefore we need a more general definition of an inverse. Those who think that this question is too abstract can check out UoL exams, where examinees are required to find Value at Risk when the distribution function is a step function. To understand the idea, do the following:

a) Draw a graph of a good function f (continuous and increasing).

b) Fix some value y_0 in the range of this function and identify the region \{y:y\ge y_0\}.

c) Find the solution x_0 of the equation f(x)=y_0. By definition, x_0=f^{-1}(y_o). Identify the region \{x:f(x)\ge y_0\}.

d) Note that x_0=\min\{x:f(x)\ge y_0\}. In general, for bad functions the minimum here may not exist. Therefore minimum is replaced by infimum, which gives us the definition of the quasi-inverse:

x_0=\inf\{x:f(x)\ge y_0\}.

« Previous Entries