25
Oct 16

Properties of variance

All properties of variance in one place

Certainty is the mother of quiet and repose, and uncertainty the cause of variance and contentions. Edward Coke

Preliminaries: study properties of means with proofs.

Definition. Yes, uncertainty leads to variance, and we measure it by $Var(X)=E(X-EX)^2$. It is useful to use the name deviation from mean for $X-EX$ and realize that $E(X-EX)=0$, so that the mean of the deviation from mean cannot serve as a measure of variation of $X$ around $EX$.

Property 1. Variance of a linear combination. For any random variables $X,Y$ and numbers $a,b$ one has
(1) $Var(aX + bY)=a^2Var(X)+2abCov(X,Y)+b^2Var(Y).$
The term $2abCov(X,Y)$ in (1) is called an interaction term. See this post for the definition and properties of covariance.
Proof.
$Var(aX + bY)=E[aX + bY -E(aX + bY)]^2$

(using linearity of means)
$=E(aX + bY-aEX -bEY)^2$

(grouping by variable)
$=E[a(X-EX)+b(Y-EY)]^2$

(squaring out)
$=E[a^2(X-EX)^2+2ab(X-EX)(Y-EY)+(Y-EY)^2]$

(using linearity of means and definitions of variance and covariance)
$=a^2Var(X) + 2abCov(X,Y) +b^2Var(Y).$
Property 2. Variance of a sum. Letting in (1) $a=b=1$ we obtain
$Var(X + Y) = Var(X) + 2Cov(X,Y)+Var(Y).$

Property 3. Homogeneity of degree 2. Choose $b=0$ in (1) to get
$Var(aX)=a^2Var(X).$
Exercise. What do you think is larger: $Var(X+Y)$ or $Var(X-Y)$?
Property 4. If we add a constant to a variable, its variance does not change: $Var(X+c)=E[X+c-E(X+c)]^2=E(X+c-EX-c)^2=E(X-EX)^2=Var(X)$
Property 5. Variance of a constant is zero: $Var(c)=E(c-Ec)^2=0$.

Property 6. Nonnegativity. Since the squared deviation from mean $(X-EX)^2$ is nonnegative, its expectation is nonnegative$E(X-EX)^2\ge 0$.

Property 7. Only a constant can have variance equal to zero: If $Var(X)=0$, then $E(X-EX)^2 =(x_1-EX)^2p_1 +...+(x_n-EX)^2p_n=0$, see the definition of the expected value. Since all probabilities are positive, we conclude that $x_i=EX$ for all $i$, which means that $X$ is identically constant.

Property 8. Shortcut for variance. We have an identity $E(X-EX)^2=EX^2-(EX)^2$. Indeed, squaring out gives

$E(X-EX)^2 =E(X^2-2XEX+(EX)^2)$

(distributing expectation)

$=EX^2-2E(XEX)+E(EX)^2$

(expectation of a constant is constant)

$=EX^2-2(EX)^2+(EX)^2=EX^2-(EX)^2$.

All of the above properties apply to any random variables. The next one is an exception in the sense that it applies only to uncorrelated variables.

Property 9. If variables are uncorrelated, that is $Cov(X,Y)=0$, then from (1) we have $Var(aX + bY)=a^2Var(X)+b^2Var(Y).$ In particular, letting $a=b=1$, we get additivity$Var(X+Y)=Var(X)+Var(Y).$ Recall that the expected value is always additive.

Generalizations$Var(\sum a_iX_i)=\sum a_i^2Var(X_i)$ and $Var(\sum X_i)=\sum Var(X_i)$ if all $X_i$ are uncorrelated.

Among my posts, where properties of variance are used, I counted 12 so far.

2
Jan 17

Conditional variance properties

Preliminaries

Review Properties of conditional expectation, especially the summary, where I introduce a new notation for conditional expectation. Everywhere I use the notation $E_Y\pi$ for expectation of $\pi$ conditional on $Y$, instead of $E(\pi|Y)$.

This post and the previous one on conditional expectation show that conditioning is a pretty advanced notion. Many introductory books use the condition $E_xu=0$ (the expected value of the error term $u=0$ conditional on the regressor $x$ is zero). Because of the complexity of conditioning, I think it's better to avoid this kind of assumption as much as possible.

Conditional variance properties

Replacing usual expectations by their conditional counterparts in the definition of variance, we obtain the definition of conditional variance:

(1) $Var_Y(X)=E_Y(X-E_YX)^2.$

Property 1. If $X,Y$ are independent, then $X-EX$ and $Y$ are also independent and conditioning doesn't change variance:

$Var_Y(X)=E_Y(X-EX)^2=E(X-EX)^2=Var(X),$

Property 2. Generalized homogeneity of degree 2: if $a$ is a deterministic function, then $a^2(Y)$ can be pulled out:

$Var_Y(a(Y)X)=E_Y[a(Y)X-E_Y(a(Y)X)]^2=E_Y[a(Y)X-a(Y)E_YX]^2$

$=E_Y[a^2(Y)(X-E_YX)^2]=a^2(Y)E_Y(X-E_YX)^2=a^2(Y)Var_Y(X).$

Property 3. Shortcut for conditional variance:

(2) $Var_Y(X)=E_Y(X^2)-(E_YX)^2.$

Proof.

$Var_Y(X)=E_Y(X-E_YX)^2=E_Y[X^2-2XE_YX+(E_YX)^2]$

(distributing conditional expectation)

$=E_YX^2-2E_Y(XE_YX)+E_Y(E_YX)^2$

(applying Properties 2 and 6 from this Summary with $a(Y)=E_YX$)

$=E_YX^2-2(E_YX)^2+(E_YX)^2=E_YX^2-(E_YX)^2.$

Property 4The law of total variance:

(3) $Var(X)=Var(E_YX)+E[Var_Y(X)].$

Proof. By the shortcut for usual variance and the law of iterated expectations

$Var(X)=EX^2-(EX)^2=E[E_Y(X^2)]-[E(E_YX)]^2$

(replacing $E_Y(X^2)$ from (2))

$=E[Var_Y(X)]+E(E_YX)^2-[E(E_YX)]^2$

(the last two terms give the shortcut for variance of $E_YX$)

$=E[Var_Y(X)]+Var(E_YX).$

Before we move further we need to define conditional covariance by

$Cov_Y(S,T) = E_Y(S - E_YS)(T - E_YT)$

(everywhere usual expectations are replaced by conditional ones). We say that random variables $S,T$ are conditionally uncorrelated if $Cov_Y(S,T) = 0$.

Property 5. Conditional variance of a linear combination. For any random variables $S,T$ and functions $a(Y),b(Y)$ one has

$Var_Y(a(Y)S + b(Y)T)=a^2(Y)Var_Y(S)+2a(Y)b(Y)Cov_Y(S,T)+b^2(Y)Var_Y(T).$

The proof is quite similar to that in case of usual variances, so we leave it to the reader. In particular, if $S,T$ are conditionally uncorrelated, then the interaction terms disappears:

$Var_Y(a(Y)S + b(Y)T)=a^2(Y)Var_Y(S)+b^2(Y)Var_Y(T).$

3
Nov 16

Properties of covariance

Wikipedia says: The magnitude of the covariance is not easy to interpret. I add: We keep the covariance around mainly for its algebraic properties. It deserves studying because it appears in two important formulas: correlation coefficient and slope estimator in simple regression (see derivation, simplified derivation and proof of unbiasedness).

Definition. For two random variables $X,Y$ their covariance is defined by

$Cov (X,Y) = E(X - EX)(Y - EY)$

(it's the mean value of the product of the deviations of two variables from their respective means).

Properties of covariance

Property 1. Linearity. Covariance is linear in the first argument when the second argument is fixed: for any random variables $X,Y,Z$ and numbers $a,b$ one has
(1) $Cov (aX + bY,Z) = aCov(X,Z) + bCov (Y,Z).$
Proof. We start by writing out the left side of Equation (1):
$Cov(aX + bY,Z)=E[(aX + bY)-E(aX + bY)](Z-EZ)$
(using linearity of means)
$= E(aX + bY - aEX - bEY)(Z - EZ)$
(collecting similar terms)
$= E[a(X - EX) + b(Y - EY)](Z - EZ)$
(distributing $(Z - EZ)$)
$= E[a(X - EX)(Z - EZ) + b(Y - EY)(Z - EZ)]$
(using linearity of means)
$= aE(X - EX)(Z - EZ) + bE(Y - EY)(Z - EZ)$
$= aCov(X,Z) + bCov(Y,Z).$

Exercise. Covariance is also linear in the second argument when the first argument is fixed. Write out and prove this property. You can notice the importance of using parentheses and brackets.

Property 2. Shortcut for covariance: $Cov(X,Y) = EXY - (EX)(EY)$.
Proof$Cov(X,Y)= E(X - EX)(Y - EY)$
(multiplying out)
$= E[XY - X(EY) - (EX)Y + (EX)(EY)]$
($EX,EY$ are constants; use linearity)
$=EXY-(EX)(EY)-(EX)(EY)+(EX)(EY)=EXY-(EX)(EY).$

Definition. Random variables $X,Y$ are called uncorrelated if $Cov(X,Y) = 0$.

Uncorrelatedness is close to independence, so the intuition is the same: one variable does not influence the other. You can also say that there is no statistical relationship between uncorrelated variables. The mathematical side is not the same: uncorrelatedness is a more general property than independence.

Property 3. Independent variables are uncorrelated: if $X,Y$ are independent, then $Cov(X,Y) = 0$.
Proof. By the shortcut for covariance and multiplicativity of means for independent variables we have $Cov(X,Y) = EXY - (EX)(EY) = 0$.

Property 4. Correlation with a constant. Any random variable is uncorrelated with any constant: $Cov(X,c) = E(X - EX)(c - Ec) = 0.$

Property 5. Symmetry. Covariance is a symmetric function of its arguments: $Cov(X,Y)=Cov(Y,X)$. This is obvious.

Property 6. Relationship between covariance and variance:

$Cov(X,X)=E(X-EX)(X-EX)=Var(X)$.

19
Feb 22

Distribution of the estimator of the error variance

If you are reading the book by Dougherty: this post is about the distribution of the estimator  $s^2$ defined in Chapter 3.

Consider regression

(1) $y=X\beta +e$

where the deterministic matrix $X$ is of size $n\times k,$ satisfies $\det \left( X^{T}X\right) \neq 0$ (regressors are not collinear) and the error $e$ satisfies

(2) $Ee=0,Var(e)=\sigma ^{2}I$

$\beta$ is estimated by $\hat{\beta}=(X^{T}X)^{-1}X^{T}y.$ Denote $P=X(X^{T}X)^{-1}X^{T},$ $Q=I-P.$ Using (1) we see that $\hat{\beta}=\beta +(X^{T}X)^{-1}X^{T}e$ and the residual $r\equiv y-X\hat{\beta}=Qe.$ $\sigma^{2}$ is estimated by

(3) $s^{2}=\left\Vert r\right\Vert ^{2}/\left( n-k\right) =\left\Vert Qe\right\Vert ^{2}/\left( n-k\right) .$

$Q$ is a projector and has properties which are derived from those of $P$

(4) $Q^{T}=Q,$ $Q^{2}=Q.$

If $\lambda$ is an eigenvalue of $Q,$ then multiplying $Qx=\lambda x$ by $Q$ and using the fact that $x\neq 0$ we get $\lambda ^{2}=\lambda .$ Hence eigenvalues of $Q$ can be only $0$ or $1.$ The equation $tr\left( Q\right) =n-k$
tells us that the number of eigenvalues equal to 1 is $n-k$ and the remaining $k$ are zeros. Let $Q=U\Lambda U^{T}$ be the diagonal representation of $Q.$ Here $U$ is an orthogonal matrix,

(5) $U^{T}U=I,$

and $\Lambda$ is a diagonal matrix with eigenvalues of $Q$ on the main diagonal. We can assume that the first $n-k$ numbers on the diagonal of $Q$ are ones and the others are zeros.

Theorem. Let $e$ be normal. 1) $s^{2}\left( n-k\right) /\sigma ^{2}$ is distributed as $\chi _{n-k}^{2}.$ 2) The estimators $\hat{\beta}$ and $s^{2}$ are independent.

Proof. 1) We have by (4)

(6) $\left\Vert Qe\right\Vert ^{2}=\left( Qe\right) ^{T}Qe=\left( Q^{T}Qe\right) ^{T}e=\left( Qe\right) ^{T}e=\left( U\Lambda U^{T}e\right) ^{T}e=\left( \Lambda U^{T}e\right) ^{T}U^{T}e.$

Denote $S=U^{T}e.$ From (2) and (5)

$ES=0,$ $Var\left( S\right) =EU^{T}ee^{T}U=\sigma ^{2}U^{T}U=\sigma ^{2}I$

and $S$ is normal as a linear transformation of a normal vector. It follows that $S=\sigma z$ where $z$ is a standard normal vector with independent standard normal coordinates $z_{1},...,z_{n}.$ Hence, (6) implies

(7) $\left\Vert Qe\right\Vert ^{2}=\sigma ^{2}\left( \Lambda z\right) ^{T}z=\sigma ^{2}\left( z_{1}^{2}+...+z_{n-k}^{2}\right) =\sigma ^{2}\chi _{n-k}^{2}.$

(3) and (7) prove the first statement.

2) First we note that the vectors $Pe,Qe$ are independent. Since they are normal, their independence follows from

$cov(Pe,Qe)=EPee^{T}Q^{T}=\sigma ^{2}PQ=0.$

It's easy to see that $X^{T}P=X^{T}.$ This allows us to show that $\hat{\beta}$ is a function of $Pe$:

$\hat{\beta}=\beta +(X^{T}X)^{-1}X^{T}e=\beta +(X^{T}X)^{-1}X^{T}Pe.$

Independence of $Pe,Qe$ leads to independence of their functions $\hat{\beta}$ and $s^{2}.$

8
May 18

Different faces of vector variance: again visualization helps

In the previous post we defined variance of a column vector $X$ with $n$ components by

$V(X)=E(X-EX)(X-EX)^T.$

In terms of elements this is the same as:

(1) $V(X)=\left(\begin{array}{cccc}V(X_1)&Cov(X_1,X_2)&...&Cov(X_1,X_n) \\Cov(X_2,X_1)&V(X_2)&...&Cov(X_2,X_n) \\...&...&...&... \\Cov(X_n,X_1)&Cov(X_n,X_2)&...&V(X_n) \end{array}\right).$

So why knowing the structure of this matrix is so important?

Let $X_1,...,X_n$ be random variables and let $a_1,...,a_n$ be numbers. In the derivation of the variance of the slope estimator for simple regression we have to deal with the expression of type

(2) $V\left(\sum_{i=1}^na_iX_i\right).$

Question 1. How do you multiply a sum by a sum? I mean, how do you use summation signs to find the product $\left(\sum_{i=1}^na_i\right)\left(\sum_{i=1}^nb_i\right)$?

Answer 1. Whenever you have problems with summation signs, try to do without them. The product

$\left(a_1+...+a_n\right)\left(b_1+...+b_n\right)=a_1b_1+...+a_1b_n+...+a_nb_1+...+a_nb_n$

should contain ALL products $a_ib_j.$ Again, a matrix visualization will help:

$\left(\begin{array}{ccc}a_1b_1&...&a_1b_n \\...&...&... \\a_nb_1&...&a_nb_n \end{array}\right).$

The product we are looking for should contain all elements of this matrix. So the answer is

(3) $\left(\sum_{i=1}^na_i\right)\left(\sum_{i=1}^nb_i\right)=\sum_{i=1}^n\sum_{j=1}^na_ib_j.$

Formally, we can write $\sum_{i=1}^nb_i=\sum_{j=1}^nb_j$ (the sum does not depend on the index of summation, this is another point many students don't understand) and then perform the multiplication in (3).

Question 2. What is the expression for (2) in terms of covariances of components?

Answer 2. If you understand Answer 1 and know the relationship between variances and covariances, it should be clear that

(4) $V\left(\sum_{i=1}^na_iX_i\right)=Cov(\sum_{i=1}^na_iX_i,\sum_{i=1}^na_iX_i)$

$=Cov(\sum_{i=1}^na_iX_i,\sum_{j=1}^na_jX_j)=\sum_{i=1}^n\sum_{j=1}^na_ia_jCov(X_i,X_j).$

Question 3. In light of (1), separate variances from covariances in (4).

Answer 3. When $i=j,$ we have $Cov(X_i,X_j)=V(X_i),$ which are diagonal elements of (1). Otherwise, for $i\neq j$ we get off-diagonal elements of (1). So the answer is

(5) $V\left(\sum_{i=1}^na_iX_i\right)=\sum_{i=1}^na_i^2V(X_i)+\sum_{i\neq j}a_ia_jCov(X_i,X_j).$

Once again, in the first sum on the right we have only variances. In the second sum, the indices $i,j$ are assumed to run from $1$ to $n$, excluding the diagonal $i=j.$

Corollary. If $X_{i}$ are uncorrelated, then the second sum in (5) disappears:

(6) $V\left(\sum_{i=1}^na_iX_i\right)=\sum_{i=1}^na_i^2V(X_i).$

This fact has been used (with a slightly different explanation) in the derivation of the variance of the slope estimator for simple regression.

Question 4. Note that the matrix (1) is symmetric (elements above the main diagonal equal their mirror siblings below that diagonal). This means that some terms in the second sum on the right of (5) are repeated twice. If you group equal terms in (5), what do you get?

Answer 4. The idea is to write

$a_ia_jCov(X_i,X_j)+a_ia_jCov(X_j,X_i)=2a_ia_jCov(X_i,X_j),$

that is, to join equal elements above and below the main diagonal in (1). For this, you need to figure out how to write a sum of the elements that are above the main diagonal. Make a bigger version of (1) (with more off-diagonal elements) to see that the elements that are above the main diagonal are listed in the sum $\sum_{i=1}^{n-1}\sum_{j=i+1}^n.$ This sum can also be written as $\sum_{1\leq i Hence, (5) is the same as

(7) $V\left(\sum_{i=1}^na_iX_i\right)=\sum_{i=1}^na_i^2V(X_i)+2\sum_{i=1}^{n-1}\sum_{j=i+1}^na_ia_jCov(X_i,X_j)$

$=\sum_{i=1}^na_i^2V(X_i)+2\sum_{1\leq i

Unlike (6), this equation is applicable when there is autocorrelation.

7
May 18

Variance of a vector: motivation and visualization

I always show my students the definition of the variance of a vector, and they usually don't pay attention. You need to know what it is, already at the level of simple regression (to understand the derivation of the slope estimator variance), and even more so when you deal with time series. Since I know exactly where students usually stumble, this post is structured as a series of questions and answers.

Think about ideas: how would you define variance of a vector?

Question 1. We know that for a random variable $X$, its variance is defined by

(1) $V(X)=E(X-EX)^{2}.$

Now let

$X=\left(\begin{array}{c}X_{1} \\... \\X_{n}\end{array}\right)$

be a vector with $n$ components, each of which is a random variable. How would you define its variance?

The answer is not straightforward because we don't know how to square a vector. Let $X^T=(\begin{array}{ccc}X_1& ...&X_n\end{array})$ denote the transposed vector. There are two ways to multiply a vector by itself: $X^TX$ and $XX^T.$

Question 2. Find the dimensions of $X^TX$ and $XX^T$ and their expressions in terms of coordinates of $X.$

Answer 2. For a product of matrices there is a compatibility rule that I write in the form

(2) $A_{n\times m}B_{m\times k}=C_{n\times k}.$

Recall that $n\times m$ in the notation $A_{n\times m}$ means that the matrix $A$ has $n$ rows and $m$ columns. For example, $X$ is of size $n\times 1.$ Verbally, the above rule says that the number of columns of $A$ should be equal to the number of rows of $B.$ In the product that common number $m$ disappears and the unique numbers ($n$ and $k$) give, respectively, the number of rows and columns of $C.$ Isn't the the formula
easier to remember than the verbal statement? From (2) we see that $X_{1\times n}^TX_{n\times 1}$ is of dimension 1 (it is a scalar) and $X_{n\times 1}X_{1\times n}^T$ is an $n\times n$ matrix.

For actual multiplication of matrices I use the visualization

(3) $\left(\begin{array}{ccccc}&&&&\\&&&&\\a_{i1}&a_{i2}&...&a_{i,m-1}&a_{im}\\&&&&\\&&&&\end{array}\right) \left(\begin{array}{ccccc}&&b_{1j}&&\\&&b_{2j}&&\\&&...&&\\&&b_{m-1,j}&&\\&&b_{mj}&&\end{array}\right) =\left( \begin{array}{ccccc}&&&&\\&&&&\\&&c_{ij}&&\\&&&&\\&&&&\end{array}\right)$

Short formulation. Multiply rows from the first matrix by columns from the second one.

Long Formulation. To find the element $c_{ij}$ of $C,$ we find a scalar product of the $i$th row of $A$ and $j$th column of $B:$ $c_{ij}=a_{i1}b_{1j}+a_{i2}b_{2j}+...$ To find all elements in the $i$th row of $C,$ we fix the $i$th row in $A$ and move right the columns in $B.$ Alternatively, to find all elements in the $j$th column of $C,$ we fix the $j$th column in $B$ and move down the rows in $A$. Using this rule, we have

(4) $X^TX=X_1^2+...+X_n^2,$ $XX^T=\left(\begin{array}{ccc}X_1^2&...&X_1X_n \\...&...&... \\X_nX_1&...&X_n^2 \end{array}\right).$

Usually students have problems with the second equation.

Based on (1) and (4), we have two candidates to define variance:

(5) $V(X)=E(X-EX)^T(X-EX)$

and

(6) $V(X)=E(X-EX)(X-EX)^T.$

Answer 1. The second definition contains more information, in the sense to be explained below, so we define variance of a vector by (6).

Question 3. Find the elements of this matrix.

Answer 3. Variance of a vector has variances of its components on the main diagonal and covariances outside it:

(7) $V(X)=\left(\begin{array}{cccc}V(X_1)&Cov(X_1,X_2)&...&Cov(X_1,X_n) \\Cov(X_2,X_1)&V(X_2)&...&Cov(X_2,X_n) \\...&...&...&... \\Cov(X_n,X_1)&Cov(X_n,X_2)&...&V(X_n) \end{array}\right).$

If you can't get this on your own, go back to Answer 2.

There is a matrix operation called trace and denoted $tr$. It is defined only for square matrices and gives the sum of diagonal elements of a matrix.

Exercise 1. Show that $tr(V(X))=E(X-EX)^T(X-EX).$ In this sense definition (6) is more informative than (5).

Exercise 2. Show that if $EX_1=...=EX_n=0$, then (7) becomes

$V(X)=\left(\begin{array}{cccc}EX^2_1&EX_1X_2&...&EX_1X_n \\EX_2X_1&EX^2_2&...&EX_2X_n \\...&...&...&... \\EX_nX_1&EX_nX_2&...&EX^2_n \end{array}\right).$

8
Jan 17

OLS estimator variance

We consider the simple regression

(1) $y_i=a+bx_i+e_i$

Here we derived the OLS estimators of the intercept and slope:

(2) $\hat{b}=\frac{Cov_u(x,y)}{Var_u(x)}$,

(3) $\hat{a}=\bar{y}-\hat{b}\bar{x}$.

A1. Existence condition. Since division by zero is not allowed, for (2) to exist we require $Var_u(x)\ne 0$. If this condition is not satisfied, then there is no variance in $x$ and all observed points are on the vertical line.

A2. Convenience condition. The regressor $x$ is deterministic. This condition is imposed to be able to apply the properties of expectation, see equation (7) in  this post. The time trend and dummy variables are examples of deterministic regressors. However, most real-life regressors are stochastic. Modifying the theory in order to cover stochastic regressors is the subject of two posts: finite-sample theory and large-sample theory.

A3. Unbiasedness condition$Ee_i=0$. This is the main assumption that makes sure that OLS estimators are unbiased, see equation (7) in  this post.

Unbiasedness is not enough

Unbiasedness characterizes the quality of an estimator, see the intuitive explanation. Unfortunately, unbiasedness is not enough to choose the best estimator because of nonuniqueness: usually, if there is one unbiased estimator of a parameter, then there are infinitely many unbiased estimators of the same parameter. For example, we know that the sample mean $\bar{X}$ unbiasedly estimates the population mean $E\bar{X}=EX$. Since $EX_1=EX$ ($X_1$ is the first observation), we can easily construct an infinite family of unbiased estimators $Y=(\bar{X}+aX_1)/(1+a)$, assuming $a\ne -1$. Indeed, using linearity of expectation $EY=(E\bar{X}+aEX_1)/(1+a)=EX$.

Variance is another measure of an estimator quality: to have a lower spread of estimator values, among competing estimators we choose the one which has the lowest variance. Knowing the estimator variance allows us to find the z-score and use statistical tables.

Slope estimator variance

It is not difficult to find the variance of the slope estimator using representation (6) derived here:

$\hat{b}=b+\frac{1}{n}\sum a_ie_i$

where $a_i=(x_i-\bar{x})/Var_u(x).$

Don't try to apply directly the definition of variance at this point, because there will be a square of a sum, which leads to a double sum upon squaring. We need two new assumptions.

A4. Uncorrelatedness of errors. Assume that $Cov(e_i,e_j)=0$ for all $i\ne j$ (errors from different equations (1) are uncorrelated). Note that because of the unbiasedness condition, this assumption is equivalent to $Ee_ie_j=0$ for all $i\ne j$. This assumption is likely to be satisfied if we observe consumption patterns of unrelated individuals.

A5. Homoscedasticity. All errors have the same variances$Var(e_i)=\sigma^2$ for all $i$. Again, because of the unbiasedness condition, this assumption is equivalent to $Ee_i^2=\sigma^2$ for all $i$.

Now we can derive the variance expression, using properties from this post:

$Var(\hat{b})=Var(b+\frac{1}{n}\sum_i a_ie_i)$ (dropping a constant doesn't affect variance)

$=Var(\frac{1}{n}\sum_i a_ie_i)$ (for uncorrelated variables, variance is additive)

$=\sum_i Var(\frac{1}{n}a_ie_i)$ (variance is homogeneous of degree 2)

$=\frac{1}{n^2}\sum_i a_i^2Var(e_i)$ (applying homoscedasticity)

$=\frac{1}{n^2}\sum_i a_i^2\sigma^2$ (plugging $a_i$)

$=\frac{1}{n^2}\sum_i(x_i-\bar{x})^2\sigma^2/Var^2_u(x)$ (using the notation of sample variance)

$=\frac{1}{n}Var_u(x)\sigma^2/Var^2_u(x)=\sigma^2/(nVar_u(x)).$

Note that canceling out two variances in the last line is obvious. It is not so obvious for some if instead of the short notation for variances you use summation signs. The case of the intercept variance is left as an exercise.

Conclusion

The above assumptions A1-A5 are called classical. It is necessary to remember their role in derivations because a considerable part of Econometrics is devoted to deviations from classical assumptions. Once you have a certain assumption violated, you should expect the corresponding estimator property invalidated. For example, if $Ee_i\ne 0$, you should expect the estimators to be biased. If any of A4-A5 is not true, the formula we have derived

$Var(\hat{b})=\sigma^2/(nVar_u(x))$

will not hold. Besides, the Gauss-Markov theorem that the OLS estimators are efficient will not hold (this will be discussed later). The pair A4-A5 can be called an efficiency condition.

26
Nov 16

Properties of correlation

Correlation coefficient: the last block of statistical foundation

Correlation has already been mentioned in

Statistical measures and their geometric roots

Properties of standard deviation

The pearls of AP Statistics 35

Properties of covariance

The pearls of AP Statistics 33

The hierarchy of definitions

Suppose random variables $X,Y$ are not constant. Then their standard deviations are not zero and we can define their correlation as in Chart 1.

Chart 1. Correlation definition

Properties of correlation

Property 1. Range of the correlation coefficient: for any $X,Y$ one has $- 1 \le \rho (X,Y) \le 1$.
This follows from the Cauchy-Schwarz inequality, as explained here.

Recall from this post that correlation is cosine of the angle between $X-EX$ and $Y-EY$.
Property 2. Interpretation of extreme cases. (Part 1) If $\rho (X,Y) = 1$, then $Y = aX + b$ with $a > 0.$

(Part 2) If $\rho (X,Y) = - 1$, then $Y = aX + b$ with $a < 0$.

Proof. (Part 1) $\rho (X,Y) = 1$ implies
(1) $Cov (X,Y) = \sigma (X)\sigma (Y)$
which, in turn, implies that $Y$ is a linear function of $X$: $Y = aX + b$ (this is the second part of the Cauchy-Schwarz inequality). Further, we can establish the sign of the number $a$. By the properties of variance and covariance
$Cov(X,Y)=Cov(X,aX+b)=aCov(X,X)+Cov(X,b)=aVar(X)$,

$\sigma (Y)=\sigma(aX + b)=\sigma (aX)=|a|\sigma (X)$.
Plugging this in Eq. (1) we get $aVar(X) = |a|\sigma^2(X)$ and see that $a$ is positive.

The proof of Part 2 is left as an exercise.

Property 3. Suppose we want to measure correlation between weight $W$ and height $H$ of people. The measurements are either in kilos and centimeters ${W_k},{H_c}$ or in pounds and feet ${W_p},{H_f}$. The correlation coefficient is unit-free in the sense that it does not depend on the units used: $\rho (W_k,H_c)=\rho (W_p,H_f)$. Mathematically speaking, correlation is homogeneous of degree $0$ in both arguments.
Proof. One measurement is proportional to another, $W_k=aW_p,\ H_c=bH_f$ with some positive constants $a,b$. By homogeneity
$\rho (W_k,H_c)=\frac{Cov(W_k,H_c)}{\sigma(W_k)\sigma(H_c)}=\frac{Cov(aW_p,bH_f)}{\sigma(aW_p)\sigma(bH_f)}=\frac{abCov(W_p,H_f)}{ab\sigma(W_p)\sigma (H_f)}=\rho (W_p,H_f).$

12
Nov 16

Properties of standard deviation

Properties of standard deviation are divided in two parts. The definitions and consequences are given here. Both variance and standard deviation are used to measure variability of values of a random variable around its mean. Then why use both of them? The why will be explained in another post.

Properties of standard deviation: definitions and consequences

Definition. For a random variable $X$, the quantity $\sigma (X) = \sqrt {Var(X)}$ is called its standard deviation.

Digression about square roots and absolute values

In general, there are two square roots of a positive number, one positive and the other negative. The positive one is called an arithmetic square root. The arithmetic root is applied here to $Var(X) \ge 0$ (see properties of variance), so standard deviation is always nonnegative.
Definition. An absolute value of a real number $a$ is defined by
(1) $|a| =a$ if $a$ is nonnegative and $|a| =-a$ if $a$ is negative.
This two-part definition is a stumbling block for many students, so making them plug in a few numbers is a must. It is introduced to measure the distance from point $a$ to the origin. For example, $dist(3,0) = |3| = 3$ and $dist(-3,0) = |-3| = 3$. More generally, for any points $a,b$ on the real line the distance between them is given by $dist(a,b) = |a - b|$.

By squaring both sides in Eq. (1) we obtain $|a|^2={a^2}$. Application of the arithmetic square root gives

(2) $|a|=\sqrt {a^2}.$

This is the equation we need right now.

Back to standard deviation

Property 1. Standard deviation is homogeneous of degree 1. Indeed, using homogeneity of variance and equation (2), we have

$\sigma (aX) =\sqrt{Var(aX)}=\sqrt{{a^2}Var(X)}=|a|\sigma(X).$

Unlike homogeneity of expected values, here we have an absolute value of the scaling coefficient $a$.

Property 2. Cauchy-Schwarz inequality. (Part 1) For any random variables $X,Y$ one has

(3) $|Cov(X,Y)|\le\sigma(X)\sigma(Y)$.

(Part 2) If the inequality sign in (3) turns into equality, $|Cov(X,Y)|=\sigma (X)\sigma (Y)$, then $Y$ is a linear function of $X$: $Y = aX + b$, with some constants $a,b$.
Proof. (Part 1) If at least one of the variables is constant, both sides of the inequality are $0$ and there is nothing to prove. To exclude the trivial case, let $X,Y$ be non-constant and, therefore, $Var(X),\ Var(Y)$ are positive. Consider a real-valued function of a real number $t$ defined by $f(t) = Var(tX + Y)$. Here we have variance of a linear combination

$f(t)=t^2Var(X)+2tCov(X,Y)+Var(Y)$.

We see that $f(t)$ is a parabola with branches looking upward (because the senior coefficient $Var(X)$ is positive). By nonnegativity of variance, $f(t)\ge 0$ and the parabola lies above the horizontal axis in the $(f,t)$ plane. Hence, the quadratic equation $f(t) = 0$ may have at most one real root. This means that the discriminant of the equation is non-positive:

$D=Cov(X,Y)^2-Var(X)Var(Y)\le 0.$

Applying square roots to both sides of $Cov(X,Y)^2\le Var(X)Var(Y)$ we finish the proof of the first part.

(Part 2) In case of the equality sign the discriminant is $0$. Therefore the parabola touches the horizontal axis where $f(t)=Var(tX + Y)=0$. But we know that this implies $tX + Y = constant$ which is just another way of writing $Y = aX + b$.

Comment. (3) explains one of the main properties of the correlation:

$-1\le\rho(X,Y)=\frac{Cov(X,Y)}{\sigma(X)\sigma(Y)}\le 1$.

13
Oct 16

Properties of means

Properties of means, covariances and variances are bread and butter of professionals. Here we consider the bread - the means

Properties of means: as simple as playing with tables

Definition of a random variable. When my Brazilian students asked for an intuitive definition of a random variable, I said: It is a function whose values are unpredictable. Therefore it is prohibited to work with their values and allowed to work only with their various means. For proofs we need a more technical definition: it is a table values+probabilities of type Table 1.

 Values of $X$ Probabilities $x_1$ $p_1$ ... ... $x_n$ $p_n$

Note: The complete form of writing ${p_i}$ is $P(X = {x_i})$.

Mean (or expected value) value definition$EX = x_1p_1 + ... + x_np_n = \sum\limits_{i = 1}^nx_ip_i.$ In words, this is a weighted sum of values, where the weights $p_i$ reflect the importance of corresponding $x_i$.

Note: The expected value is a function whose argument is a complex object (it is described by Table 1) and the value is simple: $EX$ is just a number. And it is not a product of $E$ and $X$! See how different means fit this definition.

Definition of a linear combination. See here the financial motivation. Suppose that $X,Y$ are two discrete random variables with the same probability distribution ${p_1},...,{p_n}$. Let $a,b$ be real numbers. The random variable $aX + bY$ is called a linear combination of $X,Y$ with coefficients $a,b$. Its special cases are $aX$ ($X$ scaled by $a$) and $X + Y$ (a sum of $X$ and $Y$). The detailed definition is given by Table 2.

 Values of $X$ Values of $Y$ Probabilities $aX$ $X + Y$ $aX + bY$ $x_1$ ${y_1}$ $p_1$ $a{x_1}$ ${x_1} + {y_1}$ $a{x_1} + b{y_1}$ ... ... ... ... ... ... $x_n$ ${y_n}$ $p_n$ $a{x_n}$ ${x_n} + {y_n}$ $a{x_n} + b{y_n}$

Note: The situation when the probability distributions are different is reduced to the case when they are the same, see my book.

Property 1. Linearity of means. For any random variables $X,Y$ and any numbers $a,b$ one has

(1) $E(aX + bY) = aEX + bEY$.

Proof. This is one of those straightforward proofs when knowing the definitions and starting with the left-hand side is enough to arrive at the result. Using the definitions in Table 2, the mean of the linear combination is
$E(aX + bY)= (a{x_1} + b{y_1}){p_1} + ... + (a{x_n} + b{y_n}){p_n}$

(distributing probabilities)
$= a{x_1}{p_1} + b{y_1}{p_1} + ... + a{x_n}{p_n} + b{y_n}{p_n}$

(grouping by variables)
$= (a{x_1}{p_1} + ... + a{x_n}{p_n}) + (b{y_1}{p_1} + ... + b{y_n}{p_n})$

(pulling out constants)
$= a({x_1}{p_1} + ... + {x_n}{p_n}) + b({y_1}{p_1} + ... + {y_n}{p_n})=aEX+bEY.$

See applications: one, and two, and three.

Generalization to the case of a linear combination of $n$ variables:

$E({a_1}{X_1} + ... + {a_n}{X_n}) = {a_1}E{X_1} + ... + {a_n}E{X_n}$.

Special cases. a) Letting $a = b = 1$ in (1) we get $E(X + Y) = EX + EY$. This is called additivity. See an application. b) Letting in (1) $b = 0$ we get $E(aX) = aEX$. This property is called homogeneity of degree 1 (you can pull the constant out of the expected value sign). Ask your students to deduce linearity from homogeneity and additivity.

Property 2. Expected value of a constant. Everybody knows what a constant is. Ask your students what is a constant in terms of Table 1. The mean of a constant is that constant, because a constant doesn't change, rain or shine: $Ec = c{p_1} + ... + c{p_n} = c({p_1} + ... + {p_n}) = 1$ (we have used the completeness axiom). In particular, it follows that $E(EX)=EX$.

Property 3. The expectation operator preserves order: if $x_i\ge y_i$ for all $i$, then $EX\ge EY$. In particular, the mean of a nonnegative random variable is nonnegative: if $x_i\ge 0$ for all $i$, then $EX\ge 0$.

Indeed, using the fact that all probabilities are nonnegative, we get $EX = x_1p_1 + ... + x_np_n\ge y_1p_1 + ... + y_np_n=EY$.

Property 4. For independent variables, we have $EXY=(EX)(EY)$ (multiplicativity), which has important implications on its own.

The best thing about the above properties is that, although we proved them under simplified assumptions, they are always true. We keep in mind that the expectation operator $E$ is the device used by Mother Nature to measure the average, and most of the time she keeps hidden from us both the probabilities and the average $EX$.