8
Jan 17

OLS estimator variance

Assumptions about simple regression

We consider the simple regression

(1) y_i=a+bx_i+e_i

Here we derived the OLS estimators of the intercept and slope:

(2) \hat{b}=\frac{Cov_u(x,y)}{Var_u(x)},

(3) \hat{a}=\bar{y}-\hat{b}\bar{x}.

A1. Existence condition. Since division by zero is not allowed, for (2) to exist we require Var_u(x)\ne 0. If this condition is not satisfied, then there is no variance in x and all observed points are on the vertical line.

A2. Convenience condition. The regressor x is deterministic. This condition is imposed to be able to apply the properties of expectation, see equation (7) in  this post. The time trend and dummy variables are examples of deterministic regressors. However, most real-life regressors are stochastic. Modifying the theory in order to cover stochastic regressors is the subject of two posts: finite-sample theory and large-sample theory.

A3. Unbiasedness conditionEe_i=0. This is the main assumption that makes sure that OLS estimators are unbiased, see equation (7) in  this post.

Unbiasedness is not enough

Unbiasedness characterizes the quality of an estimator, see the intuitive explanation. Unfortunately, unbiasedness is not enough to choose the best estimator because of nonuniqueness: usually, if there is one unbiased estimator of a parameter, then there are infinitely many unbiased estimators of the same parameter. For example, we know that the sample mean \bar{X} unbiasedly estimates the population mean E\bar{X}=EX. Since EX_1=EX (X_1 is the first observation), we can easily construct an infinite family of unbiased estimators Y=(\bar{X}+aX_1)/(1+a), assuming a\ne -1. Indeed, using linearity of expectation EY=(E\bar{X}+aEX_1)/(1+a)=EX.

Variance is another measure of an estimator quality: to have a lower spread of estimator values, among competing estimators we choose the one which has the lowest variance. Knowing the estimator variance allows us to find the z-score and use statistical tables.

Slope estimator variance

It is not difficult to find the variance of the slope estimator using representation (6) derived here:

\hat{b}=b+\frac{1}{n}\sum a_ie_i

where a_i=(x_i-\bar{x})/Var_u(x).

Don't try to apply directly the definition of variance at this point, because there will be a square of a sum, which leads to a double sum upon squaring. We need two new assumptions.

A4. Uncorrelatedness of errors. Assume that Cov(e_i,e_j)=0 for all i\ne j (errors from different equations (1) are uncorrelated). Note that because of the unbiasedness condition, this assumption is equivalent to Ee_ie_j=0 for all i\ne j. This assumption is likely to be satisfied if we observe consumption patterns of unrelated individuals.

A5. Homoscedasticity. All errors have the same variancesVar(e_i)=\sigma^2 for all i. Again, because of the unbiasedness condition, this assumption is equivalent to Ee_i^2=\sigma^2 for all i.

Now we can derive the variance expression, using properties from this post:

Var(\hat{b})=Var(b+\frac{1}{n}\sum_i a_ie_i) (dropping a constant doesn't affect variance)

=Var(\frac{1}{n}\sum_i a_ie_i) (for uncorrelated variables, variance is additive)

=\sum_i Var(\frac{1}{n}a_ie_i) (variance is homogeneous of degree 2)

=\frac{1}{n^2}\sum_i a_i^2Var(e_i) (applying homoscedasticity)

=\frac{1}{n^2}\sum_i a_i^2\sigma^2 (plugging a_i)

=\frac{1}{n^2}\sum_i(x_i-\bar{x})^2\sigma^2/Var^2_u(x) (using the notation of sample variance)

=\frac{1}{n}Var_u(x)\sigma^2/Var^2_u(x)=\sigma^2/(nVar_u(x)).

 

Note that canceling out two variances in the last line is obvious. It is not so obvious for some if instead of the short notation for variances you use summation signs. The case of the intercept variance is left as an exercise.

Conclusion

The above assumptions A1-A5 are called classical. It is necessary to remember their role in derivations because a considerable part of Econometrics is devoted to deviations from classical assumptions. Once you have a certain assumption violated, you should expect the corresponding estimator property invalidated. For example, if Ee_i\ne 0, you should expect the estimators to be biased. If any of A4-A5 is not true, the formula we have derived

Var(\hat{b})=\sigma^2/(nVar_u(x))

will not hold. Besides, the Gauss-Markov theorem that the OLS estimators are efficient will not hold (this will be discussed later). The pair A4-A5 can be called an efficiency condition.

2
Jan 17

Conditional variance properties

Preliminaries

Review Properties of conditional expectation, especially the summary, where I introduce a new notation for conditional expectation. Everywhere I use the notation E_Y\pi for expectation of \pi conditional on Y, instead of E(\pi|Y).

This post and the previous one on conditional expectation show that conditioning is a pretty advanced notion. Many introductory books use the condition E_xu=0 (the expected value of the error term u=0 conditional on the regressor x is zero). Because of the complexity of conditioning, I think it's better to avoid this kind of assumption as much as possible.

Conditional variance properties

Replacing usual expectations by their conditional counterparts in the definition of variance, we obtain the definition of conditional variance:

(1) Var_Y(X)=E_Y(X-E_YX)^2.

Property 1. If X,Y are independent, then X-EX and Y are also independent and conditioning doesn't change variance:

Var_Y(X)=E_Y(X-EX)^2=E(X-EX)^2=Var(X),

see Conditioning in case of independence.

Property 2. Generalized homogeneity of degree 2: if a is a deterministic function, then a^2(Y) can be pulled out:

Var_Y(a(Y)X)=E_Y[a(Y)X-E_Y(a(Y)X)]^2=E_Y[a(Y)X-a(Y)E_YX]^2

=E_Y[a^2(Y)(X-E_YX)^2]=a^2(Y)E_Y(X-E_YX)^2=a^2(Y)Var_Y(X).

Property 3. Shortcut for conditional variance:

(2) Var_Y(X)=E_Y(X^2)-(E_YX)^2.

Proof.

Var_Y(X)=E_Y(X-E_YX)^2=E_Y[X^2-2XE_YX+(E_YX)^2]

(distributing conditional expectation)

=E_YX^2-2E_Y(XE_YX)+E_Y(E_YX)^2

(applying Properties 2 and 6 from this Summary with a(Y)=E_YX)

=E_YX^2-2(E_YX)^2+(E_YX)^2=E_YX^2-(E_YX)^2.

Property 4The law of total variance:

(3) Var(X)=Var(E_YX)+E[Var_Y(X)].

Proof. By the shortcut for usual variance and the law of iterated expectations

Var(X)=EX^2-(EX)^2=E[E_Y(X^2)]-[E(E_YX)]^2

(replacing E_Y(X^2) from (2))

=E[Var_Y(X)]+E(E_YX)^2-[E(E_YX)]^2

(the last two terms give the shortcut for variance of E_YX)

=E[Var_Y(X)]+Var(E_YX).

Before we move further we need to define conditional covariance by

Cov_Y(S,T) = E_Y(S - E_YS)(T - E_YT)

(everywhere usual expectations are replaced by conditional ones). We say that random variables S,T are conditionally uncorrelated if Cov_Y(S,T) = 0.

Property 5. Conditional variance of a linear combination. For any random variables S,T and functions a(Y),b(Y) one has

Var_Y(a(Y)S + b(Y)T)=a^2(Y)Var_Y(S)+2a(Y)b(Y)Cov_Y(S,T)+b^2(Y)Var_Y(T).

The proof is quite similar to that in case of usual variances, so we leave it to the reader. In particular, if S,T are conditionally uncorrelated, then the interaction terms disappears:

Var_Y(a(Y)S + b(Y)T)=a^2(Y)Var_Y(S)+b^2(Y)Var_Y(T).

26
Nov 16

Properties of correlation

Correlation coefficient: the last block of statistical foundation

Correlation has already been mentioned in

Statistical measures and their geometric roots

Properties of standard deviation

The pearls of AP Statistics 35

Properties of covariance

The pearls of AP Statistics 33

The hierarchy of definitions

Suppose random variables X,Y are not constant. Then their standard deviations are not zero and we can define their correlation as in Chart 1.

correlation-definition

Chart 1. Correlation definition

Properties of correlation

Property 1. Range of the correlation coefficient: for any X,Y one has - 1 \le \rho (X,Y) \le 1.
This follows from the Cauchy-Schwarz inequality, as explained here.

Recall from this post that correlation is cosine of the angle between X-EX and Y-EY.
Property 2. Interpretation of extreme cases. (Part 1) If \rho (X,Y) = 1, then Y = aX + b with a > 0.

(Part 2) If \rho (X,Y) = - 1, then Y = aX + b with a < 0.

Proof. (Part 1) \rho (X,Y) = 1 implies
(1) Cov (X,Y) = \sigma (X)\sigma (Y)
which, in turn, implies that Y is a linear function of X: Y = aX + b (this is the second part of the Cauchy-Schwarz inequality). Further, we can establish the sign of the number a. By the properties of variance and covariance
Cov(X,Y)=Cov(X,aX+b)=aCov(X,X)+Cov(X,b)=aVar(X),

\sigma (Y)=\sigma(aX + b)=\sigma (aX)=|a|\sigma (X).
Plugging this in Eq. (1) we get aVar(X) = |a|\sigma^2(X) and see that a is positive.

The proof of Part 2 is left as an exercise.

Property 3. Suppose we want to measure correlation between weight W and height H of people. The measurements are either in kilos and centimeters {W_k},{H_c} or in pounds and feet {W_p},{H_f}. The correlation coefficient is unit-free in the sense that it does not depend on the units used: \rho (W_k,H_c)=\rho (W_p,H_f). Mathematically speaking, correlation is homogeneous of degree 0 in both arguments.
Proof. One measurement is proportional to another, W_k=aW_p,\ H_c=bH_f with some positive constants a,b. By homogeneity
\rho (W_k,H_c)=\frac{Cov(W_k,H_c)}{\sigma(W_k)\sigma(H_c)}=\frac{Cov(aW_p,bH_f)}{\sigma(aW_p)\sigma(bH_f)}=\frac{abCov(W_p,H_f)}{ab\sigma(W_p)\sigma (H_f)}=\rho (W_p,H_f).

 

12
Nov 16

Properties of standard deviation

Properties of standard deviation are divided in two parts. The definitions and consequences are given here. Both variance and standard deviation are used to measure variability of values of a random variable around its mean. Then why use both of them? The why will be explained in another post.

Properties of standard deviation: definitions and consequences

Definition. For a random variable X, the quantity \sigma (X) = \sqrt {Var(X)} is called its standard deviation.

    Digression about square roots and absolute values

In general, there are two square roots of a positive number, one positive and the other negative. The positive one is called an arithmetic square root. The arithmetic root is applied here to Var(X) \ge 0 (see properties of variance), so standard deviation is always nonnegative.
Definition. An absolute value of a real number a is defined by
(1) |a| =a if a is nonnegative and |a| =-a if a is negative.
This two-part definition is a stumbling block for many students, so making them plug in a few numbers is a must. It is introduced to measure the distance from point a to the origin. For example, dist(3,0) = |3| = 3 and dist(-3,0) = |-3| = 3. More generally, for any points a,b on the real line the distance between them is given by dist(a,b) = |a - b|.

By squaring both sides in Eq. (1) we obtain |a|^2={a^2}. Application of the arithmetic square root gives

(2) |a|=\sqrt {a^2}.

This is the equation we need right now.

Back to standard deviation

Property 1. Standard deviation is homogeneous of degree 1. Indeed, using homogeneity of variance and equation (2), we have

\sigma (aX) =\sqrt{Var(aX)}=\sqrt{{a^2}Var(X)}=|a|\sigma(X).

Unlike homogeneity of expected values, here we have an absolute value of the scaling coefficient a.

Property 2. Cauchy-Schwarz inequality. (Part 1) For any random variables X,Y one has

(3) |Cov(X,Y)|\le\sigma(X)\sigma(Y).

(Part 2) If the inequality sign in (3) turns into equality, |Cov(X,Y)|=\sigma (X)\sigma (Y), then Y is a linear function of X: Y = aX + b, with some constants a,b.
Proof. (Part 1) If at least one of the variables is constant, both sides of the inequality are 0 and there is nothing to prove. To exclude the trivial case, let X,Y be non-constant and, therefore, Var(X),\ Var(Y) are positive. Consider a real-valued function of a real number t defined by f(t) = Var(tX + Y). Here we have variance of a linear combination

f(t)=t^2Var(X)+2tCov(X,Y)+Var(Y).

We see that f(t) is a parabola with branches looking upward (because the senior coefficient Var(X) is positive). By nonnegativity of variance, f(t)\ge 0 and the parabola lies above the horizontal axis in the (f,t) plane. Hence, the quadratic equation f(t) = 0 may have at most one real root. This means that the discriminant of the equation is non-positive:

D=Cov(X,Y)^2-Var(X)Var(Y)\le 0.

Applying square roots to both sides of Cov(X,Y)^2\le Var(X)Var(Y) we finish the proof of the first part.

(Part 2) In case of the equality sign the discriminant is 0. Therefore the parabola touches the horizontal axis where f(t)=Var(tX + Y)=0. But we know that this implies tX + Y = constant which is just another way of writing Y = aX + b.

Comment. (3) explains one of the main properties of the correlation:

-1\le\rho(X,Y)=\frac{Cov(X,Y)}{\sigma(X)\sigma(Y)}\le 1.

 

 

 

25
Oct 16

Properties of variance

All properties of variance in one place

Certainty is the mother of quiet and repose, and uncertainty the cause of variance and contentions. Edward Coke

Preliminaries: study properties of means with proofs.

Definition. Yes, uncertainty leads to variance, and we measure it by Var(X)=E(X-EX)^2. It is useful to use the name deviation from mean for X-EX and realize that E(X-EX)=0, so that the mean of the deviation from mean cannot serve as a measure of variation of X around EX.

Property 1. Variance of a linear combination. For any random variables X,Y and numbers a,b one has
(1) Var(aX + bY)=a^2Var(X)+2abCov(X,Y)+b^2Var(Y).
The term 2abCov(X,Y) in (1) is called an interaction term. See this post for the definition and properties of covariance.
Proof.
Var(aX + bY)=E[aX + bY -E(aX + bY)]^2

(using linearity of means)
=E(aX + bY-aEX -bEY)^2

(grouping by variable)
=E[a(X-EX)+b(Y-EY)]^2

(squaring out)
=E[a^2(X-EX)^2+2ab(X-EX)(Y-EY)+(Y-EY)^2]

(using linearity of means and definitions of variance and covariance)
=a^2Var(X) + 2abCov(X,Y) +b^2Var(Y).
Property 2. Variance of a sum. Letting in (1) a=b=1 we obtain
Var(X + Y) = Var(X) + 2Cov(X,Y)+Var(Y).

Property 3. Homogeneity of degree 2. Choose b=0 in (1) to get
Var(aX)=a^2Var(X).
Exercise. What do you think is larger: Var(X+Y) or Var(X-Y)?
Property 4. If we add a constant to a variable, its variance does not change: Var(X+c)=E[X+c-E(X+c)]^2=E(X+c-EX-c)^2=E(X-EX)^2=Var(X)
Property 5. Variance of a constant is zero: Var(c)=E(c-Ec)^2=0.

Property 6. Nonnegativity. Since the squared deviation from mean (X-EX)^2 is nonnegative, its expectation is nonnegativeE(X-EX)^2\ge 0.

Property 7. Only a constant can have variance equal to zero: If Var(X)=0, then E(X-EX)^2 =(x_1-EX)^2p_1 +...+(x_n-EX)^2p_n=0, see the definition of the expected value. Since all probabilities are positive, we conclude that x_i=EX for all i, which means that X is identically constant.

Property 8. Shortcut for variance. We have an identity E(X-EX)^2=EX^2-(EX)^2. Indeed, squaring out gives

E(X-EX)^2 =E(X^2-2XEX+(EX)^2)

(distributing expectation)

=EX^2-2E(XEX)+E(EX)^2

(expectation of a constant is constant)

=EX^2-2(EX)^2+(EX)^2=EX^2-(EX)^2.

All of the above properties apply to any random variables. The next one is an exception in the sense that it applies only to uncorrelated variables.

Property 9. If variables are uncorrelated, that is Cov(X,Y)=0, then from (1) we have Var(aX + bY)=a^2Var(X)+b^2Var(Y). In particular, letting a=b=1, we get additivityVar(X+Y)=Var(X)+Var(Y). Recall that the expected value is always additive.

GeneralizationsVar(\sum a_iX_i)=\sum a_i^2Var(X_i) and Var(\sum X_i)=\sum Var(X_i) if all X_i are uncorrelated.

Among my posts, where properties of variance are used, I counted 12 so far.

20
Sep 16

The pearls of AP Statistics 29

Normal distributions: sometimes it is useful to breast the current

The usual way of defining normal variables is to introduce the whole family of normal distributions and then to say that the standard normal is a special member of this family. Here I show that, for didactic purposes, it is better to do the opposite.

Standard normal distribution

The standard normal distribution z is defined by its probability density

p(x)=\frac{1}{\sqrt{2\pi}}\exp(-\frac{x^2}{2}).

Usually students don't remember this equation, and they don't need to. The point is to emphasize that this is a specific density, not a generic "bell shape".

standard-normal

Figure 1. Standard normal density

From the plot of the density (Figure 1) they can guess that the mean of this variable is zero.

 

 

 

 

 

area-under-xpx

Figure 2. Plot of xp(x)

Alternatively, they can look at the definition of the mean of a continuous random variable Ez=\int_\infty^\infty xp(x)dx. Here the function f(x)=xp(x) has the shape given in Figure 2, where the positive area to the right of the origin exactly cancels out with the negative area to the left of the origin. Since an integral means the area under the function curve, it follows that

(1) Ez=0.

 

To find variance, we use the shortcut:

Var(z)=Ez^2-(Ez)^2=Ez^2=\int_{-\infty}^\infty x^2p(x)dx=2\int_0^\infty x^2p(x)dx=1.

plot-for-variance

Figure 3. Plot of x^2p(x)

 

The total area under the curve is twice the area to the right of the origin, see Figure 3. Here the last integral has been found using Mathematica. It follows that

(2) \sigma(z)=\sqrt{Var(z)}=1.

General normal distribution

linear-transformation

Figure $. Visualization of linear transformation - click to view video

Fix some positive \sigma and real \mu. A (general) normal variable \mu is defined as a linear transformation of z:

(3) X=\sigma z+\mu.

Changing \mu moves the density plot to the left (if \mu is negative) and to the right (if \mu is positive). Changing \sigma makes the density peaked or flat. See video. Enjoy the Mathematica file.

 

 

 

Properties follow like from the horn of plenty:

A) Using (1) and (3) we easily find the mean of X:

EX=\sigma Ez+\mu=\mu.

B) From (2) and (3) we have

Var(X)=Var(\sigma z)=\sigma^2Var(z)=\sigma^2

(the constant \mu does not affect variance and variance is homogeneous of degree 2).

C) Solving (3) for z gives us the z-score:

z=\frac{X-\mu}{\sigma}.

D) Moreover, we can prove that a linear transformation of a normal variable is normal. Indeed, let X be defined by (3) and let Y be its linear transformation: Y=\delta X+\nu. Then

Y=\delta (\sigma z+\mu)+\nu=\delta\sigma z+(\delta\mu+\nu)

is a linear transformation of the standard normal and is therefore normal.

Remarks. 1) In all of the above, no derivation is longer than one line. 2) Reliance on geometry improves understanding. 3) Only basic properties of means and variances are used. 4) With the traditional way of defining the normal distribution using the equation

p(x)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{(x-\mu)^2}{2\sigma^2})

there are two problems. Nobody understands this formula and it is difficult to extract properties of the normal variable from it.

Compare the above exposition with that of Agresti and Franklin: a) The normal distribution is symmetric, bell-shaped, and characterized by its mean μ and standard deviation σ (p.277) and b) The Standard Normal Distribution has Mean = 0 and Standard Deviation = 1 (p.285). It is the same old routine: remember this, remember that.

10
Jan 16

What is a z score: the scientific explanation

You know what is a z score when you know why people invented it.

As usual, we start with a theoretical motivation. There is a myriad of distributions. Even if we stay within the set of normal distributions, there is an infinite number of them, indexed by their means \mu(X)=EX and standard deviations \sigma(X)=\sqrt{Var(X)}. When computers did not exist, people had to use statistical tables. It was impossible to produce statistical tables for an infinite number of distributions, so the problem was to reduce the case of general \mu(X) and \sigma(X) to that of \mu(X)=0 and \sigma(X)=1.

But we know that that can be achieved by centering and scaling. Combining these two transformations, we obtain the definition of the z score:

z=\frac{X-\mu(X)}{\sigma(X)}.

Using the properties of means and variances we see that

Ez=\frac{E(X-\mu(X))}{\sigma(X)}=0,

Var(z)=\frac{Var(X-\mu(X))}{\sigma^2(X)}=\frac{Var(X)}{\sigma^2(X)}=1.

The transformation leading from X to its z score sometimes is called standardization.

This site promises to tell you the truth about undergraduate statistics. The truth about the z score is that:

(1) Standardization can be applied to any variable with finite variance, not only to normal variables. The z score is a standard normal variable only when the original variable X is normal, contrary to what some sites say.

(2) With modern computers, standardization is not necessary to find critical values for X, see Chapter 14 of my book.

9
Jan 16

Scaling a distribution

Scaling a distribution is as important as centering or demeaning considered here. The question we want to find an answer for is this: What can you do to a random variable X to obtain another random variable, say, Y, whose variance is one? Like in case of centering, geometric considerations can be used but I want to follow the algebraic approach, which is more powerful.

Hint: in case of centering, we subtract the mean, Y=X-EX. For the problem at hand the suggestion is to use scaling: Y=aX, where a is a number to be determined.

Using the fact that variance is homogeneous of degree 2, we have

Var(Y)=Var(aX)=a^2Var(X).

We want Var(Y) to be 1, so solving for a gives a=1/\sqrt{Var(X)}=1/\sigma(X). Thus, division by the standard deviation answers our question: the variable Y=X/\sigma(X) has variance and standard deviation equal to 1.

Note. Always use the notation for standard deviation \sigma with its argument X.