11
Feb 17

Gauss-Markov theorem

The Gauss-Markov theorem states that the OLS estimator is the most efficient. Without algebra, you cannot make a single step further, whether it is the precise theoretical statement or an application.

Why do we care about linearity?

The concept of linearity has been repeated many times in my posts. Here we have to start from scratch, to apply it to estimators.

The slope in simple regression

(1) y_i=a+bx_i+e_i

can be estimated by

\hat{b}(y,x)=\frac{Cov_u(y,x)}{Var_u(x)}.

Note that the notation makes explicit the dependence of the estimator on x,y. Imagine that we have two sets of observations: (y_1^{(1)},x_1),...,(y_n^{(1)},x_n) and (y_1^{(2)},x_1),...,(y_n^{(2)},x_n) (the x coordinates are the same but the y coordinates are different). In addition, the regressor is deterministic. The x's could be spatial units and the y's temperature measurements at these units at two different moments.

Definition. We say that \hat{b}(y,x) is linear with respect to y if for any two vectors y^{(i)}= (y_1^{(i)},...,y_n^{(i)}), i=1,2, and numbers c,d we have

\hat{b}(cy^{(1)}+dy^{(2)},x)=c\hat{b}(y^{(1)},x)+d\hat{b}(y^{(2)},x).

This definition is quite similar to that of linearity of means. Linearity of the estimator with respect to y easily follows from linearity of covariance

\hat{b}(cy^{(1)}+dy^{(2)},x)=\frac{Cov_u(cy^{(1)}+dy^{(2)},x)}{Var_u(x)}=c\hat{b}(y^{(1)},x)+d\hat{b}(y^{(2)},x).

In addition to knowing how to establish linearity, it's a good idea to be able to see when something is not linear. Recall that linearity implies homogeneity of degree 1. Hence, if something is not homogeneous of degree 1, it cannot be linear. The OLS estimator is not linear in x because it is homogeneous of degree -1 in x:

\hat{b}(y,cx)=\frac{Cov_u(y,cx)}{Var_u(cx)}=\frac{c}{c^2}\frac{Cov_u(y,x)}{Var_u(x)}=\frac{1}{c}\hat{b}(y,x).

Gauss-Markov theorem

Students don't have problems remembering the acronym BLUE: the OLS estimator is Best Linear Unbiased Estimator. Decoding this acronym starts from the end.

  1. An estimator, by definition, is a function of sample data.
  2. Unbiasedness of OLS estimators is thoroughly discussed here.
  3. Linearity of the slope estimator with respect to y has been proved above. Linearity with respect to x is not required.
  4. Now we look at the class of all slope estimators that are linear with respect to y. As an exercise, show that the instrumental variables estimator belongs to this class.

Gauss-Markov Theorem. Under the classical assumptions, the OLS estimator of the slope has the smallest variance in the class of all slope estimators that are linear with respect to y.

In particular, the OLS estimator of the slope is more efficient than the IV estimator. The beauty of this result is that you don't need expressions of their variances (even though they can be derived).

Remark. Even the above formulation is incomplete. In fact, the pair intercept estimator plus slope estimator is efficient. This requires matrix algebra.

 

8
Jan 17

OLS estimator variance

Assumptions about simple regression

We consider the simple regression

(1) y_i=a+bx_i+e_i

Here we derived the OLS estimators of the intercept and slope:

(2) \hat{b}=\frac{Cov_u(x,y)}{Var_u(x)},

(3) \hat{a}=\bar{y}-\hat{b}\bar{x}.

A1. Existence condition. Since division by zero is not allowed, for (2) to exist we require Var_u(x)\ne 0. If this condition is not satisfied, then there is no variance in x and all observed points are on the vertical line.

A2. Convenience condition. The regressor x is deterministic. This condition is imposed to be able to apply the properties of expectation, see equation (7) in  this post. The time trend and dummy variables are examples of deterministic regressors. However, most real-life regressors are stochastic. Modifying the theory in order to cover stochastic regressors is the subject of two posts: finite-sample theory and large-sample theory.

A3. Unbiasedness conditionEe_i=0. This is the main assumption that makes sure that OLS estimators are unbiased, see equation (7) in  this post.

Unbiasedness is not enough

Unbiasedness characterizes the quality of an estimator, see the intuitive explanation. Unfortunately, unbiasedness is not enough to choose the best estimator because of nonuniqueness: usually, if there is one unbiased estimator of a parameter, then there are infinitely many unbiased estimators of the same parameter. For example, we know that the sample mean \bar{X} unbiasedly estimates the population mean E\bar{X}=EX. Since EX_1=EX (X_1 is the first observation), we can easily construct an infinite family of unbiased estimators Y=(\bar{X}+aX_1)/(1+a), assuming a\ne -1. Indeed, using linearity of expectation EY=(E\bar{X}+aEX_1)/(1+a)=EX.

Variance is another measure of an estimator quality: to have a lower spread of estimator values, among competing estimators we choose the one which has the lowest variance. Knowing the estimator variance allows us to find the z-score and use statistical tables.

Slope estimator variance

It is not difficult to find the variance of the slope estimator using representation (6) derived here:

\hat{b}=b+\frac{1}{n}\sum a_ie_i

where a_i=(x_i-\bar{x})/Var_u(x).

Don't try to apply directly the definition of variance at this point, because there will be a square of a sum, which leads to a double sum upon squaring. We need two new assumptions.

A4. Uncorrelatedness of errors. Assume that Cov(e_i,e_j)=0 for all i\ne j (errors from different equations (1) are uncorrelated). Note that because of the unbiasedness condition, this assumption is equivalent to Ee_ie_j=0 for all i\ne j. This assumption is likely to be satisfied if we observe consumption patterns of unrelated individuals.

A5. Homoscedasticity. All errors have the same variancesVar(e_i)=\sigma^2 for all i. Again, because of the unbiasedness condition, this assumption is equivalent to Ee_i^2=\sigma^2 for all i.

Now we can derive the variance expression, using properties from this post:

Var(\hat{b})=Var(b+\frac{1}{n}\sum_i a_ie_i) (dropping a constant doesn't affect variance)

=Var(\frac{1}{n}\sum_i a_ie_i) (for uncorrelated variables, variance is additive)

=\sum_i Var(\frac{1}{n}a_ie_i) (variance is homogeneous of degree 2)

=\frac{1}{n^2}\sum_i a_i^2Var(e_i) (applying homoscedasticity)

=\frac{1}{n^2}\sum_i a_i^2\sigma^2 (plugging a_i)

=\frac{1}{n^2}\sum_i(x_i-\bar{x})^2\sigma^2/Var^2_u(x) (using the notation of sample variance)

=\frac{1}{n}Var_u(x)\sigma^2/Var^2_u(x)=\sigma^2/(nVar_u(x)).

 

Note that canceling out two variances in the last line is obvious. It is not so obvious for some if instead of the short notation for variances you use summation signs. The case of the intercept variance is left as an exercise.

Conclusion

The above assumptions A1-A5 are called classical. It is necessary to remember their role in derivations because a considerable part of Econometrics is devoted to deviations from classical assumptions. Once you have a certain assumption violated, you should expect the corresponding estimator property invalidated. For example, if Ee_i\ne 0, you should expect the estimators to be biased. If any of A4-A5 is not true, the formula we have derived

Var(\hat{b})=\sigma^2/(nVar_u(x))

will not hold. Besides, the Gauss-Markov theorem that the OLS estimators are efficient will not hold (this will be discussed later). The pair A4-A5 can be called an efficiency condition.