Nov 18

Application: estimating sigma squared

Application: estimating sigma squared

Consider multiple regression

(1) y=X\beta +e


(a) the regressors are assumed deterministic, (b) the number of regressors k is smaller than the number of observations n, (c) the regressors are linearly independent, \det (X^TX)\neq 0, and (d) the errors are homoscedastic and uncorrelated,

(2) Var(e)=\sigma^2I.

Usually students remember that \beta should be estimated and don't pay attention to estimation of \sigma^2. Partly this is because \sigma^2 does not appear in the regression and partly because the result on estimation of error variance is more complex than the result on the OLS estimator of \beta .

Definition 1. Let \hat{\beta}=(X^TX)^{-1}X^Ty be the OLS estimator of \beta. \hat{y}=X\hat{\beta} is called the fitted value and r=y-\hat{y} is called the residual.

Exercise 1. Using the projectors P=X(X^TX)^{-1}X^T and Q=I-P show that \hat{y}=Py and r=Qe.

Proof. The first equation is obvious. From the model we have r=X\beta+e-P(X\beta +e). Since PX\beta=X\beta, we have further r=e-Pe=Qe.

Definition 2. The OLS estimator of \sigma^2 is defined by s^2=\Vert r\Vert^2/(n-k).

Exercise 2. Prove that s^2 is unbiased: Es^2=\sigma^2.

Proof. Using projector properties we have

\Vert r\Vert^2=(Qe)^TQe=e^TQ^TQe=e^TQe.

Expectations of type Ee^Te and Eee^T would be easy to find from (2). However, we need to find Ee^TQe where there is an obstructing Q. See how this difficulty is overcome in the next calculation.

E\Vert r\Vert^2=Ee^TQe (e^TQe is a scalar, so its trace is equal to itself)

=Etr(e^TQe) (applying trace-commuting)

=Etr(Qee^T) (the regressors and hence Q are deterministic, so we can use linearity of E)

=tr(QEee^T) (applying (2)) =\sigma^2tr(Q).

tr(P)=k because this is the dimension of the image of P. Therefore tr(Q)=n-k. Thus, E\Vert r\Vert^2=\sigma^2(n-k) and Es^2=\sigma^2.

Oct 16

The pearls of AP Statistics 34

Coefficient of determination: an inductive introduction to R squared

I know a person, who did not understand this topic, even though he had a PhD in Math. That was me more than twenty years ago, and the reason was that the topic was given formally, without explaining the leading idea.

Leading idea

Step 1. We want to describe the relationship between observed y's and x's using the simple regression


Let us start with the simple case when there is no variability in y's, that is the slope and the errors are zero.  Since y_i=a for all i, we have y_i=\bar{y} and, of course,

(1) \sum(y_i-\bar{y})^2=0.

In the general case, we start with the decomposition

(2) y_i=\hat{y}_i+e_i

where \hat{y}_i is the fitted value and e_i is the residual, see this post. We still want to see how y_i is far from \bar{y}. With this purpose, from both sides of equation (2) we subtract \bar{y}, obtaining y_i-\bar{y}=(\hat{y}_i-\bar{y})+e_i. Squaring this equation, for the sum in (1) we get

(3) \sum(y_i-\bar{y})^2=\sum(\hat{y}_i-\bar{y})^2+2\sum(\hat{y}_i-\bar{y})e_i+\sum e^2_i.

Whoever was the first to do this, discovered that the cross product is zero and (3) simplifies to

(4) \sum(y_i-\bar{y})^2=\sum(\hat{y}_i-\bar{y})^2+\sum e^2_i.

The rest is a matter of definitions

Total Sum of Squares TSS=\sum(y_i-\bar{y})^2. (I prefer to call this a total variation around \bar{y})

Explained Sum of Squares TSS=\sum(\hat{y}_i-\bar{y})^2 (to me this is explained variation around \bar{y})

Residual Sum of Squares TSS=\sum e^2_i (unexplained variation around \bar{y}, caused by the error term)

Thus from (4) we have


Step 2. It is desirable to have RSS close to zero and ESS close to TSS. Therefore we can use the ratio ESS/TSS as a measure of how good the regression describes the relationship between y's and x's. From (5) it follows that this ratio takes values between zero and 1. Hence, the coefficient of determination


can be interpreted as the percentage of total variation of y's around \bar{y} explained by regression. From (5) an equivalent definition is


Back to the pearls of AP Statistics

How much of the above can be explained without algebra? Stats without algebra is a crippled creature. I am afraid, any concepts requiring substantial algebra should be dropped from AP Stats curriculum. Compare this post with the explanation on p. 592 of Agresti and Franklin.