Dec 18

Application: distribution of sigma squared estimator

Application: distribution of sigma squared estimator

For the formulation of multiple regression and classical conditions on its elements see Application: estimating sigma squared. There we proved unbiasedness of the OLS estimator of \sigma^2. Here we do more: we characterize its distribution and obtain unbiasedness as a corollary.


We need a summary of what we know about the residual r=y-\hat{y} and the projector Q=I-P where P=X^T(X^TX)^{-1}X^T:

(1) \Vert r\Vert^2=e^TQe.

P has k unities and n-k zeros on the diagonal of its diagonal representation, where k is the number of regressors. With Q it's the opposite: it has n-k unities and k zeros on the diagonal of its diagonal representation. We can always assume that the unities come first, so in the diagonal representation

(2) Q=UDU^{-1}

the matrix U is orthogonal and D can be written as

(3) D=\left(\begin{array}{cc}I_{n-k}&0\\0&0\end{array}\right)

where I_{n-k} is an identity matrix and the zeros are zero matrices of compatible dimensions.

Characterization of the distribution of s^2

Exercise 1. Suppose the error vector e is normal: e\sim N(0,\sigma^2I). Prove that the vector \delta =U^{-1}e/\sigma is standard normal.

Proof. By the properties of orthogonal matrices


This, together with the equation E\delta =0, proves that \delta is standard normal.

Exercise 2. Prove that \Vert r\Vert^2/\sigma^2 is distributed as \chi _{n-k}^2.

Proof. From (1) and (2) we have

\Vert r\Vert^2/\sigma^2=e^TUDU^{-1}e/\sigma^2=(U^{-1}e)^TD(U^{-1}e)/\sigma^2=\delta^TD\delta.

Now (3) shows that \Vert r\Vert^2/\sigma^2=\sum_{i=1}^{n-k}\delta_i^2 which is the definition of \chi  _{n-k}^2.

Exercise 3. Find the mean and variance of s^2=\Vert r\Vert^2/(n-k)=\sigma^2\chi _{n-k}^2/(n-k).

Solution. From Exercise 2 we obtain the result proved earlier in a different way:

Es^2=\sigma^2E\chi _{n-k}^2/(n-k)=\sigma^2.

Further, using the variance of a standard normal



Nov 18

Application: Ordinary Least Squares estimator

Application: Ordinary Least Squares estimator

Generalized Pythagoras theorem

Exercise 1. Let P be a projector and denote Q=I-P. Then \Vert x\Vert^2=\Vert Px\Vert^2+\Vert Qx\Vert^2.

Proof. By the scalar product properties

\Vert x\Vert^2=\Vert Px+Qx\Vert^2=\Vert Px\Vert^2+2(Px)\cdot (Qx)+\Vert Qx\Vert^2.

P is symmetric and idempotent, so

(Px)\cdot (Qx)=(Px)\cdot[(I-P)x]=x\cdot[(P-P^2)x]=0.

This proves the statement.

Ordinary Least Squares (OLS) estimator derivation

Problem statement. A vector y\in R^n (the dependent vector) and vectors x^{(1)},...,x^{(k)}\in R^n (independent vectors or regressors) are given. The OLS estimator is defined as that vector \beta \in R^k which minimizes the total sum of squares TSS=\sum_{i=1}^n(y_i-x^{(1)}\beta_1-...-x^{(k)}\beta_k)^2.

Denoting X=(x^{(1)},...,x^{(k)}), we see that TSS=\Vert y-X\beta\Vert^2 and that finding the OLS estimator means approximating y with vectors from the image \text{Img}X. x^{(1)},...,x^{(k)} should be linearly independent, otherwise the solution will not be unique.

Assumption. x^{(1)},...,x^{(k)} are linearly independent. This, in particular, implies that k\leq n.

Exercise 2. Show that the OLS estimator is

(2) \hat{\beta}=(X^TX)^{-1}X^Ty.

Proof. By Exercise 1 we can use P=X(X^TX)^{-1}X^T. Since X\beta belongs to the image of P, P doesn't change it: X\beta=PX\beta. Denoting also Q=I-P we have

\Vert y-X\beta\Vert^2=\Vert y-Py+Py-X\beta\Vert^2

=\Vert Qy+P(y-X\beta)\Vert^2 (by Exercise 1)

=\Vert Qy\Vert^2+\Vert P(y-X\beta)\Vert^2.

This shows that \Vert Qy\Vert^2 is a lower bound for \Vert y-X\beta\Vert^2. This lower bound is achieved when the second term is made zero. From

P(y-X\beta)=Py-X\beta =X(X^TX)^{-1}X^Ty-X\beta=X[(X^TX)^{-1}X^Ty-\beta]

we see that the second term is zero if \beta satisfies (2).

Usually the above derivation is applied to the dependent vector of the form y=X\beta+e where e is a random vector with mean zero. But it holds without this assumption. See also simplified derivation of the OLS estimator.

Dec 16

Multiple regression through the prism of dummy variables

Agresti and Franklin on p.658 say: The indicator variable for a particular category is binary. It equals 1 if the observation falls into that category and it equals 0 otherwise. I say: For most students, this is not clear enough.

Problem statement

Figure 1. Residential power consumption in 2014 and 2015. Source: http://www.eia.gov/electricity/data.cfm

Residential power consumption in the US has a seasonal pattern. Heating in winter and cooling in summer cause the differences. We want to capture the dependence of residential power consumption PowerC on the season.

 Visual approach to dummy variables

Seasons of the year are categorical variables. We have to replace them with quantitative variables, to be able to use in any mathematical procedure that involves arithmetic operations. To this end, we define a dummy variable (indicator) D_{win} for winter such that it equals 1 in winter and 0 in any other period of the year. The dummies D_{spr},\ D_{sum},\ D_{aut} for spring, summer and autumn are defined similarly. We provide two visualizations assuming monthly observations.

Table 1. Tabular visualization of dummies
Month D_{win} D_{spr} D_{sum} D_{aut} D_{win}+D_{spr}+ D_{sum}+D_{aut}
December 1 0 0 0 1
January 1 0 0 0 1
February 1 0 0 0 1
March 0 1 0 0 1
April 0 1 0 0 1
May 0 1 0 0 1
June 0 0 1 0 1
July 0 0 1 0 1
August 0 0 1 0 1
September 0 0 0 1 1
October 0 0 0 1 1
November 0 0 0 1 1

Figure 2. Graphical visualization of D_spr

The first idea may be wrong

The first thing that comes to mind is to regress PowerC on dummies as in

(1) PowerC=a+bD_{win}+cD_{spr}+dD_{sum}+eD_{aut}+error.

Not so fast. To see the problem, let us rewrite (1) as

(2) PowerC=a\times 1+bD_{win}+cD_{spr}+dD_{sum}+eD_{aut}+error.

This shows that, in addition to the four dummies, there is a fifth variable, which equals 1 across all observations. Let us denote it T (for Trivial). Table 1 shows that

(3) T=D_{win}+D_{spr}+ D_{sum}+D_{aut}.

This makes the next definition relevant. Regressors x_1,...,x_k are called linearly dependent if one of them, say, x_1, can be expressed as a linear combination of the others: x_1=a_2x_2+...+a_kx_k.  In case (3), all coefficients a_i are unities, so we have linear dependence. Using (3), let us replace T in (2). The resulting equation is rearranged as

(4) PowerC=(a+b)D_{win}+(a+c)D_{spr}+(a+d)D_{sum}+(a+e)D_{aut}+error.

Now we see what the problem is. When regressors are linearly dependent, the model is not uniquely specified. (1) and (4) are two different representations of the same model.

What is the way out?

If regressors are linearly dependent, drop them one after another until you get linearly independent ones. For example, dropping the winter dummy, we get

(5) PowerC=a+cD_{spr}+dD_{sum}+eD_{aut}+error.

Here is the estimation result for the two-year data in Figure 1:


This means that:

PowerC=128176 in winter, PowerC=128176-27380 in spring,

PowerC=128176+5450 in summer, and PowerC=128176-22225 in autumn.

It is revealing that cooling requires more power than heating. However, the summer coefficient is not significant. Here is the Excel file with the data and estimation result.

The category that has been dropped is called a base (or reference) category. Thus, the intercept in (5) measures power consumption in winter. The dummy coefficients in (5) measure deviations of power consumption in respective seasons from that in winter.

Here is the question I ask my students

We want to see how beer consumption BeerC depends on gender and income Inc. Let M and F denote the dummies for males and females, resp. Correct the following model and interpret the resulting coefficients:


Final remark

When a researcher includes all categories plus the trivial regressor, he/she falls into what is called a dummy trap. The problem of linear dependence among regressors is usually discussed under the heading of multiple regression. But since the trivial regressor is present in simple regression too, it might be a good idea to discuss it earlier.

Linear dependence/independence of regressors is an exact condition for existence of the OLS estimator. That is, if regressors are linearly dependent, then the OLS estimator doesn't exist, in which case the question about its further properties doesn't make sense. If, on the other hand, regressors are linearly independent, then the OLS estimator exists, and further properties can be studied, such as unbiasedness, variance and efficiency.

Feb 16

OLS estimator for multiple regression - simplified derivation

OLS estimator for multiple regression - as simple as possible

Here I try to explain a couple of ideas to folks not familiar with (or afraid of?) matrix algebra.

A matrix is a rectangular table of numbers. Operations with them most of the time are performed like with numbers. For example, for numbers we know that a+b=b+a. For matrices this is also true, except that they often are denoted with capital letters: A+B=B+A. It is easier to describe differences than similarities.

(1) One of the differences is that for matrices we can define a new operation called transposition: the columns of the original matrix A are put into rows of a new matrix, which is called a transposed of A. Visualize it like this: if A has more rows than columns, then for the transposed the opposite will be true:

Transposed matrix







(2) We know that the number 1 has the property that 1\times a=a. For matrices, the analog is I\times A=A where I is a special matrix called identity.

(3) The property \frac{1}{a}a=1 we have for nonzero numbers generalizes for matrices except that instead of \frac{1}{A} we write A^{-1}. Thus, the inverse matrix has the property that A^{-1}A=I.

(4) You don't need to worry about how these operations are performed when you are given specific numerical matrices, because they can be easily done in Excel. All you have to do is watch that theoretical requirements are not violated. One of them is that, in general, matrices in a product cannot change places: AB\ne BA.

Here is an example that continues my previous post about simplified derivation of the OLS estimator. Consider multiple regression

(5) y=X\beta+u

where y is the dependent variable, X is an independent variable, \beta is the parameter to estimate and u is the error. Multiplying from the left equation (5) by X^T we get X^Ty=X^TX\beta+X^Tu. As in my previous post, we get rid of the term containing the error by formally putting X^Tu=0. The resulting equation X^Ty=X^TX\beta we solve for \beta by multiplying by (X^TX)^{-1} from the left:

(X^TX)^{-1}X^Ty=(X^TX)^{-1}(X^TX)\beta=(using\ (3))=I\beta=(using\ (2))=\beta.

Putting the hat on \beta, we arrive to the OLS estimator for multiple regression\hat{\beta}=(X^TX)^{-1}X^Ty. Like in the previous post, the whole derivation takes just one paragraph!

Caveat. See the rigorous derivation here. My objective is not rigor but to give you something easy to do and remember.