Feb 22

Distribution of the estimator of the error variance

Distribution of the estimator of the error variance

If you are reading the book by Dougherty: this post is about the distribution of the estimator  s^2 defined in Chapter 3.

Consider regression

(1) y=X\beta +e

where the deterministic matrix X is of size n\times k, satisfies \det  \left( X^{T}X\right) \neq 0 (regressors are not collinear) and the error e satisfies

(2) Ee=0,Var(e)=\sigma ^{2}I

\beta is estimated by \hat{\beta}=(X^{T}X)^{-1}X^{T}y. Denote P=X(X^{T}X)^{-1}X^{T}, Q=I-P. Using (1) we see that \hat{\beta}=\beta +(X^{T}X)^{-1}X^{T}e and the residual r\equiv y-X\hat{\beta}=Qe. \sigma^{2} is estimated by

(3) s^{2}=\left\Vert r\right\Vert ^{2}/\left( n-k\right) =\left\Vert  Qe\right\Vert ^{2}/\left( n-k\right) .

Q is a projector and has properties which are derived from those of P

(4) Q^{T}=Q, Q^{2}=Q.

If \lambda is an eigenvalue of Q, then multiplying Qx=\lambda x by Q and using the fact that x\neq 0 we get \lambda ^{2}=\lambda . Hence eigenvalues of Q can be only 0 or 1. The equation tr\left( Q\right) =n-k
tells us that the number of eigenvalues equal to 1 is n-k and the remaining k are zeros. Let Q=U\Lambda U^{T} be the diagonal representation of Q. Here U is an orthogonal matrix,

(5) U^{T}U=I,

and \Lambda is a diagonal matrix with eigenvalues of Q on the main diagonal. We can assume that the first n-k numbers on the diagonal of Q are ones and the others are zeros.

Theorem. Let e be normal. 1) s^{2}\left( n-k\right) /\sigma ^{2} is distributed as \chi _{n-k}^{2}. 2) The estimators \hat{\beta} and s^{2} are independent.

Proof. 1) We have by (4)

(6) \left\Vert Qe\right\Vert ^{2}=\left( Qe\right) ^{T}Qe=\left(  Q^{T}Qe\right) ^{T}e=\left( Qe\right) ^{T}e=\left( U\Lambda U^{T}e\right)  ^{T}e=\left( \Lambda U^{T}e\right) ^{T}U^{T}e.

Denote S=U^{T}e. From (2) and (5)

ES=0, Var\left( S\right) =EU^{T}ee^{T}U=\sigma ^{2}U^{T}U=\sigma ^{2}I

and S is normal as a linear transformation of a normal vector. It follows that S=\sigma z where z is a standard normal vector with independent standard normal coordinates z_{1},...,z_{n}. Hence, (6) implies

(7) \left\Vert Qe\right\Vert ^{2}=\sigma ^{2}\left( \Lambda z\right)  ^{T}z=\sigma ^{2}\left( z_{1}^{2}+...+z_{n-k}^{2}\right) =\sigma ^{2}\chi  _{n-k}^{2}.

(3) and (7) prove the first statement.

2) First we note that the vectors Pe,Qe are independent. Since they are normal, their independence follows from

cov(Pe,Qe)=EPee^{T}Q^{T}=\sigma ^{2}PQ=0.

It's easy to see that X^{T}P=X^{T}. This allows us to show that \hat{\beta} is a function of Pe:

\hat{\beta}=\beta +(X^{T}X)^{-1}X^{T}e=\beta +(X^{T}X)^{-1}X^{T}Pe.

Independence of Pe,Qe leads to independence of their functions \hat{\beta} and s^{2}.


Nov 18

Application: estimating sigma squared

Application: estimating sigma squared

Consider multiple regression

(1) y=X\beta +e


(a) the regressors are assumed deterministic, (b) the number of regressors k is smaller than the number of observations n, (c) the regressors are linearly independent, \det (X^TX)\neq 0, and (d) the errors are homoscedastic and uncorrelated,

(2) Var(e)=\sigma^2I.

Usually students remember that \beta should be estimated and don't pay attention to estimation of \sigma^2. Partly this is because \sigma^2 does not appear in the regression and partly because the result on estimation of error variance is more complex than the result on the OLS estimator of \beta .

Definition 1. Let \hat{\beta}=(X^TX)^{-1}X^Ty be the OLS estimator of \beta. \hat{y}=X\hat{\beta} is called the fitted value and r=y-\hat{y} is called the residual.

Exercise 1. Using the projectors P=X(X^TX)^{-1}X^T and Q=I-P show that \hat{y}=Py and r=Qe.

Proof. The first equation is obvious. From the model we have r=X\beta+e-P(X\beta +e). Since PX\beta=X\beta, we have further r=e-Pe=Qe.

Definition 2. The OLS estimator of \sigma^2 is defined by s^2=\Vert r\Vert^2/(n-k).

Exercise 2. Prove that s^2 is unbiased: Es^2=\sigma^2.

Proof. Using projector properties we have

\Vert r\Vert^2=(Qe)^TQe=e^TQ^TQe=e^TQe.

Expectations of type Ee^Te and Eee^T would be easy to find from (2). However, we need to find Ee^TQe where there is an obstructing Q. See how this difficulty is overcome in the next calculation.

E\Vert r\Vert^2=Ee^TQe (e^TQe is a scalar, so its trace is equal to itself)

=Etr(e^TQe) (applying trace-commuting)

=Etr(Qee^T) (the regressors and hence Q are deterministic, so we can use linearity of E)

=tr(QEee^T) (applying (2)) =\sigma^2tr(Q).

tr(P)=k because this is the dimension of the image of P. Therefore tr(Q)=n-k. Thus, E\Vert r\Vert^2=\sigma^2(n-k) and Es^2=\sigma^2.

Nov 18

Application: Ordinary Least Squares estimator

Application: Ordinary Least Squares estimator

Generalized Pythagoras theorem

Exercise 1. Let P be a projector and denote Q=I-P. Then \Vert x\Vert^2=\Vert Px\Vert^2+\Vert Qx\Vert^2.

Proof. By the scalar product properties

\Vert x\Vert^2=\Vert Px+Qx\Vert^2=\Vert Px\Vert^2+2(Px)\cdot (Qx)+\Vert Qx\Vert^2.

P is symmetric and idempotent, so

(Px)\cdot (Qx)=(Px)\cdot[(I-P)x]=x\cdot[(P-P^2)x]=0.

This proves the statement.

Ordinary Least Squares (OLS) estimator derivation

Problem statement. A vector y\in R^n (the dependent vector) and vectors x^{(1)},...,x^{(k)}\in R^n (independent vectors or regressors) are given. The OLS estimator is defined as that vector \beta \in R^k which minimizes the total sum of squares TSS=\sum_{i=1}^n(y_i-x^{(1)}\beta_1-...-x^{(k)}\beta_k)^2.

Denoting X=(x^{(1)},...,x^{(k)}), we see that TSS=\Vert y-X\beta\Vert^2 and that finding the OLS estimator means approximating y with vectors from the image \text{Img}X. x^{(1)},...,x^{(k)} should be linearly independent, otherwise the solution will not be unique.

Assumption. x^{(1)},...,x^{(k)} are linearly independent. This, in particular, implies that k\leq n.

Exercise 2. Show that the OLS estimator is

(2) \hat{\beta}=(X^TX)^{-1}X^Ty.

Proof. By Exercise 1 we can use P=X(X^TX)^{-1}X^T. Since X\beta belongs to the image of P, P doesn't change it: X\beta=PX\beta. Denoting also Q=I-P we have

\Vert y-X\beta\Vert^2=\Vert y-Py+Py-X\beta\Vert^2

=\Vert Qy+P(y-X\beta)\Vert^2 (by Exercise 1)

=\Vert Qy\Vert^2+\Vert P(y-X\beta)\Vert^2.

This shows that \Vert Qy\Vert^2 is a lower bound for \Vert y-X\beta\Vert^2. This lower bound is achieved when the second term is made zero. From

P(y-X\beta)=Py-X\beta =X(X^TX)^{-1}X^Ty-X\beta=X[(X^TX)^{-1}X^Ty-\beta]

we see that the second term is zero if \beta satisfies (2).

Usually the above derivation is applied to the dependent vector of the form y=X\beta+e where e is a random vector with mean zero. But it holds without this assumption. See also simplified derivation of the OLS estimator.

Jan 17

Regressions with stochastic regressors 2

Regressions with stochastic regressors 2: two approaches

We consider the slope estimator for the simple regression


assuming that x_i is stochastic.

First approach: the sample size is fixed. The unbiasedness and efficiency conditions are replaced by their analogs conditioned on x. The outcome is that the slope estimator is unbiased and its variance is the average of the variance that we have in case of a deterministic regressor. See the details.

Second approach: the sample size goes to infinity. The main tools used are the properties of probability limits and laws of large numbers. The outcome is that, in the limit, the sample characteristics are replaced by their population cousins and the slope estimator is consistent. This is what we focus on here.

A brush-up on convergence in probability

Review the intuition and formal definition. This is the summary:

Fact 1. Convergence in probability (which applies to sequences of random variables) is a generalization of the notion of convergence of number sequences. In particular, if \{a_n\} is a numerical sequence that converges to a number a\lim_{n\rightarrow\infty}a_n=a, then, treating a_n as a random variable, we have convergence in probability {\text{plim}}_{n\rightarrow\infty}a_n=a.

Fact 2. For those who are familiar with the theory of limits of numerical sequences, from the previous fact it should be clear that convergence in probability preserves arithmetic operations. That is, for any sequences of random variables \{X_n\},\{Y_n\} such that limits {\text{plim}}X_n and {\text{plim}}Y_n exist, we have

\text{plim}(X_n\pm Y_n)=\text{plim}X_n\pm\text{plim}Y_n, \text{plim}(X_n\times Y_n)=\text{plim}X_n\times\text{plim}Y_n,

and if \text{plim}Y_n\ne 0 then

\text{plim}(X_n/ Y_n)=\text{plim}X_n/\text{plim}Y_n.

This makes convergence in probability very handy. Convergence in distribution doesn't have such properties.

A brush-up on laws of large numbers

See the site map for several posts about this. Here we apply the Chebyshev inequality to prove the law of large numbers for sample means. A generalization is given in the Theorem in the end of that post. Here is a further intuitive generalization:

Normally, unbiased sample characteristics converge in probability to their population counterparts.

Example 1. We know that the sample variance s^2=\frac{1}{n-1}\sum(X_i-\bar{X})^2 unbiasedly estimates the population variance \sigma^2Es^2=\sigma^2. The intuitive generalization says that then

(1) \text{plim}s^2=\sigma^2.

Here I argue that, for the purposes of obtaining some identities from the general properties of means, instead of the sample variance it's better to use the variance defined by Var_u(X)=\frac{1}{n}\sum(X_i-\bar{X})^2 (with division by n instead of n-1). Using Facts 1 and 2 we get from (1) that

(2) \text{plim}Var_u(X)=\text{plim}\frac{n-1}{n}\frac{1}{n-1}\sum(X_i-\bar{X})^2


(sample variance converges in probability to population variance). Here we use \lim(1-\frac{1}{n})=1.

Example 2. Similarly, sample covariance converges in probability to population covariance:

(3) \text{plim}Cov_u(X,Y)=Cov(X,Y)

where by definition Cov_u(X,Y)=\frac{1}{n}\sum(X_i-\bar{X})(Y_i-\bar{Y}).

Proving consistency of the slope estimator

Here (see equation (5)) I derived the representation of the OLS estimator of the slope


Using preservation of arithmetic operations for convergence in probability, we get

(4) \text{plim}\hat{b}=\text{plim}\left[b+\frac{Cov_u(X,e)}{Var_u(X)}\right]=\text{plim}b+\text{plim}\frac{Cov_u(X,e)}{Var_u(X)}


In the last line we used (2) and (3). From (4) we see what conditions should be imposed for the slope estimator to converge to a spike at the true slope:

Var(X)\neq 0 (existence condition)


Cov(X,e)=0 (consistency condition).

Under these conditions, we have \text{plim}\hat{b}=b (this is called consistency).

Conclusion. In a way, the second approach is technically simpler than the first.

Jan 17

Regressions with stochastic regressors 1

Regressions with stochastic regressors 1: applying conditioning

The convenience condition states that the regressor in simple regression is deterministic. Here we look at how this assumption can be avoided using conditional expectation and variance. General idea: you check which parts of the proofs don't go through with stochastic regressors and modify the assumptions accordingly. It happens that only assumptions concerning the error term should be replaced by their conditional counterparts.

Unbiasedness in case of stochastic regressors

We consider the slope estimator for the simple regression

(1) y_i=a+bx_i+e_i

assuming that x_i is stochastic.

First grab the critical representation (6) derived here:

(1) \hat{b}=b+\frac{1}{n}\sum a_i(x)e_i, where a_i(x)=(x_i-\bar{x})/Var_u(x).

The usual linearity of means E(aX + bY) = aEX + bEY applied to prove unbiasedness doesn't work because now the coefficients are stochastic (in other words, they are not constant). But we have generalized linearity which for the purposes of this proof can be written as

(2) E(a(x)S+b(x)T|x)=a(x)E(S|x)+b(x)E(T|x).

Let us replace the unbiasedness condition by its conditional version:

A3'. Unbiasedness conditionE(e_i|x)=0.

Then (1) and (2) give

(3) E(\hat{b}|x)=b+\frac{1}{n}\sum a_i(x)E(e_i|x)=b,

which can be called conditional unbiasedness. Next applying the law of iterated expectations E[E(S|x)]=ES we obtain unconditional unbiasedness:


Variance in case of stochastic regressors

As one can guess, we have to replace efficiency conditions by their conditional versions:

A4'. Conditional uncorrelatedness of errors. Assume that E(e_ie_j|x)=0 for all i\ne j.

A5'. Conditional homoscedasticity. All errors have the same conditional variances: E(e_i^2|x)=\sigma^2 for all i (\sigma^2 is a constant).

Now we can derive the conditional variance expression, using properties from this post:

Var(\hat{b}|x)=Var(b+\frac{1}{n}\sum_i a_i(x)e_i|x) (dropping a constant doesn't affect variance)

=Var(\frac{1}{n}\sum_i a_i(x)e_i|x) (for conditionally uncorrelated variables, conditional variance is additive)

=\sum_i Var(\frac{1}{n}a_i(x)e_i|x) (conditional variance is homogeneous of degree 2)

=\frac{1}{n^2}\sum_i a_i^2(x)Var(e_i|x) (applying conditional homoscedasticity)

=\frac{1}{n^2}\sum_i a_i^2(x)\sigma^2 (plugging a_i(x))

=\frac{1}{n^2}\sum_i(x_i-\bar{x})^2\sigma^2/Var^2_u(x) (using the notation of sample variance)

(4) =\frac{1}{n}Var_u(x)\sigma^2/Var^2_u(x)=\sigma^2/(nVar_u(x)).

Finally, using the law of total variance Var(S)=Var(E(S|x))+E[Var(S|x)] and equations (3) and (4) we obtain

(5) Var(\hat{b})=Var(b)+E[\sigma^2/(nVar_u(x))]=\frac{\sigma^2}{n}E[\frac{1}{Var_u(x)}].


Replacing the three assumptions about the error by their conditional counterparts allows us to obtain almost perfect analogs of the usual properties of OLS estimators: the usual (unconditional) unbiasedness plus the estimator variance, in which the part containing the regressor should be averaged, to account for its randomness. If you think that solving the problem of stochastic regressors requires nothing more but application of a couple of mathematical tricks, I agree with you.

Jan 17

OLS estimator variance

Assumptions about simple regression

We consider the simple regression

(1) y_i=a+bx_i+e_i

Here we derived the OLS estimators of the intercept and slope:

(2) \hat{b}=\frac{Cov_u(x,y)}{Var_u(x)},

(3) \hat{a}=\bar{y}-\hat{b}\bar{x}.

A1. Existence condition. Since division by zero is not allowed, for (2) to exist we require Var_u(x)\ne 0. If this condition is not satisfied, then there is no variance in x and all observed points are on the vertical line.

A2. Convenience condition. The regressor x is deterministic. This condition is imposed to be able to apply the properties of expectation, see equation (7) in  this post. The time trend and dummy variables are examples of deterministic regressors. However, most real-life regressors are stochastic. Modifying the theory in order to cover stochastic regressors is the subject of two posts: finite-sample theory and large-sample theory.

A3. Unbiasedness conditionEe_i=0. This is the main assumption that makes sure that OLS estimators are unbiased, see equation (7) in  this post.

Unbiasedness is not enough

Unbiasedness characterizes the quality of an estimator, see the intuitive explanation. Unfortunately, unbiasedness is not enough to choose the best estimator because of nonuniqueness: usually, if there is one unbiased estimator of a parameter, then there are infinitely many unbiased estimators of the same parameter. For example, we know that the sample mean \bar{X} unbiasedly estimates the population mean E\bar{X}=EX. Since EX_1=EX (X_1 is the first observation), we can easily construct an infinite family of unbiased estimators Y=(\bar{X}+aX_1)/(1+a), assuming a\ne -1. Indeed, using linearity of expectation EY=(E\bar{X}+aEX_1)/(1+a)=EX.

Variance is another measure of an estimator quality: to have a lower spread of estimator values, among competing estimators we choose the one which has the lowest variance. Knowing the estimator variance allows us to find the z-score and use statistical tables.

Slope estimator variance

It is not difficult to find the variance of the slope estimator using representation (6) derived here:

\hat{b}=b+\frac{1}{n}\sum a_ie_i

where a_i=(x_i-\bar{x})/Var_u(x).

Don't try to apply directly the definition of variance at this point, because there will be a square of a sum, which leads to a double sum upon squaring. We need two new assumptions.

A4. Uncorrelatedness of errors. Assume that Cov(e_i,e_j)=0 for all i\ne j (errors from different equations (1) are uncorrelated). Note that because of the unbiasedness condition, this assumption is equivalent to Ee_ie_j=0 for all i\ne j. This assumption is likely to be satisfied if we observe consumption patterns of unrelated individuals.

A5. Homoscedasticity. All errors have the same variancesVar(e_i)=\sigma^2 for all i. Again, because of the unbiasedness condition, this assumption is equivalent to Ee_i^2=\sigma^2 for all i.

Now we can derive the variance expression, using properties from this post:

Var(\hat{b})=Var(b+\frac{1}{n}\sum_i a_ie_i) (dropping a constant doesn't affect variance)

=Var(\frac{1}{n}\sum_i a_ie_i) (for uncorrelated variables, variance is additive)

=\sum_i Var(\frac{1}{n}a_ie_i) (variance is homogeneous of degree 2)

=\frac{1}{n^2}\sum_i a_i^2Var(e_i) (applying homoscedasticity)

=\frac{1}{n^2}\sum_i a_i^2\sigma^2 (plugging a_i)

=\frac{1}{n^2}\sum_i(x_i-\bar{x})^2\sigma^2/Var^2_u(x) (using the notation of sample variance)


Note that canceling out two variances in the last line is obvious. It is not so obvious for some if instead of the short notation for variances you use summation signs. The case of the intercept variance is left as an exercise.


The above assumptions A1-A5 are called classical. It is necessary to remember their role in derivations because a considerable part of Econometrics is devoted to deviations from classical assumptions. Once you have a certain assumption violated, you should expect the corresponding estimator property invalidated. For example, if Ee_i\ne 0, you should expect the estimators to be biased. If any of A4-A5 is not true, the formula we have derived


will not hold. Besides, the Gauss-Markov theorem that the OLS estimators are efficient will not hold (this will be discussed later). The pair A4-A5 can be called an efficiency condition.

Dec 16

Multiple regression through the prism of dummy variables

Agresti and Franklin on p.658 say: The indicator variable for a particular category is binary. It equals 1 if the observation falls into that category and it equals 0 otherwise. I say: For most students, this is not clear enough.

Problem statement

Figure 1. Residential power consumption in 2014 and 2015. Source: http://www.eia.gov/electricity/data.cfm

Residential power consumption in the US has a seasonal pattern. Heating in winter and cooling in summer cause the differences. We want to capture the dependence of residential power consumption PowerC on the season.

 Visual approach to dummy variables

Seasons of the year are categorical variables. We have to replace them with quantitative variables, to be able to use in any mathematical procedure that involves arithmetic operations. To this end, we define a dummy variable (indicator) D_{win} for winter such that it equals 1 in winter and 0 in any other period of the year. The dummies D_{spr},\ D_{sum},\ D_{aut} for spring, summer and autumn are defined similarly. We provide two visualizations assuming monthly observations.

Table 1. Tabular visualization of dummies
Month D_{win} D_{spr} D_{sum} D_{aut} D_{win}+D_{spr}+ D_{sum}+D_{aut}
December 1 0 0 0 1
January 1 0 0 0 1
February 1 0 0 0 1
March 0 1 0 0 1
April 0 1 0 0 1
May 0 1 0 0 1
June 0 0 1 0 1
July 0 0 1 0 1
August 0 0 1 0 1
September 0 0 0 1 1
October 0 0 0 1 1
November 0 0 0 1 1

Figure 2. Graphical visualization of D_spr

The first idea may be wrong

The first thing that comes to mind is to regress PowerC on dummies as in

(1) PowerC=a+bD_{win}+cD_{spr}+dD_{sum}+eD_{aut}+error.

Not so fast. To see the problem, let us rewrite (1) as

(2) PowerC=a\times 1+bD_{win}+cD_{spr}+dD_{sum}+eD_{aut}+error.

This shows that, in addition to the four dummies, there is a fifth variable, which equals 1 across all observations. Let us denote it T (for Trivial). Table 1 shows that

(3) T=D_{win}+D_{spr}+ D_{sum}+D_{aut}.

This makes the next definition relevant. Regressors x_1,...,x_k are called linearly dependent if one of them, say, x_1, can be expressed as a linear combination of the others: x_1=a_2x_2+...+a_kx_k.  In case (3), all coefficients a_i are unities, so we have linear dependence. Using (3), let us replace T in (2). The resulting equation is rearranged as

(4) PowerC=(a+b)D_{win}+(a+c)D_{spr}+(a+d)D_{sum}+(a+e)D_{aut}+error.

Now we see what the problem is. When regressors are linearly dependent, the model is not uniquely specified. (1) and (4) are two different representations of the same model.

What is the way out?

If regressors are linearly dependent, drop them one after another until you get linearly independent ones. For example, dropping the winter dummy, we get

(5) PowerC=a+cD_{spr}+dD_{sum}+eD_{aut}+error.

Here is the estimation result for the two-year data in Figure 1:


This means that:

PowerC=128176 in winter, PowerC=128176-27380 in spring,

PowerC=128176+5450 in summer, and PowerC=128176-22225 in autumn.

It is revealing that cooling requires more power than heating. However, the summer coefficient is not significant. Here is the Excel file with the data and estimation result.

The category that has been dropped is called a base (or reference) category. Thus, the intercept in (5) measures power consumption in winter. The dummy coefficients in (5) measure deviations of power consumption in respective seasons from that in winter.

Here is the question I ask my students

We want to see how beer consumption BeerC depends on gender and income Inc. Let M and F denote the dummies for males and females, resp. Correct the following model and interpret the resulting coefficients:


Final remark

When a researcher includes all categories plus the trivial regressor, he/she falls into what is called a dummy trap. The problem of linear dependence among regressors is usually discussed under the heading of multiple regression. But since the trivial regressor is present in simple regression too, it might be a good idea to discuss it earlier.

Linear dependence/independence of regressors is an exact condition for existence of the OLS estimator. That is, if regressors are linearly dependent, then the OLS estimator doesn't exist, in which case the question about its further properties doesn't make sense. If, on the other hand, regressors are linearly independent, then the OLS estimator exists, and further properties can be studied, such as unbiasedness, variance and efficiency.

Dec 16

Testing for structural changes: a topic suitable for AP Stats

Testing for structural changes: a topic suitable for AP Stats

Problem statement

Economic data are volatile but sometimes changes in them look more permanent than transitory.

Figure 1. US GDP from agriculture. Source: http://www.tradingeconomics.com/united-states/gdp-from-agriculture

Figure 1 shows fluctuations of US GDP from agriculture. There have been ups and downs throughout the period of 2005-2016 but overall the trend has been up until 2013 and down since then. We want an objective, statistical confirmation of the fact that in 2013 the change was structural, substantial rather than a random fluctuation.

Chow test steps

  1. Divide the observed sample in two parts, A and B, at the point where you suspect the structural change (or break) has occurred. Run three regressions: one for A, another for B and the third one for the whole sample (pooled regression). Get residual sums of squares from each of them, denoted RSS_ARSS_B and RSS_p, respectively.
  2. Let n_A and n_B be the numbers of observations in the two subsamples and suppose there are k coefficients in your regression (for Figure 1, we would regress GDP on a time variable, so the number of coefficients would be 2, including the intercept). The Chow test statistic is defined by


This statistic is distributed as F with k,n_A+n_B-2k degrees of freedom. The null hypothesis is that the coefficients are the same for the two subsamples and the alternative is that they are not. If the statistic is larger than the critical value at your chosen level of significance, splitting the sample in two is beneficial (better describes the data). If the statistic is not larger than the critical value, the pooled regression better describes the data.

Figure 2. Splitting is better (there is a structural change)

In Figure 2, the gray lines are the fitted lines for the two subsamples. They fit the data much better than the orange line (the fitted line for the whole sample).

Figure 3. Pooling is better

In Figure 3, pooling is better because the intercept and slope are about the same and pooling amounts to increasing the sample size.

Download the Excel file used for simulation. See the video.

Nov 16

The pearls of AP Statistics 35

The disturbance term: To hide or not to hide? In an introductory Stats course, some part of the theory should be hidden. Where to draw the line is an interesting question. Here I discuss the ideas that look definitely bad to me.

How disturbing is the disturbance term?

In the main text, Agresti and Franklin never mention the disturbance term u_i in the regression model

(1) y_i=a+bx_i+u_i

(it is hidden in Exercise 12.105). Instead, they write the equation for the mean \mu_y=a+bx that follows from (1) under the standard assumption Eu_i=0. This would be fine if the exposition stopped right there. However, one has to explain the random source of variability in y_i. On p. 583 the authors say: "The probability distribution of y values at a fixed value of x is a conditional distribution. At each value of x, there is a conditional distribution of y values. A regression model also describes these distributions. An additional parameter σ describes the standard deviation of each conditional distribution."

Further, Figure 12.4 illustrates distributions of errors at different points and asks: "What do the bell-shaped curves around the line at x = 12 and at x = 16 represent?"


Figure 12.4. Illustration of error distributions

Besides, explanations of heteroscedasticity and of the residual sum of squares are impossible without explicitly referring to the disturbance term.

Attributing a regression property to the correlation is not good

On p.589 I encountered a statement that puzzled me: "An important property of the correlation is that at any particular x value, the predicted value of y is relatively closer to its mean than x is to its mean. If an x value is a certain number of standard deviations from its mean, then the predicted y is r times that many standard deviations from its mean."

Firstly, this is a verbal interpretation of some formula, so why not give the formula itself? How good must be a student to guess what is behind the verbal formulation?

Secondly, as I stressed in this post, the correlation coefficient does not entail any prediction about the magnitude of a change in one variable caused by a change in another. The above statement about the predicted value of y must be a property of regression. Attributing a regression property to the correlation is not in the best interests of those who want to study Stats at a more advanced level.

Thirdly, I felt challenged to see something new in the area I thought I knew everything about. So here is the derivation. By definition, the fitted value is

(2) \hat{y_i}=\hat{a}+\hat{b}x_i

where the hats stand for estimators. The fitted line passes through the point (\bar{x},\bar{y}):

(3) \bar{y}=\hat{a}+\hat{b}\bar{x}

(this will be proved elsewhere). Subtracting (3) from (2) we get

(4) \hat{y_i}-\bar{y}=\hat{b}(x_i-\bar{x})

(using equation (4) from this post)


It is helpful to rewrite (4) in a more symmetric form:

(5) \frac{\hat{y_i}-\bar{y}}{\sigma(y)}=\rho\frac{x_i-\bar{x}}{\sigma(x)}.

This is the equation we need. Suppose an x value is a certain number of standard deviations from its mean: x_i-\bar{x}=k\sigma(x). Plug this into (5) to get \hat{y_i}-\bar{y}=\rho k\sigma(y), that is, the predicted y is \rho times that many standard deviations from its mean.

Oct 16

The pearls of AP Statistics 34

Coefficient of determination: an inductive introduction to R squared

I know a person, who did not understand this topic, even though he had a PhD in Math. That was me more than twenty years ago, and the reason was that the topic was given formally, without explaining the leading idea.

Leading idea

Step 1. We want to describe the relationship between observed y's and x's using the simple regression


Let us start with the simple case when there is no variability in y's, that is the slope and the errors are zero.  Since y_i=a for all i, we have y_i=\bar{y} and, of course,

(1) \sum(y_i-\bar{y})^2=0.

In the general case, we start with the decomposition

(2) y_i=\hat{y}_i+e_i

where \hat{y}_i is the fitted value and e_i is the residual, see this post. We still want to see how y_i is far from \bar{y}. With this purpose, from both sides of equation (2) we subtract \bar{y}, obtaining y_i-\bar{y}=(\hat{y}_i-\bar{y})+e_i. Squaring this equation, for the sum in (1) we get

(3) \sum(y_i-\bar{y})^2=\sum(\hat{y}_i-\bar{y})^2+2\sum(\hat{y}_i-\bar{y})e_i+\sum e^2_i.

Whoever was the first to do this, discovered that the cross product is zero and (3) simplifies to

(4) \sum(y_i-\bar{y})^2=\sum(\hat{y}_i-\bar{y})^2+\sum e^2_i.

The rest is a matter of definitions

Total Sum of Squares TSS=\sum(y_i-\bar{y})^2. (I prefer to call this a total variation around \bar{y})

Explained Sum of Squares TSS=\sum(\hat{y}_i-\bar{y})^2 (to me this is explained variation around \bar{y})

Residual Sum of Squares TSS=\sum e^2_i (unexplained variation around \bar{y}, caused by the error term)

Thus from (4) we have


Step 2. It is desirable to have RSS close to zero and ESS close to TSS. Therefore we can use the ratio ESS/TSS as a measure of how good the regression describes the relationship between y's and x's. From (5) it follows that this ratio takes values between zero and 1. Hence, the coefficient of determination


can be interpreted as the percentage of total variation of y's around \bar{y} explained by regression. From (5) an equivalent definition is


Back to the pearls of AP Statistics

How much of the above can be explained without algebra? Stats without algebra is a crippled creature. I am afraid, any concepts requiring substantial algebra should be dropped from AP Stats curriculum. Compare this post with the explanation on p. 592 of Agresti and Franklin.