Distribution of the estimator of the error variance
If you are reading the book by Dougherty: this post is about the distribution of the estimator defined in Chapter 3.
Consider regression
(1)
where the deterministic matrix is of size satisfies (regressors are not collinear) and the error satisfies
(2)
is estimated by Denote Using (1) we see that and the residual is estimated by
(3)
is a projector and has properties which are derived from those of
(4)
If is an eigenvalue of then multiplying by and using the fact that we get Hence eigenvalues of can be only or The equation
tells us that the number of eigenvalues equal to 1 is and the remaining are zeros. Let be the diagonal representation of Here is an orthogonal matrix,
(5)
and is a diagonal matrix with eigenvalues of on the main diagonal. We can assume that the first numbers on the diagonal of are ones and the others are zeros.
Theorem. Let be normal. 1) is distributed as 2) The estimators and are independent.
Proof. 1) We have by (4)
(6)
Denote From (2) and (5)
and is normal as a linear transformation of a normal vector. It follows that where is a standard normal vector with independent standard normal coordinates Hence, (6) implies
(7)
(3) and (7) prove the first statement.
2) First we note that the vectors are independent. Since they are normal, their independence follows from
It's easy to see that This allows us to show that is a function of :
Independence of leads to independence of their functions and
(a) the regressors are assumed deterministic, (b) the number of regressors is smaller than the number of observations (c) the regressors are linearly independent, and (d) the errors are homoscedastic and uncorrelated,
(2)
Usually students remember that should be estimated and don't pay attention to estimation of Partly this is because does not appear in the regression and partly because the result on estimation of error variance is more complex than the result on the OLS estimator of
Definition 1. Let be the OLS estimator of . is called the fitted value and is called the residual.
Exercise 1. Using the projectors and show that and
Proof. The first equation is obvious. From the model we have Since we have further
Expectations of type and would be easy to find from (2). However, we need to find where there is an obstructing See how this difficulty is overcome in the next calculation.
Problem statement. A vector (the dependent vector) and vectors (independent vectors or regressors) are given. The OLS estimator is defined as that vector which minimizes the total sum of squares
Denoting we see that and that finding the OLS estimator means approximating with vectors from the image should be linearly independent, otherwise the solution will not be unique.
Assumption. are linearly independent. This, in particular, implies that
Exercise 2. Show that the OLS estimator is
(2)
Proof. By Exercise 1 we can use Since belongs to the image of doesn't change it: Denoting also we have
(by Exercise 1)
This shows that is a lower bound for This lower bound is achieved when the second term is made zero. From
we see that the second term is zero if satisfies (2).
Usually the above derivation is applied to the dependent vector of the form where is a random vector with mean zero. But it holds without this assumption. See also simplified derivation of the OLS estimator.
Regressions with stochastic regressors 2: two approaches
We consider the slope estimator for the simple regression
assuming that is stochastic.
First approach: the sample size is fixed. The unbiasedness and efficiency conditions are replaced by their analogs conditioned on . The outcome is that the slope estimator is unbiased and its variance is the average of the variance that we have in case of a deterministic regressor. See the details.
Second approach: the sample size goes to infinity. The main tools used are the properties of probability limits and laws of large numbers. The outcome is that, in the limit, the sample characteristics are replaced by their population cousins and the slope estimator is consistent. This is what we focus on here.
Fact 1. Convergence in probability (which applies to sequences of random variables) is a generalization of the notion of convergence of number sequences. In particular, if is a numerical sequence that converges to a number , , then, treating as a random variable, we have convergence in probability .
Fact 2. For those who are familiar with the theory of limits of numerical sequences, from the previous fact it should be clear that convergence in probability preserves arithmetic operations. That is, for any sequences of random variables such that limits and exist, we have
See the site map for several posts about this. Here we apply the Chebyshev inequality to prove the law of large numbers for sample means. A generalization is given in the Theorem in the end of that post. Here is a further intuitive generalization:
Normally, unbiased sample characteristics converge in probability to their population counterparts.
Example 1. We know that the sample variance unbiasedly estimates the population variance : . The intuitive generalization says that then
(1) .
Here I argue that, for the purposes of obtaining some identities from the general properties of means, instead of the sample variance it's better to use the variance defined by (with division by instead of ). Using Facts 1 and 2 we get from (1) that
(2)
(sample variance converges in probability to population variance). Here we use .
Example 2. Similarly, sample covariance converges in probability to population covariance:
(3)
where by definition .
Proving consistency of the slope estimator
Here (see equation (5)) I derived the representation of the OLS estimator of the slope
Using preservation of arithmetic operations for convergence in probability, we get
(4)
In the last line we used (2) and (3). From (4) we see what conditions should be imposed for the slope estimator to converge to a spike at the true slope:
(existence condition)
and
(consistency condition).
Under these conditions, we have (this is called consistency).
Conclusion. In a way, the second approach is technically simpler than the first.
Regressions with stochastic regressors 1: applying conditioning
The convenience condition states that the regressor in simple regression is deterministic. Here we look at how this assumption can be avoided using conditional expectation and variance. General idea: you check which parts of the proofs don't go through with stochastic regressors and modify the assumptions accordingly. It happens that only assumptions concerning the error term should be replaced by their conditional counterparts.
Unbiasedness in case of stochastic regressors
We consider the slope estimator for the simple regression
(1)
assuming that is stochastic.
First grab the critical representation (6) derived here:
(1) , where
The usual linearity of means applied to prove unbiasedness doesn't work because now the coefficients are stochastic (in other words, they are not constant). But we have generalized linearity which for the purposes of this proof can be written as
(2)
Let us replace the unbiasedness condition by its conditional version:
A3'. Unbiasedness condition. .
Then (1) and (2) give
(3)
which can be called conditional unbiasedness. Next applying the law of iterated expectations we obtain unconditional unbiasedness:
Variance in case of stochastic regressors
As one can guess, we have to replace efficiency conditions by their conditional versions:
A4'. Conditional uncorrelatedness of errors. Assume that for all .
A5'. Conditional homoscedasticity. All errors have the same conditional variances: for all ( is a constant).
Now we can derive the conditional variance expression, using properties from this post:
(dropping a constant doesn't affect variance)
(for conditionally uncorrelated variables, conditional variance is additive)
Replacing the three assumptions about the error by their conditional counterparts allows us to obtain almost perfect analogs of the usual properties of OLS estimators: the usual (unconditional) unbiasedness plus the estimator variance, in which the part containing the regressor should be averaged, to account for its randomness. If you think that solving the problem of stochastic regressors requires nothing more but application of a couple of mathematical tricks, I agree with you.
Here we derived the OLS estimators of the intercept and slope:
(2) ,
(3) .
A1. Existence condition. Since division by zero is not allowed, for (2) to exist we require. If this condition is not satisfied, then there is no variance in and all observed points are on the vertical line.
A2. Convenience condition. The regressoris deterministic. This condition is imposed to be able to apply the properties of expectation, see equation (7) in this post. The time trend and dummy variables are examples of deterministic regressors. However, most real-life regressors are stochastic. Modifying the theory in order to cover stochastic regressors is the subject of two posts: finite-sample theory and large-sample theory.
A3. Unbiasedness condition. . This is the main assumption that makes sure that OLS estimators are unbiased, see equation (7) in this post.
Unbiasedness is not enough
Unbiasedness characterizes the quality of an estimator, see the intuitive explanation. Unfortunately, unbiasedness is not enough to choose the best estimator because of nonuniqueness: usually, if there is one unbiased estimator of a parameter, then there are infinitely many unbiased estimators of the same parameter. For example, we know that the sample mean unbiasedly estimates the population mean . Since ( is the first observation), we can easily construct an infinite family of unbiased estimators , assuming . Indeed, using linearity of expectation .
Variance is another measure of an estimator quality: to have a lower spread of estimator values, among competing estimators we choose the one which has the lowest variance. Knowing the estimator variance allows us to find the z-score and use statistical tables.
Slope estimator variance
It is not difficult to find the variance of the slope estimator using representation (6) derived here:
where
Don't try to apply directly the definition of variance at this point, because there will be a square of a sum, which leads to a double sum upon squaring. We need two new assumptions.
A4. Uncorrelatedness of errors. Assume that for all (errors from different equations (1) are uncorrelated). Note that because of the unbiasedness condition, this assumption is equivalent to for all . This assumption is likely to be satisfied if we observe consumption patterns of unrelated individuals.
A5. Homoscedasticity. All errors have the same variances: for all . Again, because of the unbiasedness condition, this assumption is equivalent to for all .
Now we can derive the variance expression, using properties from this post:
(dropping a constant doesn't affect variance)
(for uncorrelated variables, variance is additive)
(variance is homogeneous of degree 2)
(applying homoscedasticity)
(plugging )
(using the notation of sample variance)
Note that canceling out two variances in the last line is obvious. It is not so obvious for some if instead of the short notation for variances you use summation signs. The case of the intercept variance is left as an exercise.
Conclusion
The above assumptions A1-A5 are called classical. It is necessary to remember their role in derivations because a considerable part of Econometrics is devoted to deviations from classical assumptions. Once you have a certain assumption violated, you should expect the corresponding estimator property invalidated. For example, if , you should expect the estimators to be biased. If any of A4-A5 is not true, the formula we have derived
will not hold. Besides, the Gauss-Markov theorem that the OLS estimators are efficient will not hold (this will be discussed later). The pair A4-A5 can be called an efficiency condition.
Agresti and Franklin on p.658 say: The indicator variable for a particular category is binary. It equals 1 if the observation falls into that category and it equals 0 otherwise. I say: For most students, this is not clear enough.
Problem statement
Figure 1. Residential power consumption in 2014 and 2015. Source: http://www.eia.gov/electricity/data.cfm
Residential power consumption in the US has a seasonal pattern. Heating in winter and cooling in summer cause the differences. We want to capture the dependence of residential power consumption on the season.
Visual approach to dummy variables
Seasons of the year are categorical variables. We have to replace them with quantitative variables, to be able to use in any mathematical procedure that involves arithmetic operations. To this end, we define a dummy variable (indicator) for winter such that it equals 1 in winter and 0 in any other period of the year. The dummies for spring, summer and autumn are defined similarly. We provide two visualizations assuming monthly observations.
Table 1. Tabular visualization of dummies
Month
December
1
0
0
0
1
January
1
0
0
0
1
February
1
0
0
0
1
March
0
1
0
0
1
April
0
1
0
0
1
May
0
1
0
0
1
June
0
0
1
0
1
July
0
0
1
0
1
August
0
0
1
0
1
September
0
0
0
1
1
October
0
0
0
1
1
November
0
0
0
1
1
Figure 2. Graphical visualization of D_spr
The first idea may be wrong
The first thing that comes to mind is to regress on dummies as in
(1)
Not so fast. To see the problem, let us rewrite (1) as
(2)
This shows that, in addition to the four dummies, there is a fifth variable, which equals 1 across all observations. Let us denote it (for Trivial). Table 1 shows that
(3)
This makes the next definition relevant. Regressors are called linearly dependent if one of them, say, , can be expressed as a linear combination of the others: . In case (3), all coefficients are unities, so we have linear dependence. Using (3), let us replace in (2). The resulting equation is rearranged as
(4)
Now we see what the problem is. When regressors are linearly dependent, the model is not uniquely specified. (1) and (4) are two different representations of the same model.
What is the way out?
If regressors are linearly dependent, drop them one after another until you get linearly independent ones. For example, dropping the winter dummy, we get
(5)
Here is the estimation result for the two-year data in Figure 1:
.
This means that:
in winter, in spring,
in summer, and in autumn.
It is revealing that cooling requires more power than heating. However, the summer coefficient is not significant. Here is the Excel file with the data and estimation result.
The category that has been dropped is called a base (or reference) category. Thus, the intercept in (5) measures power consumption in winter. The dummy coefficients in (5) measure deviations of power consumption in respective seasons from that in winter.
Here is the question I ask my students
We want to see how beer consumption depends on gender and income . Let and denote the dummies for males and females, resp. Correct the following model and interpret the resulting coefficients:
Final remark
When a researcher includes all categories plus the trivial regressor, he/she falls into what is called a dummy trap. The problem of linear dependence among regressors is usually discussed under the heading of multiple regression. But since the trivial regressor is present in simple regression too, it might be a good idea to discuss it earlier.
Linear dependence/independence of regressors is an exact condition for existence of the OLS estimator. That is, if regressors are linearly dependent, then the OLS estimator doesn't exist, in which case the question about its further properties doesn't make sense. If, on the other hand, regressors are linearly independent, then the OLS estimator exists, and further properties can be studied, such as unbiasedness, variance and efficiency.
Testing for structural changes: a topic suitable for AP Stats
Problem statement
Economic data are volatile but sometimes changes in them look more permanent than transitory.
Figure 1. US GDP from agriculture. Source: http://www.tradingeconomics.com/united-states/gdp-from-agriculture
Figure 1 shows fluctuations of US GDP from agriculture. There have been ups and downs throughout the period of 2005-2016 but overall the trend has been up until 2013 and down since then. We want an objective, statistical confirmation of the fact that in 2013 the change was structural, substantial rather than a random fluctuation.
Chow test steps
Divide the observed sample in two parts, A and B, at the point where you suspect the structural change (or break) has occurred. Run three regressions: one for A, another for B and the third one for the whole sample (pooled regression). Get residual sums of squares from each of them, denoted , and , respectively.
Let and be the numbers of observations in the two subsamples and suppose there are coefficients in your regression (for Figure 1, we would regress GDP on a time variable, so the number of coefficients would be 2, including the intercept). The Chow test statistic is defined by
This statistic is distributed as with degrees of freedom. The null hypothesis is that the coefficients are the same for the two subsamples and the alternative is that they are not. If the statistic is larger than the critical value at your chosen level of significance, splitting the sample in two is beneficial (better describes the data). If the statistic is not larger than the critical value, the pooled regression better describes the data.
Figure 2. Splitting is better (there is a structural change)
In Figure 2, the gray lines are the fitted lines for the two subsamples. They fit the data much better than the orange line (the fitted line for the whole sample).
Figure 3. Pooling is better
In Figure 3, pooling is better because the intercept and slope are about the same and pooling amounts to increasing the sample size.
The disturbance term: To hide or not to hide? In an introductory Stats course, some part of the theory should be hidden. Where to draw the line is an interesting question. Here I discuss the ideas that look definitely bad to me.
How disturbing is the disturbance term?
In the main text, Agresti and Franklin never mention the disturbance term in the regression model
(1)
(it is hidden in Exercise 12.105). Instead, they write the equation for the mean that follows from (1) under the standard assumption . This would be fine if the exposition stopped right there. However, one has to explain the random source of variability in . On p. 583 the authors say: "The probability distribution of y values at a fixed value of x is a conditional distribution. At each value of x, there is a conditional distribution of y values. A regression model also describes these distributions. An additional parameter σ describes the standard deviation of each conditional distribution."
Further, Figure 12.4 illustrates distributions of errors at different points and asks: "What do the bell-shaped curves around the line at x = 12 and at x = 16 represent?"
Figure 12.4. Illustration of error distributions
Besides, explanations of heteroscedasticity and of the residual sum of squares are impossible without explicitly referring to the disturbance term.
Attributing a regression property to the correlation is not good
On p.589 I encountered a statement that puzzled me: "An important property of the correlation is that at any particular x value, the predicted value of y is relatively closer to its mean than x is to its mean. If an x value is a certain number of standard deviations from its mean, then the predicted y is r times that many standard deviations from its mean."
Firstly, this is a verbal interpretation of some formula, so why not give the formula itself? How good must be a student to guess what is behind the verbal formulation?
Secondly, as I stressed in this post, the correlation coefficient does not entail any prediction about the magnitude of a change in one variable caused by a change in another. The above statement about the predicted value of y must be a property of regression. Attributing a regression property to the correlation is not in the best interests of those who want to study Stats at a more advanced level.
Thirdly, I felt challenged to see something new in the area I thought I knew everything about. So here is the derivation. By definition, the fitted value is
(2)
where the hats stand for estimators. The fitted line passes through the point :
(3)
(this will be proved elsewhere). Subtracting (3) from (2) we get
It is helpful to rewrite (4) in a more symmetric form:
(5)
This is the equation we need. Suppose an x value is a certain number of standard deviations from its mean: . Plug this into (5) to get , that is, the predicted y is times that many standard deviations from its mean.
Coefficient of determination: an inductive introduction to R squared
I know a person, who did not understand this topic, even though he had a PhD in Math. That was me more than twenty years ago, and the reason was that the topic was given formally, without explaining the leading idea.
Leading idea
Step 1. We want to describe the relationship between observed y's and x's using the simple regression
.
Let us start with the simple case when there is no variability in y's, that is the slope and the errors are zero. Since for all , we have and, of course,
(1)
In the general case, we start with the decomposition
(2)
where is the fitted value and is the residual, see this post. We still want to see how is far from With this purpose, from both sides of equation (2) we subtract obtaining Squaring this equation, for the sum in (1) we get
(3)
Whoever was the first to do this, discovered that the cross product is zero and (3) simplifies to
(4)
The rest is a matter of definitions
Total Sum of Squares (I prefer to call this a total variation around )
Explained Sum of Squares (to me this is explained variation around)
Residual Sum of Squares (unexplained variation around, caused by the error term)
Thus from (4) we have
(5) and
Step 2. It is desirable to have RSS close to zero and ESS close to TSS. Therefore we can use the ratio ESS/TSS as a measure of how good the regression describes the relationship between y's and x's. From (5) it follows that this ratio takes values between zero and 1. Hence, the coefficient of determination
can be interpreted as the percentage of total variation of y's around explained by regression. From (5) an equivalent definition is
Back to the pearls of AP Statistics
How much of the above can be explained without algebra? Stats without algebra is a crippled creature. I am afraid, any concepts requiring substantial algebra should be dropped from AP Stats curriculum. Compare this post with the explanation on p. 592 of Agresti and Franklin.
You must be logged in to post a comment.