Agresti and Franklin on p.658 say: The indicator variable for a particular category is binary. It equals 1 if the observation falls into that category and it equals 0 otherwise. I say: For most students, this is not clear enough.
Problem statement

Figure 1. Residential power consumption in 2014 and 2015. Source: http://www.eia.gov/electricity/data.cfm
Residential power consumption in the US has a seasonal pattern. Heating in winter and cooling in summer cause the differences. We want to capture the dependence of residential power consumption
Visual approach to dummy variables
Seasons of the year are categorical variables. We have to replace them with quantitative variables, to be able to use in any mathematical procedure that involves arithmetic operations. To this end, we define a dummy variable (indicator)
Month | |||||
December | 1 | 0 | 0 | 0 | 1 |
January | 1 | 0 | 0 | 0 | 1 |
February | 1 | 0 | 0 | 0 | 1 |
March | 0 | 1 | 0 | 0 | 1 |
April | 0 | 1 | 0 | 0 | 1 |
May | 0 | 1 | 0 | 0 | 1 |
June | 0 | 0 | 1 | 0 | 1 |
July | 0 | 0 | 1 | 0 | 1 |
August | 0 | 0 | 1 | 0 | 1 |
September | 0 | 0 | 0 | 1 | 1 |
October | 0 | 0 | 0 | 1 | 1 |
November | 0 | 0 | 0 | 1 | 1 |


The first idea may be wrong
The first thing that comes to mind is to regress
(1)
Not so fast. To see the problem, let us rewrite (1) as
(2)
This shows that, in addition to the four dummies, there is a fifth variable, which equals 1 across all observations. Let us denote it
(3)
This makes the next definition relevant. Regressors
(4)
Now we see what the problem is. When regressors are linearly dependent, the model is not uniquely specified. (1) and (4) are two different representations of the same model.
What is the way out?
If regressors are linearly dependent, drop them one after another until you get linearly independent ones. For example, dropping the winter dummy, we get
(5)
Here is the estimation result for the two-year data in Figure 1:
This means that:
It is revealing that cooling requires more power than heating. However, the summer coefficient is not significant. Here is the Excel file with the data and estimation result.
The category that has been dropped is called a base (or reference) category. Thus, the intercept in (5) measures power consumption in winter. The dummy coefficients in (5) measure deviations of power consumption in respective seasons from that in winter.
Here is the question I ask my students
We want to see how beer consumption
Final remark
When a researcher includes all categories plus the trivial regressor, he/she falls into what is called a dummy trap. The problem of linear dependence among regressors is usually discussed under the heading of multiple regression. But since the trivial regressor is present in simple regression too, it might be a good idea to discuss it earlier.
Linear dependence/independence of regressors is an exact condition for existence of the OLS estimator. That is, if regressors are linearly dependent, then the OLS estimator doesn't exist, in which case the question about its further properties doesn't make sense. If, on the other hand, regressors are linearly independent, then the OLS estimator exists, and further properties can be studied, such as unbiasedness, variance and efficiency.
Leave a Reply
You must be logged in to post a comment.