Agresti and Franklin on p.658 say: The indicator variable for a particular category is binary. It equals 1 if the observation falls into that category and it equals 0 otherwise. I say: For most students, this is not clear enough.
Residential power consumption in the US has a seasonal pattern. Heating in winter and cooling in summer cause the differences. We want to capture the dependence of residential power consumption on the season.
Visual approach to dummy variables
Seasons of the year are categorical variables. We have to replace them with quantitative variables, to be able to use in any mathematical procedure that involves arithmetic operations. To this end, we define a dummy variable (indicator) for winter such that it equals 1 in winter and 0 in any other period of the year. The dummies for spring, summer and autumn are defined similarly. We provide two visualizations assuming monthly observations.
The first idea may be wrong
The first thing that comes to mind is to regress on dummies as in
Not so fast. To see the problem, let us rewrite (1) as
This shows that, in addition to the four dummies, there is a fifth variable, which equals 1 across all observations. Let us denote it (for Trivial). Table 1 shows that
This makes the next definition relevant. Regressors are called linearly dependent if one of them, say, , can be expressed as a linear combination of the others: . In case (3), all coefficients are unities, so we have linear dependence. Using (3), let us replace in (2). The resulting equation is rearranged as
Now we see what the problem is. When regressors are linearly dependent, the model is not uniquely specified. (1) and (4) are two different representations of the same model.
What is the way out?
If regressors are linearly dependent, drop them one after another until you get linearly independent ones. For example, dropping the winter dummy, we get
Here is the estimation result for the two-year data in Figure 1:
This means that:
in winter, in spring,
in summer, and in autumn.
It is revealing that cooling requires more power than heating. However, the summer coefficient is not significant. Here is the Excel file with the data and estimation result.
The category that has been dropped is called a base (or reference) category. Thus, the intercept in (5) measures power consumption in winter. The dummy coefficients in (5) measure deviations of power consumption in respective seasons from that in winter.
Here is the question I ask my students
We want to see how beer consumption depends on gender and income . Let and denote the dummies for males and females, resp. Correct the following model and interpret the resulting coefficients:
When a researcher includes all categories plus the trivial regressor, he/she falls into what is called a dummy trap. The problem of linear dependence among regressors is usually discussed under the heading of multiple regression. But since the trivial regressor is present in simple regression too, it might be a good idea to discuss it earlier.
Linear dependence/independence of regressors is an exact condition for existence of the OLS estimator. That is, if regressors are linearly dependent, then the OLS estimator doesn't exist, in which case the question about its further properties doesn't make sense. If, on the other hand, regressors are linearly independent, then the OLS estimator exists, and further properties can be studied, such as unbiasedness, variance and efficiency.