Aug 17

Violations of classical assumptions 1

Violations of classical assumptions

This is a large topic which requires several posts or several book chapters. During a conference in Sweden in 2010, a Swedish statistician asked me: "What is Econometrics, anyway? What tools does it use?" I said: "Among others, it uses linear regression." He said: "But linear regression is a general statistical tool, why do they say it's a part of Econometrics?" My answer was: "Yes, it's a general tool but the name Econometrics emphasizes that the motivation for its applications lies in Economics".

Both classical assumptions and their violations should be studied with this point in mind: What is the Economics and Math behind each assumption?

Violations of the first three assumptions

We consider the simple regression

(1) y_i=a+bx_i+e_i

Make sure to review the assumptions. Their numbering and names sometimes are different from what Dougherty's book has. In particular, most of the time I omit the following assumption:

A6. The model is linear in parameters and correctly specified.

When it is not linear in parameters, you can think of nonlinear alternatives. Instead of saying "correctly specified" I say "true model" when a "wrong model" is available.

A1. What if the existence condition is violated? If variance of the regressor is zero, the OLS estimator does not exist. The fitted line is supposed to be vertical, and you can regress x on y. Violation of the existence condition in case of multiple regression leads to multicollinearity, and that's where economic considerations are important.

A2. The convenience condition is called so because when it is violated, that is, the regressor is stochastic, there are ways to deal with this problem:  finite-sample theory and large-sample theory.

A3. What if the errors in (1) have means different from zero? This question can be divided in two: 1) the means of the errors are the same: Ee_i=c\ne 0 for all i and 2) the means are different. Read the post about centering and see if you can come up with the answer for the first question. The means may be different because of omission of a relevant variable (can you do the math?). In the absence of data on such a variable, there is nothing you can do.

Violations of A4 and A5 will be treated later.

Jul 17

Running simple regression in Stata

Running simple regression in Stata is, well, simple. It's just a matter of a couple of clicks. Try to make it a small research.

  1. Obtain descriptive statistics for your data (Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics). Look at all that stuff you studied in introductory statistics: units of measurement, means, minimums, maximums, and correlations. Knowing the units of measurement will be important for interpreting regression results; correlations will predict signs of coefficients, etc. In your report, don't just mechanically repeat all those measures; try to find and discuss something interesting.
  2. Visualize your data (Graphics > Twoway graph). On the graph you can observe outliers and discern possible nonlinearity.
  3. After running regression, report the estimated equation. It is called a fitted line and in our case looks like this: Earnings = -13.93+2.45*S (use descriptive names and not abstract X,Y). To see if the coefficient of S is significant, look at its p-value, which is smaller than 0.001. This tells us that at all levels of significance larger than or equal to 0.001 the null that the coefficient of S is significant is rejected. This follows from the definition of p-value. Nobody cares about significance of the intercept. Report also the p-value of the F statistic. It characterizes significance of all nontrivial regressors and is important in case of multiple regression. The last statistic to report is R squared.
  4. Think about possible variations of the model. Could the dependence of Earnings on S be nonlinear? What other determinants of Earnings would you suggest from among the variables in Dougherty's file?
Looking at data

Figure 1. Looking at data. For data, we use a scatterplot.


Running regression

Figure 2. Running regression (Statistics > Linear models and related > Linear regression)

Jan 17

Regressions with stochastic regressors 2

Regressions with stochastic regressors 2: two approaches

We consider the slope estimator for the simple regression


assuming that x_i is stochastic.

First approach: the sample size is fixed. The unbiasedness and efficiency conditions are replaced by their analogs conditioned on x. The outcome is that the slope estimator is unbiased and its variance is the average of the variance that we have in case of a deterministic regressor. See the details.

Second approach: the sample size goes to infinity. The main tools used are the properties of probability limits and laws of large numbers. The outcome is that, in the limit, the sample characteristics are replaced by their population cousins and the slope estimator is consistent. This is what we focus on here.

A brush-up on convergence in probability

Review the intuition and formal definition. This is the summary:

Fact 1. Convergence in probability (which applies to sequences of random variables) is a generalization of the notion of convergence of number sequences. In particular, if \{a_n\} is a numerical sequence that converges to a number a\lim_{n\rightarrow\infty}a_n=a, then, treating a_n as a random variable, we have convergence in probability {\text{plim}}_{n\rightarrow\infty}a_n=a.

Fact 2. For those who are familiar with the theory of limits of numerical sequences, from the previous fact it should be clear that convergence in probability preserves arithmetic operations. That is, for any sequences of random variables \{X_n\},\{Y_n\} such that limits {\text{plim}}X_n and {\text{plim}}Y_n exist, we have

\text{plim}(X_n\pm Y_n)=\text{plim}X_n\pm\text{plim}Y_n, \text{plim}(X_n\times Y_n)=\text{plim}X_n\times\text{plim}Y_n,

and if \text{plim}Y_n\ne 0 then

\text{plim}(X_n/ Y_n)=\text{plim}X_n/\text{plim}Y_n.

This makes convergence in probability very handy. Convergence in distribution doesn't have such properties.

A brush-up on laws of large numbers

See the site map for several posts about this. Here we apply the Chebyshev inequality to prove the law of large numbers for sample means. A generalization is given in the Theorem in the end of that post. Here is a further intuitive generalization:

Normally, unbiased sample characteristics converge in probability to their population counterparts.

Example 1. We know that the sample variance s^2=\frac{1}{n-1}\sum(X_i-\bar{X})^2 unbiasedly estimates the population variance \sigma^2Es^2=\sigma^2. The intuitive generalization says that then

(1) \text{plim}s^2=\sigma^2.

Here I argue that, for the purposes of obtaining some identities from the general properties of means, instead of the sample variance it's better to use the variance defined by Var_u(X)=\frac{1}{n}\sum(X_i-\bar{X})^2 (with division by n instead of n-1). Using Facts 1 and 2 we get from (1) that

(2) \text{plim}Var_u(X)=\text{plim}\frac{n-1}{n}\frac{1}{n-1}\sum(X_i-\bar{X})^2


(sample variance converges in probability to population variance). Here we use \lim(1-\frac{1}{n})=1.

Example 2. Similarly, sample covariance converges in probability to population covariance:

(3) \text{plim}Cov_u(X,Y)=Cov(X,Y)

where by definition Cov_u(X,Y)=\frac{1}{n}\sum(X_i-\bar{X})(Y_i-\bar{Y}).

Proving consistency of the slope estimator

Here (see equation (5)) I derived the representation of the OLS estimator of the slope


Using preservation of arithmetic operations for convergence in probability, we get

(4) \text{plim}\hat{b}=\text{plim}\left[b+\frac{Cov_u(X,e)}{Var_u(X)}\right]=\text{plim}b+\text{plim}\frac{Cov_u(X,e)}{Var_u(X)}


In the last line we used (2) and (3). From (4) we see what conditions should be imposed for the slope estimator to converge to a spike at the true slope:

Var(X)\neq 0 (existence condition)


Cov(X,e)=0 (consistency condition).

Under these conditions, we have \text{plim}\hat{b}=b (this is called consistency).

Conclusion. In a way, the second approach is technically simpler than the first.

Jan 17

Regressions with stochastic regressors 1

Regressions with stochastic regressors 1: applying conditioning

The convenience condition states that the regressor in simple regression is deterministic. Here we look at how this assumption can be avoided using conditional expectation and variance. General idea: you check which parts of the proofs don't go through with stochastic regressors and modify the assumptions accordingly. It happens that only assumptions concerning the error term should be replaced by their conditional counterparts.

Unbiasedness in case of stochastic regressors

We consider the slope estimator for the simple regression

(1) y_i=a+bx_i+e_i

assuming that x_i is stochastic.

First grab the critical representation (6) derived here:

(1) \hat{b}=b+\frac{1}{n}\sum a_i(x)e_i, where a_i(x)=(x_i-\bar{x})/Var_u(x).

The usual linearity of means E(aX + bY) = aEX + bEY applied to prove unbiasedness doesn't work because now the coefficients are stochastic (in other words, they are not constant). But we have generalized linearity which for the purposes of this proof can be written as

(2) E(a(x)S+b(x)T|x)=a(x)E(S|x)+b(x)E(T|x).

Let us replace the unbiasedness condition by its conditional version:

A3'. Unbiasedness conditionE(e_i|x)=0.

Then (1) and (2) give

(3) E(\hat{b}|x)=b+\frac{1}{n}\sum a_i(x)E(e_i|x)=b,

which can be called conditional unbiasedness. Next applying the law of iterated expectations E[E(S|x)]=ES we obtain unconditional unbiasedness:


Variance in case of stochastic regressors

As one can guess, we have to replace efficiency conditions by their conditional versions:

A4'. Conditional uncorrelatedness of errors. Assume that E(e_ie_j|x)=0 for all i\ne j.

A5'. Conditional homoscedasticity. All errors have the same conditional variances: E(e_i^2|x)=\sigma^2 for all i (\sigma^2 is a constant).

Now we can derive the conditional variance expression, using properties from this post:

Var(\hat{b}|x)=Var(b+\frac{1}{n}\sum_i a_i(x)e_i|x) (dropping a constant doesn't affect variance)

=Var(\frac{1}{n}\sum_i a_i(x)e_i|x) (for conditionally uncorrelated variables, conditional variance is additive)

=\sum_i Var(\frac{1}{n}a_i(x)e_i|x) (conditional variance is homogeneous of degree 2)

=\frac{1}{n^2}\sum_i a_i^2(x)Var(e_i|x) (applying conditional homoscedasticity)

=\frac{1}{n^2}\sum_i a_i^2(x)\sigma^2 (plugging a_i(x))

=\frac{1}{n^2}\sum_i(x_i-\bar{x})^2\sigma^2/Var^2_u(x) (using the notation of sample variance)

(4) =\frac{1}{n}Var_u(x)\sigma^2/Var^2_u(x)=\sigma^2/(nVar_u(x)).

Finally, using the law of total variance Var(S)=Var(E(S|x))+E[Var(S|x)] and equations (3) and (4) we obtain

(5) Var(\hat{b})=Var(b)+E[\sigma^2/(nVar_u(x))]=\frac{\sigma^2}{n}E[\frac{1}{Var_u(x)}].


Replacing the three assumptions about the error by their conditional counterparts allows us to obtain almost perfect analogs of the usual properties of OLS estimators: the usual (unconditional) unbiasedness plus the estimator variance, in which the part containing the regressor should be averaged, to account for its randomness. If you think that solving the problem of stochastic regressors requires nothing more but application of a couple of mathematical tricks, I agree with you.

Jan 17

OLS estimator variance

Assumptions about simple regression

We consider the simple regression

(1) y_i=a+bx_i+e_i

Here we derived the OLS estimators of the intercept and slope:

(2) \hat{b}=\frac{Cov_u(x,y)}{Var_u(x)},

(3) \hat{a}=\bar{y}-\hat{b}\bar{x}.

A1. Existence condition. Since division by zero is not allowed, for (2) to exist we require Var_u(x)\ne 0. If this condition is not satisfied, then there is no variance in x and all observed points are on the vertical line.

A2. Convenience condition. The regressor x is deterministic. This condition is imposed to be able to apply the properties of expectation, see equation (7) in  this post. The time trend and dummy variables are examples of deterministic regressors. However, most real-life regressors are stochastic. Modifying the theory in order to cover stochastic regressors is the subject of two posts: finite-sample theory and large-sample theory.

A3. Unbiasedness conditionEe_i=0. This is the main assumption that makes sure that OLS estimators are unbiased, see equation (7) in  this post.

Unbiasedness is not enough

Unbiasedness characterizes the quality of an estimator, see the intuitive explanation. Unfortunately, unbiasedness is not enough to choose the best estimator because of nonuniqueness: usually, if there is one unbiased estimator of a parameter, then there are infinitely many unbiased estimators of the same parameter. For example, we know that the sample mean \bar{X} unbiasedly estimates the population mean E\bar{X}=EX. Since EX_1=EX (X_1 is the first observation), we can easily construct an infinite family of unbiased estimators Y=(\bar{X}+aX_1)/(1+a), assuming a\ne -1. Indeed, using linearity of expectation EY=(E\bar{X}+aEX_1)/(1+a)=EX.

Variance is another measure of an estimator quality: to have a lower spread of estimator values, among competing estimators we choose the one which has the lowest variance. Knowing the estimator variance allows us to find the z-score and use statistical tables.

Slope estimator variance

It is not difficult to find the variance of the slope estimator using representation (6) derived here:

\hat{b}=b+\frac{1}{n}\sum a_ie_i

where a_i=(x_i-\bar{x})/Var_u(x).

Don't try to apply directly the definition of variance at this point, because there will be a square of a sum, which leads to a double sum upon squaring. We need two new assumptions.

A4. Uncorrelatedness of errors. Assume that Cov(e_i,e_j)=0 for all i\ne j (errors from different equations (1) are uncorrelated). Note that because of the unbiasedness condition, this assumption is equivalent to Ee_ie_j=0 for all i\ne j. This assumption is likely to be satisfied if we observe consumption patterns of unrelated individuals.

A5. Homoscedasticity. All errors have the same variancesVar(e_i)=\sigma^2 for all i. Again, because of the unbiasedness condition, this assumption is equivalent to Ee_i^2=\sigma^2 for all i.

Now we can derive the variance expression, using properties from this post:

Var(\hat{b})=Var(b+\frac{1}{n}\sum_i a_ie_i) (dropping a constant doesn't affect variance)

=Var(\frac{1}{n}\sum_i a_ie_i) (for uncorrelated variables, variance is additive)

=\sum_i Var(\frac{1}{n}a_ie_i) (variance is homogeneous of degree 2)

=\frac{1}{n^2}\sum_i a_i^2Var(e_i) (applying homoscedasticity)

=\frac{1}{n^2}\sum_i a_i^2\sigma^2 (plugging a_i)

=\frac{1}{n^2}\sum_i(x_i-\bar{x})^2\sigma^2/Var^2_u(x) (using the notation of sample variance)


Note that canceling out two variances in the last line is obvious. It is not so obvious for some if instead of the short notation for variances you use summation signs. The case of the intercept variance is left as an exercise.


The above assumptions A1-A5 are called classical. It is necessary to remember their role in derivations because a considerable part of Econometrics is devoted to deviations from classical assumptions. Once you have a certain assumption violated, you should expect the corresponding estimator property invalidated. For example, if Ee_i\ne 0, you should expect the estimators to be biased. If any of A4-A5 is not true, the formula we have derived


will not hold. Besides, the Gauss-Markov theorem that the OLS estimators are efficient will not hold (this will be discussed later). The pair A4-A5 can be called an efficiency condition.

Jan 17

Review of Agresti and Franklin

Review of Agresti and Franklin "Statistics: The Art and Science of Learning from Data", 3rd edition

Who is this book for?

On the Internet you can find both positive and negative reviews. The ones that I saw on Goodreads.com and Amazon.com do not say much about the pros and cons. Here I try to be more specific.

The main limitation of the book is that it adheres to the College Board statement that "it is a one semester, introductory, non-calculus-based, college course in statistics". Hence, there are no derivations and no links between formulas. You will not find explanations of why Statistics works. As a result, there is too much emphasis on memorization. After reading the book, you will likely not have an integral view of statistical methods.

I have seen students who understand such texts well. Generally, they have an excellent memory and better-than-average imagination. But such students are better off reading more advanced books. A potential reader has to lower his/her expectations. I imagine a person who is not interested in taking a more advanced Stats course later. The motivation of that person would be: a) to understand the ways Statistics is applied and/or b) to pass AP Stats just because it is a required course. The review is written on the premise that this is the intended readership.

What I like

  1. The number and variety of exercises. This is good for an instructor who teaches large classes. Having authored several books, I can assure you that inventing many exercises is the most time-consuming part of this business.
  2. The authors have come up with good visual embellishments of graphs and tables summarized in "A Guide to Learning From the Art in This Text" in the end of the book.
  3. The book has generous left margins. Sometimes they contain reminders about the past material. Otherwise, the reader can use them for notes.
  4. MINITAB is prohibitively expensive, but the Student Edition of MINITAB is provided on the accompanying CD.

What I don't like

  1. I counted about 140 high-resolution photos that have nothing to do with the subject matter. They hardly add to the educational value of the book but certainly add to its cost. This bad trend in introductory textbooks is fueled to a considerable extent by Pearson Education.
  2. 800+ pages, even after slashing all appendices and unnecessary illustrations, is a lot of reading for one semester. Even if you memorize all of them, during the AP test it be will difficult for you to pull out of your memory exactly that page you need to answer exactly this particular question.
  3. In an introductory text, one has to refrain from giving too much theory. Still, I don't like some choices made by the authors. The learning curve is flat. As a way of gentle introduction to algebra, verbal descriptions of formulas are normal. But sticking to verbal descriptions until p. 589 is too much. This reminds me a train trip in Kazakhstan. You enter the steppe through the western border and two days later you see the same endless steppe, just the train station is different.
  4. At the theoretical level, many topics are treated superficially. You can find a lot of additional information in my posts named "The pearls of AP Statistics". Here is the list of most important additions: regression and correlation should be decoupled; the importance of sampling distributions is overstated; probability is better explained without reference to the long run; the difference between the law of large numbers and central limit theorem should be made clear; the rate of convergence in the law of large numbers is not that fast; the law of large numbers is intuitively simple; the uniform distribution can also be made simple; to understand different charts, put them side by side; the Pareto chart is better understood as a special type of a histogram; instead of using the software on the provided CD, try to simulate in Excel yourself.
  5. Using outdated Texas instruments calculators contradicts the American Statistical Association recommendation to "Use technology for developing concepts and analyzing data".


If I want to save time and don't intend to delve into theory, I would prefer to read a concise book that directly addresses questions given on the AP test. However, to decide for yourself, read the Preface to see how much fantasy has been put into the book, and you may want to read it.

Dec 16

Multiple regression through the prism of dummy variables

Agresti and Franklin on p.658 say: The indicator variable for a particular category is binary. It equals 1 if the observation falls into that category and it equals 0 otherwise. I say: For most students, this is not clear enough.

Problem statement

Figure 1. Residential power consumption in 2014 and 2015. Source: http://www.eia.gov/electricity/data.cfm

Residential power consumption in the US has a seasonal pattern. Heating in winter and cooling in summer cause the differences. We want to capture the dependence of residential power consumption PowerC on the season.

 Visual approach to dummy variables

Seasons of the year are categorical variables. We have to replace them with quantitative variables, to be able to use in any mathematical procedure that involves arithmetic operations. To this end, we define a dummy variable (indicator) D_{win} for winter such that it equals 1 in winter and 0 in any other period of the year. The dummies D_{spr},\ D_{sum},\ D_{aut} for spring, summer and autumn are defined similarly. We provide two visualizations assuming monthly observations.

Table 1. Tabular visualization of dummies
Month D_{win} D_{spr} D_{sum} D_{aut} D_{win}+D_{spr}+ D_{sum}+D_{aut}
December 1 0 0 0 1
January 1 0 0 0 1
February 1 0 0 0 1
March 0 1 0 0 1
April 0 1 0 0 1
May 0 1 0 0 1
June 0 0 1 0 1
July 0 0 1 0 1
August 0 0 1 0 1
September 0 0 0 1 1
October 0 0 0 1 1
November 0 0 0 1 1

Figure 2. Graphical visualization of D_spr

The first idea may be wrong

The first thing that comes to mind is to regress PowerC on dummies as in

(1) PowerC=a+bD_{win}+cD_{spr}+dD_{sum}+eD_{aut}+error.

Not so fast. To see the problem, let us rewrite (1) as

(2) PowerC=a\times 1+bD_{win}+cD_{spr}+dD_{sum}+eD_{aut}+error.

This shows that, in addition to the four dummies, there is a fifth variable, which equals 1 across all observations. Let us denote it T (for Trivial). Table 1 shows that

(3) T=D_{win}+D_{spr}+ D_{sum}+D_{aut}.

This makes the next definition relevant. Regressors x_1,...,x_k are called linearly dependent if one of them, say, x_1, can be expressed as a linear combination of the others: x_1=a_2x_2+...+a_kx_k.  In case (3), all coefficients a_i are unities, so we have linear dependence. Using (3), let us replace T in (2). The resulting equation is rearranged as

(4) PowerC=(a+b)D_{win}+(a+c)D_{spr}+(a+d)D_{sum}+(a+e)D_{aut}+error.

Now we see what the problem is. When regressors are linearly dependent, the model is not uniquely specified. (1) and (4) are two different representations of the same model.

What is the way out?

If regressors are linearly dependent, drop them one after another until you get linearly independent ones. For example, dropping the winter dummy, we get

(5) PowerC=a+cD_{spr}+dD_{sum}+eD_{aut}+error.

Here is the estimation result for the two-year data in Figure 1:


This means that:

PowerC=128176 in winter, PowerC=128176-27380 in spring,

PowerC=128176+5450 in summer, and PowerC=128176-22225 in autumn.

It is revealing that cooling requires more power than heating. However, the summer coefficient is not significant. Here is the Excel file with the data and estimation result.

The category that has been dropped is called a base (or reference) category. Thus, the intercept in (5) measures power consumption in winter. The dummy coefficients in (5) measure deviations of power consumption in respective seasons from that in winter.

Here is the question I ask my students

We want to see how beer consumption BeerC depends on gender and income Inc. Let M and F denote the dummies for males and females, resp. Correct the following model and interpret the resulting coefficients:


Final remark

When a researcher includes all categories plus the trivial regressor, he/she falls into what is called a dummy trap. The problem of linear dependence among regressors is usually discussed under the heading of multiple regression. But since the trivial regressor is present in simple regression too, it might be a good idea to discuss it earlier.

Linear dependence/independence of regressors is an exact condition for existence of the OLS estimator. That is, if regressors are linearly dependent, then the OLS estimator doesn't exist, in which case the question about its further properties doesn't make sense. If, on the other hand, regressors are linearly independent, then the OLS estimator exists, and further properties can be studied, such as unbiasedness, variance and efficiency.

Dec 16

Nonparametric estimation for AP Stats

Nonparametric estimation is the right topic for expanding the Stats agenda


Figure 1. Dependence of income on age

For the last several years I have been doing research in nonparametric estimation. It is intellectually rewarding and it is the best tool to show Stats students the usefulness of Statistics. Agresti and Franklin have a chapter on nonparametric estimation. However, the choice of the topics (Wilcoxon test and Kruskal-Wallis test) is unfortunate. These two tests are about comparing means from two samples. They provide just numbers (corresponding statistics), which is not very appealing because the students see just another solution to the familiar problem.

Nonparametric technique is best in nonlinear curve fitting, and this is its selling point because it is VISUAL. The following examples explain the difference between parametric and nonparametric estimation.

Example 1

Suppose we want to use simple regression to estimate dependence of consumption on income. This is a parametric model, with two parameters (intercept and slope). Suppose the fitted line is Consumption=0.1+0.9\times Income (I just put plausible numbers). The slope 0.9 is interpreted as the marginal propensity to consume and can be used in economic modeling to find the budget multiplier. The advantage of parametric estimation is that often estimated parameters have economic meaning.

Example 2

This example has been taken from Lecture Notes by John Fox. Now let us look at dependence of income on age. It is clear that income is low for young people, then rises with age until middle age and declines after retirement. The dependence is obviously nonlinear and, a priori, no guesses can be made about the shape of the curve.

Figure 1 shows the median and quartiles of the distribution of income from wages and salaries as a function of single years of age. The data are taken from the 1990 U.S. Census one-percent Public Use Microdata Sample, and represent 1.24 million observations. Income starts increasing at around 18 years, tops out at 48 and declines till the age of 65. The fitted line is approximately linear until the age of 24, so young people enjoy a highest and constant income growth rate.

Example 3


Figure 2. Density of return on Apple stock


Figure 3. Density of return on MA stock

What would have been the better 5-year investment: Apple or MasterCard? Figure 2 shows that the density of return on Apple stock has a negative mode. The density of return on MasterCard has the mode close to zero. This tells us that MasterCard would be better. Indeed, the annual return on Apple is 18%, while on MasterCard it is 29% (over the last 5 years). Nonparametric estimates of densities (kernel density estimates) are used by financial analysts to simulate stock prices to predict their future movements.

Remark. For simple statistical tasks I recommend Eviews student version for two reasons. 1) It has excellent Help. When I want my students to understand just the essence and avoid proofs, I tell them to read Eviews Help. 2) The student version is just $39.95. Figures 2 and 3 have been produced using Eviews.

Nov 16

The pearls of AP Statistics 35

The disturbance term: To hide or not to hide? In an introductory Stats course, some part of the theory should be hidden. Where to draw the line is an interesting question. Here I discuss the ideas that look definitely bad to me.

How disturbing is the disturbance term?

In the main text, Agresti and Franklin never mention the disturbance term u_i in the regression model

(1) y_i=a+bx_i+u_i

(it is hidden in Exercise 12.105). Instead, they write the equation for the mean \mu_y=a+bx that follows from (1) under the standard assumption Eu_i=0. This would be fine if the exposition stopped right there. However, one has to explain the random source of variability in y_i. On p. 583 the authors say: "The probability distribution of y values at a fixed value of x is a conditional distribution. At each value of x, there is a conditional distribution of y values. A regression model also describes these distributions. An additional parameter σ describes the standard deviation of each conditional distribution."

Further, Figure 12.4 illustrates distributions of errors at different points and asks: "What do the bell-shaped curves around the line at x = 12 and at x = 16 represent?"


Figure 12.4. Illustration of error distributions

Besides, explanations of heteroscedasticity and of the residual sum of squares are impossible without explicitly referring to the disturbance term.

Attributing a regression property to the correlation is not good

On p.589 I encountered a statement that puzzled me: "An important property of the correlation is that at any particular x value, the predicted value of y is relatively closer to its mean than x is to its mean. If an x value is a certain number of standard deviations from its mean, then the predicted y is r times that many standard deviations from its mean."

Firstly, this is a verbal interpretation of some formula, so why not give the formula itself? How good must be a student to guess what is behind the verbal formulation?

Secondly, as I stressed in this post, the correlation coefficient does not entail any prediction about the magnitude of a change in one variable caused by a change in another. The above statement about the predicted value of y must be a property of regression. Attributing a regression property to the correlation is not in the best interests of those who want to study Stats at a more advanced level.

Thirdly, I felt challenged to see something new in the area I thought I knew everything about. So here is the derivation. By definition, the fitted value is

(2) \hat{y_i}=\hat{a}+\hat{b}x_i

where the hats stand for estimators. The fitted line passes through the point (\bar{x},\bar{y}):

(3) \bar{y}=\hat{a}+\hat{b}\bar{x}

(this will be proved elsewhere). Subtracting (3) from (2) we get

(4) \hat{y_i}-\bar{y}=\hat{b}(x_i-\bar{x})

(using equation (4) from this post)


It is helpful to rewrite (4) in a more symmetric form:

(5) \frac{\hat{y_i}-\bar{y}}{\sigma(y)}=\rho\frac{x_i-\bar{x}}{\sigma(x)}.

This is the equation we need. Suppose an x value is a certain number of standard deviations from its mean: x_i-\bar{x}=k\sigma(x). Plug this into (5) to get \hat{y_i}-\bar{y}=\rho k\sigma(y), that is, the predicted y is \rho times that many standard deviations from its mean.