Aug 17

Violations of classical assumptions 1

Violations of classical assumptions

This is a large topic which requires several posts or several book chapters. During a conference in Sweden in 2010, a Swedish statistician asked me: "What is Econometrics, anyway? What tools does it use?" I said: "Among others, it uses linear regression." He said: "But linear regression is a general statistical tool, why do they say it's a part of Econometrics?" My answer was: "Yes, it's a general tool but the name Econometrics emphasizes that the motivation for its applications lies in Economics".

Both classical assumptions and their violations should be studied with this point in mind: What is the Economics and Math behind each assumption?

Violations of the first three assumptions

We consider the simple regression

(1) y_i=a+bx_i+e_i

Make sure to review the assumptions. Their numbering and names sometimes are different from what Dougherty's book has. In particular, most of the time I omit the following assumption:

A6. The model is linear in parameters and correctly specified.

When it is not linear in parameters, you can think of nonlinear alternatives. Instead of saying "correctly specified" I say "true model" when a "wrong model" is available.

A1. What if the existence condition is violated? If variance of the regressor is zero, the OLS estimator does not exist. The fitted line is supposed to be vertical, and you can regress x on y. Violation of the existence condition in case of multiple regression leads to multicollinearity, and that's where economic considerations are important.

A2. The convenience condition is called so because when it is violated, that is, the regressor is stochastic, there are ways to deal with this problem:  finite-sample theory and large-sample theory.

A3. What if the errors in (1) have means different from zero? This question can be divided in two: 1) the means of the errors are the same: Ee_i=c\ne 0 for all i and 2) the means are different. Read the post about centering and see if you can come up with the answer for the first question. The means may be different because of omission of a relevant variable (can you do the math?). In the absence of data on such a variable, there is nothing you can do.

Violations of A4 and A5 will be treated later.

Sep 16

Proving unbiasedness of OLS estimators

Proving unbiasedness of OLS estimators - the do's and don'ts


Here we derived the OLS estimators. To distinguish between sample and population means, the variance and covariance in the slope estimator will be provided with the subscript u (for "uniform", see the rationale here).

(1) \hat{b}=\frac{Cov_u(x,y)}{Var_u(x)},

(2) \hat{a}=\bar{y}-\hat{b}\bar{x}.

These equations are used in conjunction with the model

(3) y_i=a+bx_i+e_i

where we remember that

(4) Ee_i=0 for all i.

Since (2) depends on (1), we have to start with unbiasedness of the slope estimator.

Using the right representation is critical

We have to show that E\hat{b}=b.

Step 1. Don't apply the expectation directly to (1). Do separate in (1) what is supposed to be E\hat{b}. To reveal the role of errors in (1), plug (3) in (1) and use linearity of covariance with respect to each argument when the other argument is fixed:


Here Cov_u(x,a)=0 (a constant is uncorrelated with any variable), Cov_u(x,x)=Var_u(x) (covariance of x with itself is its variance), so

(5) \hat{b}=\frac{bVar_u(x)+Cov_u(x,e)}{Var_u(x)}=b+\frac{Cov_u(x,e)}{Var_u(x)}.

Equation (5) is the mean-plus-deviation-from-the-mean decomposition. Many students think that Cov_u(x,e)=0 because of (4). No! The covariance here does not involve the population mean.

Step 2. It pays to make one more step to develop (5). Write out the numerator in (5) using summation:


Don't write out Var_u(x)! Presence of two summations confuses many students.

Multiplying parentheses and using the fact that \sum(x_i-\bar{x})=n\bar{x}-n\bar{x}=0 we have



To simplify calculations, denote a_i=(x_i-\bar{x})/Var_u(x). Then the slope estimator becomes

(6) \hat{b}=b+\frac{1}{n}\sum a_ie_i.

This is the critical representation.

Unbiasedness of the slope estimator

Convenience conditionThe regressor x is deterministic. I call it a convenience condition because it's just a matter of mathematical expedience, and later on we'll study ways to bypass it.

From (6), linearity of means and remembering that the deterministic coefficients a_i behave like constants,

(7) E\hat{b}=E[b+\frac{1}{n}\sum a_ie_i]=b+\frac{1}{n}\sum a_iEe_i=b

by (4). This proves unbiasedness.

You don't know the difference between the population and sample means until you see them working in the same formula.

Unbiasedness of the intercept estimator

As above we plug (3) in (2): \hat{a}=\overline{a+bx+e}-\hat{b}\bar{x}=a+b\bar{x}+\bar{e}-\hat{b}\bar{x}. Applying expectation:



Since in (1)  there is division by Var_u(x), the condition Var_u(x)\ne 0 is the main condition for existence of OLS estimators. From the above proof we see that (4) is the main condition for unbiasedness.

May 16

What is cointegration?

What is cointegration? The discussions here and here  are bad because they link the definition to differencing a time series. In fact, to understand cointegration, you need two notions: stationary processes  (please read before continuing) and linear dependence.

Definition. We say that vectors X_1,...,X_n are linearly dependent if there exist numbers a_1,...,a_n, not all of which are zero, such that the linear combination a_1X_1+...+a_nX_n is a zero vector.

Recall from this post that stationary processes play the role of zero in the set of all processes. Replace in the above definition "vectors" with "processes" and "a zero vector" with "a stationary process" and - voilà - you have the definition of cointegration:

Definition. We say that processes X_1,...,X_n are cointegrated if there exist numbers a_1,...,a_n, not all of which are zero, such that the linear combination a_1X_1+...+a_nX_n is a stationary process. Remembering that each process is a collection of random variables indexed with time moments t, we obtain a definition that explicitly involves time: processes \{X_{1,t}\},...,\{X_{n,t}\} are cointegrated if there exist numbers a_1,...,a_n, not all of which are zero, such that a_1X_{1,t}+...+a_nX_{n,t}=u_t where \{u_t\} is a stationary process.

To fully understand the implications, you need to know all the intricacies of linear dependence. I do not want to plunge into this lengthy discussion here. Instead, I want to explain how this definition leads to a regression in case of two processes.

If \{X_{1,t}\},\{X_{2,t}\} are cointegrated, then there exist numbers a_1,a_2, at least one of which is not zero, such that a_1X_{1,t}+a_2X_{2,t}=u_t where \{u_t\} is a stationary process. If a_1\ne 0, we can solve for X_{1,t} obtaining X_{1,t}=\beta X_{2,t}+v_t with \beta=-a_2/a_1 and v_t=1/a_1u_t. This is almost a regression, except that the mean of v_t may not be zero. We can represent v_t=(v_t-Ev_t)+Ev_t=w_t+\alpha, where \alpha=Ev_t, w_t=v_t-Ev_t. Then the above equation becomes X_{1,t}=\alpha+\beta X_{2,t}+w_t, which is simple regression. The case a_2\ne 0 leads to a similar result.

Practical recommendation. To see if \{X_{1,t}\},\{X_{2,t}\} are cointegrated, regress one of them on the other and test the residuals for stationarity.

Jan 16

What is a z score: the scientific explanation

You know what is a z score when you know why people invented it.

As usual, we start with a theoretical motivation. There is a myriad of distributions. Even if we stay within the set of normal distributions, there is an infinite number of them, indexed by their means \mu(X)=EX and standard deviations \sigma(X)=\sqrt{Var(X)}. When computers did not exist, people had to use statistical tables. It was impossible to produce statistical tables for an infinite number of distributions, so the problem was to reduce the case of general \mu(X) and \sigma(X) to that of \mu(X)=0 and \sigma(X)=1.

But we know that that can be achieved by centering and scaling. Combining these two transformations, we obtain the definition of the z score:


Using the properties of means and variances we see that



The transformation leading from X to its z score sometimes is called standardization.

This site promises to tell you the truth about undergraduate statistics. The truth about the z score is that:

(1) Standardization can be applied to any variable with finite variance, not only to normal variables. The z score is a standard normal variable only when the original variable X is normal, contrary to what some sites say.

(2) With modern computers, standardization is not necessary to find critical values for X, see Chapter 14 of my book.

Jan 16

Mean plus deviation-from-mean decomposition

Mean plus deviation-from-mean decomposition

This is about separating the deterministic and random parts of a variable. This topic can be difficult or easy, depending on how you look at it. The right way to think about it is theoretical.

Everything starts with a simple question: What can you do to a random variable X to obtain a new variable, say, Y, whose mean is equal to zero? Intuitively, when you subtract the mean from X, the distribution moves to the left or right, depending on the sign of EX, so that the distribution of Y is centered on zero. One of my students used this intuition to guess that you should subtract the mean: Y=X-EX. The guess should be confirmed by algebra: from this definition


(here we distributed the expectation operator and used the property that the mean of a constant (EX) is that constant). By the way, subtracting the mean from a variable is called centering or demeaning.

If you understand the above, you can represent X as

X = EX+(X-EX).

Here \mu=EX is the mean and u=X-EX is the deviation from the mean. As was shown above, Eu=0. Thus, we obtain the mean plus deviation-from-mean decomposition X=\mu+u. Simple, isn't it? It is so simple, that students don't pay attention to it. In fact, it is omnipresent in Statistics because Var(X)=Var(u). The analysis of Var(X) is reduced to that of Var(u)!