Jan 17

Regressions with stochastic regressors 2

Regressions with stochastic regressors 2: two approaches

We consider the slope estimator for the simple regression


assuming that x_i is stochastic.

First approach: the sample size is fixed. The unbiasedness and efficiency conditions are replaced by their analogs conditioned on x. The outcome is that the slope estimator is unbiased and its variance is the average of the variance that we have in case of a deterministic regressor. See the details.

Second approach: the sample size goes to infinity. The main tools used are the properties of probability limits and laws of large numbers. The outcome is that, in the limit, the sample characteristics are replaced by their population cousins and the slope estimator is consistent. This is what we focus on here.

A brush-up on convergence in probability

Review the intuition and formal definition. This is the summary:

Fact 1. Convergence in probability (which applies to sequences of random variables) is a generalization of the notion of convergence of number sequences. In particular, if \{a_n\} is a numerical sequence that converges to a number a\lim_{n\rightarrow\infty}a_n=a, then, treating a_n as a random variable, we have convergence in probability {\text{plim}}_{n\rightarrow\infty}a_n=a.

Fact 2. For those who are familiar with the theory of limits of numerical sequences, from the previous fact it should be clear that convergence in probability preserves arithmetic operations. That is, for any sequences of random variables \{X_n\},\{Y_n\} such that limits {\text{plim}}X_n and {\text{plim}}Y_n exist, we have

\text{plim}(X_n\pm Y_n)=\text{plim}X_n\pm\text{plim}Y_n,

\text{plim}(X_n\times Y_n)=\text{plim}X_n\times\text{plim}Y_n,

and if \text{plim}Y_n\ne 0 then

\text{plim}(X_n/ Y_n)=\text{plim}X_n/\text{plim}Y_n.

This makes convergence in probability very handy. Convergence in distribution doesn't have such properties.

A brush-up on laws of large numbers

See the site map for several posts about this. Here we apply the Chebyshev inequality to prove the law of large numbers for sample means. A generalization is given in the Theorem in the end of that post. Here is a further intuitive generalization:

Normally, unbiased sample characteristics converge in probability to their population counterparts.

Example 1. We know that the sample variance s^2=\frac{1}{n-1}\sum(X_i-\bar{X})^2 unbiasedly estimates the population variance \sigma^2Es^2=\sigma^2. The intuitive generalization says that then

(1) \text{plim}s^2=\sigma^2.

Here I argue that, for the purposes of obtaining some identities from the general properties of means, instead of the sample variance it's better to use the variance defined by Var_u(X)=\frac{1}{n}\sum(X_i-\bar{X})^2 (with division by n instead of n-1). Using Facts 1 and 2 we get from (1) that

(2) \text{plim}Var_u(X)=\text{plim}\frac{n-1}{n}\frac{1}{n-1}\sum(X_i-\bar{X})^2


(sample variance converges in probability to population variance). Here we use \lim(1-\frac{1}{n})=1.

Example 2. Similarly, sample covariance converges in probability to population covariance:

(3) \text{plim}Cov_u(X,Y)=Cov(X,Y)

where by definition Cov_u(X,Y)=\frac{1}{n}\sum(X_i-\bar{X})(Y_i-\bar{Y}).

Proving consistency of the slope estimator

Here (see equation (5)) I derived the representation of the OLS estimator of the slope


Using preservation of arithmetic operations for convergence in probability, we get

(4) \text{plim}\hat{b}=\text{plim}\left[b+\frac{Cov_u(X,e)}{Var_u(X)}\right]=\text{plim}b+\text{plim}\frac{Cov_u(X,e)}{Var_u(X)}


In the last line we used (2) and (3). From (4) we see what conditions should be imposed for the slope estimator to converge to a spike at the true slope:

Var(X)\neq 0 (existence condition)


Cov(X,e)=0 (consistency condition).

Under these conditions, we have \text{plim}\hat{b}=b (this is called consistency).

Conclusion. In a way, the second approach is technically simpler than the first.

Jan 17

The law of large numbers proved

The law of large numbers overview

I have already several posts about the law of large numbers:

  1. start with the intuition, which is illustrated using Excel;
  2. simulations in Excel show that convergence is not as fast as some textbooks claim;
  3. to distinguish the law of large numbers from the central limit theorem read this;
  4. the ultimate purpose is the application to simple regression with a stochastic regressor.

Here we busy ourselves with the proof.

Measuring deviation of a random variable from a constant

Let X be a random variable and c some constant. We want a measure of X differing from the constant by a given number \varepsilon or more. The set where X differs from c by \varepsilon>0 or more is the outside of the segment [c-\varepsilon,c+\varepsilon], that is, \{|X-c|\ge\varepsilon\}=\{X\le c-\varepsilon\}\cup\{X\ge c+\varepsilon\}.

Figure 1. Measuring the outside of interval

Now suppose X has a density p(t). It is natural to measure the set \{|X-c|\ge\varepsilon\} by the probability P(|X-c|\ge\varepsilon). This is illustrated in Figure 1.

Convergence to a spike formalized

Figure 2. Convergence to a spike

Once again, check out the idea. Consider a sequence of random variables \{T_n\} and a parameter \tau. Fix some \varepsilon>0 and consider a corridor [\tau-\varepsilon,\tau+\varepsilon] of width 2\varepsilon around \tau. For \{T_n\} to converge to a spike at \tau we want the area P(|T_n-\tau|\ge\varepsilon) to go to zero as we move along the sequence to infinity. This is illustrated in Figure 2, where, say, \{T_1\} has a flat density and the density of \{T_{1000}\} is chisel-shaped. In the latter case the area P(|T_n-\tau|\ge\varepsilon) is much smaller than in the former. The math of this phenomenon is such that P(|T_n-\tau|\ge\varepsilon) should go to zero for any \varepsilon>0 (the narrower the corridor, the further to infinity we should move along the sequence).

Definition. Let \tau be some parameter and let \{T_n\} be a sequence of its estimators. We say that \{T_n\} converges to \tau in probability or, alternatively, \{T_n\} consistently estimates \tau if P(|T_n-\tau|\ge\varepsilon)\rightarrow 0 as n\rightarrow 0 for any \varepsilon>0.

The law of large numbers in its simplest form

Let \{X_n\} be an i.i.d. sample from a population with mean \mu and variance \sigma^2. This is the situation from the standard Stats course. We need two facts about the sample mean \bar{X}: it is unbiased,

(1) E\bar{X}=\mu,

and its variance tends to zero

(2) Var(\bar{X})=\sigma^2/n\rightarrow 0 as n\rightarrow 0.


P(|\bar{X}-\mu|\ge \varepsilon) (by (1))

=P(|\bar{X}-E\bar{X}|\ge \varepsilon) (by the Chebyshev inequality, see Extension 3))

\le\frac{1}{\varepsilon^2}Var(\bar{X}) (by (2))

=\frac{\sigma^2}{n\varepsilon^2}\rightarrow 0  as n\rightarrow 0.

Since this is true for any \varepsilon>0, the sample mean is a consistent estimator of the population mean. This proves Example 1.

Final remarks

The above proof applies in the next more general situation.

Theorem. Let \tau be some parameter and let \{T_n\} be a sequence of its estimators such that: a) ET_n=\tau for any n and b) Var(T_n)\rightarrow 0. Then \{T_n\} converges in probability to \tau.

This statement is often used on the Econometrics exams of the University of London.

In the unbiasedness definition the sample size is fixed. In the consistency definition it tends to infinity. The above theorem says that unbiasedness for all n plus Var(T_n)\rightarrow 0 are sufficient for consistency.

Sep 16

All you need to know about the law of large numbers

All about the law of large numbers: properties and applications

Level 1: estimation of population parameters

The law of large numbers is a statement about convergence which is called convergence in probability and denoted \text{plim}. The precise definition is rather complex but the intuition is simple: it is convergence to a spike at the parameter being estimated. Usually, any unbiasedness statement has its analog in terms of the corresponding law of large numbers.

Example 1. The sample mean unbiasedly estimates the population mean: E\bar{X}=EX. Its analog: the sample mean converges to a spike at the population mean: \text{plim}\bar{X}=EX. See the proof based on the Chebyshev inequality.

Example 2. The sample variance unbiasedly estimates the population variance: E\overline{s^2}=Var(X) where s^2=\frac{\sum(X_i-\bar{X})^2}{n-1}. Its analog: the sample variance converges to a spike at the population variance:

(1) \text{plim}\overline{s^2}=Var(X).

Example 3. The sample covariance s_{X,Y}=\frac{\sum(X_i-\bar{X})(Y_i-\bar{Y})}{n-1} unbiasedly estimates the population covariance: E\overline{s_{X,Y}}=Cov(X,Y). Its analog: the sample covariance converges to a spike at the population covariance:

(2) \text{plim}\overline{s_{X,Y}}=Cov(X,Y).

Up one level: convergence in probability is just convenient

Using or not convergence in probability is a matter of expedience. For usual limits of sequences we know the properties which I call preservation of arithmetic operations:

\lim(a_n\pm b_n)=\lim a_n\pm \lim b_n,

\lim(a_n\times b_n)=\lim a_n\times\lim b_n,

\lim(a_n/ b_n)=\lim a_n/\lim b_n.

Convergence in probability has exact same properties, just replace \lim with \text{plim}.

Next level: making regression estimation more plausible

Using convergence in probability allows us to handle stochastic regressors and avoid the unrealistic assumption that regressors are deterministic.

Convergence in probability and in distribution are two types of convergence of random variables that are widely used in the Econometrics course of the University of London.