23
Mar 17

Binary choice models: theoretical obstacles

Binary choice models: theoretical obstacles (problems with the linear probability model and binary choice models)

What's wrong with the linear probability model

Recall the problem statement: the dependent variable y_i can take only two values, 0 or 1, and the independent variables are joined into the index Index_i=\beta_0+\beta_1x_{1i}+...+\beta_kx_{ki}. The linear probability model

(1) P(y_i=1|x_i)=Index_i

is equivalently written in linear regression form as

(2) y_i=Index_i+u_i with E(u_i|x_i)=0.

Let's study the error term. If y_i=1, from (2) the value of u_i is u_i=y_i-Index_i=1-Index_i. From (1) we know the probability of this event. If y_i=0, then the value of u_i is u_i=y_i-Index_i=-Index_i and by (1) the probability of this event is 1-P(y_i=1|x_i)=1-Index_i. We can summarize this information in a table:

Table 1. Error properties
Values of u_i Corresponding probabilities
1-Index_i Index_i
-Index_i 1-Index_i

For each observation, the error is a binary variable. In particular, it's not continuous, much less normal. Since the index changes with the observation, the errors are not identically distributed.

It's easy to find the mean and variance of u_i. The mean is

Eu_i=(1-Index_i)Index_i-Index_i(1-Index_i)=0

(this is good). The variance is

Var(u_i)=Eu_i^2-(Eu_i)^2=(1-Index_i)^2Index_i+(Index_i)^2(1-Index_i)=    (1-Index_i)Index_i[(1-Index_i)+(Index_i)]=(1-Index_i)Index_i,

which is bad (heteroscedasticity). Besides, for this variance to be positive, the index should stay between 0 and 1.

Why in binary choice models there is no error term

We know the general specification of a binary choice model:

P(y_i=1|x_i)=F(Index_i).

Here F is a distribution function of some variable, say X. Let's see what happens if we include the error term, as in

(3) P(y_i=1|x_i)=F(Index_i+u_i).

It is natural, as a first approximation, to consider identically distributed errors. By definition,

(4) F(Index_i+u_i)=P(X\le Index_i+u_i)=P(X-u_i\le Index_i).

The variables Z_i=X-u_i are distributed identically. Denoting Z their common distribution, from (3) and (4) we have

P(y_i=1|x_i)=P(X-u_i\le Index_i)=P(Z\le Index_i)=F_Z(Index_i).

Thus, including the error term in (3) leads to a change of a distribution function in the model specification. In probit and logit, we fix good distribution functions from the very beginning and don't want to change them by introducing (possibly bad) errors.

 

20
Mar 17

Binary choice models

Problem statement

Consider the classical problem of why people choose to drive or use public transportation to go to their jobs. For a given individual i, we observe the decision variable y_i, which takes values either 0 (drive) or 1 (use public transportation), and various variables that impact the decision, like costs, convenience, availability of parking, etc. We denote the independent variables x_1,...,x_k and their values for a given individual x_i=(x_{1i},...,x_{ki}). For convenience, the usual expression \beta_0+\beta_1x_{1i}+...+\beta_kx_{ki} that arises on the right-hand side of a multiple regression is called an index and denoted Index_i.

We want to study how people's decisions y_i depend on the index Index_i.

Linear probability model

If you are familiar with the multiple regression, the first idea that comes to mind is

(1) y_i=Index_i+u_i.

This turns out to be a bad idea because the range of the variable on the left is bounded and the range of the index is not. Whenever there is a discrepancy between the bounded decision variable and the unbounded index, it has to be made up for by the error term. Thus the error term will be certainly bad. (A detailed analysis shows that it will be heteroscedastic but this fact is less important than the problem with range boundedness.)

The next statement helps to understand the right approach. We need the unbiasedness condition from the first approach to stochastic regressors:

(2) E(u_i|x_i)=0.

Statement. A combination of equations (1)+(2) is equivalent to just one equation

(3) P(y_i=1|x_i)=Index_i.

Proof. Step 1. Since y_i is a Bernoulli variable, by the definition of conditional expectation we have the identity

(4) E(y_i|x_i)=P(y_i=1|x_i)\times 1+P(y_i=0|x_i)\times 0=P(y_i=1|x_i).

Step 2. If (1)+(2) is true, then by (4)

P(y_i=1|x_i)=E(y_i|x_i)=E(Index_i+u_i|x_i)=Index_i,

so (3) holds (see Property 7). Conversely, suppose that (3) is true. Let us write

(5) y_i=P(y_i=1|x_i)+[y_i-P(y_i=1|x_i)]

and denote u_i=y_i-P(y_i=1|x_i). Then using (4) we see that (2) is satisfied:

E(u_i|x_i)=E(y_i|x_i)-E[P(y_i=1|x_i)|x_i]=E(y_i|x_i)-P(y_i=1|x_i)=0

(we use Property 7 again). (5) and (3) give (1). The proof is over.

This little exercise shows that the linear model (1) is the same as (3), which is called a linear probability model. (3) has the same problem as (1): the variable on the left is bounded and the index on the right is not. Note also a conceptual problem unseen before: while the decisions y_i are observed, the probabilities P(y_i=1|x_i) are not. This is why one has to use the maximum likelihood method.

Binary choice models

Since we know what is a distribution function, we can guess how to correct (3). A distribution function has the same range as the probability at the left of (3), so the right model should look like this:

(6) P(y_i=1|x_i)=F(Index_i)

where F is some distribution function. Two choices of F are common

(a) F is a distribution function of the standard normal; in this case (6) is called a probit model.

(b) F is a logistic function; then (6) is called a logit model.

Measuring marginal effects

For the linear model (1) the marginal effect of variable x_{ji} on y_i is measured by the derivative \partial y_i/\partial x_{ji}=\beta_j and is constant. If we apply the same idea to (6), we see that the marginal effect is not constant:

(7) \frac{\partial P(y_i=1|x_i)}{\partial x_{ji}}=\frac{\partial F(Index_i)}{\partial Index_i}\frac{\partial Index_i}{\partial x_{ji}}=f(Index_i)\beta_j

where f is the density of F (we use the distribution function differentiation equation and the chain rule). In statistical software, the value of f(Index_i) is usually reported at the mean value of the index.

For the probit model equation (7) gives \frac{\partial P(y_i=1|x_i)}{\partial x_{ji}}=\frac{1}{\sqrt{2\pi}}\exp(-\frac{Index_i^2}{2})\beta_j and for the logit \frac{\partial P(y_i=1|x_i)}{\partial x_{ji}}=\frac{e^{-Index_i}}{(1+e^{-Index_i})^2}\beta_j. There is no need to remember these equations if you know the algebra.