23
Mar 17

## Binary choice models: theoretical obstacles

Binary choice models: theoretical obstacles (problems with the linear probability model and binary choice models)

### What's wrong with the linear probability model

Recall the problem statement: the dependent variable $y_i$ can take only two values, 0 or 1, and the independent variables are joined into the index $Index_i=\beta_0+\beta_1x_{1i}+...+\beta_kx_{ki}$. The linear probability model

(1) $P(y_i=1|x_i)=Index_i$

is equivalently written in linear regression form as

(2) $y_i=Index_i+u_i$ with $E(u_i|x_i)=0$.

Let's study the error term. If $y_i=1$, from (2) the value of $u_i$ is $u_i=y_i-Index_i=1-Index_i$. From (1) we know the probability of this event. If $y_i=0$, then the value of $u_i$ is $u_i=y_i-Index_i=-Index_i$ and by (1) the probability of this event is $1-P(y_i=1|x_i)=1-Index_i$. We can summarize this information in a table:

 Values of $u_i$ $u_i$ Corresponding probabilities $1-Index_i$ $1-Index_i$ $Index_i$ $Index_i$ $-Index_i$ $-Index_i$ $1-Index_i$ $1-Index_i$

For each observation, the error is a binary variable. In particular, it's not continuous, much less normal. Since the index changes with the observation, the errors are not identically distributed.

It's easy to find the mean and variance of $u_i$. The mean is $Eu_i=(1-Index_i)Index_i-Index_i(1-Index_i)=0$

(this is good). The variance is $Var(u_i)=Eu_i^2-(Eu_i)^2=(1-Index_i)^2Index_i+(Index_i)^2(1-Index_i)= (1-Index_i)Index_i[(1-Index_i)+(Index_i)]=(1-Index_i)Index_i,$

which is bad (heteroscedasticity). Besides, for this variance to be positive, the index should stay between 0 and 1.

### Why in binary choice models there is no error term

We know the general specification of a binary choice model: $P(y_i=1|x_i)=F(Index_i).$

Here $F$ is a distribution function of some variable, say $X$. Let's see what happens if we include the error term, as in

(3) $P(y_i=1|x_i)=F(Index_i+u_i).$

It is natural, as a first approximation, to consider identically distributed errors. By definition,

(4) $F(Index_i+u_i)=P(X\le Index_i+u_i)=P(X-u_i\le Index_i)$.

The variables $Z_i=X-u_i$ are distributed identically. Denoting $Z$ their common distribution, from (3) and (4) we have $P(y_i=1|x_i)=P(X-u_i\le Index_i)=P(Z\le Index_i)=F_Z(Index_i)$.

Thus, including the error term in (3) leads to a change of a distribution function in the model specification. In probit and logit, we fix good distribution functions from the very beginning and don't want to change them by introducing (possibly bad) errors.

20
Mar 17

## Binary choice models

### Problem statement

Consider the classical problem of why people choose to drive or use public transportation to go to their jobs. For a given individual $i$, we observe the decision variable $y_i$, which takes values either 0 (drive) or 1 (use public transportation), and various variables that impact the decision, like costs, convenience, availability of parking, etc. We denote the independent variables $x_1,...,x_k$ and their values for a given individual $x_i=(x_{1i},...,x_{ki})$. For convenience, the usual expression $\beta_0+\beta_1x_{1i}+...+\beta_kx_{ki}$ that arises on the right-hand side of a multiple regression is called an index and denoted $Index_i$.

We want to study how people's decisions $y_i$ depend on the index $Index_i$.

### Linear probability model

If you are familiar with the multiple regression, the first idea that comes to mind is

(1) $y_i=Index_i+u_i$.

This turns out to be a bad idea because the range of the variable on the left is bounded and the range of the index is not. Whenever there is a discrepancy between the bounded decision variable and the unbounded index, it has to be made up for by the error term. Thus the error term will be certainly bad. (A detailed analysis shows that it will be heteroscedastic but this fact is less important than the problem with range boundedness.)

The next statement helps to understand the right approach. We need the unbiasedness condition from the first approach to stochastic regressors:

(2) $E(u_i|x_i)=0$.

Statement. A combination of equations (1)+(2) is equivalent to just one equation

(3) $P(y_i=1|x_i)=Index_i$.

Proof. Step 1. Since $y_i$ is a Bernoulli variable, by the definition of conditional expectation we have the identity

(4) $E(y_i|x_i)=P(y_i=1|x_i)\times 1+P(y_i=0|x_i)\times 0=P(y_i=1|x_i)$.

Step 2. If (1)+(2) is true, then by (4) $P(y_i=1|x_i)=E(y_i|x_i)=E(Index_i+u_i|x_i)=Index_i$,

so (3) holds (see Property 7). Conversely, suppose that (3) is true. Let us write

(5) $y_i=P(y_i=1|x_i)+[y_i-P(y_i=1|x_i)]$

and denote $u_i=y_i-P(y_i=1|x_i)$. Then using (4) we see that (2) is satisfied: $E(u_i|x_i)=E(y_i|x_i)-E[P(y_i=1|x_i)|x_i]=E(y_i|x_i)-P(y_i=1|x_i)=0$

(we use Property 7 again). (5) and (3) give (1). The proof is over.

This little exercise shows that the linear model (1) is the same as (3), which is called a linear probability model. (3) has the same problem as (1): the variable on the left is bounded and the index on the right is not. Note also a conceptual problem unseen before: while the decisions $y_i$ are observed, the probabilities $P(y_i=1|x_i)$ are not. This is why one has to use the maximum likelihood method.

### Binary choice models

Since we know what is a distribution function, we can guess how to correct (3). A distribution function has the same range as the probability at the left of (3), so the right model should look like this:

(6) $P(y_i=1|x_i)=F(Index_i)$

where $F$ is some distribution function. Two choices of $F$ are common

(a) $F$ is a distribution function of the standard normal; in this case (6) is called a probit model.

(b) $F$ is a logistic function; then (6) is called a logit model.

### Measuring marginal effects

For the linear model (1) the marginal effect of variable $x_{ji}$ on $y_i$ is measured by the derivative $\partial y_i/\partial x_{ji}=\beta_j$ and is constant. If we apply the same idea to (6), we see that the marginal effect is not constant:

(7) $\frac{\partial P(y_i=1|x_i)}{\partial x_{ji}}=\frac{\partial F(Index_i)}{\partial Index_i}\frac{\partial Index_i}{\partial x_{ji}}=f(Index_i)\beta_j$

where $f$ is the density of $F$ (we use the distribution function differentiation equation and the chain rule). In statistical software, the value of $f(Index_i)$ is usually reported at the mean value of the index.

For the probit model equation (7) gives $\frac{\partial P(y_i=1|x_i)}{\partial x_{ji}}=\frac{1}{\sqrt{2\pi}}\exp(-\frac{Index_i^2}{2})\beta_j$ and for the logit $\frac{\partial P(y_i=1|x_i)}{\partial x_{ji}}=\frac{e^{-Index_i}}{(1+e^{-Index_i})^2}\beta_j$. There is no need to remember these equations if you know the algebra.