8
Jan 17

## OLS estimator variance

We consider the simple regression

(1) $y_i=a+bx_i+e_i$

Here we derived the OLS estimators of the intercept and slope:

(2) $\hat{b}=\frac{Cov_u(x,y)}{Var_u(x)}$,

(3) $\hat{a}=\bar{y}-\hat{b}\bar{x}$.

A1. Existence condition. Since division by zero is not allowed, for (2) to exist we require $Var_u(x)\ne 0$. If this condition is not satisfied, then there is no variance in $x$ and all observed points are on the vertical line.

A2. Convenience condition. The regressor $x$ is deterministic. This condition is imposed to be able to apply the properties of expectation, see equation (7) in  this post. The time trend and dummy variables are examples of deterministic regressors. However, most real-life regressors are stochastic. Modifying the theory in order to cover stochastic regressors is the subject of two posts: finite-sample theory and large-sample theory.

A3. Unbiasedness condition$Ee_i=0$. This is the main assumption that makes sure that OLS estimators are unbiased, see equation (7) in  this post.

### Unbiasedness is not enough

Unbiasedness characterizes the quality of an estimator, see the intuitive explanation. Unfortunately, unbiasedness is not enough to choose the best estimator because of nonuniqueness: usually, if there is one unbiased estimator of a parameter, then there are infinitely many unbiased estimators of the same parameter. For example, we know that the sample mean $\bar{X}$ unbiasedly estimates the population mean $E\bar{X}=EX$. Since $EX_1=EX$ ($X_1$ is the first observation), we can easily construct an infinite family of unbiased estimators $Y=(\bar{X}+aX_1)/(1+a)$, assuming $a\ne -1$. Indeed, using linearity of expectation $EY=(E\bar{X}+aEX_1)/(1+a)=EX$.

Variance is another measure of an estimator quality: to have a lower spread of estimator values, among competing estimators we choose the one which has the lowest variance. Knowing the estimator variance allows us to find the z-score and use statistical tables.

### Slope estimator variance

It is not difficult to find the variance of the slope estimator using representation (6) derived here:

$\hat{b}=b+\frac{1}{n}\sum a_ie_i$

where $a_i=(x_i-\bar{x})/Var_u(x).$

Don't try to apply directly the definition of variance at this point, because there will be a square of a sum, which leads to a double sum upon squaring. We need two new assumptions.

A4. Uncorrelatedness of errors. Assume that $Cov(e_i,e_j)=0$ for all $i\ne j$ (errors from different equations (1) are uncorrelated). Note that because of the unbiasedness condition, this assumption is equivalent to $Ee_ie_j=0$ for all $i\ne j$. This assumption is likely to be satisfied if we observe consumption patterns of unrelated individuals.

A5. Homoscedasticity. All errors have the same variances$Var(e_i)=\sigma^2$ for all $i$. Again, because of the unbiasedness condition, this assumption is equivalent to $Ee_i^2=\sigma^2$ for all $i$.

Now we can derive the variance expression, using properties from this post:

$Var(\hat{b})=Var(b+\frac{1}{n}\sum_i a_ie_i)$ (dropping a constant doesn't affect variance)

$=Var(\frac{1}{n}\sum_i a_ie_i)$ (for uncorrelated variables, variance is additive)

$=\sum_i Var(\frac{1}{n}a_ie_i)$ (variance is homogeneous of degree 2)

$=\frac{1}{n^2}\sum_i a_i^2Var(e_i)$ (applying homoscedasticity)

$=\frac{1}{n^2}\sum_i a_i^2\sigma^2$ (plugging $a_i$)

$=\frac{1}{n^2}\sum_i(x_i-\bar{x})^2\sigma^2/Var^2_u(x)$ (using the notation of sample variance)

$=\frac{1}{n}Var_u(x)\sigma^2/Var^2_u(x)=\sigma^2/(nVar_u(x)).$

Note that canceling out two variances in the last line is obvious. It is not so obvious for some if instead of the short notation for variances you use summation signs. The case of the intercept variance is left as an exercise.

### Conclusion

The above assumptions A1-A5 are called classical. It is necessary to remember their role in derivations because a considerable part of Econometrics is devoted to deviations from classical assumptions. Once you have a certain assumption violated, you should expect the corresponding estimator property invalidated. For example, if $Ee_i\ne 0$, you should expect the estimators to be biased. If any of A4-A5 is not true, the formula we have derived

$Var(\hat{b})=\sigma^2/(nVar_u(x))$

will not hold. Besides, the Gauss-Markov theorem that the OLS estimators are efficient will not hold (this will be discussed later). The pair A4-A5 can be called an efficiency condition.

8
Nov 16

## The pearls of AP Statistics 35

The disturbance term: To hide or not to hide? In an introductory Stats course, some part of the theory should be hidden. Where to draw the line is an interesting question. Here I discuss the ideas that look definitely bad to me.

### How disturbing is the disturbance term?

In the main text, Agresti and Franklin never mention the disturbance term $u_i$ in the regression model

(1) $y_i=a+bx_i+u_i$

(it is hidden in Exercise 12.105). Instead, they write the equation for the mean $\mu_y=a+bx$ that follows from (1) under the standard assumption $Eu_i=0$. This would be fine if the exposition stopped right there. However, one has to explain the random source of variability in $y_i$. On p. 583 the authors say: "The probability distribution of y values at a fixed value of x is a conditional distribution. At each value of x, there is a conditional distribution of y values. A regression model also describes these distributions. An additional parameter σ describes the standard deviation of each conditional distribution."

Further, Figure 12.4 illustrates distributions of errors at different points and asks: "What do the bell-shaped curves around the line at x = 12 and at x = 16 represent?"

Figure 12.4. Illustration of error distributions

Besides, explanations of heteroscedasticity and of the residual sum of squares are impossible without explicitly referring to the disturbance term.

### Attributing a regression property to the correlation is not good

On p.589 I encountered a statement that puzzled me: "An important property of the correlation is that at any particular x value, the predicted value of y is relatively closer to its mean than x is to its mean. If an x value is a certain number of standard deviations from its mean, then the predicted y is r times that many standard deviations from its mean."

Firstly, this is a verbal interpretation of some formula, so why not give the formula itself? How good must be a student to guess what is behind the verbal formulation?

Secondly, as I stressed in this post, the correlation coefficient does not entail any prediction about the magnitude of a change in one variable caused by a change in another. The above statement about the predicted value of y must be a property of regression. Attributing a regression property to the correlation is not in the best interests of those who want to study Stats at a more advanced level.

Thirdly, I felt challenged to see something new in the area I thought I knew everything about. So here is the derivation. By definition, the fitted value is

(2) $\hat{y_i}=\hat{a}+\hat{b}x_i$

where the hats stand for estimators. The fitted line passes through the point $(\bar{x},\bar{y})$:

(3) $\bar{y}=\hat{a}+\hat{b}\bar{x}$

(this will be proved elsewhere). Subtracting (3) from (2) we get

(4) $\hat{y_i}-\bar{y}=\hat{b}(x_i-\bar{x})$

(using equation (4) from this post)

$=\rho\frac{\sigma(y)}{\sigma(x)}(x_i-\bar{x}).$

It is helpful to rewrite (4) in a more symmetric form:

(5) $\frac{\hat{y_i}-\bar{y}}{\sigma(y)}=\rho\frac{x_i-\bar{x}}{\sigma(x)}.$

This is the equation we need. Suppose an x value is a certain number of standard deviations from its mean: $x_i-\bar{x}=k\sigma(x)$. Plug this into (5) to get $\hat{y_i}-\bar{y}=\rho k\sigma(y)$, that is, the predicted y is $\rho$ times that many standard deviations from its mean.

8
Oct 16

## The pearls of AP Statistics 32

Student's t distribution: one-line explanation of its origin

They sayWe’ll now learn about a confidence interval that applies even for small sample sizes… Suppose we knew the standard deviation,  $\sigma/\sqrt{n}$, of the sample mean. Then, with the additional assumption that the population is normal, with small $n$ we could use the formula $\bar{x}\pm z\sigma/\sqrt{n}$, for instance with $z = 1.96$ for 95% confidence. In practice, we don’t know the population standard deviation σ. Substituting the sample standard deviation s for σ to get $se=s/\sqrt{n}$ then introduces extra error. This error can be sizeable when $n$ is small. To account for this increased error, we must replace the z-score by a slightly larger score, called a t-score. The confidence interval is then a bit wider. (Agresti and Franklin, p.369)

I say: The opening statement in italic (We’ll now learn about...) creates the wrong impression that the task at hand is to address small sample sizes. The next part in italic (To account for this increased error...) confuses the reader further by implying that

1) using the sample standard deviation instead of the population standard deviation and

2) replacing the z score by the t score

are two separate acts. They are not: see equation (3) below. The last proposition in italic (The confidence interval is then a bit wider) is true. It confused me to the extent that I made a wrong remark in the first version of this post, see Remark 4 below.

### Preliminaries

William Gosset published his result under a pseudonym ("Student"), and that result was modified by Ronald Fisher to what we know now as Student's t distribution. Gosset with his statistic wanted to address small sample sizes. The modern explanation is different: the t statistic arises from replacing the unknown population variance by its estimator, the sample variance, and it works regardless of the sample size. If we take a couple of facts on trust, the explanation will be just a one-line formula.

Let $X_1,...,X_n$ be a sample of independent observations from a normal population.

Fact 1. The z-score of the sample mean

(1) $z_0=\frac{\bar{X}-E\bar{X}}{\sigma(\bar{X})}=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}$

is a standard normal variable.

Fact 2. The sample variance $s^2=\frac{1}{n-1}\sum(X_i-\bar{X})^2$ upon scaling becomes a chi-square variable. More precisely, the variable

(2) $\chi^2_{n-1}=\frac{(n-1)s^2}{\sigma^2}$

is a chi-square with $n-1$ degrees of freedom.

Fact 3. The variables in (1) and (2) are independent.

### Intuitive introduction to t distribution

When a population parameter is unknown, replace it by its estimator. Following this general statistical idea, in the situation when $\sigma$ is unknown, instead of (1) consider

(3) $t=\frac{\bar{X}-\mu}{s/\sqrt{n}}$ (dividing and multiplying by $\sigma$$=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\frac{1}{\sqrt{s^2/\sigma^2}}$ (using (1), (2)) $=\frac{z_0}{\sqrt{\chi^2_{n-1}/(n-1)}}.$

By definition and because the numerator and denominator are independent, the last expression is a t distribution with $n-1$ degrees of freedom. This is all there is to it.

### Concluding remarks

Remark 1. When I give the definitions of chi-square, t statistic and F statistic, my students are often surprised. This is because there is no reference to samples. To be precise, it is better to remember that the way I define them, they are random variables and not statistics and using "distribution" or "variable" would be more appropriate than "statistic". A statistic, by definition, is a function of observations. The variable we start with in (3) is, obviously, a statistic. Equation (3) means that that statistic is distributed as t with n-1 degrees of freedom.

Remark 2. Many AP Stats books claim that a sum of normal variables is normal. In fact, for this to be true we need independence of the summands. Under our assumption of independent observations and normality, the sum $X_1+...+X_n$ is normal. The variable in (1) is normal as a linear transformation of this sum. Since its mean is zero and its variance is 1, it is a standard normal. We have proved Fact 1. The proofs of Facts 2 and 3 are much more complex.

Remark 3. The t statistic is not used for large samples not because it does not work for large n but because for large n it is close to the z score.

Remark 4. Taking the t distribution defined here as a standard t, we can define a general t as its linear transformation, $GeneralT=\sigma*StandardT+\mu$ (similarly to general normals). Since the standard deviation of the standard t is not 1, the standard deviation of the general t we have defined will not be $\sigma$. The general t is necessary to use the Mathematica function StudentTCI (confidence interval for Student's t). The t score that arises in estimation is the standard t. In this case, confidence intervals based on t are indeed wider than those based on z. I apologize for my previous wrong comment and am posting this video. See an updated Mathematica file.

2
Oct 16

## The pearls of AP Statistics 31

Demystifying sampling distributions: too much talking about nothing

### What we know about sample means

Let $X_1,...,X_n$ be an independent identically distributed sample and consider its sample mean $\bar{X}$.

Fact 1. The sample mean is an unbiased estimator of the population mean:

(1) $E\bar{X}=\frac{1}{n}(EX_1+...+EX_n)=\frac{1}{n}(\mu+...+\mu)=\mu$

(use linearity of means).

Fact 2. Variance of the sample mean is

(2) $Var(\bar{X})=\frac{1}{n^2}(Var(X_1)+...+Var(X_n)=\frac{1}{n^2}(\sigma^2(X)+...+\sigma^2(X))=\frac{\sigma^2(X)}{n}$

(use homogeneity of variance of degree 2 and additivity of variance for independent variables). Hence $\sigma(\bar{X})=\frac{\sigma(X)}{\sqrt{n}}$

Fact 3. The implication of these two properties is that the sample mean becomes more concentrated around the population mean as the sample size increases (see at least the law of large numbers; I have a couple more posts about this).

Fact 4. Finally, the z scores of sample means stabilize to a standard normal distribution (the central limit theorem).

### What is a sampling distribution?

The sampling distribution of a statistic is the probability distribution that specifies probabilities for the possible values the statistic can take (Agresti and Franklin, p.308). After this definition, the authors go ahead and discuss the above four facts. Note that none of them requires the knowledge of what the sampling distribution is. The ONLY sampling distribution that appears explicitly in AP Statistics is the binomial. However, in the book the binomial is given in Section 6.3, before sampling distributions, which are the subject of Chapter 7. Section 7.3 explains that the binomial is a sampling distribution but that section is optional. Thus the whole Chapter 7 (almost 40 pages) is redundant.

### Then what are sampling distributions for?

Here is a simple example that explains their role. Consider the binomial $X_1+X_2$ of two observations on an unfair coin. It involves two random variables and therefore is described by a joint distribution with the sample space consisting of pairs of values

Table 1. Sample space for pair $(X_1,X_2)$

 Coin 1 0 1 Coin 2 0 (0,0) (0,1) 1 (1,0) (1,1)

Each coin independently takes values 0 and 1 (shown in the margins); the sample space contains four pairs of these values (shown in the main body of the table). The corresponding probability distribution is given by the table

Table 2. Joint probabilities for pair $(X_1,X_2)$

 Coin 1 p q Coin 2 p $p^2$ $pq$ q $pq$ $q^2$

Since we are counting only the number of successes, the outcomes (0,1) and (1,0) for the purposes of our experiment are the same. Hence, joining indistinguishable outcomes, we obtain a smaller sample space

Table 3. Sampling distribution for binomial $X_1+X_2$

 # of successes Corresponding probabilities 0 $p^2$ 1 $2pq$ 2 $q^2$

The last table is the sampling distribution for the binomial with sample size 2. All the sampling distribution does is replace a large joint distribution Table 1+Table 2 by a smaller distribution Table 3. The beauty of proofs of equations (1) and (2) is that they do not depend on which distribution is used (the distribution is hidden in the expected value operator).

Unless you want your students to appreciate the reduction in the sample space brought about by sampling distributions, it is not worth discussing them. See Wikipedia for examples other than the binomial.

4
Sep 16

## The pearls of AP Statistics 25

### Central Limit Theorem versus Law of Large Numbers

They say: The Central Limit Theorem (CLT). Describes the Expected Shape of the Sampling Distribution for Sample Mean $\bar{X}$. For a random sample of size $n$ from a population having mean μ and standard deviation σ, then as the sample size $n$ increases, the sampling distribution of the sample mean $\bar{X}$ approaches an approximately normal distribution. (Agresti and Franklin, p.321)

I say: There are at least three problems with this statement.

Problem 1. With any notion or statement, I would like to know its purpose in the first place. The primary purpose of the law of large numbers is to estimate population parameters. The Central Limit Theorem may be a nice theoretical result, but why do I need it? The motivation is similar to the one we use for introducing the z score. There is a myriad of distributions. Only some standard distributions have been tabulated. Suppose we have a sequence of variables $X_n$, none of which have been tabulated. Suppose also that, as $n$ increases, those variables become close to a normal variable in the sense that the cumulative probabilities (areas under their respective densities) become close:

(1) $P(X_n\le a)\rightarrow P(normal\le a)$ for all $a$.

Then we can use tables developed for normal variables to approximate $P(X_n\le a)$. This justifies using (1) as the definition of a new convergence type called convergence in distribution.

Problem 2. Having introduced convergence (1), we need to understand what it means in terms of densities (distributions). As illustrated in Excel, the law of large numbers means convergence to a spike. In particular, the sample mean converges to a mass concentrated at μ (densities contract to one point). Referring to the sample mean in the context of CLT is misleading, because the CLT is about densities stabilization.

Figure 1. Law of large numbers with n=100, 1000, 10000

Figure 1 appeared in my posts before, I just added n=10,000, to show that densities do not stabilize.

Figure 2. Central limit theorem with n=100, 1000, 10000

In Figure 2, for clarity I use line plots instead of histograms. The density for n=100 is very rugged. The blue line (for n=1000) is more rugged than the orange (for n=10,000). Convergence to a normal shape is visible, although slow.

Main problem. It is not the sample means that converge to a normal distribution. It is their z scores

$z=\frac{\bar{X}-E\bar{X}}{\sigma(\bar{X})}$

that do. Specifically,

$P(\frac{\bar{X}-E\bar{X}}{\sigma(\bar{X})}\le a)\rightarrow P(z\le a)$ for all $a$

where $z$ is a standard normal variable.

In my simulations I used sample means for Figure 1 and z scores of sample means for Figure 2. In particular, z scores always have means equal to zero, and that can be seen in Figure 2. In your class, you can use the Excel file. As usual, you have to enable macros.

3
Sep 16

## All you need to know about the law of large numbers

All about the law of large numbers: properties and applications

### Level 1: estimation of population parameters

The law of large numbers is a statement about convergence which is called convergence in probability and denoted $\text{plim}$. The precise definition is rather complex but the intuition is simple: it is convergence to a spike at the parameter being estimated. Usually, any unbiasedness statement has its analog in terms of the corresponding law of large numbers.

Example 1. The sample mean unbiasedly estimates the population mean: $E\bar{X}=EX$. Its analog: the sample mean converges to a spike at the population mean: $\text{plim}\bar{X}=EX$. See the proof based on the Chebyshev inequality.

Example 2. The sample variance unbiasedly estimates the population variance: $E\overline{s^2}=Var(X)$ where $s^2=\frac{\sum(X_i-\bar{X})^2}{n-1}$. Its analog: the sample variance converges to a spike at the population variance:

(1) $\text{plim}\overline{s^2}=Var(X)$.

Example 3. The sample covariance $s_{X,Y}=\frac{\sum(X_i-\bar{X})(Y_i-\bar{Y})}{n-1}$ unbiasedly estimates the population covariance: $E\overline{s_{X,Y}}=Cov(X,Y)$. Its analog: the sample covariance converges to a spike at the population covariance:

(2) $\text{plim}\overline{s_{X,Y}}=Cov(X,Y)$.

### Up one level: convergence in probability is just convenient

Using or not convergence in probability is a matter of expedience. For usual limits of sequences we know the properties which I call preservation of arithmetic operations:

$\lim(a_n\pm b_n)=\lim a_n\pm \lim b_n,$

$\lim(a_n\times b_n)=\lim a_n\times\lim b_n,$

$\lim(a_n/ b_n)=\lim a_n/\lim b_n.$

Convergence in probability has exact same properties, just replace $\lim$ with $\text{plim}$.

### Next level: making regression estimation more plausible

Using convergence in probability allows us to handle stochastic regressors and avoid the unrealistic assumption that regressors are deterministic.

Convergence in probability and in distribution are two types of convergence of random variables that are widely used in the Econometrics course of the University of London.

2
Sep 16

## Proving unbiasedness of OLS estimators

Proving unbiasedness of OLS estimators - the do's and don'ts

### Groundwork

Here we derived the OLS estimators. To distinguish between sample and population means, the variance and covariance in the slope estimator will be provided with the subscript u (for "uniform", see the rationale here).

(1) $\hat{b}=\frac{Cov_u(x,y)}{Var_u(x)}$,

(2) $\hat{a}=\bar{y}-\hat{b}\bar{x}$.

These equations are used in conjunction with the model

(3) $y_i=a+bx_i+e_i$

where we remember that

(4) $Ee_i=0$ for all $i$.

Since (2) depends on (1), we have to start with unbiasedness of the slope estimator.

### Using the right representation is critical

We have to show that $E\hat{b}=b$.

Step 1. Don't apply the expectation directly to (1). Do separate in (1) what is supposed to be $E\hat{b}$. To reveal the role of errors in (1), plug (3) in (1) and use linearity of covariance with respect to each argument when the other argument is fixed:

$\hat{b}=\frac{Cov_u(x,a+bx+e)}{Var_u(x)}=\frac{Cov_u(x,a)+bCov_u(x,x)+Cov_u(x,e)}{Var_u(x)}$.

Here $Cov_u(x,a)=0$ (a constant is uncorrelated with any variable), $Cov_u(x,x)=Var_u(x)$ (covariance of $x$ with itself is its variance), so

(5) $\hat{b}=\frac{bVar_u(x)+Cov_u(x,e)}{Var_u(x)}=b+\frac{Cov_u(x,e)}{Var_u(x)}$.

Equation (5) is the mean-plus-deviation-from-the-mean decomposition. Many students think that $Cov_u(x,e)=0$ because of (4). No! The covariance here does not involve the population mean.

Step 2. It pays to make one more step to develop (5). Write out the numerator in (5) using summation:

$\hat{b}=b+\frac{1}{n}\sum(x_i-\bar{x})(e_i-\bar{e})/Var_u(x).$

Don't write out $Var_u(x)$! Presence of two summations confuses many students.

Multiplying parentheses and using the fact that $\sum(x_i-\bar{x})=n\bar{x}-n\bar{x}=0$ we have

$\hat{b}=b+\frac{1}{n}[\sum(x_i-\bar{x})e_i-\bar{e}\sum(x_i-\bar{x})]/Var_u(x)$

$=b+\frac{1}{n}\sum\frac{(x_i-\bar{x})}{Var_u(x)}e_i.$

To simplify calculations, denote $a_i=(x_i-\bar{x})/Var_u(x).$ Then the slope estimator becomes

(6) $\hat{b}=b+\frac{1}{n}\sum a_ie_i.$

This is the critical representation.

### Unbiasedness of the slope estimator

Convenience conditionThe regressor $x$ is deterministic. I call it a convenience condition because it's just a matter of mathematical expedience, and later on we'll study ways to bypass it.

From (6), linearity of means and remembering that the deterministic coefficients $a_i$ behave like constants,

(7) $E\hat{b}=E[b+\frac{1}{n}\sum a_ie_i]=b+\frac{1}{n}\sum a_iEe_i=b$

by (4). This proves unbiasedness.

You don't know the difference between the population and sample means until you see them working in the same formula.

### Unbiasedness of the intercept estimator

As above we plug (3) in (2): $\hat{a}=\overline{a+bx+e}-\hat{b}\bar{x}=a+b\bar{x}+\bar{e}-\hat{b}\bar{x}$. Applying expectation:

$E\hat{a}=a+b\bar{x}+E\bar{e}-E\hat{b}\bar{x}=a+b\bar{x}-b\bar{x}=a.$

### Conclusion

Since in (1)  there is division by $Var_u(x)$, the condition $Var_u(x)\ne 0$ is the main condition for existence of OLS estimators. From the above proof we see that (4) is the main condition for unbiasedness.

30
Aug 16

## The pearls of AP Statistics 24

Unbiasedness: the stumbling block of a Statistics course

### God is in the detail

They say: A good estimator has a sampling distribution that is centered at the parameter. We define center in this case as the mean of that sampling distribution. An estimator with this property is said to be unbiased. From Section 7.2, we know that for random sampling the mean of the sampling distribution of the sample mean x equals the population mean μ. So, the sample mean x is an unbiased estimator of μ. (Agresti and Franklin, p.351).

I say: This is a classic case of turning everything upside down, and this happens when the logical chain is broken. Unbiasedness is one of the pillars of Statistics. It can and should be given right after the notion of population mean is introduced. The authors make the definition dependent on random sampling, sampling distribution and a whole Section 7.2. Therefore I highly doubt that any student can grasp the above definition. My explanation below may not be the best; I just want to prompt the reader to think about alternatives to the above "definition".

### Population mean versus sample mean

By definition, in the discrete case, a random variable is a table values+probabilities:

 Values Probabilities $X_1$ $p_1$ ... ... $X_n$ $p_n$

If we know this table, we can define the population mean $\mu=EX=p_1X_1+...+p_nX_n$. This is a weighted average of the variable values because the probabilities are percentages: $0 for all $i$ and $p_1+...+p_n=1$. The expectation operator $E$ is the device used by Mother Nature to measure the average, and most of the time she keeps hidden from us both the probabilities and the average $EX.$

Now suppose that $X_1,...,X_n$ represent a sample from the given population (and not the values in the above table). We can define the sample mean $\bar{X}=\frac{X_1+...+X_n}{n}$. Being a little smarter than monkeys, we instead of unknown probabilities use the uniform distribution $p_i=1/n$. Unlike the population average, the sample average is always possible to calculate, as long as the sample is available.

Consider a good shooter shooting at three targets using a good rifle.

The black dots represent points hit by bullets on three targets. In Figure 1, there was only one shot. What is your best guess about where the bull's eye is? Regarding Figure 2, everybody says that probably the bull's eye is midway (red point) between points A and B. In Figure 3, the sample mean is represented by the red point. Going back to unbiasedness: 1) the bull's eye is the unknown population parameter that needs to be estimated, 2) points hit by bullets are sample observations, 3) their sample mean is represented by red points, 4) the red points estimate the location of the bull's eye. The sample mean is said to be an unbiased estimator of population mean because

(1) $E\bar{X}=\mu$.

In words, Mother Nature says that, in her opinion, on average our bullets hit the bull's eye.

This explanation is an alternative to the one you can see in many books: in the long run, the sample mean correctly estimates the population mean. That explanation in fact replaces equation (1) by the corresponding law of large numbers. My explanation just underlines the fact that there is an abstract average, that we cannot use, and the sample average, that we invent to circumvent that problem.

See related theoretical facts here.

24
Aug 16

## The pearls of AP Statistics 22

The law of large numbers - a bird's view

They say: In 1689, the Swiss mathematician Jacob Bernoulli proved that as the number of trials increases, the proportion of occurrences of any given outcome approaches a particular number (such as 1/6) in the long run. (Agresti and Franklin, p.213).

I say: The expression “law of large numbers” appears in the book 13 times, yet its meaning is never clearly explained. The closest approximation to the truth is the above sentence about Jacob Bernoulli. To see if this explanation works, tell it to your students and ask what they understood. To me, this is a clear case when withholding theory harms understanding.

Intuition comes first. I ask my students: if you flip a fair coin 100 times, what do you expect the proportion of ones to be? Absolutely everybody replies correctly, just the form of the answer may be different (50-50 or 0.5 or 50 out of 100). Then I ask: probably it will not be exactly 0.5 but if you flip the coin 1000 times, do you expect the proportion to be closer to 0.5? Everybody says: Yes. Next I ask: Suppose the coin is unfair and the probability of 1 appearing is 0.7. What would you expect the proportion to be close to in large samples? Most students come up with the right answer: 0.7. Congratulations, you have discovered what is called a law of large numbers!

Then we give a theoretical format to our little discovery. $p=0.7$ is a population parameter. Flipping a coin $n$ times we obtain observations $X_1,...,X_n$. The proportion of ones is the sample mean $\bar{X}=\frac{X_1+...+X_n}{n}$. The law of large numbers says two things: 1) as the sample size increases, the sample mean approaches the population mean. 2) At the same time, its variation about the population mean becomes smaller and smaller.

Part 1) is clear to everybody. To corroborate statement 2), I give two facts. Firstly, we know that the standard deviation of the sample mean is $\frac{\sigma}{\sqrt{n}}$. From this we see that as $n$ increases, the standard deviation of the sample mean decreases and the values of the sample mean become more and more concentrated around the population mean. We express this by saying that the sample mean converges to a spike. Secondly, I produce two histograms. With the sample size $n=100$, there are two modes (just 1o%) of the histogram at 0.69 and 0.72, while 0.7 was used as the population mean in my simulations. Besides, the spread of the values is large. With $n=1000$, the mode (27%) is at the true value 0.7, and the spread is low.

Finally, we relate our little exercise to practical needs. In practice, the true mean is never known. But we can obtain a sample and calculate its mean. With a large sample size, the sample mean will be close to the truth. More generally, take any other population parameter, such as its standard deviation, and calculate the sample statistic that estimates it, such as the sample standard deviation. Again, the law of large numbers applies and the sample statistic will be close to the population parameter. The histograms have been obtained as explained here and here. Download the Excel file.

5
Aug 16

## The pearls of AP Statistics 12

Are you sure you are massaging the right muscle in the brain?

They say: The median is the middle value of the observations when the observations are ordered from the smallest to the largest (or from the largest to the smallest) (Agresti and Franklin, p.47) Half the observations are smaller than it, and half are larger.

Generally, if the shape is: a) perfectly symmetric, the mean equals the median, b) skewed to the right, the mean is larger than the median, c) skewed to the left, the mean is smaller than the median (same source, p.51)

I say: those who don't think will swallow this without demur. I have two problems. Firstly, the way the definition of the median is given makes me think that the median takes one of the observed values, which it does not, in general. Of course, you can provide a caveat. But why not just say: The median is such a point that half of the observations lie to the left of it and another half to the right. Secondly, the part about relationship between the mean and median is just terrible because it appeals to memorization. I went to great lengths to explain what is internal vision. Here are questions that are aimed at developing it:

1. Suppose you have a sample $x_1,..., x_n$. If all values move to the right by a constant c, what happens to the mean, median, mode, sample variance, range and IQR?
2. Suppose you have a sample $x_1, ..., x_n, x_{n+1}, ..., x_{2n}$ with an even sample size $2n$. If the first $n$ values move to the left by a constant c and the last $n$ values move to the right by the same constant c, what happens to the mean, median, mode, sample variance, range and IQR?

To feel how a formula works, try to change its elements.