Nov 16

The pearls of AP Statistics 35

The disturbance term: To hide or not to hide? In an introductory Stats course, some part of the theory should be hidden. Where to draw the line is an interesting question. Here I discuss the ideas that look definitely bad to me.

How disturbing is the disturbance term?

In the main text, Agresti and Franklin never mention the disturbance term u_i in the regression model

(1) y_i=a+bx_i+u_i

(it is hidden in Exercise 12.105). Instead, they write the equation for the mean \mu_y=a+bx that follows from (1) under the standard assumption Eu_i=0. This would be fine if the exposition stopped right there. However, one has to explain the random source of variability in y_i. On p. 583 the authors say: "The probability distribution of y values at a fixed value of x is a conditional distribution. At each value of x, there is a conditional distribution of y values. A regression model also describes these distributions. An additional parameter σ describes the standard deviation of each conditional distribution."

Further, Figure 12.4 illustrates distributions of errors at different points and asks: "What do the bell-shaped curves around the line at x = 12 and at x = 16 represent?"


Figure 12.4. Illustration of error distributions

Besides, explanations of heteroscedasticity and of the residual sum of squares are impossible without explicitly referring to the disturbance term.

Attributing a regression property to the correlation is not good

On p.589 I encountered a statement that puzzled me: "An important property of the correlation is that at any particular x value, the predicted value of y is relatively closer to its mean than x is to its mean. If an x value is a certain number of standard deviations from its mean, then the predicted y is r times that many standard deviations from its mean."

Firstly, this is a verbal interpretation of some formula, so why not give the formula itself? How good must be a student to guess what is behind the verbal formulation?

Secondly, as I stressed in this post, the correlation coefficient does not entail any prediction about the magnitude of a change in one variable caused by a change in another. The above statement about the predicted value of y must be a property of regression. Attributing a regression property to the correlation is not in the best interests of those who want to study Stats at a more advanced level.

Thirdly, I felt challenged to see something new in the area I thought I knew everything about. So here is the derivation. By definition, the fitted value is

(2) \hat{y_i}=\hat{a}+\hat{b}x_i

where the hats stand for estimators. The fitted line passes through the point (\bar{x},\bar{y}):

(3) \bar{y}=\hat{a}+\hat{b}\bar{x}

(this will be proved elsewhere). Subtracting (3) from (2) we get

(4) \hat{y_i}-\bar{y}=\hat{b}(x_i-\bar{x})

(using equation (4) from this post)


It is helpful to rewrite (4) in a more symmetric form:

(5) \frac{\hat{y_i}-\bar{y}}{\sigma(y)}=\rho\frac{x_i-\bar{x}}{\sigma(x)}.

This is the equation we need. Suppose an x value is a certain number of standard deviations from its mean: x_i-\bar{x}=k\sigma(x). Plug this into (5) to get \hat{y_i}-\bar{y}=\rho k\sigma(y), that is, the predicted y is \rho times that many standard deviations from its mean.

Oct 16

The pearls of AP Statistics 34

Coefficient of determination: an inductive introduction to R squared

I know a person, who did not understand this topic, even though he had a PhD in Math. That was me more than twenty years ago, and the reason was that the topic was given formally, without explaining the leading idea.

Leading idea

Step 1. We want to describe the relationship between observed y's and x's using the simple regression


Let us start with the simple case when there is no variability in y's, that is the slope and the errors are zero.  Since y_i=a for all i, we have y_i=\bar{y} and, of course,

(1) \sum(y_i-\bar{y})^2=0.

In the general case, we start with the decomposition

(2) y_i=\hat{y}_i+e_i

where \hat{y}_i is the fitted value and e_i is the residual, see this post. We still want to see how y_i is far from \bar{y}. With this purpose, from both sides of equation (2) we subtract \bar{y}, obtaining y_i-\bar{y}=(\hat{y}_i-\bar{y})+e_i. Squaring this equation, for the sum in (1) we get

(3) \sum(y_i-\bar{y})^2=\sum(\hat{y}_i-\bar{y})^2+2\sum(\hat{y}_i-\bar{y})e_i+\sum e^2_i.

Whoever was the first to do this, discovered that the cross product is zero and (3) simplifies to

(4) \sum(y_i-\bar{y})^2=\sum(\hat{y}_i-\bar{y})^2+\sum e^2_i.

The rest is a matter of definitions

Total Sum of Squares TSS=\sum(y_i-\bar{y})^2. (I prefer to call this a total variation around \bar{y})

Explained Sum of Squares TSS=\sum(\hat{y}_i-\bar{y})^2 (to me this is explained variation around \bar{y})

Residual Sum of Squares TSS=\sum e^2_i (unexplained variation around \bar{y}, caused by the error term)

Thus from (4) we have


Step 2. It is desirable to have RSS close to zero and ESS close to TSS. Therefore we can use the ratio ESS/TSS as a measure of how good the regression describes the relationship between y's and x's. From (5) it follows that this ratio takes values between zero and 1. Hence, the coefficient of determination


can be interpreted as the percentage of total variation of y's around \bar{y} explained by regression. From (5) an equivalent definition is


Back to the pearls of AP Statistics

How much of the above can be explained without algebra? Stats without algebra is a crippled creature. I am afraid, any concepts requiring substantial algebra should be dropped from AP Stats curriculum. Compare this post with the explanation on p. 592 of Agresti and Franklin.

Oct 16

Properties of variance

All properties of variance in one place

Certainty is the mother of quiet and repose, and uncertainty the cause of variance and contentions. Edward Coke

Preliminaries: study properties of means with proofs.

Definition. Yes, uncertainty leads to variance, and we measure it by Var(X)=E(X-EX)^2. It is useful to use the name deviation from mean for X-EX and realize that E(X-EX)=0, so that the mean of the deviation from mean cannot serve as a measure of variation of X around EX.

Property 1. Variance of a linear combination. For any random variables X,Y and numbers a,b one has
(1) Var(aX + bY)=a^2Var(X)+2abCov(X,Y)+b^2Var(Y).
The term 2abCov(X,Y) in (1) is called an interaction term. See this post for the definition and properties of covariance.
Var(aX + bY)=E[aX + bY -E(aX + bY)]^2

(using linearity of means)
=E(aX + bY-aEX -bEY)^2

(grouping by variable)

(squaring out)

(using linearity of means and definitions of variance and covariance)
=a^2Var(X) + 2abCov(X,Y) +b^2Var(Y).
Property 2. Variance of a sum. Letting in (1) a=b=1 we obtain
Var(X + Y) = Var(X) + 2Cov(X,Y)+Var(Y).

Property 3. Homogeneity of degree 2. Choose b=0 in (1) to get
Exercise. What do you think is larger: Var(X+Y) or Var(X-Y)?
Property 4. If we add a constant to a variable, its variance does not change: Var(X+c)=E[X+c-E(X+c)]^2=E(X+c-EX-c)^2=E(X-EX)^2=Var(X)
Property 5. Variance of a constant is zero: Var(c)=E(c-Ec)^2=0.

Property 6. Nonnegativity. Since the squared deviation from mean (X-EX)^2 is nonnegative, its expectation is nonnegativeE(X-EX)^2\ge 0.

Property 7. Only a constant can have variance equal to zero: If Var(X)=0, then E(X-EX)^2 =(x_1-EX)^2p_1 +...+(x_n-EX)^2p_n=0, see the definition of the expected value. Since all probabilities are positive, we conclude that x_i=EX for all i, which means that X is identically constant.

Property 8. Shortcut for variance. We have an identity E(X-EX)^2=EX^2-(EX)^2. Indeed, squaring out gives

E(X-EX)^2 =E(X^2-2XEX+(EX)^2)

(distributing expectation)


(expectation of a constant is constant)


All of the above properties apply to any random variables. The next one is an exception in the sense that it applies only to uncorrelated variables.

Property 9. If variables are uncorrelated, that is Cov(X,Y)=0, then from (1) we have Var(aX + bY)=a^2Var(X)+b^2Var(Y). In particular, letting a=b=1, we get additivityVar(X+Y)=Var(X)+Var(Y). Recall that the expected value is always additive.

GeneralizationsVar(\sum a_iX_i)=\sum a_i^2Var(X_i) and Var(\sum X_i)=\sum Var(X_i) if all X_i are uncorrelated.

Among my posts, where properties of variance are used, I counted 12 so far.

Oct 16

The pearls of AP Statistics 33

Correlation and regression are two separate entities

They say: The correlation summarizes the direction of the association between two quantitative variables and the strength of its linear (straight-line) trend (Agresti and Franklin, p.105). Later, at a level that is supposed to be more advanced, they repeat: The correlation, denoted by r, describes linear association between two variables (p.586).

I say: This is a common misconception about correlation, even Wikipedia says so. Once I was consulting specialists from the Oncology Institute in Almaty. Until then, all of them were using correlation to study their data. When I suggested using simple regression, they asked what was the difference and how regression was better. I said: correlation is a measure of statistical relationship. When two variables are positively correlated and one of them goes up, the other also goes up (on average) but you never know by how much. On the other hand, regression gives a specific algebraic dependence between two variables, so that you can quantify your predictions about changes in one of them caused by changes in another.

Because of algebra of least squares estimation, you can conclude something about correlation if you know the estimated slope, and vice versa, but conceptually correlation and regression are different and there is no need to delay the study of correlation until after regression. The correlation coefficient is defined as

(1) \rho(X,Y)=\frac{Cov(X,Y)}{\sigma(X)\sigma(Y)}.

See this post for the definition and properties of covariance. As one can see, it can be studied right after the covariance and standard deviation. The slope of the regression line is a result of least squares fitting, which is a more advanced concept, and is given by

(2) b=\frac{Cov(X,Y)}{Var(X)},

see a simplified derivation or a full derivation. I am using the notations

(3) Cov(X,Y)=\frac{1}{n}\sum(X_i-\bar{X})(Y_i-\bar{Y}),\ Var(X)=Cov(X,X),\ \sigma(X)=\sqrt{Var(X)}

which arise from the corresponding population characteristics as explained in this post. Directly from (1) and (2) we see that

(4) b=\rho(X,Y)\frac{\sigma(Y)}{\sigma(X)},\ \rho(X,Y)=b\frac{\sigma(X)}{\sigma(Y)}.

Using these equations, we can go from the correlation to the slope and back if we know the sigmas. In particular, they are positive or negative simultaneously. The second equation in (4) gives rise to the interpretation of the correlation as "a standardized version of the slope" (p.588). To me, this "intuition" is far-fetched.

Notice how economical is the sequence of definitions in (3): one follows from another, which makes remembering them easier, and summation signs are reduced to a minimum. Under the "non-algebraic" approach, the covariance, variance and standard deviation are given separately, increasing the burden on one's memory.

Oct 16

Properties of means

Properties of means, covariances and variances are bread and butter of professionals. Here we consider the bread - the means

Properties of means: as simple as playing with tables

Definition of a random variable. When my Brazilian students asked for an intuitive definition of a random variable, I said: It is a function whose values are unpredictable. Therefore it is prohibited to work with their values and allowed to work only with their various means. For proofs we need a more technical definition: it is a table values+probabilities of type Table 1.

Table 1.  Random variable definition
Values of X Probabilities
x_1 p_1
... ...
x_n p_n

Note: The complete form of writing {p_i} is P(X = {x_i}).

Mean (or expected value) value definitionEX = x_1p_1 + ... + x_np_n = \sum\limits_{i = 1}^nx_ip_i. In words, this is a weighted sum of values, where the weights p_i reflect the importance of corresponding x_i.

Note: The expected value is a function whose argument is a complex object (it is described by Table 1) and the value is simple: EX is just a number. And it is not a product of E and X! See how different means fit this definition.

Definition of a linear combination. See here the financial motivation. Suppose that X,Y are two discrete random variables with the same probability distribution {p_1},...,{p_n}. Let a,b be real numbers. The random variable aX + bY is called a linear combination of X,Y with coefficients a,b. Its special cases are aX (X scaled by a) and X + Y (a sum of X and Y). The detailed definition is given by Table 2.

Table 2.  Linear operations definition
Values of X Values of Y Probabilities aX X + Y aX + bY
x_1 {y_1} p_1 a{x_1} {x_1} + {y_1} a{x_1} + b{y_1}
...  ... ...  ...  ...  ...
x_n {y_n} p_n a{x_n} {x_n} + {y_n} a{x_n} + b{y_n}

Note: The situation when the probability distributions are different is reduced to the case when they are the same, see my book.

Property 1. Linearity of means. For any random variables X,Y and any numbers a,b one has

(1) E(aX + bY) = aEX + bEY.

Proof. This is one of those straightforward proofs when knowing the definitions and starting with the left-hand side is enough to arrive at the result. Using the definitions in Table 2, the mean of the linear combination is
E(aX + bY)= (a{x_1} + b{y_1}){p_1} + ... + (a{x_n} + b{y_n}){p_n}

(distributing probabilities)
= a{x_1}{p_1} + b{y_1}{p_1} + ... + a{x_n}{p_n} + b{y_n}{p_n}

(grouping by variables)
= (a{x_1}{p_1} + ... + a{x_n}{p_n}) + (b{y_1}{p_1} + ... + b{y_n}{p_n})

(pulling out constants)
= a({x_1}{p_1} + ... + {x_n}{p_n}) + b({y_1}{p_1} + ... + {y_n}{p_n})=aEX+bEY.

See applications: one, and two, and three.

Generalization to the case of a linear combination of n variables:

E({a_1}{X_1} + ... + {a_n}{X_n}) = {a_1}E{X_1} + ... + {a_n}E{X_n}.

Special cases. a) Letting a = b = 1 in (1) we get E(X + Y) = EX + EY. This is called additivity. See an application. b) Letting in (1) b = 0 we get E(aX) = aEX. This property is called homogeneity of degree 1 (you can pull the constant out of the expected value sign). Ask your students to deduce linearity from homogeneity and additivity.

Property 2. Expected value of a constant. Everybody knows what a constant is. Ask your students what is a constant in terms of Table 1. The mean of a constant is that constant, because a constant doesn't change, rain or shine: Ec = c{p_1} + ... + c{p_n} = c({p_1} + ... + {p_n}) = 1 (we have used the completeness axiom). In particular, it follows that E(EX)=EX.

Property 3. The expectation operator preserves order: if x_i\ge y_i for all i, then EX\ge EY. In particular, the mean of a nonnegative random variable is nonnegative: if x_i\ge 0 for all i, then EX\ge 0.

Indeed, using the fact that all probabilities are nonnegative, we get EX = x_1p_1 + ... + x_np_n\ge y_1p_1 + ... + y_np_n=EY.

Property 4. For independent variables, we have EXY=(EX)(EY) (multiplicativity), which has important implications on its own.

The best thing about the above properties is that, although we proved them under simplified assumptions, they are always true. We keep in mind that the expectation operator E is the device used by Mother Nature to measure the average, and most of the time she keeps hidden from us both the probabilities and the average EX.

Oct 16

The pearls of AP Statistics 32

Student's t distribution: one-line explanation of its origin

They sayWe’ll now learn about a confidence interval that applies even for small sample sizes… Suppose we knew the standard deviation,  \sigma/\sqrt{n}, of the sample mean. Then, with the additional assumption that the population is normal, with small n we could use the formula \bar{x}\pm z\sigma/\sqrt{n}, for instance with z = 1.96 for 95% confidence. In practice, we don’t know the population standard deviation σ. Substituting the sample standard deviation s for σ to get se=s/\sqrt{n} then introduces extra error. This error can be sizeable when n is small. To account for this increased error, we must replace the z-score by a slightly larger score, called a t-score. The confidence interval is then a bit wider. (Agresti and Franklin, p.369)

I say: The opening statement in italic (We’ll now learn about...) creates the wrong impression that the task at hand is to address small sample sizes. The next part in italic (To account for this increased error...) confuses the reader further by implying that

1) using the sample standard deviation instead of the population standard deviation and

2) replacing the z score by the t score

are two separate acts. They are not: see equation (3) below. The last proposition in italic (The confidence interval is then a bit wider) is true. It confused me to the extent that I made a wrong remark in the first version of this post, see Remark 4 below.


William Gosset published his result under a pseudonym ("Student"), and that result was modified by Ronald Fisher to what we know now as Student's t distribution. Gosset with his statistic wanted to address small sample sizes. The modern explanation is different: the t statistic arises from replacing the unknown population variance by its estimator, the sample variance, and it works regardless of the sample size. If we take a couple of facts on trust, the explanation will be just a one-line formula.

Let X_1,...,X_n be a sample of independent observations from a normal population.

Fact 1. The z-score of the sample mean

(1) z_0=\frac{\bar{X}-E\bar{X}}{\sigma(\bar{X})}=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}

is a standard normal variable.

Fact 2. The sample variance s^2=\frac{1}{n-1}\sum(X_i-\bar{X})^2 upon scaling becomes a chi-square variable. More precisely, the variable

(2) \chi^2_{n-1}=\frac{(n-1)s^2}{\sigma^2}

is a chi-square with n-1 degrees of freedom.

Fact 3. The variables in (1) and (2) are independent.

Intuitive introduction to t distribution

When a population parameter is unknown, replace it by its estimator. Following this general statistical idea, in the situation when \sigma is unknown, instead of (1) consider

(3) t=\frac{\bar{X}-\mu}{s/\sqrt{n}} (dividing and multiplying by \sigma=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\frac{1}{\sqrt{s^2/\sigma^2}} (using (1), (2)) =\frac{z_0}{\sqrt{\chi^2_{n-1}/(n-1)}}.

By definition and because the numerator and denominator are independent, the last expression is a t distribution with n-1 degrees of freedom. This is all there is to it.

Concluding remarks

Remark 1. When I give the definitions of chi-square, t statistic and F statistic, my students are often surprised. This is because there is no reference to samples. To be precise, it is better to remember that the way I define them, they are random variables and not statistics and using "distribution" or "variable" would be more appropriate than "statistic". A statistic, by definition, is a function of observations. The variable we start with in (3) is, obviously, a statistic. Equation (3) means that that statistic is distributed as t with n-1 degrees of freedom.

Remark 2. Many AP Stats books claim that a sum of normal variables is normal. In fact, for this to be true we need independence of the summands. Under our assumption of independent observations and normality, the sum X_1+...+X_n is normal. The variable in (1) is normal as a linear transformation of this sum. Since its mean is zero and its variance is 1, it is a standard normal. We have proved Fact 1. The proofs of Facts 2 and 3 are much more complex.

Remark 3. The t statistic is not used for large samples not because it does not work for large n but because for large n it is close to the z score.

relationship-between-confidence-intervals-based-on-t-score-and-z-scoreRemark 4. Taking the t distribution defined here as a standard t, we can define a general t as its linear transformation, GeneralT=\sigma*StandardT+\mu (similarly to general normals). Since the standard deviation of the standard t is not 1, the standard deviation of the general t we have defined will not be \sigma. The general t is necessary to use the Mathematica function StudentTCI (confidence interval for Student's t). The t score that arises in estimation is the standard t. In this case, confidence intervals based on t are indeed wider than those based on z. I apologize for my previous wrong comment and am posting this video. See an updated Mathematica file.

Oct 16

The pearls of AP Statistics 31

Demystifying sampling distributions: too much talking about nothing

What we know about sample means

Let X_1,...,X_n be an independent identically distributed sample and consider its sample mean \bar{X}.

Fact 1. The sample mean is an unbiased estimator of the population mean:

(1) E\bar{X}=\frac{1}{n}(EX_1+...+EX_n)=\frac{1}{n}(\mu+...+\mu)=\mu

(use linearity of means).

Fact 2. Variance of the sample mean is

(2) Var(\bar{X})=\frac{1}{n^2}(Var(X_1)+...+Var(X_n)=\frac{1}{n^2}(\sigma^2(X)+...+\sigma^2(X))=\frac{\sigma^2(X)}{n}

(use homogeneity of variance of degree 2 and additivity of variance for independent variables). Hence \sigma(\bar{X})=\frac{\sigma(X)}{\sqrt{n}}

Fact 3. The implication of these two properties is that the sample mean becomes more concentrated around the population mean as the sample size increases (see at least the law of large numbers; I have a couple more posts about this).

Fact 4. Finally, the z scores of sample means stabilize to a standard normal distribution (the central limit theorem).

What is a sampling distribution?

The sampling distribution of a statistic is the probability distribution that specifies probabilities for the possible values the statistic can take (Agresti and Franklin, p.308). After this definition, the authors go ahead and discuss the above four facts. Note that none of them requires the knowledge of what the sampling distribution is. The ONLY sampling distribution that appears explicitly in AP Statistics is the binomial. However, in the book the binomial is given in Section 6.3, before sampling distributions, which are the subject of Chapter 7. Section 7.3 explains that the binomial is a sampling distribution but that section is optional. Thus the whole Chapter 7 (almost 40 pages) is redundant.

Then what are sampling distributions for?

Here is a simple example that explains their role. Consider the binomial X_1+X_2 of two observations on an unfair coin. It involves two random variables and therefore is described by a joint distribution with the sample space consisting of pairs of values

Table 1. Sample space for pair (X_1,X_2)

Coin 1
0 1
Coin 2 0 (0,0) (0,1)
1 (1,0) (1,1)

Each coin independently takes values 0 and 1 (shown in the margins); the sample space contains four pairs of these values (shown in the main body of the table). The corresponding probability distribution is given by the table

Table 2. Joint probabilities for pair (X_1,X_2)

Coin 1
p q
Coin 2 p p^2 pq
q pq q^2

Since we are counting only the number of successes, the outcomes (0,1) and (1,0) for the purposes of our experiment are the same. Hence, joining indistinguishable outcomes, we obtain a smaller sample space

Table 3. Sampling distribution for binomial X_1+X_2

# of successes Corresponding probabilities
0 p^2
1 2pq
2 q^2

The last table is the sampling distribution for the binomial with sample size 2. All the sampling distribution does is replace a large joint distribution Table 1+Table 2 by a smaller distribution Table 3. The beauty of proofs of equations (1) and (2) is that they do not depend on which distribution is used (the distribution is hidden in the expected value operator).

Unless you want your students to appreciate the reduction in the sample space brought about by sampling distributions, it is not worth discussing them. See Wikipedia for examples other than the binomial.

Sep 16

Definitions of chi-square, t statistic and F statistic

Definitions of the standard normal distribution and independence can be combined to produce definitions of chi-square, t statistic and F statistic. The similarity of the definitions makes them easier to study.

Independence of continuous random variables

Definition of independent discrete random variables easily modifies for the continuous case. Let X,Y be two continuous random variables with densities p_X,\ p_Y, respectively. We say that these variables are independent if the density p_{X,Y} of the pair (X,Y) is a product of individual densities:

(1) p_{X,Y}(s,t)=p_X(s)p_Y(t) for all s,t.

As in this post, equation (1) can be understood in two ways. If (1) is given, then X,Y are independent. Conversely, we if want them to be independent, we can define the density of the pair by equation (1). This definition readily generalizes for the case of many variables. In particular, if we want variables z_1,...,z_n to be standard normal and independent, we say that each of them has density defined here and the joint density p_{z_1,...,z_n} is a product of individual densities.

Definition of chi-square variable


Figure 1. chi-square with 1 degree of freedom

Let z_1,...,z_n be standard normal and independent. Then the variable \chi^2_n=z_1^2+...+z_n^2 is called a chi-square variable with n degrees of freedom. Obviously, \chi^2_n\ge 0, which means that its density is zero to the left of the origin. For low values of degrees of freedom, the density is not bounded near the origin, see Figure 1.

Definition of t distribution


Figure 2. t distribution and standard normal compared

Let z_0,z_1,...,z_n be standard normal and independent. Then the variable t_n=\frac{z_0}{\sqrt{(z_1^2+...+z_n^2)/n}} is called a t distribution with n degrees of freedom. The density of the t distribution is bell-shaped and for low n has fatter tails than the standard normal. For high n, it approaches that of the standard normal, see Figure 2.

Definition of F distribution


Figure 3. F distribution with (1,m) degrees of freedom

Let u_1,...,u_n,v_1,...,v_m be standard normal and independent. Then the variable F_{n,m}=\frac{(u_1^2+...+u_n^2)/n}{(v_1^2+...+v_m^2)/m} is called an F distribution with (n,m) degrees of freedom. It is nonnegative and its density is zero to the left of the origin. When n is low, the density is not bounded in the neighborhood of zero, see Figure 3.

The Mathematica file and video illustrate better the densities of these three variables.


  1. If \chi^2_n and \chi^2_m are independent, then \chi^2_n+\chi^2_m is \chi^2_{n+m} (addition rule). This rule is applied in the theory of ANOVA models.
  2. t_n^2=F_{1,n}. This is an easy proof of equation (2.71) from Introduction to Econometrics, by Christopher Dougherty, published by Oxford University Press, UK, in 2016.
Sep 16

The pearls of AP Statistics 30

Where do the confidence interval and margin of error come from?

They say: A confidence interval is an interval containing the most believable values for a parameter.
The probability that this method produces an interval that contains the parameter is called the confidence level. This is a number chosen to be close to 1, most commonly 0.95... The key is the sampling distribution of the point estimate. This distribution tells us the probability that the point estimate will fall within any certain distance of the parameter (Agresti and Franklin, p.352)... The margin of error measures how accurate the point estimate is likely to be in estimating a parameter. It is a multiple of the standard deviation of the sampling distribution of the estimate, such as 1.96 x (standard deviation) when the sampling distribution is a normal distribution (p.353)

I say: Confidence intervals, invented by Jerzy Neyman, were an important contribution to the statistical science. The logic behind them is substantial. Some math is better to hide from students but not in this case. The authors keep in mind complex notions involving math and try to deliver them verbally. Instead of hoping that students will recreate mentally those notions, why not give them directly?


I ask my students what kind of information they would prefer:

a) I predict the price S of Apple stock to be $114 tomorrow or

b) Tomorrow the price of Apple stock is expected to stay within $1 distance from $114 with probability 95%, that is P(113<S<115)=0.95.

Everybody says statement b) is better. A follow-up question: Do you want the probability in statement b) to be high or low? Unanimous answer: High. A series of definitions follows.

An interval (a,b) containing the values of a random variable S with high probability

(1) P(a<S<b)=p

is called a confidence interval. The value p which controls probability is called a confidence level and the number \alpha=1-p is called a level of significance. The interpretation of \alpha is that P(S\ falls\ outside\ of\ (a,b))=\alpha as follows from (1).

How to find a confidence interval

We want the confidence level to be close to 1 and the significance level to be close to zero. In applications, we choose them and we need to find the interval (a,b) from equation (1).

Step 1. Consider the standard normal. (1) becomes P(a<z<b)=p. Note that usually it is impossible to find two unknowns from one equation. Therefore we look for a symmetric interval, in which case we have to solve

(2) P(-a<z<a)=p

for a. The solution a=z_{cr} is called a critical value corresponding to the confidence level p or significance level \alpha=1-p. It is impossible to find it by hand, that's why people use statistical tables. In Mathematica, the critical value is given by

z_{cr}=Max[NormalCI[0, 1, ConfidenceLevel -> p]].

Geometrically, it is obvious that, as the confidence level approaches 1, the critical value goes to infinity, see the video or download the Mathematica file.

Step 2. In case of a general normal variable, plug its z-score in (2):

(3) P(-z_{cr}<\frac{X-\mu}{\sigma}<z_{cr})=p.

The event -z_{cr}<\frac{X-\mu}{\sigma}<z_{cr} is the same as -z_{cr}\sigma<X-\mu<z_{cr}\sigma which is the same as \mu-z_{cr}\sigma<X<\mu+z_{cr}\sigma. Hence, their probabilities are the same:

(4) P(-z_{cr}<\frac{X-\mu}{\sigma}<z_{cr})=P(\mu-z_{cr}\sigma<X<\mu+z_{cr}\sigma)=p.

We have found the confidence interval (\mu-z_{cr}\sigma,\mu+z_{cr}\sigma) for a normal variable. This explains where the margin of error z_{cr}\sigma comes from.

Step 3. In case of a random variable which is not necessarily normal we can use the central limit theorem. z-scores of sample means z=\frac{\bar{X}-E\bar{X}}{\sigma(\bar{X})}, for example, approach the standard normal. Instead of (3) we have an approximation

P(-z_{cr}<\frac{\bar{X}-E\bar{X}}{\sigma(\bar{X})}<z_{cr})\approx p.

Then instead of (4) we get

P(E\bar{X}-z_{cr}\sigma(\bar{X})<\bar{X}<E\bar{X}+z_{cr}\sigma(\bar{X}))\approx p.


In my classes, I insist that logically interconnected facts should be given in one place. To consolidate this information, I give my students the case of one-sided intervals as an exercise.

Sep 16

The pearls of AP Statistics 29

Normal distributions: sometimes it is useful to breast the current

The usual way of defining normal variables is to introduce the whole family of normal distributions and then to say that the standard normal is a special member of this family. Here I show that, for didactic purposes, it is better to do the opposite.

Standard normal distribution

The standard normal distribution z is defined by its probability density


Usually students don't remember this equation, and they don't need to. The point is to emphasize that this is a specific density, not a generic "bell shape".


Figure 1. Standard normal density

From the plot of the density (Figure 1) they can guess that the mean of this variable is zero.







Figure 2. Plot of xp(x)

Alternatively, they can look at the definition of the mean of a continuous random variable Ez=\int_\infty^\infty xp(x)dx. Here the function f(x)=xp(x) has the shape given in Figure 2, where the positive area to the right of the origin exactly cancels out with the negative area to the left of the origin. Since an integral means the area under the function curve, it follows that

(1) Ez=0.


To find variance, we use the shortcut:

Var(z)=Ez^2-(Ez)^2=Ez^2=\int_{-\infty}^\infty x^2p(x)dx=2\int_0^\infty x^2p(x)dx=1.


Figure 3. Plot of x^2p(x)


The total area under the curve is twice the area to the right of the origin, see Figure 3. Here the last integral has been found using Mathematica. It follows that

(2) \sigma(z)=\sqrt{Var(z)}=1.

General normal distribution


Figure $. Visualization of linear transformation - click to view video

Fix some positive \sigma and real \mu. A (general) normal variable \mu is defined as a linear transformation of z:

(3) X=\sigma z+\mu.

Changing \mu moves the density plot to the left (if \mu is negative) and to the right (if \mu is positive). Changing \sigma makes the density peaked or flat. See video. Enjoy the Mathematica file.




Properties follow like from the horn of plenty:

A) Using (1) and (3) we easily find the mean of X:

EX=\sigma Ez+\mu=\mu.

B) From (2) and (3) we have

Var(X)=Var(\sigma z)=\sigma^2Var(z)=\sigma^2

(the constant \mu does not affect variance and variance is homogeneous of degree 2).

C) Solving (3) for z gives us the z-score:


D) Moreover, we can prove that a linear transformation of a normal variable is normal. Indeed, let X be defined by (3) and let Y be its linear transformation: Y=\delta X+\nu. Then

Y=\delta (\sigma z+\mu)+\nu=\delta\sigma z+(\delta\mu+\nu)

is a linear transformation of the standard normal and is therefore normal.

Remarks. 1) In all of the above, no derivation is longer than one line. 2) Reliance on geometry improves understanding. 3) Only basic properties of means and variances are used. 4) With the traditional way of defining the normal distribution using the equation


there are two problems. Nobody understands this formula and it is difficult to extract properties of the normal variable from it.

Compare the above exposition with that of Agresti and Franklin: a) The normal distribution is symmetric, bell-shaped, and characterized by its mean μ and standard deviation σ (p.277) and b) The Standard Normal Distribution has Mean = 0 and Standard Deviation = 1 (p.285). It is the same old routine: remember this, remember that.