Nov 19

My presentation at Kazakh National University

Today's talk: “Analysis of variance in the central limit theorem”
The talk is about results, which are a combination of methods of the function theory, functional analysis and probability theory. The intuition underlying the central limit theorem will be described, and the history and place of the results of the author in modern theory will be highlighted.

Jan 17

Review of Agresti and Franklin

Review of Agresti and Franklin "Statistics: The Art and Science of Learning from Data", 3rd edition

Who is this book for?

On the Internet you can find both positive and negative reviews. The ones that I saw on Goodreads.com and Amazon.com do not say much about the pros and cons. Here I try to be more specific.

The main limitation of the book is that it adheres to the College Board statement that "it is a one semester, introductory, non-calculus-based, college course in statistics". Hence, there are no derivations and no links between formulas. You will not find explanations of why Statistics works. As a result, there is too much emphasis on memorization. After reading the book, you will likely not have an integral view of statistical methods.

I have seen students who understand such texts well. Generally, they have an excellent memory and better-than-average imagination. But such students are better off reading more advanced books. A potential reader has to lower his/her expectations. I imagine a person who is not interested in taking a more advanced Stats course later. The motivation of that person would be: a) to understand the ways Statistics is applied and/or b) to pass AP Stats just because it is a required course. The review is written on the premise that this is the intended readership.

What I like

  1. The number and variety of exercises. This is good for an instructor who teaches large classes. Having authored several books, I can assure you that inventing many exercises is the most time-consuming part of this business.
  2. The authors have come up with good visual embellishments of graphs and tables summarized in "A Guide to Learning From the Art in This Text" in the end of the book.
  3. The book has generous left margins. Sometimes they contain reminders about the past material. Otherwise, the reader can use them for notes.
  4. MINITAB is prohibitively expensive, but the Student Edition of MINITAB is provided on the accompanying CD.

What I don't like

  1. I counted about 140 high-resolution photos that have nothing to do with the subject matter. They hardly add to the educational value of the book but certainly add to its cost. This bad trend in introductory textbooks is fueled to a considerable extent by Pearson Education.
  2. 800+ pages, even after slashing all appendices and unnecessary illustrations, is a lot of reading for one semester. Even if you memorize all of them, during the AP test it be will difficult for you to pull out of your memory exactly that page you need to answer exactly this particular question.
  3. In an introductory text, one has to refrain from giving too much theory. Still, I don't like some choices made by the authors. The learning curve is flat. As a way of gentle introduction to algebra, verbal descriptions of formulas are normal. But sticking to verbal descriptions until p. 589 is too much. This reminds me a train trip in Kazakhstan. You enter the steppe through the western border and two days later you see the same endless steppe, just the train station is different.
  4. At the theoretical level, many topics are treated superficially. You can find a lot of additional information in my posts named "The pearls of AP Statistics". Here is the list of most important additions: regression and correlation should be decoupled; the importance of sampling distributions is overstated; probability is better explained without reference to the long run; the difference between the law of large numbers and central limit theorem should be made clear; the rate of convergence in the law of large numbers is not that fast; the law of large numbers is intuitively simple; the uniform distribution can also be made simple; to understand different charts, put them side by side; the Pareto chart is better understood as a special type of a histogram; instead of using the software on the provided CD, try to simulate in Excel yourself.
  5. Using outdated Texas instruments calculators contradicts the American Statistical Association recommendation to "Use technology for developing concepts and analyzing data".


If I want to save time and don't intend to delve into theory, I would prefer to read a concise book that directly addresses questions given on the AP test. However, to decide for yourself, read the Preface to see how much fantasy has been put into the book, and you may want to read it.

Oct 16

The pearls of AP Statistics 32

Student's t distribution: one-line explanation of its origin

They sayWe’ll now learn about a confidence interval that applies even for small sample sizes… Suppose we knew the standard deviation,  \sigma/\sqrt{n}, of the sample mean. Then, with the additional assumption that the population is normal, with small n we could use the formula \bar{x}\pm z\sigma/\sqrt{n}, for instance with z = 1.96 for 95% confidence. In practice, we don’t know the population standard deviation σ. Substituting the sample standard deviation s for σ to get se=s/\sqrt{n} then introduces extra error. This error can be sizeable when n is small. To account for this increased error, we must replace the z-score by a slightly larger score, called a t-score. The confidence interval is then a bit wider. (Agresti and Franklin, p.369)

I say: The opening statement in italic (We’ll now learn about...) creates the wrong impression that the task at hand is to address small sample sizes. The next part in italic (To account for this increased error...) confuses the reader further by implying that

1) using the sample standard deviation instead of the population standard deviation and

2) replacing the z score by the t score

are two separate acts. They are not: see equation (3) below. The last proposition in italic (The confidence interval is then a bit wider) is true. It confused me to the extent that I made a wrong remark in the first version of this post, see Remark 4 below.


William Gosset published his result under a pseudonym ("Student"), and that result was modified by Ronald Fisher to what we know now as Student's t distribution. Gosset with his statistic wanted to address small sample sizes. The modern explanation is different: the t statistic arises from replacing the unknown population variance by its estimator, the sample variance, and it works regardless of the sample size. If we take a couple of facts on trust, the explanation will be just a one-line formula.

Let X_1,...,X_n be a sample of independent observations from a normal population.

Fact 1. The z-score of the sample mean

(1) z_0=\frac{\bar{X}-E\bar{X}}{\sigma(\bar{X})}=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}

is a standard normal variable.

Fact 2. The sample variance s^2=\frac{1}{n-1}\sum(X_i-\bar{X})^2 upon scaling becomes a chi-square variable. More precisely, the variable

(2) \chi^2_{n-1}=\frac{(n-1)s^2}{\sigma^2}

is a chi-square with n-1 degrees of freedom.

Fact 3. The variables in (1) and (2) are independent.

Intuitive introduction to t distribution

When a population parameter is unknown, replace it by its estimator. Following this general statistical idea, in the situation when \sigma is unknown, instead of (1) consider

(3) t=\frac{\bar{X}-\mu}{s/\sqrt{n}} (dividing and multiplying by \sigma=\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\frac{1}{\sqrt{s^2/\sigma^2}} (using (1), (2)) =\frac{z_0}{\sqrt{\chi^2_{n-1}/(n-1)}}.

By definition and because the numerator and denominator are independent, the last expression is a t distribution with n-1 degrees of freedom. This is all there is to it.

Concluding remarks

Remark 1. When I give the definitions of chi-square, t statistic and F statistic, my students are often surprised. This is because there is no reference to samples. To be precise, it is better to remember that the way I define them, they are random variables and not statistics and using "distribution" or "variable" would be more appropriate than "statistic". A statistic, by definition, is a function of observations. The variable we start with in (3) is, obviously, a statistic. Equation (3) means that that statistic is distributed as t with n-1 degrees of freedom.

Remark 2. Many AP Stats books claim that a sum of normal variables is normal. In fact, for this to be true we need independence of the summands. Under our assumption of independent observations and normality, the sum X_1+...+X_n is normal. The variable in (1) is normal as a linear transformation of this sum. Since its mean is zero and its variance is 1, it is a standard normal. We have proved Fact 1. The proofs of Facts 2 and 3 are much more complex.

Remark 3. The t statistic is not used for large samples not because it does not work for large n but because for large n it is close to the z score.

relationship-between-confidence-intervals-based-on-t-score-and-z-scoreRemark 4. Taking the t distribution defined here as a standard t, we can define a general t as its linear transformation, GeneralT=\sigma*StandardT+\mu (similarly to general normals). Since the standard deviation of the standard t is not 1, the standard deviation of the general t we have defined will not be \sigma. The general t is necessary to use the Mathematica function StudentTCI (confidence interval for Student's t). The t score that arises in estimation is the standard t. In this case, confidence intervals based on t are indeed wider than those based on z. I apologize for my previous wrong comment and am posting this video. See an updated Mathematica file.

Oct 16

The pearls of AP Statistics 31

Demystifying sampling distributions: too much talking about nothing

What we know about sample means

Let X_1,...,X_n be an independent identically distributed sample and consider its sample mean \bar{X}.

Fact 1. The sample mean is an unbiased estimator of the population mean:

(1) E\bar{X}=\frac{1}{n}(EX_1+...+EX_n)=\frac{1}{n}(\mu+...+\mu)=\mu

(use linearity of means).

Fact 2. Variance of the sample mean is

(2) Var(\bar{X})=\frac{1}{n^2}(Var(X_1)+...+Var(X_n)=\frac{1}{n^2}(\sigma^2(X)+...+\sigma^2(X))=\frac{\sigma^2(X)}{n}

(use homogeneity of variance of degree 2 and additivity of variance for independent variables). Hence \sigma(\bar{X})=\frac{\sigma(X)}{\sqrt{n}}

Fact 3. The implication of these two properties is that the sample mean becomes more concentrated around the population mean as the sample size increases (see at least the law of large numbers; I have a couple more posts about this).

Fact 4. Finally, the z scores of sample means stabilize to a standard normal distribution (the central limit theorem).

What is a sampling distribution?

The sampling distribution of a statistic is the probability distribution that specifies probabilities for the possible values the statistic can take (Agresti and Franklin, p.308). After this definition, the authors go ahead and discuss the above four facts. Note that none of them requires the knowledge of what the sampling distribution is. The ONLY sampling distribution that appears explicitly in AP Statistics is the binomial. However, in the book the binomial is given in Section 6.3, before sampling distributions, which are the subject of Chapter 7. Section 7.3 explains that the binomial is a sampling distribution but that section is optional. Thus the whole Chapter 7 (almost 40 pages) is redundant.

Then what are sampling distributions for?

Here is a simple example that explains their role. Consider the binomial X_1+X_2 of two observations on an unfair coin. It involves two random variables and therefore is described by a joint distribution with the sample space consisting of pairs of values

Table 1. Sample space for pair (X_1,X_2)

Coin 1
0 1
Coin 2 0 (0,0) (0,1)
1 (1,0) (1,1)

Each coin independently takes values 0 and 1 (shown in the margins); the sample space contains four pairs of these values (shown in the main body of the table). The corresponding probability distribution is given by the table

Table 2. Joint probabilities for pair (X_1,X_2)

Coin 1
p q
Coin 2 p p^2 pq
q pq q^2

Since we are counting only the number of successes, the outcomes (0,1) and (1,0) for the purposes of our experiment are the same. Hence, joining indistinguishable outcomes, we obtain a smaller sample space

Table 3. Sampling distribution for binomial X_1+X_2

# of successes Corresponding probabilities
0 p^2
1 2pq
2 q^2

The last table is the sampling distribution for the binomial with sample size 2. All the sampling distribution does is replace a large joint distribution Table 1+Table 2 by a smaller distribution Table 3. The beauty of proofs of equations (1) and (2) is that they do not depend on which distribution is used (the distribution is hidden in the expected value operator).

Unless you want your students to appreciate the reduction in the sample space brought about by sampling distributions, it is not worth discussing them. See Wikipedia for examples other than the binomial.

Sep 16

The pearls of AP Statistics 30

Where do the confidence interval and margin of error come from?

They say: A confidence interval is an interval containing the most believable values for a parameter.
The probability that this method produces an interval that contains the parameter is called the confidence level. This is a number chosen to be close to 1, most commonly 0.95... The key is the sampling distribution of the point estimate. This distribution tells us the probability that the point estimate will fall within any certain distance of the parameter (Agresti and Franklin, p.352)... The margin of error measures how accurate the point estimate is likely to be in estimating a parameter. It is a multiple of the standard deviation of the sampling distribution of the estimate, such as 1.96 x (standard deviation) when the sampling distribution is a normal distribution (p.353)

I say: Confidence intervals, invented by Jerzy Neyman, were an important contribution to the statistical science. The logic behind them is substantial. Some math is better to hide from students but not in this case. The authors keep in mind complex notions involving math and try to deliver them verbally. Instead of hoping that students will recreate mentally those notions, why not give them directly?


I ask my students what kind of information they would prefer:

a) I predict the price S of Apple stock to be $114 tomorrow or

b) Tomorrow the price of Apple stock is expected to stay within $1 distance from $114 with probability 95%, that is P(113<S<115)=0.95.

Everybody says statement b) is better. A follow-up question: Do you want the probability in statement b) to be high or low? Unanimous answer: High. A series of definitions follows.

An interval (a,b) containing the values of a random variable S with high probability

(1) P(a<S<b)=p

is called a confidence interval. The value p which controls probability is called a confidence level and the number \alpha=1-p is called a level of significance. The interpretation of \alpha is that P(S\ falls\ outside\ of\ (a,b))=\alpha as follows from (1).

How to find a confidence interval

We want the confidence level to be close to 1 and the significance level to be close to zero. In applications, we choose them and we need to find the interval (a,b) from equation (1).

Step 1. Consider the standard normal. (1) becomes P(a<z<b)=p. Note that usually it is impossible to find two unknowns from one equation. Therefore we look for a symmetric interval, in which case we have to solve

(2) P(-a<z<a)=p

for a. The solution a=z_{cr} is called a critical value corresponding to the confidence level p or significance level \alpha=1-p. It is impossible to find it by hand, that's why people use statistical tables. In Mathematica, the critical value is given by

z_{cr}=Max[NormalCI[0, 1, ConfidenceLevel -> p]].

Geometrically, it is obvious that, as the confidence level approaches 1, the critical value goes to infinity, see the video or download the Mathematica file.

Step 2. In case of a general normal variable, plug its z-score in (2):

(3) P(-z_{cr}<\frac{X-\mu}{\sigma}<z_{cr})=p.

The event -z_{cr}<\frac{X-\mu}{\sigma}<z_{cr} is the same as -z_{cr}\sigma<X-\mu<z_{cr}\sigma which is the same as \mu-z_{cr}\sigma<X<\mu+z_{cr}\sigma. Hence, their probabilities are the same:

(4) P(-z_{cr}<\frac{X-\mu}{\sigma}<z_{cr})=P(\mu-z_{cr}\sigma<X<\mu+z_{cr}\sigma)=p.

We have found the confidence interval (\mu-z_{cr}\sigma,\mu+z_{cr}\sigma) for a normal variable. This explains where the margin of error z_{cr}\sigma comes from.

Step 3. In case of a random variable which is not necessarily normal we can use the central limit theorem. z-scores of sample means z=\frac{\bar{X}-E\bar{X}}{\sigma(\bar{X})}, for example, approach the standard normal. Instead of (3) we have an approximation

P(-z_{cr}<\frac{\bar{X}-E\bar{X}}{\sigma(\bar{X})}<z_{cr})\approx p.

Then instead of (4) we get

P(E\bar{X}-z_{cr}\sigma(\bar{X})<\bar{X}<E\bar{X}+z_{cr}\sigma(\bar{X}))\approx p.


In my classes, I insist that logically interconnected facts should be given in one place. To consolidate this information, I give my students the case of one-sided intervals as an exercise.

Sep 16

The pearls of AP Statistics 25

Central Limit Theorem versus Law of Large Numbers

They say: The Central Limit Theorem (CLT). Describes the Expected Shape of the Sampling Distribution for Sample Mean \bar{X}. For a random sample of size n from a population having mean μ and standard deviation σ, then as the sample size n increases, the sampling distribution of the sample mean \bar{X} approaches an approximately normal distribution. (Agresti and Franklin, p.321)

I say: There are at least three problems with this statement.

Problem 1. With any notion or statement, I would like to know its purpose in the first place. The primary purpose of the law of large numbers is to estimate population parameters. The Central Limit Theorem may be a nice theoretical result, but why do I need it? The motivation is similar to the one we use for introducing the z score. There is a myriad of distributions. Only some standard distributions have been tabulated. Suppose we have a sequence of variables X_n, none of which have been tabulated. Suppose also that, as n increases, those variables become close to a normal variable in the sense that the cumulative probabilities (areas under their respective densities) become close:

(1) P(X_n\le a)\rightarrow P(normal\le a) for all a.

Then we can use tables developed for normal variables to approximate P(X_n\le a). This justifies using (1) as the definition of a new convergence type called convergence in distribution.

Problem 2. Having introduced convergence (1), we need to understand what it means in terms of densities (distributions). As illustrated in Excel, the law of large numbers means convergence to a spike. In particular, the sample mean converges to a mass concentrated at μ (densities contract to one point). Referring to the sample mean in the context of CLT is misleading, because the CLT is about densities stabilization.


Figure 1. Law of large numbers with n=100, 1000, 10000

Figure 1 appeared in my posts before, I just added n=10,000, to show that densities do not stabilize.

Figure 2. Central limit theorem with n=100, 1000, 10000


In Figure 2, for clarity I use line plots instead of histograms. The density for n=100 is very rugged. The blue line (for n=1000) is more rugged than the orange (for n=10,000). Convergence to a normal shape is visible, although slow.

Main problem. It is not the sample means that converge to a normal distribution. It is their z scores


that do. Specifically,

P(\frac{\bar{X}-E\bar{X}}{\sigma(\bar{X})}\le a)\rightarrow P(z\le a) for all a

where z is a standard normal variable.

In my simulations I used sample means for Figure 1 and z scores of sample means for Figure 2. In particular, z scores always have means equal to zero, and that can be seen in Figure 2. In your class, you can use the Excel file. As usual, you have to enable macros.