Nov 16

Properties of correlation

Correlation coefficient: the last block of statistical foundation

Correlation has already been mentioned in

Statistical measures and their geometric roots

Properties of standard deviation

The pearls of AP Statistics 35

Properties of covariance

The pearls of AP Statistics 33

The hierarchy of definitions

Suppose random variables X,Y are not constant. Then their standard deviations are not zero and we can define their correlation as in Chart 1.


Chart 1. Correlation definition

Properties of correlation

Property 1. Range of the correlation coefficient: for any X,Y one has - 1 \le \rho (X,Y) \le 1.
This follows from the Cauchy-Schwarz inequality, as explained here.

Recall from this post that correlation is cosine of the angle between X-EX and Y-EY.
Property 2. Interpretation of extreme cases. (Part 1) If \rho (X,Y) = 1, then Y = aX + b with a > 0.

(Part 2) If \rho (X,Y) = - 1, then Y = aX + b with a < 0.

Proof. (Part 1) \rho (X,Y) = 1 implies
(1) Cov (X,Y) = \sigma (X)\sigma (Y)
which, in turn, implies that Y is a linear function of X: Y = aX + b (this is the second part of the Cauchy-Schwarz inequality). Further, we can establish the sign of the number a. By the properties of variance and covariance

\sigma (Y)=\sigma(aX + b)=\sigma (aX)=|a|\sigma (X).
Plugging this in Eq. (1) we get aVar(X) = |a|\sigma^2(X) and see that a is positive.

The proof of Part 2 is left as an exercise.

Property 3. Suppose we want to measure correlation between weight W and height H of people. The measurements are either in kilos and centimeters {W_k},{H_c} or in pounds and feet {W_p},{H_f}. The correlation coefficient is unit-free in the sense that it does not depend on the units used: \rho (W_k,H_c)=\rho (W_p,H_f). Mathematically speaking, correlation is homogeneous of degree 0 in both arguments.
Proof. One measurement is proportional to another, W_k=aW_p,\ H_c=bH_f with some positive constants a,b. By homogeneity
\rho (W_k,H_c)=\frac{Cov(W_k,H_c)}{\sigma(W_k)\sigma(H_c)}=\frac{Cov(aW_p,bH_f)}{\sigma(aW_p)\sigma(bH_f)}=\frac{abCov(W_p,H_f)}{ab\sigma(W_p)\sigma (H_f)}=\rho (W_p,H_f).


Nov 16

Statistical measures and their geometric roots

Variance, covariancestandard deviation and correlation: their definitions and properties are deeply rooted in the Euclidean geometry.

Here is the why: analogy with Euclidean geometry

euclidEuclid axiomatically described the space we live in. What we have known about the geometry of this space since the ancient times has never failed us. Therefore, statistical definitions based on the Euclidean geometry are sure to work.

   1. Analogy between scalar product and covariance

Geometry. See Table 2 here for operations with vectors. The scalar product of two vectors X=(X_1,...,X_n),\ Y=(Y_1,...,Y_n) is defined by

(X,Y)=\sum X_iY_i.

Statistical analog: Covariance of two random variables is defined by


Both the scalar product and covariance are linear in one argument when the other argument is fixed.

   2. Analogy between orthogonality and uncorrelatedness

Geometry. Two vectors X,Y are called orthogonal (or perpendicular) if

(1) (X,Y)=\sum X_iY_i=0.

Exercise. How do you draw on the plane the vectors X=(1,0),\ Y=(0,1)? Check that they are orthogonal.

Statistical analog: Two random variables are called uncorrelated if Cov(X,Y)=0.

   3. Measuring lengths


Figure 1. Length of a vector

Geometry: the length of a vector X=(X_1,...,X_n) is \sqrt{\sum X_i^2}, see Figure 1.

Statistical analog: the standard deviation of a random variable X is


This explains the square root in the definition of the standard deviation.

   4. Cauchy-Schwarz inequality

Geometry|(X,Y)|\le\sqrt{\sum X_i^2}\sqrt{\sum Y_i^2}.

Statistical analog|Cov(X,Y)|\le\sigma(X)\sigma(Y). See the proof here. The proof of its geometric counterpart is similar.

   5. Triangle inequality


Figure 2. Triangle inequality

Geometry\sqrt{\sum (X_i+Y_i)^2}\le\sqrt{\sum X_i^2}+\sqrt{\sum X_i^2}, see Figure 2 where the length of X+Y does not exceed the sum of lengths of X and Y.

Statistical analog: using the Cauchy-Schwarz inequality we have





   4. The Pythagorean theorem

Geometry: In a right triangle, the squared hypotenuse is equal to the sum of the squares of the two legs. The illustration is similar to Figure 2, except that the angle between X and Y should be right.

Proof. Taking two orthogonal vectors X,Y as legs, we have

Squared hypotenuse = \sum(X_i+Y_i)^2

(squaring out and using orthogonality (1))

=\sum X_i^2+2\sum X_iY_i+\sum Y_i^2=\sum X_i^2+\sum Y_i^2 = Sum of squared legs

Statistical analog: If two random variables are uncorrelated, then variance of their sum is a sum of variances Var(X+Y)=Var(X)+Var(Y).

   5. The most important analogy: measuring angles

Geometry: the cosine of the angle between two vectors X,Y is defined by

Cosine between X,Y = \frac{\sum X_iY_i}{\sqrt{\sum X_i^2\sum Y_i^2}}.

Statistical analog: the correlation coefficient between two random variables is defined by


This intuitively explains why the correlation coefficient takes values between -1 and +1.

Remark. My colleague Alisher Aldashev noticed that the correlation coefficient is the cosine of the angle between the deviations X-EX and Y-EY and not between X,Y themselves.

Nov 16

The pearls of AP Statistics 35

The disturbance term: To hide or not to hide? In an introductory Stats course, some part of the theory should be hidden. Where to draw the line is an interesting question. Here I discuss the ideas that look definitely bad to me.

How disturbing is the disturbance term?

In the main text, Agresti and Franklin never mention the disturbance term u_i in the regression model

(1) y_i=a+bx_i+u_i

(it is hidden in Exercise 12.105). Instead, they write the equation for the mean \mu_y=a+bx that follows from (1) under the standard assumption Eu_i=0. This would be fine if the exposition stopped right there. However, one has to explain the random source of variability in y_i. On p. 583 the authors say: "The probability distribution of y values at a fixed value of x is a conditional distribution. At each value of x, there is a conditional distribution of y values. A regression model also describes these distributions. An additional parameter σ describes the standard deviation of each conditional distribution."

Further, Figure 12.4 illustrates distributions of errors at different points and asks: "What do the bell-shaped curves around the line at x = 12 and at x = 16 represent?"


Figure 12.4. Illustration of error distributions

Besides, explanations of heteroscedasticity and of the residual sum of squares are impossible without explicitly referring to the disturbance term.

Attributing a regression property to the correlation is not good

On p.589 I encountered a statement that puzzled me: "An important property of the correlation is that at any particular x value, the predicted value of y is relatively closer to its mean than x is to its mean. If an x value is a certain number of standard deviations from its mean, then the predicted y is r times that many standard deviations from its mean."

Firstly, this is a verbal interpretation of some formula, so why not give the formula itself? How good must be a student to guess what is behind the verbal formulation?

Secondly, as I stressed in this post, the correlation coefficient does not entail any prediction about the magnitude of a change in one variable caused by a change in another. The above statement about the predicted value of y must be a property of regression. Attributing a regression property to the correlation is not in the best interests of those who want to study Stats at a more advanced level.

Thirdly, I felt challenged to see something new in the area I thought I knew everything about. So here is the derivation. By definition, the fitted value is

(2) \hat{y_i}=\hat{a}+\hat{b}x_i

where the hats stand for estimators. The fitted line passes through the point (\bar{x},\bar{y}):

(3) \bar{y}=\hat{a}+\hat{b}\bar{x}

(this will be proved elsewhere). Subtracting (3) from (2) we get

(4) \hat{y_i}-\bar{y}=\hat{b}(x_i-\bar{x})

(using equation (4) from this post)


It is helpful to rewrite (4) in a more symmetric form:

(5) \frac{\hat{y_i}-\bar{y}}{\sigma(y)}=\rho\frac{x_i-\bar{x}}{\sigma(x)}.

This is the equation we need. Suppose an x value is a certain number of standard deviations from its mean: x_i-\bar{x}=k\sigma(x). Plug this into (5) to get \hat{y_i}-\bar{y}=\rho k\sigma(y), that is, the predicted y is \rho times that many standard deviations from its mean.

Oct 16

The pearls of AP Statistics 33

Correlation and regression are two separate entities

They say: The correlation summarizes the direction of the association between two quantitative variables and the strength of its linear (straight-line) trend (Agresti and Franklin, p.105). Later, at a level that is supposed to be more advanced, they repeat: The correlation, denoted by r, describes linear association between two variables (p.586).

I say: This is a common misconception about correlation, even Wikipedia says so. Once I was consulting specialists from the Oncology Institute in Almaty. Until then, all of them were using correlation to study their data. When I suggested using simple regression, they asked what was the difference and how regression was better. I said: correlation is a measure of statistical relationship. When two variables are positively correlated and one of them goes up, the other also goes up (on average) but you never know by how much. On the other hand, regression gives a specific algebraic dependence between two variables, so that you can quantify your predictions about changes in one of them caused by changes in another.

Because of algebra of least squares estimation, you can conclude something about correlation if you know the estimated slope, and vice versa, but conceptually correlation and regression are different and there is no need to delay the study of correlation until after regression. The correlation coefficient is defined as

(1) \rho(X,Y)=\frac{Cov(X,Y)}{\sigma(X)\sigma(Y)}.

See this post for the definition and properties of covariance. As one can see, it can be studied right after the covariance and standard deviation. The slope of the regression line is a result of least squares fitting, which is a more advanced concept, and is given by

(2) b=\frac{Cov(X,Y)}{Var(X)},

see a simplified derivation or a full derivation. I am using the notations

(3) Cov(X,Y)=\frac{1}{n}\sum(X_i-\bar{X})(Y_i-\bar{Y}),\ Var(X)=Cov(X,X),\ \sigma(X)=\sqrt{Var(X)}

which arise from the corresponding population characteristics as explained in this post. Directly from (1) and (2) we see that

(4) b=\rho(X,Y)\frac{\sigma(Y)}{\sigma(X)},\ \rho(X,Y)=b\frac{\sigma(X)}{\sigma(Y)}.

Using these equations, we can go from the correlation to the slope and back if we know the sigmas. In particular, they are positive or negative simultaneously. The second equation in (4) gives rise to the interpretation of the correlation as "a standardized version of the slope" (p.588). To me, this "intuition" is far-fetched.

Notice how economical is the sequence of definitions in (3): one follows from another, which makes remembering them easier, and summation signs are reduced to a minimum. Under the "non-algebraic" approach, the covariance, variance and standard deviation are given separately, increasing the burden on one's memory.

Oct 16

The pearls of AP Statistics 31

Demystifying sampling distributions: too much talking about nothing

What we know about sample means

Let X_1,...,X_n be an independent identically distributed sample and consider its sample mean \bar{X}.

Fact 1. The sample mean is an unbiased estimator of the population mean:

(1) E\bar{X}=\frac{1}{n}(EX_1+...+EX_n)=\frac{1}{n}(\mu+...+\mu)=\mu

(use linearity of means).

Fact 2. Variance of the sample mean is

(2) Var(\bar{X})=\frac{1}{n^2}(Var(X_1)+...+Var(X_n)=\frac{1}{n^2}(\sigma^2(X)+...+\sigma^2(X))=\frac{\sigma^2(X)}{n}

(use homogeneity of variance of degree 2 and additivity of variance for independent variables). Hence \sigma(\bar{X})=\frac{\sigma(X)}{\sqrt{n}}

Fact 3. The implication of these two properties is that the sample mean becomes more concentrated around the population mean as the sample size increases (see at least the law of large numbers; I have a couple more posts about this).

Fact 4. Finally, the z scores of sample means stabilize to a standard normal distribution (the central limit theorem).

What is a sampling distribution?

The sampling distribution of a statistic is the probability distribution that specifies probabilities for the possible values the statistic can take (Agresti and Franklin, p.308). After this definition, the authors go ahead and discuss the above four facts. Note that none of them requires the knowledge of what the sampling distribution is. The ONLY sampling distribution that appears explicitly in AP Statistics is the binomial. However, in the book the binomial is given in Section 6.3, before sampling distributions, which are the subject of Chapter 7. Section 7.3 explains that the binomial is a sampling distribution but that section is optional. Thus the whole Chapter 7 (almost 40 pages) is redundant.

Then what are sampling distributions for?

Here is a simple example that explains their role. Consider the binomial X_1+X_2 of two observations on an unfair coin. It involves two random variables and therefore is described by a joint distribution with the sample space consisting of pairs of values

Table 1. Sample space for pair (X_1,X_2)

Coin 1
0 1
Coin 2 0 (0,0) (0,1)
1 (1,0) (1,1)

Each coin independently takes values 0 and 1 (shown in the margins); the sample space contains four pairs of these values (shown in the main body of the table). The corresponding probability distribution is given by the table

Table 2. Joint probabilities for pair (X_1,X_2)

Coin 1
p q
Coin 2 p p^2 pq
q pq q^2

Since we are counting only the number of successes, the outcomes (0,1) and (1,0) for the purposes of our experiment are the same. Hence, joining indistinguishable outcomes, we obtain a smaller sample space

Table 3. Sampling distribution for binomial X_1+X_2

# of successes Corresponding probabilities
0 p^2
1 2pq
2 q^2

The last table is the sampling distribution for the binomial with sample size 2. All the sampling distribution does is replace a large joint distribution Table 1+Table 2 by a smaller distribution Table 3. The beauty of proofs of equations (1) and (2) is that they do not depend on which distribution is used (the distribution is hidden in the expected value operator).

Unless you want your students to appreciate the reduction in the sample space brought about by sampling distributions, it is not worth discussing them. See Wikipedia for examples other than the binomial.

Sep 16

The pearls of AP Statistics 29

Normal distributions: sometimes it is useful to breast the current

The usual way of defining normal variables is to introduce the whole family of normal distributions and then to say that the standard normal is a special member of this family. Here I show that, for didactic purposes, it is better to do the opposite.

Standard normal distribution

The standard normal distribution z is defined by its probability density


Usually students don't remember this equation, and they don't need to. The point is to emphasize that this is a specific density, not a generic "bell shape".


Figure 1. Standard normal density

From the plot of the density (Figure 1) they can guess that the mean of this variable is zero.







Figure 2. Plot of xp(x)

Alternatively, they can look at the definition of the mean of a continuous random variable Ez=\int_\infty^\infty xp(x)dx. Here the function f(x)=xp(x) has the shape given in Figure 2, where the positive area to the right of the origin exactly cancels out with the negative area to the left of the origin. Since an integral means the area under the function curve, it follows that

(1) Ez=0.


To find variance, we use the shortcut:

Var(z)=Ez^2-(Ez)^2=Ez^2=\int_{-\infty}^\infty x^2p(x)dx=2\int_0^\infty x^2p(x)dx=1.


Figure 3. Plot of x^2p(x)


The total area under the curve is twice the area to the right of the origin, see Figure 3. Here the last integral has been found using Mathematica. It follows that

(2) \sigma(z)=\sqrt{Var(z)}=1.

General normal distribution


Figure $. Visualization of linear transformation - click to view video

Fix some positive \sigma and real \mu. A (general) normal variable \mu is defined as a linear transformation of z:

(3) X=\sigma z+\mu.

Changing \mu moves the density plot to the left (if \mu is negative) and to the right (if \mu is positive). Changing \sigma makes the density peaked or flat. See video. Enjoy the Mathematica file.




Properties follow like from the horn of plenty:

A) Using (1) and (3) we easily find the mean of X:

EX=\sigma Ez+\mu=\mu.

B) From (2) and (3) we have

Var(X)=Var(\sigma z)=\sigma^2Var(z)=\sigma^2

(the constant \mu does not affect variance and variance is homogeneous of degree 2).

C) Solving (3) for z gives us the z-score:


D) Moreover, we can prove that a linear transformation of a normal variable is normal. Indeed, let X be defined by (3) and let Y be its linear transformation: Y=\delta X+\nu. Then

Y=\delta (\sigma z+\mu)+\nu=\delta\sigma z+(\delta\mu+\nu)

is a linear transformation of the standard normal and is therefore normal.

Remarks. 1) In all of the above, no derivation is longer than one line. 2) Reliance on geometry improves understanding. 3) Only basic properties of means and variances are used. 4) With the traditional way of defining the normal distribution using the equation


there are two problems. Nobody understands this formula and it is difficult to extract properties of the normal variable from it.

Compare the above exposition with that of Agresti and Franklin: a) The normal distribution is symmetric, bell-shaped, and characterized by its mean μ and standard deviation σ (p.277) and b) The Standard Normal Distribution has Mean = 0 and Standard Deviation = 1 (p.285). It is the same old routine: remember this, remember that.

Aug 16

The pearls of AP Statistics 22

The law of large numbers - a bird's view

They say: In 1689, the Swiss mathematician Jacob Bernoulli proved that as the number of trials increases, the proportion of occurrences of any given outcome approaches a particular number (such as 1/6) in the long run. (Agresti and Franklin, p.213).

I say: The expression “law of large numbers” appears in the book 13 times, yet its meaning is never clearly explained. The closest approximation to the truth is the above sentence about Jacob Bernoulli. To see if this explanation works, tell it to your students and ask what they understood. To me, this is a clear case when withholding theory harms understanding.

Intuition comes first. I ask my students: if you flip a fair coin 100 times, what do you expect the proportion of ones to be? Absolutely everybody replies correctly, just the form of the answer may be different (50-50 or 0.5 or 50 out of 100). Then I ask: probably it will not be exactly 0.5 but if you flip the coin 1000 times, do you expect the proportion to be closer to 0.5? Everybody says: Yes. Next I ask: Suppose the coin is unfair and the probability of 1 appearing is 0.7. What would you expect the proportion to be close to in large samples? Most students come up with the right answer: 0.7. Congratulations, you have discovered what is called a law of large numbers!

Then we give a theoretical format to our little discovery. p=0.7 is a population parameter. Flipping a coin n times we obtain observations X_1,...,X_n. The proportion of ones is the sample mean \bar{X}=\frac{X_1+...+X_n}{n}. The law of large numbers says two things: 1) as the sample size increases, the sample mean approaches the population mean. 2) At the same time, its variation about the population mean becomes smaller and smaller.

Part 1) is clear to everybody. To corroborate statement 2), I give two facts. Firstly, we know that the standard deviation of the sample mean is \frac{\sigma}{\sqrt{n}}. From this we see that as n increases, the standard deviation of the sample mean decreases and the values of the sample mean become more and more concentrated around the population mean. We express this by saying that the sample mean converges to a spike. Secondly, I produce two histograms. With the sample size n=100, there are two modes (just 1o%) of the histogram at 0.69 and 0.72, while 0.7 was used as the population mean in my simulations. Besides, the spread of the values is large. With n=1000, the mode (27%) is at the true value 0.7, and the spread is low.

Histogram of proportions with n=100


Histogram of proportions with n=1000

Finally, we relate our little exercise to practical needs. In practice, the true mean is never known. But we can obtain a sample and calculate its mean. With a large sample size, the sample mean will be close to the truth. More generally, take any other population parameter, such as its standard deviation, and calculate the sample statistic that estimates it, such as the sample standard deviation. Again, the law of large numbers applies and the sample statistic will be close to the population parameter. The histograms have been obtained as explained here and here. Download the Excel file.

Aug 16

The pearls of AP Statistics 14

Reasons to increase Math content in AP Statistics course

The definition of the standard deviation, to those who see it for the first time, looks complex and scary. Agresti and Franklin on p.57 have done an excellent job explaining it. They do it step by step: introduce deviations, the sum of squares, variance and give the formula in the end. The names introduced here will be useful later, in other contexts. Being a rotten theorist, I don't like the "small technical point" on p. 57 (the true reason why there is division by n-1 and not by n is unbiasedness: Es^2=\sigma^2) but this is a minor point.

AP Stats teachers cannot discuss advanced facts because many students are not good in algebra. However, there are good methodological reasons to increase Math content of an AP Stats course. When students saw algebra for the first time, their cognitive skills may have been underdeveloped, which may have prevented them from leaping from numbers to algebraic notation. On the other hand, by the time they take AP Stats they mature. Their logic, power of observation, motivation etc. are better. The crucial fact is that in Statistics numbers meet algebra again, and this can be usefully employed.

Ask your students two questions. 1) You have observations on two stocks, X and Y. How is the sample mean of their sum related to their individual sample means? 2) You have s shares of stock X (s is a number, X is a random variable). How is the sample mean of your portfolio sX related to the sample mean of X? This smells money and motivates well.

The first answer, \overline{X+Y}=\bar{X}+\bar{Y}, tells us that if we know the individual means, we can avoid calculating \overline{X+Y} by simply adding two numbers. Similarly, the second formula, \overline{sX}=s\bar{X}, simplifies calculation of \overline{sX}. Methodologically, this is an excellent opportunity to dive into theory. Firstly, there is good motivation. Secondly, it's easy to see the link between numbers and algebra (see tabular representations of random variables in Chapters 4 and 5 of my book (you are welcome to download the free version). Thirdly, even though this is theory, many things here are done by analogy, which students love. Fourthly, this topic paves the road to properties of the variance and covariance (recall that the slope in simple regression is covariance over variance).

Agresti and Franklin don't have any theoretical properties of the mean, so without them the definition of the mean is kind of left hanging in the air. FYI: properties of the mean, variance, covariance and standard deviation are omnipresent in theory. The mode, median, range and IQR are not used at all because they have bad theoretical properties.

Jan 16

What is a z score: the scientific explanation

You know what is a z score when you know why people invented it.

As usual, we start with a theoretical motivation. There is a myriad of distributions. Even if we stay within the set of normal distributions, there is an infinite number of them, indexed by their means \mu(X)=EX and standard deviations \sigma(X)=\sqrt{Var(X)}. When computers did not exist, people had to use statistical tables. It was impossible to produce statistical tables for an infinite number of distributions, so the problem was to reduce the case of general \mu(X) and \sigma(X) to that of \mu(X)=0 and \sigma(X)=1.

But we know that that can be achieved by centering and scaling. Combining these two transformations, we obtain the definition of the z score:


Using the properties of means and variances we see that



The transformation leading from X to its z score sometimes is called standardization.

This site promises to tell you the truth about undergraduate statistics. The truth about the z score is that:

(1) Standardization can be applied to any variable with finite variance, not only to normal variables. The z score is a standard normal variable only when the original variable X is normal, contrary to what some sites say.

(2) With modern computers, standardization is not necessary to find critical values for X, see Chapter 14 of my book.

Jan 16

Scaling a distribution

Scaling a distribution is as important as centering or demeaning considered here. The question we want to find an answer for is this: What can you do to a random variable X to obtain another random variable, say, Y, whose variance is one? Like in case of centering, geometric considerations can be used but I want to follow the algebraic approach, which is more powerful.

Hint: in case of centering, we subtract the mean, Y=X-EX. For the problem at hand the suggestion is to use scaling: Y=aX, where a is a number to be determined.

Using the fact that variance is homogeneous of degree 2, we have


We want Var(Y) to be 1, so solving for a gives a=1/\sqrt{Var(X)}=1/\sigma(X). Thus, division by the standard deviation answers our question: the variable Y=X/\sigma(X) has variance and standard deviation equal to 1.

Note. Always use the notation for standard deviation \sigma with its argument X.