14
Jan 17

Inductive introduction to Chebyshev inequality

Chebyshev inequality - enigma or simplicity itself?

Let's go back to the very basics. The true probability distribution is usually unknown. This is why using separate values and probabilities is prohibited and we work with various averages. However, as you will see below, the Chebyshev inequality answers a question about behavior of certain probabilities.

Motivation

Table 1. Income distribution

IncomePercentageP(Income>=c)Chebyshev boundBound/true
100.02715.055.05
200.0660.9732.5252.595
300.1230.9071.6831.856
400.1790.7841.2631.611
500.2020.6061.011.667
600.1790.4030.8422.089
700.1230.2250.7213.204
800.0660.1020.6316.186
900.0270.0360.56115.583
1000.0090.0090.50556.111
In Table 1, in the first two columns we have information about income distribution (income values and their probabilities or percentages of the population) in a hypothetical country H. For example, 0.027 is the proportion of the population that has income of 10 H-dollars.

Figure 1. Income distribution

I simulated this distribution using the normal distribution function of Excel in such a way as to get relatively low percentages of poor and rich people, see Figure 1.

Suppose the government wants to increase the income tax on wealthy people and use the resulting tax revenue to support the low income population. The question is what part of the population will be impacted. The percentage of the population with income higher than or equal to a given cut-off level c is given by

(1) P(Income\ge c).

For example, the government may decide to impose a higher tax on wealthy people with Income\ge 90, in which case the proportion of affected people will be 0.027+0.009=0.036. If the tax revenue is to be used only to support the poorest people with income of 10 H-dollars, the proportion of people who benefit from this decision can also be expressed using probability (1) because P(Income=10)=1-P(Income\ge 20). It's easy to see that probability (1) is a cumulative probability: to find it, we sum all probabilities, starting from the last row up to the row in which Income=c. Denoting I_j income levels and p_j the corresponding probabilities, we have

(2) P(Income\ge c)=\sum_{I_j\ge c}p_j.

The third column of Table 1 contains these probabilities for all cut-off values.

Question. The true probabilities are usually unknown but the mean is normally available (to obtain the GDP per capita, just divide the GDP by the head count). In our case the mean income is 50.5. What can be said about (1) if the the cut-off value and the mean are known?

Chebyshev's answer

Chebyshev noticed that for those j over which we sum in (2) we have c\le I_j or, equivalently, 1\le I_j/c. His answer is obtained in two steps:

P(Income\ge c)=\sum_{I_j\ge c}p_j\times 1 (replacing 1 by I_j/c can only increase the right-hand side)

\le\sum_{I_j\ge c}p_jI_j/c (increase the sum further by including all j)

\le\sum_jp_jI_j/c=\frac{1}{c}EIncome.

Thus, we cannot find the exact value of (1) but we can give an upper bound P(Income\ge c)\le\frac{1}{c}EIncome. The fourth column of Table 1 contains Chebyshev bounds. The fifth column, which contains the ratios of the bounds to the true values from column 3, shows that the bounds are reasonably good for middle incomes and badly miss the mark for low and high incomes.

The above proof applies to any nonnegative random variable X and positive c and we state the result as the simplest form of the Chebyshev inequality:

(3) P(X\ge c)\le\frac{1}{c}EX.

Extensions

  1. If X changes sign, its absolute value is nonetheless nonnegative, so P(|X|\ge c)\le\frac{1}{c}E|X|.
  2. It is more interesting to bound the probability of deviation of X from its mean EX. For this, just plug |X-EX| in (3): P(|X-EX|\ge c)\le\frac{1}{c}E|X-EX|.
  3. One more step allows us to obtain Var(X) instead of E|X-EX| at the right. Note that the events |X-EX|\ge c and |X-EX|^2\ge c^2 are equivalent. Therefore

P(|X-EX|\ge c)=P(|X-EX|^2\ge c^2)\le\frac{1}{c^2}E|X-EX|^2=\frac{1}{c^2}Var(X).

The result we have obtained P(|X-EX|\ge c)\le \frac{1}{c^2}Var(X) will be referred to as the Chebyshev inequality.

Digression

A long time ago I read a joke about P.L. Chebyshev. He traveled to Paris to give a talk named "On the optimal fabric cutout". The best Paris fashion designers gathered to listen to his presentation. They left the room after he said: For simplicity, let us imagine that the human body is ball-shaped.

 

8
Jan 17

OLS estimator variance

Assumptions about simple regression

We consider the simple regression

(1) y_i=a+bx_i+e_i

Here we derived the OLS estimators of the intercept and slope:

(2) \hat{b}=\frac{Cov_u(x,y)}{Var_u(x)},

(3) \hat{a}=\bar{y}-\hat{b}\bar{x}.

A1. Existence condition. Since division by zero is not allowed, for (2) to exist we require Var_u(x)\ne 0. If this condition is not satisfied, then there is no variance in x and all observed points are on the vertical line.

A2. Convenience condition. The regressor x is deterministic. This condition is imposed to be able to apply the properties of expectation, see equation (7) in  this post. The time trend and dummy variables are examples of deterministic regressors. However, most real-life regressors are stochastic. Modifying the theory in order to cover stochastic regressors is the subject of two posts: finite-sample theory and large-sample theory.

A3. Unbiasedness conditionEe_i=0. This is the main assumption that makes sure that OLS estimators are unbiased, see equation (7) in  this post.

Unbiasedness is not enough

Unbiasedness characterizes the quality of an estimator, see the intuitive explanation. Unfortunately, unbiasedness is not enough to choose the best estimator because of nonuniqueness: usually, if there is one unbiased estimator of a parameter, then there are infinitely many unbiased estimators of the same parameter. For example, we know that the sample mean \bar{X} unbiasedly estimates the population mean E\bar{X}=EX. Since EX_1=EX (X_1 is the first observation), we can easily construct an infinite family of unbiased estimators Y=(\bar{X}+aX_1)/(1+a), assuming a\ne -1. Indeed, using linearity of expectation EY=(E\bar{X}+aEX_1)/(1+a)=EX.

Variance is another measure of an estimator quality: to have a lower spread of estimator values, among competing estimators we choose the one which has the lowest variance. Knowing the estimator variance allows us to find the z-score and use statistical tables.

Slope estimator variance

It is not difficult to find the variance of the slope estimator using representation (6) derived here:

\hat{b}=b+\frac{1}{n}\sum a_ie_i

where a_i=(x_i-\bar{x})/Var_u(x).

Don't try to apply directly the definition of variance at this point, because there will be a square of a sum, which leads to a double sum upon squaring. We need two new assumptions.

A4. Uncorrelatedness of errors. Assume that Cov(e_i,e_j)=0 for all i\ne j (errors from different equations (1) are uncorrelated). Note that because of the unbiasedness condition, this assumption is equivalent to Ee_ie_j=0 for all i\ne j. This assumption is likely to be satisfied if we observe consumption patterns of unrelated individuals.

A5. Homoscedasticity. All errors have the same variancesVar(e_i)=\sigma^2 for all i. Again, because of the unbiasedness condition, this assumption is equivalent to Ee_i^2=\sigma^2 for all i.

Now we can derive the variance expression, using properties from this post:

Var(\hat{b})=Var(b+\frac{1}{n}\sum_i a_ie_i) (dropping a constant doesn't affect variance)

=Var(\frac{1}{n}\sum_i a_ie_i) (for uncorrelated variables, variance is additive)

=\sum_i Var(\frac{1}{n}a_ie_i) (variance is homogeneous of degree 2)

=\frac{1}{n^2}\sum_i a_i^2Var(e_i) (applying homoscedasticity)

=\frac{1}{n^2}\sum_i a_i^2\sigma^2 (plugging a_i)

=\frac{1}{n^2}\sum_i(x_i-\bar{x})^2\sigma^2/Var^2_u(x) (using the notation of sample variance)

=\frac{1}{n}Var_u(x)\sigma^2/Var^2_u(x)=\sigma^2/(nVar_u(x)).

 

Note that canceling out two variances in the last line is obvious. It is not so obvious for some if instead of the short notation for variances you use summation signs. The case of the intercept variance is left as an exercise.

Conclusion

The above assumptions A1-A5 are called classical. It is necessary to remember their role in derivations because a considerable part of Econometrics is devoted to deviations from classical assumptions. Once you have a certain assumption violated, you should expect the corresponding estimator property invalidated. For example, if Ee_i\ne 0, you should expect the estimators to be biased. If any of A4-A5 is not true, the formula we have derived

Var(\hat{b})=\sigma^2/(nVar_u(x))

will not hold. Besides, the Gauss-Markov theorem that the OLS estimators are efficient will not hold (this will be discussed later). The pair A4-A5 can be called an efficiency condition.

2
Jan 17

Conditional variance properties

Preliminaries

Review Properties of conditional expectation, especially the summary, where I introduce a new notation for conditional expectation. Everywhere I use the notation E_Y\pi for expectation of \pi conditional on Y, instead of E(\pi|Y).

This post and the previous one on conditional expectation show that conditioning is a pretty advanced notion. Many introductory books use the condition E_xu=0 (the expected value of the error term u=0 conditional on the regressor x is zero). Because of the complexity of conditioning, I think it's better to avoid this kind of assumption as much as possible.

Conditional variance properties

Replacing usual expectations by their conditional counterparts in the definition of variance, we obtain the definition of conditional variance:

(1) Var_Y(X)=E_Y(X-E_YX)^2.

Property 1. If X,Y are independent, then X-EX and Y are also independent and conditioning doesn't change variance:

Var_Y(X)=E_Y(X-EX)^2=E(X-EX)^2=Var(X),

see Conditioning in case of independence.

Property 2. Generalized homogeneity of degree 2: if a is a deterministic function, then a^2(Y) can be pulled out:

Var_Y(a(Y)X)=E_Y[a(Y)X-E_Y(a(Y)X)]^2=E_Y[a(Y)X-a(Y)E_YX]^2

=E_Y[a^2(Y)(X-E_YX)^2]=a^2(Y)E_Y(X-E_YX)^2=a^2(Y)Var_Y(X).

Property 3. Shortcut for conditional variance:

(2) Var_Y(X)=E_Y(X^2)-(E_YX)^2.

Proof.

Var_Y(X)=E_Y(X-E_YX)^2=E_Y[X^2-2XE_YX+(E_YX)^2]

(distributing conditional expectation)

=E_YX^2-2E_Y(XE_YX)+E_Y(E_YX)^2

(applying Properties 2 and 6 from this Summary with a(Y)=E_YX)

=E_YX^2-2(E_YX)^2+(E_YX)^2=E_YX^2-(E_YX)^2.

Property 4The law of total variance:

(3) Var(X)=Var(E_YX)+E[Var_Y(X)].

Proof. By the shortcut for usual variance and the law of iterated expectations

Var(X)=EX^2-(EX)^2=E[E_Y(X^2)]-[E(E_YX)]^2

(replacing E_Y(X^2) from (2))

=E[Var_Y(X)]+E(E_YX)^2-[E(E_YX)]^2

(the last two terms give the shortcut for variance of E_YX)

=E[Var_Y(X)]+Var(E_YX).

Before we move further we need to define conditional covariance by

Cov_Y(S,T) = E_Y(S - E_YS)(T - E_YT)

(everywhere usual expectations are replaced by conditional ones). We say that random variables S,T are conditionally uncorrelated if Cov_Y(S,T) = 0.

Property 5. Conditional variance of a linear combination. For any random variables S,T and functions a(Y),b(Y) one has

Var_Y(a(Y)S + b(Y)T)=a^2(Y)Var_Y(S)+2a(Y)b(Y)Cov_Y(S,T)+b^2(Y)Var_Y(T).

The proof is quite similar to that in case of usual variances, so we leave it to the reader. In particular, if S,T are conditionally uncorrelated, then the interaction terms disappears:

Var_Y(a(Y)S + b(Y)T)=a^2(Y)Var_Y(S)+b^2(Y)Var_Y(T).

26
Nov 16

Properties of correlation

Correlation coefficient: the last block of statistical foundation

Correlation has already been mentioned in

Statistical measures and their geometric roots

Properties of standard deviation

The pearls of AP Statistics 35

Properties of covariance

The pearls of AP Statistics 33

The hierarchy of definitions

Suppose random variables X,Y are not constant. Then their standard deviations are not zero and we can define their correlation as in Chart 1.

correlation-definition

Chart 1. Correlation definition

Properties of correlation

Property 1. Range of the correlation coefficient: for any X,Y one has - 1 \le \rho (X,Y) \le 1.
This follows from the Cauchy-Schwarz inequality, as explained here.

Recall from this post that correlation is cosine of the angle between X-EX and Y-EY.
Property 2. Interpretation of extreme cases. (Part 1) If \rho (X,Y) = 1, then Y = aX + b with a > 0.

(Part 2) If \rho (X,Y) = - 1, then Y = aX + b with a < 0.

Proof. (Part 1) \rho (X,Y) = 1 implies
(1) Cov (X,Y) = \sigma (X)\sigma (Y)
which, in turn, implies that Y is a linear function of X: Y = aX + b (this is the second part of the Cauchy-Schwarz inequality). Further, we can establish the sign of the number a. By the properties of variance and covariance
Cov(X,Y)=Cov(X,aX+b)=aCov(X,X)+Cov(X,b)=aVar(X),

\sigma (Y)=\sigma(aX + b)=\sigma (aX)=|a|\sigma (X).
Plugging this in Eq. (1) we get aVar(X) = |a|\sigma^2(X) and see that a is positive.

The proof of Part 2 is left as an exercise.

Property 3. Suppose we want to measure correlation between weight W and height H of people. The measurements are either in kilos and centimeters {W_k},{H_c} or in pounds and feet {W_p},{H_f}. The correlation coefficient is unit-free in the sense that it does not depend on the units used: \rho (W_k,H_c)=\rho (W_p,H_f). Mathematically speaking, correlation is homogeneous of degree 0 in both arguments.
Proof. One measurement is proportional to another, W_k=aW_p,\ H_c=bH_f with some positive constants a,b. By homogeneity
\rho (W_k,H_c)=\frac{Cov(W_k,H_c)}{\sigma(W_k)\sigma(H_c)}=\frac{Cov(aW_p,bH_f)}{\sigma(aW_p)\sigma(bH_f)}=\frac{abCov(W_p,H_f)}{ab\sigma(W_p)\sigma (H_f)}=\rho (W_p,H_f).

 

13
Nov 16

Statistical measures and their geometric roots

Variance, covariancestandard deviation and correlation: their definitions and properties are deeply rooted in the Euclidean geometry.

Here is the why: analogy with Euclidean geometry

euclidEuclid axiomatically described the space we live in. What we have known about the geometry of this space since the ancient times has never failed us. Therefore, statistical definitions based on the Euclidean geometry are sure to work.

   1. Analogy between scalar product and covariance

Geometry. See Table 2 here for operations with vectors. The scalar product of two vectors X=(X_1,...,X_n),\ Y=(Y_1,...,Y_n) is defined by

(X,Y)=\sum X_iY_i.

Statistical analog: Covariance of two random variables is defined by

Cov(X,Y)=E(X-\bar{X})(Y-\bar{Y}).

Both the scalar product and covariance are linear in one argument when the other argument is fixed.

   2. Analogy between orthogonality and uncorrelatedness

Geometry. Two vectors X,Y are called orthogonal (or perpendicular) if

(1) (X,Y)=\sum X_iY_i=0.

Exercise. How do you draw on the plane the vectors X=(1,0),\ Y=(0,1)? Check that they are orthogonal.

Statistical analog: Two random variables are called uncorrelated if Cov(X,Y)=0.

   3. Measuring lengths

length-of-vector

Figure 1. Length of a vector

Geometry: the length of a vector X=(X_1,...,X_n) is \sqrt{\sum X_i^2}, see Figure 1.

Statistical analog: the standard deviation of a random variable X is

\sigma(X)=\sqrt{Var(X)}=\sqrt{E(X-\bar{X})^2}.

This explains the square root in the definition of the standard deviation.

   4. Cauchy-Schwarz inequality

Geometry|(X,Y)|\le\sqrt{\sum X_i^2}\sqrt{\sum Y_i^2}.

Statistical analog|Cov(X,Y)|\le\sigma(X)\sigma(Y). See the proof here. The proof of its geometric counterpart is similar.

   5. Triangle inequality

triangle-inequality

Figure 2. Triangle inequality

Geometry\sqrt{\sum (X_i+Y_i)^2}\le\sqrt{\sum X_i^2}+\sqrt{\sum X_i^2}, see Figure 2 where the length of X+Y does not exceed the sum of lengths of X and Y.

Statistical analog: using the Cauchy-Schwarz inequality we have

\sigma(X+Y)=\sqrt{Var(X+Y)}

=\sqrt{Var(X)+2Cov(X,Y)+Var(Y)}

\le\sqrt{\sigma^2(X)+2\sigma(X)\sigma(Y)+\sigma^2(Y)}

=\sigma(X)+\sigma(Y).

   4. The Pythagorean theorem

Geometry: In a right triangle, the squared hypotenuse is equal to the sum of the squares of the two legs. The illustration is similar to Figure 2, except that the angle between X and Y should be right.

Proof. Taking two orthogonal vectors X,Y as legs, we have

Squared hypotenuse = \sum(X_i+Y_i)^2

(squaring out and using orthogonality (1))

=\sum X_i^2+2\sum X_iY_i+\sum Y_i^2=\sum X_i^2+\sum Y_i^2 = Sum of squared legs

Statistical analog: If two random variables are uncorrelated, then variance of their sum is a sum of variances Var(X+Y)=Var(X)+Var(Y).

   5. The most important analogy: measuring angles

Geometry: the cosine of the angle between two vectors X,Y is defined by

Cosine between X,Y = \frac{\sum X_iY_i}{\sqrt{\sum X_i^2\sum Y_i^2}}.

Statistical analog: the correlation coefficient between two random variables is defined by

\rho(X,Y)=\frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}=\frac{Cov(X,Y)}{\sigma(X)\sigma(Y)}.

This intuitively explains why the correlation coefficient takes values between -1 and +1.

Remark. My colleague Alisher Aldashev noticed that the correlation coefficient is the cosine of the angle between the deviations X-EX and Y-EY and not between X,Y themselves.

12
Nov 16

Properties of standard deviation

Properties of standard deviation are divided in two parts. The definitions and consequences are given here. Both variance and standard deviation are used to measure variability of values of a random variable around its mean. Then why use both of them? The why will be explained in another post.

Properties of standard deviation: definitions and consequences

Definition. For a random variable X, the quantity \sigma (X) = \sqrt {Var(X)} is called its standard deviation.

    Digression about square roots and absolute values

In general, there are two square roots of a positive number, one positive and the other negative. The positive one is called an arithmetic square root. The arithmetic root is applied here to Var(X) \ge 0 (see properties of variance), so standard deviation is always nonnegative.
Definition. An absolute value of a real number a is defined by
(1) |a| =a if a is nonnegative and |a| =-a if a is negative.
This two-part definition is a stumbling block for many students, so making them plug in a few numbers is a must. It is introduced to measure the distance from point a to the origin. For example, dist(3,0) = |3| = 3 and dist(-3,0) = |-3| = 3. More generally, for any points a,b on the real line the distance between them is given by dist(a,b) = |a - b|.

By squaring both sides in Eq. (1) we obtain |a|^2={a^2}. Application of the arithmetic square root gives

(2) |a|=\sqrt {a^2}.

This is the equation we need right now.

Back to standard deviation

Property 1. Standard deviation is homogeneous of degree 1. Indeed, using homogeneity of variance and equation (2), we have

\sigma (aX) =\sqrt{Var(aX)}=\sqrt{{a^2}Var(X)}=|a|\sigma(X).

Unlike homogeneity of expected values, here we have an absolute value of the scaling coefficient a.

Property 2. Cauchy-Schwarz inequality. (Part 1) For any random variables X,Y one has

(3) |Cov(X,Y)|\le\sigma(X)\sigma(Y).

(Part 2) If the inequality sign in (3) turns into equality, |Cov(X,Y)|=\sigma (X)\sigma (Y), then Y is a linear function of X: Y = aX + b, with some constants a,b.
Proof. (Part 1) If at least one of the variables is constant, both sides of the inequality are 0 and there is nothing to prove. To exclude the trivial case, let X,Y be non-constant and, therefore, Var(X),\ Var(Y) are positive. Consider a real-valued function of a real number t defined by f(t) = Var(tX + Y). Here we have variance of a linear combination

f(t)=t^2Var(X)+2tCov(X,Y)+Var(Y).

We see that f(t) is a parabola with branches looking upward (because the senior coefficient Var(X) is positive). By nonnegativity of variance, f(t)\ge 0 and the parabola lies above the horizontal axis in the (f,t) plane. Hence, the quadratic equation f(t) = 0 may have at most one real root. This means that the discriminant of the equation is non-positive:

D=Cov(X,Y)^2-Var(X)Var(Y)\le 0.

Applying square roots to both sides of Cov(X,Y)^2\le Var(X)Var(Y) we finish the proof of the first part.

(Part 2) In case of the equality sign the discriminant is 0. Therefore the parabola touches the horizontal axis where f(t)=Var(tX + Y)=0. But we know that this implies tX + Y = constant which is just another way of writing Y = aX + b.

Comment. (3) explains one of the main properties of the correlation:

-1\le\rho(X,Y)=\frac{Cov(X,Y)}{\sigma(X)\sigma(Y)}\le 1.

 

 

 

25
Oct 16

Properties of variance

All properties of variance in one place

Certainty is the mother of quiet and repose, and uncertainty the cause of variance and contentions. Edward Coke

Preliminaries: study properties of means with proofs.

Definition. Yes, uncertainty leads to variance, and we measure it by Var(X)=E(X-EX)^2. It is useful to use the name deviation from mean for X-EX and realize that E(X-EX)=0, so that the mean of the deviation from mean cannot serve as a measure of variation of X around EX.

Property 1. Variance of a linear combination. For any random variables X,Y and numbers a,b one has
(1) Var(aX + bY)=a^2Var(X)+2abCov(X,Y)+b^2Var(Y).
The term 2abCov(X,Y) in (1) is called an interaction term. See this post for the definition and properties of covariance.
Proof.
Var(aX + bY)=E[aX + bY -E(aX + bY)]^2

(using linearity of means)
=E(aX + bY-aEX -bEY)^2

(grouping by variable)
=E[a(X-EX)+b(Y-EY)]^2

(squaring out)
=E[a^2(X-EX)^2+2ab(X-EX)(Y-EY)+(Y-EY)^2]

(using linearity of means and definitions of variance and covariance)
=a^2Var(X) + 2abCov(X,Y) +b^2Var(Y).
Property 2. Variance of a sum. Letting in (1) a=b=1 we obtain
Var(X + Y) = Var(X) + 2Cov(X,Y)+Var(Y).

Property 3. Homogeneity of degree 2. Choose b=0 in (1) to get
Var(aX)=a^2Var(X).
Exercise. What do you think is larger: Var(X+Y) or Var(X-Y)?
Property 4. If we add a constant to a variable, its variance does not change: Var(X+c)=E[X+c-E(X+c)]^2=E(X+c-EX-c)^2=E(X-EX)^2=Var(X)
Property 5. Variance of a constant is zero: Var(c)=E(c-Ec)^2=0.

Property 6. Nonnegativity. Since the squared deviation from mean (X-EX)^2 is nonnegative, its expectation is nonnegativeE(X-EX)^2\ge 0.

Property 7. Only a constant can have variance equal to zero: If Var(X)=0, then E(X-EX)^2 =(x_1-EX)^2p_1 +...+(x_n-EX)^2p_n=0, see the definition of the expected value. Since all probabilities are positive, we conclude that x_i=EX for all i, which means that X is identically constant.

Property 8. Shortcut for variance. We have an identity E(X-EX)^2=EX^2-(EX)^2. Indeed, squaring out gives

E(X-EX)^2 =E(X^2-2XEX+(EX)^2)

(distributing expectation)

=EX^2-2E(XEX)+E(EX)^2

(expectation of a constant is constant)

=EX^2-2(EX)^2+(EX)^2=EX^2-(EX)^2.

All of the above properties apply to any random variables. The next one is an exception in the sense that it applies only to uncorrelated variables.

Property 9. If variables are uncorrelated, that is Cov(X,Y)=0, then from (1) we have Var(aX + bY)=a^2Var(X)+b^2Var(Y). In particular, letting a=b=1, we get additivityVar(X+Y)=Var(X)+Var(Y). Recall that the expected value is always additive.

GeneralizationsVar(\sum a_iX_i)=\sum a_i^2Var(X_i) and Var(\sum X_i)=\sum Var(X_i) if all X_i are uncorrelated.

Among my posts, where properties of variance are used, I counted 12 so far.

21
Oct 16

The pearls of AP Statistics 33

Correlation and regression are two separate entities

They say: The correlation summarizes the direction of the association between two quantitative variables and the strength of its linear (straight-line) trend (Agresti and Franklin, p.105). Later, at a level that is supposed to be more advanced, they repeat: The correlation, denoted by r, describes linear association between two variables (p.586).

I say: This is a common misconception about correlation, even Wikipedia says so. Once I was consulting specialists from the Oncology Institute in Almaty. Until then, all of them were using correlation to study their data. When I suggested using simple regression, they asked what was the difference and how regression was better. I said: correlation is a measure of statistical relationship. When two variables are positively correlated and one of them goes up, the other also goes up (on average) but you never know by how much. On the other hand, regression gives a specific algebraic dependence between two variables, so that you can quantify your predictions about changes in one of them caused by changes in another.

Because of algebra of least squares estimation, you can conclude something about correlation if you know the estimated slope, and vice versa, but conceptually correlation and regression are different and there is no need to delay the study of correlation until after regression. The correlation coefficient is defined as

(1) \rho(X,Y)=\frac{Cov(X,Y)}{\sigma(X)\sigma(Y)}.

See this post for the definition and properties of covariance. As one can see, it can be studied right after the covariance and standard deviation. The slope of the regression line is a result of least squares fitting, which is a more advanced concept, and is given by

(2) b=\frac{Cov(X,Y)}{Var(X)},

see a simplified derivation or a full derivation. I am using the notations

(3) Cov(X,Y)=\frac{1}{n}\sum(X_i-\bar{X})(Y_i-\bar{Y}),\ Var(X)=Cov(X,X),\ \sigma(X)=\sqrt{Var(X)}

which arise from the corresponding population characteristics as explained in this post. Directly from (1) and (2) we see that

(4) b=\rho(X,Y)\frac{\sigma(Y)}{\sigma(X)},\ \rho(X,Y)=b\frac{\sigma(X)}{\sigma(Y)}.

Using these equations, we can go from the correlation to the slope and back if we know the sigmas. In particular, they are positive or negative simultaneously. The second equation in (4) gives rise to the interpretation of the correlation as "a standardized version of the slope" (p.588). To me, this "intuition" is far-fetched.

Notice how economical is the sequence of definitions in (3): one follows from another, which makes remembering them easier, and summation signs are reduced to a minimum. Under the "non-algebraic" approach, the covariance, variance and standard deviation are given separately, increasing the burden on one's memory.

2
Oct 16

The pearls of AP Statistics 31

Demystifying sampling distributions: too much talking about nothing

What we know about sample means

Let X_1,...,X_n be an independent identically distributed sample and consider its sample mean \bar{X}.

Fact 1. The sample mean is an unbiased estimator of the population mean:

(1) E\bar{X}=\frac{1}{n}(EX_1+...+EX_n)=\frac{1}{n}(\mu+...+\mu)=\mu

(use linearity of means).

Fact 2. Variance of the sample mean is

(2) Var(\bar{X})=\frac{1}{n^2}(Var(X_1)+...+Var(X_n)=\frac{1}{n^2}(\sigma^2(X)+...+\sigma^2(X))=\frac{\sigma^2(X)}{n}

(use homogeneity of variance of degree 2 and additivity of variance for independent variables). Hence \sigma(\bar{X})=\frac{\sigma(X)}{\sqrt{n}}

Fact 3. The implication of these two properties is that the sample mean becomes more concentrated around the population mean as the sample size increases (see at least the law of large numbers; I have a couple more posts about this).

Fact 4. Finally, the z scores of sample means stabilize to a standard normal distribution (the central limit theorem).

What is a sampling distribution?

The sampling distribution of a statistic is the probability distribution that specifies probabilities for the possible values the statistic can take (Agresti and Franklin, p.308). After this definition, the authors go ahead and discuss the above four facts. Note that none of them requires the knowledge of what the sampling distribution is. The ONLY sampling distribution that appears explicitly in AP Statistics is the binomial. However, in the book the binomial is given in Section 6.3, before sampling distributions, which are the subject of Chapter 7. Section 7.3 explains that the binomial is a sampling distribution but that section is optional. Thus the whole Chapter 7 (almost 40 pages) is redundant.

Then what are sampling distributions for?

Here is a simple example that explains their role. Consider the binomial X_1+X_2 of two observations on an unfair coin. It involves two random variables and therefore is described by a joint distribution with the sample space consisting of pairs of values

Table 1. Sample space for pair (X_1,X_2)

Coin 1
0 1
Coin 2 0 (0,0) (0,1)
1 (1,0) (1,1)

Each coin independently takes values 0 and 1 (shown in the margins); the sample space contains four pairs of these values (shown in the main body of the table). The corresponding probability distribution is given by the table

Table 2. Joint probabilities for pair (X_1,X_2)

Coin 1
p q
Coin 2 p p^2 pq
q pq q^2

Since we are counting only the number of successes, the outcomes (0,1) and (1,0) for the purposes of our experiment are the same. Hence, joining indistinguishable outcomes, we obtain a smaller sample space

Table 3. Sampling distribution for binomial X_1+X_2

# of successes Corresponding probabilities
0 p^2
1 2pq
2 q^2

The last table is the sampling distribution for the binomial with sample size 2. All the sampling distribution does is replace a large joint distribution Table 1+Table 2 by a smaller distribution Table 3. The beauty of proofs of equations (1) and (2) is that they do not depend on which distribution is used (the distribution is hidden in the expected value operator).

Unless you want your students to appreciate the reduction in the sample space brought about by sampling distributions, it is not worth discussing them. See Wikipedia for examples other than the binomial.

15
Sep 16

The pearls of AP Statistics 28

From independence of events to independence of random variables

One way to avoid complex Math is by showing the students simplified, plausible derivations which create appearance of rigor and provide enough ground for intuition. This is what I try to do here.

Independence of random variables

Let X,Y be two random variables. Suppose X takes values x_1,x_2 with probabilities P(X=x_i)=p_i. Similarly, Y takes values y_1,y_2 with probabilities P(Y=y_i)=q_i. Now we want to consider a pair (X,Y). The pair can take values (x_i,y_j) where i,j take values 1,2. These are joint events with probabilities denoted P(X=x_i,Y=y_j)=p_{i,j}.

DefinitionX,Y are called independent if for all i,j one has

(1) p_{i,j}=p_iq_j.

Thus, in case of two-valued variables, their independence means independence of 4 events. Independence of variables is a more complex condition than independence of events.

Properties of independent variables

Property 1. For independent variables, we have EXY=EXEY (multiplicativity). Indeed, by definition of the expected value and equation (1)

EXY=x_1y_1p_{1,1}+x_1y_2p_{1,2}+x_2y_1p_{2,1}+x_2y_2p_{2,2}

=x_1y_1p_1q_1+x_1y_2p_1q_2+x_2y_1p_2q_1+x_2y_2p_2q_2

=(x_1p_1+x_2p_2)(y_1q_1+y_2q_2)=EXY.

Remark. This proof is a good exercise to check how well students understand the definitions of the product XY and of the expectation operator. Note also that multiplicativity holds only under independence, unlike linearity E(aX+bY)=aEX+bEY, which is always true.

Property 2. Independent variables are uncorrelated: Cov(X,Y)=0. This follows immediately from multiplicativity and the shortcut for covariance:

(2) Cov(X,Y)=E(XY)-(EX)(EY)=0.

Remark. Independence is stronger than uncorrelatedness: variables can be uncorrelated but not independent.

Property 3. For independent variables, variance is additive: Var(X+Y)=Var(X)+Var(Y). This easily follows from the general formula for Var(X+Y) and equation (2):

Var(X+Y)=Var(X)+2Cov(X,Y)+Var(Y)=Var(X)+Var(Y).

Property 4. Independence is such a strong property that it is preserved under nonlinear transformations. This means the following. Take two deterministic functions f,g; apply one to X and the other to Y. The resulting random variables f(X),g(Y) will be independent. Instead of the proof, I provide an application. If z_1,z_2 are two independent standard normals, then z^2_1,z^2_2 are two independent chi-square variables with 1 degree of freedom.

Remark. Normality is preserved only under linear transformations.

This post is an antithesis of the following definition from (Agresti and Franklin, p.540): Two categorical variables are independent if the population conditional distributions for one of them are identical at each category of the other. The variables are dependent (or associated) if the conditional distributions are not identical.