29
Jun 17

Introduction to Stata

Introduction to Stata: Stata interface, how to use Stata Help, how to use Data Editor and how to graph data. Important details to remember:

1. In any program, the first thing to use is Help. I learned everything from Help and never took any programming courses.
2. The number of observations for all variables in one data file must be the same. This can be a problem if, for example, you want to see out-of-sample predictions.
3. In Data Editor, numeric variables are displayed in black and strings are displayed in red.
4. The name of the hidden variable that counts observations is _n
5. If you have several definitions of graphs in two-way graphs menu, they will be graphed together or separately, depending on what is enabled/disabled.

See details in videos. Sorry about the background noise!

Video 1. Stata interface. The windows introduced: Results, Command, Variables, Properties, Review and Viewer.

Video 2. Using Stata Help. Help can be used through the Review window or in a separate pdf viewer. Eviews Help is much easier to understand.

Video 3. Using Data Editor. How to open and view variables, the visual difference between numeric variables and string variables. The lengths of all variables in the same file must be the same.

Video 4. Graphing data. To graph a variable, you need to define its graph and then display it. It is possible to display more than one variable on the same chart.

21
Feb 17

The pearls of AP Statistics 37

Confidence interval: attach probability or not attach?

I am reading "5 Steps to a 5 AP Statistics, 2010-2011 Edition" by Duane Hinders (sorry, I don't have the latest edition). The tip at the bottom of p.200 says:

For the exam, be VERY, VERY clear on the discussion above. Many students
seem to think that we can attach a probability to our interpretation of a confidence
interval. We cannot.

This is one of those misconceptions that travel from book to book. Below I show how it may have arisen.

Confidence interval derivation

The intuition behind the confidence interval and the confidence interval derivation using z score have been given here. To make the discussion close to Duane Hinders, I show the confidence interval derivation using the t statistic. Let $X_1,...,X_n$ be a sample of independent observations from a normal population, $\mu$ the population mean and $s$ the standard error. Skipping the intuition, let's go directly to the t statistic

(1) $t=\frac{\bar{X}-\mu}{s/\sqrt{n}}$.

At the 95% confidence level, from statistical tables find the critical value $t_{cr,0.95}$ of the t statistic such that

$P(-t_{cr,0.95}

Plug here (1) to get

(2) $P(-t_{cr,0.95}<\frac{\bar{X}-\mu}{s/\sqrt{n}}

Using equivalent transformations of inequalities (multiplying them by $s/\sqrt{n}$ and adding $\mu$ to all sides) we rewrite (2) as

(3) $P(\mu-t_{cr,0.95}\frac{s}{\sqrt{n}}<\bar{X}<\mu+t_{cr,0.95}\frac{s}{\sqrt{n}})=0.95.$

Thus, we have proved

Statement 1. The interval $\mu\pm t_{cr,0.95}\frac{s}{\sqrt{n}}$ contains the values of the sample mean with probability 95%.

The left-side inequality in (3) is equivalent to $\mu<\bar{X}+t_{cr,0.95}\frac{s}{\sqrt{n}}$ and the right-side one is equivalent to $\bar{X}-t_{cr,0.95}\frac{s}{\sqrt{n}}<\mu$. Combining these two inequalities, we see that (3) can be equivalently written as

(4) $P(\bar{X}-t_{cr,0.95}\frac{s}{\sqrt{n}}<\mu<\bar{X}+t_{cr,0.95}\frac{s}{\sqrt{n}})=0.95.$

So, we have

Statement 2. The interval $\bar{X}\pm t_{cr,0.95}\frac{s}{\sqrt{n}}$ contains the population mean with probability 95%.

Source of the misconception

In (3), the variable in the middle ($\bar{X}$) is random, and the statement that it belongs to some interval is naturally probabilistic. People not familiar with the above derivation don't understand how a statement that the population mean (which is a constant) belongs to some interval can be probabilistic. It's the interval ends that are random in (4) (the sample mean and standard error are both random), that's why there is probability! Statements 1 and 2 are equivalent!

My colleague Aidan Islyami mentioned that we should distinguish estimates from estimators.

In all statistical derivations random variables are ex-ante (before the event). No book says that but that's the way it is. An estimate is an ex-post (after the event) value of an estimator. An estimate is, of course, a number and not a random variable. Ex-ante, a confidence interval always has a probability. Ex-post, the fact that an estimate belongs to some interval is deterministic (has probability either 0 or 1) and it doesn't make sense to talk about 95%.

Since confidence levels are always strictly between 0 and 100%, students should keep in mind that we deal with ex-ante variables.
11
Feb 17

Gauss-Markov theorem

The Gauss-Markov theorem states that the OLS estimator is the most efficient. Without algebra, you cannot make a single step further, whether it is the precise theoretical statement or an application.

Why do we care about linearity?

The concept of linearity has been repeated many times in my posts. Here we have to start from scratch, to apply it to estimators.

The slope in simple regression

(1) $y_i=a+bx_i+e_i$

can be estimated by

$\hat{b}(y,x)=\frac{Cov_u(y,x)}{Var_u(x)}$.

Note that the notation makes explicit the dependence of the estimator on $x,y$. Imagine that we have two sets of observations: $(y_1^{(1)},x_1),...,(y_n^{(1)},x_n)$ and $(y_1^{(2)},x_1),...,(y_n^{(2)},x_n)$ (the x coordinates are the same but the y coordinates are different). In addition, the regressor is deterministic. The x's could be spatial units and the y's temperature measurements at these units at two different moments.

Definition. We say that $\hat{b}(y,x)$ is linear with respect to $y$ if for any two vectors $y^{(i)}= (y_1^{(i)},...,y_n^{(i)}),$ $i=1,2,$ and numbers $c,d$ we have

$\hat{b}(cy^{(1)}+dy^{(2)},x)=c\hat{b}(y^{(1)},x)+d\hat{b}(y^{(2)},x)$.

This definition is quite similar to that of linearity of means. Linearity of the estimator with respect to $y$ easily follows from linearity of covariance

$\hat{b}(cy^{(1)}+dy^{(2)},x)=\frac{Cov_u(cy^{(1)}+dy^{(2)},x)}{Var_u(x)}=c\hat{b}(y^{(1)},x)+d\hat{b}(y^{(2)},x)$.

In addition to knowing how to establish linearity, it's a good idea to be able to see when something is not linear. Recall that linearity implies homogeneity of degree 1. Hence, if something is not homogeneous of degree 1, it cannot be linear. The OLS estimator is not linear in x because it is homogeneous of degree -1 in x:

$\hat{b}(y,cx)=\frac{Cov_u(y,cx)}{Var_u(cx)}=\frac{c}{c^2}\frac{Cov_u(y,x)}{Var_u(x)}=\frac{1}{c}\hat{b}(y,x)$.

Gauss-Markov theorem

Students don't have problems remembering the acronym BLUE: the OLS estimator is Best Linear Unbiased Estimator. Decoding this acronym starts from the end.

1. An estimator, by definition, is a function of sample data.
2. Unbiasedness of OLS estimators is thoroughly discussed here.
3. Linearity of the slope estimator with respect to $y$ has been proved above. Linearity with respect to $x$ is not required.
4. Now we look at the class of all slope estimators that are linear with respect to $y$. As an exercise, show that the instrumental variables estimator belongs to this class.

Gauss-Markov Theorem. Under the classical assumptions, the OLS estimator of the slope has the smallest variance in the class of all slope estimators that are linear with respect to $y$.

In particular, the OLS estimator of the slope is more efficient than the IV estimator. The beauty of this result is that you don't need expressions of their variances (even though they can be derived).

Remark. Even the above formulation is incomplete. In fact, the pair intercept estimator plus slope estimator is efficient. This requires matrix algebra.

9
Dec 16

Ditch statistical tables if you have a computer

You don't need statistical tables if you have Excel or Mathematica. Here I give the relevant Excel and Mathematica functions described in Chapter 14 of my book. You can save all the formulas in one spreadsheet or notebook and use it multiple times.

Cumulative Distribution Function of the Standard Normal Distribution

For a given real $z$, the value of the distribution function of the standard normal is
$F(z)=\frac{1}{\sqrt{2\pi }}\int_{-\infty }^{z}\exp (-t^{2}/2)dt.$

In Excel, use the formula =NORM.S.DIST(z,TRUE).

In Mathematica, enter CDF[NormalDistribution[0,1],z]

Probability Function of the Binomial Distribution

For given number of successes $x,$ number of trials $n$ and probability $p$ the probability is

$P(Binomial=x)=C_{x}^{n}p^{x}(1-p)^{n-x}$.

In Excel, use the formula =BINOM.DIST(x,n,p,FALSE)

In Mathematica, enter PDF[BinomialDistribution[n,p],x]

Cumulative Binomial Probabilities

For a given cut-off value $x,$ number of trials $n$ and probability $p$ the cumulative probability is

$P(Binomial\leq x)=\sum_{t=0}^{x}C_{t}^{n}p^{t}(1-p)^{n-t}.$
In Excel, use the formula =BINOM.DIST(x,n,p,TRUE).

In Mathematica, enter CDF[BinomialDistribution[n,p],x]

Values of the exponential function $e^{-\lambda}$$e^{-\lambda}$

In Excel, use the formula =EXP(-lambda)

In Mathematica, enter Exp[-lambda]

Individual Poisson Probabilities

For given number of successes $x$ and arrival rate $\lambda$ the probability is

$P(Poisson=x)=\frac{e^{-\lambda }\lambda^{x}}{x!}.$
In Excel, use the formula =POISSON.DIST(x,lambda,FALSE)

In Mathematica, enter PDF[PoissonDistribution[lambda],x]

Cumulative Poisson Probabilities

For given cut-off $x$ and arrival rate $\lambda$ the cumulative probability is

$P(Poisson\leq x)=\sum_{t=0}^{x}\frac{e^{-\lambda }\lambda ^{t}}{t!}.$
In Excel, use the formula =POISSON.DIST(x,lambda,TRUE)

In Mathematica, enter CDF[PoissonDistribution[lambda],x]

Cutoff Points of the Chi-Square Distribution Function

For given probability of the right tail $\alpha$ and degrees of freedom $\nu$, the cut-off value (critical value) $\chi _{\nu,\alpha }^{2}$ is a solution of the equation
$P(\chi _{\nu}^{2}>\chi _{\nu,\alpha }^{2})=\alpha .$
In Excel, use the formula =CHISQ.INV.RT(alpha,v)

In Mathematica, enter InverseCDF[ChiSquareDistribution[v],1-alpha]

Cutoff Points for the Student’s t Distribution

For given probability of the right tail $\alpha$ and degrees of freedom $\nu$, the cut-off value $t_{\nu,\alpha }$ is a solution of the equation $P(t_{\nu}>t_{\nu,\alpha })=\alpha$.
In Excel, use the formula =T.INV(1-alpha,v)

In Mathematica, enter InverseCDF[StudentTDistribution[v],1-alpha]

Cutoff Points for the F Distribution

For given probability of the right tail $\alpha$, degrees of freedom $v_1$ (numerator) and $v_2$ (denominator), the cut-off value $F_{v_1,v_2,\alpha }$ is a solution of the equation $P(F_{v_1,v_2}>F_{v_1,v_2,\alpha })=\alpha$.

In Excel, use the formula =F.INV.RT(alpha,v1,v2)

In Mathematica, enter InverseCDF[FRatioDistribution[v1,v2],1-alpha]

26
Nov 16

Properties of correlation

Correlation coefficient: the last block of statistical foundation

Correlation has already been mentioned in

Statistical measures and their geometric roots

Properties of standard deviation

The pearls of AP Statistics 35

Properties of covariance

The pearls of AP Statistics 33

The hierarchy of definitions

Suppose random variables $X,Y$ are not constant. Then their standard deviations are not zero and we can define their correlation as in Chart 1.

Chart 1. Correlation definition

Properties of correlation

Property 1. Range of the correlation coefficient: for any $X,Y$ one has $- 1 \le \rho (X,Y) \le 1$.
This follows from the Cauchy-Schwarz inequality, as explained here.

Recall from this post that correlation is cosine of the angle between $X-EX$ and $Y-EY$.
Property 2. Interpretation of extreme cases. (Part 1) If $\rho (X,Y) = 1$, then $Y = aX + b$ with $a > 0.$

(Part 2) If $\rho (X,Y) = - 1$, then $Y = aX + b$ with $a < 0$.

Proof. (Part 1) $\rho (X,Y) = 1$ implies
(1) $Cov (X,Y) = \sigma (X)\sigma (Y)$
which, in turn, implies that $Y$ is a linear function of $X$: $Y = aX + b$ (this is the second part of the Cauchy-Schwarz inequality). Further, we can establish the sign of the number $a$. By the properties of variance and covariance
$Cov(X,Y)=Cov(X,aX+b)=aCov(X,X)+Cov(X,b)=aVar(X)$,

$\sigma (Y)=\sigma(aX + b)=\sigma (aX)=|a|\sigma (X)$.
Plugging this in Eq. (1) we get $aVar(X) = |a|\sigma^2(X)$ and see that $a$ is positive.

The proof of Part 2 is left as an exercise.

Property 3. Suppose we want to measure correlation between weight $W$ and height $H$ of people. The measurements are either in kilos and centimeters ${W_k},{H_c}$ or in pounds and feet ${W_p},{H_f}$. The correlation coefficient is unit-free in the sense that it does not depend on the units used: $\rho (W_k,H_c)=\rho (W_p,H_f)$. Mathematically speaking, correlation is homogeneous of degree $0$ in both arguments.
Proof. One measurement is proportional to another, $W_k=aW_p,\ H_c=bH_f$ with some positive constants $a,b$. By homogeneity
$\rho (W_k,H_c)=\frac{Cov(W_k,H_c)}{\sigma(W_k)\sigma(H_c)}=\frac{Cov(aW_p,bH_f)}{\sigma(aW_p)\sigma(bH_f)}=\frac{abCov(W_p,H_f)}{ab\sigma(W_p)\sigma (H_f)}=\rho (W_p,H_f).$

20
Nov 16

The pearls of AP Statistics 36

ANOVA: the artefact that survives because of the College Board

Why ANOVA should be dropped from AP Statistics

1. The common argument in favor of using ANOVA is that "The methods introduced in this chapter [Comparing Groups: Analysis of Variance Methods] apply when a quantitative response variable has a categorical explanatory variable" (Agresti and Franklin, p. 680). However, categorical explanatory variables can be replaced by indicator (dummy) variables, and then regression methods can be used to study dependences involving categorical variables. On p. 695, the authors admit that "ANOVA can be presented as a special case of multiple regression".
2. In terms of knowledge of basic statistical ideas (hypothesis testing, F statistics, significance level), ANOVA doesn't add any value. Those, who have mastered these basic ideas, will not have problems learning ANOVA at their workplace if they have to. There is no need to burden everybody with this stuff "just in case".
3. The explanation of ANOVA is accompanied with definitions of the within-groups variance estimate and between-groups variance estimate (Agresti and Franklin, p. 686). Even in my courses, where I give a lot of algebra, the students don't get them unless they do a couple of theoretical exercises. At the AP Stats level, the usefulness of these definitions is nil.
4. The requirement to remember how the F statistics and degrees of freedom are calculated, for the purpose of being able to interpret just one table with output from a statistical package, doesn't make sense. In my book, I have a whole chapter on ANOVA, with most derivations, and I don't remember a thing. Why torture the students?
5. In the 90 years since R. Fisher has invented ANOVA, many other, more precise and versatile, statistical methods have been developed.

Conclusion

There are two suggestions

1) Explain just the intuition and then jump to the interpretation of output, indicating the statistic to look at, as in Table 14.14.

2) The theory of ANOVA is useful for two reasons: there is a lot of manipulation with summation signs and there is a link to regressions. Learning all this may be the only justification to study ANOVA with definitions. In my classes, this takes 6 hours.

13
Nov 16

Statistical measures and their geometric roots

Variance, covariancestandard deviation and correlation: their definitions and properties are deeply rooted in the Euclidean geometry.

Here is the why: analogy with Euclidean geometry

Euclid axiomatically described the space we live in. What we have known about the geometry of this space since the ancient times has never failed us. Therefore, statistical definitions based on the Euclidean geometry are sure to work.

1. Analogy between scalar product and covariance

Geometry. See Table 2 here for operations with vectors. The scalar product of two vectors $X=(X_1,...,X_n),\ Y=(Y_1,...,Y_n)$ is defined by

$(X,Y)=\sum X_iY_i.$

Statistical analog: Covariance of two random variables is defined by

$Cov(X,Y)=E(X-\bar{X})(Y-\bar{Y}).$

Both the scalar product and covariance are linear in one argument when the other argument is fixed.

2. Analogy between orthogonality and uncorrelatedness

Geometry. Two vectors $X,Y$ are called orthogonal (or perpendicular) if

(1) $(X,Y)=\sum X_iY_i=0.$

Exercise. How do you draw on the plane the vectors $X=(1,0),\ Y=(0,1)$? Check that they are orthogonal.

Statistical analog: Two random variables are called uncorrelated if $Cov(X,Y)=0$.

3. Measuring lengths

Figure 1. Length of a vector

Geometry: the length of a vector $X=(X_1,...,X_n)$ is $\sqrt{\sum X_i^2}$, see Figure 1.

Statistical analog: the standard deviation of a random variable $X$ is

$\sigma(X)=\sqrt{Var(X)}=\sqrt{E(X-\bar{X})^2}.$

This explains the square root in the definition of the standard deviation.

4. Cauchy-Schwarz inequality

Geometry$|(X,Y)|\le\sqrt{\sum X_i^2}\sqrt{\sum Y_i^2}$.

Statistical analog$|Cov(X,Y)|\le\sigma(X)\sigma(Y)$. See the proof here. The proof of its geometric counterpart is similar.

5. Triangle inequality

Figure 2. Triangle inequality

Geometry$\sqrt{\sum (X_i+Y_i)^2}\le\sqrt{\sum X_i^2}+\sqrt{\sum X_i^2}$, see Figure 2 where the length of X+Y does not exceed the sum of lengths of X and Y.

Statistical analog: using the Cauchy-Schwarz inequality we have

$\sigma(X+Y)=\sqrt{Var(X+Y)}$ $=\sqrt{Var(X)+2Cov(X,Y)+Var(Y)}$ $\le\sqrt{\sigma^2(X)+2\sigma(X)\sigma(Y)+\sigma^2(Y)}$ $=\sigma(X)+\sigma(Y).$

4. The Pythagorean theorem

Geometry: In a right triangle, the squared hypotenuse is equal to the sum of the squares of the two legs. The illustration is similar to Figure 2, except that the angle between X and Y should be right.

Proof. Taking two orthogonal vectors $X,Y$ as legs, we have

Squared hypotenuse = $\sum(X_i+Y_i)^2$

(squaring out and using orthogonality (1))

$=\sum X_i^2+2\sum X_iY_i+\sum Y_i^2=\sum X_i^2+\sum Y_i^2$ = Sum of squared legs

Statistical analog: If two random variables are uncorrelated, then variance of their sum is a sum of variances $Var(X+Y)=Var(X)+Var(Y).$

5. The most important analogy: measuring angles

Geometry: the cosine of the angle between two vectors $X,Y$ is defined by

Cosine between X,Y = $\frac{\sum X_iY_i}{\sqrt{\sum X_i^2\sum Y_i^2}}.$

Statistical analog: the correlation coefficient between two random variables is defined by

$\rho(X,Y)=\frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}=\frac{Cov(X,Y)}{\sigma(X)\sigma(Y)}.$

This intuitively explains why the correlation coefficient takes values between -1 and +1.

Remark. My colleague Alisher Aldashev noticed that the correlation coefficient is the cosine of the angle between the deviations $X-EX$ and $Y-EY$ and not between $X,Y$ themselves.

12
Nov 16

Properties of standard deviation

Properties of standard deviation are divided in two parts. The definitions and consequences are given here. Both variance and standard deviation are used to measure variability of values of a random variable around its mean. Then why use both of them? The why will be explained in another post.

Properties of standard deviation: definitions and consequences

Definition. For a random variable $X$, the quantity $\sigma (X) = \sqrt {Var(X)}$ is called its standard deviation.

Digression about square roots and absolute values

In general, there are two square roots of a positive number, one positive and the other negative. The positive one is called an arithmetic square root. The arithmetic root is applied here to $Var(X) \ge 0$ (see properties of variance), so standard deviation is always nonnegative.
Definition. An absolute value of a real number $a$ is defined by
(1) $|a| =a$ if $a$ is nonnegative and $|a| =-a$ if $a$ is negative.
This two-part definition is a stumbling block for many students, so making them plug in a few numbers is a must. It is introduced to measure the distance from point $a$ to the origin. For example, $dist(3,0) = |3| = 3$ and $dist(-3,0) = |-3| = 3$. More generally, for any points $a,b$ on the real line the distance between them is given by $dist(a,b) = |a - b|$.

By squaring both sides in Eq. (1) we obtain $|a|^2={a^2}$. Application of the arithmetic square root gives

(2) $|a|=\sqrt {a^2}.$

This is the equation we need right now.

Back to standard deviation

Property 1. Standard deviation is homogeneous of degree 1. Indeed, using homogeneity of variance and equation (2), we have

$\sigma (aX) =\sqrt{Var(aX)}=\sqrt{{a^2}Var(X)}=|a|\sigma(X).$

Unlike homogeneity of expected values, here we have an absolute value of the scaling coefficient $a$.

Property 2. Cauchy-Schwarz inequality. (Part 1) For any random variables $X,Y$ one has

(3) $|Cov(X,Y)|\le\sigma(X)\sigma(Y)$.

(Part 2) If the inequality sign in (3) turns into equality, $|Cov(X,Y)|=\sigma (X)\sigma (Y)$, then $Y$ is a linear function of $X$: $Y = aX + b$, with some constants $a,b$.
Proof. (Part 1) If at least one of the variables is constant, both sides of the inequality are $0$ and there is nothing to prove. To exclude the trivial case, let $X,Y$ be non-constant and, therefore, $Var(X),\ Var(Y)$ are positive. Consider a real-valued function of a real number $t$ defined by $f(t) = Var(tX + Y)$. Here we have variance of a linear combination

$f(t)=t^2Var(X)+2tCov(X,Y)+Var(Y)$.

We see that $f(t)$ is a parabola with branches looking upward (because the senior coefficient $Var(X)$ is positive). By nonnegativity of variance, $f(t)\ge 0$ and the parabola lies above the horizontal axis in the $(f,t)$ plane. Hence, the quadratic equation $f(t) = 0$ may have at most one real root. This means that the discriminant of the equation is non-positive:

$D=Cov(X,Y)^2-Var(X)Var(Y)\le 0.$

Applying square roots to both sides of $Cov(X,Y)^2\le Var(X)Var(Y)$ we finish the proof of the first part.

(Part 2) In case of the equality sign the discriminant is $0$. Therefore the parabola touches the horizontal axis where $f(t)=Var(tX + Y)=0$. But we know that this implies $tX + Y = constant$ which is just another way of writing $Y = aX + b$.

Comment. (3) explains one of the main properties of the correlation:

$-1\le\rho(X,Y)=\frac{Cov(X,Y)}{\sigma(X)\sigma(Y)}\le 1$.

8
Nov 16

The pearls of AP Statistics 35

The disturbance term: To hide or not to hide? In an introductory Stats course, some part of the theory should be hidden. Where to draw the line is an interesting question. Here I discuss the ideas that look definitely bad to me.

How disturbing is the disturbance term?

In the main text, Agresti and Franklin never mention the disturbance term $u_i$ in the regression model

(1) $y_i=a+bx_i+u_i$

(it is hidden in Exercise 12.105). Instead, they write the equation for the mean $\mu_y=a+bx$ that follows from (1) under the standard assumption $Eu_i=0$. This would be fine if the exposition stopped right there. However, one has to explain the random source of variability in $y_i$. On p. 583 the authors say: "The probability distribution of y values at a fixed value of x is a conditional distribution. At each value of x, there is a conditional distribution of y values. A regression model also describes these distributions. An additional parameter σ describes the standard deviation of each conditional distribution."

Further, Figure 12.4 illustrates distributions of errors at different points and asks: "What do the bell-shaped curves around the line at x = 12 and at x = 16 represent?"

Figure 12.4. Illustration of error distributions

Besides, explanations of heteroscedasticity and of the residual sum of squares are impossible without explicitly referring to the disturbance term.

Attributing a regression property to the correlation is not good

On p.589 I encountered a statement that puzzled me: "An important property of the correlation is that at any particular x value, the predicted value of y is relatively closer to its mean than x is to its mean. If an x value is a certain number of standard deviations from its mean, then the predicted y is r times that many standard deviations from its mean."

Firstly, this is a verbal interpretation of some formula, so why not give the formula itself? How good must be a student to guess what is behind the verbal formulation?

Secondly, as I stressed in this post, the correlation coefficient does not entail any prediction about the magnitude of a change in one variable caused by a change in another. The above statement about the predicted value of y must be a property of regression. Attributing a regression property to the correlation is not in the best interests of those who want to study Stats at a more advanced level.

Thirdly, I felt challenged to see something new in the area I thought I knew everything about. So here is the derivation. By definition, the fitted value is

(2) $\hat{y_i}=\hat{a}+\hat{b}x_i$

where the hats stand for estimators. The fitted line passes through the point $(\bar{x},\bar{y})$:

(3) $\bar{y}=\hat{a}+\hat{b}\bar{x}$

(this will be proved elsewhere). Subtracting (3) from (2) we get

(4) $\hat{y_i}-\bar{y}=\hat{b}(x_i-\bar{x})$

(using equation (4) from this post)

$=\rho\frac{\sigma(y)}{\sigma(x)}(x_i-\bar{x}).$

It is helpful to rewrite (4) in a more symmetric form:

(5) $\frac{\hat{y_i}-\bar{y}}{\sigma(y)}=\rho\frac{x_i-\bar{x}}{\sigma(x)}.$

This is the equation we need. Suppose an x value is a certain number of standard deviations from its mean: $x_i-\bar{x}=k\sigma(x)$. Plug this into (5) to get $\hat{y_i}-\bar{y}=\rho k\sigma(y)$, that is, the predicted y is $\rho$ times that many standard deviations from its mean.

3
Nov 16

Properties of covariance

Wikipedia says: The magnitude of the covariance is not easy to interpret. I add: We keep the covariance around mainly for its algebraic properties. It deserves studying because it appears in two important formulas: correlation coefficient and slope estimator in simple regression (see derivation, simplified derivation and proof of unbiasedness).

Definition. For two random variables $X,Y$ their covariance is defined by

$Cov (X,Y) = E(X - EX)(Y - EY)$

(it's the mean value of the product of the deviations of two variables from their respective means).

Properties of covariance

Property 1. Linearity. Covariance is linear in the first argument when the second argument is fixed: for any random variables $X,Y,Z$ and numbers $a,b$ one has
(1) $Cov (aX + bY,Z) = aCov(X,Z) + bCov (Y,Z).$
Proof. We start by writing out the left side of Equation (1):
$Cov(aX + bY,Z)=E[(aX + bY)-E(aX + bY)](Z-EZ)$
(using linearity of means)
$= E(aX + bY - aEX - bEY)(Z - EZ)$
(collecting similar terms)
$= E[a(X - EX) + b(Y - EY)](Z - EZ)$
(distributing $(Z - EZ)$)
$= E[a(X - EX)(Z - EZ) + b(Y - EY)(Z - EZ)]$
(using linearity of means)
$= aE(X - EX)(Z - EZ) + bE(Y - EY)(Z - EZ)$
$= aCov(X,Z) + bCov(Y,Z).$

Exercise. Covariance is also linear in the second argument when the first argument is fixed. Write out and prove this property. You can notice the importance of using parentheses and brackets.

Property 2. Shortcut for covariance: $Cov(X,Y) = EXY - (EX)(EY)$.
Proof$Cov(X,Y)= E(X - EX)(Y - EY)$
(multiplying out)
$= E[XY - X(EY) - (EX)Y + (EX)(EY)]$
($EX,EY$ are constants; use linearity)
$=EXY-(EX)(EY)-(EX)(EY)+(EX)(EY)=EXY-(EX)(EY).$

Definition. Random variables $X,Y$ are called uncorrelated if $Cov(X,Y) = 0$.

Uncorrelatedness is close to independence, so the intuition is the same: one variable does not influence the other. You can also say that there is no statistical relationship between uncorrelated variables. The mathematical side is not the same: uncorrelatedness is a more general property than independence.

Property 3. Independent variables are uncorrelated: if $X,Y$ are independent, then $Cov(X,Y) = 0$.
Proof. By the shortcut for covariance and multiplicativity of means for independent variables we have $Cov(X,Y) = EXY - (EX)(EY) = 0$.

Property 4. Correlation with a constant. Any random variable is uncorrelated with any constant: $Cov(X,c) = E(X - EX)(c - Ec) = 0.$

Property 5. Symmetry. Covariance is a symmetric function of its arguments: $Cov(X,Y)=Cov(Y,X)$. This is obvious.

Property 6. Relationship between covariance and variance:

$Cov(X,X)=E(X-EX)(X-EX)=Var(X)$.