21
Oct 16

## The pearls of AP Statistics 33

### Correlation and regression are two separate entities

They say: The correlation summarizes the direction of the association between two quantitative variables and the strength of its linear (straight-line) trend (Agresti and Franklin, p.105). Later, at a level that is supposed to be more advanced, they repeat: The correlation, denoted by r, describes linear association between two variables (p.586).

I say: This is a common misconception about correlation, even Wikipedia says so. Once I was consulting specialists from the Oncology Institute in Almaty. Until then, all of them were using correlation to study their data. When I suggested using simple regression, they asked what was the difference and how regression was better. I said: correlation is a measure of statistical relationship. When two variables are positively correlated and one of them goes up, the other also goes up (on average) but you never know by how much. On the other hand, regression gives a specific algebraic dependence between two variables, so that you can quantify your predictions about changes in one of them caused by changes in another.

Because of algebra of least squares estimation, you can conclude something about correlation if you know the estimated slope, and vice versa, but conceptually correlation and regression are different and there is no need to delay the study of correlation until after regression. The correlation coefficient is defined as

(1) $\rho(X,Y)=\frac{Cov(X,Y)}{\sigma(X)\sigma(Y)}.$

See this post for the definition and properties of covariance. As one can see, it can be studied right after the covariance and standard deviation. The slope of the regression line is a result of least squares fitting, which is a more advanced concept, and is given by

(2) $b=\frac{Cov(X,Y)}{Var(X)},$

see a simplified derivation or a full derivation. I am using the notations

(3) $Cov(X,Y)=\frac{1}{n}\sum(X_i-\bar{X})(Y_i-\bar{Y}),\ Var(X)=Cov(X,X),\ \sigma(X)=\sqrt{Var(X)}$

which arise from the corresponding population characteristics as explained in this post. Directly from (1) and (2) we see that

(4) $b=\rho(X,Y)\frac{\sigma(Y)}{\sigma(X)},\ \rho(X,Y)=b\frac{\sigma(X)}{\sigma(Y)}.$

Using these equations, we can go from the correlation to the slope and back if we know the sigmas. In particular, they are positive or negative simultaneously. The second equation in (4) gives rise to the interpretation of the correlation as "a standardized version of the slope" (p.588). To me, this "intuition" is far-fetched.

Notice how economical is the sequence of definitions in (3): one follows from another, which makes remembering them easier, and summation signs are reduced to a minimum. Under the "non-algebraic" approach, the covariance, variance and standard deviation are given separately, increasing the burden on one's memory.