30
Oct 16

## The pearls of AP Statistics 34

### Coefficient of determination: an inductive introduction to R squared

I know a person, who did not understand this topic, even though he had a PhD in Math. That was me more than twenty years ago, and the reason was that the topic was given formally, without explaining the leading idea.

Step 1. We want to describe the relationship between observed y's and x's using the simple regression

$y_i=a+bx_i+u_i$.

Let us start with the simple case when there is no variability in y's, that is the slope and the errors are zero.  Since $y_i=a$ for all $i$, we have $y_i=\bar{y}$ and, of course,

(1) $\sum(y_i-\bar{y})^2=0.$

(2) $y_i=\hat{y}_i+e_i$

where $\hat{y}_i$ is the fitted value and $e_i$ is the residual, see this post. We still want to see how $y_i$ is far from $\bar{y}.$ With this purpose, from both sides of equation (2) we subtract $\bar{y},$ obtaining $y_i-\bar{y}=(\hat{y}_i-\bar{y})+e_i.$ Squaring this equation, for the sum in (1) we get

(3) $\sum(y_i-\bar{y})^2=\sum(\hat{y}_i-\bar{y})^2+2\sum(\hat{y}_i-\bar{y})e_i+\sum e^2_i.$

Whoever was the first to do this, discovered that the cross product is zero and (3) simplifies to

(4) $\sum(y_i-\bar{y})^2=\sum(\hat{y}_i-\bar{y})^2+\sum e^2_i.$

### The rest is a matter of definitions

Total Sum of Squares $TSS=\sum(y_i-\bar{y})^2.$ (I prefer to call this a total variation around $\bar{y}$)

Explained Sum of Squares $TSS=\sum(\hat{y}_i-\bar{y})^2$ (to me this is explained variation around $\bar{y}$)

Residual Sum of Squares $TSS=\sum e^2_i$ (unexplained variation around $\bar{y}$, caused by the error term)

Thus from (4) we have

(5) $TSS=ESS+RSS$ and $1=ESS/TSS+RSS/TSS.$

Step 2. It is desirable to have RSS close to zero and ESS close to TSS. Therefore we can use the ratio ESS/TSS as a measure of how good the regression describes the relationship between y's and x's. From (5) it follows that this ratio takes values between zero and 1. Hence, the coefficient of determination

$R^2=ESS/TSS$

can be interpreted as the percentage of total variation of y's around $\bar{y}$ explained by regression. From (5) an equivalent definition is

$R^2=1-RSS/TSS.$

### Back to the pearls of AP Statistics

How much of the above can be explained without algebra? Stats without algebra is a crippled creature. I am afraid, any concepts requiring substantial algebra should be dropped from AP Stats curriculum. Compare this post with the explanation on p. 592 of Agresti and Franklin.