Nov 16

Properties of correlation

Correlation coefficient: the last block of statistical foundation

Correlation has already been mentioned in

Statistical measures and their geometric roots

Properties of standard deviation

The pearls of AP Statistics 35

Properties of covariance

The pearls of AP Statistics 33

The hierarchy of definitions

Suppose random variables X,Y are not constant. Then their standard deviations are not zero and we can define their correlation as in Chart 1.


Chart 1. Correlation definition

Properties of correlation

Property 1. Range of the correlation coefficient: for any X,Y one has - 1 \le \rho (X,Y) \le 1.
This follows from the Cauchy-Schwarz inequality, as explained here.

Recall from this post that correlation is cosine of the angle between X-EX and Y-EY.
Property 2. Interpretation of extreme cases. (Part 1) If \rho (X,Y) = 1, then Y = aX + b with a > 0.

(Part 2) If \rho (X,Y) = - 1, then Y = aX + b with a < 0.

Proof. (Part 1) \rho (X,Y) = 1 implies
(1) Cov (X,Y) = \sigma (X)\sigma (Y)
which, in turn, implies that Y is a linear function of X: Y = aX + b (this is the second part of the Cauchy-Schwarz inequality). Further, we can establish the sign of the number a. By the properties of variance and covariance

\sigma (Y)=\sigma(aX + b)=\sigma (aX)=|a|\sigma (X).
Plugging this in Eq. (1) we get aVar(X) = |a|\sigma^2(X) and see that a is positive.

The proof of Part 2 is left as an exercise.

Property 3. Suppose we want to measure correlation between weight W and height H of people. The measurements are either in kilos and centimeters {W_k},{H_c} or in pounds and feet {W_p},{H_f}. The correlation coefficient is unit-free in the sense that it does not depend on the units used: \rho (W_k,H_c)=\rho (W_p,H_f). Mathematically speaking, correlation is homogeneous of degree 0 in both arguments.
Proof. One measurement is proportional to another, W_k=aW_p,\ H_c=bH_f with some positive constants a,b. By homogeneity
\rho (W_k,H_c)=\frac{Cov(W_k,H_c)}{\sigma(W_k)\sigma(H_c)}=\frac{Cov(aW_p,bH_f)}{\sigma(aW_p)\sigma(bH_f)}=\frac{abCov(W_p,H_f)}{ab\sigma(W_p)\sigma (H_f)}=\rho (W_p,H_f).


Nov 16

Statistical measures and their geometric roots

Variance, covariancestandard deviation and correlation: their definitions and properties are deeply rooted in the Euclidean geometry.

Here is the why: analogy with Euclidean geometry

euclidEuclid axiomatically described the space we live in. What we have known about the geometry of this space since the ancient times has never failed us. Therefore, statistical definitions based on the Euclidean geometry are sure to work.

   1. Analogy between scalar product and covariance

Geometry. See Table 2 here for operations with vectors. The scalar product of two vectors X=(X_1,...,X_n),\ Y=(Y_1,...,Y_n) is defined by

(X,Y)=\sum X_iY_i.

Statistical analog: Covariance of two random variables is defined by


Both the scalar product and covariance are linear in one argument when the other argument is fixed.

   2. Analogy between orthogonality and uncorrelatedness

Geometry. Two vectors X,Y are called orthogonal (or perpendicular) if

(1) (X,Y)=\sum X_iY_i=0.

Exercise. How do you draw on the plane the vectors X=(1,0),\ Y=(0,1)? Check that they are orthogonal.

Statistical analog: Two random variables are called uncorrelated if Cov(X,Y)=0.

   3. Measuring lengths


Figure 1. Length of a vector

Geometry: the length of a vector X=(X_1,...,X_n) is \sqrt{\sum X_i^2}, see Figure 1.

Statistical analog: the standard deviation of a random variable X is


This explains the square root in the definition of the standard deviation.

   4. Cauchy-Schwarz inequality

Geometry|(X,Y)|\le\sqrt{\sum X_i^2}\sqrt{\sum Y_i^2}.

Statistical analog|Cov(X,Y)|\le\sigma(X)\sigma(Y). See the proof here. The proof of its geometric counterpart is similar.

   5. Triangle inequality


Figure 2. Triangle inequality

Geometry\sqrt{\sum (X_i+Y_i)^2}\le\sqrt{\sum X_i^2}+\sqrt{\sum X_i^2}, see Figure 2 where the length of X+Y does not exceed the sum of lengths of X and Y.

Statistical analog: using the Cauchy-Schwarz inequality we have

\sigma(X+Y)=\sqrt{Var(X+Y)} =\sqrt{Var(X)+2Cov(X,Y)+Var(Y)} \le\sqrt{\sigma^2(X)+2\sigma(X)\sigma(Y)+\sigma^2(Y)} =\sigma(X)+\sigma(Y).

   4. The Pythagorean theorem

Geometry: In a right triangle, the squared hypotenuse is equal to the sum of the squares of the two legs. The illustration is similar to Figure 2, except that the angle between X and Y should be right.

Proof. Taking two orthogonal vectors X,Y as legs, we have

Squared hypotenuse = \sum(X_i+Y_i)^2

(squaring out and using orthogonality (1))

=\sum X_i^2+2\sum X_iY_i+\sum Y_i^2=\sum X_i^2+\sum Y_i^2 = Sum of squared legs

Statistical analog: If two random variables are uncorrelated, then variance of their sum is a sum of variances Var(X+Y)=Var(X)+Var(Y).

   5. The most important analogy: measuring angles

Geometry: the cosine of the angle between two vectors X,Y is defined by

Cosine between X,Y = \frac{\sum X_iY_i}{\sqrt{\sum X_i^2\sum Y_i^2}}.

Statistical analog: the correlation coefficient between two random variables is defined by


This intuitively explains why the correlation coefficient takes values between -1 and +1.

Remark. My colleague Alisher Aldashev noticed that the correlation coefficient is the cosine of the angle between the deviations X-EX and Y-EY and not between X,Y themselves.

Nov 16

Properties of standard deviation

Properties of standard deviation are divided in two parts. The definitions and consequences are given here. Both variance and standard deviation are used to measure variability of values of a random variable around its mean. Then why use both of them? The why will be explained in another post.

Properties of standard deviation: definitions and consequences

Definition. For a random variable X, the quantity \sigma (X) = \sqrt {Var(X)} is called its standard deviation.

    Digression about square roots and absolute values

In general, there are two square roots of a positive number, one positive and the other negative. The positive one is called an arithmetic square root. The arithmetic root is applied here to Var(X) \ge 0 (see properties of variance), so standard deviation is always nonnegative.
Definition. An absolute value of a real number a is defined by
(1) |a| =a if a is nonnegative and |a| =-a if a is negative.
This two-part definition is a stumbling block for many students, so making them plug in a few numbers is a must. It is introduced to measure the distance from point a to the origin. For example, dist(3,0) = |3| = 3 and dist(-3,0) = |-3| = 3. More generally, for any points a,b on the real line the distance between them is given by dist(a,b) = |a - b|.

By squaring both sides in Eq. (1) we obtain |a|^2={a^2}. Application of the arithmetic square root gives

(2) |a|=\sqrt {a^2}.

This is the equation we need right now.

Back to standard deviation

Property 1. Standard deviation is homogeneous of degree 1. Indeed, using homogeneity of variance and equation (2), we have

\sigma (aX) =\sqrt{Var(aX)}=\sqrt{{a^2}Var(X)}=|a|\sigma(X).

Unlike homogeneity of expected values, here we have an absolute value of the scaling coefficient a.

Property 2. Cauchy-Schwarz inequality. (Part 1) For any random variables X,Y one has

(3) |Cov(X,Y)|\le\sigma(X)\sigma(Y).

(Part 2) If the inequality sign in (3) turns into equality, |Cov(X,Y)|=\sigma (X)\sigma (Y), then Y is a linear function of X: Y = aX + b, with some constants a,b.
Proof. (Part 1) If at least one of the variables is constant, both sides of the inequality are 0 and there is nothing to prove. To exclude the trivial case, let X,Y be non-constant and, therefore, Var(X),\ Var(Y) are positive. Consider a real-valued function of a real number t defined by f(t) = Var(tX + Y). Here we have variance of a linear combination


We see that f(t) is a parabola with branches looking upward (because the senior coefficient Var(X) is positive). By nonnegativity of variance, f(t)\ge 0 and the parabola lies above the horizontal axis in the (f,t) plane. Hence, the quadratic equation f(t) = 0 may have at most one real root. This means that the discriminant of the equation is non-positive:

D=Cov(X,Y)^2-Var(X)Var(Y)\le 0.

Applying square roots to both sides of Cov(X,Y)^2\le Var(X)Var(Y) we finish the proof of the first part.

(Part 2) In case of the equality sign the discriminant is 0. Therefore the parabola touches the horizontal axis where f(t)=Var(tX + Y)=0. But we know that this implies tX + Y = constant which is just another way of writing Y = aX + b.

Comment. (3) explains one of the main properties of the correlation:

-1\le\rho(X,Y)=\frac{Cov(X,Y)}{\sigma(X)\sigma(Y)}\le 1.




Nov 16

Properties of covariance

Wikipedia says: The magnitude of the covariance is not easy to interpret. I add: We keep the covariance around mainly for its algebraic properties. It deserves studying because it appears in two important formulas: correlation coefficient and slope estimator in simple regression (see derivation, simplified derivation and proof of unbiasedness).

Definition. For two random variables X,Y their covariance is defined by

Cov (X,Y) = E(X - EX)(Y - EY)

(it's the mean value of the product of the deviations of two variables from their respective means).

Properties of covariance

Property 1. Linearity. Covariance is linear in the first argument when the second argument is fixed: for any random variables X,Y,Z and numbers a,b one has
(1) Cov (aX + bY,Z) = aCov(X,Z) + bCov (Y,Z).
Proof. We start by writing out the left side of Equation (1):
Cov(aX + bY,Z)=E[(aX + bY)-E(aX + bY)](Z-EZ)
(using linearity of means)
= E(aX + bY - aEX - bEY)(Z - EZ)
(collecting similar terms)
= E[a(X - EX) + b(Y - EY)](Z - EZ)
(distributing (Z - EZ))
= E[a(X - EX)(Z - EZ) + b(Y - EY)(Z - EZ)]
(using linearity of means)
= aE(X - EX)(Z - EZ) + bE(Y - EY)(Z - EZ)
= aCov(X,Z) + bCov(Y,Z).

Exercise. Covariance is also linear in the second argument when the first argument is fixed. Write out and prove this property. You can notice the importance of using parentheses and brackets.

Property 2. Shortcut for covariance: Cov(X,Y) = EXY - (EX)(EY).
ProofCov(X,Y)= E(X - EX)(Y - EY)
(multiplying out)
= E[XY - X(EY) - (EX)Y + (EX)(EY)]
(EX,EY are constants; use linearity)

Definition. Random variables X,Y are called uncorrelated if Cov(X,Y) = 0.

Uncorrelatedness is close to independence, so the intuition is the same: one variable does not influence the other. You can also say that there is no statistical relationship between uncorrelated variables. The mathematical side is not the same: uncorrelatedness is a more general property than independence.

Property 3. Independent variables are uncorrelated: if X,Y are independent, then Cov(X,Y) = 0.
Proof. By the shortcut for covariance and multiplicativity of means for independent variables we have Cov(X,Y) = EXY - (EX)(EY) = 0.

Property 4. Correlation with a constant. Any random variable is uncorrelated with any constant: Cov(X,c) = E(X - EX)(c - Ec) = 0.

Property 5. Symmetry. Covariance is a symmetric function of its arguments: Cov(X,Y)=Cov(Y,X). This is obvious.

Property 6. Relationship between covariance and variance:


Oct 16

Properties of variance

All properties of variance in one place

Certainty is the mother of quiet and repose, and uncertainty the cause of variance and contentions. Edward Coke

Preliminaries: study properties of means with proofs.

Definition. Yes, uncertainty leads to variance, and we measure it by Var(X)=E(X-EX)^2. It is useful to use the name deviation from mean for X-EX and realize that E(X-EX)=0, so that the mean of the deviation from mean cannot serve as a measure of variation of X around EX.

Property 1. Variance of a linear combination. For any random variables X,Y and numbers a,b one has
(1) Var(aX + bY)=a^2Var(X)+2abCov(X,Y)+b^2Var(Y).
The term 2abCov(X,Y) in (1) is called an interaction term. See this post for the definition and properties of covariance.
Var(aX + bY)=E[aX + bY -E(aX + bY)]^2

(using linearity of means)
=E(aX + bY-aEX -bEY)^2

(grouping by variable)

(squaring out)

(using linearity of means and definitions of variance and covariance)
=a^2Var(X) + 2abCov(X,Y) +b^2Var(Y).
Property 2. Variance of a sum. Letting in (1) a=b=1 we obtain
Var(X + Y) = Var(X) + 2Cov(X,Y)+Var(Y).

Property 3. Homogeneity of degree 2. Choose b=0 in (1) to get
Exercise. What do you think is larger: Var(X+Y) or Var(X-Y)?
Property 4. If we add a constant to a variable, its variance does not change: Var(X+c)=E[X+c-E(X+c)]^2=E(X+c-EX-c)^2=E(X-EX)^2=Var(X)
Property 5. Variance of a constant is zero: Var(c)=E(c-Ec)^2=0.

Property 6. Nonnegativity. Since the squared deviation from mean (X-EX)^2 is nonnegative, its expectation is nonnegativeE(X-EX)^2\ge 0.

Property 7. Only a constant can have variance equal to zero: If Var(X)=0, then E(X-EX)^2 =(x_1-EX)^2p_1 +...+(x_n-EX)^2p_n=0, see the definition of the expected value. Since all probabilities are positive, we conclude that x_i=EX for all i, which means that X is identically constant.

Property 8. Shortcut for variance. We have an identity E(X-EX)^2=EX^2-(EX)^2. Indeed, squaring out gives

E(X-EX)^2 =E(X^2-2XEX+(EX)^2)

(distributing expectation)


(expectation of a constant is constant)


All of the above properties apply to any random variables. The next one is an exception in the sense that it applies only to uncorrelated variables.

Property 9. If variables are uncorrelated, that is Cov(X,Y)=0, then from (1) we have Var(aX + bY)=a^2Var(X)+b^2Var(Y). In particular, letting a=b=1, we get additivityVar(X+Y)=Var(X)+Var(Y). Recall that the expected value is always additive.

GeneralizationsVar(\sum a_iX_i)=\sum a_i^2Var(X_i) and Var(\sum X_i)=\sum Var(X_i) if all X_i are uncorrelated.

Among my posts, where properties of variance are used, I counted 12 so far.

Sep 16

All you need to know about the law of large numbers

All about the law of large numbers: properties and applications

Level 1: estimation of population parameters

The law of large numbers is a statement about convergence which is called convergence in probability and denoted \text{plim}. The precise definition is rather complex but the intuition is simple: it is convergence to a spike at the parameter being estimated. Usually, any unbiasedness statement has its analog in terms of the corresponding law of large numbers.

Example 1. The sample mean unbiasedly estimates the population mean: E\bar{X}=EX. Its analog: the sample mean converges to a spike at the population mean: \text{plim}\bar{X}=EX. See the proof based on the Chebyshev inequality.

Example 2. The sample variance unbiasedly estimates the population variance: E\overline{s^2}=Var(X) where s^2=\frac{\sum(X_i-\bar{X})^2}{n-1}. Its analog: the sample variance converges to a spike at the population variance:

(1) \text{plim}\overline{s^2}=Var(X).

Example 3. The sample covariance s_{X,Y}=\frac{\sum(X_i-\bar{X})(Y_i-\bar{Y})}{n-1} unbiasedly estimates the population covariance: E\overline{s_{X,Y}}=Cov(X,Y). Its analog: the sample covariance converges to a spike at the population covariance:

(2) \text{plim}\overline{s_{X,Y}}=Cov(X,Y).

Up one level: convergence in probability is just convenient

Using or not convergence in probability is a matter of expedience. For usual limits of sequences we know the properties which I call preservation of arithmetic operations:

\lim(a_n\pm b_n)=\lim a_n\pm \lim b_n, \lim(a_n\times b_n)=\lim a_n\times\lim b_n, \lim(a_n/ b_n)=\lim a_n/\lim b_n.

Convergence in probability has exact same properties, just replace \lim with \text{plim}.

Next level: making regression estimation more plausible

Using convergence in probability allows us to handle stochastic regressors and avoid the unrealistic assumption that regressors are deterministic.

Convergence in probability and in distribution are two types of convergence of random variables that are widely used in the Econometrics course of the University of London.

Sep 16

Proving unbiasedness of OLS estimators

Proving unbiasedness of OLS estimators - the do's and don'ts


Here we derived the OLS estimators. To distinguish between sample and population means, the variance and covariance in the slope estimator will be provided with the subscript u (for "uniform", see the rationale here).

(1) \hat{b}=\frac{Cov_u(x,y)}{Var_u(x)},

(2) \hat{a}=\bar{y}-\hat{b}\bar{x}.

These equations are used in conjunction with the model

(3) y_i=a+bx_i+e_i

where we remember that

(4) Ee_i=0 for all i.

Since (2) depends on (1), we have to start with unbiasedness of the slope estimator.

Using the right representation is critical

We have to show that E\hat{b}=b.

Step 1. Don't apply the expectation directly to (1). Do separate in (1) what is supposed to be E\hat{b}. To reveal the role of errors in (1), plug (3) in (1) and use linearity of covariance with respect to each argument when the other argument is fixed:


Here Cov_u(x,a)=0 (a constant is uncorrelated with any variable), Cov_u(x,x)=Var_u(x) (covariance of x with itself is its variance), so

(5) \hat{b}=\frac{bVar_u(x)+Cov_u(x,e)}{Var_u(x)}=b+\frac{Cov_u(x,e)}{Var_u(x)}.

Equation (5) is the mean-plus-deviation-from-the-mean decomposition. Many students think that Cov_u(x,e)=0 because of (4). No! The covariance here does not involve the population mean.

Step 2. It pays to make one more step to develop (5). Write out the numerator in (5) using summation:


Don't write out Var_u(x)! Presence of two summations confuses many students.

Multiplying parentheses and using the fact that \sum(x_i-\bar{x})=n\bar{x}-n\bar{x}=0 we have

\hat{b}=b+\frac{1}{n}[\sum(x_i-\bar{x})e_i-\bar{e}\sum(x_i-\bar{x})]/Var_u(x) =b+\frac{1}{n}\sum\frac{(x_i-\bar{x})}{Var_u(x)}e_i.

To simplify calculations, denote a_i=(x_i-\bar{x})/Var_u(x). Then the slope estimator becomes

(6) \hat{b}=b+\frac{1}{n}\sum a_ie_i.

This is the critical representation.

Unbiasedness of the slope estimator

Convenience conditionThe regressor x is deterministic. I call it a convenience condition because it's just a matter of mathematical expedience, and later on we'll study ways to bypass it.

From (6), linearity of means and remembering that the deterministic coefficients a_i behave like constants,

(7) E\hat{b}=E[b+\frac{1}{n}\sum a_ie_i]=b+\frac{1}{n}\sum a_iEe_i=b

by (4). This proves unbiasedness.

You don't know the difference between the population and sample means until you see them working in the same formula.

Unbiasedness of the intercept estimator

As above we plug (3) in (2): \hat{a}=\overline{a+bx+e}-\hat{b}\bar{x}=a+b\bar{x}+\bar{e}-\hat{b}\bar{x}. Applying expectation:



Since in (1)  there is division by Var_u(x), the condition Var_u(x)\ne 0 is the main condition for existence of OLS estimators. From the above proof we see that (4) is the main condition for unbiasedness.

Aug 16

The pearls of AP Statistics 14

Reasons to increase Math content in AP Statistics course

The definition of the standard deviation, to those who see it for the first time, looks complex and scary. Agresti and Franklin on p.57 have done an excellent job explaining it. They do it step by step: introduce deviations, the sum of squares, variance and give the formula in the end. The names introduced here will be useful later, in other contexts. Being a rotten theorist, I don't like the "small technical point" on p. 57 (the true reason why there is division by n-1 and not by n is unbiasedness: Es^2=\sigma^2) but this is a minor point.

AP Stats teachers cannot discuss advanced facts because many students are not good in algebra. However, there are good methodological reasons to increase Math content of an AP Stats course. When students saw algebra for the first time, their cognitive skills may have been underdeveloped, which may have prevented them from leaping from numbers to algebraic notation. On the other hand, by the time they take AP Stats they mature. Their logic, power of observation, motivation etc. are better. The crucial fact is that in Statistics numbers meet algebra again, and this can be usefully employed.

Ask your students two questions. 1) You have observations on two stocks, X and Y. How is the sample mean of their sum related to their individual sample means? 2) You have s shares of stock X (s is a number, X is a random variable). How is the sample mean of your portfolio sX related to the sample mean of X? This smells money and motivates well.

The first answer, \overline{X+Y}=\bar{X}+\bar{Y}, tells us that if we know the individual means, we can avoid calculating \overline{X+Y} by simply adding two numbers. Similarly, the second formula, \overline{sX}=s\bar{X}, simplifies calculation of \overline{sX}. Methodologically, this is an excellent opportunity to dive into theory. Firstly, there is good motivation. Secondly, it's easy to see the link between numbers and algebra (see tabular representations of random variables in Chapters 4 and 5 of my book (you are welcome to download the free version). Thirdly, even though this is theory, many things here are done by analogy, which students love. Fourthly, this topic paves the road to properties of the variance and covariance (recall that the slope in simple regression is covariance over variance).

Agresti and Franklin don't have any theoretical properties of the mean, so without them the definition of the mean is kind of left hanging in the air. FYI: properties of the mean, variance, covariance and standard deviation are omnipresent in theory. The mode, median, range and IQR are not used at all because they have bad theoretical properties.

May 16

What is a stationary process?

What is a stationary process? More precisely, this post is about a discrete weakly stationary process. This topic is not exactly a beginners Stats, I am posting this to help those who study Econometrics using Introduction to Econometrics, by Christopher Dougherty, published by Oxford University Press, UK, in 2016.

Point of view. At discrete moments in time t, we observe some random variables X_tX_t can be, for example, periodical temperature measurements in a certain location. You can imagine a straight line, with moments t labeled on it, and for each t, some variable X_t attached to it. In general, X_t may have different distributions and in theory time moments may extend infinitely to the left and right.

Definition. We say that the collection \{X_t\} is (weakly) stationary if it satisfies three conditions:

  1. The means EX_t are constant (that is, do not depend on t),
  2. The variances Var(X_t) are also constant (same thing, they do not depend on t), and
  3. The covariances Cov(X_t,X_s)=f(|t-s|) depend only on the distance in time between two moments t,s.

Regarding the last condition, recall the visualization of the process, with random variables sticking out of points in time, and the fact that the distance between two moments t,s is given by the absolute value |t-s|. The condition Cov(X_t,X_s)=f(|t-s|) says that the covariance between X_t,X_s is some (unspecified) function of this distance. It should not depend on any of the moments t,s themselves.

If you want a complex definition to stay in your memory, you have to chew and digest it. The best thing to do is to prove a couple of properties.

Main property. A sum of two independent stationary processes is also stationary.

Proof. The assumption is that each variable in the collection \{X_t\} is independent of each variable in the collection \{Y_t\}. We need to check that \{X_t+Y_t\} satisfies the definition of a stationary process.

Obviously, E(X_t+Y_t)=EX_t+EY_t is constant.

Similarly, by independence we have Var(X_t+Y_t)=Var(X_t)+Var(Y_t), so variance of the sum is constant.

Finally, using properties of covariance,


(two terms disappear by independence)


(each covariance depends only on |t-s|, so their sum depends only on |t-s|).

Conclusion. You certainly know that 0+0=0. The above property is similar to this:

stationary process + stationary process = stationary process

(under independence). Now you can understand the role of stationary processes in the set of all processes: they play the role of zero. That is to say, the process \{X_t\} is not very different from the process \{Y_t\} if their difference is stationary.

Generalization. Any linear combination of independent stationary processes is stationary.

Feb 16

Summation sign rules: identities for simple regression

Summation sign rules: identities for simple regression

There are many sources on the Internet. This and this are relatively simple, while this one is pretty advanced. They cover the basics. My purpose is more specific: to show how to obtain a couple of identities in terms of summation signs from general properties of variance and covariance.

Shortcut for covariance. This is a name of the following identity

(1) E(X-EX)(Y-EY)=E(XY)-(EX)(EY)

where on the left we have the definition of Cov(X,Y) and on the right we have an alternative expression (a shortcut) for the same thing. Letting X=Y in (1) we get a shortcut for variance:

(2) E(X-EX)^2=E(X^2)-(EX)^2,

see the direct proof here. Again, on the left we have the definition of Var(X) and on the right a shortcut for the same.

In this post I mentioned that

for a discrete uniformly distributed variable with a finite number of elements, the population mean equals the sample mean if the sample is the whole population.

This is what it means. The most useful definition of a discrete random variable is this: it is a table values+probabilities of type

Table 1. Discrete random variable with n values 
Values X_1 ... X_n
Probabilities p_1 ... p_n

Here X_1,...,X_n are the values and p_1,...,p_n are the probabilities (they sum to one). With this table, it is easy to define the mean of X:

(3) EX=\sum_{i=1}^nX_ip_i.

A variable like this is called uniformly distributed if all probabilities are the same:

Table 2. Uniformly distributed discrete random variable with n values
Values X_1 ... X_n
Probabilities 1/n ... 1/n

In this case (3) becomes

(4) EX=\bar{X}.

This explains the statement from my post. Using (4), equations (1) and (2) rewrite as

(5) \overline{(X-\bar{X})(Y-\bar{Y})}=\overline{XY}-\bar{X}\bar{Y},\ \overline{(X-\bar{X})^2}=\overline{X^2}-(\bar{X})^2.

Try to write this using summation signs. For example, the first identity in (5) becomes

\frac{1}{n}\sum_{i=1}^n\big(X_i-\frac{1}{n}\sum_{i=1}^nX_i\big)\big(Y_i-\frac{1}{n}\sum_{i=1}^nY_i\big) =\frac{1}{n}\sum_{i=1}^nX_iY_i-\big(\frac{1}{n}\sum_{i=1}^nX_i\big)\big(\frac{1}{n}\sum_{i=1}^nY_i\big).

This is crazy and trying to prove this directly would be even crazier.

Remark. Let X_1,...,X_n be a sample from an arbitrary distribution. Regardless of the parent distribution, the artificial uniform distribution from Table 2 can still be applied to the sample. To avoid confusion with the expected value E with respect to the parent distribution, instead of (4) we can write

(6) E_uX=\bar{X}

where the subscript u stands for "uniform". With that understanding, equations (5) are still true. The power of this approach is that all expressions in (5) are random variables which allows for further application of the expected value E with respect to the parent distribution.