Aug 17

Violations of classical assumptions 1

Violations of classical assumptions

This is a large topic which requires several posts or several book chapters. During a conference in Sweden in 2010, a Swedish statistician asked me: "What is Econometrics, anyway? What tools does it use?" I said: "Among others, it uses linear regression." He said: "But linear regression is a general statistical tool, why do they say it's a part of Econometrics?" My answer was: "Yes, it's a general tool but the name Econometrics emphasizes that the motivation for its applications lies in Economics".

Both classical assumptions and their violations should be studied with this point in mind: What is the Economics and Math behind each assumption?

Violations of the first three assumptions

We consider the simple regression

(1) y_i=a+bx_i+e_i

Make sure to review the assumptions. Their numbering and names sometimes are different from what Dougherty's book has. In particular, most of the time I omit the following assumption:

A6. The model is linear in parameters and correctly specified.

When it is not linear in parameters, you can think of nonlinear alternatives. Instead of saying "correctly specified" I say "true model" when a "wrong model" is available.

A1. What if the existence condition is violated? If variance of the regressor is zero, the OLS estimator does not exist. The fitted line is supposed to be vertical, and you can regress x on y. Violation of the existence condition in case of multiple regression leads to multicollinearity, and that's where economic considerations are important.

A2. The convenience condition is called so because when it is violated, that is, the regressor is stochastic, there are ways to deal with this problem:  finite-sample theory and large-sample theory.

A3. What if the errors in (1) have means different from zero? This question can be divided in two: 1) the means of the errors are the same: Ee_i=c\ne 0 for all i and 2) the means are different. Read the post about centering and see if you can come up with the answer for the first question. The means may be different because of omission of a relevant variable (can you do the math?). In the absence of data on such a variable, there is nothing you can do.

Violations of A4 and A5 will be treated later.

Jan 17

The law of large numbers proved

The law of large numbers overview

I have already several posts about the law of large numbers:

  1. start with the intuition, which is illustrated using Excel;
  2. simulations in Excel show that convergence is not as fast as some textbooks claim;
  3. to distinguish the law of large numbers from the central limit theorem read this;
  4. the ultimate purpose is the application to simple regression with a stochastic regressor.

Here we busy ourselves with the proof.

Measuring deviation of a random variable from a constant

Let X be a random variable and c some constant. We want a measure of X differing from the constant by a given number \varepsilon or more. The set where X differs from c by \varepsilon>0 or more is the outside of the segment [c-\varepsilon,c+\varepsilon], that is, \{|X-c|\ge\varepsilon\}=\{X\le c-\varepsilon\}\cup\{X\ge c+\varepsilon\}.

Figure 1. Measuring the outside of interval

Now suppose X has a density p(t). It is natural to measure the set \{|X-c|\ge\varepsilon\} by the probability P(|X-c|\ge\varepsilon). This is illustrated in Figure 1.

Convergence to a spike formalized

Figure 2. Convergence to a spike

Once again, check out the idea. Consider a sequence of random variables \{T_n\} and a parameter \tau. Fix some \varepsilon>0 and consider a corridor [\tau-\varepsilon,\tau+\varepsilon] of width 2\varepsilon around \tau. For \{T_n\} to converge to a spike at \tau we want the area P(|T_n-\tau|\ge\varepsilon) to go to zero as we move along the sequence to infinity. This is illustrated in Figure 2, where, say, \{T_1\} has a flat density and the density of \{T_{1000}\} is chisel-shaped. In the latter case the area P(|T_n-\tau|\ge\varepsilon) is much smaller than in the former. The math of this phenomenon is such that P(|T_n-\tau|\ge\varepsilon) should go to zero for any \varepsilon>0 (the narrower the corridor, the further to infinity we should move along the sequence).

Definition. Let \tau be some parameter and let \{T_n\} be a sequence of its estimators. We say that \{T_n\} converges to \tau in probability or, alternatively, \{T_n\} consistently estimates \tau if P(|T_n-\tau|\ge\varepsilon)\rightarrow 0 as n\rightarrow 0 for any \varepsilon>0.

The law of large numbers in its simplest form

Let \{X_n\} be an i.i.d. sample from a population with mean \mu and variance \sigma^2. This is the situation from the standard Stats course. We need two facts about the sample mean \bar{X}: it is unbiased,

(1) E\bar{X}=\mu,

and its variance tends to zero

(2) Var(\bar{X})=\sigma^2/n\rightarrow 0 as n\rightarrow 0.


P(|\bar{X}-\mu|\ge \varepsilon) (by (1))

=P(|\bar{X}-E\bar{X}|\ge \varepsilon) (by the Chebyshev inequality, see Extension 3))

\le\frac{1}{\varepsilon^2}Var(\bar{X}) (by (2))

=\frac{\sigma^2}{n\varepsilon^2}\rightarrow 0  as n\rightarrow 0.

Since this is true for any \varepsilon>0, the sample mean is a consistent estimator of the population mean. This proves Example 1.

Final remarks

The above proof applies in the next more general situation.

Theorem. Let \tau be some parameter and let \{T_n\} be a sequence of its estimators such that: a) ET_n=\tau for any n and b) Var(T_n)\rightarrow 0. Then \{T_n\} converges in probability to \tau.

This statement is often used on the Econometrics exams of the University of London.

In the unbiasedness definition the sample size is fixed. In the consistency definition it tends to infinity. The above theorem says that unbiasedness for all n plus Var(T_n)\rightarrow 0 are sufficient for consistency.

Jan 17

OLS estimator variance

Assumptions about simple regression

We consider the simple regression

(1) y_i=a+bx_i+e_i

Here we derived the OLS estimators of the intercept and slope:

(2) \hat{b}=\frac{Cov_u(x,y)}{Var_u(x)},

(3) \hat{a}=\bar{y}-\hat{b}\bar{x}.

A1. Existence condition. Since division by zero is not allowed, for (2) to exist we require Var_u(x)\ne 0. If this condition is not satisfied, then there is no variance in x and all observed points are on the vertical line.

A2. Convenience condition. The regressor x is deterministic. This condition is imposed to be able to apply the properties of expectation, see equation (7) in  this post. The time trend and dummy variables are examples of deterministic regressors. However, most real-life regressors are stochastic. Modifying the theory in order to cover stochastic regressors is the subject of two posts: finite-sample theory and large-sample theory.

A3. Unbiasedness conditionEe_i=0. This is the main assumption that makes sure that OLS estimators are unbiased, see equation (7) in  this post.

Unbiasedness is not enough

Unbiasedness characterizes the quality of an estimator, see the intuitive explanation. Unfortunately, unbiasedness is not enough to choose the best estimator because of nonuniqueness: usually, if there is one unbiased estimator of a parameter, then there are infinitely many unbiased estimators of the same parameter. For example, we know that the sample mean \bar{X} unbiasedly estimates the population mean E\bar{X}=EX. Since EX_1=EX (X_1 is the first observation), we can easily construct an infinite family of unbiased estimators Y=(\bar{X}+aX_1)/(1+a), assuming a\ne -1. Indeed, using linearity of expectation EY=(E\bar{X}+aEX_1)/(1+a)=EX.

Variance is another measure of an estimator quality: to have a lower spread of estimator values, among competing estimators we choose the one which has the lowest variance. Knowing the estimator variance allows us to find the z-score and use statistical tables.

Slope estimator variance

It is not difficult to find the variance of the slope estimator using representation (6) derived here:

\hat{b}=b+\frac{1}{n}\sum a_ie_i

where a_i=(x_i-\bar{x})/Var_u(x).

Don't try to apply directly the definition of variance at this point, because there will be a square of a sum, which leads to a double sum upon squaring. We need two new assumptions.

A4. Uncorrelatedness of errors. Assume that Cov(e_i,e_j)=0 for all i\ne j (errors from different equations (1) are uncorrelated). Note that because of the unbiasedness condition, this assumption is equivalent to Ee_ie_j=0 for all i\ne j. This assumption is likely to be satisfied if we observe consumption patterns of unrelated individuals.

A5. Homoscedasticity. All errors have the same variancesVar(e_i)=\sigma^2 for all i. Again, because of the unbiasedness condition, this assumption is equivalent to Ee_i^2=\sigma^2 for all i.

Now we can derive the variance expression, using properties from this post:

Var(\hat{b})=Var(b+\frac{1}{n}\sum_i a_ie_i) (dropping a constant doesn't affect variance)

=Var(\frac{1}{n}\sum_i a_ie_i) (for uncorrelated variables, variance is additive)

=\sum_i Var(\frac{1}{n}a_ie_i) (variance is homogeneous of degree 2)

=\frac{1}{n^2}\sum_i a_i^2Var(e_i) (applying homoscedasticity)

=\frac{1}{n^2}\sum_i a_i^2\sigma^2 (plugging a_i)

=\frac{1}{n^2}\sum_i(x_i-\bar{x})^2\sigma^2/Var^2_u(x) (using the notation of sample variance)


Note that canceling out two variances in the last line is obvious. It is not so obvious for some if instead of the short notation for variances you use summation signs. The case of the intercept variance is left as an exercise.


The above assumptions A1-A5 are called classical. It is necessary to remember their role in derivations because a considerable part of Econometrics is devoted to deviations from classical assumptions. Once you have a certain assumption violated, you should expect the corresponding estimator property invalidated. For example, if Ee_i\ne 0, you should expect the estimators to be biased. If any of A4-A5 is not true, the formula we have derived


will not hold. Besides, the Gauss-Markov theorem that the OLS estimators are efficient will not hold (this will be discussed later). The pair A4-A5 can be called an efficiency condition.

Dec 16

Multiple regression through the prism of dummy variables

Agresti and Franklin on p.658 say: The indicator variable for a particular category is binary. It equals 1 if the observation falls into that category and it equals 0 otherwise. I say: For most students, this is not clear enough.

Problem statement

Figure 1. Residential power consumption in 2014 and 2015. Source: http://www.eia.gov/electricity/data.cfm

Residential power consumption in the US has a seasonal pattern. Heating in winter and cooling in summer cause the differences. We want to capture the dependence of residential power consumption PowerC on the season.

 Visual approach to dummy variables

Seasons of the year are categorical variables. We have to replace them with quantitative variables, to be able to use in any mathematical procedure that involves arithmetic operations. To this end, we define a dummy variable (indicator) D_{win} for winter such that it equals 1 in winter and 0 in any other period of the year. The dummies D_{spr},\ D_{sum},\ D_{aut} for spring, summer and autumn are defined similarly. We provide two visualizations assuming monthly observations.

Table 1. Tabular visualization of dummies
Month D_{win} D_{spr} D_{sum} D_{aut} D_{win}+D_{spr}+ D_{sum}+D_{aut}
December 1 0 0 0 1
January 1 0 0 0 1
February 1 0 0 0 1
March 0 1 0 0 1
April 0 1 0 0 1
May 0 1 0 0 1
June 0 0 1 0 1
July 0 0 1 0 1
August 0 0 1 0 1
September 0 0 0 1 1
October 0 0 0 1 1
November 0 0 0 1 1

Figure 2. Graphical visualization of D_spr

The first idea may be wrong

The first thing that comes to mind is to regress PowerC on dummies as in

(1) PowerC=a+bD_{win}+cD_{spr}+dD_{sum}+eD_{aut}+error.

Not so fast. To see the problem, let us rewrite (1) as

(2) PowerC=a\times 1+bD_{win}+cD_{spr}+dD_{sum}+eD_{aut}+error.

This shows that, in addition to the four dummies, there is a fifth variable, which equals 1 across all observations. Let us denote it T (for Trivial). Table 1 shows that

(3) T=D_{win}+D_{spr}+ D_{sum}+D_{aut}.

This makes the next definition relevant. Regressors x_1,...,x_k are called linearly dependent if one of them, say, x_1, can be expressed as a linear combination of the others: x_1=a_2x_2+...+a_kx_k.  In case (3), all coefficients a_i are unities, so we have linear dependence. Using (3), let us replace T in (2). The resulting equation is rearranged as

(4) PowerC=(a+b)D_{win}+(a+c)D_{spr}+(a+d)D_{sum}+(a+e)D_{aut}+error.

Now we see what the problem is. When regressors are linearly dependent, the model is not uniquely specified. (1) and (4) are two different representations of the same model.

What is the way out?

If regressors are linearly dependent, drop them one after another until you get linearly independent ones. For example, dropping the winter dummy, we get

(5) PowerC=a+cD_{spr}+dD_{sum}+eD_{aut}+error.

Here is the estimation result for the two-year data in Figure 1:


This means that:

PowerC=128176 in winter, PowerC=128176-27380 in spring,

PowerC=128176+5450 in summer, and PowerC=128176-22225 in autumn.

It is revealing that cooling requires more power than heating. However, the summer coefficient is not significant. Here is the Excel file with the data and estimation result.

The category that has been dropped is called a base (or reference) category. Thus, the intercept in (5) measures power consumption in winter. The dummy coefficients in (5) measure deviations of power consumption in respective seasons from that in winter.

Here is the question I ask my students

We want to see how beer consumption BeerC depends on gender and income Inc. Let M and F denote the dummies for males and females, resp. Correct the following model and interpret the resulting coefficients:


Final remark

When a researcher includes all categories plus the trivial regressor, he/she falls into what is called a dummy trap. The problem of linear dependence among regressors is usually discussed under the heading of multiple regression. But since the trivial regressor is present in simple regression too, it might be a good idea to discuss it earlier.

Linear dependence/independence of regressors is an exact condition for existence of the OLS estimator. That is, if regressors are linearly dependent, then the OLS estimator doesn't exist, in which case the question about its further properties doesn't make sense. If, on the other hand, regressors are linearly independent, then the OLS estimator exists, and further properties can be studied, such as unbiasedness, variance and efficiency.

Oct 16

The pearls of AP Statistics 31

Demystifying sampling distributions: too much talking about nothing

What we know about sample means

Let X_1,...,X_n be an independent identically distributed sample and consider its sample mean \bar{X}.

Fact 1. The sample mean is an unbiased estimator of the population mean:

(1) E\bar{X}=\frac{1}{n}(EX_1+...+EX_n)=\frac{1}{n}(\mu+...+\mu)=\mu

(use linearity of means).

Fact 2. Variance of the sample mean is

(2) Var(\bar{X})=\frac{1}{n^2}(Var(X_1)+...+Var(X_n)=\frac{1}{n^2}(\sigma^2(X)+...+\sigma^2(X))=\frac{\sigma^2(X)}{n}

(use homogeneity of variance of degree 2 and additivity of variance for independent variables). Hence \sigma(\bar{X})=\frac{\sigma(X)}{\sqrt{n}}

Fact 3. The implication of these two properties is that the sample mean becomes more concentrated around the population mean as the sample size increases (see at least the law of large numbers; I have a couple more posts about this).

Fact 4. Finally, the z scores of sample means stabilize to a standard normal distribution (the central limit theorem).

What is a sampling distribution?

The sampling distribution of a statistic is the probability distribution that specifies probabilities for the possible values the statistic can take (Agresti and Franklin, p.308). After this definition, the authors go ahead and discuss the above four facts. Note that none of them requires the knowledge of what the sampling distribution is. The ONLY sampling distribution that appears explicitly in AP Statistics is the binomial. However, in the book the binomial is given in Section 6.3, before sampling distributions, which are the subject of Chapter 7. Section 7.3 explains that the binomial is a sampling distribution but that section is optional. Thus the whole Chapter 7 (almost 40 pages) is redundant.

Then what are sampling distributions for?

Here is a simple example that explains their role. Consider the binomial X_1+X_2 of two observations on an unfair coin. It involves two random variables and therefore is described by a joint distribution with the sample space consisting of pairs of values

Table 1. Sample space for pair (X_1,X_2)

Coin 1
0 1
Coin 2 0 (0,0) (0,1)
1 (1,0) (1,1)

Each coin independently takes values 0 and 1 (shown in the margins); the sample space contains four pairs of these values (shown in the main body of the table). The corresponding probability distribution is given by the table

Table 2. Joint probabilities for pair (X_1,X_2)

Coin 1
p q
Coin 2 p p^2 pq
q pq q^2

Since we are counting only the number of successes, the outcomes (0,1) and (1,0) for the purposes of our experiment are the same. Hence, joining indistinguishable outcomes, we obtain a smaller sample space

Table 3. Sampling distribution for binomial X_1+X_2

# of successes Corresponding probabilities
0 p^2
1 2pq
2 q^2

The last table is the sampling distribution for the binomial with sample size 2. All the sampling distribution does is replace a large joint distribution Table 1+Table 2 by a smaller distribution Table 3. The beauty of proofs of equations (1) and (2) is that they do not depend on which distribution is used (the distribution is hidden in the expected value operator).

Unless you want your students to appreciate the reduction in the sample space brought about by sampling distributions, it is not worth discussing them. See Wikipedia for examples other than the binomial.

Sep 16

Proving unbiasedness of OLS estimators

Proving unbiasedness of OLS estimators - the do's and don'ts


Here we derived the OLS estimators. To distinguish between sample and population means, the variance and covariance in the slope estimator will be provided with the subscript u (for "uniform", see the rationale here).

(1) \hat{b}=\frac{Cov_u(x,y)}{Var_u(x)},

(2) \hat{a}=\bar{y}-\hat{b}\bar{x}.

These equations are used in conjunction with the model

(3) y_i=a+bx_i+e_i

where we remember that

(4) Ee_i=0 for all i.

Since (2) depends on (1), we have to start with unbiasedness of the slope estimator.

Using the right representation is critical

We have to show that E\hat{b}=b.

Step 1. Don't apply the expectation directly to (1). Do separate in (1) what is supposed to be E\hat{b}. To reveal the role of errors in (1), plug (3) in (1) and use linearity of covariance with respect to each argument when the other argument is fixed:


Here Cov_u(x,a)=0 (a constant is uncorrelated with any variable), Cov_u(x,x)=Var_u(x) (covariance of x with itself is its variance), so

(5) \hat{b}=\frac{bVar_u(x)+Cov_u(x,e)}{Var_u(x)}=b+\frac{Cov_u(x,e)}{Var_u(x)}.

Equation (5) is the mean-plus-deviation-from-the-mean decomposition. Many students think that Cov_u(x,e)=0 because of (4). No! The covariance here does not involve the population mean.

Step 2. It pays to make one more step to develop (5). Write out the numerator in (5) using summation:


Don't write out Var_u(x)! Presence of two summations confuses many students.

Multiplying parentheses and using the fact that \sum(x_i-\bar{x})=n\bar{x}-n\bar{x}=0 we have

\hat{b}=b+\frac{1}{n}[\sum(x_i-\bar{x})e_i-\bar{e}\sum(x_i-\bar{x})]/Var_u(x) =b+\frac{1}{n}\sum\frac{(x_i-\bar{x})}{Var_u(x)}e_i.

To simplify calculations, denote a_i=(x_i-\bar{x})/Var_u(x). Then the slope estimator becomes

(6) \hat{b}=b+\frac{1}{n}\sum a_ie_i.

This is the critical representation.

Unbiasedness of the slope estimator

Convenience conditionThe regressor x is deterministic. I call it a convenience condition because it's just a matter of mathematical expedience, and later on we'll study ways to bypass it.

From (6), linearity of means and remembering that the deterministic coefficients a_i behave like constants,

(7) E\hat{b}=E[b+\frac{1}{n}\sum a_ie_i]=b+\frac{1}{n}\sum a_iEe_i=b

by (4). This proves unbiasedness.

You don't know the difference between the population and sample means until you see them working in the same formula.

Unbiasedness of the intercept estimator

As above we plug (3) in (2): \hat{a}=\overline{a+bx+e}-\hat{b}\bar{x}=a+b\bar{x}+\bar{e}-\hat{b}\bar{x}. Applying expectation:



Since in (1)  there is division by Var_u(x), the condition Var_u(x)\ne 0 is the main condition for existence of OLS estimators. From the above proof we see that (4) is the main condition for unbiasedness.

Aug 16

The pearls of AP Statistics 24

Unbiasedness: the stumbling block of a Statistics course

God is in the detail

They say: A good estimator has a sampling distribution that is centered at the parameter. We define center in this case as the mean of that sampling distribution. An estimator with this property is said to be unbiased. From Section 7.2, we know that for random sampling the mean of the sampling distribution of the sample mean x equals the population mean μ. So, the sample mean x is an unbiased estimator of μ. (Agresti and Franklin, p.351).

I say: This is a classic case of turning everything upside down, and this happens when the logical chain is broken. Unbiasedness is one of the pillars of Statistics. It can and should be given right after the notion of population mean is introduced. The authors make the definition dependent on random sampling, sampling distribution and a whole Section 7.2. Therefore I highly doubt that any student can grasp the above definition. My explanation below may not be the best; I just want to prompt the reader to think about alternatives to the above "definition".

Population mean versus sample mean

By definition, in the discrete case, a random variable is a table values+probabilities:

Values Probabilities
X_1 p_1
 ...  ...
X_n p_n

If we know this table, we can define the population mean \mu=EX=p_1X_1+...+p_nX_n. This is a weighted average of the variable values because the probabilities are percentages: 0<p_i<1 for all i and p_1+...+p_n=1. The expectation operator E is the device used by Mother Nature to measure the average, and most of the time she keeps hidden from us both the probabilities and the average EX.

Now suppose that X_1,...,X_n represent a sample from the given population (and not the values in the above table). We can define the sample mean \bar{X}=\frac{X_1+...+X_n}{n}. Being a little smarter than monkeys, we instead of unknown probabilities use the uniform distribution p_i=1/n. Unlike the population average, the sample average is always possible to calculate, as long as the sample is available.

Consider a good shooter shooting at three targets using a good rifle.

Unbiasedness intuition

The black dots represent points hit by bullets on three targets. In Figure 1, there was only one shot. What is your best guess about where the bull's eye is? Regarding Figure 2, everybody says that probably the bull's eye is midway (red point) between points A and B. In Figure 3, the sample mean is represented by the red point. Going back to unbiasedness: 1) the bull's eye is the unknown population parameter that needs to be estimated, 2) points hit by bullets are sample observations, 3) their sample mean is represented by red points, 4) the red points estimate the location of the bull's eye. The sample mean is said to be an unbiased estimator of population mean because

(1) E\bar{X}=\mu.

In words, Mother Nature says that, in her opinion, on average our bullets hit the bull's eye.

This explanation is an alternative to the one you can see in many books: in the long run, the sample mean correctly estimates the population mean. That explanation in fact replaces equation (1) by the corresponding law of large numbers. My explanation just underlines the fact that there is an abstract average, that we cannot use, and the sample average, that we invent to circumvent that problem.

See related theoretical facts here.

Aug 16

The pearls of AP Statistics 14

Reasons to increase Math content in AP Statistics course

The definition of the standard deviation, to those who see it for the first time, looks complex and scary. Agresti and Franklin on p.57 have done an excellent job explaining it. They do it step by step: introduce deviations, the sum of squares, variance and give the formula in the end. The names introduced here will be useful later, in other contexts. Being a rotten theorist, I don't like the "small technical point" on p. 57 (the true reason why there is division by n-1 and not by n is unbiasedness: Es^2=\sigma^2) but this is a minor point.

AP Stats teachers cannot discuss advanced facts because many students are not good in algebra. However, there are good methodological reasons to increase Math content of an AP Stats course. When students saw algebra for the first time, their cognitive skills may have been underdeveloped, which may have prevented them from leaping from numbers to algebraic notation. On the other hand, by the time they take AP Stats they mature. Their logic, power of observation, motivation etc. are better. The crucial fact is that in Statistics numbers meet algebra again, and this can be usefully employed.

Ask your students two questions. 1) You have observations on two stocks, X and Y. How is the sample mean of their sum related to their individual sample means? 2) You have s shares of stock X (s is a number, X is a random variable). How is the sample mean of your portfolio sX related to the sample mean of X? This smells money and motivates well.

The first answer, \overline{X+Y}=\bar{X}+\bar{Y}, tells us that if we know the individual means, we can avoid calculating \overline{X+Y} by simply adding two numbers. Similarly, the second formula, \overline{sX}=s\bar{X}, simplifies calculation of \overline{sX}. Methodologically, this is an excellent opportunity to dive into theory. Firstly, there is good motivation. Secondly, it's easy to see the link between numbers and algebra (see tabular representations of random variables in Chapters 4 and 5 of my book (you are welcome to download the free version). Thirdly, even though this is theory, many things here are done by analogy, which students love. Fourthly, this topic paves the road to properties of the variance and covariance (recall that the slope in simple regression is covariance over variance).

Agresti and Franklin don't have any theoretical properties of the mean, so without them the definition of the mean is kind of left hanging in the air. FYI: properties of the mean, variance, covariance and standard deviation are omnipresent in theory. The mode, median, range and IQR are not used at all because they have bad theoretical properties.

Dec 15

Population mean versus sample mean

Population mean versus sample mean.

Equations involving both population and sample means are especially confusing for students. One of them is unbiasedness of the sample mean E\bar{X}=EX. In the Econometrics context there are many relations of this type. They need to be emphasized and explained many times until everybody understands the difference.

On the practical side, the first thing to understand is that the population mean uses all population elements and the population distribution, which are usually unknown. On the other hand, the sample mean uses only the sample and is known, as long as the sample is known.

On the theoretical side, we know that 1) as the sample size increases, the sample mean tends to the population mean (law of large numbers), 2) the population mean of the sample mean equals the population mean (unbiasedness), 3) for a discrete uniformly distributed variable with a finite number of elements, the population mean equals the sample mean (see equation (4) in that post) if the sample is the whole population, 4) if the population mean equals \mu, that does not mean that any sample from that population has the same sample mean.

For the preliminary material on properties of means see this post.