Feb 16

What is a mean value - all means in one place

What is a mean value - all means in one place

In introductory Stats texts, various means are scattered all over the place, and there is no indication of links between them. This is what we address here.

The population mean of a discrete random variable is the starting point. Such a variable, by definition, is a table values+probabilities, see this post, and its mean is EX=\sum_{i=1}^nX_ip_i. If that random variable is uniformly distributed, in the same post we explain that EX=\bar{X}, so the sample mean is a special case of a population mean.

The next point is the link between the grouped data formula and sample mean. Recall the procedure for finding absolute frequencies. Let Y_1,...,Y_n be the values in the sample (it is convenient to assume that they are arranged in an ascending order). Equal values are joined in groups. Let X_1,...,X_m denote the distinct values in the sample and n_1,...,n_m their absolute frequencies. Their total is, clearly, n. The sample mean is
(sorting out Y's into groups with equal values)
=\left(\overbrace {X_1+...+X_1}^{n_1{\rm{\ times}}}+...+\overbrace{X_m+...+X_m}^{n_m{\rm{\ times}}}\right)/n
=(n_1X_1 + ... + n_mX_m)/n,

which is the grouped data formula. We have shown that the grouped data formula obtains as a special case of the sample mean when equal values are joined into groups.

Next, denoting r_i=n_i/n the relative frequencies, we get

(n_1X_1 + ... + n_mX_m)/n=

(dividing through by n)


If we accept the relative frequencies as probabilities, then this becomes the population mean. Thus, with this convention, the grouped data formula and population mean are the same.

Finally, the mean of a continuous random variable X which has a density p_X is defined by EX=\int_{-\infty}^\infty tp_X(t)dt. in Section 6.3 of my book it is shown that the mean of a continuous random variable is a limit of grouped means.


Properties of means apply equally to all mean types.

Feb 16

Summation sign rules: identities for simple regression

Summation sign rules: identities for simple regression

There are many sources on the Internet. This and this are relatively simple, while this one is pretty advanced. They cover the basics. My purpose is more specific: to show how to obtain a couple of identities in terms of summation signs from general properties of variance and covariance.

Shortcut for covariance. This is a name of the following identity

(1) E(X-EX)(Y-EY)=E(XY)-(EX)(EY)

where on the left we have the definition of Cov(X,Y) and on the right we have an alternative expression (a shortcut) for the same thing. Letting X=Y in (1) we get a shortcut for variance:

(2) E(X-EX)^2=E(X^2)-(EX)^2,

see the direct proof here. Again, on the left we have the definition of Var(X) and on the right a shortcut for the same.

In this post I mentioned that

for a discrete uniformly distributed variable with a finite number of elements, the population mean equals the sample mean if the sample is the whole population.

This is what it means. The most useful definition of a discrete random variable is this: it is a table values+probabilities of type

Table 1. Discrete random variable with n values 
Values X_1 ... X_n
Probabilities p_1 ... p_n

Here X_1,...,X_n are the values and p_1,...,p_n are the probabilities (they sum to one). With this table, it is easy to define the mean of X:

(3) EX=\sum_{i=1}^nX_ip_i.

A variable like this is called uniformly distributed if all probabilities are the same:

Table 2. Uniformly distributed discrete random variable with n values
Values X_1 ... X_n
Probabilities 1/n ... 1/n

In this case (3) becomes

(4) EX=\bar{X}.

This explains the statement from my post. Using (4), equations (1) and (2) rewrite as

(5) \overline{(X-\bar{X})(Y-\bar{Y})}=\overline{XY}-\bar{X}\bar{Y},\ \overline{(X-\bar{X})^2}=\overline{X^2}-(\bar{X})^2.

Try to write this using summation signs. For example, the first identity in (5) becomes



This is crazy and trying to prove this directly would be even crazier.

Remark. Let X_1,...,X_n be a sample from an arbitrary distribution. Regardless of the parent distribution, the artificial uniform distribution from Table 2 can still be applied to the sample. To avoid confusion with the expected value E with respect to the parent distribution, instead of (4) we can write

(6) E_uX=\bar{X}

where the subscript u stands for "uniform". With that understanding, equations (5) are still true. The power of this approach is that all expressions in (5) are random variables which allows for further application of the expected value E with respect to the parent distribution.

Dec 15

Population mean versus sample mean

Population mean versus sample mean.

Equations involving both population and sample means are especially confusing for students. One of them is unbiasedness of the sample mean E\bar{X}=EX. In the Econometrics context there are many relations of this type. They need to be emphasized and explained many times until everybody understands the difference.

On the practical side, the first thing to understand is that the population mean uses all population elements and the population distribution, which are usually unknown. On the other hand, the sample mean uses only the sample and is known, as long as the sample is known.

On the theoretical side, we know that 1) as the sample size increases, the sample mean tends to the population mean (law of large numbers), 2) the population mean of the sample mean equals the population mean (unbiasedness), 3) for a discrete uniformly distributed variable with a finite number of elements, the population mean equals the sample mean (see equation (4) in that post) if the sample is the whole population, 4) if the population mean equals \mu, that does not mean that any sample from that population has the same sample mean.

For the preliminary material on properties of means see this post.