Jan 17

OLS estimator variance

Assumptions about simple regression

We consider the simple regression

(1) y_i=a+bx_i+e_i

Here we derived the OLS estimators of the intercept and slope:

(2) \hat{b}=\frac{Cov_u(x,y)}{Var_u(x)},

(3) \hat{a}=\bar{y}-\hat{b}\bar{x}.

A1. Existence condition. Since division by zero is not allowed, for (2) to exist we require Var_u(x)\ne 0. If this condition is not satisfied, then there is no variance in x and all observed points are on the vertical line.

A2. Convenience condition. The regressor x is deterministic. This condition is imposed to be able to apply the properties of expectation, see equation (7) in  this post. The time trend and dummy variables are examples of deterministic regressors. However, most real-life regressors are stochastic. Modifying the theory in order to cover stochastic regressors is the subject of two posts: finite-sample theory and large-sample theory.

A3. Unbiasedness conditionEe_i=0. This is the main assumption that makes sure that OLS estimators are unbiased, see equation (7) in  this post.

Unbiasedness is not enough

Unbiasedness characterizes the quality of an estimator, see the intuitive explanation. Unfortunately, unbiasedness is not enough to choose the best estimator because of nonuniqueness: usually, if there is one unbiased estimator of a parameter, then there are infinitely many unbiased estimators of the same parameter. For example, we know that the sample mean \bar{X} unbiasedly estimates the population mean E\bar{X}=EX. Since EX_1=EX (X_1 is the first observation), we can easily construct an infinite family of unbiased estimators Y=(\bar{X}+aX_1)/(1+a), assuming a\ne -1. Indeed, using linearity of expectation EY=(E\bar{X}+aEX_1)/(1+a)=EX.

Variance is another measure of an estimator quality: to have a lower spread of estimator values, among competing estimators we choose the one which has the lowest variance. Knowing the estimator variance allows us to find the z-score and use statistical tables.

Slope estimator variance

It is not difficult to find the variance of the slope estimator using representation (6) derived here:

\hat{b}=b+\frac{1}{n}\sum a_ie_i

where a_i=(x_i-\bar{x})/Var_u(x).

Don't try to apply directly the definition of variance at this point, because there will be a square of a sum, which leads to a double sum upon squaring. We need two new assumptions.

A4. Uncorrelatedness of errors. Assume that Cov(e_i,e_j)=0 for all i\ne j (errors from different equations (1) are uncorrelated). Note that because of the unbiasedness condition, this assumption is equivalent to Ee_ie_j=0 for all i\ne j. This assumption is likely to be satisfied if we observe consumption patterns of unrelated individuals.

A5. Homoscedasticity. All errors have the same variancesVar(e_i)=\sigma^2 for all i. Again, because of the unbiasedness condition, this assumption is equivalent to Ee_i^2=\sigma^2 for all i.

Now we can derive the variance expression, using properties from this post:

Var(\hat{b})=Var(b+\frac{1}{n}\sum_i a_ie_i) (dropping a constant doesn't affect variance)

=Var(\frac{1}{n}\sum_i a_ie_i) (for uncorrelated variables, variance is additive)

=\sum_i Var(\frac{1}{n}a_ie_i) (variance is homogeneous of degree 2)

=\frac{1}{n^2}\sum_i a_i^2Var(e_i) (applying homoscedasticity)

=\frac{1}{n^2}\sum_i a_i^2\sigma^2 (plugging a_i)

=\frac{1}{n^2}\sum_i(x_i-\bar{x})^2\sigma^2/Var^2_u(x) (using the notation of sample variance)



Note that canceling out two variances in the last line is obvious. It is not so obvious for some if instead of the short notation for variances you use summation signs. The case of the intercept variance is left as an exercise.


The above assumptions A1-A5 are called classical. It is necessary to remember their role in derivations because a considerable part of Econometrics is devoted to deviations from classical assumptions. Once you have a certain assumption violated, you should expect the corresponding estimator property invalidated. For example, if Ee_i\ne 0, you should expect the estimators to be biased. If any of A4-A5 is not true, the formula we have derived


will not hold. Besides, the Gauss-Markov theorem that the OLS estimators are efficient will not hold (this will be discussed later). The pair A4-A5 can be called an efficiency condition.

Nov 16

Properties of correlation

Correlation coefficient: the last block of statistical foundation

Correlation has already been mentioned in

Statistical measures and their geometric roots

Properties of standard deviation

The pearls of AP Statistics 35

Properties of covariance

The pearls of AP Statistics 33

The hierarchy of definitions

Suppose random variables X,Y are not constant. Then their standard deviations are not zero and we can define their correlation as in Chart 1.


Chart 1. Correlation definition

Properties of correlation

Property 1. Range of the correlation coefficient: for any X,Y one has - 1 \le \rho (X,Y) \le 1.
This follows from the Cauchy-Schwarz inequality, as explained here.

Recall from this post that correlation is cosine of the angle between X-EX and Y-EY.
Property 2. Interpretation of extreme cases. (Part 1) If \rho (X,Y) = 1, then Y = aX + b with a > 0.

(Part 2) If \rho (X,Y) = - 1, then Y = aX + b with a < 0.

Proof. (Part 1) \rho (X,Y) = 1 implies
(1) Cov (X,Y) = \sigma (X)\sigma (Y)
which, in turn, implies that Y is a linear function of X: Y = aX + b (this is the second part of the Cauchy-Schwarz inequality). Further, we can establish the sign of the number a. By the properties of variance and covariance

\sigma (Y)=\sigma(aX + b)=\sigma (aX)=|a|\sigma (X).
Plugging this in Eq. (1) we get aVar(X) = |a|\sigma^2(X) and see that a is positive.

The proof of Part 2 is left as an exercise.

Property 3. Suppose we want to measure correlation between weight W and height H of people. The measurements are either in kilos and centimeters {W_k},{H_c} or in pounds and feet {W_p},{H_f}. The correlation coefficient is unit-free in the sense that it does not depend on the units used: \rho (W_k,H_c)=\rho (W_p,H_f). Mathematically speaking, correlation is homogeneous of degree 0 in both arguments.
Proof. One measurement is proportional to another, W_k=aW_p,\ H_c=bH_f with some positive constants a,b. By homogeneity
\rho (W_k,H_c)=\frac{Cov(W_k,H_c)}{\sigma(W_k)\sigma(H_c)}=\frac{Cov(aW_p,bH_f)}{\sigma(aW_p)\sigma(bH_f)}=\frac{abCov(W_p,H_f)}{ab\sigma(W_p)\sigma (H_f)}=\rho (W_p,H_f).


Oct 16

Properties of means

Properties of means, covariances and variances are bread and butter of professionals. Here we consider the bread - the means

Properties of means: as simple as playing with tables

Definition of a random variable. When my Brazilian students asked for an intuitive definition of a random variable, I said: It is a function whose values are unpredictable. Therefore it is prohibited to work with their values and allowed to work only with their various means. For proofs we need a more technical definition: it is a table values+probabilities of type Table 1.

Table 1.  Random variable definition
Values of X Probabilities
x_1 p_1
... ...
x_n p_n

Note: The complete form of writing {p_i} is P(X = {x_i}).

Mean (or expected value) value definitionEX = x_1p_1 + ... + x_np_n = \sum\limits_{i = 1}^nx_ip_i. In words, this is a weighted sum of values, where the weights p_i reflect the importance of corresponding x_i.

Note: The expected value is a function whose argument is a complex object (it is described by Table 1) and the value is simple: EX is just a number. And it is not a product of E and X! See how different means fit this definition.

Definition of a linear combination. See here the financial motivation. Suppose that X,Y are two discrete random variables with the same probability distribution {p_1},...,{p_n}. Let a,b be real numbers. The random variable aX + bY is called a linear combination of X,Y with coefficients a,b. Its special cases are aX (X scaled by a) and X + Y (a sum of X and Y). The detailed definition is given by Table 2.

Table 2.  Linear operations definition
Values of X Values of Y Probabilities aX X + Y aX + bY
x_1 {y_1} p_1 a{x_1} {x_1} + {y_1} a{x_1} + b{y_1}
...  ... ...  ...  ...  ...
x_n {y_n} p_n a{x_n} {x_n} + {y_n} a{x_n} + b{y_n}

Note: The situation when the probability distributions are different is reduced to the case when they are the same, see my book.

Property 1. Linearity of means. For any random variables X,Y and any numbers a,b one has

(1) E(aX + bY) = aEX + bEY.

Proof. This is one of those straightforward proofs when knowing the definitions and starting with the left-hand side is enough to arrive at the result. Using the definitions in Table 2, the mean of the linear combination is
E(aX + bY)= (a{x_1} + b{y_1}){p_1} + ... + (a{x_n} + b{y_n}){p_n}

(distributing probabilities)
= a{x_1}{p_1} + b{y_1}{p_1} + ... + a{x_n}{p_n} + b{y_n}{p_n}

(grouping by variables)
= (a{x_1}{p_1} + ... + a{x_n}{p_n}) + (b{y_1}{p_1} + ... + b{y_n}{p_n})

(pulling out constants)
= a({x_1}{p_1} + ... + {x_n}{p_n}) + b({y_1}{p_1} + ... + {y_n}{p_n})=aEX+bEY.

See applications: one, and two, and three.

Generalization to the case of a linear combination of n variables:

E({a_1}{X_1} + ... + {a_n}{X_n}) = {a_1}E{X_1} + ... + {a_n}E{X_n}.

Special cases. a) Letting a = b = 1 in (1) we get E(X + Y) = EX + EY. This is called additivity. See an application. b) Letting in (1) b = 0 we get E(aX) = aEX. This property is called homogeneity of degree 1 (you can pull the constant out of the expected value sign). Ask your students to deduce linearity from homogeneity and additivity.

Property 2. Expected value of a constant. Everybody knows what a constant is. Ask your students what is a constant in terms of Table 1. The mean of a constant is that constant, because a constant doesn't change, rain or shine: Ec = c{p_1} + ... + c{p_n} = c({p_1} + ... + {p_n}) = 1 (we have used the completeness axiom). In particular, it follows that E(EX)=EX.

Property 3. The expectation operator preserves order: if x_i\ge y_i for all i, then EX\ge EY. In particular, the mean of a nonnegative random variable is nonnegative: if x_i\ge 0 for all i, then EX\ge 0.

Indeed, using the fact that all probabilities are nonnegative, we get EX = x_1p_1 + ... + x_np_n\ge y_1p_1 + ... + y_np_n=EY.

Property 4. For independent variables, we have EXY=(EX)(EY) (multiplicativity), which has important implications on its own.

The best thing about the above properties is that, although we proved them under simplified assumptions, they are always true. We keep in mind that the expectation operator E is the device used by Mother Nature to measure the average, and most of the time she keeps hidden from us both the probabilities and the average EX.

Oct 16

The pearls of AP Statistics 31

Demystifying sampling distributions: too much talking about nothing

What we know about sample means

Let X_1,...,X_n be an independent identically distributed sample and consider its sample mean \bar{X}.

Fact 1. The sample mean is an unbiased estimator of the population mean:

(1) E\bar{X}=\frac{1}{n}(EX_1+...+EX_n)=\frac{1}{n}(\mu+...+\mu)=\mu

(use linearity of means).

Fact 2. Variance of the sample mean is

(2) Var(\bar{X})=\frac{1}{n^2}(Var(X_1)+...+Var(X_n)=\frac{1}{n^2}(\sigma^2(X)+...+\sigma^2(X))=\frac{\sigma^2(X)}{n}

(use homogeneity of variance of degree 2 and additivity of variance for independent variables). Hence \sigma(\bar{X})=\frac{\sigma(X)}{\sqrt{n}}

Fact 3. The implication of these two properties is that the sample mean becomes more concentrated around the population mean as the sample size increases (see at least the law of large numbers; I have a couple more posts about this).

Fact 4. Finally, the z scores of sample means stabilize to a standard normal distribution (the central limit theorem).

What is a sampling distribution?

The sampling distribution of a statistic is the probability distribution that specifies probabilities for the possible values the statistic can take (Agresti and Franklin, p.308). After this definition, the authors go ahead and discuss the above four facts. Note that none of them requires the knowledge of what the sampling distribution is. The ONLY sampling distribution that appears explicitly in AP Statistics is the binomial. However, in the book the binomial is given in Section 6.3, before sampling distributions, which are the subject of Chapter 7. Section 7.3 explains that the binomial is a sampling distribution but that section is optional. Thus the whole Chapter 7 (almost 40 pages) is redundant.

Then what are sampling distributions for?

Here is a simple example that explains their role. Consider the binomial X_1+X_2 of two observations on an unfair coin. It involves two random variables and therefore is described by a joint distribution with the sample space consisting of pairs of values

Table 1. Sample space for pair (X_1,X_2)

Coin 1
0 1
Coin 2 0 (0,0) (0,1)
1 (1,0) (1,1)

Each coin independently takes values 0 and 1 (shown in the margins); the sample space contains four pairs of these values (shown in the main body of the table). The corresponding probability distribution is given by the table

Table 2. Joint probabilities for pair (X_1,X_2)

Coin 1
p q
Coin 2 p p^2 pq
q pq q^2

Since we are counting only the number of successes, the outcomes (0,1) and (1,0) for the purposes of our experiment are the same. Hence, joining indistinguishable outcomes, we obtain a smaller sample space

Table 3. Sampling distribution for binomial X_1+X_2

# of successes Corresponding probabilities
0 p^2
1 2pq
2 q^2

The last table is the sampling distribution for the binomial with sample size 2. All the sampling distribution does is replace a large joint distribution Table 1+Table 2 by a smaller distribution Table 3. The beauty of proofs of equations (1) and (2) is that they do not depend on which distribution is used (the distribution is hidden in the expected value operator).

Unless you want your students to appreciate the reduction in the sample space brought about by sampling distributions, it is not worth discussing them. See Wikipedia for examples other than the binomial.

Sep 16

All you need to know about the law of large numbers

All about the law of large numbers: properties and applications

Level 1: estimation of population parameters

The law of large numbers is a statement about convergence which is called convergence in probability and denoted \text{plim}. The precise definition is rather complex but the intuition is simple: it is convergence to a spike at the parameter being estimated. Usually, any unbiasedness statement has its analog in terms of the corresponding law of large numbers.

Example 1. The sample mean unbiasedly estimates the population mean: E\bar{X}=EX. Its analog: the sample mean converges to a spike at the population mean: \text{plim}\bar{X}=EX. See the proof based on the Chebyshev inequality.

Example 2. The sample variance unbiasedly estimates the population variance: E\overline{s^2}=Var(X) where s^2=\frac{\sum(X_i-\bar{X})^2}{n-1}. Its analog: the sample variance converges to a spike at the population variance:

(1) \text{plim}\overline{s^2}=Var(X).

Example 3. The sample covariance s_{X,Y}=\frac{\sum(X_i-\bar{X})(Y_i-\bar{Y})}{n-1} unbiasedly estimates the population covariance: E\overline{s_{X,Y}}=Cov(X,Y). Its analog: the sample covariance converges to a spike at the population covariance:

(2) \text{plim}\overline{s_{X,Y}}=Cov(X,Y).

Up one level: convergence in probability is just convenient

Using or not convergence in probability is a matter of expedience. For usual limits of sequences we know the properties which I call preservation of arithmetic operations:

\lim(a_n\pm b_n)=\lim a_n\pm \lim b_n,

\lim(a_n\times b_n)=\lim a_n\times\lim b_n,

\lim(a_n/ b_n)=\lim a_n/\lim b_n.

Convergence in probability has exact same properties, just replace \lim with \text{plim}.

Next level: making regression estimation more plausible

Using convergence in probability allows us to handle stochastic regressors and avoid the unrealistic assumption that regressors are deterministic.

Convergence in probability and in distribution are two types of convergence of random variables that are widely used in the Econometrics course of the University of London.

Sep 16

Proving unbiasedness of OLS estimators

Proving unbiasedness of OLS estimators - the do's and don'ts


Here we derived the OLS estimators. To distinguish between sample and population means, the variance and covariance in the slope estimator will be provided with the subscript u (for "uniform", see the rationale here).

(1) \hat{b}=\frac{Cov_u(x,y)}{Var_u(x)},

(2) \hat{a}=\bar{y}-\hat{b}\bar{x}.

These equations are used in conjunction with the model

(3) y_i=a+bx_i+e_i

where we remember that

(4) Ee_i=0 for all i.

Since (2) depends on (1), we have to start with unbiasedness of the slope estimator.

Using the right representation is critical

We have to show that E\hat{b}=b.

Step 1. Don't apply the expectation directly to (1). Do separate in (1) what is supposed to be E\hat{b}. To reveal the role of errors in (1), plug (3) in (1) and use linearity of covariance with respect to each argument when the other argument is fixed:


Here Cov_u(x,a)=0 (a constant is uncorrelated with any variable), Cov_u(x,x)=Var_u(x) (covariance of x with itself is its variance), so

(5) \hat{b}=\frac{bVar_u(x)+Cov_u(x,e)}{Var_u(x)}=b+\frac{Cov_u(x,e)}{Var_u(x)}.

Equation (5) is the mean-plus-deviation-from-the-mean decomposition. Many students think that Cov_u(x,e)=0 because of (4). No! The covariance here does not involve the population mean.

Step 2. It pays to make one more step to develop (5). Write out the numerator in (5) using summation:


Don't write out Var_u(x)! Presence of two summations confuses many students.

Multiplying parentheses and using the fact that \sum(x_i-\bar{x})=n\bar{x}-n\bar{x}=0 we have



To simplify calculations, denote a_i=(x_i-\bar{x})/Var_u(x). Then the slope estimator becomes

(6) \hat{b}=b+\frac{1}{n}\sum a_ie_i.

This is the critical representation.

Unbiasedness of the slope estimator

Convenience conditionThe regressor x is deterministic. I call it a convenience condition because it's just a matter of mathematical expedience, and later on we'll study ways to bypass it.

From (6), linearity of means and remembering that the deterministic coefficients a_i behave like constants,

(7) E\hat{b}=E[b+\frac{1}{n}\sum a_ie_i]=b+\frac{1}{n}\sum a_iEe_i=b

by (4). This proves unbiasedness.

You don't know the difference between the population and sample means until you see them working in the same formula.

Unbiasedness of the intercept estimator

As above we plug (3) in (2): \hat{a}=\overline{a+bx+e}-\hat{b}\bar{x}=a+b\bar{x}+\bar{e}-\hat{b}\bar{x}. Applying expectation:



Since in (1)  there is division by Var_u(x), the condition Var_u(x)\ne 0 is the main condition for existence of OLS estimators. From the above proof we see that (4) is the main condition for unbiasedness.

Aug 16

The pearls of AP Statistics 24

Unbiasedness: the stumbling block of a Statistics course

God is in the detail

They say: A good estimator has a sampling distribution that is centered at the parameter. We define center in this case as the mean of that sampling distribution. An estimator with this property is said to be unbiased. From Section 7.2, we know that for random sampling the mean of the sampling distribution of the sample mean x equals the population mean μ. So, the sample mean x is an unbiased estimator of μ. (Agresti and Franklin, p.351).

I say: This is a classic case of turning everything upside down, and this happens when the logical chain is broken. Unbiasedness is one of the pillars of Statistics. It can and should be given right after the notion of population mean is introduced. The authors make the definition dependent on random sampling, sampling distribution and a whole Section 7.2. Therefore I highly doubt that any student can grasp the above definition. My explanation below may not be the best; I just want to prompt the reader to think about alternatives to the above "definition".

Population mean versus sample mean

By definition, in the discrete case, a random variable is a table values+probabilities:

Values Probabilities
X_1 p_1
 ...  ...
X_n p_n

If we know this table, we can define the population mean \mu=EX=p_1X_1+...+p_nX_n. This is a weighted average of the variable values because the probabilities are percentages: 0<p_i<1 for all i and p_1+...+p_n=1. The expectation operator E is the device used by Mother Nature to measure the average, and most of the time she keeps hidden from us both the probabilities and the average EX.

Now suppose that X_1,...,X_n represent a sample from the given population (and not the values in the above table). We can define the sample mean \bar{X}=\frac{X_1+...+X_n}{n}. Being a little smarter than monkeys, we instead of unknown probabilities use the uniform distribution p_i=1/n. Unlike the population average, the sample average is always possible to calculate, as long as the sample is available.

Consider a good shooter shooting at three targets using a good rifle.

Unbiasedness intuition

The black dots represent points hit by bullets on three targets. In Figure 1, there was only one shot. What is your best guess about where the bull's eye is? Regarding Figure 2, everybody says that probably the bull's eye is midway (red point) between points A and B. In Figure 3, the sample mean is represented by the red point. Going back to unbiasedness: 1) the bull's eye is the unknown population parameter that needs to be estimated, 2) points hit by bullets are sample observations, 3) their sample mean is represented by red points, 4) the red points estimate the location of the bull's eye. The sample mean is said to be an unbiased estimator of population mean because

(1) E\bar{X}=\mu.

In words, Mother Nature says that, in her opinion, on average our bullets hit the bull's eye.

This explanation is an alternative to the one you can see in many books: in the long run, the sample mean correctly estimates the population mean. That explanation in fact replaces equation (1) by the corresponding law of large numbers. My explanation just underlines the fact that there is an abstract average, that we cannot use, and the sample average, that we invent to circumvent that problem.

See related theoretical facts here.

Aug 16

The pearls of AP Statistics 22

The law of large numbers - a bird's view

They say: In 1689, the Swiss mathematician Jacob Bernoulli proved that as the number of trials increases, the proportion of occurrences of any given outcome approaches a particular number (such as 1/6) in the long run. (Agresti and Franklin, p.213).

I say: The expression “law of large numbers” appears in the book 13 times, yet its meaning is never clearly explained. The closest approximation to the truth is the above sentence about Jacob Bernoulli. To see if this explanation works, tell it to your students and ask what they understood. To me, this is a clear case when withholding theory harms understanding.

Intuition comes first. I ask my students: if you flip a fair coin 100 times, what do you expect the proportion of ones to be? Absolutely everybody replies correctly, just the form of the answer may be different (50-50 or 0.5 or 50 out of 100). Then I ask: probably it will not be exactly 0.5 but if you flip the coin 1000 times, do you expect the proportion to be closer to 0.5? Everybody says: Yes. Next I ask: Suppose the coin is unfair and the probability of 1 appearing is 0.7. What would you expect the proportion to be close to in large samples? Most students come up with the right answer: 0.7. Congratulations, you have discovered what is called a law of large numbers!

Then we give a theoretical format to our little discovery. p=0.7 is a population parameter. Flipping a coin n times we obtain observations X_1,...,X_n. The proportion of ones is the sample mean \bar{X}=\frac{X_1+...+X_n}{n}. The law of large numbers says two things: 1) as the sample size increases, the sample mean approaches the population mean. 2) At the same time, its variation about the population mean becomes smaller and smaller.

Part 1) is clear to everybody. To corroborate statement 2), I give two facts. Firstly, we know that the standard deviation of the sample mean is \frac{\sigma}{\sqrt{n}}. From this we see that as n increases, the standard deviation of the sample mean decreases and the values of the sample mean become more and more concentrated around the population mean. We express this by saying that the sample mean converges to a spike. Secondly, I produce two histograms. With the sample size n=100, there are two modes (just 1o%) of the histogram at 0.69 and 0.72, while 0.7 was used as the population mean in my simulations. Besides, the spread of the values is large. With n=1000, the mode (27%) is at the true value 0.7, and the spread is low.

Histogram of proportions with n=100


Histogram of proportions with n=1000

Finally, we relate our little exercise to practical needs. In practice, the true mean is never known. But we can obtain a sample and calculate its mean. With a large sample size, the sample mean will be close to the truth. More generally, take any other population parameter, such as its standard deviation, and calculate the sample statistic that estimates it, such as the sample standard deviation. Again, the law of large numbers applies and the sample statistic will be close to the population parameter. The histograms have been obtained as explained here and here. Download the Excel file.

Feb 16

What is a mean value - all means in one place

What is a mean value - all means in one place

In introductory Stats texts, various means are scattered all over the place, and there is no indication of links between them. This is what we address here.

The population mean of a discrete random variable is the starting point. Such a variable, by definition, is a table values+probabilities, see this post, and its mean is EX=\sum_{i=1}^nX_ip_i. If that random variable is uniformly distributed, in the same post we explain that EX=\bar{X}, so the sample mean is a special case of a population mean.

The next point is the link between the grouped data formula and sample mean. Recall the procedure for finding absolute frequencies. Let Y_1,...,Y_n be the values in the sample (it is convenient to assume that they are arranged in an ascending order). Equal values are joined in groups. Let X_1,...,X_m denote the distinct values in the sample and n_1,...,n_m their absolute frequencies. Their total is, clearly, n. The sample mean is
(sorting out Y's into groups with equal values)
=\left(\overbrace {X_1+...+X_1}^{n_1{\rm{\ times}}}+...+\overbrace{X_m+...+X_m}^{n_m{\rm{\ times}}}\right)/n
=(n_1X_1 + ... + n_mX_m)/n,

which is the grouped data formula. We have shown that the grouped data formula obtains as a special case of the sample mean when equal values are joined into groups.

Next, denoting r_i=n_i/n the relative frequencies, we get

(n_1X_1 + ... + n_mX_m)/n=

(dividing through by n)


If we accept the relative frequencies as probabilities, then this becomes the population mean. Thus, with this convention, the grouped data formula and population mean are the same.

Finally, the mean of a continuous random variable X which has a density p_X is defined by EX=\int_{-\infty}^\infty tp_X(t)dt. in Section 6.3 of my book it is shown that the mean of a continuous random variable is a limit of grouped means.


Properties of means apply equally to all mean types.

Feb 16

Summation sign rules: identities for simple regression

Summation sign rules: identities for simple regression

There are many sources on the Internet. This and this are relatively simple, while this one is pretty advanced. They cover the basics. My purpose is more specific: to show how to obtain a couple of identities in terms of summation signs from general properties of variance and covariance.

Shortcut for covariance. This is a name of the following identity

(1) E(X-EX)(Y-EY)=E(XY)-(EX)(EY)

where on the left we have the definition of Cov(X,Y) and on the right we have an alternative expression (a shortcut) for the same thing. Letting X=Y in (1) we get a shortcut for variance:

(2) E(X-EX)^2=E(X^2)-(EX)^2,

see the direct proof here. Again, on the left we have the definition of Var(X) and on the right a shortcut for the same.

In this post I mentioned that

for a discrete uniformly distributed variable with a finite number of elements, the population mean equals the sample mean if the sample is the whole population.

This is what it means. The most useful definition of a discrete random variable is this: it is a table values+probabilities of type

Table 1. Discrete random variable with n values 
Values X_1 ... X_n
Probabilities p_1 ... p_n

Here X_1,...,X_n are the values and p_1,...,p_n are the probabilities (they sum to one). With this table, it is easy to define the mean of X:

(3) EX=\sum_{i=1}^nX_ip_i.

A variable like this is called uniformly distributed if all probabilities are the same:

Table 2. Uniformly distributed discrete random variable with n values
Values X_1 ... X_n
Probabilities 1/n ... 1/n

In this case (3) becomes

(4) EX=\bar{X}.

This explains the statement from my post. Using (4), equations (1) and (2) rewrite as

(5) \overline{(X-\bar{X})(Y-\bar{Y})}=\overline{XY}-\bar{X}\bar{Y},\ \overline{(X-\bar{X})^2}=\overline{X^2}-(\bar{X})^2.

Try to write this using summation signs. For example, the first identity in (5) becomes



This is crazy and trying to prove this directly would be even crazier.

Remark. Let X_1,...,X_n be a sample from an arbitrary distribution. Regardless of the parent distribution, the artificial uniform distribution from Table 2 can still be applied to the sample. To avoid confusion with the expected value E with respect to the parent distribution, instead of (4) we can write

(6) E_uX=\bar{X}

where the subscript u stands for "uniform". With that understanding, equations (5) are still true. The power of this approach is that all expressions in (5) are random variables which allows for further application of the expected value E with respect to the parent distribution.