Search Results for linearity of means

13
Oct 16

Properties of means

Properties of means, covariances and variances are bread and butter of professionals. Here we consider the bread - the means

Properties of means: as simple as playing with tables

Definition of a random variable. When my Brazilian students asked for an intuitive definition of a random variable, I said: It is a function whose values are unpredictable. Therefore it is prohibited to work with their values and allowed to work only with their various means. For proofs we need a more technical definition: it is a table values+probabilities of type Table 1.

Table 1.  Random variable definition
Values of X Probabilities
x_1 p_1
... ...
x_n p_n

Note: The complete form of writing {p_i} is P(X = {x_i}).

Mean (or expected value) value definitionEX = x_1p_1 + ... + x_np_n = \sum\limits_{i = 1}^nx_ip_i. In words, this is a weighted sum of values, where the weights p_i reflect the importance of corresponding x_i.

Note: The expected value is a function whose argument is a complex object (it is described by Table 1) and the value is simple: EX is just a number. And it is not a product of E and X! See how different means fit this definition.

Definition of a linear combination. See here the financial motivation. Suppose that X,Y are two discrete random variables with the same probability distribution {p_1},...,{p_n}. Let a,b be real numbers. The random variable aX + bY is called a linear combination of X,Y with coefficients a,b. Its special cases are aX (X scaled by a) and X + Y (a sum of X and Y). The detailed definition is given by Table 2.

Table 2.  Linear operations definition
Values of X Values of Y Probabilities aX X + Y aX + bY
x_1 {y_1} p_1 a{x_1} {x_1} + {y_1} a{x_1} + b{y_1}
...  ... ...  ...  ...  ...
x_n {y_n} p_n a{x_n} {x_n} + {y_n} a{x_n} + b{y_n}

Note: The situation when the probability distributions are different is reduced to the case when they are the same, see my book.

Property 1. Linearity of means. For any random variables X,Y and any numbers a,b one has

(1) E(aX + bY) = aEX + bEY.

Proof. This is one of those straightforward proofs when knowing the definitions and starting with the left-hand side is enough to arrive at the result. Using the definitions in Table 2, the mean of the linear combination is
E(aX + bY)= (a{x_1} + b{y_1}){p_1} + ... + (a{x_n} + b{y_n}){p_n}

(distributing probabilities)
= a{x_1}{p_1} + b{y_1}{p_1} + ... + a{x_n}{p_n} + b{y_n}{p_n}

(grouping by variables)
= (a{x_1}{p_1} + ... + a{x_n}{p_n}) + (b{y_1}{p_1} + ... + b{y_n}{p_n})

(pulling out constants)
= a({x_1}{p_1} + ... + {x_n}{p_n}) + b({y_1}{p_1} + ... + {y_n}{p_n})=aEX+bEY.

See applications: one, and two, and three.

Generalization to the case of a linear combination of n variables:

E({a_1}{X_1} + ... + {a_n}{X_n}) = {a_1}E{X_1} + ... + {a_n}E{X_n}.

Special cases. a) Letting a = b = 1 in (1) we get E(X + Y) = EX + EY. This is called additivity. See an application. b) Letting in (1) b = 0 we get E(aX) = aEX. This property is called homogeneity of degree 1 (you can pull the constant out of the expected value sign). Ask your students to deduce linearity from homogeneity and additivity.

Property 2. Expected value of a constant. Everybody knows what a constant is. Ask your students what is a constant in terms of Table 1. The mean of a constant is that constant, because a constant doesn't change, rain or shine: Ec = c{p_1} + ... + c{p_n} = c({p_1} + ... + {p_n}) = 1 (we have used the completeness axiom). In particular, it follows that E(EX)=EX.

Property 3. The expectation operator preserves order: if x_i\ge y_i for all i, then EX\ge EY. In particular, the mean of a nonnegative random variable is nonnegative: if x_i\ge 0 for all i, then EX\ge 0.

Indeed, using the fact that all probabilities are nonnegative, we get EX = x_1p_1 + ... + x_np_n\ge y_1p_1 + ... + y_np_n=EY.

Property 4. For independent variables, we have EXY=(EX)(EY) (multiplicativity), which has important implications on its own.

The best thing about the above properties is that, although we proved them under simplified assumptions, they are always true. We keep in mind that the expectation operator E is the device used by Mother Nature to measure the average, and most of the time she keeps hidden from us both the probabilities and the average EX.

« Previous Entries  

Search Results for linearity of means

17
Mar 19

AP Statistics the Genghis Khan way 2

AP Statistics the Genghis Khan way 2

Last semester I tried to explain theory through numerical examples. The results were terrible. Even the best students didn't stand up to my expectations. The midterm grades were so low that I did something I had never done before: I allowed my students to write an analysis of the midterm at home. Those who were able to verbally articulate the answers to me received a bonus that allowed them to pass the semester.

This semester I made a U-turn. I announced that in the first half of the semester we will concentrate on theory and we followed this methodology. Out of 35 students, 20 significantly improved their performance and 15 remained where they were.

Midterm exam, version 1

1. General density definition (6 points)

a. Define the density p_X of a random variable X. Draw the density of heights of adults, making simplifying assumptions if necessary. Don't forget to label the axes.

b. According to your plot, how much is the integral \int_{-\infty}^0p_X(t)dt? Explain.

c. Why the density cannot be negative?

d. Why the total area under the density curve should be 1?

e. Where are basketball players on your graph? Write down the corresponding expression for probability.

f. Where are dwarfs on your graph? Write down the corresponding expression for probability.

This question is about the interval formula. In each case students have to write the equation for the probability and the corresponding integral of the density. At this level, I don't talk about the distribution function and introduce the density by the interval formula.

2. Properties of means (8 points)

a. Define a discrete random variable and its mean.

b. Define linear operations with random variables.

c. Prove linearity of means.

d. Prove additivity and homogeneity of means.

e. How much is the mean of a constant?

f. Using induction, derive the linearity of means for the case of n variables from the case of two variables (3 points).

3. Covariance properties (6 points)

a. Derive linearity of covariance in the first argument when the second is fixed.

b. How much is covariance if one of its arguments is a constant?

c. What is the link between variance and covariance? If you know one of these functions, can you find the other (there should be two answers)? (4 points)

4. Standard normal variable (6 points)

a. Define the density p_z(t) of a standard normal.

b. Why is the function p_z(t) even? Illustrate this fact on the plot.

c. Why is the function f(t)=tp_z(t) odd? Illustrate this fact on the plot.

d. Justify the equation Ez=0.

e. Why is V(z)=1?

f. Let t>0. Show on the same plot areas corresponding to the probabilities A_1=P(0<z<t), A_2=P(z>t), A_3=P(z<-t), A_4=P(-t<z<0). Write down the relationships between A_1,...,A_4.

5. General normal variable (3 points)

a. Define a general normal variable X.

b. Use this definition to find the mean and variance of X.

c. Using part b, on the same plot graph the density of the standard normal and of a general normal with parameters \sigma =2, \mu =3.

Midterm exam, version 2

1. General density definition (6 points)

a. Define the density p_X of a random variable X. Draw the density of work experience of adults, making simplifying assumptions if necessary. Don't forget to label the axes.

b. According to your plot, how much is the integral \int_{-\infty}^0p_X(t)dt? Explain.

c. Why the density cannot be negative?

d. Why the total area under the density curve should be 1?

e. Where are retired people on your graph? Write down the corresponding expression for probability.

f. Where are young people (up to 25 years old) on your graph? Write down the corresponding expression for probability.

2. Variance properties (8 points)

a. Define variance of a random variable. Why is it non-negative?

b. Define the formula for variance of a linear combination of two variables.

c. How much is variance of a constant?

d. What is the formula for variance of a sum? What do we call homogeneity of variance?

e. What is larger: V(X+Y) or V(X-Y)? (2 points)

f. One investor has 100 shares of Apple, another - 200 shares. Which investor's portfolio has larger variability? (2 points)

3. Poisson distribution (6 points)

a. Write down the Taylor expansion and explain the idea. How are the Taylor coefficients found?

b. Use the Taylor series for the exponential function to define the Poisson distribution.

c. Find the mean of the Poisson distribution. What is the interpretation of the parameter \lambda in practice?

4. Standard normal variable (6 points)

a. Define the density p_z(t) of a standard normal.

b. Why is the function p_z(t) even? Illustrate this fact on the plot.

c. Why is the function f(t)=tp_z(t) odd? Illustrate this fact on the plot.

d. Justify the equation Ez=0.

e. Why is V(z)=1?

f. Let t>0. Show on the same plot areas corresponding to the probabilities A_1=P(0<z<t), A_2=P(z>t), A_{3}=P(z<-t), A_4=P(-t<z<0). Write down the relationships between A_{1},...,A_{4}.

5. General normal variable (3 points)

a. Define a general normal variable X.

b. Use this definition to find the mean and variance of X.

c. Using part b, on the same plot graph the density of the standard normal and of a general normal with parameters \sigma =2, \mu =3.

« Previous Entries  

Search Results for linearity of means

11
Feb 17

Gauss-Markov theorem

The Gauss-Markov theorem states that the OLS estimator is the most efficient. Without algebra, you cannot make a single step further, whether it is the precise theoretical statement or an application.

Why do we care about linearity?

The concept of linearity has been repeated many times in my posts. Here we have to start from scratch, to apply it to estimators.

The slope in simple regression

(1) y_i=a+bx_i+e_i

can be estimated by

\hat{b}(y,x)=\frac{Cov_u(y,x)}{Var_u(x)}.

Note that the notation makes explicit the dependence of the estimator on x,y. Imagine that we have two sets of observations: (y_1^{(1)},x_1),...,(y_n^{(1)},x_n) and (y_1^{(2)},x_1),...,(y_n^{(2)},x_n) (the x coordinates are the same but the y coordinates are different). In addition, the regressor is deterministic. The x's could be spatial units and the y's temperature measurements at these units at two different moments.

Definition. We say that \hat{b}(y,x) is linear with respect to y if for any two vectors y^{(i)}= (y_1^{(i)},...,y_n^{(i)}), i=1,2, and numbers c,d we have

\hat{b}(cy^{(1)}+dy^{(2)},x)=c\hat{b}(y^{(1)},x)+d\hat{b}(y^{(2)},x).

This definition is quite similar to that of linearity of means. Linearity of the estimator with respect to y easily follows from linearity of covariance

\hat{b}(cy^{(1)}+dy^{(2)},x)=\frac{Cov_u(cy^{(1)}+dy^{(2)},x)}{Var_u(x)}=c\hat{b}(y^{(1)},x)+d\hat{b}(y^{(2)},x).

In addition to knowing how to establish linearity, it's a good idea to be able to see when something is not linear. Recall that linearity implies homogeneity of degree 1. Hence, if something is not homogeneous of degree 1, it cannot be linear. The OLS estimator is not linear in x because it is homogeneous of degree -1 in x:

\hat{b}(y,cx)=\frac{Cov_u(y,cx)}{Var_u(cx)}=\frac{c}{c^2}\frac{Cov_u(y,x)}{Var_u(x)}=\frac{1}{c}\hat{b}(y,x).

Gauss-Markov theorem

Students don't have problems remembering the acronym BLUE: the OLS estimator is Best Linear Unbiased Estimator. Decoding this acronym starts from the end.

  1. An estimator, by definition, is a function of sample data.
  2. Unbiasedness of OLS estimators is thoroughly discussed here.
  3. Linearity of the slope estimator with respect to y has been proved above. Linearity with respect to x is not required.
  4. Now we look at the class of all slope estimators that are linear with respect to y. As an exercise, show that the instrumental variables estimator belongs to this class.

Gauss-Markov Theorem. Under the classical assumptions, the OLS estimator of the slope has the smallest variance in the class of all slope estimators that are linear with respect to y.

In particular, the OLS estimator of the slope is more efficient than the IV estimator. The beauty of this result is that you don't need expressions of their variances (even though they can be derived).

Remark. Even the above formulation is incomplete. In fact, the pair intercept estimator plus slope estimator is efficient. This requires matrix algebra.

 

« Previous Entries  

Search Results for linearity of means

11
Jan 17

Regressions with stochastic regressors 1

Regressions with stochastic regressors 1: applying conditioning

The convenience condition states that the regressor in simple regression is deterministic. Here we look at how this assumption can be avoided using conditional expectation and variance. General idea: you check which parts of the proofs don't go through with stochastic regressors and modify the assumptions accordingly. It happens that only assumptions concerning the error term should be replaced by their conditional counterparts.

Unbiasedness in case of stochastic regressors

We consider the slope estimator for the simple regression

(1) y_i=a+bx_i+e_i

assuming that x_i is stochastic.

First grab the critical representation (6) derived here:

(1) \hat{b}=b+\frac{1}{n}\sum a_i(x)e_i, where a_i(x)=(x_i-\bar{x})/Var_u(x).

The usual linearity of means E(aX + bY) = aEX + bEY applied to prove unbiasedness doesn't work because now the coefficients are stochastic (in other words, they are not constant). But we have generalized linearity which for the purposes of this proof can be written as

(2) E(a(x)S+b(x)T|x)=a(x)E(S|x)+b(x)E(T|x).

Let us replace the unbiasedness condition by its conditional version:

A3'. Unbiasedness conditionE(e_i|x)=0.

Then (1) and (2) give

(3) E(\hat{b}|x)=b+\frac{1}{n}\sum a_i(x)E(e_i|x)=b,

which can be called conditional unbiasedness. Next applying the law of iterated expectations E[E(S|x)]=ES we obtain unconditional unbiasedness:

E\hat{b}=E[E(\hat{b}|x)]=Eb=b.

Variance in case of stochastic regressors

As one can guess, we have to replace efficiency conditions by their conditional versions:

A4'. Conditional uncorrelatedness of errors. Assume that E(e_ie_j|x)=0 for all i\ne j.

A5'. Conditional homoscedasticity. All errors have the same conditional variances: E(e_i^2|x)=\sigma^2 for all i (\sigma^2 is a constant).

Now we can derive the conditional variance expression, using properties from this post:

Var(\hat{b}|x)=Var(b+\frac{1}{n}\sum_i a_i(x)e_i|x) (dropping a constant doesn't affect variance)

=Var(\frac{1}{n}\sum_i a_i(x)e_i|x) (for conditionally uncorrelated variables, conditional variance is additive)

=\sum_i Var(\frac{1}{n}a_i(x)e_i|x) (conditional variance is homogeneous of degree 2)

=\frac{1}{n^2}\sum_i a_i^2(x)Var(e_i|x) (applying conditional homoscedasticity)

=\frac{1}{n^2}\sum_i a_i^2(x)\sigma^2 (plugging a_i(x))

=\frac{1}{n^2}\sum_i(x_i-\bar{x})^2\sigma^2/Var^2_u(x) (using the notation of sample variance)

(4) =\frac{1}{n}Var_u(x)\sigma^2/Var^2_u(x)=\sigma^2/(nVar_u(x)).

Finally, using the law of total variance Var(S)=Var(E(S|x))+E[Var(S|x)] and equations (3) and (4) we obtain

(5) Var(\hat{b})=Var(b)+E[\sigma^2/(nVar_u(x))]=\frac{\sigma^2}{n}E[\frac{1}{Var_u(x)}].

Conclusion

Replacing the three assumptions about the error by their conditional counterparts allows us to obtain almost perfect analogs of the usual properties of OLS estimators: the usual (unconditional) unbiasedness plus the estimator variance, in which the part containing the regressor should be averaged, to account for its randomness. If you think that solving the problem of stochastic regressors requires nothing more but application of a couple of mathematical tricks, I agree with you.

« Previous Entries  

Search Results for linearity of means

3
Nov 16

Properties of covariance

Wikipedia says: The magnitude of the covariance is not easy to interpret. I add: We keep the covariance around mainly for its algebraic properties. It deserves studying because it appears in two important formulas: correlation coefficient and slope estimator in simple regression (see derivation, simplified derivation and proof of unbiasedness).

Definition. For two random variables X,Y their covariance is defined by

Cov (X,Y) = E(X - EX)(Y - EY)

(it's the mean value of the product of the deviations of two variables from their respective means).

Properties of covariance

Property 1. Linearity. Covariance is linear in the first argument when the second argument is fixed: for any random variables X,Y,Z and numbers a,b one has
(1) Cov (aX + bY,Z) = aCov(X,Z) + bCov (Y,Z).
Proof. We start by writing out the left side of Equation (1):
Cov(aX + bY,Z)=E[(aX + bY)-E(aX + bY)](Z-EZ)
(using linearity of means)
= E(aX + bY - aEX - bEY)(Z - EZ)
(collecting similar terms)
= E[a(X - EX) + b(Y - EY)](Z - EZ)
(distributing (Z - EZ))
= E[a(X - EX)(Z - EZ) + b(Y - EY)(Z - EZ)]
(using linearity of means)
= aE(X - EX)(Z - EZ) + bE(Y - EY)(Z - EZ)
= aCov(X,Z) + bCov(Y,Z).

Exercise. Covariance is also linear in the second argument when the first argument is fixed. Write out and prove this property. You can notice the importance of using parentheses and brackets.

Property 2. Shortcut for covariance: Cov(X,Y) = EXY - (EX)(EY).
ProofCov(X,Y)= E(X - EX)(Y - EY)
(multiplying out)
= E[XY - X(EY) - (EX)Y + (EX)(EY)]
(EX,EY are constants; use linearity)
=EXY-(EX)(EY)-(EX)(EY)+(EX)(EY)=EXY-(EX)(EY).

Definition. Random variables X,Y are called uncorrelated if Cov(X,Y) = 0.

Uncorrelatedness is close to independence, so the intuition is the same: one variable does not influence the other. You can also say that there is no statistical relationship between uncorrelated variables. The mathematical side is not the same: uncorrelatedness is a more general property than independence.

Property 3. Independent variables are uncorrelated: if X,Y are independent, then Cov(X,Y) = 0.
Proof. By the shortcut for covariance and multiplicativity of means for independent variables we have Cov(X,Y) = EXY - (EX)(EY) = 0.

Property 4. Correlation with a constant. Any random variable is uncorrelated with any constant: Cov(X,c) = E(X - EX)(c - Ec) = 0.

Property 5. Symmetry. Covariance is a symmetric function of its arguments: Cov(X,Y)=Cov(Y,X). This is obvious.

Property 6. Relationship between covariance and variance:

Cov(X,X)=E(X-EX)(X-EX)=Var(X).

« Previous Entries  

Search Results for linearity of means

25
Oct 16

Properties of variance

All properties of variance in one place

Certainty is the mother of quiet and repose, and uncertainty the cause of variance and contentions. Edward Coke

Preliminaries: study properties of means with proofs.

Definition. Yes, uncertainty leads to variance, and we measure it by Var(X)=E(X-EX)^2. It is useful to use the name deviation from mean for X-EX and realize that E(X-EX)=0, so that the mean of the deviation from mean cannot serve as a measure of variation of X around EX.

Property 1. Variance of a linear combination. For any random variables X,Y and numbers a,b one has
(1) Var(aX + bY)=a^2Var(X)+2abCov(X,Y)+b^2Var(Y).
The term 2abCov(X,Y) in (1) is called an interaction term. See this post for the definition and properties of covariance.
Proof.
Var(aX + bY)=E[aX + bY -E(aX + bY)]^2

(using linearity of means)
=E(aX + bY-aEX -bEY)^2

(grouping by variable)
=E[a(X-EX)+b(Y-EY)]^2

(squaring out)
=E[a^2(X-EX)^2+2ab(X-EX)(Y-EY)+(Y-EY)^2]

(using linearity of means and definitions of variance and covariance)
=a^2Var(X) + 2abCov(X,Y) +b^2Var(Y).
Property 2. Variance of a sum. Letting in (1) a=b=1 we obtain
Var(X + Y) = Var(X) + 2Cov(X,Y)+Var(Y).

Property 3. Homogeneity of degree 2. Choose b=0 in (1) to get
Var(aX)=a^2Var(X).
Exercise. What do you think is larger: Var(X+Y) or Var(X-Y)?
Property 4. If we add a constant to a variable, its variance does not change: Var(X+c)=E[X+c-E(X+c)]^2=E(X+c-EX-c)^2=E(X-EX)^2=Var(X)
Property 5. Variance of a constant is zero: Var(c)=E(c-Ec)^2=0.

Property 6. Nonnegativity. Since the squared deviation from mean (X-EX)^2 is nonnegative, its expectation is nonnegativeE(X-EX)^2\ge 0.

Property 7. Only a constant can have variance equal to zero: If Var(X)=0, then E(X-EX)^2 =(x_1-EX)^2p_1 +...+(x_n-EX)^2p_n=0, see the definition of the expected value. Since all probabilities are positive, we conclude that x_i=EX for all i, which means that X is identically constant.

Property 8. Shortcut for variance. We have an identity E(X-EX)^2=EX^2-(EX)^2. Indeed, squaring out gives

E(X-EX)^2 =E(X^2-2XEX+(EX)^2)

(distributing expectation)

=EX^2-2E(XEX)+E(EX)^2

(expectation of a constant is constant)

=EX^2-2(EX)^2+(EX)^2=EX^2-(EX)^2.

All of the above properties apply to any random variables. The next one is an exception in the sense that it applies only to uncorrelated variables.

Property 9. If variables are uncorrelated, that is Cov(X,Y)=0, then from (1) we have Var(aX + bY)=a^2Var(X)+b^2Var(Y). In particular, letting a=b=1, we get additivityVar(X+Y)=Var(X)+Var(Y). Recall that the expected value is always additive.

GeneralizationsVar(\sum a_iX_i)=\sum a_i^2Var(X_i) and Var(\sum X_i)=\sum Var(X_i) if all X_i are uncorrelated.

Among my posts, where properties of variance are used, I counted 12 so far.

« Previous Entries  

Search Results for linearity of means

2
Oct 16

The pearls of AP Statistics 31

Demystifying sampling distributions: too much talking about nothing

What we know about sample means

Let X_1,...,X_n be an independent identically distributed sample and consider its sample mean \bar{X}.

Fact 1. The sample mean is an unbiased estimator of the population mean:

(1) E\bar{X}=\frac{1}{n}(EX_1+...+EX_n)=\frac{1}{n}(\mu+...+\mu)=\mu

(use linearity of means).

Fact 2. Variance of the sample mean is

(2) Var(\bar{X})=\frac{1}{n^2}(Var(X_1)+...+Var(X_n)=\frac{1}{n^2}(\sigma^2(X)+...+\sigma^2(X))=\frac{\sigma^2(X)}{n}

(use homogeneity of variance of degree 2 and additivity of variance for independent variables). Hence \sigma(\bar{X})=\frac{\sigma(X)}{\sqrt{n}}

Fact 3. The implication of these two properties is that the sample mean becomes more concentrated around the population mean as the sample size increases (see at least the law of large numbers; I have a couple more posts about this).

Fact 4. Finally, the z scores of sample means stabilize to a standard normal distribution (the central limit theorem).

What is a sampling distribution?

The sampling distribution of a statistic is the probability distribution that specifies probabilities for the possible values the statistic can take (Agresti and Franklin, p.308). After this definition, the authors go ahead and discuss the above four facts. Note that none of them requires the knowledge of what the sampling distribution is. The ONLY sampling distribution that appears explicitly in AP Statistics is the binomial. However, in the book the binomial is given in Section 6.3, before sampling distributions, which are the subject of Chapter 7. Section 7.3 explains that the binomial is a sampling distribution but that section is optional. Thus the whole Chapter 7 (almost 40 pages) is redundant.

Then what are sampling distributions for?

Here is a simple example that explains their role. Consider the binomial X_1+X_2 of two observations on an unfair coin. It involves two random variables and therefore is described by a joint distribution with the sample space consisting of pairs of values

Table 1. Sample space for pair (X_1,X_2)

Coin 1
0 1
Coin 2 0 (0,0) (0,1)
1 (1,0) (1,1)

Each coin independently takes values 0 and 1 (shown in the margins); the sample space contains four pairs of these values (shown in the main body of the table). The corresponding probability distribution is given by the table

Table 2. Joint probabilities for pair (X_1,X_2)

Coin 1
p q
Coin 2 p p^2 pq
q pq q^2

Since we are counting only the number of successes, the outcomes (0,1) and (1,0) for the purposes of our experiment are the same. Hence, joining indistinguishable outcomes, we obtain a smaller sample space

Table 3. Sampling distribution for binomial X_1+X_2

# of successes Corresponding probabilities
0 p^2
1 2pq
2 q^2

The last table is the sampling distribution for the binomial with sample size 2. All the sampling distribution does is replace a large joint distribution Table 1+Table 2 by a smaller distribution Table 3. The beauty of proofs of equations (1) and (2) is that they do not depend on which distribution is used (the distribution is hidden in the expected value operator).

Unless you want your students to appreciate the reduction in the sample space brought about by sampling distributions, it is not worth discussing them. See Wikipedia for examples other than the binomial.

« Previous Entries  

Search Results for linearity of means

2
Sep 16

Proving unbiasedness of OLS estimators

Proving unbiasedness of OLS estimators - the do's and don'ts

Groundwork

Here we derived the OLS estimators. To distinguish between sample and population means, the variance and covariance in the slope estimator will be provided with the subscript u (for "uniform", see the rationale here).

(1) \hat{b}=\frac{Cov_u(x,y)}{Var_u(x)},

(2) \hat{a}=\bar{y}-\hat{b}\bar{x}.

These equations are used in conjunction with the model

(3) y_i=a+bx_i+e_i

where we remember that

(4) Ee_i=0 for all i.

Since (2) depends on (1), we have to start with unbiasedness of the slope estimator.

Using the right representation is critical

We have to show that E\hat{b}=b.

Step 1. Don't apply the expectation directly to (1). Do separate in (1) what is supposed to be E\hat{b}. To reveal the role of errors in (1), plug (3) in (1) and use linearity of covariance with respect to each argument when the other argument is fixed:

\hat{b}=\frac{Cov_u(x,a+bx+e)}{Var_u(x)}=\frac{Cov_u(x,a)+bCov_u(x,x)+Cov_u(x,e)}{Var_u(x)}.

Here Cov_u(x,a)=0 (a constant is uncorrelated with any variable), Cov_u(x,x)=Var_u(x) (covariance of x with itself is its variance), so

(5) \hat{b}=\frac{bVar_u(x)+Cov_u(x,e)}{Var_u(x)}=b+\frac{Cov_u(x,e)}{Var_u(x)}.

Equation (5) is the mean-plus-deviation-from-the-mean decomposition. Many students think that Cov_u(x,e)=0 because of (4). No! The covariance here does not involve the population mean.

Step 2. It pays to make one more step to develop (5). Write out the numerator in (5) using summation:

\hat{b}=b+\frac{1}{n}\sum(x_i-\bar{x})(e_i-\bar{e})/Var_u(x).

Don't write out Var_u(x)! Presence of two summations confuses many students.

Multiplying parentheses and using the fact that \sum(x_i-\bar{x})=n\bar{x}-n\bar{x}=0 we have

\hat{b}=b+\frac{1}{n}[\sum(x_i-\bar{x})e_i-\bar{e}\sum(x_i-\bar{x})]/Var_u(x)

=b+\frac{1}{n}\sum\frac{(x_i-\bar{x})}{Var_u(x)}e_i.

To simplify calculations, denote a_i=(x_i-\bar{x})/Var_u(x). Then the slope estimator becomes

(6) \hat{b}=b+\frac{1}{n}\sum a_ie_i.

This is the critical representation.

Unbiasedness of the slope estimator

Convenience conditionThe regressor x is deterministic. I call it a convenience condition because it's just a matter of mathematical expedience, and later on we'll study ways to bypass it.

From (6), linearity of means and remembering that the deterministic coefficients a_i behave like constants,

(7) E\hat{b}=E[b+\frac{1}{n}\sum a_ie_i]=b+\frac{1}{n}\sum a_iEe_i=b

by (4). This proves unbiasedness.

You don't know the difference between the population and sample means until you see them working in the same formula.

Unbiasedness of the intercept estimator

As above we plug (3) in (2): \hat{a}=\overline{a+bx+e}-\hat{b}\bar{x}=a+b\bar{x}+\bar{e}-\hat{b}\bar{x}. Applying expectation:

E\hat{a}=a+b\bar{x}+E\bar{e}-E\hat{b}\bar{x}=a+b\bar{x}-b\bar{x}=a.

Conclusion

Since in (1)  there is division by Var_u(x), the condition Var_u(x)\ne 0 is the main condition for existence of OLS estimators. From the above proof we see that (4) is the main condition for unbiasedness.

« Previous Entries  

Search Results for linearity of means

16
Jun 21

Solution to Question 1 from UoL exam 2020

Solution to Question 1 from UoL exam 2020

The assessment was an open-book take-home online assessment with a 24-hour window. No attempt was made to prevent cheating, except a warning, which was pretty realistic. Before an exam it's a good idea to see my checklist.

Question 1. Consider the following ARMA(1,1) process:

(1) z_{t}=\gamma +\alpha z_{t-1}+\varepsilon _{t}+\theta \varepsilon _{t-1}

where \varepsilon _{t} is a zero-mean white noise process with variance \sigma ^{2}, and assume |\alpha |,|\theta |<1 and \alpha+\theta \neq 0, which together make sure z_{t} is covariance stationary.

(a) [20 marks] Calculate the conditional and unconditional means of z_{t}, that is, E_{t-1}[z_{t}] and E[z_{t}].

(b) [20 marks] Set \alpha =0. Derive the autocovariance and autocorrelation function of this process for all lags as functions of the parameters \theta and \sigma .

(c) [30 marks] Assume now \alpha \neq 0. Calculate the conditional and unconditional variances of z_{t}, that is, Var_{t-1}[z_{t}] and Var[z_{t}].

Hint: for the unconditional variance, you might want to start by deriving the unconditional covariance between the variable and the innovation term, i.e., Cov[z_{t},\varepsilon _{t}].

(d) [30 marks] Derive the autocovariance and autocorrelation for lags of 1 and 2 as functions of the parameters of the model.

Hint: use the hint of part (c).

Solution

Part (a)

Reminder: The definition of a zero-mean white noise process is

(2) E\varepsilon _{t}=0, Var(\varepsilon _{t})=E\varepsilon_{t}^{2}=\sigma ^{2} for all t and Cov(\varepsilon _{j},\varepsilon_{i})=E\varepsilon _{j}\varepsilon _{i}=0 for all i\neq j.

A variable indexed t-1 is known at moment t-1 and at all later moments and behaves like a constant for conditioning at such moments.

Moment t is future relative to t-1.  The future is unpredictable and the best guess about the future error is zero.

The recurrent relationship in (1) shows that

(3) z_{t-1}=\gamma +\alpha z_{t-2}+... does not depend on the information that arrives at time t and later.

Hence, using also linearity of conditional means,

(4) E_{t-1}z_{t}=E_{t-1}\gamma +\alpha E_{t-1}z_{t-1}+E_{t-1}\varepsilon _{t}+\theta E_{t-1}\varepsilon _{t-1}=\gamma +\alpha z_{t-1}+\theta\varepsilon _{t-1}.

The law of iterated expectations (LIE): application of E_{t-1}, based on information available at time t-1, and subsequent application of E, based on no information, gives the same result as application of E.

Ez_{t}=E[E_{t-1}z_{t}]=E\gamma +\alpha Ez_{t-1}+\theta E\varepsilon _{t-1}=\gamma +\alpha Ez_{t-1}.

Since z_{t} is covariance stationary, its means across times are the same, so Ez_{t}=\gamma +\alpha Ez_{t} and Ez_{t}=\frac{\gamma }{1-\alpha }.

Part (b)

With \alpha =0 we get z_{t}=\gamma +\varepsilon _{t}+\theta\varepsilon _{t-1} and from part (a) Ez_{t}=\gamma . Using (2), we find variance

Var(z_{t})=E(z_{t}-Ez_{t})^{2}=E(\varepsilon _{t}^{2}+2\theta \varepsilon_{t}\varepsilon _{t-1}+\theta ^{2}\varepsilon _{t-2}^{2})=(1+\theta^{2})\sigma ^{2}

and first autocovariance

(5) \gamma_{1}=Cov(z_{t},z_{t-1})=E(z_{t}-Ez_{t})(z_{t-1}-Ez_{t-1})=E(\varepsilon_{t}+\theta \varepsilon _{t-1})(\varepsilon _{t-1}+\theta \varepsilon_{t-2})=\theta E\varepsilon _{t-1}^{2}=\theta \sigma ^{2}.

Second and higher autocovariances are zero because the subscripts of epsilons don't overlap.

Autocorrelation function: \rho _{0}=\frac{Cov(z_{t},z_{t})}{\sqrt{Var(z_{t})Var(z_{t})}}=1 (this is always true),

\rho _{1}=\frac{Cov(z_{t},z_{t-1})}{\sqrt{Var(z_{t})Var(z_{t-1})}}=\frac{\theta \sigma ^{2}}{(1+\theta ^{2})\sigma ^{2}}=\frac{\theta }{1+\theta ^{2}}, \rho _{j}=0 for j>1.

This is characteristic of MA processes: their autocorrelations are zero starting from some point.

Part (c)

If we replace all expectations in the definition of variance, we obtain the definition of conditional variance. From (1) and (4)

Var_{t-1}(z_{t})=E_{t-1}(z_{t}-E_{t-1}z_{t})^{2}=E_{t-1}\varepsilon_{t}^{2}=\sigma ^{2}.

By the law of total variance

(6) Var(z_{t})=EVar_{t-1}(z_{t})+Var(E_{t-1}z_{t})=\sigma ^{2}+Var(\gamma+\alpha z_{t-1}+\theta \varepsilon _{t-1})=

(an additive constant does not affect variance)

=\sigma ^{2}+Var(\alpha z_{t-1}+\theta \varepsilon _{t-1})=\sigma^{2}+\alpha ^{2}Var(z_{t})+2\alpha \theta Cov(z_{t-1},\varepsilon_{t-1})+\theta ^{2}Var(\varepsilon _{t-1}).

By the LIE and (3)

Cov(z_{t-1},\varepsilon _{t-1})=Cov(\gamma +\alpha z_{t-2}+\varepsilon  _{t-1}+\theta \varepsilon _{t-2},\varepsilon _{t-1})=\alpha  Cov(z_{t-2},\varepsilon _{t-1})+E\varepsilon _{t-1}^{2}+\theta  EE_{t-2}\varepsilon _{t-2}\varepsilon _{t-1}=\sigma ^{2}+\theta  E(\varepsilon _{t-2}E_{t-2}\varepsilon _{t-1}).

Here E_{t-2}\varepsilon _{t-1}=0, so

(7) Cov(z_{t-1},\varepsilon _{t-1})=\sigma ^{2}.

This equation leads to

Var(z_{t})=Var(\gamma +\alpha z_{t-1}+\varepsilon _{t}+\theta \varepsilon  _{t-1})=\alpha ^{2}Var(z_{t-1})+Var(\varepsilon _{t})+\theta  ^{2}Var(\varepsilon _{t-1})+

+2\alpha Cov(z_{t-1},\varepsilon _{t})+2\alpha \theta  Cov(z_{t-1},\varepsilon _{t-1})+2\theta Cov(\varepsilon _{t},\varepsilon  _{t-1})=\alpha ^{2}Var(z_{t})+\sigma ^{2}+\theta ^{2}\sigma ^{2}+2\alpha  \theta \sigma ^{2}

and, finally,

(8) Var(z_{t})=\frac{(1+2\alpha \theta +\theta ^{2})\sigma ^{2}}{1-\alpha  ^{2}}.

Part (d)

From (7)

(9) Cov(z_{t-1},\varepsilon _{t-2})=Cov(\gamma +\alpha z_{t-2}+\varepsilon  _{t-1}+\theta \varepsilon _{t-2},\varepsilon _{t-2})=\alpha  Cov(z_{t-2},\varepsilon _{t-2})+\theta Var(\varepsilon _{t-2})=(\alpha  +\theta )\sigma ^{2}.

It follows that

Cov(z_{t},z_{t-1})=Cov(\gamma +\alpha z_{t-1}+\varepsilon _{t}+\theta  \varepsilon _{t-1},\gamma +\alpha z_{t-2}+\varepsilon _{t-1}+\theta  \varepsilon _{t-2})=

(a constant is not correlated with anything)

=\alpha ^{2}Cov(z_{t-1},z_{t-2})+\alpha Cov(z_{t-1},\varepsilon  _{t-1})+\alpha \theta Cov(z_{t-1},\varepsilon _{t-2})+

+\alpha Cov(\varepsilon _{t},z_{t-2})+Cov(\varepsilon _{t},\varepsilon  _{t-1})+\theta Cov(\varepsilon _{t},\varepsilon _{t-2})+

+\theta \alpha Cov(\varepsilon _{t-1},z_{t-2})+\theta Var(\varepsilon  _{t-1})+\theta ^{2}Cov(\varepsilon _{t-1},\varepsilon _{t-2}).

From (7) Cov(z_{t-2},\varepsilon _{t-2})=\sigma ^{2} and from (9) Cov(z_{t-1},\varepsilon _{t-2})=(\alpha +\theta )\sigma ^{2}.

From (3) Cov(\varepsilon _{t},z_{t-2})=Cov(\varepsilon _{t-1},z_{t-2})=0.

Using also the white noise properties and stationarity of z_{t}

Cov(z_{t},z_{t-1})=Cov(z_{t-1},z_{t-2})=\gamma _{1},

we are left with

\gamma _{1}=\alpha ^{2}\gamma _{1}+\alpha \sigma  ^{2}+\alpha \theta (\alpha +\theta )\sigma ^{2}+\theta \sigma ^{2}=\alpha  ^{2}\gamma _{1}+(1+\alpha \theta )(\alpha +\theta )\sigma ^{2}.

Hence,

\gamma _{1}=\frac{(1+\alpha \theta )(\alpha +\theta )\sigma ^{2}}{1-\alpha  ^{2}}

and using (8)

\rho _{0}=1, \rho _{1}=\frac{(1+\alpha \theta )(\alpha +\theta )}{  1+2\alpha \theta +\theta ^{2}}.

The finish is close.

Cov(z_{t},z_{t-2})=Cov(\gamma +\alpha z_{t-1}+\varepsilon _{t}+\theta  \varepsilon _{t-1},\gamma +\alpha z_{t-3}+\varepsilon _{t-2}+\theta  \varepsilon _{t-3})=

=\alpha ^{2}Cov(z_{t-1},z_{t-3})+\alpha Cov(z_{t-1},\varepsilon  _{t-2})+\alpha \theta Cov(z_{t-1},\varepsilon _{t-3})+

+\alpha Cov(\varepsilon _{t},z_{t-3})+Cov(\varepsilon _{t},\varepsilon  _{t-2})+\theta Cov(\varepsilon _{t},\varepsilon _{t-3})+

+\theta \alpha Cov(\varepsilon _{t-1},z_{t-3})+\theta Cov(\varepsilon  _{t-1},\varepsilon _{t-2})+\theta ^{2}Cov(\varepsilon _{t-1},\varepsilon  _{t-3}).

This simplifies to

(10) Cov(z_{t},z_{t-2})=\alpha ^{2}Cov(z_{t-1},z_{t-3})+\alpha (\alpha  +\theta )\sigma ^{2}+\alpha \theta Cov(z_{t-1},\varepsilon _{t-3}).

By (7)

Cov(z_{t-1},\varepsilon _{t-3})=Cov(\gamma +\alpha z_{t-2}+\varepsilon  _{t-1}+\theta \varepsilon _{t-2},\varepsilon _{t-3})=\alpha  Cov(z_{t-2},\varepsilon _{t-3})=

=\alpha Cov(\gamma +\alpha z_{t-3}+\varepsilon _{t-2}+\theta \varepsilon  _{t-3},\varepsilon _{t-3})=\alpha \sigma ^{2}+\alpha \theta \sigma ^{2}=\alpha (1+\theta )\sigma ^{2}.

Finally, using (10)

\gamma _{2}=\alpha ^{2}\gamma _{2}+\alpha (\alpha +\theta )\sigma  ^{2}+\alpha^2 \theta (1 +\theta )\sigma ^{2}=\alpha ^{2}\gamma  _{2}+\alpha\sigma^2 (\alpha +\theta +\alpha\theta +\alpha\theta^2)\sigma ^{2},

\gamma _{2}=\frac{\alpha\sigma^2 (\alpha +\theta +\alpha\theta +\alpha\theta^2)\sigma ^{2}}{1-\alpha  ^{2}},

\rho _{2}=\frac{\alpha\sigma^2 (\alpha +\theta +\alpha\theta +\alpha\theta^2)}{1+2\alpha \theta  +\theta ^{2}}.

A couple of errors have been corrected on June 22, 2021. Hope this is final.

« Previous Entries  

Search Results for linearity of means

22
Feb 19

Cramer's rule and invertibility criterion

Cramer's rule and invertibility criterion

Consequences of multilinearity

For a fixed j, \det A is a linear function of column A^{(j)}. Such a linear function generates a row-vector L_{j} by way of a formula (see Exercise 3)

(1) \det A=L_jA^{(j)}.

Exercise 1. In addition to (1), we have

(2) L_jA^{(k)}=0 for any k\neq l.

Proof. Here and in the future it is useful to introduce the coordinate representation for L_j=(l_{j1},...,l_{jn}) and put

(3) L=\left(\begin{array}{c}L_1 \\... \\L_n\end{array}\right).

Then we can write (1) as \det A=\sum_{i=1}^nl_{ji}a_{ij}. Here the element l_{ji} does not involve a_{ij} and therefore by the different-columns-different-rows rule it does not involve elements of the entire column A^{(j)}. Hence, the vector L_j does not involve elements of the column A^{(j)}.

Let A^{\prime } denote the matrix obtained from A by replacing column A^{(j)} with column A^{(k)}. The vector L_j for the matrix A^{\prime } is the same as for A because both vectors depend on the elements from columns other than the column numbered j. Since A^\prime contains linearly dependent (actually two identical) columns, \det A^\prime=0. Using in (1) A^\prime instead of A we get 0=\det A^\prime=L_jA^{(k)}, as required.

After reading the next two sections, come back and read this statement again to appreciate its power and originality.

Cramer's rule

Exercise 2. Suppose \det A\neq 0. For any y\in R^n denote B_j the matrix formed by replacing the j-th column of A by the column vector y. Then the solution of the system Ax=y exists and the components of x are given by

x_j=\frac{\det B_j}{\det A},\ j=1,...,n.

Proof. Premultiply Ax=y by L_{j}:

(4) L_jy=L_jAx=\left(L_jA^{(1)},...,L_jA^{(n)}\right)x=(0,...,\det A,...,0)x.

Here we applied (1) and (2) (the j-th component of the vector (0,...,\det  A,...,0) is \det A and all others are zeros). From (4) it follows that (\det  A)x_{j}=L_{j}y. On the other hand, from (1) we have \det B_j=L_jy (the vector L_j for B_j is the same as for A, see the proof of Exercise 1). The last two equations prove the statement.

Invertibility criterion

Exercise 3. A is invertible if and only if \det A\neq 0.

Proof. If A is invertible, then AA^{-1}=I. By multiplicativity of determinant and Axiom 3 this implies \det A\det (A^{-1})=1. Thus, \det  A\neq 0.

Conversely, suppose \det A\neq 0. (1), (2) and (3) imply

(5) LA=\left(\begin{array}{c}L_1 \\... \\L_n\end{array}\right) (A^{(1)},...,A^{(n)})=\left(\begin{array}{ccc}L_1A^{(1)}&...&L_1A^{(n)} \\...&...&... \\L_nA^{(1)}&...&L_nA^{(n)}\end{array}  \right)

=\left(\begin{array}{ccc}\det A&...&0 \\...&...&... \\0&...&\det A\end{array}  \right)=\det A\times I.

This means that the matrix \frac{1}{\det A}L is the inverse of A. Recall that existence of the left inverse implies that of the right inverse, so A is invertible.

Definition 1. The matrix L is more than a transient technical twist; it is called an adjugate matrix and property (5), correspondingly, is called an adjugate identity.

« Previous Entries