13
Oct 16

## Properties of means

Properties of means, covariances and variances are bread and butter of professionals. Here we consider the bread - the means

### Properties of means: as simple as playing with tables

Definition of a random variable. When my Brazilian students asked for an intuitive definition of a random variable, I said: It is a function whose values are unpredictable. Therefore it is prohibited to work with their values and allowed to work only with their various means. For proofs we need a more technical definition: it is a table values+probabilities of type Table 1.

 Values of $X$$X$ Probabilities $x_1$$x_1$ $p_1$$p_1$ ... ... $x_n$$x_n$ $p_n$$p_n$

Note: The complete form of writing ${p_i}$ is $P(X = {x_i})$.

Mean (or expected value) value definition$EX = x_1p_1 + ... + x_np_n = \sum\limits_{i = 1}^nx_ip_i.$ In words, this is a weighted sum of values, where the weights $p_i$ reflect the importance of corresponding $x_i$.

Note: The expected value is a function whose argument is a complex object (it is described by Table 1) and the value is simple: $EX$ is just a number. And it is not a product of $E$ and $X$! See how different means fit this definition.

Definition of a linear combination. See here the financial motivation. Suppose that $X,Y$ are two discrete random variables with the same probability distribution ${p_1},...,{p_n}$. Let $a,b$ be real numbers. The random variable $aX + bY$ is called a linear combination of $X,Y$ with coefficients $a,b$. Its special cases are $aX$ ($X$ scaled by $a$) and $X + Y$ (a sum of $X$ and $Y$). The detailed definition is given by Table 2.

 Values of $X$$X$ Values of $Y$$Y$ Probabilities $aX$$aX$ $X + Y$$X + Y$ $aX + bY$$aX + bY$ $x_1$$x_1$ ${y_1}$${y_1}$ $p_1$$p_1$ $a{x_1}$$a{x_1}$ ${x_1} + {y_1}$${x_1} + {y_1}$ $a{x_1} + b{y_1}$$a{x_1} + b{y_1}$ ... ... ... ... ... ... $x_n$$x_n$ ${y_n}$${y_n}$ $p_n$$p_n$ $a{x_n}$$a{x_n}$ ${x_n} + {y_n}$${x_n} + {y_n}$ $a{x_n} + b{y_n}$$a{x_n} + b{y_n}$

Note: The situation when the probability distributions are different is reduced to the case when they are the same, see my book.

Property 1. Linearity of means. For any random variables $X,Y$ and any numbers $a,b$ one has

(1) $E(aX + bY) = aEX + bEY$.

Proof. This is one of those straightforward proofs when knowing the definitions and starting with the left-hand side is enough to arrive at the result. Using the definitions in Table 2, the mean of the linear combination is
$E(aX + bY)= (a{x_1} + b{y_1}){p_1} + ... + (a{x_n} + b{y_n}){p_n}$

(distributing probabilities)
$= a{x_1}{p_1} + b{y_1}{p_1} + ... + a{x_n}{p_n} + b{y_n}{p_n}$

(grouping by variables)
$= (a{x_1}{p_1} + ... + a{x_n}{p_n}) + (b{y_1}{p_1} + ... + b{y_n}{p_n})$

(pulling out constants)
$= a({x_1}{p_1} + ... + {x_n}{p_n}) + b({y_1}{p_1} + ... + {y_n}{p_n})=aEX+bEY.$

See applications: one, and two, and three.

Generalization to the case of a linear combination of $n$ variables:

$E({a_1}{X_1} + ... + {a_n}{X_n}) = {a_1}E{X_1} + ... + {a_n}E{X_n}$.

Special cases. a) Letting $a = b = 1$ in (1) we get $E(X + Y) = EX + EY$. This is called additivity. See an application. b) Letting in (1) $b = 0$ we get $E(aX) = aEX$. This property is called homogeneity of degree 1 (you can pull the constant out of the expected value sign). Ask your students to deduce linearity from homogeneity and additivity.

Property 2. Expected value of a constant. Everybody knows what a constant is. Ask your students what is a constant in terms of Table 1. The mean of a constant is that constant, because a constant doesn't change, rain or shine: $Ec = c{p_1} + ... + c{p_n} = c({p_1} + ... + {p_n}) = 1$ (we have used the completeness axiom). In particular, it follows that $E(EX)=EX$.

Property 3. The expectation operator preserves order: if $x_i\ge y_i$ for all $i$, then $EX\ge EY$. In particular, the mean of a nonnegative random variable is nonnegative: if $x_i\ge 0$ for all $i$, then $EX\ge 0$.

Indeed, using the fact that all probabilities are nonnegative, we get $EX = x_1p_1 + ... + x_np_n\ge y_1p_1 + ... + y_np_n=EY$.

Property 4. For independent variables, we have $EXY=(EX)(EY)$ (multiplicativity), which has important implications on its own.

The best thing about the above properties is that, although we proved them under simplified assumptions, they are always true. We keep in mind that the expectation operator $E$ is the device used by Mother Nature to measure the average, and most of the time she keeps hidden from us both the probabilities and the average $EX$.

30
Jul 19

## Properties of root subspaces

Let $A$ be a square matrix and let $\lambda \in \sigma (A)$ be its eigenvalue. As we know, the nonzero elements of the null space $N(A-\lambda I)=\{x:(A-\lambda I)x=0\}$ are the corresponding eigenvectors. This definition is generalized as follows.

Definition 1. The subspaces $N_{\lambda }^{(k)}=N((A-\lambda I)^{k}),$ $k=1,2,...$ are called root subspaces of $A$ corresponding to $\lambda .$

Exercise 1. a) Root subspaces are increasing:

(1) $N_{\lambda }^{(k)}\subset N_{\lambda }^{(k+1)}$ for all $k\geq 1$

and b) there is such $p\leq n$ that all inclusions (1) are strict for $k and

(2) $N_{\lambda }^{(p)}=N_{\lambda }^{(p+1)}=...$

Proof. a) If $x\in N_{\lambda }^{(k)}$ for some $k,$ then $(A-\lambda I)^{k+1}x=(A-\lambda I)(A-\lambda I)^{k}x=0,$ which shows that $x\in N_{\lambda }^{(k+1)}.$

b) (1) implies $\dim N_{\lambda }^{(k)}\leq \dim N_{\lambda }^{(k+1)}.$ Since all root subspaces are contained in $C^{n},$ there are $k$ such that $N_{\lambda }^{(k)}=N_{\lambda }^{(k+1)}.$ Let $p$ be the smallest such $k.$ Then all inclusions (1) are strict for $k

Suppose $N_{\lambda}^{(k+1)}\setminus N_{\lambda }^{(k)}\neq \varnothing$ for some $k\ge p.$ Then there exists $x\in N_{\lambda }^{(k+1)}$ such that $x\notin N_{\lambda}^{(k)}$, that is, $(A-\lambda I)^{k+1}x=0,$ $(A-\lambda I)^{k}x\neq 0.$ Put $y=(A-\lambda I)^{k-p}x.$ Then $(A-\lambda I)^{p+1}y=(A-\lambda I)^{k+1}x=0,$ $(A-\lambda I)^{p}y=(A-\lambda I)^{k}x\notin 0.$ This means
that $y\in N_{\lambda }^{(p+1)}\setminus N_{\lambda }^{(p)}$ which contradicts the definition of $p.$

Definition 2. Property (2) can be called stabilization. The number $p$ from (2) is called a height of the eigenvalue $\lambda$.

Exercise 2. Let $\lambda \in \sigma (A)$ and let $p$ be the number from Exercise 1. Then

(3) $C^{n}=N_{\lambda }^{(p)}\dotplus \text{Img}[(A-\lambda I)^{p}].$

Proof. By the rank-nullity theorem applied to $(A-\lambda I)^{p}$ we have $n=\dim N_{\lambda }^{(p)}+\dim \text{Img}[(A-\lambda I)^{p}].$ By Exercise 3, to prove (3) it is sufficient to establish that $L\equiv N_{\lambda}^{(p)}\cap \text{Img}[(A-\lambda I)^{p}]=\{0\}.$ Let's assume that $L$ contains a nonzero vector $x.$ Then we have $x=(A-\lambda I)^{p}y$ for some $y.$ We obtain two facts:

$(A-\lambda I)^{p}y\neq 0$ $\Longrightarrow y\notin N_{\lambda }^{(p)},$

$(A-\lambda I)^{2p}y=(A-\lambda I)^{p}(A-\lambda I)^{p}y=(A-\lambda I)^{p}x=0\Longrightarrow y\in N_{\lambda }^{(2p)}.$

It follows that $y$ is a nonzero element of $N_{\lambda }^{(2p)}\setminus N_{\lambda }^{(p)}.$ This contradicts (2). Hence, the assumption $L\neq \{0\}$ is wrong, and (3) follows.

Exercise 3. Both subspaces at the right of (3) are invariant with respect to $A.$

Proof. If $x\in N_{\lambda }^{(p)},$ then by commutativity of $A$ and $A-\lambda I$ we have $(A-\lambda I)^{p}Ax=A(A-\lambda I)^{p}x=0,$ so $Ax\in N_{\lambda }^{(p)}.$

Suppose $x\in \text{Img}[(A-\lambda I)^{p}],$ so that $x=(A-\lambda I)^{p}y$ for some $y.$ Then $Ax=(A-\lambda I)^{p}Ay\in \text{Img}[(A-\lambda I)^{p}].$

Exercise 3 means that, for the purpose of further analyzing $A,$ we can consider its restrictions onto $N_{\lambda }^{(p)}$ and $\text{Img}[(A-\lambda I)^{p}].$

Exercise 4. The restriction of $A$ onto $N_{\lambda }^{(p)}$ does not have eigenvalues other than $\lambda .$

Proof. Suppose $Ax=\mu x,$ $x\neq 0,$ for some $\mu .$ Since $x\in N_{\lambda }^{(p)},$ we have $(A-\lambda I)^{p}x=0.$ Then $(A-\lambda I)x=(\mu -\lambda )x$ and $0=(A-\lambda I)^{p}x=(\mu -\lambda )^{p}x$. This implies $\mu =\lambda .$

Exercise 5. The restriction of $A$ onto $\text{Img}[(A-\lambda I)^{p}]$ does not have $\lambda$ as an eigenvalue (so that $A-\lambda I$ is invertible).

Proof. Suppose $x\in \text{Img}[(A-\lambda I)^{p}]$ and $Ax=\lambda x,$ $x\neq 0.$ Then $x=(A-\lambda I)^{p}y$ for some $y\neq 0$ and $0=(A-\lambda I)x=(A-\lambda I)^{p+1}y.$ By Exercise 1 $y\in N_{\lambda }^{(p+1)}=N_{\lambda }^{(p)}$ and $x=(A-\lambda I)^{p}y=0.$ This contradicts the choice of $x.$

12
Nov 16

## Properties of standard deviation

Properties of standard deviation are divided in two parts. The definitions and consequences are given here. Both variance and standard deviation are used to measure variability of values of a random variable around its mean. Then why use both of them? The why will be explained in another post.

### Properties of standard deviation: definitions and consequences

Definition. For a random variable $X$, the quantity $\sigma (X) = \sqrt {Var(X)}$ is called its standard deviation.

#### Digression about square roots and absolute values

In general, there are two square roots of a positive number, one positive and the other negative. The positive one is called an arithmetic square root. The arithmetic root is applied here to $Var(X) \ge 0$ (see properties of variance), so standard deviation is always nonnegative.
Definition. An absolute value of a real number $a$ is defined by
(1) $|a| =a$ if $a$ is nonnegative and $|a| =-a$ if $a$ is negative.
This two-part definition is a stumbling block for many students, so making them plug in a few numbers is a must. It is introduced to measure the distance from point $a$ to the origin. For example, $dist(3,0) = |3| = 3$ and $dist(-3,0) = |-3| = 3$. More generally, for any points $a,b$ on the real line the distance between them is given by $dist(a,b) = |a - b|$.

By squaring both sides in Eq. (1) we obtain $|a|^2={a^2}$. Application of the arithmetic square root gives

(2) $|a|=\sqrt {a^2}.$

This is the equation we need right now.

### Back to standard deviation

Property 1. Standard deviation is homogeneous of degree 1. Indeed, using homogeneity of variance and equation (2), we have

$\sigma (aX) =\sqrt{Var(aX)}=\sqrt{{a^2}Var(X)}=|a|\sigma(X).$

Unlike homogeneity of expected values, here we have an absolute value of the scaling coefficient $a$.

Property 2. Cauchy-Schwarz inequality. (Part 1) For any random variables $X,Y$ one has

(3) $|Cov(X,Y)|\le\sigma(X)\sigma(Y)$.

(Part 2) If the inequality sign in (3) turns into equality, $|Cov(X,Y)|=\sigma (X)\sigma (Y)$, then $Y$ is a linear function of $X$: $Y = aX + b$, with some constants $a,b$.
Proof. (Part 1) If at least one of the variables is constant, both sides of the inequality are $0$ and there is nothing to prove. To exclude the trivial case, let $X,Y$ be non-constant and, therefore, $Var(X),\ Var(Y)$ are positive. Consider a real-valued function of a real number $t$ defined by $f(t) = Var(tX + Y)$. Here we have variance of a linear combination

$f(t)=t^2Var(X)+2tCov(X,Y)+Var(Y)$.

We see that $f(t)$ is a parabola with branches looking upward (because the senior coefficient $Var(X)$ is positive). By nonnegativity of variance, $f(t)\ge 0$ and the parabola lies above the horizontal axis in the $(f,t)$ plane. Hence, the quadratic equation $f(t) = 0$ may have at most one real root. This means that the discriminant of the equation is non-positive:

$D=Cov(X,Y)^2-Var(X)Var(Y)\le 0.$

Applying square roots to both sides of $Cov(X,Y)^2\le Var(X)Var(Y)$ we finish the proof of the first part.

(Part 2) In case of the equality sign the discriminant is $0$. Therefore the parabola touches the horizontal axis where $f(t)=Var(tX + Y)=0$. But we know that this implies $tX + Y = constant$ which is just another way of writing $Y = aX + b$.

Comment. (3) explains one of the main properties of the correlation:

$-1\le\rho(X,Y)=\frac{Cov(X,Y)}{\sigma(X)\sigma(Y)}\le 1$.

3
Nov 16

## Properties of covariance

Wikipedia says: The magnitude of the covariance is not easy to interpret. I add: We keep the covariance around mainly for its algebraic properties. It deserves studying because it appears in two important formulas: correlation coefficient and slope estimator in simple regression (see derivation, simplified derivation and proof of unbiasedness).

Definition. For two random variables $X,Y$ their covariance is defined by

$Cov (X,Y) = E(X - EX)(Y - EY)$

(it's the mean value of the product of the deviations of two variables from their respective means).

### Properties of covariance

Property 1. Linearity. Covariance is linear in the first argument when the second argument is fixed: for any random variables $X,Y,Z$ and numbers $a,b$ one has
(1) $Cov (aX + bY,Z) = aCov(X,Z) + bCov (Y,Z).$
Proof. We start by writing out the left side of Equation (1):
$Cov(aX + bY,Z)=E[(aX + bY)-E(aX + bY)](Z-EZ)$
(using linearity of means)
$= E(aX + bY - aEX - bEY)(Z - EZ)$
(collecting similar terms)
$= E[a(X - EX) + b(Y - EY)](Z - EZ)$
(distributing $(Z - EZ)$)
$= E[a(X - EX)(Z - EZ) + b(Y - EY)(Z - EZ)]$
(using linearity of means)
$= aE(X - EX)(Z - EZ) + bE(Y - EY)(Z - EZ)$
$= aCov(X,Z) + bCov(Y,Z).$

Exercise. Covariance is also linear in the second argument when the first argument is fixed. Write out and prove this property. You can notice the importance of using parentheses and brackets.

Property 2. Shortcut for covariance: $Cov(X,Y) = EXY - (EX)(EY)$.
Proof$Cov(X,Y)= E(X - EX)(Y - EY)$
(multiplying out)
$= E[XY - X(EY) - (EX)Y + (EX)(EY)]$
($EX,EY$ are constants; use linearity)
$=EXY-(EX)(EY)-(EX)(EY)+(EX)(EY)=EXY-(EX)(EY).$

Definition. Random variables $X,Y$ are called uncorrelated if $Cov(X,Y) = 0$.

Uncorrelatedness is close to independence, so the intuition is the same: one variable does not influence the other. You can also say that there is no statistical relationship between uncorrelated variables. The mathematical side is not the same: uncorrelatedness is a more general property than independence.

Property 3. Independent variables are uncorrelated: if $X,Y$ are independent, then $Cov(X,Y) = 0$.
Proof. By the shortcut for covariance and multiplicativity of means for independent variables we have $Cov(X,Y) = EXY - (EX)(EY) = 0$.

Property 4. Correlation with a constant. Any random variable is uncorrelated with any constant: $Cov(X,c) = E(X - EX)(c - Ec) = 0.$

Property 5. Symmetry. Covariance is a symmetric function of its arguments: $Cov(X,Y)=Cov(Y,X)$. This is obvious.

Property 6. Relationship between covariance and variance:

$Cov(X,X)=E(X-EX)(X-EX)=Var(X)$.

25
Oct 16

## Properties of variance

### All properties of variance in one place

Certainty is the mother of quiet and repose, and uncertainty the cause of variance and contentions. Edward Coke

Preliminaries: study properties of means with proofs.

Definition. Yes, uncertainty leads to variance, and we measure it by $Var(X)=E(X-EX)^2$. It is useful to use the name deviation from mean for $X-EX$ and realize that $E(X-EX)=0$, so that the mean of the deviation from mean cannot serve as a measure of variation of $X$ around $EX$.

Property 1. Variance of a linear combination. For any random variables $X,Y$ and numbers $a,b$ one has
(1) $Var(aX + bY)=a^2Var(X)+2abCov(X,Y)+b^2Var(Y).$
The term $2abCov(X,Y)$ in (1) is called an interaction term. See this post for the definition and properties of covariance.
Proof.
$Var(aX + bY)=E[aX + bY -E(aX + bY)]^2$

(using linearity of means)
$=E(aX + bY-aEX -bEY)^2$

(grouping by variable)
$=E[a(X-EX)+b(Y-EY)]^2$

(squaring out)
$=E[a^2(X-EX)^2+2ab(X-EX)(Y-EY)+(Y-EY)^2]$

(using linearity of means and definitions of variance and covariance)
$=a^2Var(X) + 2abCov(X,Y) +b^2Var(Y).$
Property 2. Variance of a sum. Letting in (1) $a=b=1$ we obtain
$Var(X + Y) = Var(X) + 2Cov(X,Y)+Var(Y).$

Property 3. Homogeneity of degree 2. Choose $b=0$ in (1) to get
$Var(aX)=a^2Var(X).$
Exercise. What do you think is larger: $Var(X+Y)$ or $Var(X-Y)$?
Property 4. If we add a constant to a variable, its variance does not change: $Var(X+c)=E[X+c-E(X+c)]^2=E(X+c-EX-c)^2=E(X-EX)^2=Var(X)$
Property 5. Variance of a constant is zero: $Var(c)=E(c-Ec)^2=0$.

Property 6. Nonnegativity. Since the squared deviation from mean $(X-EX)^2$ is nonnegative, its expectation is nonnegative$E(X-EX)^2\ge 0$.

Property 7. Only a constant can have variance equal to zero: If $Var(X)=0$, then $E(X-EX)^2 =(x_1-EX)^2p_1 +...+(x_n-EX)^2p_n=0$, see the definition of the expected value. Since all probabilities are positive, we conclude that $x_i=EX$ for all $i$, which means that $X$ is identically constant.

Property 8. Shortcut for variance. We have an identity $E(X-EX)^2=EX^2-(EX)^2$. Indeed, squaring out gives

$E(X-EX)^2 =E(X^2-2XEX+(EX)^2)$

(distributing expectation)

$=EX^2-2E(XEX)+E(EX)^2$

(expectation of a constant is constant)

$=EX^2-2(EX)^2+(EX)^2=EX^2-(EX)^2$.

All of the above properties apply to any random variables. The next one is an exception in the sense that it applies only to uncorrelated variables.

Property 9. If variables are uncorrelated, that is $Cov(X,Y)=0$, then from (1) we have $Var(aX + bY)=a^2Var(X)+b^2Var(Y).$ In particular, letting $a=b=1$, we get additivity$Var(X+Y)=Var(X)+Var(Y).$ Recall that the expected value is always additive.

Generalizations$Var(\sum a_iX_i)=\sum a_i^2Var(X_i)$ and $Var(\sum X_i)=\sum Var(X_i)$ if all $X_i$ are uncorrelated.

Among my posts, where properties of variance are used, I counted 12 so far.

18
Jul 16

## Properties of conditional expectation

### Background

A company sells a product and may offer a discount. We denote by $X$ the sales volume and by $Y$ the discount amount (per unit). For simplicity, both variables take only two values. They depend on each other. If the sales are high, the discount may be larger. A higher discount, in its turn, may attract more buyers. At the same level of sales, the discount may vary depending on the vendor's costs. With the same discount, the sales vary with consumer preferences. Along with the sales and discount, we consider a third variable that depends on both of them. It can be the profit $\pi$.

### Formalization

The sales volume $X$ takes values $x_1,x_2$ with probabilities $p_i^X=P(X=x_i)$$i=1,2$. Similarly, the discount $Y$ takes values $y_1,y_2$ with probabilities $p_i^Y=P(Y=y_i)$$i=1,2$. The joint events have joint probabilities denoted $P(X=x_i,Y=y_j)=p_{i,j}$. The profit in the event $X=x_i,Y=y_j$ is denoted $\pi_{i,j}$. This information is summarized in Table 1.

 $y_1$$y_1$ $y_1$$y_1$ $x_1$$x_1$ $\pi_{1,1},\ p_{1,1}$$\pi_{1,1},\ p_{1,1}$ $\pi_{1,2},\ p_{1,2}$$\pi_{1,2},\ p_{1,2}$ $p_1^X$$p_1^X$ $x_2$$x_2$ $\pi_{2,1},\ p_{2,1}$$\pi_{2,1},\ p_{2,1}$ $\pi_{2,2},\ p_{2,2}$$\pi_{2,2},\ p_{2,2}$ $p_2^X$$p_2^X$ $p_1^Y$$p_1^Y$ $p_2^Y$$p_2^Y$

Comments. In the left-most column and upper-most row we have values of the sales and discount. In the "margins" (last row and last column) we put probabilities of those values. In the main body of the table we have profit values and their probabilities. It follows that the expected profit is

(1) $E\pi=\pi_{1,1}p_{1,1}+\pi_{1,2}p_{1,2}+\pi_{2,1}p_{2,1}+\pi_{2,2}p_{2,2}.$

### Conditioning

Suppose that the vendor fixes the discount at $y_1$. Then only the column containing this value is relevant. To get numbers that satisfy the completeness axiom, we define conditional probabilities

$P(X=x_1|Y=y_1)=\frac{p_{11}}{p_1^Y},\ P(X=x_2|Y=y_1)=\frac{p_{21}}{p_1^Y}.$

This allows us to define conditional expectation

(2) $E(\pi|Y=y_1)=\pi_{11}\frac{p_{11}}{p_1^Y}+\pi_{21}\frac{p_{21}}{p_1^Y}.$

Similarly, if the discount is fixed at $y_2$,

(3) $E(\pi|Y=y_2)=\pi_{12}\frac{p_{12}}{p_2^Y}+\pi_{22}\frac{p_{22}}{p_2^Y}.$

Equations (2) and (3) are joined in the notation $E(\pi|Y)$.

Property 1. While the usual expectation (1) is a number, the conditional expectation $E(\pi|Y)$ is a function of the value of $Y$ on which the conditioning is being done. Since it is a function of $Y$, it is natural to consider it a random variable defined by the next table

 Values Probabilities $E(\pi|Y=y_1)$$E(\pi|Y=y_1)$ $p_1^Y$$p_1^Y$ $E(\pi|Y=y_2)$$E(\pi|Y=y_2)$ $p_2^Y$$p_2^Y$

Property 2. Law of iterated expectations: the mean of the conditional expectation equals the usual mean. Indeed, using Table 2, we have

$E[E(\pi|Y)]=E(\pi|Y=y_1)p_1^Y+E(\pi|Y=y_2)p_2^Y$ (applying (2) and (3))

$=\left[\pi_{11}\frac{p_{11}}{p_1^Y}+\pi_{21}\frac{p_{21}}{p_1^Y}\right]p_1^Y+\left[\pi_{12}\frac{p_{12}}{p_2^Y}+\pi_{22}\frac{p_{22}}{p_2^Y}\right]p_2^Y$ $=\pi_{1,1}p_{1,1}+\pi_{1,2}p_{1,2}+\pi_{2,1}p_{2,1}+\pi_{2,2}p_{2,2}=E\pi.$

Property 3. Generalized homogeneity. In the usual homogeneity $E(aX)=aEX$$a$ is a number. In the generalized homogeneity

(4) $E(a(Y)\pi|Y)=a(Y)E(\pi|Y),$

$a(Y)$ is allowed to be a  function of the variable on which we are conditioning. See for yourself: using (2), for instance,

$E(a(y_1)\pi|Y=y_1)=a(y_1)\pi_{11}\frac{p_{11}}{p_1^Y}+a(y_1)\pi_{21}\frac{p_{21}}{p_1^Y}$ $=a(y_1)\left[\pi_{11}\frac{p_{11}}{p_1^Y}+\pi_{21}\frac{p_{21}}{p_1^Y}\right]=a(y_1)E(X|Y=y_1).$

Property 4. Additivity. For any random variables $S,T$ we have

(5) $E(S+T|Y)=E(S|Y)+E(T|Y).$

The proof is left as an exercise.

Property 5. Generalized linearity. For any random variables $S,T$ and functions $a(Y),b(Y)$ equations (4) and (5) imply

$E(a(Y)S+b(Y)T|Y)=a(Y)E(S|Y)+b(Y)E(T|Y).$

Property 6. Conditioning in case of independence. This property has to do with the informational aspect of conditioning. The usual expectation (1) takes into account all contingencies. (2) and (3) are based on the assumption that one contingency for $Y$ has been realized, so that the other one becomes irrelevant. Therefore $E(\pi|Y)$ is considered  an updated version of (1) that takes into account the arrival of new information that the value of $Y$ has been fixed. Now we can state the property itself: if $X,Y$ are independent, then $E(X|Y)=EX$, that is, conditioning on $Y$ does not improve our knowledge of $EX$.

Proof. In case of independence we have $p_{i,j}=p_i^Xp_j^Y$ for all $i,j$, so that

$E(X|Y=y_j)=x_1\frac{p_{1j}}{p_j^Y}+x_2\frac{p_{2j}}{p_j^Y}=x_1p_1^X+x_2p_2^X=EX.$

Property 7. Conditioning in case of complete dependence. Conditioning of $Y$ on $Y$ gives the most precise information: $E(Y|Y)=Y$ (if we condition $Y$ on $Y$, we know about it everything and there is no averaging). More generally, $E(f(Y)|Y)=f(Y)$ for any deterministic function $f$.

Proof. If we condition $Y$ on $Y$, the conditional probabilities become

$p_{11}=P(Y=y_1|Y=y_1)=1,\ p_{21}=P(Y=y_2|Y=y_1)=0.$

Hence, (2) gives

$E(f(Y)|Y=y_1)=f(y_1)\times 1+f(y_2)\times 0=f(y_1).$

Conditioning on $Y=y_2$ is treated similarly.

### Summary

Not many people know that using the notation $E_Y\pi$ for conditional expectation instead of $E(\pi|Y)$ makes everything much clearer. I rewrite the above properties using this notation:

1. Law of iterated expectations: $E(E_Y\pi)=E\pi$
2. Generalized homogeneity$E_Y(a(Y)\pi)=a(Y)E_Y\pi$
3. Additivity: For any random variables $S,T$ we have $E_Y(S+T)=E_YS+E_YT$
4. Generalized linearity: For any random variables $S,T$ and functions $a(Y),b(Y)$ one has $E_Y(a(Y)S+b(Y)T)=a(Y)E_YS+b(Y)E_YT$
5. Conditioning in case of independence: if $X,Y$ are independent, then $E_YX=EX$
6. Conditioning in case of complete dependence$E_Yf(Y)=f(Y)$ for any deterministic function $f$.

28
Feb 16

## What is a mean value - all means in one place

What is a mean value - all means in one place

In introductory Stats texts, various means are scattered all over the place, and there is no indication of links between them. This is what we address here.

The population mean of a discrete random variable is the starting point. Such a variable, by definition, is a table values+probabilities, see this post, and its mean is $EX=\sum_{i=1}^nX_ip_i$. If that random variable is uniformly distributed, in the same post we explain that $EX=\bar{X}$, so the sample mean is a special case of a population mean.

The next point is the link between the grouped data formula and sample mean. Recall the procedure for finding absolute frequencies. Let $Y_1,...,Y_n$ be the values in the sample (it is convenient to assume that they are arranged in an ascending order). Equal values are joined in groups. Let $X_1,...,X_m$ denote the distinct values in the sample and $n_1,...,n_m$ their absolute frequencies. Their total is, clearly, $n$. The sample mean is
$\bar{Y}=(Y_1+...+Y_n)/n$
(sorting out $Y$'s into groups with equal values)
$=\left(\overbrace {X_1+...+X_1}^{n_1{\rm{\ times}}}+...+\overbrace{X_m+...+X_m}^{n_m{\rm{\ times}}}\right)/n$
$=(n_1X_1 + ... + n_mX_m)/n,$

which is the grouped data formula. We have shown that the grouped data formula obtains as a special case of the sample mean when equal values are joined into groups.

Next, denoting $r_i=n_i/n$ the relative frequencies, we get

$(n_1X_1 + ... + n_mX_m)/n=$

(dividing through by $n$)

$=r_1X_1+...+r_mX_m.$

If we accept the relative frequencies as probabilities, then this becomes the population mean. Thus, with this convention, the grouped data formula and population mean are the same.

Finally, the mean of a continuous random variable $X$ which has a density $p_X$ is defined by $EX=\int_{-\infty}^\infty tp_X(t)dt$. in Section 6.3 of my book it is shown that the mean of a continuous random variable is a limit of grouped means.

### Conclusion

Properties of means apply equally to all mean types.

17
Mar 19

## AP Statistics the Genghis Khan way 2

Last semester I tried to explain theory through numerical examples. The results were terrible. Even the best students didn't stand up to my expectations. The midterm grades were so low that I did something I had never done before: I allowed my students to write an analysis of the midterm at home. Those who were able to verbally articulate the answers to me received a bonus that allowed them to pass the semester.

This semester I made a U-turn. I announced that in the first half of the semester we will concentrate on theory and we followed this methodology. Out of 35 students, 20 significantly improved their performance and 15 remained where they were.

### Midterm exam, version 1

#### 1. General density definition (6 points)

a. Define the density $p_X$ of a random variable $X.$ Draw the density of heights of adults, making simplifying assumptions if necessary. Don't forget to label the axes.

b. According to your plot, how much is the integral $\int_{-\infty}^0p_X(t)dt?$ Explain.

c. Why the density cannot be negative?

d. Why the total area under the density curve should be 1?

e. Where are basketball players on your graph? Write down the corresponding expression for probability.

f. Where are dwarfs on your graph? Write down the corresponding expression for probability.

This question is about the interval formula. In each case students have to write the equation for the probability and the corresponding integral of the density. At this level, I don't talk about the distribution function and introduce the density by the interval formula.

#### 2. Properties of means (8 points)

a. Define a discrete random variable and its mean.

b. Define linear operations with random variables.

c. Prove linearity of means.

d. Prove additivity and homogeneity of means.

e. How much is the mean of a constant?

f. Using induction, derive the linearity of means for the case of $n$ variables from the case of two variables (3 points).

#### 3. Covariance properties (6 points)

a. Derive linearity of covariance in the first argument when the second is fixed.

b. How much is covariance if one of its arguments is a constant?

c. What is the link between variance and covariance? If you know one of these functions, can you find the other (there should be two answers)? (4 points)

#### 4. Standard normal variable (6 points)

a. Define the density $p_z(t)$ of a standard normal.

b. Why is the function $p_z(t)$ even? Illustrate this fact on the plot.

c. Why is the function $f(t)=tp_z(t)$ odd? Illustrate this fact on the plot.

d. Justify the equation $Ez=0.$

e. Why is $V(z)=1?$

f. Let $t>0.$ Show on the same plot areas corresponding to the probabilities $A_1=P(0 $A_2=P(z>t),$ $A_3=P(z<-t),$ $A_4=P(-t Write down the relationships between $A_1,...,A_4.$

#### 5. General normal variable (3 points)

a. Define a general normal variable $X.$

b. Use this definition to find the mean and variance of $X.$

c. Using part b, on the same plot graph the density of the standard normal and of a general normal with parameters $\sigma =2,$ $\mu =3.$

### Midterm exam, version 2

#### 1. General density definition (6 points)

a. Define the density $p_X$ of a random variable $X.$ Draw the density of work experience of adults, making simplifying assumptions if necessary. Don't forget to label the axes.

b. According to your plot, how much is the integral $\int_{-\infty}^0p_X(t)dt?$ Explain.

c. Why the density cannot be negative?

d. Why the total area under the density curve should be 1?

e. Where are retired people on your graph? Write down the corresponding expression for probability.

f. Where are young people (up to 25 years old) on your graph? Write down the corresponding expression for probability.

#### 2. Variance properties (8 points)

a. Define variance of a random variable. Why is it non-negative?

b. Define the formula for variance of a linear combination of two variables.

c. How much is variance of a constant?

d. What is the formula for variance of a sum? What do we call homogeneity of variance?

e. What is larger: $V(X+Y)$ or $V(X-Y)$? (2 points)

f. One investor has 100 shares of Apple, another - 200 shares. Which investor's portfolio has larger variability? (2 points)

#### 3. Poisson distribution (6 points)

a. Write down the Taylor expansion and explain the idea. How are the Taylor coefficients found?

b. Use the Taylor series for the exponential function to define the Poisson distribution.

c. Find the mean of the Poisson distribution. What is the interpretation of the parameter $\lambda$ in practice?

#### 4. Standard normal variable (6 points)

a. Define the density $p_z(t)$ of a standard normal.

b. Why is the function $p_z(t)$ even? Illustrate this fact on the plot.

c. Why is the function $f(t)=tp_z(t)$ odd? Illustrate this fact on the plot.

d. Justify the equation $Ez=0.$

e. Why is $V(z)=1?$

f. Let $t>0.$ Show on the same plot areas corresponding to the probabilities $A_1=P(0 $A_2=P(z>t),$ $A_{3}=P(z<-t),$ $A_4=P(-t Write down the relationships between $A_{1},...,A_{4}.$

#### 5. General normal variable (3 points)

a. Define a general normal variable $X.$

b. Use this definition to find the mean and variance of $X.$

c. Using part b, on the same plot graph the density of the standard normal and of a general normal with parameters $\sigma =2,$ $\mu =3.$

2
Aug 18

# Basic statistics

AP Statistics the Genghis Khan way 1

AP Statistics the Genghis Khan way 2

Descriptive statistics and inferential statistics

Numerical versus categorical variable

Uniform distribution definition, with examples

### Using graphs to describe data

What should you hate about AP Statistics? The TI-83+ and TI-84 graphing calculators are terrible

How to prevent cheating with TI-83+ and TI-84

Minitab is overpriced. Use Excel instead

What is a Pareto chart and how is it different from a histogram?

The stem-and-leaf plot is an archaism - it's time to leave it behind

Histogram versus time series plot, with video

Comparing histogram, Pareto chart and times series plot

Using statistical tables for normal distribution

### Probability

Little tricks for AP Statistics

What is probability. Includes sample space; elementary, impossible, sure events; completeness axiom,  de Morgan’s laws, link between logic and geometry

Independence of events. Includes conditional probability, multiplication rule and visual illustration of independence

Law of total probability - you could have invented this

Significance level and power of test

Reevaluating probabilities based on piece of evidence

p value definition

### Using numerical measures to describe data

What is a median, with an exercise

Using financial examples to explain properties of sample means

Properties of means

What is a mean value. All means in one place: population mean, sample mean, grouped data formula, mean of a continuous random variable

Unbiasedness definition, with intuition

Marginal probabilities and densities

All properties of variance in one place

Variance of a vector: motivation and visualization

Different faces of vector variance: again visualization helps

Inductive introduction to Chebyshev inequality

Properties of covariance

Properties of standard deviation

Correlation coefficient: the last block of statistical foundation

Statistical measures and their geometric roots

Population mean versus sample mean: summary comparison

Mean plus deviation-from-mean decomposition

Scaling a distribution

What is a z score: the scientific explanation

What is a binomial random variable - analogy with market demand

Active learning - away from boredom of lectures, with Excel file and video. How to simulate several random variables at the same time.

From independence of events to independence of random variables. Includes multiplicativity of means and additivity of variance

Normal distributions. Includes standard normal distribution, (general) normal variable, linear transformation and their properties, video and Mathematica file

Definitions of chi-square, t statistic and F statistic

Student's t distribution: one-line explanation of its origin

Confidence interval and margin of error derivation using z-score. Includes confidence and significance levels, critical value

Confidence interval using t statistic: attach probability or not attach?

### Distribution function

Distribution function properties

Density function properties

Examples of distribution functions

Distribution and density functions of a linear transformation

Binary choice models

Binary choice models: theoretical obstacles

### Maximum likelihood

Maximum likelihood: idea and life of a bulb

Maximum likelihood: application to linear model

### Conditioning

Properties of conditional expectation

Conditional expectation generalized to continuous random variables

Conditional variance properties

### Simulation of random variables

Importance of simulation in Excel for elementary stats courses

Generating the Bernoulli random variable (coin), with Excel file

Creating frequency table and histogram and using Excel macros, with Excel file

Modeling a sample from a normal distribution, with Excel file

### Sampling distributions

Demystifying sampling distributions: too much talking about nothing

### Law of large numbers and central limit theorem

Law of large numbers explained

Law of large numbers illustrated

Law of large numbers: the mega delusion of AP Statistics, with Excel file

All about the law of large numbers. Includes convergence in probability, preservation of arithmetic operations and application to simple regression

Central Limit Theorem versus Law of Large Numbers. Includes convergence in distribution and Excel file

Law of large numbers proved

24
Jan 17

## Regressions with stochastic regressors 2

### Regressions with stochastic regressors 2: two approaches

We consider the slope estimator for the simple regression

$y_i=a+bx_i+e_i$

assuming that $x_i$ is stochastic.

First approach: the sample size is fixed. The unbiasedness and efficiency conditions are replaced by their analogs conditioned on $x$. The outcome is that the slope estimator is unbiased and its variance is the average of the variance that we have in case of a deterministic regressor. See the details.

Second approach: the sample size goes to infinity. The main tools used are the properties of probability limits and laws of large numbers. The outcome is that, in the limit, the sample characteristics are replaced by their population cousins and the slope estimator is consistent. This is what we focus on here.

### A brush-up on convergence in probability

Review the intuition and formal definition. This is the summary:

Fact 1. Convergence in probability (which applies to sequences of random variables) is a generalization of the notion of convergence of number sequences. In particular, if $\{a_n\}$ is a numerical sequence that converges to a number $a$$\lim_{n\rightarrow\infty}a_n=a$, then, treating $a_n$ as a random variable, we have convergence in probability ${\text{plim}}_{n\rightarrow\infty}a_n=a$.

Fact 2. For those who are familiar with the theory of limits of numerical sequences, from the previous fact it should be clear that convergence in probability preserves arithmetic operations. That is, for any sequences of random variables $\{X_n\},\{Y_n\}$ such that limits ${\text{plim}}X_n$ and ${\text{plim}}Y_n$ exist, we have

$\text{plim}(X_n\pm Y_n)=\text{plim}X_n\pm\text{plim}Y_n,$ $\text{plim}(X_n\times Y_n)=\text{plim}X_n\times\text{plim}Y_n,$

and if $\text{plim}Y_n\ne 0$ then

$\text{plim}(X_n/ Y_n)=\text{plim}X_n/\text{plim}Y_n.$

This makes convergence in probability very handy. Convergence in distribution doesn't have such properties.

### A brush-up on laws of large numbers

See the site map for several posts about this. Here we apply the Chebyshev inequality to prove the law of large numbers for sample means. A generalization is given in the Theorem in the end of that post. Here is a further intuitive generalization:

Normally, unbiased sample characteristics converge in probability to their population counterparts.

Example 1. We know that the sample variance $s^2=\frac{1}{n-1}\sum(X_i-\bar{X})^2$ unbiasedly estimates the population variance $\sigma^2$$Es^2=\sigma^2$. The intuitive generalization says that then

(1) $\text{plim}s^2=\sigma^2$.

Here I argue that, for the purposes of obtaining some identities from the general properties of means, instead of the sample variance it's better to use the variance defined by $Var_u(X)=\frac{1}{n}\sum(X_i-\bar{X})^2$ (with division by $n$ instead of $n-1$). Using Facts 1 and 2 we get from (1) that

(2) $\text{plim}Var_u(X)=\text{plim}\frac{n-1}{n}\frac{1}{n-1}\sum(X_i-\bar{X})^2$

$=\text{plim}(1-\frac{1}{n})s^2=\text{plim}s^2=\sigma^2=Var(X)$

(sample variance converges in probability to population variance). Here we use $\lim(1-\frac{1}{n})=1$.

Example 2. Similarly, sample covariance converges in probability to population covariance:

(3) $\text{plim}Cov_u(X,Y)=Cov(X,Y)$

where by definition $Cov_u(X,Y)=\frac{1}{n}\sum(X_i-\bar{X})(Y_i-\bar{Y})$.

### Proving consistency of the slope estimator

Here (see equation (5)) I derived the representation of the OLS estimator of the slope

$\hat{b}=b+\frac{Cov_u(X,e)}{Var_u(X)}$

Using preservation of arithmetic operations for convergence in probability, we get

(4) $\text{plim}\hat{b}=\text{plim}\left[b+\frac{Cov_u(X,e)}{Var_u(X)}\right]=\text{plim}b+\text{plim}\frac{Cov_u(X,e)}{Var_u(X)}$

$=b+\frac{\text{plim}Cov_u(X,e)}{\text{plim}Var_u(X)}=b+\frac{Cov(X,e)}{Var(X)}.$

In the last line we used (2) and (3). From (4) we see what conditions should be imposed for the slope estimator to converge to a spike at the true slope:

$Var(X)\neq 0$ (existence condition)

and

$Cov(X,e)=0$ (consistency condition).

Under these conditions, we have $\text{plim}\hat{b}=b$ (this is called consistency).

Conclusion. In a way, the second approach is technically simpler than the first.