2
Jan 17

## Conditional variance properties

### Preliminaries

Review Properties of conditional expectation, especially the summary, where I introduce a new notation for conditional expectation. Everywhere I use the notation $E_Y\pi$ for expectation of $\pi$ conditional on $Y$, instead of $E(\pi|Y)$.

This post and the previous one on conditional expectation show that conditioning is a pretty advanced notion. Many introductory books use the condition $E_xu=0$ (the expected value of the error term $u=0$ conditional on the regressor $x$ is zero). Because of the complexity of conditioning, I think it's better to avoid this kind of assumption as much as possible.

### Conditional variance properties

Replacing usual expectations by their conditional counterparts in the definition of variance, we obtain the definition of conditional variance:

(1) $Var_Y(X)=E_Y(X-E_YX)^2.$

Property 1. If $X,Y$ are independent, then $X-EX$ and $Y$ are also independent and conditioning doesn't change variance:

$Var_Y(X)=E_Y(X-EX)^2=E(X-EX)^2=Var(X),$

Property 2. Generalized homogeneity of degree 2: if $a$ is a deterministic function, then $a^2(Y)$ can be pulled out:

$Var_Y(a(Y)X)=E_Y[a(Y)X-E_Y(a(Y)X)]^2=E_Y[a(Y)X-a(Y)E_YX]^2$

$=E_Y[a^2(Y)(X-E_YX)^2]=a^2(Y)E_Y(X-E_YX)^2=a^2(Y)Var_Y(X).$

Property 3. Shortcut for conditional variance:

(2) $Var_Y(X)=E_Y(X^2)-(E_YX)^2.$

Proof.

$Var_Y(X)=E_Y(X-E_YX)^2=E_Y[X^2-2XE_YX+(E_YX)^2]$

(distributing conditional expectation)

$=E_YX^2-2E_Y(XE_YX)+E_Y(E_YX)^2$

(applying Properties 2 and 6 from this Summary with $a(Y)=E_YX$)

$=E_YX^2-2(E_YX)^2+(E_YX)^2=E_YX^2-(E_YX)^2.$

Property 4The law of total variance:

(3) $Var(X)=Var(E_YX)+E[Var_Y(X)].$

Proof. By the shortcut for usual variance and the law of iterated expectations

$Var(X)=EX^2-(EX)^2=E[E_Y(X^2)]-[E(E_YX)]^2$

(replacing $E_Y(X^2)$ from (2))

$=E[Var_Y(X)]+E(E_YX)^2-[E(E_YX)]^2$

(the last two terms give the shortcut for variance of $E_YX$)

$=E[Var_Y(X)]+Var(E_YX).$

Before we move further we need to define conditional covariance by

$Cov_Y(S,T) = E_Y(S - E_YS)(T - E_YT)$

(everywhere usual expectations are replaced by conditional ones). We say that random variables $S,T$ are conditionally uncorrelated if $Cov_Y(S,T) = 0$.

Property 5. Conditional variance of a linear combination. For any random variables $S,T$ and functions $a(Y),b(Y)$ one has

$Var_Y(a(Y)S + b(Y)T)=a^2(Y)Var_Y(S)+2a(Y)b(Y)Cov_Y(S,T)+b^2(Y)Var_Y(T).$

The proof is quite similar to that in case of usual variances, so we leave it to the reader. In particular, if $S,T$ are conditionally uncorrelated, then the interaction terms disappears:

$Var_Y(a(Y)S + b(Y)T)=a^2(Y)Var_Y(S)+b^2(Y)Var_Y(T).$

13
Oct 16

## Properties of means

Properties of means, covariances and variances are bread and butter of professionals. Here we consider the bread - the means

### Properties of means: as simple as playing with tables

Definition of a random variable. When my Brazilian students asked for an intuitive definition of a random variable, I said: It is a function whose values are unpredictable. Therefore it is prohibited to work with their values and allowed to work only with their various means. For proofs we need a more technical definition: it is a table values+probabilities of type Table 1.

 Values of $X$ Probabilities $x_1$ $p_1$ ... ... $x_n$ $p_n$

Note: The complete form of writing ${p_i}$ is $P(X = {x_i})$.

Mean (or expected value) value definition$EX = x_1p_1 + ... + x_np_n = \sum\limits_{i = 1}^nx_ip_i.$ In words, this is a weighted sum of values, where the weights $p_i$ reflect the importance of corresponding $x_i$.

Note: The expected value is a function whose argument is a complex object (it is described by Table 1) and the value is simple: $EX$ is just a number. And it is not a product of $E$ and $X$! See how different means fit this definition.

Definition of a linear combination. See here the financial motivation. Suppose that $X,Y$ are two discrete random variables with the same probability distribution ${p_1},...,{p_n}$. Let $a,b$ be real numbers. The random variable $aX + bY$ is called a linear combination of $X,Y$ with coefficients $a,b$. Its special cases are $aX$ ($X$ scaled by $a$) and $X + Y$ (a sum of $X$ and $Y$). The detailed definition is given by Table 2.

 Values of $X$ Values of $Y$ Probabilities $aX$ $X + Y$ $aX + bY$ $x_1$ ${y_1}$ $p_1$ $a{x_1}$ ${x_1} + {y_1}$ $a{x_1} + b{y_1}$ ... ... ... ... ... ... $x_n$ ${y_n}$ $p_n$ $a{x_n}$ ${x_n} + {y_n}$ $a{x_n} + b{y_n}$

Note: The situation when the probability distributions are different is reduced to the case when they are the same, see my book.

Property 1. Linearity of means. For any random variables $X,Y$ and any numbers $a,b$ one has

(1) $E(aX + bY) = aEX + bEY$.

Proof. This is one of those straightforward proofs when knowing the definitions and starting with the left-hand side is enough to arrive at the result. Using the definitions in Table 2, the mean of the linear combination is
$E(aX + bY)= (a{x_1} + b{y_1}){p_1} + ... + (a{x_n} + b{y_n}){p_n}$

(distributing probabilities)
$= a{x_1}{p_1} + b{y_1}{p_1} + ... + a{x_n}{p_n} + b{y_n}{p_n}$

(grouping by variables)
$= (a{x_1}{p_1} + ... + a{x_n}{p_n}) + (b{y_1}{p_1} + ... + b{y_n}{p_n})$

(pulling out constants)
$= a({x_1}{p_1} + ... + {x_n}{p_n}) + b({y_1}{p_1} + ... + {y_n}{p_n})=aEX+bEY.$

See applications: one, and two, and three.

Generalization to the case of a linear combination of $n$ variables:

$E({a_1}{X_1} + ... + {a_n}{X_n}) = {a_1}E{X_1} + ... + {a_n}E{X_n}$.

Special cases. a) Letting $a = b = 1$ in (1) we get $E(X + Y) = EX + EY$. This is called additivity. See an application. b) Letting in (1) $b = 0$ we get $E(aX) = aEX$. This property is called homogeneity of degree 1 (you can pull the constant out of the expected value sign). Ask your students to deduce linearity from homogeneity and additivity.

Property 2. Expected value of a constant. Everybody knows what a constant is. Ask your students what is a constant in terms of Table 1. The mean of a constant is that constant, because a constant doesn't change, rain or shine: $Ec = c{p_1} + ... + c{p_n} = c({p_1} + ... + {p_n}) = 1$ (we have used the completeness axiom). In particular, it follows that $E(EX)=EX$.

Property 3. The expectation operator preserves order: if $x_i\ge y_i$ for all $i$, then $EX\ge EY$. In particular, the mean of a nonnegative random variable is nonnegative: if $x_i\ge 0$ for all $i$, then $EX\ge 0$.

Indeed, using the fact that all probabilities are nonnegative, we get $EX = x_1p_1 + ... + x_np_n\ge y_1p_1 + ... + y_np_n=EY$.

Property 4. For independent variables, we have $EXY=(EX)(EY)$ (multiplicativity), which has important implications on its own.

The best thing about the above properties is that, although we proved them under simplified assumptions, they are always true. We keep in mind that the expectation operator $E$ is the device used by Mother Nature to measure the average, and most of the time she keeps hidden from us both the probabilities and the average $EX$.

29
Sep 16

## Definitions of chi-square, t statistic and F statistic

Definitions of the standard normal distribution and independence can be combined to produce definitions of chi-square, t statistic and F statistic. The similarity of the definitions makes them easier to study.

### Independence of continuous random variables

Definition of independent discrete random variables easily modifies for the continuous case. Let $X,Y$ be two continuous random variables with densities $p_X,\ p_Y$, respectively. We say that these variables are independent if the density $p_{X,Y}$ of the pair $(X,Y)$ is a product of individual densities:

(1) $p_{X,Y}(s,t)=p_X(s)p_Y(t)$ for all $s,t.$

As in this post, equation (1) can be understood in two ways. If (1) is given, then $X,Y$ are independent. Conversely, we if want them to be independent, we can define the density of the pair by equation (1). This definition readily generalizes for the case of many variables. In particular, if we want variables $z_1,...,z_n$ to be standard normal and independent, we say that each of them has density defined here and the joint density $p_{z_1,...,z_n}$ is a product of individual densities.

### Definition of chi-square variable

Figure 1. chi-square with 1 degree of freedom

Let $z_1,...,z_n$ be standard normal and independent. Then the variable $\chi^2_n=z_1^2+...+z_n^2$ is called a chi-square variable with $n$ degrees of freedom. Obviously, $\chi^2_n\ge 0$, which means that its density is zero to the left of the origin. For low values of degrees of freedom, the density is not bounded near the origin, see Figure 1.

### Definition of t distribution

Figure 2. t distribution and standard normal compared

Let $z_0,z_1,...,z_n$ be standard normal and independent. Then the variable $t_n=\frac{z_0}{\sqrt{(z_1^2+...+z_n^2)/n}}$ is called a t distribution with $n$ degrees of freedom. The density of the t distribution is bell-shaped and for low $n$ has fatter tails than the standard normal. For high $n$, it approaches that of the standard normal, see Figure 2.

### Definition of F distribution

Figure 3. F distribution with (1,m) degrees of freedom

Let $u_1,...,u_n,v_1,...,v_m$ be standard normal and independent. Then the variable $F_{n,m}=\frac{(u_1^2+...+u_n^2)/n}{(v_1^2+...+v_m^2)/m}$ is called an F distribution with $(n,m)$ degrees of freedom. It is nonnegative and its density is zero to the left of the origin. When $n$ is low, the density is not bounded in the neighborhood of zero, see Figure 3.

The Mathematica file and video illustrate better the densities of these three variables.

### Consequences

1. If $\chi^2_n$ and $\chi^2_m$ are independent, then $\chi^2_n+\chi^2_m$ is $\chi^2_{n+m}$ (addition rule). This rule is applied in the theory of ANOVA models.
2. $t_n^2=F_{1,n}$. This is an easy proof of equation (2.71) from Introduction to Econometrics, by Christopher Dougherty, published by Oxford University Press, UK, in 2016.
15
Sep 16

## The pearls of AP Statistics 28

### From independence of events to independence of random variables

One way to avoid complex Math is by showing the students simplified, plausible derivations which create appearance of rigor and provide enough ground for intuition. This is what I try to do here.

### Independence of random variables

Let $X,Y$ be two random variables. Suppose $X$ takes values $x_1,x_2$ with probabilities $P(X=x_i)=p_i$. Similarly, $Y$ takes values $y_1,y_2$ with probabilities $P(Y=y_i)=q_i$. Now we want to consider a pair $(X,Y)$. The pair can take values $(x_i,y_j)$ where $i,j$ take values $1,2$. These are joint events with probabilities denoted $P(X=x_i,Y=y_j)=p_{i,j}$.

Definition$X,Y$ are called independent if for all $i,j$ one has

(1) $p_{i,j}=p_iq_j$.

Thus, in case of two-valued variables, their independence means independence of 4 events. Independence of variables is a more complex condition than independence of events.

### Properties of independent variables

Property 1. For independent variables, we have $EXY=EXEY$ (multiplicativity). Indeed, by definition of the expected value and equation (1)

$EXY=x_1y_1p_{1,1}+x_1y_2p_{1,2}+x_2y_1p_{2,1}+x_2y_2p_{2,2}$

$=x_1y_1p_1q_1+x_1y_2p_1q_2+x_2y_1p_2q_1+x_2y_2p_2q_2$

$=(x_1p_1+x_2p_2)(y_1q_1+y_2q_2)=EXY$.

Remark. This proof is a good exercise to check how well students understand the definitions of the product $XY$ and of the expectation operator. Note also that multiplicativity holds only under independence, unlike linearity $E(aX+bY)=aEX+bEY$, which is always true.

Property 2. Independent variables are uncorrelated: $Cov(X,Y)=0$. This follows immediately from multiplicativity and the shortcut for covariance:

(2) $Cov(X,Y)=E(XY)-(EX)(EY)=0.$

Remark. Independence is stronger than uncorrelatedness: variables can be uncorrelated but not independent.

Property 3. For independent variables, variance is additive: $Var(X+Y)=Var(X)+Var(Y).$ This easily follows from the general formula for $Var(X+Y)$ and equation (2):

$Var(X+Y)=Var(X)+2Cov(X,Y)+Var(Y)=Var(X)+Var(Y).$

Property 4. Independence is such a strong property that it is preserved under nonlinear transformations. This means the following. Take two deterministic functions $f,g$; apply one to $X$ and the other to $Y$. The resulting random variables $f(X),g(Y)$ will be independent. Instead of the proof, I provide an application. If $z_1,z_2$ are two independent standard normals, then $z^2_1,z^2_2$ are two independent chi-square variables with 1 degree of freedom.

Remark. Normality is preserved only under linear transformations.

This post is an antithesis of the following definition from (Agresti and Franklin, p.540): Two categorical variables are independent if the population conditional distributions for one of them are identical at each category of the other. The variables are dependent (or associated) if the conditional distributions are not identical.

11
Sep 16

## The pearls of AP Statistics 27

### Independence of events: intuitive definitions matter

First and foremost: independence of an AP Statistics course from Math is nonsense. Most of Stats is based on mathematical intuition.

### Independent events

The usual definition says: events $A,B$ are called independent if

(1) $P(A\cap B)=P(A)P(B).$

Figure 1. Independence illustrated - click to view the video

You can use it formally or you can try to find a tangible interpretation of this definition, which I did. In Figure 1, the sample space is the unit square. Let $A$ be the rectangle delimited by red lines, of width $a$ and height $1$. Since in this illustration probability of an event is its area, we have $P(A)=a\times 1=a$. Similarly, let $B$ be the rectangle delimited by blue lines, of width $1$ and height $b$, so that $P(B)=b\times 1=b$. Obviously, the intersection $A\cap B$ has area $ab$ which equals $P(A)P(B).$ Equation (1) is satisfied and $A,B$ are independent. When the rectangle $A$ moves left and right and/or the rectangle $B$ moves up and down, the independence condition is preserved. We have a visual illustration of the common explanation that "what happens to one event, does not affect the probability of the other".

In Mathematica, enter the command

Animate[ParametricPlot[{{0.2 + a, t}, {0.4 + a, t}, {t, 0.3 + b}, {t,
0.6 + b}}, {t, 0, 1}, PlotRange -> {{0, 1}, {0, 1}},
PlotRangeClipping -> True, Frame -> True,
PlotStyle -> {Red, Red, Blue, Blue}, Mesh -> False], {a, -0.15,
0.55}, {b, -0.25, 0.35}, AnimationRunning -> False]

Choose "Forward and Backward" and then press both Play buttons. Those who don't have Mathematica, can view my video.

The statement "If A and B are dependent events, then so are A and the complement of B" (Agresti and Franklin, p.237) is not so simple. Here is the formal proof of the complementary statement ("dependent" is replaced with "independent"; $S$ denotes the sample space): if (1) is true, then

$P(A\cap B^c)=P(A\cap (S\setminus B))=P(A)-P(A\cap B)=P(A)-P(A)P(B)=P(A)P(B^c).$

Reading equation (1) from left to right: in practice, if we know that events are independent, we can find the probability of the joint event $A\cap B$ by multiplying individual probabilities $P(A),P(B).$

Reading equation (1) from right to left: in theory, if we want our events to be independent, we can define the probability of the joint event $P(A\cap B)$ by multiplying individual probabilities $P(A),P(B).$

### Why there is division in the definition of conditional probability?

Figure 2. Conditional probability

Golovkin crushed Brook, and I am happy. Let $A$ be the event that the fight did not end in the first round. Suppose we know that the fight did not end in the first round, we just don't know the score for the round. Let $B,C,D$ be the events that Golovkin scored more, Brook scored more and there was a tie, respectively. Our sample space, based on the information we have, is limited to $A$ but the probabilities we are interested in do not sum to one:

$P(A\cap B)+P(A\cap C)+P(A\cap D)=P(A)$

To satisfy the completeness axiom, we divide both sides by $P(A)$:

$P(A\cap B)/P(A)+P(A\cap C)/P(A)+P(A\cap D)/P(A)=1.$

This explains why conditional probabilities are defined by

(2) $P(B|A)=P(A\cap B)/P(A),$

$P(C|A)=P(A\cap C)/P(A),$

$P(D|A)=P(A\cap D)/P(A).$

If $A,B$ are independent, from (1) we see that $P(B|A)=P(B)$. The multiplication rule $P(A\cap B)=P(B|A)P(A)$ is a consequence of (2) and not an independent property.

18
Jul 16

## Properties of conditional expectation

### Background

A company sells a product and may offer a discount. We denote by $X$ the sales volume and by $Y$ the discount amount (per unit). For simplicity, both variables take only two values. They depend on each other. If the sales are high, the discount may be larger. A higher discount, in its turn, may attract more buyers. At the same level of sales, the discount may vary depending on the vendor's costs. With the same discount, the sales vary with consumer preferences. Along with the sales and discount, we consider a third variable that depends on both of them. It can be the profit $\pi$.

### Formalization

The sales volume $X$ takes values $x_1,x_2$ with probabilities $p_i^X=P(X=x_i)$$i=1,2$. Similarly, the discount $Y$ takes values $y_1,y_2$ with probabilities $p_i^Y=P(Y=y_i)$$i=1,2$. The joint events have joint probabilities denoted $P(X=x_i,Y=y_j)=p_{i,j}$. The profit in the event $X=x_i,Y=y_j$ is denoted $\pi_{i,j}$. This information is summarized in Table 1.

 $y_1$ $y_1$ $x_1$ $\pi_{1,1},\ p_{1,1}$ $\pi_{1,2},\ p_{1,2}$ $p_1^X$ $x_2$ $\pi_{2,1},\ p_{2,1}$ $\pi_{2,2},\ p_{2,2}$ $p_2^X$ $p_1^Y$ $p_2^Y$

Comments. In the left-most column and upper-most row we have values of the sales and discount. In the "margins" (last row and last column) we put probabilities of those values. In the main body of the table we have profit values and their probabilities. It follows that the expected profit is

(1) $E\pi=\pi_{1,1}p_{1,1}+\pi_{1,2}p_{1,2}+\pi_{2,1}p_{2,1}+\pi_{2,2}p_{2,2}.$

### Conditioning

Suppose that the vendor fixes the discount at $y_1$. Then only the column containing this value is relevant. To get numbers that satisfy the completeness axiom, we define conditional probabilities

$P(X=x_1|Y=y_1)=\frac{p_{11}}{p_1^Y},\ P(X=x_2|Y=y_1)=\frac{p_{21}}{p_1^Y}.$

This allows us to define conditional expectation

(2) $E(\pi|Y=y_1)=\pi_{11}\frac{p_{11}}{p_1^Y}+\pi_{21}\frac{p_{21}}{p_1^Y}.$

Similarly, if the discount is fixed at $y_2$,

(3) $E(\pi|Y=y_2)=\pi_{12}\frac{p_{12}}{p_2^Y}+\pi_{22}\frac{p_{22}}{p_2^Y}.$

Equations (2) and (3) are joined in the notation $E(\pi|Y)$.

Property 1. While the usual expectation (1) is a number, the conditional expectation $E(\pi|Y)$ is a function of the value of $Y$ on which the conditioning is being done. Since it is a function of $Y$, it is natural to consider it a random variable defined by the next table

 Values Probabilities $E(\pi|Y=y_1)$ $p_1^Y$ $E(\pi|Y=y_2)$ $p_2^Y$

Property 2. Law of iterated expectations: the mean of the conditional expectation equals the usual mean. Indeed, using Table 2, we have

$E[E(\pi|Y)]=E(\pi|Y=y_1)p_1^Y+E(\pi|Y=y_2)p_2^Y$ (applying (2) and (3))

$=\left[\pi_{11}\frac{p_{11}}{p_1^Y}+\pi_{21}\frac{p_{21}}{p_1^Y}\right]p_1^Y+\left[\pi_{12}\frac{p_{12}}{p_2^Y}+\pi_{22}\frac{p_{22}}{p_2^Y}\right]p_2^Y$

$=\pi_{1,1}p_{1,1}+\pi_{1,2}p_{1,2}+\pi_{2,1}p_{2,1}+\pi_{2,2}p_{2,2}=E\pi.$

Property 3. Generalized homogeneity. In the usual homogeneity $E(aX)=aEX$$a$ is a number. In the generalized homogeneity

(4) $E(a(Y)\pi|Y)=a(Y)E(\pi|Y),$

$a(Y)$ is allowed to be a  function of the variable on which we are conditioning. See for yourself: using (2), for instance,

$E(a(y_1)\pi|Y=y_1)=a(y_1)\pi_{11}\frac{p_{11}}{p_1^Y}+a(y_1)\pi_{21}\frac{p_{21}}{p_1^Y}$

$=a(y_1)\left[\pi_{11}\frac{p_{11}}{p_1^Y}+\pi_{21}\frac{p_{21}}{p_1^Y}\right]=a(y_1)E(X|Y=y_1).$

Property 4. Additivity. For any random variables $S,T$ we have

(5) $E(S+T|Y)=E(S|Y)+E(T|Y).$

The proof is left as an exercise.

Property 5. Generalized linearity. For any random variables $S,T$ and functions $a(Y),b(Y)$ equations (4) and (5) imply

$E(a(Y)S+b(Y)T|Y)=a(Y)E(S|Y)+b(Y)E(T|Y).$

Property 6. Conditioning in case of independence. This property has to do with the informational aspect of conditioning. The usual expectation (1) takes into account all contingencies. (2) and (3) are based on the assumption that one contingency for $Y$ has been realized, so that the other one becomes irrelevant. Therefore $E(\pi|Y)$ is considered  an updated version of (1) that takes into account the arrival of new information that the value of $Y$ has been fixed. Now we can state the property itself: if $X,Y$ are independent, then $E(X|Y)=EX$, that is, conditioning on $Y$ does not improve our knowledge of $EX$.

Proof. In case of independence we have $p_{i,j}=p_i^Xp_j^Y$ for all $i,j$, so that

$E(X|Y=y_j)=x_1\frac{p_{1j}}{p_j^Y}+x_2\frac{p_{2j}}{p_j^Y}=x_1p_1^X+x_2p_2^X=EX.$

Property 7. Conditioning in case of complete dependence. Conditioning of $Y$ on $Y$ gives the most precise information: $E(Y|Y)=Y$ (if we condition $Y$ on $Y$, we know about it everything and there is no averaging). More generally, $E(f(Y)|Y)=f(Y)$ for any deterministic function $f$.

Proof. If we condition $Y$ on $Y$, the conditional probabilities become

$p_{11}=P(Y=y_1|Y=y_1)=1,\ p_{21}=P(Y=y_2|Y=y_1)=0.$

Hence, (2) gives

$E(f(Y)|Y=y_1)=f(y_1)\times 1+f(y_2)\times 0=f(y_1).$

Conditioning on $Y=y_2$ is treated similarly.

### Summary

Not many people know that using the notation $E_Y\pi$ for conditional expectation instead of $E(\pi|Y)$ makes everything much clearer. I rewrite the above properties using this notation:

1. Law of iterated expectations: $E(E_Y\pi)=E\pi$
2. Generalized homogeneity$E_Y(a(Y)\pi)=a(Y)E_Y\pi$
3. Additivity: For any random variables $S,T$ we have $E_Y(S+T)=E_YS+E_YT$
4. Generalized linearity: For any random variables $S,T$ and functions $a(Y),b(Y)$ one has $E_Y(a(Y)S+b(Y)T)=a(Y)E_YS+b(Y)E_YT$
5. Conditioning in case of independence: if $X,Y$ are independent, then $E_YX=EX$
6. Conditioning in case of complete dependence$E_Yf(Y)=f(Y)$ for any deterministic function $f$.