Jan 17

Conditional variance properties


Review Properties of conditional expectation, especially the summary, where I introduce a new notation for conditional expectation. Everywhere I use the notation E_Y\pi for expectation of \pi conditional on Y, instead of E(\pi|Y).

This post and the previous one on conditional expectation show that conditioning is a pretty advanced notion. Many introductory books use the condition E_xu=0 (the expected value of the error term u=0 conditional on the regressor x is zero). Because of the complexity of conditioning, I think it's better to avoid this kind of assumption as much as possible.

Conditional variance properties

Replacing usual expectations by their conditional counterparts in the definition of variance, we obtain the definition of conditional variance:

(1) Var_Y(X)=E_Y(X-E_YX)^2.

Property 1. If X,Y are independent, then X-EX and Y are also independent and conditioning doesn't change variance:


see Conditioning in case of independence.

Property 2. Generalized homogeneity of degree 2: if a is a deterministic function, then a^2(Y) can be pulled out:



Property 3. Shortcut for conditional variance:

(2) Var_Y(X)=E_Y(X^2)-(E_YX)^2.



(distributing conditional expectation)


(applying Properties 2 and 6 from this Summary with a(Y)=E_YX)


Property 4The law of total variance:

(3) Var(X)=Var(E_YX)+E[Var_Y(X)].

Proof. By the shortcut for usual variance and the law of iterated expectations


(replacing E_Y(X^2) from (2))


(the last two terms give the shortcut for variance of E_YX)


Before we move further we need to define conditional covariance by

Cov_Y(S,T) = E_Y(S - E_YS)(T - E_YT)

(everywhere usual expectations are replaced by conditional ones). We say that random variables S,T are conditionally uncorrelated if Cov_Y(S,T) = 0.

Property 5. Conditional variance of a linear combination. For any random variables S,T and functions a(Y),b(Y) one has

Var_Y(a(Y)S + b(Y)T)=a^2(Y)Var_Y(S)+2a(Y)b(Y)Cov_Y(S,T)+b^2(Y)Var_Y(T).

The proof is quite similar to that in case of usual variances, so we leave it to the reader. In particular, if S,T are conditionally uncorrelated, then the interaction terms disappears:

Var_Y(a(Y)S + b(Y)T)=a^2(Y)Var_Y(S)+b^2(Y)Var_Y(T).

Oct 16

Properties of means

Properties of means, covariances and variances are bread and butter of professionals. Here we consider the bread - the means

Properties of means: as simple as playing with tables

Definition of a random variable. When my Brazilian students asked for an intuitive definition of a random variable, I said: It is a function whose values are unpredictable. Therefore it is prohibited to work with their values and allowed to work only with their various means. For proofs we need a more technical definition: it is a table values+probabilities of type Table 1.

Table 1.  Random variable definition
Values of X Probabilities
x_1 p_1
... ...
x_n p_n

Note: The complete form of writing {p_i} is P(X = {x_i}).

Mean (or expected value) value definitionEX = x_1p_1 + ... + x_np_n = \sum\limits_{i = 1}^nx_ip_i. In words, this is a weighted sum of values, where the weights p_i reflect the importance of corresponding x_i.

Note: The expected value is a function whose argument is a complex object (it is described by Table 1) and the value is simple: EX is just a number. And it is not a product of E and X! See how different means fit this definition.

Definition of a linear combination. See here the financial motivation. Suppose that X,Y are two discrete random variables with the same probability distribution {p_1},...,{p_n}. Let a,b be real numbers. The random variable aX + bY is called a linear combination of X,Y with coefficients a,b. Its special cases are aX (X scaled by a) and X + Y (a sum of X and Y). The detailed definition is given by Table 2.

Table 2.  Linear operations definition
Values of X Values of Y Probabilities aX X + Y aX + bY
x_1 {y_1} p_1 a{x_1} {x_1} + {y_1} a{x_1} + b{y_1}
...  ... ...  ...  ...  ...
x_n {y_n} p_n a{x_n} {x_n} + {y_n} a{x_n} + b{y_n}

Note: The situation when the probability distributions are different is reduced to the case when they are the same, see my book.

Property 1. Linearity of means. For any random variables X,Y and any numbers a,b one has

(1) E(aX + bY) = aEX + bEY.

Proof. This is one of those straightforward proofs when knowing the definitions and starting with the left-hand side is enough to arrive at the result. Using the definitions in Table 2, the mean of the linear combination is
E(aX + bY)= (a{x_1} + b{y_1}){p_1} + ... + (a{x_n} + b{y_n}){p_n}

(distributing probabilities)
= a{x_1}{p_1} + b{y_1}{p_1} + ... + a{x_n}{p_n} + b{y_n}{p_n}

(grouping by variables)
= (a{x_1}{p_1} + ... + a{x_n}{p_n}) + (b{y_1}{p_1} + ... + b{y_n}{p_n})

(pulling out constants)
= a({x_1}{p_1} + ... + {x_n}{p_n}) + b({y_1}{p_1} + ... + {y_n}{p_n})=aEX+bEY.

See applications: one, and two, and three.

Generalization to the case of a linear combination of n variables:

E({a_1}{X_1} + ... + {a_n}{X_n}) = {a_1}E{X_1} + ... + {a_n}E{X_n}.

Special cases. a) Letting a = b = 1 in (1) we get E(X + Y) = EX + EY. This is called additivity. See an application. b) Letting in (1) b = 0 we get E(aX) = aEX. This property is called homogeneity of degree 1 (you can pull the constant out of the expected value sign). Ask your students to deduce linearity from homogeneity and additivity.

Property 2. Expected value of a constant. Everybody knows what a constant is. Ask your students what is a constant in terms of Table 1. The mean of a constant is that constant, because a constant doesn't change, rain or shine: Ec = c{p_1} + ... + c{p_n} = c({p_1} + ... + {p_n}) = 1 (we have used the completeness axiom). In particular, it follows that E(EX)=EX.

Property 3. The expectation operator preserves order: if x_i\ge y_i for all i, then EX\ge EY. In particular, the mean of a nonnegative random variable is nonnegative: if x_i\ge 0 for all i, then EX\ge 0.

Indeed, using the fact that all probabilities are nonnegative, we get EX = x_1p_1 + ... + x_np_n\ge y_1p_1 + ... + y_np_n=EY.

Property 4. For independent variables, we have EXY=(EX)(EY) (multiplicativity), which has important implications on its own.

The best thing about the above properties is that, although we proved them under simplified assumptions, they are always true. We keep in mind that the expectation operator E is the device used by Mother Nature to measure the average, and most of the time she keeps hidden from us both the probabilities and the average EX.

Sep 16

Definitions of chi-square, t statistic and F statistic

Definitions of the standard normal distribution and independence can be combined to produce definitions of chi-square, t statistic and F statistic. The similarity of the definitions makes them easier to study.

Independence of continuous random variables

Definition of independent discrete random variables easily modifies for the continuous case. Let X,Y be two continuous random variables with densities p_X,\ p_Y, respectively. We say that these variables are independent if the density p_{X,Y} of the pair (X,Y) is a product of individual densities:

(1) p_{X,Y}(s,t)=p_X(s)p_Y(t) for all s,t.

As in this post, equation (1) can be understood in two ways. If (1) is given, then X,Y are independent. Conversely, we if want them to be independent, we can define the density of the pair by equation (1). This definition readily generalizes for the case of many variables. In particular, if we want variables z_1,...,z_n to be standard normal and independent, we say that each of them has density defined here and the joint density p_{z_1,...,z_n} is a product of individual densities.

Definition of chi-square variable


Figure 1. chi-square with 1 degree of freedom

Let z_1,...,z_n be standard normal and independent. Then the variable \chi^2_n=z_1^2+...+z_n^2 is called a chi-square variable with n degrees of freedom. Obviously, \chi^2_n\ge 0, which means that its density is zero to the left of the origin. For low values of degrees of freedom, the density is not bounded near the origin, see Figure 1.

Definition of t distribution


Figure 2. t distribution and standard normal compared

Let z_0,z_1,...,z_n be standard normal and independent. Then the variable t_n=\frac{z_0}{\sqrt{(z_1^2+...+z_n^2)/n}} is called a t distribution with n degrees of freedom. The density of the t distribution is bell-shaped and for low n has fatter tails than the standard normal. For high n, it approaches that of the standard normal, see Figure 2.

Definition of F distribution


Figure 3. F distribution with (1,m) degrees of freedom

Let u_1,...,u_n,v_1,...,v_m be standard normal and independent. Then the variable F_{n,m}=\frac{(u_1^2+...+u_n^2)/n}{(v_1^2+...+v_m^2)/m} is called an F distribution with (n,m) degrees of freedom. It is nonnegative and its density is zero to the left of the origin. When n is low, the density is not bounded in the neighborhood of zero, see Figure 3.

The Mathematica file and video illustrate better the densities of these three variables.


  1. If \chi^2_n and \chi^2_m are independent, then \chi^2_n+\chi^2_m is \chi^2_{n+m} (addition rule). This rule is applied in the theory of ANOVA models.
  2. t_n^2=F_{1,n}. This is an easy proof of equation (2.71) from Introduction to Econometrics, by Christopher Dougherty, published by Oxford University Press, UK, in 2016.
Sep 16

The pearls of AP Statistics 28

From independence of events to independence of random variables

One way to avoid complex Math is by showing the students simplified, plausible derivations which create appearance of rigor and provide enough ground for intuition. This is what I try to do here.

Independence of random variables

Let X,Y be two random variables. Suppose X takes values x_1,x_2 with probabilities P(X=x_i)=p_i. Similarly, Y takes values y_1,y_2 with probabilities P(Y=y_i)=q_i. Now we want to consider a pair (X,Y). The pair can take values (x_i,y_j) where i,j take values 1,2. These are joint events with probabilities denoted P(X=x_i,Y=y_j)=p_{i,j}.

DefinitionX,Y are called independent if for all i,j one has

(1) p_{i,j}=p_iq_j.

Thus, in case of two-valued variables, their independence means independence of 4 events. Independence of variables is a more complex condition than independence of events.

Properties of independent variables

Property 1. For independent variables, we have EXY=EXEY (multiplicativity). Indeed, by definition of the expected value and equation (1)




Remark. This proof is a good exercise to check how well students understand the definitions of the product XY and of the expectation operator. Note also that multiplicativity holds only under independence, unlike linearity E(aX+bY)=aEX+bEY, which is always true.

Property 2. Independent variables are uncorrelated: Cov(X,Y)=0. This follows immediately from multiplicativity and the shortcut for covariance:

(2) Cov(X,Y)=E(XY)-(EX)(EY)=0.

Remark. Independence is stronger than uncorrelatedness: variables can be uncorrelated but not independent.

Property 3. For independent variables, variance is additive: Var(X+Y)=Var(X)+Var(Y). This easily follows from the general formula for Var(X+Y) and equation (2):


Property 4. Independence is such a strong property that it is preserved under nonlinear transformations. This means the following. Take two deterministic functions f,g; apply one to X and the other to Y. The resulting random variables f(X),g(Y) will be independent. Instead of the proof, I provide an application. If z_1,z_2 are two independent standard normals, then z^2_1,z^2_2 are two independent chi-square variables with 1 degree of freedom.

Remark. Normality is preserved only under linear transformations.

This post is an antithesis of the following definition from (Agresti and Franklin, p.540): Two categorical variables are independent if the population conditional distributions for one of them are identical at each category of the other. The variables are dependent (or associated) if the conditional distributions are not identical.

Sep 16

The pearls of AP Statistics 27

Independence of events: intuitive definitions matter

First and foremost: independence of an AP Statistics course from Math is nonsense. Most of Stats is based on mathematical intuition.

Independent events

The usual definition says: events A,B are called independent if

(1) P(A\cap B)=P(A)P(B).


Figure 1. Independence illustrated - click to view the video

You can use it formally or you can try to find a tangible interpretation of this definition, which I did. In Figure 1, the sample space is the unit square. Let A be the rectangle delimited by red lines, of width a and height 1. Since in this illustration probability of an event is its area, we have P(A)=a\times 1=a. Similarly, let B be the rectangle delimited by blue lines, of width 1 and height b, so that P(B)=b\times 1=b. Obviously, the intersection A\cap B has area ab which equals P(A)P(B). Equation (1) is satisfied and A,B are independent. When the rectangle A moves left and right and/or the rectangle B moves up and down, the independence condition is preserved. We have a visual illustration of the common explanation that "what happens to one event, does not affect the probability of the other".

In Mathematica, enter the command

Animate[ParametricPlot[{{0.2 + a, t}, {0.4 + a, t}, {t, 0.3 + b}, {t,
0.6 + b}}, {t, 0, 1}, PlotRange -> {{0, 1}, {0, 1}},
PlotRangeClipping -> True, Frame -> True,
PlotStyle -> {Red, Red, Blue, Blue}, Mesh -> False], {a, -0.15,
0.55}, {b, -0.25, 0.35}, AnimationRunning -> False]

Choose "Forward and Backward" and then press both Play buttons. Those who don't have Mathematica, can view my video.

The statement "If A and B are dependent events, then so are A and the complement of B" (Agresti and Franklin, p.237) is not so simple. Here is the formal proof of the complementary statement ("dependent" is replaced with "independent"; S denotes the sample space): if (1) is true, then

P(A\cap B^c)=P(A\cap (S\setminus B))=P(A)-P(A\cap B)=P(A)-P(A)P(B)=P(A)P(B^c).

Reading equation (1) from left to right: in practice, if we know that events are independent, we can find the probability of the joint event A\cap B by multiplying individual probabilities P(A),P(B).

Reading equation (1) from right to left: in theory, if we want our events to be independent, we can define the probability of the joint event P(A\cap B) by multiplying individual probabilities P(A),P(B).

Why there is division in the definition of conditional probability?


Figure 2. Conditional probability

Golovkin crushed Brook, and I am happy. Let A be the event that the fight did not end in the first round. Suppose we know that the fight did not end in the first round, we just don't know the score for the round. Let B,C,D be the events that Golovkin scored more, Brook scored more and there was a tie, respectively. Our sample space, based on the information we have, is limited to A but the probabilities we are interested in do not sum to one:

P(A\cap B)+P(A\cap C)+P(A\cap D)=P(A)

To satisfy the completeness axiom, we divide both sides by P(A):

P(A\cap B)/P(A)+P(A\cap C)/P(A)+P(A\cap D)/P(A)=1.

This explains why conditional probabilities are defined by

(2) P(B|A)=P(A\cap B)/P(A),

P(C|A)=P(A\cap C)/P(A),

P(D|A)=P(A\cap D)/P(A).

If A,B are independent, from (1) we see that P(B|A)=P(B). The multiplication rule P(A\cap B)=P(B|A)P(A) is a consequence of (2) and not an independent property.

Jul 16

Properties of conditional expectation

Properties of conditional expectation


A company sells a product and may offer a discount. We denote by X the sales volume and by Y the discount amount (per unit). For simplicity, both variables take only two values. They depend on each other. If the sales are high, the discount may be larger. A higher discount, in its turn, may attract more buyers. At the same level of sales, the discount may vary depending on the vendor's costs. With the same discount, the sales vary with consumer preferences. Along with the sales and discount, we consider a third variable that depends on both of them. It can be the profit \pi.


The sales volume X takes values x_1,x_2 with probabilities p_i^X=P(X=x_i)i=1,2. Similarly, the discount Y takes values y_1,y_2 with probabilities p_i^Y=P(Y=y_i)i=1,2. The joint events have joint probabilities denoted P(X=x_i,Y=y_j)=p_{i,j}. The profit in the event X=x_i,Y=y_j is denoted \pi_{i,j}. This information is summarized in Table 1.

Table 1. Values and probabilities of the profit function
y_1 y_1
x_1 \pi_{1,1},\ p_{1,1} \pi_{1,2},\ p_{1,2} p_1^X
x_2 \pi_{2,1},\ p_{2,1} \pi_{2,2},\ p_{2,2} p_2^X
p_1^Y p_2^Y

Comments. In the left-most column and upper-most row we have values of the sales and discount. In the "margins" (last row and last column) we put probabilities of those values. In the main body of the table we have profit values and their probabilities. It follows that the expected profit is

(1) E\pi=\pi_{1,1}p_{1,1}+\pi_{1,2}p_{1,2}+\pi_{2,1}p_{2,1}+\pi_{2,2}p_{2,2}.


Suppose that the vendor fixes the discount at y_1. Then only the column containing this value is relevant. To get numbers that satisfy the completeness axiom, we define conditional probabilities

P(X=x_1|Y=y_1)=\frac{p_{11}}{p_1^Y},\ P(X=x_2|Y=y_1)=\frac{p_{21}}{p_1^Y}.

This allows us to define conditional expectation

(2) E(\pi|Y=y_1)=\pi_{11}\frac{p_{11}}{p_1^Y}+\pi_{21}\frac{p_{21}}{p_1^Y}.

Similarly, if the discount is fixed at y_2,

(3) E(\pi|Y=y_2)=\pi_{12}\frac{p_{12}}{p_2^Y}+\pi_{22}\frac{p_{22}}{p_2^Y}.

Equations (2) and (3) are joined in the notation E(\pi|Y).

Property 1. While the usual expectation (1) is a number, the conditional expectation E(\pi|Y) is a function of the value of Y on which the conditioning is being done. Since it is a function of Y, it is natural to consider it a random variable defined by the next table

Table 2. Conditional expectation is a random variable
Values Probabilities
E(\pi|Y=y_1) p_1^Y
E(\pi|Y=y_2) p_2^Y

Property 2. Law of iterated expectations: the mean of the conditional expectation equals the usual mean. Indeed, using Table 2, we have

E[E(\pi|Y)]=E(\pi|Y=y_1)p_1^Y+E(\pi|Y=y_2)p_2^Y (applying (2) and (3))



Property 3. Generalized homogeneity. In the usual homogeneity E(aX)=aEXa is a number. In the generalized homogeneity

(4) E(a(Y)\pi|Y)=a(Y)E(\pi|Y),

a(Y) is allowed to be a  function of the variable on which we are conditioning. See for yourself: using (2), for instance,



Property 4. Additivity. For any random variables S,T we have

(5) E(S+T|Y)=E(S|Y)+E(T|Y).

The proof is left as an exercise.

Property 5. Generalized linearity. For any random variables S,T and functions a(Y),b(Y) equations (4) and (5) imply


Property 6. Conditioning in case of independence. This property has to do with the informational aspect of conditioning. The usual expectation (1) takes into account all contingencies. (2) and (3) are based on the assumption that one contingency for Y has been realized, so that the other one becomes irrelevant. Therefore E(\pi|Y) is considered  an updated version of (1) that takes into account the arrival of new information that the value of Y has been fixed. Now we can state the property itself: if X,Y are independent, then E(X|Y)=EX, that is, conditioning on Y does not improve our knowledge of EX.

Proof. In case of independence we have p_{i,j}=p_i^Xp_j^Y for all i,j, so that


Property 7. Conditioning in case of complete dependence. Conditioning of Y on Y gives the most precise information: E(Y|Y)=Y (if we condition Y on Y, we know about it everything and there is no averaging). More generally, E(f(Y)|Y)=f(Y) for any deterministic function f.

Proof. If we condition Y on Y, the conditional probabilities become

p_{11}=P(Y=y_1|Y=y_1)=1,\ p_{21}=P(Y=y_2|Y=y_1)=0.

Hence, (2) gives

E(f(Y)|Y=y_1)=f(y_1)\times 1+f(y_2)\times 0=f(y_1).

Conditioning on Y=y_2 is treated similarly.


Not many people know that using the notation E_Y\pi for conditional expectation instead of E(\pi|Y) makes everything much clearer. I rewrite the above properties using this notation:

  1. Law of iterated expectations: E(E_Y\pi)=E\pi
  2. Generalized homogeneityE_Y(a(Y)\pi)=a(Y)E_Y\pi
  3. Additivity: For any random variables S,T we have E_Y(S+T)=E_YS+E_YT
  4. Generalized linearity: For any random variables S,T and functions a(Y),b(Y) one has E_Y(a(Y)S+b(Y)T)=a(Y)E_YS+b(Y)E_YT
  5. Conditioning in case of independence: if X,Y are independent, then E_YX=EX
  6. Conditioning in case of complete dependenceE_Yf(Y)=f(Y) for any deterministic function f.