21
Apr 18

## Conditional expectation generalized to continuous random variables

The conditional expectation definition needs to be generalized, to be applicable to continuous random variables. The generalization is accompanied with an example and later will be applied to expected shortfall.

## Generalizing conditional expectation definition

Suppose $X$ can take values $x_1,...,x_I$ with probabilities $p_i^X=P(X=x_i)$ and $Y$ can take values $y_1,...,y_J$ with probabilities

(1) $p_j^Y=P(Y=y_j)$.

Denote the joint probabilities $p_{ij}=P(X=x_i,Y=y_j)$. The definition from this post gives

(2) $E(X|Y=\bar{y}_j)=\frac{\sum_{i=1}^Ip_{ij}x_i}{p_j^Y}$,

where $\bar{y}_j$ is a fixed value of $Y$. The drawback of this definition is its dependence on indexation of values $x_i,y_j$. Our purpose is to show how from this definition one can obtain a definition that does not use indexation and can therefore be applied to continuous random variables. Denote

$1_{\{Y=\bar{y}_j\}}=\left\{\begin{array}{ll}1,&Y=\bar{y}_j;\\0,&\textrm{otherwise}.\end{array} \right.$

Then the sum $\sum_{i=1}^Ip_{ij}x_i$ can be expanded by including zero terms:

(3) $\sum_{i=1}^Ip_{ij}x_i=\sum_{i=1}^I\sum_{j=1}^Jp_{ij}x_i1_{\{Y=\bar{y}_j\}}=E(X1_{\{Y=\bar{y}_j\}})$

(the sum in the middle includes all points in the sample space). Using (1) and (3) we can rewrite (2) as

$E(X|Y=\bar{y}_j)=\frac{E(X1_{\{Y=\bar{y}_j\}})}{P(Y=\bar{y}_j)}$.

Replacing here the conditioning on $Y=\bar{y}_j$ by conditioning on a general set $A$ whose probability is not zero we obtain the definition of conditional expectation:

(4) $E(X|A)=\frac{E(X1_A)}{P(A)}$.

Example. If $z$ is standard normal and $\Phi$ is its distribution function, then for any number $a$ one has

(5) $E(z|z\le a)=-p_z(a)/\Phi(a)$

where $p_z$ is the density.

Proof. From the expression of the density

$\frac{dp_z(t)}{dt}=\frac{1}{\sqrt{2\pi}}e^{-t^2/2}(-t)=-tp_z(t)$.

Applying this equation and (4) we get

$E(z|z\le a)=\frac{E(z1_{\{z\le a\}})}{P(z\le a)}=\frac{\int_{-\infty}^atp_z(t)dt}{\Phi(a)}=\frac{-\int_{-\infty}^a\frac{dp_z(t)}{dt}dt}{\Phi(a)}=\frac{p_z(-\infty)-p_z(a)}{\Phi(a)}=-\frac{p_z(a)}{\Phi(a)}$.

18
Jul 16

## Properties of conditional expectation

### Background

A company sells a product and may offer a discount. We denote by $X$ the sales volume and by $Y$ the discount amount (per unit). For simplicity, both variables take only two values. They depend on each other. If the sales are high, the discount may be larger. A higher discount, in its turn, may attract more buyers. At the same level of sales, the discount may vary depending on the vendor's costs. With the same discount, the sales vary with consumer preferences. Along with the sales and discount, we consider a third variable that depends on both of them. It can be the profit $\pi$.

### Formalization

The sales volume $X$ takes values $x_1,x_2$ with probabilities $p_i^X=P(X=x_i)$$i=1,2$. Similarly, the discount $Y$ takes values $y_1,y_2$ with probabilities $p_i^Y=P(Y=y_i)$$i=1,2$. The joint events have joint probabilities denoted $P(X=x_i,Y=y_j)=p_{i,j}$. The profit in the event $X=x_i,Y=y_j$ is denoted $\pi_{i,j}$. This information is summarized in Table 1.

 $y_1$$y_1$ $y_1$$y_1$ $x_1$$x_1$ $\pi_{1,1},\ p_{1,1}$$\pi_{1,1},\ p_{1,1}$ $\pi_{1,2},\ p_{1,2}$$\pi_{1,2},\ p_{1,2}$ $p_1^X$$p_1^X$ $x_2$$x_2$ $\pi_{2,1},\ p_{2,1}$$\pi_{2,1},\ p_{2,1}$ $\pi_{2,2},\ p_{2,2}$$\pi_{2,2},\ p_{2,2}$ $p_2^X$$p_2^X$ $p_1^Y$$p_1^Y$ $p_2^Y$$p_2^Y$

Comments. In the left-most column and upper-most row we have values of the sales and discount. In the "margins" (last row and last column) we put probabilities of those values. In the main body of the table we have profit values and their probabilities. It follows that the expected profit is

(1) $E\pi=\pi_{1,1}p_{1,1}+\pi_{1,2}p_{1,2}+\pi_{2,1}p_{2,1}+\pi_{2,2}p_{2,2}.$

### Conditioning

Suppose that the vendor fixes the discount at $y_1$. Then only the column containing this value is relevant. To get numbers that satisfy the completeness axiom, we define conditional probabilities

$P(X=x_1|Y=y_1)=\frac{p_{11}}{p_1^Y},\ P(X=x_2|Y=y_1)=\frac{p_{21}}{p_1^Y}.$

This allows us to define conditional expectation

(2) $E(\pi|Y=y_1)=\pi_{11}\frac{p_{11}}{p_1^Y}+\pi_{21}\frac{p_{21}}{p_1^Y}.$

Similarly, if the discount is fixed at $y_2$,

(3) $E(\pi|Y=y_2)=\pi_{12}\frac{p_{12}}{p_2^Y}+\pi_{22}\frac{p_{22}}{p_2^Y}.$

Equations (2) and (3) are joined in the notation $E(\pi|Y)$.

Property 1. While the usual expectation (1) is a number, the conditional expectation $E(\pi|Y)$ is a function of the value of $Y$ on which the conditioning is being done. Since it is a function of $Y$, it is natural to consider it a random variable defined by the next table

 Values Probabilities $E(\pi|Y=y_1)$$E(\pi|Y=y_1)$ $p_1^Y$$p_1^Y$ $E(\pi|Y=y_2)$$E(\pi|Y=y_2)$ $p_2^Y$$p_2^Y$

Property 2. Law of iterated expectations: the mean of the conditional expectation equals the usual mean. Indeed, using Table 2, we have

$E[E(\pi|Y)]=E(\pi|Y=y_1)p_1^Y+E(\pi|Y=y_2)p_2^Y$ (applying (2) and (3))

$=\left[\pi_{11}\frac{p_{11}}{p_1^Y}+\pi_{21}\frac{p_{21}}{p_1^Y}\right]p_1^Y+\left[\pi_{12}\frac{p_{12}}{p_2^Y}+\pi_{22}\frac{p_{22}}{p_2^Y}\right]p_2^Y$ $=\pi_{1,1}p_{1,1}+\pi_{1,2}p_{1,2}+\pi_{2,1}p_{2,1}+\pi_{2,2}p_{2,2}=E\pi.$

Property 3. Generalized homogeneity. In the usual homogeneity $E(aX)=aEX$$a$ is a number. In the generalized homogeneity

(4) $E(a(Y)\pi|Y)=a(Y)E(\pi|Y),$

$a(Y)$ is allowed to be a  function of the variable on which we are conditioning. See for yourself: using (2), for instance,

$E(a(y_1)\pi|Y=y_1)=a(y_1)\pi_{11}\frac{p_{11}}{p_1^Y}+a(y_1)\pi_{21}\frac{p_{21}}{p_1^Y}$ $=a(y_1)\left[\pi_{11}\frac{p_{11}}{p_1^Y}+\pi_{21}\frac{p_{21}}{p_1^Y}\right]=a(y_1)E(X|Y=y_1).$

Property 4. Additivity. For any random variables $S,T$ we have

(5) $E(S+T|Y)=E(S|Y)+E(T|Y).$

The proof is left as an exercise.

Property 5. Generalized linearity. For any random variables $S,T$ and functions $a(Y),b(Y)$ equations (4) and (5) imply

$E(a(Y)S+b(Y)T|Y)=a(Y)E(S|Y)+b(Y)E(T|Y).$

Property 6. Conditioning in case of independence. This property has to do with the informational aspect of conditioning. The usual expectation (1) takes into account all contingencies. (2) and (3) are based on the assumption that one contingency for $Y$ has been realized, so that the other one becomes irrelevant. Therefore $E(\pi|Y)$ is considered  an updated version of (1) that takes into account the arrival of new information that the value of $Y$ has been fixed. Now we can state the property itself: if $X,Y$ are independent, then $E(X|Y)=EX$, that is, conditioning on $Y$ does not improve our knowledge of $EX$.

Proof. In case of independence we have $p_{i,j}=p_i^Xp_j^Y$ for all $i,j$, so that

$E(X|Y=y_j)=x_1\frac{p_{1j}}{p_j^Y}+x_2\frac{p_{2j}}{p_j^Y}=x_1p_1^X+x_2p_2^X=EX.$

Property 7. Conditioning in case of complete dependence. Conditioning of $Y$ on $Y$ gives the most precise information: $E(Y|Y)=Y$ (if we condition $Y$ on $Y$, we know about it everything and there is no averaging). More generally, $E(f(Y)|Y)=f(Y)$ for any deterministic function $f$.

Proof. If we condition $Y$ on $Y$, the conditional probabilities become

$p_{11}=P(Y=y_1|Y=y_1)=1,\ p_{21}=P(Y=y_2|Y=y_1)=0.$

Hence, (2) gives

$E(f(Y)|Y=y_1)=f(y_1)\times 1+f(y_2)\times 0=f(y_1).$

Conditioning on $Y=y_2$ is treated similarly.

### Summary

Not many people know that using the notation $E_Y\pi$ for conditional expectation instead of $E(\pi|Y)$ makes everything much clearer. I rewrite the above properties using this notation:

1. Law of iterated expectations: $E(E_Y\pi)=E\pi$
2. Generalized homogeneity$E_Y(a(Y)\pi)=a(Y)E_Y\pi$
3. Additivity: For any random variables $S,T$ we have $E_Y(S+T)=E_YS+E_YT$
4. Generalized linearity: For any random variables $S,T$ and functions $a(Y),b(Y)$ one has $E_Y(a(Y)S+b(Y)T)=a(Y)E_YS+b(Y)E_YT$
5. Conditioning in case of independence: if $X,Y$ are independent, then $E_YX=EX$
6. Conditioning in case of complete dependence$E_Yf(Y)=f(Y)$ for any deterministic function $f$.

18
Oct 18

## Law of iterated expectations: geometric aspect

There will be a separate post on projectors. In the meantime, we'll have a look at simple examples that explain a lot about conditional expectations.

### Examples of projectors

The name "projector" is almost self-explanatory. Imagine a point and a plane in the three-dimensional space. Draw a perpendicular from the point to the plane. The intersection of the perpendicular with the plane is the points's projection onto that plane. Note that if the point already belongs to the plane, its projection equals the point itself. Besides, instead of projecting onto a plane we can project onto a straight line.

The above description translates into the following equations. For any $x\in R^3$ define

(1) $P_2x=(x_1,x_2,0)$ and $P_1x=(x_1,0,0).$

$P_2$ projects $R^3$ onto the plane $L_2=\{(x_1,x_2,0):x_1,x_2\in R\}$ (which is two-dimensional) and $P_1$ projects $R^3$ onto the straight line $L_1=\{(x_1,0,0):x_1\in R\}$ (which is one-dimensional).

Property 1. Double application of a projector amounts to single application.

Proof. We do this just for one of the projectors. Using (1) three times we get

(1) $P_2[P_2x]=P_2(x_1,x_2,0)=(x_1,x_2,0)=P_2x.$

Property 2. A successive application of two projectors yields the projection onto a subspace of a smaller dimension.

Proof. If we apply first $P_2$ and then $P_1$, the result is

(2) $P_1[P_2x]=P_1(x_1,x_2,0)=(x_1,0,0)=P_1x.$

If we change the order of projectors, we have

(3) $P_2[P_1x]=P_2(x_1,0,0)=(x_1,0,0)=P_1x.$

Exercise 1. Show that both projectors are linear.

Exercise 2. Like any other linear operator in a Euclidean space, these projectors are given by some matrices. What are they?

### The simple truth about conditional expectation

In the time series setup, we have a sequence of information sets $...\subset I_t\subset I_{t+1}\subset...$ (it's natural to assume that with time the amount of available information increases). Denote

$E_tX=E(X|I_t)$

the expectation of $X$ conditional on $I_t$. For each $t$,

 $E_t$$E_t$ is a projector onto the space of random functions that depend only on the information set $I_t$$I_t$.

Property 1. Double application of conditional expectation gives the same result as single application:

(4) $E_t(E_tX)=E_tX$

($E_tX$ is already a function of $I_t$, so conditioning it on $I_t$ doesn't change it).

Property 2. A successive conditioning on two different information sets is the same as conditioning on the smaller one:

(5) $E_tE_{t+1}X=E_tX,$

(6) $E_{t+1}E_tX=E_tX.$

Property 3. Conditional expectation is a linear operator: for any variables $X,Y$ and numbers $a,b$

$E_t(aX+bY)=aE_tX+bE_tY.$

It's easy to see that (4)-(6) are similar to (1)-(3), respectively, but I prefer to use different names for (4)-(6). I call (4) a projector property. (5) is known as the Law of Iterated Expectations, see my post on the informational aspect for more intuition. (6) holds simply because at time $t+1$ the expectation $E_tX$ is known and behaves like a constant.

Summary. (4)-(6) are easy to remember as one property. The smaller information set wins$E_sE_tX=E_{\min\{s,t\}}X.$

13
Oct 18

## Law of iterated expectations: informational aspect

The notion of Brownian motion will help us. Suppose we observe a particle that moves back and forth randomly along a straight line. The particle starts at zero at time zero. The movement can be visualized by plotting on the horizontal axis time and on the vertical axis - the position of the particle. $W(t)$ denotes the random position of the particle at time $t$.

Figure 1. Unconditional expectation

In Figure 1, various paths starting at the origin are shown in different colors. The intersections of the paths with vertical lines at times 0.5, 1 and 1.5 show the positions of the particle at these times. The deviations of those positions from $y=0$ to the upside and downside are assumed to be equally likely (more precisely, they are normal variables with mean zero and variance $t$).

### Unconditional expectation

“In the beginning there was nothing, which exploded.” ― Terry Pratchett, Lords and Ladies

If we are at the origin (like the Big Bang), nothing has happened yet and $EW(t)=0$ is the best prediction for any moment $t>0$ we can make (shown by the blue horizontal line in Figure 1). The usual, unconditional expectation $EX$ corresponds to the empty information set.

### Conditional expectation

Figure 2. Conditional expectation

In Figure 2, suppose we are at $t=2.$ The dark blue path between $t=0$ and $t=2$ has been realized. We know that the particle has reached the point $W(2)$ at that time. With this knowledge, we see that the paths starting at this point will have the average

(1) $E(W(t)|W(2))=W(2),$ $t>2.$

This is because the particle will continue moving randomly, with the up and down moves being equally likely. Prediction (1) is shown by the horizontal light blue line between $t=2$ and $t=4.$ In general, this prediction is better than $EW(t)=0$.

Note that for different realized paths, $W(2)$ takes different values. Therefore $E(W(t)|W(2))$, for fixed $t>2$, is a random variable of $W(2)$. It is a function of the event we condition the expectation on.

### Law of iterated expectations

Figure 3. Law of iterated expectations

Suppose you are at time $t=2$ (see Figure 3). You send many agents to the future $t=3$ to fetch the information about what will happen. They bring you the data on the means $E(W(t)|W(3))$ they see (shown by horizontal lines between $t=3$ and $t=4).$ Since there are many possible future realizations, you have to average the future means. For this, you will use the distributional belief you have at time $t=2.$ The result is $E[E(W(t)|W(3))|W(2)].$ Since the up and down moves are equally likely, your distribution at time $t=2$ is symmetric around $W(2).$ Therefore the above average will be equal to $E(W(t)|W(2)).$ This is the Law of Iterated Expectations, also called the tower property:

(2) $E[E(W(t)|W(3))|W(2)]=E(W(t)|W(2)).$

The knowledge of all of the future predictions $E(W(t)|W(3))$, upon averaging, does not improve or change our current prediction $E(W(t)|W(2))$.

For a full mathematical treatment of conditional expectation see Lecture 10 by Gordan Zitkovic.

4
Oct 17

## Conditional-mean-plus-remainder representation

Conditional-mean-plus-remainder representation: we separate the main part from the remainder and find out the remainder properties. My post on properties of conditional expectation is an elementary introduction to conditioning. This is my first post in Quantitative Finance.

## A brush-up on conditional expectations

1. Notation. Let $X$ be a random variable and let $I$ be an information set. Instead of the usual notation $E(X|I)$ for conditional expectation, in large expressions it's better to use the notation with $I$ in the subscript: $E_IX=E(X|I).$

2. Generalized homogeneity. If $f(I)$ depends only on information $I,$ then $E_I(f(I)X)=f(I)E_I(X)$ (a function of known information is known and behaves like a constant). A special case is $E_I(f(I))=f(I)E_I(1)=f(I).$ With $f(I)=E_I(X)$ we get $E_I(E_I(X))=E_I(X).$ This shows that conditioning is a projector: if you project a point in a 3D space onto a 2D plane and then project the image of the point onto the same plane, the result will be the same image as from single projecting.

3. Additivity. $E_I(X+Y)=E_IX+E_IY.$

4. Law of iterated expectations (LIE). If we know about two information sets that $I_1\subset I_2,$ then $E_{I_1}E_{I_2}X=E_{I_1}X.$ I like the geometric explanation in terms of projectors. Projecting a point onto a plane and then projecting the result onto a straight line is the same as projecting the point directly onto the straight line.

## Conditional-mean-plus-remainder representation

This is a direct generalization of the mean-plus-deviation-from-mean decomposition. There we wrote $X=EX+(X-EX)$ and denoted $\mu=EX,~\varepsilon=X-EX$ to obtain $X=\mu+\varepsilon$ with the property $E\varepsilon=0.$

Here we write $X=E_IX+(X-E_IX)$ and denote $\varepsilon=X-E_IX$ the remainder. Then the representation is

(1) $X=E_IX+\varepsilon.$

Properties. 1) $E_I\varepsilon=E_IX-E_IX=0$ (remember, this is a random variable identically equal to zero, not a number zero).

2) Conditional covariance is obtained from the usual covariance by replacing all usual expectations by conditional. Thus, by definition,

$Cov_I(X,Y)=E_I(X-E_IX)(Y-E_IY).$

For the components in (1) we have

$Cov_I(E_IX,\varepsilon)=E_I(E_IX-E_IE_IX)(\varepsilon-E_I\varepsilon)=E_I(E_IX-E_IX)\varepsilon=0.$

3) $Var_I(\varepsilon)=E_I(\varepsilon-E_I\varepsilon)^{2}=E_I(X-E_IX)^2=Var_I(X).$

2
Jan 17

## Conditional variance properties

### Preliminaries

Review Properties of conditional expectation, especially the summary, where I introduce a new notation for conditional expectation. Everywhere I use the notation $E_Y\pi$ for expectation of $\pi$ conditional on $Y$, instead of $E(\pi|Y)$.

This post and the previous one on conditional expectation show that conditioning is a pretty advanced notion. Many introductory books use the condition $E_xu=0$ (the expected value of the error term $u=0$ conditional on the regressor $x$ is zero). Because of the complexity of conditioning, I think it's better to avoid this kind of assumption as much as possible.

### Conditional variance properties

Replacing usual expectations by their conditional counterparts in the definition of variance, we obtain the definition of conditional variance:

(1) $Var_Y(X)=E_Y(X-E_YX)^2.$

Property 1. If $X,Y$ are independent, then $X-EX$ and $Y$ are also independent and conditioning doesn't change variance:

$Var_Y(X)=E_Y(X-EX)^2=E(X-EX)^2=Var(X),$

Property 2. Generalized homogeneity of degree 2: if $a$ is a deterministic function, then $a^2(Y)$ can be pulled out:

$Var_Y(a(Y)X)=E_Y[a(Y)X-E_Y(a(Y)X)]^2=E_Y[a(Y)X-a(Y)E_YX]^2$ $=E_Y[a^2(Y)(X-E_YX)^2]=a^2(Y)E_Y(X-E_YX)^2=a^2(Y)Var_Y(X).$

Property 3. Shortcut for conditional variance:

(2) $Var_Y(X)=E_Y(X^2)-(E_YX)^2.$

Proof.

$Var_Y(X)=E_Y(X-E_YX)^2=E_Y[X^2-2XE_YX+(E_YX)^2]$

(distributing conditional expectation)

$=E_YX^2-2E_Y(XE_YX)+E_Y(E_YX)^2$

(applying Properties 2 and 6 from this Summary with $a(Y)=E_YX$)

$=E_YX^2-2(E_YX)^2+(E_YX)^2=E_YX^2-(E_YX)^2.$

Property 4The law of total variance:

(3) $Var(X)=Var(E_YX)+E[Var_Y(X)].$

Proof. By the shortcut for usual variance and the law of iterated expectations

$Var(X)=EX^2-(EX)^2=E[E_Y(X^2)]-[E(E_YX)]^2$

(replacing $E_Y(X^2)$ from (2))

$=E[Var_Y(X)]+E(E_YX)^2-[E(E_YX)]^2$

(the last two terms give the shortcut for variance of $E_YX$)

$=E[Var_Y(X)]+Var(E_YX).$

Before we move further we need to define conditional covariance by

$Cov_Y(S,T) = E_Y(S - E_YS)(T - E_YT)$

(everywhere usual expectations are replaced by conditional ones). We say that random variables $S,T$ are conditionally uncorrelated if $Cov_Y(S,T) = 0$.

Property 5. Conditional variance of a linear combination. For any random variables $S,T$ and functions $a(Y),b(Y)$ one has

$Var_Y(a(Y)S + b(Y)T)=a^2(Y)Var_Y(S)+2a(Y)b(Y)Cov_Y(S,T)+b^2(Y)Var_Y(T).$

The proof is quite similar to that in case of usual variances, so we leave it to the reader. In particular, if $S,T$ are conditionally uncorrelated, then the interaction terms disappears:

$Var_Y(a(Y)S + b(Y)T)=a^2(Y)Var_Y(S)+b^2(Y)Var_Y(T).$

24
Oct 22

## A problem to do once and never come back

There is a problem I gave on the midterm that does not require much imagination. Just know the definitions and do the technical work, so I was hoping we could put this behind us. Turned out we could not and thus you see this post.

Problem. Suppose the joint density of variables $X,Y$ is given by

$f_{X,Y}(x,y)=\left\{ \begin{array}{c}k\left( e^{x}+e^{y}\right) \text{ for }0

I. Find $k$.

II. Find marginal densities of $X,Y$. Are $X,Y$ independent?

III. Find conditional densities $f_{X|Y},\ f_{Y|X}$.

IV. Find $EX,\ EY$.

When solving a problem like this, the first thing to do is to give the theory. You may not be able to finish without errors the long calculations but your grade will be determined by the beginning theoretical remarks.

### I. Finding the normalizing constant

Any density should satisfy the completeness axiom: the area under the density curve (or in this case the volume under the density surface) must be equal to one: $\int \int f_{X,Y}(x,y)dxdy=1.$ The constant $k$ chosen to satisfy this condition is called a normalizing constant. The integration in general is over the whole plain $R^{2}$ and the first task is to express the above integral as an iterated integral. This is where the domain where the density is not zero should be taken into account. There is little you can do without geometry. One example of how to do this is here.

The shape of the area $A=\left\{ (x,y):0 is determined by a) the extreme values of $x,y$ and b) the relationship between them. The extreme values are 0 and 1 for both $x$ and $y$, meaning that $A$ is contained in the square $\left\{ (x,y):0 The inequality $y means that we cut out of this square the triangle below the line $y=x$ (it is really the lower triangle because if from a point on the line $y=x$ we move down vertically, $x$ will stay the same and $y$ will become smaller than $x$).

In the iterated integral:

a) the lower and upper limits of integration for the inner integral are the boundaries for the inner variable; they may depend on the outer variable but not on the inner variable.

b) the lower and upper limits of integration for the outer integral are the extreme values for the outer variable; they must be constant.

This is illustrated in Pane A of Figure 1.

Figure 1. Integration order

Always take the inner integral in parentheses to show that you are dealing with an iterated integral.

a) In the inner integral integrating over $x$ means moving along blue arrows from the boundary $x=y$ to the boundary $x=1.$ The boundaries may depend on $y$ but not on $x$ because the outer integral is over $y.$

b) In the outer integral put the extreme values for the outer variable. Thus,

$\underset{A}{\int \int }f_{X,Y}(x,y)dxdy=\int_{0}^{1}\left(\int_{y}^{1}f_{X,Y}(x,y)dx\right) dy.$

Check that if we first integrate over $y$ (vertically along red arrows, see Pane B in Figure 1) then the equation

$\underset{A}{\int \int }f_{X,Y}(x,y)dxdy=\int_{0}^{1}\left(\int_{0}^{x}f_{X,Y}(x,y)dy\right) dx$

results.

In fact, from the definition $A=\left\{ (x,y):0 one can see that the inner interval for $x$ is $\left[ y,1\right]$ and for $y$ it is $\left[ 0,x\right] .$

### II. Marginal densities

The condition for independence of $X,Y$ is $f_{X,Y}\left( x,y\right) =f_{X}\left( x\right) f_{Y}\left( y\right)$ (this is a direct analog of the independence condition for events $P\left( A\cap B\right) =P\left( A\right) P\left( B\right)$). In words: the joint density decomposes into a product of individual densities.

### III. Conditional densities

In this case the easiest is to recall the definition of conditional probability $P\left( A|B\right) =\frac{P\left( A\cap B\right) }{P\left(B\right) }.$ The definition of conditional densities $f_{X|Y},\ f_{Y|X}$ is quite similar:

(2) $f_{X|Y}\left( x|y\right) =\frac{f_{X,Y}\left( x,y\right) }{f_{Y}\left( y\right) },\ f_{Y|X}\left( y|x\right) =\frac{f_{X,Y}\left( x,y\right) }{f_{X}\left( x\right) }$.

Of course, $f_{Y}\left( y\right) ,f_{X}\left( x\right)$ here can be replaced by their marginal equivalents.

### IV. Finding expected values of $X,Y$$X,Y$

The usual definition $EX=\int xf_{X}\left( x\right) dx$ takes an equivalent form using the marginal density:

$EX=\int x\left( \int f_{X,Y}\left( x,y\right) dy\right) dx=\int \int xf_{X,Y}\left( x,y\right) dydx.$

Which equation to use is a matter of convenience.

Another replacement in the usual definition gives the definition of conditional expectations:

$E\left( X|Y\right) =\int xf_{X|Y}\left( x|y\right) dx,$ $E\left( Y|X\right) =\int yf_{Y|X}\left( y|x\right) dx.$

Note that these are random variables: $E\left( X|Y=y\right)$ depends in $y$ and $E\left( Y|X=x\right)$ depends on $x.$

### Solution to the problem

Being a lazy guy, for the problem this post is about I provide answers found in Mathematica:

I. $k=0.581977$

II. $f_{X}\left( x\right) =-1+e^{x}\left( 1+x\right) ,$ for $x\in[ 0,1],$ $f_{Y}\left( y\right) =e-e^{y}y,$ for $y\in \left[ 0,1\right] .$

It is readily seen that the independence condition is not satisfied.

III. $f_{X|Y}\left( x|y\right) =\frac{k\left( e^{x}+e^{y}\right) }{e-e^{y}y}$ for $0

$f_{Y|X}\left(y|x\right) =\frac{k\left(e^x+e^y\right) }{-1+e^x\left( 1+x\right) }$ for $0

IV. $EX=0.709012,$ $EY=0.372965.$

24
Jun 20

## Solution to Question 2 from UoL exam 2018, Zone B

There are three companies, called A, B, and C, and each has a 4% chance of going bankrupt. The event that one of the three companies will go bankrupt is independent of the event that any other company will go bankrupt.

Company A has outstanding bonds, and a bond will have a net return of $r = 0\%$ if the corporation does not go bankrupt, but it will have a net return of $r = -100\%$, i.e., losing everything invested, if it goes bankrupt. Suppose an investor buys $1000 worth of bonds of company A, which we will refer to as portfolio ${P_1}$. Suppose also that there exists a security whose payout depends on the bankruptcy of companies B and C in a joint fashion. In particular, if neither B nor C go bankrupt, this derivative will have a net return of $r = 0\%$. If exactly one of B or C go bankrupt, it will have a net return of $r = -50\%$, i.e., losing half of the investment. If both B and C go bankrupt, it will have a net return of $r = -100\%$, i.e., losing the whole investment. Suppose an investor buys$1000 worth of this derivative, which is then called portfolio ${P_2}$.

(a) Calculate the VaR at the $\alpha = 10\%$ critical level for portfolios $P_1$ and ${P_2}$. [30 marks]

Independence of events. Denote $A,{A^c}$ the events that company A goes bankrupt and does not go bankrupt, resp. A similar notation will be used for the other two companies. The simple definition of independence of bankruptcy events $P(A \cap B) = P(A)P(B)$ would be too difficult to apply to prove independence of all events that we need. A general definition of independence of variables is that their sigma-fields are independent (it will not be explained here). This general definition implies that in all cases below we can use multiplicativity of probability such as

$P(B \cap C) = P(B)P(C) = {0.04^2} = 0.0016,\,\,P({B^c} \cap {C^c}) = {0.96^2} = 0.9216,$ $P((B \cap {C^c}) \cup ({B^c} \cap C)) = P(B \cap {C^c}) + P({B^c} \cap C) = 2 \times 0.04 \times 0.96 = 0.0768.$

The events here have a simple interpretation: the first is that “both B and C fail”, the second is “both B and C fail”, and the third is that “either (B fails and C does not) or (B does not fail and C does)” (they do not intersect and additivity of probability applies).

Let ${r_A},{r_S}$ be returns on A and the security S, resp. From the problem statement it follows that these returns are described by the tables
Table 1

 ${r_A}$${r_A}$ Prob 0 0.96 -100 0.04

Table 2

 ${r_S}$${r_S}$ Prob 0 0.9216 -50 0.0768 -100 0.0016

Everywhere we will be working with percentages, so the dollar values don’t matter.

From Table 1 we conclude that the distribution function of return on A looks as follows:

Figure 1. Distribution function of portfolio A

At $x=-100$ the function jumps up by 0.04, at $x=0$ by another 0.96. The dashed line at $y=0.1$ is used in the definition of the VaR using the generalized inverse:

$VaR_A^{0.1} = \inf \{ {x:{F_A}(x) \ge 0.1}\} = 0.$

From Table 2 we see that the distribution function of return on S looks like this:

The first jump is at $x=-100$, the second at $x=-50$ and third one at $x=0$. As above, it follows that

$VaR_S^{0.1} = \inf\{ {x:{F_S}(x) \ge 0.1}\} = 0.$

(b) Calculate the VaR at the $\alpha=10\%$ critical level for the joint portfolio ${P_1} + {P_2}$. [20 marks]

To find the return distribution for $P_1 + P_2$, we have to consider all pairs of events from Tables 1 and 2 using independence.

1.$P({r_A}=0,{r_S}=0)=0.96\times 0.9216=0.884736$

2.$P({r_A}=-100,{r_S}=0)=0.04\times 0.9216=0.036864$

3.$P({r_A}=0,{r_S}=-50)=0.96\times 0.0768=0.073728$

4.$P({r_A}=-100,{r_S}=-50)=0.04\times 0.0768=0.003072$

5.$P({r_A}=0,{r_S}=-100)=0.96\times 0.0016=0.001536$

6.$P({r_A}=-100,{r_S}=-100)=0.04\times 0.0016=0.000064$

Since we deal with a joint portfolio, percentages for separate portfolios should be translated into ones for the whole portfolio. For example, the loss of 100% on one portfolio and 0% on the other means 50% on the joint portfolio (investments are equal). There are two such losses, in lines 2 and 5, so the probabilities should be added. Thus, we obtain the table for the return $r$ on the joint portfolio:

Table 3

 $r$$r$ Prob 0 0.884736 -25 0.073728 -50 0.0384 -75 0.003072 -100 0.000064

Here only the first probability exceeds 0.1, so the definition of the generalized inverse gives

$VaR_r^{0.1} = \inf \{ {x:{F_r}(x) \ge 0.1}\} = 0.$

(c) Is VaR sub-additive in this example? Explain why the absence of sub-additivity may be a concern for risk managers. [20 marks]

To check sub-additivity, we need to pass to positive numbers, as explained in other posts. Zeros remain zeros, the inequality $0 \le 0 + 0$ is true, so sub-additivity holds in this example. Lack of sub-additivity is an undesirable property for risk managers, because for them keeping the VaR at low levels for portfolio parts doesn’t mean having low VaR for the whole portfolio.

(d) The expected shortfall $E{S^\alpha }$ at the $\alpha$ critical level can be defined as

$ES^\alpha= - E_t[R|R < - VaR_{t + 1}^\alpha],$

where $R$ is a return or dollar amount. Calculate the expected shortfall at the $\alpha = 10\%$ critical level for portfolio $P_2$. Is this risk measure sub-additive? [30 marks]

Using the definition of conditional expectation and Table 3, we have (the time subscript can be omitted because the problem is static)
$ES^{0.1}=-E[r|r
$=-\frac{-25\times 0.073728-50\times 0.0384-75\times 0.003072-100\times 0.000064}{0.073728+0.0384+0.003072+0.000064}=\frac{4}{0.115264}=34.7029.$

There is a theoretical property that the expected shortfall is sub-additive.

13
Apr 19

## Checklist for Quantitative Finance FN3142

Students of FN3142 often think that they can get by by picking a few technical tricks. The questions below are mostly about intuition that helps to understand and apply those tricks.

Everywhere we assume that $...,Y_{t-1},Y_t,Y_{t+1},...$ is a time series and $...,I_{t-1},I_t,I_{t+1},...$ is a sequence of corresponding information sets. It is natural to assume that $I_t\subset I_{t+1}$ for all $t.$ We use the short conditional expectation notation: $E_tX=E(X|I_t)$.

### Questions

Question 1. How do you calculate conditional expectation in practice?

Question 2. How do you explain $E_t(E_tX)=E_tX$?

Question 3. Simplify each of $E_tE_{t+1}X$ and $E_{t+1}E_tX$ and explain intuitively.

Question 4. $\varepsilon _t$ is a shock at time $t$. Positive and negative shocks are equally likely. What is your best prediction now for tomorrow's shock? What is your best prediction now for the shock that will happen the day after tomorrow?

Question 5. How and why do you predict $Y_{t+1}$ at time $t$? What is the conditional mean of your prediction?

Question 6. What is the error of such a prediction? What is its conditional mean?

Question 7. Answer the previous two questions replacing $Y_{t+1}$ by $Y_{t+p}$.

Question 8. What is the mean-plus-deviation-from-mean representation (conditional version)?

Question 9. How is the representation from Q.8 reflected in variance decomposition?

Question 10. What is a canonical form? State and prove all properties of its parts.

Question 11. Define conditional variance for white noise process and establish its link with the unconditional one.

Question 12. How do you define the conditional density in case of two variables, when one of them serves as the condition? Use it to prove the LIE.

Question 13. Write down the joint distribution function for a) independent observations and b) for serially dependent observations.

Question 14. If one variable is a linear function of another, what is the relationship between their densities?

Question 15. What can you say about the relationship between $a,b$ if $f(a)=f(b)$? Explain geometrically the definition of the quasi-inverse function.

Answer 1. Conditional expectation is a complex notion. There are several definitions of differing levels of generality and complexity. See one of them here and another in Answer 12.

The point of this exercise is that any definition requires a lot of information and in practice there is no way to apply any of them to actually calculate conditional expectation. Then why do they juggle conditional expectation in theory? The efficient market hypothesis comes to rescue: it is posited that all observed market data incorporate all available information, and, in particular, stock prices are already conditioned on $I_t.$

Answers 2 and 3. This is the best explanation I have.

Answer 4. Since positive and negative shocks are equally likely, the best prediction is $E_t\varepsilon _{t+1}=0$ (I call this equation a martingale condition). Similarly, $E_t\varepsilon _{t+2}=0$ but in this case I prefer to see an application of the LIE: $E_{t}\varepsilon _{t+2}=E_t(E_{t+1}\varepsilon _{t+2})=E_t0=0.$

Answer 5. The best prediction is $\hat{Y}_{t+1}=E_tY_{t+1}$ because it minimizes $E_t(Y_{t+1}-f(I_t))^2$ among all functions $f$ of current information $I_t.$ Formally, you can use the first order condition

$\frac{d}{df(I_t)}E_t(Y_{t+1}-f(I_t))^2=-2E_t(Y_{t+1}-f(I_t))=0$

to find that $f(I_t)=E_tf(I_t)=E_tY_{t+1}$ is the minimizing function. By the projector property
$E_t\hat{Y}_{t+1}=E_tE_tY_{t+1}=E_tY_{t+1}=\hat{Y}_{t+1}.$

Answer 6. It is natural to define the prediction error by

$\hat{\varepsilon}_{t+1}=Y_{t+1}-\hat{Y}_{t+1}=Y_{t+1}-E_tY_{t+1}.$

By the projector property $E_t\hat{\varepsilon}_{t+1}=E_tY_{t+1}-E_tY_{t+1}=0$.

Answer 7. To generalize, just change the subscripts. For the prediction we have to use two subscripts: the notation $\hat{Y}_{t,t+p}$ means that we are trying to predict what happens at a future date $t+p$ based on info set $I_t$ (time $t$ is like today). Then by definition $\hat{Y} _{t,t+p}=E_tY_{t+p},$ $\hat{\varepsilon}_{t,t+p}=Y_{t+p}-E_tY_{t+p}.$

Answer 8. Answer 7, obviously, implies $Y_{t+p}=\hat{Y}_{t,t+p}+\hat{\varepsilon}_{t,t+p}.$ The simple case is here.

Answer 9. See the law of total variance and change it to reflect conditioning on $I_t.$

Answer 11. Combine conditional variance definition with white noise definition.

Answer 12. The conditional density is defined similarly to the conditional probability. Let $X,Y$ be two random variables. Denote $p_X$ the density of $X$ and $p_{X,Y}$ the joint density. Then the conditional density of $Y$ conditional on $X$ is defined as $p_{Y|X}(y|x)=\frac{p_{X,Y}(x,y)}{p_X(x)}.$ After this we can define the conditional expectation $E(Y|X)=\int yp_{Y|X}(y|x)dy.$ With these definitions one can prove the Law of Iterated Expectations:

$E[E(Y|X)]=\int E(Y|x)p_X(x)dx=\int \left( \int yp_{Y|X}(y|x)dy\right) p_X(x)dx$

$=\int \int y\frac{p_{X,Y}(x,y)}{p_X(x)}p_X(x)dxdy=\int \int yp_{X,Y}(x,y)dxdy=EY.$

This is an illustration to Answer 1 and a prelim to Answer 13.

Answer 13. Understanding this answer is essential for Section 8.6 on maximum likelihood of Patton's guide.

a) In case of independent observations $X_1,...,X_n$ the joint density of the vector $X=(X_1,...,X_n)$ is a product of individual densities:

$p_X(x_1,...,x_n)=p_{X_1}(x_1)...p_{X_n}(x_n).$

b) In the time series context it is natural to assume that the next observation depends on the previous ones, that is, for each $t,$ $X_t$ depends on $X_1,...,X_{t-1}$ (serially dependent observations). Therefore we should work with conditional densities $p_{X_1,...,X_t|X_1,...,X_{t-1}}.$ From Answer 12 we can guess how to make conditional densities appear:

$p_{X_1,...,X_n}(x_1,...,x_n)=\frac{p_{X_1,...,X_n}(x_1,...,x_n)}{ p_{X_1,...,X_{n-1}}(x_1,...,x_{n-1})}\frac{p_{X_1,...,X_{n-1}}(x_1,...,x_{n-1})}{ p_{X_1,...,X_{n-2}}(x_1,...,x_{n-2})}...\frac{p_{X_1,X_2}(x_1,x_2)}{p_{X_1}(x_1)}p_{X_1}(x_1).$

The fractions on the right are recognized as conditional probabilities. The resulting expression is pretty awkward:

$p_{X_1,...,X_n}(x_1,...,x_n)=p_{X_1,...,X_n|X_1,...,X_n-1}(x_1,...,x_n|x_1,...,x_{n-1})\times$ $\times p_{X_1,...,X_{n-1}|X_1,...,X_{n-2}}(x_1,...,x_{n-1}|x_1,...,x_{n-2})...\times$ $p_{X_1,X_2|X_1}(x_1,x_2|x_1)p_{X_1}(x_1).$

Answer 14. The answer given here helps one understand how to pass from the density of the standard normal to that of the general normal.

Answer 15. This elementary explanation of the function definition can be used in the fifth grade. Note that conditions sufficient for existence of the inverse are not satisfied in a case as simple as the distribution function of the Bernoulli variable (when the graph of the function has flat pieces and is not continuous). Therefore we need a more general definition of an inverse. Those who think that this question is too abstract can check out UoL exams, where examinees are required to find Value at Risk when the distribution function is a step function. To understand the idea, do the following:

a) Draw a graph of a good function $f$ (continuous and increasing).

b) Fix some value $y_0$ in the range of this function and identify the region $\{y:y\ge y_0\}$.

c) Find the solution $x_0$ of the equation $f(x)=y_0$. By definition, $x_0=f^{-1}(y_o).$ Identify the region $\{x:f(x)\ge y_0\}$.

d) Note that $x_0=\min\{x:f(x)\ge y_0\}$. In general, for bad functions the minimum here may not exist. Therefore minimum is replaced by infimum, which gives us the definition of the quasi-inverse:

$x_0=\inf\{x:f(x)\ge y_0\}$.

9
Nov 18

## Geometry and algebra of projectors

Projectors are geometrically so simple that they should have been discussed somewhere in the beginning of this course. I am giving them now because the applications are more advanced.

### Motivating example

Let $L$ be the $x$-axis and $L^\perp$ the $y$-axis on the plane. Let $P$ be the projector onto $L$ along $L^\perp$ and let $Q$ be the projector onto $L^\perp$ along $L.$ This geometry translates into the following definitions:

$L=\{(x,0):x\in R\},$ $L^\perp=\{(0,y):y\in R\},$ $P(x,y)=(x,0),$ $Q(x,y)=(0,y).$

The theory is modeled on the following observations.

a) $P$ leaves the elements of $L$ unchanged and sends to zero all elements of $L^\perp.$

b) $L$ is the image of $P$ and $L^\perp$ is the null space of $P.$

c) Any element of the image of $P$ is orthogonal to any element of the image of $Q.$

d) Any $x$ can be represented as $x=(x_1,0)+(0,x_2)=Px+Qx.$ It follows that $I=P+Q.$

For more simple examples, see my post on conditional expectations.

### Formal approach

Definition 1. A square matrix $P$ is called a projector if it satisfies two conditions: 1) $P^2=P$ ($P$ is idempotent; for some reason, students remember this term better than others) and 2) $P^T=P$ ($P$ is symmetric).

Exercise 1. Denote $L_P=\{x:Px=x\}$ the set of points $x$ that are left unchanged by $P.$ Then $L_P$ is the image of $P$ (and therefore a subspace).

Proof. Indeed, the image of $P$ consists of points $y=Px.$ For any such $y,$ we have $Py=P^2x=Px=y,$ so $y$ belongs to $L_P.$ Conversely, any element of $L_P$ is seen to belong to the image of $P.$

Exercise 2. a) The null space and image of $P$ are orthogonal. b) We have an orthogonal decomposition $R^n=N(P)\oplus \text{Img}(P).$

Proof. a) If $x\in \text{Img}(P)$ and $y\in N(P),$ then $Py=0$ and by Exercise 1 $Px=x.$ Therefore $x\cdot y=(Px)\cdot y=x\cdot (Py)=0.$ This shows that $\text{Img}(P)\perp N(P).$

b) For any $x$ write $x=Px+(I-P)x.$ Here $Px\in \text{Img}(P)$ and $(I-P)x\in N(P)$ because $P(I-P)x=(P-P^2)x=0.$

Exercise 3. a) Along with $P,$ the matrix $Q=I-P$ is also a projector. b) $\text{Img}(Q)=N(P)$ and $N(Q)=\text{Img}(P).$

Proof. a) $Q$ is idempotent: $Q^2=(I-P)^2=I-2P+P^2=I-P=Q.$ b) $Q$ is symmetric: $Q^T=I^T-P^T=Q.$

b) By Exercise 2

$\text{Img}(Q)=\{x:Qx=x\}=\{x:(I-P)x=x\}=\{x:Px=0\}=N(P).$

Since $P=I-Q,$ this equation implies $N(Q)=\text{Img}(P).$

It follows that, as with $P,$ the set $L_Q=\{x:Qx=x\}$ is the image of $Q$ and it consists of points that are not changed by $Q.$