Oct 18

Law of iterated expectations: geometric aspect

Law of iterated expectations: geometric aspect

There will be a separate post on projectors. In the meantime, we'll have a look at simple examples that explain a lot about conditional expectations.

Examples of projectors

The name "projector" is almost self-explanatory. Imagine a point and a plane in the three-dimensional space. Draw a perpendicular from the point to the plane. The intersection of the perpendicular with the plane is the points's projection onto that plane. Note that if the point already belongs to the plane, its projection equals the point itself. Besides, instead of projecting onto a plane we can project onto a straight line.

The above description translates into the following equations. For any x\in R^3 define

(1) P_2x=(x_1,x_2,0) and P_1x=(x_1,0,0).

P_2 projects R^3 onto the plane L_2=\{(x_1,x_2,0):x_1,x_2\in R\} (which is two-dimensional) and P_1 projects R^3 onto the straight line L_1=\{(x_1,0,0):x_1\in R\} (which is one-dimensional).

Property 1. Double application of a projector amounts to single application.

Proof. We do this just for one of the projectors. Using (1) three times we get

(1) P_2[P_2x]=P_2(x_1,x_2,0)=(x_1,x_2,0)=P_2x.

Property 2. A successive application of two projectors yields the projection onto a subspace of a smaller dimension.

Proof. If we apply first P_2 and then P_1, the result is

(2) P_1[P_2x]=P_1(x_1,x_2,0)=(x_1,0,0)=P_1x.

If we change the order of projectors, we have

(3) P_2[P_1x]=P_2(x_1,0,0)=(x_1,0,0)=P_1x.

Exercise 1. Show that both projectors are linear.

Exercise 2. Like any other linear operator in a Euclidean space, these projectors are given by some matrices. What are they?

The simple truth about conditional expectation

In the time series setup, we have a sequence of information sets ...\subset I_t\subset I_{t+1}\subset... (it's natural to assume that with time the amount of available information increases). Denote


the expectation of X conditional on I_t. For each t,

E_t is a projector onto the space of random functions that depend only on the information set I_t.

Property 1. Double application of conditional expectation gives the same result as single application:

(4) E_t(E_tX)=E_tX

(E_tX is already a function of I_t, so conditioning it on I_t doesn't change it).

Property 2. A successive conditioning on two different information sets is the same as conditioning on the smaller one:

(5) E_tE_{t+1}X=E_tX,

(6) E_{t+1}E_tX=E_tX.

Property 3. Conditional expectation is a linear operator: for any variables X,Y and numbers a,b


It's easy to see that (4)-(6) are similar to (1)-(3), respectively, but I prefer to use different names for (4)-(6). I call (4) a projector property. (5) is known as the Law of Iterated Expectations, see my post on the informational aspect for more intuition. (6) holds simply because at time t+1 the expectation E_tX is known and behaves like a constant.

Summary. (4)-(6) are easy to remember as one property. The smaller information set winsE_sE_tX=E_{\min\{s,t\}}X.

Oct 18

Law of iterated expectations: informational aspect

Law of iterated expectations: informational aspect

The notion of Brownian motion will help us. Suppose we observe a particle that moves back and forth randomly along a straight line. The particle starts at zero at time zero. The movement can be visualized by plotting on the horizontal axis time and on the vertical axis - the position of the particle. W(t) denotes the random position of the particle at time t.

Unconditional expectation

Figure 1. Unconditional expectation

In Figure 1, various paths starting at the origin are shown in different colors. The intersections of the paths with vertical lines at times 0.5, 1 and 1.5 show the positions of the particle at these times. The deviations of those positions from y=0 to the upside and downside are assumed to be equally likely (more precisely, they are normal variables with mean zero and variance t).

Unconditional expectation

“In the beginning there was nothing, which exploded.” ― Terry Pratchett, Lords and Ladies

If we are at the origin (like the Big Bang), nothing has happened yet and EW(t)=0 is the best prediction for any moment t>0 we can make (shown by the blue horizontal line in Figure 1). The usual, unconditional expectation EX corresponds to the empty information set.

Conditional expectation

Conditional expectation

Figure 2. Conditional expectation

In Figure 2, suppose we are at t=2. The green path between t=0 and t=2 has been realized. We know that the particle has reached the point W(2) at that time. With this knowledge, we see that the paths starting at this point will have the average

(1) E(W(t)|W(2))=W(2), t>2.

This is because the particle will continue moving randomly, with the up and down moves being equally likely. Prediction (1) is shown by the horizontal blue line between t=2 and t=4. In general, this prediction is better than EW(t)=0.

Note that for different realized paths, W(2) takes different values. Therefore E(W(t)|W(2)), for t<2, is a random variable. It is a function of the event we condition the expectation on.

Law of iterated expectations

Law of iterated expectations

Figure 3. Law of iterated expectations

Suppose you are at time t=2 (see Figure 3). You send many agents to the future t=3 to fetch the information about what will happen. They bring you the data on the means E(W(t)|W(3)) they see (shown by horizontal lines between t=3 and t=4). Since there are many possible future realizations, you have to average the future means. For this, you will use the distributional belief you have at time t=2. The result is E[E(W(t)|W(3))|W(2)]. Since the up and down moves are equally likely, your distribution at time t=2 is symmetric around W(2). Therefore the above average will be equal to E(W(t)|W(2)). This is the Law of Iterated Expectations, also called the tower property:

(2) E[E(W(t)|W(3))|W(2)]=E(W(t)|W(2)).

The knowledge of all of the future predictions E(W(t)|W(3)), upon averaging, does not improve or change our current prediction E(W(t)|W(2)).

For a full mathematical treatment of conditional expectation see Lecture 10 by Gordan Zitkovic.

Sep 18

Portfolio analysis: return on portfolio

Portfolio analysis: return on portfolio

Exercise 1. Suppose a portfolio contains n_1 shares of stock 1 whose price is S_1 and n_2 shares of stock 2 whose price is S_2. Stock prices fluctuate and are random variables. Numbers of shares are assumed fixed and are deterministic. What is the expected value of the portfolio?

Solution. The portfolio value is its market price V=n_1S_1+n_2S_2. Since this is a linear combination, the expected value is EV=n_1ES_1+n_2ES_2.

In fact, the portfolio analysis is a little bit different than suggested by Exercise 1. To explain the difference, we start with fixing two points of view.

  1. I hold a portfolio of stocks. I may have inherited it, and it does not matter how much it cost at the moment it was formed. If I want to sell it, I am interested in knowing its market value. In this situation the numbers of shares in my portfolio, which are constant, and the market prices of stocks, which are random, determine the market value of the portfolio, defined in Exercise 1. The value of the portfolio is a linear combination of stock prices.
  2. I have a certain amount of money M^0 to invest. Being a gambler, I am not interested in holding a portfolio forever. I am thinking about buying a portfolio of stocks now and selling it, say, in a year at price M^1. In this case I am interested in the rate of return defined by r=\frac{M^1-M^0}{M^0}. M^0 is considered deterministic (current prices are certain) and M^1 is random (future prices are unpredictable). Thus the rate of return is random.

We pursue the second view (prevalent in finance). As it often happens in economics and finance, the result depends on how one understands the things. Suppose the initial amount M^0 is invested in n assets. Denoting M_i^0 the amount invested in asset i, we have M^0=\sum\limits_{i = 1}^nM_i^0. Denoting s_i=M_i^0/{M^0} the share (percentage) of M_i^0 in the total investment M^0, we have

(1) M_i^0=s_iM^0,\ M^0=\sum\limits_{i = 1}^ns_iM^0.

The initial shares s_i are deterministic.

Let M_i^1 be what becomes of M_i^0 in one year and let M^1=\sum\limits_{i = 1}^nM_i^1 be the total value of the investment at the end of the year. Since different assets grow at different rates, generally it is not true that M_i^1 =s_iM^1. Denote r_i=\frac{M_i^1-M_i^0}{M_i^0} the rate of return on asset i. Then

(2) M_i^1=(1+r_i)M_i^0, M^1=\sum\limits_{i = 1}^n(1+r_i)M_i^0.

Exercise 2. The rate of return on the portfolio is a linear combination of the rates of return on separate assets, the coefficients being the initial shares of investment.

Solution. Using Equations (1) and (2) we get

(3) r=\frac{M^1-M^0}{M^0}=\frac{\sum(1+r_i)M_i^0-\sum M_i^0}{M^0}=\frac{\sum r_iM_i^0}{M^0}=\frac{\sum r_is_iM^0}{M^0}=\sum s_ir_i .

Once you know this equation you can find the mean and variance of the rate of return on the portfolio in terms of investment shares and rates of return on assets.

Sep 18

Applications of the diagonal representation IV

Applications of the diagonal representation IV

Principal component analysis is a general method based on diagonalization of the variance matrix. We consider it in a financial context. The variance matrix measures riskiness of the portfolio.  We want to see which stocks contribute most to the portfolio risk. The surprise is that the answer is given not in terms of the vector of returns but in terms of its linear transformation.

8. Principal component analysis (PCA)

Let R be a column-vector of returns on p stocks with the variance matrix V(R)=E(R-ER)(R-ER)^{T}. The idea is to find an orthogonal matrix W such that W^{-1}V(R)W=D is a diagonal matrix D=diag[\lambda_1,...,\lambda_p] with \lambda_1\geq...\geq\lambda_p.

With such a matrix, instead of R we can consider its transformation Y=W^{-1}R for which


We know that V(Y) has variances V(Y_1),...,V(Y_p) on the main diagonal. It follows that V(Y_i)=\lambda_i for all i. Variance is a measure of riskiness. Thus, the transformed variables Y_1,...,Y_p are put in the order of declining risk. What follows is the realization of this idea using sample data.

In a sampling context, all population means shoud be replaced by their sample counterparts. Let R^{(t)} be a p\times 1 vector of observations on R at time t. These observations are put side by side into a matrix \mathbb{R}=(R^{(1)},...,R^{(n)}) where n is the number of moments in time. The population mean ER is estimated by the sample mean


The variance matrix V(R) is estimated by


where l is a 1\times n vector of ones. It is this matrix that is diagonalized: W^{-1}\hat{V}W=D.

In general, the eigenvalues in D are not ordered. Ordering them and at the same time changing places of the rows of W^{-1} correspondingly we get a new orthogonal matrix W_1 (this requires a small proof) such that the eigenvalues in W_1^{-1}\hat{V}W_1=D_1 will be ordered. There is a lot more to say about the method and its applications.

May 18

Efficient market hypothesis is subject to interpretation

Efficient market hypothesis is subject to interpretation

The formulation on Investopedia seems to me the best:

The efficient market hypothesis (EMH) is an investment theory that states it is impossible to "beat the market" because stock market efficiency causes existing share prices to always incorporate and reflect all relevant information. According to the EMH, stocks always trade at their fair value on stock exchanges, making it impossible for investors to either purchase undervalued stocks or sell stocks for inflated prices. As such, it should be impossible to outperform the overall market through expert stock selection or market timing, and the only way an investor can possibly obtain higher returns is by purchasing riskier investments.

This is not Math, and the EMH interpretation is subjective. My purpose is not to discuss the advantages and drawbacks of various versions of the EMH but indicate some errors students make on exams.

Best(?) way to answer questions related to EMH

Since there is a lot of talking, the best is to use the appropriate key words.

Start with "The EMH states that it is impossible to make economic profit".

Then explain why: The stock market is efficient in the sense that stocks trade at their fair value, so that undervalued or overvalued stocks don't exist.

Then specify that "to obtain economic profits, from the revenues we subtract opportunity (hidden) costs, in addition to direct costs, such as transaction fees". What on the surface seems to be a profitable activity may in fact be balancing at break-even.

Next is to address the specification by Malkiel that the EMH depends on the information set \Omega_t available at time t.

Weak form of EMH. The information set \Omega_t^1 contains only historical values of asset prices, dividends (and possibly volume) up until time t. This is basically what an investor sees on a stock price chart. Many students say "historical information" but fail to mention that it is about prices of financial assets. The birthdays of celebrities are also historical information but they are not in this info set.

Semi-strong form of EMH. The info set \Omega_t^2 is all publicly available information. Some students don't realize that it includes \Omega_t^1. The risk-free rate is in \Omega_t^2 but not in \Omega_t^1 because 1) it is publicly known and 2) it is not traded (it is fixed by the central bank for extended periods of time).

Strong form of EMH. The info set \Omega_t^3 includes all publicly available info plus private company information. Firstly, this info set includes the previous two: \Omega_t^1\subset\Omega_t^2\subset\Omega_t^3. Secondly, whether a certain piece of information belongs to \Omega_t^2 or \Omega_t^3 depends on time. For example, the number of shares Warren Buffett purchased today of the stock is in \Omega_t^3 but over time it becomes a part of \Omega_t^2 because large holdings must be reported within 45 days of the end of a calendar quarter. If there are nuances like this you have to explain them.

Implications for time series analysis

Conditional expectation is a relatively complex mathematical construct. The simplest definition is accessible to basic statistics students. The mid-level definition in case of conditioning on a set of positive probability already raises questions about practical calculation. The most general definition is based on Radon-Nikodym  derivatives. Moreover, nobody knows exactly any of those \Omega_t. So how do you apply time series models which depend so heavily on conditioning? The answer is simple: since by the EMH the stock price "reflects all relevant information", that price is already conditioned on that information, and you don't need to worry about theoretical complexities of conditioning in applications.

May 18

The Newey-West estimator: uncorrelated and correlated data

The Newey-West estimator: uncorrelated and correlated data

I hate long posts but here we by necessity have to go through all ideas and calculations, to understand what is going on. One page of formulas in A. Patton's guide to FN3142 Quantitative Finance in my rendition becomes three posts.

Preliminaries and autocovariance function

Let X_1,...,X_n be random variables. We need to recall that the variance of the vector X=(X_1,...,X_n)^T is

(1) V(X)=\left(\begin{array}{cccc}V(X_1)&Cov(X_1,X_2)&...&Cov(X_1,X_n)\\Cov(X_2,X_1)&V(X_2)&...&Cov(X_2,X_n)\\...&...&...&...\\Cov(X_n,X_1)&Cov(X_n,X_2)&...&V(X_n)\end{array}\right).

With the help of this matrix we derived two expressions for variance of a linear combination:

(2) V\left(\sum_{i=1}^na_iX_i\right)=\sum_{i=1}^na_i^2V(X_i)

for uncorrelated variables and

(3) V\left(\sum_{i=1}^na_iX_i\right)=\sum_{i=1}^na_i^2V(X_i)+2\sum_{i=1}^{n-1}\sum_{j=i+1}^na_ia_jCov(X_i,X_j)

when there is autocorrelation.

In a time series context X_1,...,X_n are observations along time. 1,...,n stand for moments in time and the sequence X_1,...,X_n is called a time series. We need to recall the definition of a stationary process. Of that definition, we will use only the part about covariances: Cov(X_i,X_j) depends only on the distance |i-j| between the time moments i,j. For example, in the top right corner of (1) we have Cov(X_1,X_n), which depends only on n-1.

Preamble. Let X_1,...,X_n be a stationary times series. Firstly, Cov(X_i,X_i+k) depends only on k. Secondly, for all integer k=0,\pm 1,\pm 2,... denoting j=i+k we have

(4) Cov(X_i,X_{i+k})=Cov(X_{j-k},X_j)=Cov(X_j,X_{j-k}).

Definition. The autocovariance function is defined by

(5) \gamma_k=Cov(X_i,X_{k+i}) for all integer k=0,\pm 1,\pm 2,...

In particular,

(6) \gamma_0=Cov(X_i,X_i)=V(X_i) for all i.

The preamble shows that definition (5) is correct (the right side in (5) depends only on k and not on i). Because of (4) we have symmetry \gamma_{-k}=\gamma _k, so negative k can be excluded from consideration.

With (5) and (6) for a stationary series (1) becomes

(7) V(X)=\left(\begin{array}{cccc}\gamma_0&\gamma_1&...&\gamma_{n-1}\\ \gamma_1&\gamma_0&...&\gamma_{n-2}\\...&...&...&...\\  \gamma_{n-1}&\gamma_{n-2}&...&\gamma_0\end{array}\right).

Estimating variance of a sample mean

Uncorrelated observations. Suppose X_1,...,X_n are uncorrelated observations from the same population with variance \sigma^2. From (2)
we get

(8) V\left(\frac{1}{n}\sum_{i=1}^nX_i\right) =\frac{1}{n^2}\sum_{i=1}^nV(X_i)=\frac{n\sigma^2}{n^2}=\frac{\sigma^2}{n}.

This is a theoretical relationship. To actually obtain an estimator of the sample variance, we need to replace \sigma^2 by some estimator. It is known that

(9) s^2=\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})^2

consistently estimates \sigma^2. Plugging it in (8) we see that variance of the sample mean is consistently estimated by


This is the estimator derived on p.151 of Patton's guide.

Correlated observations. In this case we use (3):

V\left( \frac{1}{n}\sum_{i=1}^nX_i\right) =\frac{1}{n^{2}}\left[\sum_{i=1}^nV(X_i)+2\sum_{i=1}^{n-1}\sum_{j=i+1}^nCov(X_i,X_j)\right].

Here visualization comes in handy. The sums in the square brackets include all terms on the main diagonal of (7) and above it. That is, we have n copies of \gamma_0, n-1 copies of \gamma_{1},..., 2 copies of \gamma _{n-2} and 1 copy of \gamma _{n-1}. The sum in the brackets is

\sum_{i=1}^nV(X_i)+2\sum_{i=1}^{n-1}\sum_{j=i+1}^nCov(X_i,X_j)=n\gamma_0+2[(n-1)\gamma_1+...+2\gamma_{n-2}+\gamma _{n-1}]=n\gamma  _0+2\sum_{k=1}^{n-1}(n-k)\gamma_k.

Thus we obtain the first equation on p.152 of Patton's guide (it's up to you to match the notation):

(10) V(\bar{X})=\frac{1}{n}\gamma_0+\frac{2}{n}\sum_{k=1}^{n-1}(1-\frac{k}{n})\gamma_k.

As above, this is just a theoretical relationship. \gamma_0=V(X_i)=\sigma^2 is estimated by (9). Ideally, the estimator of \gamma_k=Cov(X_i,X_{k+i}) is obtained by replacing all population means by sample means:

(11) \hat{\gamma}_k=\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})(X_{k+i}-\bar{X}).

There are two problems with this estimator, though. The first problem is that when i runs from 1 to n, k+i runs from k+1 to k+n. To exclude out-of-sample values, the summation in (11) is reduced:

(12) \hat{\gamma}_k=\frac{1}{n-k}\sum_{i=1}^{n-k}(X_i-\bar{X})(X_{k+i}-\bar{X}).

The second problem is that the sum in (12) becomes too small when k is close to n. For example, for k=n-1 (12) contains just one term (there is no averaging). Therefore the upper limit of summation n-1 in (10) is replaced by some function M(n) that tends to infinity slower than n. The result is the estimator


where \hat{\gamma}_0 is given by (9) and \hat{\gamma}_k is given by (12). This is almost the Newey-West estimator from p.152. The only difference is that instead of \frac{k}{M(n)} they use \frac{k}{M(n)+1}, and I have no idea why. One explanation is that for low n, M(n) can be zero, so they just wanted to avoid division by zero.

May 18

Different faces of vector variance: again visualization helps

Different faces of vector variance: again visualization helps

In the previous post we defined variance of a column vector X with n components by


In terms of elements this is the same as:

(1) V(X)=\left(\begin{array}{cccc}V(X_1)&Cov(X_1,X_2)&...&Cov(X_1,X_n)\\Cov(X_2,X_1)&V(X_2)&...&Cov(X_2,X_n)\\...&...&...&...\\Cov(X_n,X_1)&Cov(X_n,X_2)&...&V(X_n)\end{array}\right).

So why knowing the structure of this matrix is so important?

Let X_1,...,X_n be random variables and let a_1,...,a_n be numbers. In the derivation of the variance of the slope estimator for simple regression we have to deal with the expression of type

(2) V\left(\sum_{i=1}^na_iX_i\right).

Question 1. How do you multiply a sum by a sum? I mean, how do you use summation signs to find the product \left(\sum_{i=1}^na_i\right)\left(\sum_{i=1}^nb_i\right)?

Answer 1. Whenever you have problems with summation signs, try to do without them. The product


should contain ALL products a_ib_j. Again, a matrix visualization will help:


The product we are looking for should contain all elements of this matrix. So the answer is

(3) \left(\sum_{i=1}^na_i\right)\left(\sum_{i=1}^nb_i\right)=\sum_{i=1}^n\sum_{j=1}^na_ib_j.

Formally, we can write \sum_{i=1}^nb_i=\sum_{j=1}^nb_j (the sum does not depend on the index of summation, this is another point many students don't understand) and then perform the multiplication in (3).

Question 2. What is the expression for (2) in terms of covariances of components?

Answer 2. If you understand Answer 1 and know the relationship between variances and covariances, it should be clear that

(4) V\left(\sum_{i=1}^na_iX_i\right)=Cov(\sum_{i=1}^na_iX_i,\sum_{i=1}^na_iX_i)


Question 3. In light of (1), separate variances from covariances in (4).

Answer 3. When i=j, we have Cov(X_i,X_j)=V(X_i), which are diagonal elements of (1). Otherwise, for i\neq j we get off-diagonal elements of (1). So the answer is

(5) V\left(\sum_{i=1}^na_iX_i\right)=\sum_{i=1}^na_i^2V(X_i)+\sum_{i\neq j}a_ia_jCov(X_i,X_j).

Once again, in the first sum on the right we have only variances. In the second sum, the indices i,j are assumed to run from 1 to n, excluding the diagonal i=j.

Corollary. If X_{i} are uncorrelated, then the second sum in (5) disappears:

(6) V\left(\sum_{i=1}^na_iX_i\right)=\sum_{i=1}^na_i^2V(X_i).

This fact has been used (with a slightly different explanation) in the derivation of the variance of the slope estimator for simple regression.

Question 4. Note that the matrix (1) is symmetric (elements above the main diagonal equal their mirror siblings below that diagonal). This means that some terms in the second sum on the right of (5) are repeated twice. If you group equal terms in (5), what do you get?

Answer 4. The idea is to write


that is, to join equal elements above and below the main diagonal in (1). For this, you need to figure out how to write a sum of the elements that are above the main diagonal. Make a bigger version of (1) (with more off-diagonal elements) to see that the elements that are above the main diagonal are listed in the sum \sum_{i=1}^{n-1}\sum_{j=i+1}^n. This sum can also be written as \sum_{1\leq i<j\leq n}. Hence, (5) is the same as

(7) V\left(\sum_{i=1}^na_iX_i\right)=\sum_{i=1}^na_i^2V(X_i)+2\sum_{i=1}^{n-1}\sum_{j=i+1}^na_ia_jCov(X_i,X_j)

=\sum_{i=1}^na_i^2V(X_i)+2\sum_{1\leq i<j\leq n}a_ia_jCov(X_i,X_j).

Unlike (6), this equation is applicable when there is autocorrelation.

May 18

Variance of a vector: motivation and visualization

Variance of a vector: motivation and visualization

I always show my students the definition of the variance of a vector, and they usually don't pay attention. You need to know what it is, already at the level of simple regression (to understand the derivation of the slope estimator variance), and even more so when you deal with time series. Since I know exactly where students usually stumble, this post is structured as a series of questions and answers.

Think about ideas: how would you define variance of a vector?

Question 1. We know that for a random variable X, its variance is defined by

(1) V(X)=E(X-EX)^{2}.

Now let


be a vector with n components, each of which is a random variable. How would you define its variance?

The answer is not straightforward because we don't know how to square a vector. Let X^T=(\begin{array}{ccc}X_1& ...&X_n\end{array}) denote the transposed vector. There are two ways to multiply a vector by itself: X^TX and XX^T.

Question 2. Find the dimensions of X^TX and XX^T and their expressions in terms of coordinates of X.

Answer 2. For a product of matrices there is a compatibility rule that I write in the form

(2) A_{n\times m}B_{m\times k}=C_{n\times k}.

Recall that n\times m in the notation A_{n\times m} means that the matrix A has n rows and m columns. For example, X is of size n\times 1. Verbally, the above rule says that the number of columns of A should be equal to the number of rows of B. In the product that common number m disappears and the unique numbers (n and k) give, respectively, the number of rows and columns of C. Isn't the the formula
easier to remember than the verbal statement? From (2) we see that X_{1\times n}^TX_{n\times 1} is of dimension 1 (it is a scalar) and X_{n\times 1}X_{1\times n}^T is an n\times n matrix.

For actual multiplication of matrices I use the visualization

(3) \left(\begin{array}{ccccc}&&&&\\&&&&\\a_{i1}&a_{i2}&...&a_{i,m-1}&a_{im}\\&&&&\\&&&&\end{array}\right) \left(\begin{array}{ccccc}&&b_{1j}&&\\&&b_{2j}&&\\&&...&&\\&&b_{m-1,j}&&\\&&b_{mj}&&\end{array}\right) =\left(  \begin{array}{ccccc}&&&&\\&&&&\\&&c_{ij}&&\\&&&&\\&&&&\end{array}\right)

Short formulation. Multiply rows from the first matrix by columns from the second one.

Long Formulation. To find the element c_{ij} of C, we find a scalar product of the ith row of A and jth column of B: c_{ij}=a_{i1}b_{1j}+a_{i2}b_{2j}+... To find all elements in the ith row of C, we fix the ith row in A and move right the columns in B. Alternatively, to find all elements in the jth column of C, we fix the jth column in B and move down the rows in A. Using this rule, we have

(4) X^TX=X_1^2+...+X_n^2, XX^T=\left(\begin{array}{ccc}X_1^2&...&X_1X_n\\...&...&...\\X_nX_1&...&X_n^2  \end{array}\right).

Usually students have problems with the second equation.

Based on (1) and (4), we have two candidates to define variance:

(5) V(X)=E(X-EX)^T(X-EX)


(6) V(X)=E(X-EX)(X-EX)^T.

Answer 1. The second definition contains more information, in the sense to be explained below, so we define variance of a vector by (6).

Question 3. Find the elements of this matrix.

Answer 3. Variance of a vector has variances of its components on the main diagonal and covariances outside it:

(7) V(X)=\left(\begin{array}{cccc}V(X_1)&Cov(X_1,X_2)&...&Cov(X_1,X_n)\\Cov(X_2,X_1)&V(X_2)&...&Cov(X_2,X_n)\\...&...&...&...\\Cov(X_n,X_1)&Cov(X_n,X_2)&...&V(X_n)\end{array}\right).

If you can't get this on your own, go back to Answer 2.

There is a matrix operation called trace and denoted tr. It is defined only for square matrices and gives the sum of diagonal elements of a matrix.

Exercise 1. Show that tr(V(X))=E(X-EX)^T(X-EX). In this sense definition (6) is more informative than (5).

Exercise 2. Show that if EX_1=...=EX_n=0, then (7) becomes



May 18

Solution to Question 2 from UoL exam 2016, zone A

Solution to Question 2 from UoL exam 2016, Zone A (FN3142)

This is another difficult question, and I don't think it will appear again in its entirety. However, the ideas applied here and here are worth repeating and will surely be on future UoL exams in some form.

Problem statement

You hold two different corporate bonds (bond A and bond B), each with a face value of $1 mln. The issuing firms have a 2% probability of defaulting on the bonds, and both the default events and the recovery values are independent of each other. Without default, the notional value is repaid, while in case of default, the recovery value is uniformly distributed between 0 and the notional value.
(a1) (20 points) Find the 1% VaR for bond A or bond B and report your calculations. (HINT: you will need the definition and properties of a uniform distribution U[a,b] on [a,b].)
(a2) (60 points) Taking into account the formula for the distribution function of the sum of two independent uniformly distributed random variables on [a,b] explain how you would find the 1% VaR for a portfolio combining the two bonds (A + B). Report your calculations without finding the actual value.
(b) (20 points) The 1% expected shortfall is defi ned as the expected loss given that the loss exceeds the 1% VaR. Find the 1% expected shortfall for bond A and report your calculations.

I made the statement shorter by referencing my post on the uniform distribution. The formulas for the triangular distribution given in the exam and examiner's comments are both wrong.


As with previous solutions related to VaR, I emphasize general methods that reduce the amount of guessing to a minimum. Besides, I work with the definition that gives a negative VaR in percents (that is, I work with returns) and then translate that value to positive dollar amounts. Translating the problem statement to returns: without default, the return on each bond is zero, while in case of default the loss is uniformly distributed between -1 and 0 (-1 corresponds to -100%). The recovery value is the notional value minus the dollar loss.

The method consists of two steps. Step 1. Derive the distribution function. Step 2. Use the usual or generalized inverse to find the VaR.

(a1) R_A denotes the return on A, D_A is the event that A defaults and \bar{D}_A is the complement event of no default. Similar notation is employed for B. By the law of total probability for any real x

(1) F_{R_A}(x)=P(R_A\le x)=P(R_A\le x|D_A)P(D_A)+P(R_A\le x|\bar{D}_A)P(\bar{D}_A).

We want to show that for our purposes the second term can be made zero. We know that without default, return is zero. This means that conditionally we deal with a random variable that takes value 0 with probability 1. Hence, conditional on no default, by additivity of conditional probability for x\ge 0

(2) P(R_A\le x|\bar{D}_A)=P(R_A<0|\bar{D}_A)+P(R_A=0|\bar{D}_A)+P(0<R_A\le x|\bar{D}_A)=1

because the middle term on the right is 1 and the other two are zero. It follows that for x\ge 0 the second term in (1) is

(3) P(R_A\le x|\bar{D}_A)P(\bar{D}_A)=1\times 0.98>0.01.

To find the VaR, we need the whole expression in (1) to be 0.01, so we have to consider only x<0, in which case the second term on the right of (1) is zero.

In the first term on the right of (1) P(R_A\le x|D_A) is uniformly distributed on [-1,0], so from its distribution function it's clear that we need to consider only -1<x<0:

F_{R_A}(x)=P(R_A\le x|D_A)P(D_A)=(x+1)0.02.

As this should be equal to 0.01, we obtain x=-0.5. The loss is $1 mln times 50% and the recovery value is half a million.

(a2) Denote R=R_A+R_B. This allows us to use the formula for the convolution of two independent uniform distributions. We should remember that this is not a return on the total portfolio because it takes values between -2 and 0. By the  law of total probability for any real x

(4) F_R(x)=P(R\le x|D_A\cap D_B)P(D_A\cap D_B)+P(R\le x|D_A\cap \bar{D}_B)P(D_A\cap \bar{D}_B)

+P(R\le x|\bar{D}_A\cap D_B)P(\bar{D}_A\cap D_B)+P(R\le x|\bar{D}_A\cap \bar{D}_B)P(\bar{D}_A\cap \bar{D}_B).

As above, we can show that the last term in (4) is large if x\ge 0. Indeed, if no company defaults, the return is zero and, as in (2), for such x

P(R\le x|\bar{D}_A\cap \bar{D}_B)=1.

Therefore, as in (3)

P(R\le x|\bar{D}_A\cap \bar{D}_B)P(\bar{D}_A\cap \bar{D}_B)=1\times 0.98^2=0.9604>0.01.

It follows that we can restrict our attention to x<0, in which case the last term on the right of (4) is zero, the second and third terms are the same and we get

(5) F_R(x)=P(R\le x|D_A\cap D_B)0.02^2+2P(R\le x|D_A\cap \bar{D}_B)0.02\cdot 0.98.

In case of two defaults, R_A,\ R_B are independent and distributed as U[-1,0]. Hence, R has a triangular distribution. Denoting its density f_T, we have

(6) P(R\le x|D_A\cap D_B)0.02^2=0.0004\int_{-2}^xf_T(x)dx,\ -2<x<0.

The integral here is

(7) \int_{-2}^xf_T(x)dx=\left\{\begin{array}{ll}x^2/2+2x+2, &-2<x<-1\\1-x^2/2, &-1\le x<0\end{array}\right.

In case of one default, we use the uniform distribution to find

(8) 2P(R\le x|D_A\cap \bar{D}_B)0.02\cdot 0.98=\left\{\begin{array}{ll}0.0392(x+1),&-1\le x<0\\0,&x<-1\end{array}\right..

Plugging (6), (7), (8) in (5) and equating the result to 0.01, we get

(9) 0.0004(x^2/2+2x+2)=0.01, -2<x<-1


(10) 0.0004(1-x^2/2)+0.0392(x+1)=0.01, -1\le x<0.

Mathematica gives two solutions for (9), none of which is in the interval (-2,-1). Of the two solutions of (10), the one which belongs to (-1,0) is -0.75. This is 37.5% of -2 and the loss is $750,000 out of $2 mln. The exam question doesn't require you to give numerical values.

(b) From the general definition of expected shortfall

ES^{0.01}=\frac{ER_A1_{R_A\le -0.5}}{P(R_A\le -0.5)}=\frac{\int_{-1}^{-0.5}tdt}{\int_{-1}^{-0.5}dt}=\frac{-3/8}{1/2}=-0.75

which means an expected loss of $750,000.

May 18

Law of total probability - you could have invented this

Law of total probability - you could have invented this

A knight wants to kill (event K) a dragon. There are two ways to do this: by fighting (event F) the dragon or by outwitting (O) it. The choice of the way (F or O) is random, and in each case the outcome (K or not K) is also random. For the probability of killing there is a simple, intuitive formula:


Its derivation is straightforward from the definition of conditional probability: since F and O cover the whole sample space and are disjoint, we have by additivity of probability

P(K)=P(K\cap(F\cup O))=P(K\cap F)+P(K\cap O)=\frac{P(K\cap F)}{P(F)}P(F)+\frac{P(K\cap O)}{P(O)}P(O)


This is easy to generalize to the case of many conditioning events. Suppose A_1,...,A_n are mutually exclusive (that is, disjoint) and collectively exhaustive (that is, cover the whole sample space). Then for any event B one has


This equation is call the law of total probability.

Application to a sum of continuous and discrete random variables

Let X,Y be independent random variables. Suppose that X is continuous, with a distribution function F_X, and suppose Y is discrete, with values y_1,...,y_n. Then for the distribution function of the sum F_{X+Y} we have

F_{X+Y}(t)=P(X+Y\le t)=\sum_{j=1}^nP(X+Y\le t|Y=y_j)P(Y=y_j)

(by independence conditioning on Y=y_j can be omitted)

=\sum_{j=1}^nP(X\le t-y_j)P(Y=y_j)=\sum_{j=1}^nF_X(t-y_j)P(Y=y_j).

Compare this to the much more complex derivation in case of two continuous variables.