Let denote the return on From Table 1 we can derive the probabilities table for this return:

Table 2. Joint table of returns on separate portfolios

From Table 2 we conclude that the return on the combined portfolio looks as follows:

Table 3. Total return

Table 3 shows that

for

for

for and

for

Try to follow the procedure used in Post 1 and you will see that

for

for

for

for and

for

This implies In statistics, we always have to watch if the numbers we get make sense. The last number doesn't and in fact leads to a contradiction: This is because the quasi-inverse notion has nothing to do with probabilities. With a more realistic return, the VaR should be negative for small values of

(b) The subadditivity definition requires amounts opposite in sign to ours. That is, we define from and then say that VaR thus defined is sub-additive if We have been using the definition It's easy to see that Thus, in our case we have which is not smaller than Sub-additivity does not hold in this example. Absence of sub-additivity means that riskiness of the whole portfolio, as measured by VaR, may exceed riskiness of the sum of the portfolio parts.

(c) The problem uses the definition of the expected shortfall that yields positive values. I use everywhere the definition that gives negative values: Since the setup is static, this is the same as By definition, so .

In Post 1 we found that for each of The condition places no restriction on so from Table 1

As a result,

Since from Table 3

Therefore Converting everything to positive values, we have so that sub-additivity holds.

The returns in percentages can be easily converted to those in dollars.

There is a hidden mine in this question, and it is caused by discreteness of the distribution function. We had a lively discussion of this oddity in my class. The answer will be given in two posts.

Question. Two corporations each have a 4% chance of going bankrupt and the event that one of the two companies will go bankrupt is independent of the event that the other company will go bankrupt. Each company has outstanding bonds. A bond from any of the two companies will return if the corporation does not go bankrupt, and if it goes bankrupt you lose the face value of the investment, i.e., . Suppose an investor buys $1000 worth of bonds of the first corporation, which is then called portfolio , and similarly, an investor buys $1000 worth of bonds of the second corporation, which is then called portfolio .

(a) [40 marks] Calculate the VaR at critical level for each portfolio and for the joint portfolio .

(b) [30 marks] Is VaR sub-additive in this example? Explain why the absence of sub-additivity may be a concern for risk managers.

(c) [30 marks] The expected shortfall at the critical level can be defined as . Calculate the expected shortfall for the portfolios , and . Is this risk measure sub-additive?

Solution. a) The return on each portfolio is a binary variable described by Table 1:

Table 1. Return on separate portfolios

Prob

0

0.96

-100

0.04

Therefore the distribution function of the return is a piece-wise constant function equal to for to for and to for see Example 3. For instance, if we can write

Diagram 1. Return distribution function

Since this function is not one-to-one, it's usual inverse does not exist and we have to use the quasi-inverse, see Answer 15. As with the distribution function, we need to look at different cases.

If drawing a horizontal line at we see that the set is empty and the infimum of an empty set is by definition (can you guess why?)

For any we have so

Next, for we have and

Finally, if we get and

Diagram 2. Quasi-inverse function

The resulting function is bad in two ways. Firstly, it takes infinite values. In applications to VaR this should not concern us because the bad values occur in the ranges and

Secondly, for practically interesting values of the graph of has flat pieces, which may be problematic. By definition, is the solution to the equation This means that we should have When we plug this value in we are supposed to get However, here we don't, in general. For example, while

This happens because the usual inverse does not exist. When the usual inverse exists, we have two identities and Here both are violated.

Now the definition of VaR gives for each portfolio.

Solution to Question 3b) from UoL exam 2018, Zone A

I thought that after all the work we've done with my students the answer to this question would be obvious. It was not, so I am sharing it.

Question. Consider a position consisting of a $20,000 investment in asset and a $20,000 investment in asset . Assume that returns on these two assets are i.i.d. normal with mean zero, that the daily volatilities of both assets are 3%, and that the correlation coefficient between their returns is 0.4. What is the 10-day VaR at the critical level for the portfolio?

Solution. First we have to work with returns and then translate the result into dollars.

Let be the daily returns on the two assets. We are given that ,

Since the total investment is $40,000, the shares of the investment are Therefore the daily return on the portfolio is see Exercise 2.

It follows that

These figures are for daily returns. We need to make sure that is normally distributed. The sufficient condition for this is that the returns are jointly normally distributed. It is not mentioned in the problem statement, and we have to assume that it is satisfied.

Let denote the return on day Under continuous compounding the daily returns are summed: if we invest initially, after the first day we have after the second day we have and so on. So the 10-day return is

Since the daily returns are independent and identically distributed, by additivity of variance we have

is normally distributed because the daily returns are independent. It remains to apply the VaR formula

.

for normal distributions. From the table of the distribution function of the standard normal Thus, This translates to the minimum loss of Thus, with probability 1% the loss can be $7362 or more.

Students of FN3142 often think that they can get by by picking a few technical tricks. The questions below are mostly about intuition that helps to understand and apply those tricks.

Everywhere we assume that is a time series and is a sequence of corresponding information sets. It is natural to assume that for all We use the short conditional expectation notation: .

Questions

Question 1. How do you calculate conditional expectation in practice?

Question 2. How do you explain ?

Question 3. Simplify each of and and explain intuitively.

Question 4. is a shock at time . Positive and negative shocks are equally likely. What is your best prediction now for tomorrow's shock? What is your best prediction now for the shock that will happen the day after tomorrow?

Question 5. How and why do you predict at time ? What is the conditional mean of your prediction?

Question 6. What is the error of such a prediction? What is its conditional mean?

Question 7. Answer the previous two questions replacing by .

Question 8. What is the mean-plus-deviation-from-mean representation (conditional version)?

Question 9. How is the representation from Q.8 reflected in variance decomposition?

Question 10. What is a canonical form? State and prove all properties of its parts.

Question 11. Define conditional variance for white noise process and establish its link with the unconditional one.

Question 12. How do you define the conditional density in case of two variables, when one of them serves as the condition? Use it to prove the LIE.

Question 13. Write down the joint distribution function for a) independent observations and b) for serially dependent observations.

Question 14. If one variable is a linear function of another, what is the relationship between their densities?

Question 15. What can you say about the relationship between if ? Explain geometrically the definition of the quasi-inverse function.

Answers

Answer 1. Conditional expectation is a complex notion. There are several definitions of differing levels of generality and complexity. See one of them here and another in Answer 12.

The point of this exercise is that any definition requires a lot of information and in practice there is no way to apply any of them to actually calculate conditional expectation. Then why do they juggle conditional expectation in theory? The efficient market hypothesis comes to rescue: it is posited that all observed market data incorporate all available information, and, in particular, stock prices are already conditioned on

Answer 4. Since positive and negative shocks are equally likely, the best prediction is (I call this equation a martingale condition). Similarly, but in this case I prefer to see an application of the LIE:

Answer 5. The best prediction is because it minimizes among all functions of current information Formally, you can use the first order condition

Answer 6. It is natural to define the prediction error by

By the projector property .

Answer 7. To generalize, just change the subscripts. For the prediction we have to use two subscripts: the notation means that we are trying to predict what happens at a future date based on info set (time is like today). Then by definition

Answer 8. Answer 7, obviously, implies The simple case is here.

Answer 12. The conditional density is defined similarly to the conditional probability. Let be two random variables. Denote the density of and the joint density. Then the conditional density of conditional on is defined as After this we can define the conditional expectation With these definitions one can prove the Law of Iterated Expectations:

This is an illustration to Answer 1 and a prelim to Answer 13.

Answer 13. Understanding this answer is essential for Section 8.6 on maximum likelihood of Patton's guide.

a) In case of independent observations the joint density of the vector is a product of individual densities:

b) In the time series context it is natural to assume that the next observation depends on the previous ones, that is, for each depends on (serially dependent observations). Therefore we should work with conditional densities From Answer 12 we can guess how to make conditional densities appear:

The fractions on the right are recognized as conditional probabilities. The resulting expression is pretty awkward:

Answer 14. The answer given here helps one understand how to pass from the density of the standard normal to that of the general normal.

Answer 15. This elementary explanation of the function definition can be used in the fifth grade. Note that conditions sufficient for existence of the inverse are not satisfied in a case as simple as the distribution function of the Bernoulli variable (when the graph of the function has flat pieces and is not continuous). Therefore we need a more general definition of an inverse. Those who think that this question is too abstract can check out UoL exams, where examinees are required to find Value at Risk when the distribution function is a step function. To understand the idea, do the following:

a) Draw a graph of a good function (continuous and increasing).

b) Fix some value in the range of this function and identify the region .

c) Find the solution of the equation . By definition, Identify the region .

d) Note that . In general, for bad functions the minimum here may not exist. Therefore minimum is replaced by infimum, which gives us the definition of the quasi-inverse:

There will be a separate post on projectors. In the meantime, we'll have a look at simple examples that explain a lot about conditional expectations.

Examples of projectors

The name "projector" is almost self-explanatory. Imagine a point and a plane in the three-dimensional space. Draw a perpendicular from the point to the plane. The intersection of the perpendicular with the plane is the points's projection onto that plane. Note that if the point already belongs to the plane, its projection equals the point itself. Besides, instead of projecting onto a plane we can project onto a straight line.

The above description translates into the following equations. For any define

(1) and

projects onto the plane (which is two-dimensional) and projects onto the straight line (which is one-dimensional).

Property 1. Double application of a projector amounts to single application.

Proof. We do this just for one of the projectors. Using (1) three times we get

(1)

Property 2. A successive application of two projectors yields the projection onto a subspace of a smaller dimension.

Proof. If we apply first and then , the result is

(2)

If we change the order of projectors, we have

(3)

Exercise 1. Show that both projectors are linear.

Exercise 2. Like any other linear operator in a Euclidean space, these projectors are given by some matrices. What are they?

The simple truth about conditional expectation

In the time series setup, we have a sequence of information sets (it's natural to assume that with time the amount of available information increases). Denote

the expectation of conditional on . For each ,

is a projector onto the space of random functions that depend only on the information set .

Property 1. Double application of conditional expectation gives the same result as single application:

(4)

( is already a function of , so conditioning it on doesn't change it).

Property 2. A successive conditioning on two different information sets is the same as conditioning on the smaller one:

(5)

(6)

Property 3. Conditional expectation is a linear operator: for any variables and numbers

It's easy to see that (4)-(6) are similar to (1)-(3), respectively, but I prefer to use different names for (4)-(6). I call (4) a projector property. (5) is known as the Law of Iterated Expectations, see my post on the informational aspect for more intuition. (6) holds simply because at time the expectation is known and behaves like a constant.

Summary. (4)-(6) are easy to remember as one property. The smaller information set wins:

Law of iterated expectations: informational aspect

The notion of Brownian motion will help us. Suppose we observe a particle that moves back and forth randomly along a straight line. The particle starts at zero at time zero. The movement can be visualized by plotting on the horizontal axis time and on the vertical axis - the position of the particle. denotes the random position of the particle at time .

Figure 1. Unconditional expectation

In Figure 1, various paths starting at the origin are shown in different colors. The intersections of the paths with vertical lines at times 0.5, 1 and 1.5 show the positions of the particle at these times. The deviations of those positions from to the upside and downside are assumed to be equally likely (more precisely, they are normal variables with mean zero and variance ).

Unconditional expectation

“In the beginning there was nothing, which exploded.” ― Terry Pratchett, Lords and Ladies

If we are at the origin (like the Big Bang), nothing has happened yet and is the best prediction for any moment we can make (shown by the blue horizontal line in Figure 1). The usual, unconditional expectation corresponds to the empty information set.

Conditional expectation

Figure 2. Conditional expectation

In Figure 2, suppose we are at The dark blue path between and has been realized. We know that the particle has reached the point at that time. With this knowledge, we see that the paths starting at this point will have the average

(1)

This is because the particle will continue moving randomly, with the up and down moves being equally likely. Prediction (1) is shown by the horizontal light blue line between and In general, this prediction is better than .

Note that for different realized paths, takes different values. Therefore , for , is a random variable. It is a function of the event we condition the expectation on.

Law of iterated expectations

Figure 3. Law of iterated expectations

Suppose you are at time (see Figure 3). You send many agents to the future to fetch the information about what will happen. They bring you the data on the means they see (shown by horizontal lines between and Since there are many possible future realizations, you have to average the future means. For this, you will use the distributional belief you have at time The result is Since the up and down moves are equally likely, your distribution at time is symmetric around Therefore the above average will be equal to This is the Law of Iterated Expectations, also called the tower property:

(2)

The knowledge of all of the future predictions , upon averaging, does not improve or change our current prediction .

For a full mathematical treatment of conditional expectation see Lecture 10 by Gordan Zitkovic.

Exercise 1. Suppose a portfolio contains shares of stock 1 whose price is and shares of stock 2 whose price is . Stock prices fluctuate and are random variables. Numbers of shares are assumed fixed and are deterministic. What is the expected value of the portfolio?

Solution. The portfolio value is its market price . Since this is a linear combination, the expected value is .

In fact, the portfolio analysis is a little bit different than suggested by Exercise 1. To explain the difference, we start with fixing two points of view.

View 1. I hold a portfolio of stocks. I may have inherited it, and it does not matter how much it cost at the moment it was formed. If I want to sell it, I am interested in knowing its market value. In this situation the numbers of shares in my portfolio, which are constant, and the market prices of stocks, which are random, determine the market value of the portfolio, defined in Exercise 1. The value of the portfolio is a linear combination of stock prices.

View 2. I have a certain amount of money to invest. Being a gambler, I am not interested in holding a portfolio forever. I am thinking about buying a portfolio of stocks now and selling it, say, in a year at price . In this case I am interested in the rate of return defined by is considered deterministic (current prices are certain) and is random (future prices are unpredictable). Thus the rate of return is random.

We pursue the second view (prevalent in finance). As it often happens in economics and finance, the result depends on how one understands the things. Suppose the initial amount is invested in assets. Denoting the amount invested in asset , we have . Denoting the share (percentage) of in the total investment , we have

(1)

The initial shares are deterministic.

Let be what becomes of in one year and let be the total value of the investment at the end of the year. Since different assets grow at different rates, generally it is not true that . Denote the rate of return on asset . Then

(2)

Exercise 2. The rate of return on the portfolio is a linear combination of the rates of return on separate assets, the coefficients being the initial shares of investment.

Solution. Using Equations (1) and (2) we get

(3)

Once you know this equation you can find the mean and variance of the rate of return on the portfolio in terms of investment shares and rates of return on assets.

Principal component analysis is a general method based on diagonalization of the variance matrix. We consider it in a financial context. The variance matrix measures riskiness of the portfolio. We want to see which stocks contribute most to the portfolio risk. The surprise is that the answer is given not in terms of the vector of returns but in terms of its linear transformation.

With such a matrix, instead of we can consider its transformation for which

We know that has variances on the main diagonal. It follows that for all Variance is a measure of riskiness. Thus, the transformed variables are put in the order of declining risk. What follows is the realization of this idea using sample data.

In a sampling context, all population means should be replaced by their sample counterparts. Let be a vector of observations on at time These observations are put side by side into a matrix where is the number of moments in time. The population mean is estimated by the sample mean

The variance matrix is estimated by

where is a vector of ones. It is this matrix that is diagonalized:

In general, the eigenvalues in are not ordered. Ordering them and at the same time changing places of the rows of correspondingly we get a new orthogonal matrix (this requires a small proof) such that the eigenvalues in will be ordered. There is a lot more to say about the method and its applications.

Efficient market hypothesis is subject to interpretation

The formulation on Investopedia seems to me the best:

The efficient market hypothesis (EMH) is an investment theory that states it is impossible to "beat the market" because stock market efficiency causes existing share prices to always incorporate and reflect all relevant information. According to the EMH, stocks always trade at their fair value on stock exchanges, making it impossible for investors to either purchase undervalued stocks or sell stocks for inflated prices. As such, it should be impossible to outperform the overall market through expert stock selection or market timing, and the only way an investor can possibly obtain higher returns is by purchasing riskier investments.

This is not Math, and the EMH interpretation is subjective. My purpose is not to discuss the advantages and drawbacks of various versions of the EMH but indicate some errors students make on exams.

Best(?) way to answer questions related to EMH

Since there is a lot of talking, the best is to use the appropriate key words.

Start with "The EMH states that it is impossible to make economic profit".

Then explain why: The stock market is efficient in the sense that stocks trade at their fair value, so that undervalued or overvalued stocks don't exist.

Then specify that "to obtain economic profits, from the revenues we subtract opportunity (hidden) costs, in addition to direct costs, such as transaction fees". What on the surface seems to be a profitable activity may in fact be balancing at break-even.

Next is to address the specification by Malkiel that the EMH depends on the information set available at time .

Weak form of EMH. The information set contains only historical values of asset prices, dividends (and possibly volume) up until time . This is basically what an investor sees on a stock price chart. Many students say "historical information" but fail to mention that it is about prices of financial assets. The birthdays of celebrities are also historical information but they are not in this info set.

Semi-strong form of EMH. The info set is all publicly available information. Some students don't realize that it includes . The risk-free rate is in but not in because 1) it is publicly known and 2) it is not traded (it is fixed by the central bank for extended periods of time).

Strong form of EMH. The info set includes all publicly available info plus private company information. Firstly, this info set includes the previous two: . Secondly, whether a certain piece of information belongs to or depends on time. For example, the number of shares Warren Buffett purchased today of the stock is in but over time it becomes a part of because large holdings must be reported within 45 days of the end of a calendar quarter. If there are nuances like this you have to explain them.

Implications for time series analysis

Conditional expectation is a relatively complex mathematical construct. The simplest definition is accessible to basic statistics students. The mid-level definition in case of conditioning on a set of positive probability already raises questions about practical calculation. The most general definition is based on Radon-Nikodym derivatives. Moreover, nobody knows exactly any of those . So how do you apply time series models which depend so heavily on conditioning? The answer is simple: since by the EMH the stock price "reflects all relevant information", that price is already conditioned on that information, and you don't need to worry about theoretical complexities of conditioning in applications.

The Newey-West estimator: uncorrelated and correlated data

I hate long posts but here we by necessity have to go through all ideas and calculations, to understand what is going on. One page of formulas in A. Patton's guide to FN3142 Quantitative Finance in my rendition becomes three posts.

With the help of this matrix we derived two expressions for variance of a linear combination:

(2)

for uncorrelated variables and

(3)

when there is autocorrelation.

In a time series context are observations along time. stand for moments in time and the sequence is called a time series. We need to recall the definition of a stationary process. Of that definition, we will use only the part about covariances: depends only on the distance between the time moments For example, in the top right corner of (1) we have which depends only on

Preamble. Let be a stationary times series. Firstly, depends only on Secondly, for all integer denoting we have

(4)

Definition. The autocovariance function is defined by

(5) for all integer

In particular,

(6) for all

The preamble shows that definition (5) is correct (the right side in (5) depends only on and not on ). Because of (4) we have symmetry so negative can be excluded from consideration.

With (5) and (6) for a stationary series (1) becomes

(7)

Estimating variance of a sample mean

Uncorrelated observations. Suppose are uncorrelated observations from the same population with variance From (2)
we get

(8)

This is a theoretical relationship. To actually obtain an estimator of the sample variance, we need to replace by some estimator. It is known that

(9)

consistently estimates Plugging it in (8) we see that variance of the sample mean is consistently estimated by

This is the estimator derived on p.151 of Patton's guide.

Correlated observations. In this case we use (3):

.

Here visualization comes in handy. The sums in the square brackets include all terms on the main diagonal of (7) and above it. That is, we have copies of copies of ,..., 2 copies of and 1 copy of The sum in the brackets is

Thus we obtain the first equation on p.152 of Patton's guide (it's up to you to match the notation):

(10)

As above, this is just a theoretical relationship. is estimated by (9). Ideally, the estimator of is obtained by replacing all population means by sample means:

(11)

There are two problems with this estimator, though. The first problem is that when runs from to runs from to To exclude out-of-sample values, the summation in (11) is reduced:

(12)

The second problem is that the sum in (12) becomes too small when is close to For example, for (12) contains just one term (there is no averaging). Therefore the upper limit of summation in (10) is replaced by some function that tends to infinity slower than The result is the estimator

where is given by (9) and is given by (12). This is almost the Newey-West estimator from p.152. The only difference is that instead of they use , and I have no idea why. One explanation is that for low , can be zero, so they just wanted to avoid division by zero.