2
May 18

## Law of total probability - you could have invented this

A knight wants to kill (event $K$) a dragon. There are two ways to do this: by fighting (event $F$) the dragon or by outwitting ($O$) it. The choice of the way ($F$ or $O$) is random, and in each case the outcome ($K$ or not $K$) is also random. For the probability of killing there is a simple, intuitive formula:

$P(K)=P(K|F)P(F)+P(K|O)P(O)$.

Its derivation is straightforward from the definition of conditional probability: since $F$ and $O$ cover the whole sample space and are disjoint, we have by additivity of probability

$P(K)=P(K\cap(F\cup O))=P(K\cap F)+P(K\cap O)=\frac{P(K\cap F)}{P(F)}P(F)+\frac{P(K\cap O)}{P(O)}P(O)$

$=P(K|F)P(F)+P(K|O)P(O)$.

This is easy to generalize to the case of many conditioning events. Suppose $A_1,...,A_n$ are mutually exclusive (that is, disjoint) and collectively exhaustive (that is, cover the whole sample space). Then for any event $B$ one has

$P(B)=P(B|A_1)P(A_1)+...+P(B|A_n)p(A_n)$.

This equation is call the law of total probability.

## Application to a sum of continuous and discrete random variables

Let $X,Y$ be independent random variables. Suppose that $X$ is continuous, with a distribution function $F_X$, and suppose $Y$ is discrete, with values $y_1,...,y_n$. Then for the distribution function of the sum $F_{X+Y}$ we have

$F_{X+Y}(t)=P(X+Y\le t)=\sum_{j=1}^nP(X+Y\le t|Y=y_j)P(Y=y_j)$

(by independence conditioning on $Y=y_j$ can be omitted)

$=\sum_{j=1}^nP(X\le t-y_j)P(Y=y_j)=\sum_{j=1}^nF_X(t-y_j)P(Y=y_j)$.

Compare this to the much more complex derivation in case of two continuous variables.

4
Apr 18

## Distribution function estimation

The relativity theory says that what initially looks absolutely difficult, on closer examination turns out to be relatively simple. Here is one such topic. We start with a motivating example.

Large cloud service providers have huge data centers. A data center, being a large group of computer servers, typically requires extensive air conditioning. The intensity and cost of air conditioning depend on the temperature of the surrounding environment. If, as in our motivating example, we denote by $T$ the temperature outside and by $t$ a cut-off value, then a cloud service provider is interested in knowing the probability $P(T\le t)$ for different values of $t$. This is exactly the distribution function of temperature: $F_T(t)=P(T\le t)$. So how do you estimate it?

It comes down to usual sampling. Fix some cut-off, for example, $t=20$ and see for how many days in a year the temperature does not exceed 20. If the number of such days is, say, 200, then 200/365 will be the estimate of the probability $P(T\le 20)$.

It remains to dress this idea in mathematical clothes.

## Empirical distribution function

If an observation $T_i$ belongs to the event $\{T\le 20\}$, we count it as 1, otherwise we count it as zero. That is, we are dealing with a dummy variable

(1) $1_{\{T\le 20\}}=\left\{\begin{array}{ll}1,&T\le 20;\\0,&T>20.\end{array}\right.$

The total count is $\sum 1_{\{T_i\le 20\}}$ and this is divided by the total number of observations, which is 365, to get 200/365.

It is important to realize that the variable in (1) is a coin (Bernoulli variable). For an unfair coin with  probability of 1 equal to $p$ and probability of zero equal to $1-p$ the mean is

$EC=p\times 1+(1-p)\times 0=p$

and the variance is

$Var(C)=EC^2-(EC)^2=p-p^2=p(1-p)$.

For the variable in (1) $p=P\{1_{\{T\le 20\}}=1\}=P(T\le 20)=F_T(20)$, so the mean and variance are

(2) $E1_{\{T\le 20\}}=F_T(20),\ Var(1_{\{T\le 20\}})=F_T(20)(1-F_T(20))$.

Generalizing, the probability $F_T(t)=P(T\le t)$ is estimated by

(3) $\frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}$

where $n$ is the number of observations. (3) is called an empirical distribution function because it is a direct empirical analog of $P(T\le t)$.

Applying expectation to (3) and using an equation similar to (2), we prove unbiasedness of our estimator:

(4) $E\frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}=P(T\le t)=F_T(t)$.

Further, assuming independent observations we can find variance of (3):

(5) $Var\left(\frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}\right)$ (using homogeneity of degree 2)

$=\frac{1}{n^2}Var\left(\sum_{i=1}^n 1_{\{T_i\le t\}}\right)$ (using independence)

$=\frac{1}{n^2}\sum_{i=1}^nVar(1_{\{T_i\le t\}})$ (applying an equation similar to (2))

$=\frac{1}{n^2}\sum_{i=1}^nF_T(t)(1-F_T(t))=\frac{1}{n}F_T(t)(1-F_T(t)).$

Corollary. (4) and (5) can be used to prove that (3) is a consistent estimator of the distribution function, i.e., (3) converges to $F_T(t)$ in probability.

14
Mar 18

## Intro to option greeks: delta and its determinants

I started trading stocks in 2010. I didn't expect to make big profits and wasn't actively trading. That's until 2015, when I met a guy who turned $10,000 into$140,000 in four years. And then I thought: why am I fooling around when it's possible to make good money? Experienced traders say: trading is a journey. That's how my journey started. Stocks move too slowly, to my taste, so I had to look for other avenues.

Two things were clear to me. I didn't want to be glued to the monitor the whole day and didn't want to study a lot of theory. Therefore I decided to concentrate on the futures market. To trade futures, you don't even need to know the definition of a futures contract. The price moves very quickly, and if you know what you are doing, you can make a couple of hundreds in a matter of minutes. It turned out that the futures markets are the best approximation to the efficient market hypothesis. Trend is your friend (until the end), as they say. In the futures markets, trends are rare and short-lived. Trading futures is like driving a race car. The psychological stress is enormous and it may excite your worst instincts. After trying for almost two years and losing $8,000 I gave up. Don't trade futures unless you can predict a big move. Many people start their trading careers in the forex market because the volumes there are large and transaction fees are low. I never traded forex and think that it is as risky as the futures market. If you want to try it, I would suggest to trade not the exchange rates themselves but indexes or ETF's (exchange traded funds) that trace them. Again, look for large movements. One more market I don't want to trade is bonds. Actions of central banks and macroeconomic events are among strong movers of this market. Otherwise, it's the same as futures. Futures, forex and bonds have one feature in common. In all of them institutional (large) traders dominate. My impression is that in absence of market-moving events they select a range within which to trade. Having deep pockets, they can buy at the top of the range and sell at the bottom without worrying about the associated loss. Trading in a range like that will kill a retail (small) investor. Changes in fundamentals force the big guys to shift the range, and that's when small investors have a chance to profit. I tried to avoid options because they require learning some theory. After a prolonged resistance, I started trading options and immediately fell in love with them. I think that anybody with$25,000 in savings can and should be trading options.

Definition. In Math, the Greek letter $\Delta$ (delta) is usually used to denote change or rate of change. In case of options, it's the rate of change of the option price when the stock price changes. Mathematically, it's a derivative $\Delta=\frac{\partial c}{\partial S}$ where $c$ is the call price and $S$ is the stock price. In layman's terms, when the stock price changes by $1, the call price changes (moves in the same direction) by $\Delta$ dollars. The basic features of delta can be understood by looking at how it depends on the strike price, when time is fixed, and how it changes with time, when the strike price is fixed. As before, we concentrate on probabilistic intuition. ## How delta depends on strike price Figure 1. AAPL option chain with 26 days to expiration Look at the option chain in Figure 1. For the strikes that are deep in the money, delta is close to one. This is because if a call option is deep in the money, the probability that it will end up in the money by expiration is high (see how the call price depends on the strike price). Hence, stock price changes are followed by call price changes almost one to one. On the other hand, if a strike is far out of the money, it is likely to remain out of the money by expiration. The stock price changes have little effect on the call price. That's why delta is close to zero. ## How delta depends on time to expiration Figure 2. AAPL option chain with 5 days to expiration Now let us compare that option chain to the one with a shorter time to expiration (see Figure 2). If an option is to expire soon, the probability of a drastic stock movement before expiration is low, see the comparison of areas of influence with different times to expiration. Only a few options with strikes lower than at the money strike have deltas different from one. The deeper in the money calls have deltas equal to one: their prices exactly repeat the stock price. Similarly, only a few out of the money options have deltas different from zero. If the strike is very far out of the money, the call delta is 0 because the call is very likely to expire worthless and its dependence on the stock price is negligible. 4 Mar 18 ## Interest rate - the puppetmaster behind option prices ## Interest rate - the puppetmaster behind option prices Figure 1. Call as a function of interest rate The interest rate is the last variable we need to discuss. The dependence of the call price on the interest rate that emerges from the Black-Scholes formula is depicted in Figure 1. The dependence is positive, right? Not so fast. This is the case when common sense should be used instead of mathematical models. One economic factor can influence another through many channels, often leading to contradicting results. John Hull offers two explanations. 1. As interest rates in the economy increase, the expected return required by investors from the stock tends to increase. This suggests a positive dependence of the call price on the interest rate. 2. On the other hand, when the interest rate rises, the present value of any future cash flow received by the long call holder decreases. In particular, this reduces the payoff if at expiration the option is in the money. The combined impact of these two effects embedded in the Black-Scholes formula is to increase the value of the call options. ## However, experience tells the opposite The Fed changes the interest rate at discrete times, not continually. Two moments matter: when the rumor about the upcoming interest rate change hits the market and when the actual change takes place. The market reaction to the rumor depends on the investors' mood - bullish or bearish. A bullish market tends to shrug off most bad news. In a bearish market, even a slight threat may have drastic consequences. By the time the actual change occurs, it is usually priced in. In 2017, the Fed raised the rate three times: on March 15 (no reaction, judging by S&P 500 SPDR SPY), June 14 (again no reaction) and December 13 (a slight fall). The huge fall in the beginning of February 2018 was not caused by any actual change. It was an accumulated result of cautiousness ("This market has been bullish for too long!") and fears that the Fed would increase the rates in 2018 by more than had been anticipated. Many investors started selling stocks and buying bonds and other less risky assets. The total value of US bonds is about$20 trillion, while the total market capitalization of US stocks is around $30 trillion. The two markets are comparable in size, which means there is enough room to move from one to another and the total portfolio reshuffling can be considerable. Thus far, the mere expectation that the interest rate will increase has been able to substantially reduce stock prices and, consequently, call prices. All this I summarized to my students as follows. When interest rates rise, bonds become more attractive. This is a substitution effect: investors switch from one asset to another all the time. Therefore stock prices and call prices fall. Thus the dependence of call prices on interest rates is negative. The first explanation suggested by Hull neglects the substitution effect. The second explanation is not credible either, for the following reason. As I explained, stock volatility has a very strong influence on options. Options themselves have an even higher volatility. A change in interest rates by a couple percent is nothing in comparison with this volatility. Most investors would not care about the resulting reduction in the present value of future cash flows. 27 Dec 17 ## How to study mt3042 Optimisation: a guide to a guide ## How to study mt3042 Optimisation: a guide to a guide Section and examples numbering follows that of mt3042 Optimization Guide by M. Baltovic. ## Main idea: look at geometry in the two-dimensional case Here is an example. The norm of a vector $x\in R^n$ is defined by $\left\Vert x\right\Vert =\sqrt{\sum_{i=1}^nx_i^2}.$ The combination of squaring and extracting a square root often makes it difficult to understand how this construction works. Here is a simple inequality that allows one to do without this norm (or, put it differently, replace it with another norm). Take $n=2.$ $\max \{|x_1|,|x_2|\}=\max \{\sqrt{x_1^2},\sqrt{x_2^2}\}\leq\max\{\sqrt{x_1^2+x_2^2},\sqrt{x_1^2+x_2^2}\}=\left\Vert x\right\Vert =\sqrt{x_1^2+x_2^2}\leq\sqrt{\max\{|x_1|,|x_2|\}^2+\max \{|x_1|,|x_2|\}^2}=\sqrt{2}\max\{|x_1|,|x_2|\}.$ We have proved that $\max \{|x_1|,|x_2|\}\leq\left\Vert x\right\Vert\leq\sqrt{2}\max\{|x_1|,|x_2|\}.$ This easily generalizes to $R^{n}$: (1) $\max \{|x_1|,...,|x_n|\}\leq\left\Vert x\right\Vert\leq\sqrt{n}\max\{|x_1|,...,|x_n|\}.$ Application. The set $A\subset R^n$ is called bounded if there is a constant $C$ such that $\left\Vert x\right\Vert \leq C$ for all $x\in A.$ (1) implies an equivalent definition: the set $A\subset R^n$ is called bounded if there is a constant $C$ such that $\max\{|x_1|,...,|x_n|\}\leq C$ for all $x\in A.$ See p.35 of Baltovic's guide, where the inequality $y_{i}\leq \frac{f(\hat{x})}{p_{i}}$ is sufficient for proving boundedness of the set $Y.$ Theorem 2.2 (The Cauchy-Schwarz Inequality). This inequality does not have serious applications in the guide. For a nontrivial application of the Cauchy-Schwarz inequality see my post. 2.1.8. Avoid using the definition of continuity in terms of $\varepsilon-\delta$ (Definition 2.18). Use Definition 2.19 in terms of sequences instead. 2.6.2. Definition 2.21 for many students is indigestible. Just say that the matrix $A$ consists of partial derivatives of components of $f=(f_1,...,f_m):$ $A=\left(\begin{array}{ccc} \frac{\partial f_1}{\partial x_1}&...&\frac{\partial f_m}{\partial x_1} \\...&...&...\\ \frac{\partial f_1}{\partial x_n}&...&\frac{\partial f_m}{\partial x_n}\end{array}\right) .$ Theorem 2.11. The proof is really simple in the one-dimensional case. By the definition of the derivative, $\frac{f(x_n)-f(x)}{x_n-x}\rightarrow f^{\prime }(x)$ for any sequence $x_n\rightarrow x.$ Multiplying this equation by $x_n-x\rightarrow 0$ we get $f(x_{n})-f(x)\rightarrow (x_n-x)f^{\prime }(x)\rightarrow 0$, which proves continuity of $f$ at $x.$ 3.3.1. There is Math that happens on paper (formulas) and the one that happens in the head (logic). Many students see the formulas and miss the logic. Carefully read this section and see if the logic happens in your head. 3.4. The solution to Example 3.2 is overblown. A professional mathematician never thinks like that. A pro would explain the idea as follows: because of Condition 2, the function is close to zero in some neighborhood of infinity $\{x:|x|>N\}$. Therefore, a maximum should be looked for in the set $\{x:|x|\leq N\}$. Since this is a compact, the Weierstrass theorem applies. With a proper graphical illustration, the students don't need anything else. 4.2 First-order conditions for optima. See the proof. 4.4 Second-order conditions for optima. See explanation using the Taylor decomposition. 5.3 The Theorem of Lagrange. For the Lagrange method see necessary conditionssufficient conditions, and case of many constraints. 5.4 Proof of Lagrange's Theorem. See a simple explanation of the constraint qualification condition. The explanation on pp.58-60 is hard to understand because of dimensionality. 5.6 The Lagrangian multipliers. See simpler derivation. 6.4 Proof of the Kuhn-Tucker Theorem. In case of the Kuhn-Tucker theorem, the most important point is that, once the binding constraints have been determined, the nonbinding ones can be omitted from the analysis. The proof of nonnegativity of the Lagrange multiplier for binding constraints is less than one page. Example 6.4. In solutions that rely on the Kuhn-Tucker theorem, the author suggests to check the constraint qualification condition for all possible combinations of constraints. Not only is this time consuming, but this is also misleading, given the fact that often it is possible to determine the binding constraints and use the Lagrange method instead of the Kuhn-Tucker theorem or, alternatively, to use the Kuhn-Tucker theorem for eliminating simple cases. The same problem can be solved using the convexity theory. Example 6.5. In this case Baltovic makes a controversial experiment: what happens if we go the wrong way (expectedly, bad things happen), without providing the correct solution. Solution to Exercise 6.1. In this exercise, the revenue is homogeneous of degree 2 and the cost is homogeneous of degree 1, which indicates that the profit is infinite. No need to do a three-page analysis! 7.6 The Bellman equations. There are many optimization methods not covered in Sundaram's book. One of them, Pontryagin's maximum principle, is more general that the Bellman approach. p. 172. The bound $\sum \delta ^{t}|r(s_{t},a_{t})|\leq K\sum \delta ^{t}$ is obvious and does not require the Cauchy-Schwarz inequality. Example 8.1. See the solution of this example using the Cauchy-Schwarz inequality. 5 Nov 17 ## Finite Horizon Dynamic Programming ## Finite Horizon Dynamic Programming This is the title of the theory we start studying here. We use the Traveller's problem to explain the main definitions: 1. Set of states, for each time 2. Set of actions, for each time and state 3. Reward, for each action 4. Strategy = set of actions that takes us from A to B (geometrically, it is a path from A to B). 5. Definition of a Markovian strategy: at each time, the action depends only on the state at that time and not on how we got there. 6. Value function of a strategy = sum of all rewards Video 1. Traveller's problem-Definitions ## The method of backwards induction By definition, an optimal strategy maximizes the value function. Statement. Piece of optimal whole PLUS optimal remainder = optimal whole. That is, if we have an optimal strategy leading from A to B, we can take any piece of it, leading from A to an intermediate point, say, C. Then if we find a partial optimal strategy leading from C to B, then the combination of the piece A to C of the initial optimal strategy and the partial optimal strategy C to B will be optimal from A to B. Idea: starting from the end, find optimal remainders for all states. Video 2. Traveller's problem-Solution 8 Oct 17 ## Reevaluating probabilities based on piece of evidence ## Reevaluating probabilities based on piece of evidence This actually has to do with the Bayes' theorem. However, in simple problems one can use a dead simple approach: just find probabilities of all elementary events. This post builds upon the post on Significance level and power of test, including the notation. Be sure to review that post. Here is an example from the guide for Quantitative Finance by A. Patton (University of London course code FN3142). Activity 7.2 Consider a test that has a Type I error rate of 5%, and power of 50%. Suppose that, before running the test, the researcher thinks that both the null and the alternative are equally likely. 1. If the test indicates a rejection of the null hypothesis, what is the probability that the null is false? 2. If the test indicates a failure to reject the null hypothesis, what is the probability that the null is true? Denote events R = {Reject null}, A = {fAil to reject null}; T = {null is True}; F = {null is False}. Then we are given: (1) $P(F)=0.5;\ P(T)=0.5;$ (2) $P(R|T)=\frac{P(R\cap T)}{P(T)}=0.05;\ P(R|F)=\frac{P(R\cap F)}{P(F)}=0.5;$ (1) and (2) show that we can find $P(R\cap T)$ and $P(R\cap F)$ and therefore also $P(A\cap T)$ and $P(A\cap F).$ Once we know probabilities of elementary events, we can find everything about everything. Figure 1. Elementary events Answering the first question: just plug probabilities in $P(F|R)=\frac{P(R\cap F)}{P(R)}=\frac{P(R\cap F)}{P(R\cap T)+P(A\cap T)}.$ Answering the second question: just plug probabilities in $P(T|A)=\frac{P(A\cap T)}{P(A)}=\frac{P(A\cap T)}{P(A\cap T)+P(A\cap F)}.$ Patton uses the Bayes' theorem and the law of total probability. The solution suggested above uses only additivity of probability. 6 Oct 17 ## Significance level and power of test ## Significance level and power of test In this post we discuss several interrelated concepts: null and alternative hypotheses, type I and type II errors and their probabilities. Review the definitions of a sample space and elementary events and that of a conditional probability. ## Type I and Type II errors Regarding the true state of nature we assume two mutually exclusive possibilities: the null hypothesis (like the suspect is guilty) and alternative hypothesis (the suspect is innocent). It's up to us what to call the null and what to call the alternative. However, the statistical procedures are not symmetric: it's easier to measure the probability of rejecting the null when it is true than other involved probabilities. This is why what is desirable to prove is usually designated as the alternative. Usually in books you can see the following table.  Decision taken Fail to reject null Reject null State of nature Null is true Correct decision Type I error Null is false Type II error Correct decision This table is not good enough because there is no link to probabilities. The next video does fill in the blanks. Video. Significance level and power of test ## Significance level and power of test The conclusion from the video is that $\frac{P(T\bigcap R)}{P(T)}=P(R|T)=P\text{(Type I error)=significance level}$ $\frac{P(F\bigcap R)}{P(F)}=P(R|F)=P\text{(Correctly rejecting false null)=Power}$ 11 Aug 17 ## Violations of classical assumptions 2 ## Violations of classical assumptions This will be a simple post explaining the common observation that "in Economics, variability of many variables is proportional to those variables". Make sure to review the assumptions; they tend to slip from memory. We consider the simple regression (1) $y_i=a+bx_i+e_i.$ One of classical assumptions is Homoscedasticity. All errors have the same variances$Var(e_i)=\sigma^2$ for all $i$. We discuss its opposite, which is Heteroscedasticity. Not all errors have the same variance. It would be wrong to write it as $Var(e_i)\ne\sigma^2$ for all $i$ (which means that all errors have variance different from $\sigma^2$). You can write that not all $Var(e_i)$ are the same but it's better to use the verbal definition. Remark about Video 1. The dashed lines can represent mean consumption. Then the fact that variation of a variable grows with its level becomes more obvious. Video 1. Case for heteroscedasticity Figure 1. Illustration from Dougherty: as x increases, variance of the error term increases Homoscedasticity was used in the derivation of the OLS estimator variance; under heteroscedasticity that expression is no longer valid. There are other implications, which will be discussed later. Companies example. The Samsung Galaxy Note 7 battery fires and explosions that caused two recalls cost the smartphone maker at least$5 billion. There is no way a small company could have such losses.

GDP example. The error in measuring US GDP is on the order of \$200 bln, which is comparable to the Kazakhstan GDP. However, the standard deviation of the ratio error/GDP seems to be about the same across countries, if the underground economy is not too big. Often the assumption that the standard deviation of the regression error is proportional to one of regressors is plausible.

To see if the regression error is heteroscedastic, you can look at the graph of the residuals or use statistical tests.

10
Jul 17

## Alternatives to simple regression in Stata

In this post we looked at dependence of EARNINGS on S (years of schooling). In the end I suggested to think about possible variations of the model. Specifically, could the dependence be nonlinear? We consider two answers to this question.

This name is used for the quadratic dependence of the dependent variable on the independent variable. For our variables the dependence is

$EARNINGS=a+bS+cS^2+u$.

Note that the dependence on S is quadratic but the right-hand side is linear in the parameters, so we still are in the realm of linear regression. Video 1 shows how to run this regression.

Video 1. Running quadratic regression in Stata

## Nonparametric regression

The general way to write this model is

$y=m(x)+u.$

The beauty and power of nonparametric regression consists in the fact that we don't need to specify the functional form of dependence of $y$ on $x$. Therefore there are no parameters to interpret, there is only the fitted curve. There is also the estimated equation of the nonlinear dependence, which is too complex to consider here. I already illustrated the difference between parametric and nonparametric regression. See in Video 2 how to run nonparametric regression in Stata.

Video 2. Nonparametric dependence