Mar 19

AP Statistics the Genghis Khan way 2

AP Statistics the Genghis Khan way 2

Last semester I tried to explain theory through numerical examples. The results were terrible. Even the best students didn't stand up to my expectations. The midterm grades were so low that I did something I had never done before: I allowed my students to write an analysis of the midterm at home. Those who were able to verbally articulate the answers to me received a bonus that allowed them to pass the semester.

This semester I made a U-turn. I announced that in the first half of the semester we will concentrate on theory and we followed this methodology. Out of 35 students, 20 significantly improved their performance and 15 remained where they were.

Midterm exam, version 1

1. General density definition (6 points)

a. Define the density p_X of a random variable X. Draw the density of heights of adults, making simplifying assumptions if necessary. Don't forget to label the axes.

b. According to your plot, how much is the integral \int_{-\infty}^0p_X(t)dt? Explain.

c. Why the density cannot be negative?

d. Why the total area under the density curve should be 1?

e. Where are basketball players on your graph? Write down the corresponding expression for probability.

f. Where are dwarfs on your graph? Write down the corresponding expression for probability.

This question is about the interval formula. In each case students have to write the equation for the probability and the corresponding integral of the density. At this level, I don't talk about the distribution function and introduce the density by the interval formula.

2. Properties of means (8 points)

a. Define a discrete random variable and its mean.

b. Define linear operations with random variables.

c. Prove linearity of means.

d. Prove additivity and homogeneity of means.

e. How much is the mean of a constant?

f. Using induction, derive the linearity of means for the case of n variables from the case of two variables (3 points).

3. Covariance properties (6 points)

a. Derive linearity of covariance in the first argument when the second is fixed.

b. How much is covariance if one of its arguments is a constant?

c. What is the link between variance and covariance? If you know one of these functions, can you find the other (there should be two answers)? (4 points)

4. Standard normal variable (6 points)

a. Define the density p_z(t) of a standard normal.

b. Why is the function p_z(t) even? Illustrate this fact on the plot.

c. Why is the function f(t)=tp_z(t) odd? Illustrate this fact on the plot.

d. Justify the equation Ez=0.

e. Why is V(z)=1?

f. Let t>0. Show on the same plot areas corresponding to the probabilities A_1=P(0<z<t), A_2=P(z>t), A_3=P(z<-t), A_4=P(-t<z<0). Write down the relationships between A_1,...,A_4.

5. General normal variable (3 points)

a. Define a general normal variable X.

b. Use this definition to find the mean and variance of X.

c. Using part b, on the same plot graph the density of the standard normal and of a general normal with parameters \sigma =2, \mu =3.

Midterm exam, version 2

1. General density definition (6 points)

a. Define the density p_X of a random variable X. Draw the density of work experience of adults, making simplifying assumptions if necessary. Don't forget to label the axes.

b. According to your plot, how much is the integral \int_{-\infty}^0p_X(t)dt? Explain.

c. Why the density cannot be negative?

d. Why the total area under the density curve should be 1?

e. Where are retired people on your graph? Write down the corresponding expression for probability.

f. Where are young people (up to 25 years old) on your graph? Write down the corresponding expression for probability.

2. Variance properties (8 points)

a. Define variance of a random variable. Why is it non-negative?

b. Define the formula for variance of a linear combination of two variables.

c. How much is variance of a constant?

d. What is the formula for variance of a sum? What do we call homogeneity of variance?

e. What is larger: V(X+Y) or V(X-Y)? (2 points)

f. One investor has 100 shares of Apple, another - 200 shares. Which investor's portfolio has larger variability? (2 points)

3. Poisson distribution (6 points)

a. Write down the Taylor expansion and explain the idea. How are the Taylor coefficients found?

b. Use the Taylor series for the exponential function to define the Poisson distribution.

c. Find the mean of the Poisson distribution. What is the interpretation of the parameter \lambda in practice?

4. Standard normal variable (6 points)

a. Define the density p_z(t) of a standard normal.

b. Why is the function p_z(t) even? Illustrate this fact on the plot.

c. Why is the function f(t)=tp_z(t) odd? Illustrate this fact on the plot.

d. Justify the equation Ez=0.

e. Why is V(z)=1?

f. Let t>0. Show on the same plot areas corresponding to the probabilities A_1=P(0<z<t), A_2=P(z>t), A_{3}=P(z<-t), A_4=P(-t<z<0). Write down the relationships between A_{1},...,A_{4}.

5. General normal variable (3 points)

a. Define a general normal variable X.

b. Use this definition to find the mean and variance of X.

c. Using part b, on the same plot graph the density of the standard normal and of a general normal with parameters \sigma =2, \mu =3.

Mar 19

AP Statistics the Genghis Khan way 1

AP Statistics the Genghis Khan way 1

Recently I enjoyed reading Jack Weatherford's "Genghis Khan and the Making of the Modern World" (2004). I was reading the book with a specific question in mind: what were the main reasons of the success of the Mongols? Here you can see the list of their innovations, some of which were in fact adapted from the nations they subjugated. But what was the main driving force behind those innovations? The conclusion I came to is that Genghis Khan was a genial psychologist. He used what he knew about individual and social psychology to constantly improve the government of his empire.

I am no Genghis Khan but I try to base my teaching methods on my knowledge of student psychology.

Problems and suggested solutions

Steven Krantz in his book (How to teach mathematics : Second edition, 1998, don't remember the page) says something like this: If you want your students to do something, arrange your classes so that they do it in the class.

Problem 1. Students mechanically write down what the teacher says and writes.

Solution. I don't allow my students to write while I am explaining the material. When I explain, their task is to listen and try to understand. I invite them to ask questions and prompt me to write more explanations and comments. After they all say "We understand", I clean the board and then they write down whatever they understood and remembered.

Problem 2. Students are not used to analyze what they read or write.

Solution. After students finish their writing, I ask them to exchange notebooks and check each other's writings. It's easier for them to do this while everything is fresh in their memory. I bought and distributed red pens. When they see that something is missing or wrong, they have to write in red. Errors or omissions must stand out. Thus, right there in the class students repeat the material twice.

Problem 3. Students don't study at home.

Solution. I let my students know in advance what the next quiz will be about. Even with this knowledge, most of them don't prepare at home. Before the quiz I give them about half an hour to repeat and discuss the material (this is at least the third repetition). We start the quiz when they say they are ready.

Problem 4. Students don't understand that active repetition (writing without looking at one's notes) is much more productive than passive repetition (just reading the notes).

Solution. Each time before discussion sessions I distribute scratch paper and urge students to write, not just read or talk. About half of them follow my recommendation. Their desire to keep their notebooks neat is not their last consideration. The solution to Problem 1 also hinges upon active repetition.

Problem 5. If students work and are evaluated individually, usually there is no or little interaction between them.

Solution. My class is divided in teams (currently I have teams of two to six people). I randomly select one person from each team to write the quiz. That person's grade is the team's grade. This forces better students to coach others and weaker students to seek help.

Problem 6. Some students don't want to work in teams. They are usually either good students, who don't want to suffer because of weak team members, or weak students, who don't want their low grades to harm other team members.

Solution. The good students usually argue that it's not fair if their grade becomes lower because of somebody else's fault. My answer to them is that the meaning of fairness depends on the definition. In my grading scheme, 30 points out of 100 is allocated for team work and the rest for individual achievements. Therefore I never allow good students to work individually. I want them to be my teaching assistants and help other students. While doing so, I tell them that I may reward good students with a bonus in the end of the semester. In some cases I allow weak students to write quizzes individually but only if the team so requests. The request of the weak student doesn't matter. The weak student still has to participate in team discussions.

Problem 7. There is no accumulation of theoretical knowledge (flat learning curve).

Solution. a) Most students come from high school with little experience in algebra. I raise the level gradually and emphasize understanding. Students never see multiple choice questions in my classes. They also know that right answers without explanations will be discarded.

b) Normally, during my explanations I fill out the board. The amount of the information the students have to remember is substantial and increases over time. If you know a better way to develop one's internal vision, let me know.

c) I don't believe in learning the theory by doing applied exercises. After explaining the theory I formulate it as a series of theoretical exercises. I give the theory in large, logically consistent blocks for students to see the system. Half of exam questions are theoretical (students have to provide proofs and derivations) and the other half - applied.

d) The right motivation can be of two types: theoretical or applied, and I never substitute one for another.

Problem 8. In low-level courses you need to conduct frequent evaluations to keep your students in working shape. Multiply that by the number of students, and you get a serious teaching overload.

Solution. Once at a teaching conference in Prague my colleague from New York boasted that he grades 160 papers per week. Evaluating one paper per team saves you from that hell.


In the beginning of the academic year I had 47 students. In the second semester 12 students dropped the course entirely or enrolled in Stats classes taught by other teachers. Based on current grades, I expect 15 more students to fail. Thus, after the first year I'll have about 20 students in my course (if they don't fail other courses). These students will master statistics at the level of my book.

Nov 18

Little tricks for AP Statistics

Little tricks for AP Statistics

This year I am teaching AP Statistics. If the things continue the way they are, about half of the class will fail. Here is my diagnosis and how I am handling the problem.

On the surface, the students lack algebra training but I think the problem is deeper: many of them have underdeveloped cognitive abilities. Their perception is slow, memory is limited, analytical abilities are rudimentary and they are not used to work at home. Limited resources require  careful allocation.


Short and intuitive names are better than two-word professional names.

Instead of "sample space" or "probability space" say "universe". The universe is the widest possible event, and nothing exists outside it.

Instead of "elementary event" say "atom". Simplest possible events are called atoms. This corresponds to the theoretical notion of an atom in measure theory (an atom is a measurable set which has positive measure and contains no set of smaller positive measure).

Then the formulation of classical probability becomes short. Let n denote the number of atoms in the universe and let n_A be the number of atoms in event A. If all atoms are equally likely (have equal probabilities), then P(A)=n_A/n.

The clumsy "mutually exclusive events" are better replaced by more visual "disjoint sets". Likewise, instead of "collectively exhaustive events" say "events that cover the universe".

The combination "mutually exclusive" and "collectively exhaustive" events is beyond comprehension for many. I say: if events are disjoint and cover the universe, we call them tiles. To support this definition, play onscreen one of jigsaw puzzles (Video 1) and produce the picture from Figure 1.

Video 1. Tiles (disjoint events that cover the universe)

Tiles (disjoint events that cover the universe)

Figure 1. Tiles (disjoint events that cover the universe)

The philosophy of team work

We are in the same boat. I mean the big boat. Not the class. Not the university. It's the whole country. We depend on each other. Failure of one may jeopardize the well-being of everybody else.

You work in teams. You help each other to learn. My lectures and your presentations are just the beginning of the journey of knowledge into your heads. I cannot control how it settles there. Be my teaching assistants, share your big and little discoveries with your classmates.

I don't just preach about you helping each other. I force you to work in teams. 30% of the final grade is allocated to team work. Team work means joint responsibility. You work on assignments together. I randomly select a team member for reporting. His or her grade is what each team member gets.

This kind of team work is incompatible with the Western obsession with grades privacy. If I say my grade is nobody's business, by extension I consider the level of my knowledge a private issue. This will prevent me from asking for help and admitting my errors. The situation when students hide their errors and weaknesses from others also goes against the ethics of many workplaces. In my class all grades are public knowledge.

In some situations, keeping the grade private is technically impossible. Conducting a competition without announcing the points won is impossible. If I catch a student cheating, I announce the failing grade immediately, as a warning to others.

To those of you who think team-based learning is unfair to better students I repeat: 30% of the final grade is given for team work, not for personal achievements. The other 70% is where you can shine personally.

Breaking the wall of silence

Team work serves several purposes.

Firstly, joint responsibility helps breaking communication barriers. See in Video 2 my students working in teams on classroom assignments. The situation when a weaker student is too proud to ask for help and a stronger student doesn't want to offend by offering help is not acceptable. One can ask for help or offer help without losing respect for each other.

Video 2. Teams working on assignments

Secondly, it turns on resources that are otherwise idle. Explaining something to somebody is the best way to improve your own understanding. The better students master a kind of leadership that is especially valuable in a modern society. For the weaker students, feeling responsible for a team improves motivation.

Thirdly, I save time by having to grade less student papers.

On exams and quizzes I mercilessly punish the students for Yes/No answers without explanations. There are no half-points for half-understanding. This, in combination with the team work and open grades policy allows me to achieve my main objective: students are eager to talk to me about their problems.

Set operations and probability

After studying the basics of set operations and probabilities we had a midterm exam. It revealed that about one-third of students didn't understand this material and some of that misunderstanding came from high school. During the review session I wanted to see if they were ready for a frank discussion and told them: "Those who don't understand probabilities, please raise your hands", and about one-third raised their hands. I invited two of them to work at the board.

Video 3. Translating verbal statements to sets, with accompanying probabilities

Many teachers think that the Venn diagrams explain everything about sets because they are visual. No, for some students they are not visual enough. That's why I prepared a simple teaching aid (see Video 3) and explained the task to the two students as follows:

I am shooting at the target. The target is a square with two circles on it, one red and the other blue. The target is the universe (the bullet cannot hit points outside it). The probability of a set is its area. I am going to tell you one statement after another. You write that statement in the first column of the table. In the second column write the mathematical expression for the set. In the third column write the probability of that set, together with any accompanying formulas that you can come up with. The formulas should reflect the relationships between relevant areas.

Table 1. Set operations and probabilities

Statement Set Probability
1. The bullet hit the universe S P(S)=1
2. The bullet didn't hit the universe \emptyset P(\emptyset )=0
3. The bullet hit the red circle A P(A)
4. The bullet didn't hit the red circle \bar{A}=S\backslash A P(\bar{A})=P(S)-P(A)=1-P(A)
5. The bullet hit both the red and blue circles A\cap B P(A\cap B) (in general, this is not equal to P(A)P(B))
6. The bullet hit A or B (or both) A\cup B P(A\cup B)=P(A)+P(B)-P(A\cap B)

(additivity rule)

7. The bullet hit A but not B A\backslash B P(A\backslash B)=P(A)-P(A\cap B)
8. The bullet hit B but not A B\backslash A P(B\backslash A)=P(B)-P(A\cap B)
9. The bullet hit either A or B (but not both) (A\backslash B)\cup(B\backslash A) P\left( (A\backslash B)\cup (B\backslash A)\right)

=P(A)+P(B)-2P(A\cap B)

During the process, I was illustrating everything on my teaching aid. This exercise allows the students to relate verbal statements to sets and further to their areas. The main point is that people need to see the logic, and that logic should be repeated several times through similar exercises.

May 18

Law of total probability - you could have invented this

Law of total probability - you could have invented this

A knight wants to kill (event K) a dragon. There are two ways to do this: by fighting (event F) the dragon or by outwitting (O) it. The choice of the way (F or O) is random, and in each case the outcome (K or not K) is also random. For the probability of killing there is a simple, intuitive formula:


Its derivation is straightforward from the definition of conditional probability: since F and O cover the whole sample space and are disjoint, we have by additivity of probability

P(K)=P(K\cap(F\cup O))=P(K\cap F)+P(K\cap O)=\frac{P(K\cap F)}{P(F)}P(F)+\frac{P(K\cap O)}{P(O)}P(O)


This is easy to generalize to the case of many conditioning events. Suppose A_1,...,A_n are mutually exclusive (that is, disjoint) and collectively exhaustive (that is, cover the whole sample space). Then for any event B one has


This equation is call the law of total probability.

Application to a sum of continuous and discrete random variables

Let X,Y be independent random variables. Suppose that X is continuous, with a distribution function F_X, and suppose Y is discrete, with values y_1,...,y_n. Then for the distribution function of the sum F_{X+Y} we have

F_{X+Y}(t)=P(X+Y\le t)=\sum_{j=1}^nP(X+Y\le t|Y=y_j)P(Y=y_j)

(by independence conditioning on Y=y_j can be omitted)

=\sum_{j=1}^nP(X\le t-y_j)P(Y=y_j)=\sum_{j=1}^nF_X(t-y_j)P(Y=y_j).

Compare this to the much more complex derivation in case of two continuous variables.


Apr 18

Distribution function estimation

Distribution function estimation

The relativity theory says that what initially looks absolutely difficult, on closer examination turns out to be relatively simple. Here is one such topic. We start with a motivating example.

Large cloud service providers have huge data centers. A data center, being a large group of computer servers, typically requires extensive air conditioning. The intensity and cost of air conditioning depend on the temperature of the surrounding environment. If, as in our motivating example, we denote by T the temperature outside and by t a cut-off value, then a cloud service provider is interested in knowing the probability P(T\le t) for different values of t. This is exactly the distribution function of temperature: F_T(t)=P(T\le t). So how do you estimate it?

It comes down to usual sampling. Fix some cut-off, for example, t=20 and see for how many days in a year the temperature does not exceed 20. If the number of such days is, say, 200, then 200/365 will be the estimate of the probability P(T\le 20).

It remains to dress this idea in mathematical clothes.

Empirical distribution function

If an observation T_i belongs to the event \{T\le 20\}, we count it as 1, otherwise we count it as zero. That is, we are dealing with a dummy variable

(1) 1_{\{T\le 20\}}=\left\{\begin{array}{ll}1,&T\le 20;\\0,&T>20.\end{array}\right.

The total count is \sum 1_{\{T_i\le 20\}} and this is divided by the total number of observations, which is 365, to get 200/365.

It is important to realize that the variable in (1) is a coin (Bernoulli variable). For an unfair coin with  probability of 1 equal to p and probability of zero equal to 1-p the mean is

EC=p\times 1+(1-p)\times 0=p

and the variance is


For the variable in (1) p=P\{1_{\{T\le 20\}}=1\}=P(T\le 20)=F_T(20), so the mean and variance are

(2) E1_{\{T\le 20\}}=F_T(20),\ Var(1_{\{T\le 20\}})=F_T(20)(1-F_T(20)).

Generalizing, the probability F_T(t)=P(T\le t) is estimated by

(3) \frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}

where n is the number of observations. (3) is called an empirical distribution function because it is a direct empirical analog of P(T\le t).

Applying expectation to (3) and using an equation similar to (2), we prove unbiasedness of our estimator:

(4) E\frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}=P(T\le t)=F_T(t).

Further, assuming independent observations we can find variance of (3):

(5) Var\left(\frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}\right) (using homogeneity of degree 2)

=\frac{1}{n^2}Var\left(\sum_{i=1}^n 1_{\{T_i\le t\}}\right) (using independence)

=\frac{1}{n^2}\sum_{i=1}^nVar(1_{\{T_i\le t\}}) (applying an equation similar to (2))


Corollary. (4) and (5) can be used to prove that (3) is a consistent estimator of the distribution function, i.e., (3) converges to F_T(t) in probability.




Mar 18

Intro to option greeks: delta and its determinants

Intro to option greeks: delta and its determinants

I started trading stocks in 2010. I didn't expect to make big profits and wasn't actively trading. That's until 2015, when I met a guy who turned $10,000 into $140,000 in four years. And then I thought: why am I fooling around when it's possible to make good money? Experienced traders say: trading is a journey. That's how my journey started. Stocks move too slowly, to my taste, so I had to look for other avenues.

Two things were clear to me. I didn't want to be glued to the monitor the whole day and didn't want to study a lot of theory. Therefore I decided to concentrate on the futures market. To trade futures, you don't even need to know the definition of a futures contract. The price moves very quickly, and if you know what you are doing, you can make a couple of hundreds in a matter of minutes. It turned out that the futures markets are the best approximation to the efficient market hypothesis. Trend is your friend (until the end), as they say. In the futures markets, trends are rare and short-lived. Trading futures is like driving a race car. The psychological stress is enormous and it may excite your worst instincts. After trying for almost two years and losing $8,000 I gave up. Don't trade futures unless you can predict a big move.

Many people start their trading careers in the forex market because the volumes there are large and transaction fees are low. I never traded forex and think that it is as risky as the futures market. If you want to try it, I would suggest to trade not the exchange rates themselves but indexes or ETF's (exchange traded funds) that trace them. Again, look for large movements.

One more market I don't want to trade is bonds. Actions of central banks and macroeconomic events are among strong movers of this market. Otherwise, it's the same as futures. Futures, forex and bonds have one feature in common. In all of them institutional (large) traders dominate. My impression is that in absence of market-moving events they select a range within which to trade. Having deep pockets, they can buy at the top of the range and sell at the bottom without worrying about the associated loss. Trading in a range like that will kill a retail (small) investor. Changes in fundamentals force the big guys to shift the range, and that's when small investors have a chance to profit.

I tried to avoid options because they require learning some theory. After a prolonged resistance, I started trading options and immediately fell in love with them. I think that anybody with $25,000 in savings can and should be trading options.

Definition. In Math, the Greek letter \Delta (delta) is usually used to denote change or rate of change. In case of options, it's the rate of change of the option price when the stock price changes. Mathematically, it's a derivative \Delta=\frac{\partial c}{\partial S} where c is the call price and S is the stock price. In layman's terms, when the stock price changes by $1, the call price changes (moves in the same direction) by \Delta dollars. The basic features of delta can be understood by looking at how it depends on the strike price, when time is fixed, and how it changes with time, when the strike price is fixed. As before, we concentrate on probabilistic intuition.

How delta depends on strike price

Intro to option greeks: delta and its determinants

Figure 1. AAPL option chain with 26 days to expiration

Look at the option chain in Figure 1. For the strikes that are deep in the money, delta is close to one. This is because if a call option is deep in the money, the probability that it will end up in the money by expiration is high (see how the call price depends on the strike price). Hence, stock price changes are followed by call price changes almost one to one. On the other hand, if a strike is far out of the money, it is likely to remain out of the money by expiration. The stock price changes have little effect on the call price. That's why delta is close to zero.

How delta depends on time to expiration

Intro to option greeks: delta and its determinants

Figure 2. AAPL option chain with 5 days to expiration

Now let us compare that option chain to the one with a shorter time to expiration (see Figure 2). If an option is to expire soon, the probability of a drastic stock movement before expiration is low, see the comparison of areas of influence with different times to expiration. Only a few options with strikes lower than at the money strike have deltas different from one. The deeper in the money calls have deltas equal to one: their prices exactly repeat the stock price. Similarly, only a few out of the money options have deltas different from zero. If the strike is very far out of the money, the call delta is 0 because the call is very likely to expire worthless and its dependence on the stock price is negligible.

Mar 18

Interest rate - the puppetmaster behind option prices

Interest rate - the puppetmaster behind option prices

Call as a function of interest rate

Figure 1. Call as a function of interest rate

The interest rate is the last variable we need to discuss. The dependence of the call price on the interest rate that emerges from the Black-Scholes formula is depicted in Figure 1. The dependence is positive, right? Not so fast. This is the case when common sense should be used instead of mathematical models. One economic factor can influence another through many channels, often leading to contradicting results.

John Hull offers two explanations.

  1. As interest rates in the economy increase, the expected return required by investors from the stock tends to increase. This suggests a positive dependence of the call price on the interest rate.
  2. On the other hand, when the interest rate rises, the present value of any future cash flow received by the long call holder decreases. In particular, this reduces the payoff if at expiration the option is in the money.

The combined impact of these two effects embedded in the Black-Scholes formula is to increase the value of the call options.

However, experience tells the opposite

The Fed changes the interest rate at discrete times, not continually. Two moments matter: when the rumor about the upcoming interest rate change hits the market and when the actual change takes place. The market reaction to the rumor depends on the investors' mood - bullish or bearish. A bullish market tends to shrug off most bad news. In a bearish market, even a slight threat may have drastic consequences. By the time the actual change occurs, it is usually priced in.

In 2017, the Fed raised the rate three times: on March 15 (no reaction, judging by S&P 500 SPDR SPY), June 14 (again no reaction) and December 13 (a slight fall). The huge fall in the beginning of February 2018 was not caused by any actual change. It was an accumulated result of cautiousness ("This market has been bullish for too long!") and fears that the Fed would increase the rates in 2018 by more than had been anticipated. Many investors started selling stocks and buying bonds and other less risky assets. The total value of US bonds is about $20 trillion, while the total market capitalization of US stocks is around $30 trillion. The two markets are comparable in size, which means there is enough room to move from one to another and the total portfolio reshuffling can be considerable. Thus far, the mere expectation that the interest rate will increase has been able to substantially reduce stock prices and, consequently, call prices.

All this I summarized to my students as follows. When interest rates rise, bonds become more attractive. This is a substitution effect: investors switch from one asset to another all the time. Therefore stock prices and call prices fall. Thus the dependence of call prices on interest rates is negative.

The first explanation suggested by Hull neglects the substitution effect. The second explanation is not credible either, for the following reason. As I explained, stock volatility has a very strong influence on options. Options themselves have an even higher volatility. A change in interest rates by a couple percent is nothing in comparison with this volatility. Most investors would not care about the resulting reduction in the present value of future cash flows.

Dec 17

How to study mt3042 Optimisation: a guide to a guide

How to study mt3042 Optimisation: a guide to a guide

Section and examples numbering follows that of mt3042 Optimization Guide by M. Baltovic.

Main idea: look at geometry in the two-dimensional case

Here is an example. The norm of a vector x\in R^n is defined by \left\Vert x\right\Vert =\sqrt{\sum_{i=1}^nx_i^2}. The combination of squaring and extracting a square root often makes it difficult to understand how this construction works. Here is a simple inequality that allows one to do without this norm (or, put it differently, replace it with another norm). Take n=2.

\max \{|x_1|,|x_2|\}=\max \{\sqrt{x_1^2},\sqrt{x_2^2}\}\leq\max\{\sqrt{x_1^2+x_2^2},\sqrt{x_1^2+x_2^2}\}=\left\Vert  x\right\Vert =\sqrt{x_1^2+x_2^2}\leq\sqrt{\max\{|x_1|,|x_2|\}^2+\max \{|x_1|,|x_2|\}^2}=\sqrt{2}\max\{|x_1|,|x_2|\}.

We have proved that \max \{|x_1|,|x_2|\}\leq\left\Vert x\right\Vert\leq\sqrt{2}\max\{|x_1|,|x_2|\}. This easily generalizes to R^{n}:

(1) \max \{|x_1|,...,|x_n|\}\leq\left\Vert x\right\Vert\leq\sqrt{n}\max\{|x_1|,...,|x_n|\}.

Application. The set A\subset R^n is called bounded if there is a constant C such that \left\Vert x\right\Vert \leq C for all x\in A. (1) implies an equivalent definition: the set A\subset R^n is called bounded if there is a constant C such that \max\{|x_1|,...,|x_n|\}\leq C for all x\in A. See p.35 of Baltovic's guide, where the inequality y_{i}\leq \frac{f(\hat{x})}{p_{i}} is sufficient for proving boundedness of the set Y.

Theorem 2.2 (The Cauchy-Schwarz Inequality). This inequality does not have serious applications in the guide. For a nontrivial application of the Cauchy-Schwarz inequality see my post.

2.1.8. Avoid using the definition of continuity in terms of \varepsilon-\delta (Definition 2.18). Use Definition 2.19 in terms of sequences instead.

2.6.2. Definition 2.21 for many students is indigestible. Just say that the matrix A consists of partial derivatives of components of f=(f_1,...,f_m):

A=\left(\begin{array}{ccc}  \frac{\partial f_1}{\partial x_1}&...&\frac{\partial f_m}{\partial x_1} \\...&...&...\\  \frac{\partial f_1}{\partial x_n}&...&\frac{\partial f_m}{\partial x_n}\end{array}\right) .

Theorem 2.11. The proof is really simple in the one-dimensional case. By the definition of the derivative, \frac{f(x_n)-f(x)}{x_n-x}\rightarrow  f^{\prime }(x) for any sequence x_n\rightarrow x. Multiplying this equation by x_n-x\rightarrow 0 we get f(x_{n})-f(x)\rightarrow  (x_n-x)f^{\prime }(x)\rightarrow 0, which proves continuity of f at x.

3.3.1. There is Math that happens on paper (formulas) and the one that happens in the head (logic). Many students see the formulas and miss the logic. Carefully read this section and see if the logic happens in your head.

3.4. The solution to Example 3.2 is overblown. A professional mathematician never thinks like that. A pro would explain the idea as follows: because of Condition 2, the function is close to zero in some neighborhood of infinity \{x:|x|>N\}. Therefore, a maximum should be looked for in the set \{x:|x|\leq N\}. Since this is a compact, the Weierstrass theorem applies. With a proper graphical illustration, the students don't need anything else.

4.2 First-order conditions for optima. See the proof.

4.4 Second-order conditions for optima. See explanation using the Taylor decomposition.

5.3 The Theorem of Lagrange. For the Lagrange method see necessary conditionssufficient conditions, and case of many constraints.

5.4 Proof of Lagrange's Theorem. See a simple explanation of the constraint qualification condition. The explanation on pp.58-60 is hard to understand because of dimensionality.

5.6 The Lagrangian multipliers. See simpler derivation.

6.4 Proof of the Kuhn-Tucker Theorem. In case of the Kuhn-Tucker theorem, the most important point is that, once the binding constraints have been determined, the nonbinding ones can be omitted from the analysis. The proof of nonnegativity of the Lagrange multiplier for binding constraints is less than one page.

Example 6.4. In solutions that rely on the Kuhn-Tucker theorem, the author suggests to check the constraint qualification condition for all possible combinations of constraints. Not only is this time consuming, but this is also misleading, given the fact that often it is possible to determine the binding constraints and use the Lagrange method instead of the Kuhn-Tucker theorem or, alternatively, to use the Kuhn-Tucker theorem for eliminating simple cases. The same problem can be solved using the convexity theory.

Example 6.5. In this case Baltovic makes a controversial experiment: what happens if we go the wrong way (expectedly, bad things happen), without providing the correct solution.

Solution to Exercise 6.1. In this exercise, the revenue is homogeneous of degree 2 and the cost is homogeneous of degree 1, which indicates that the profit is infinite. No need to do a three-page analysis!

7.6 The Bellman equations. There are many optimization methods not covered in Sundaram's book. One of them, Pontryagin's maximum principle, is more general that the Bellman approach.

p. 172. The bound \sum \delta ^{t}|r(s_{t},a_{t})|\leq K\sum \delta ^{t} is obvious and does not require the Cauchy-Schwarz inequality.

Example 8.1. See the solution of this example using the Cauchy-Schwarz inequality.

Nov 17

Finite Horizon Dynamic Programming

Finite Horizon Dynamic Programming

This is the title of the theory we start studying here. We use the Traveller's problem to explain the main definitions:

  1. Set of states, for each time
  2. Set of actions, for each time and state
  3. Reward, for each action
  4. Strategy = set of actions that takes us from A to B (geometrically, it is a path from A to B).
  5. Definition of a Markovian strategy: at each time, the action depends only on the state at that time and not on how we got there.
  6. Value function of a strategy = sum of all rewards

Video 1. Traveller's problem-Definitions

The method of backwards induction

By definition, an optimal strategy maximizes the value function.

Statement. Piece of optimal whole PLUS optimal remainder = optimal whole. That is, if we have an optimal strategy leading from A to B, we can take any piece of it, leading from A to an intermediate point, say, C. Then if we find a partial optimal strategy leading from C to B, then the combination of the piece A to C of the initial optimal strategy and the partial optimal strategy C to B will be optimal from A to B.

Idea: starting from the end, find optimal remainders for all states.


Video 2. Traveller's problem-Solution


Oct 17

Reevaluating probabilities based on piece of evidence

Reevaluating probabilities based on piece of evidence

This actually has to do with the Bayes' theorem. However, in simple problems one can use a dead simple approach: just find probabilities of all elementary events. This post builds upon the post on Significance level and power of test, including the notation. Be sure to review that post.

Here is an example from the guide for Quantitative Finance by A. Patton (University of London course code FN3142).

Activity 7.2 Consider a test that has a Type I error rate of 5%, and power of 50%.

Suppose that, before running the test, the researcher thinks that both the null and the alternative are equally likely.

  1. If the test indicates a rejection of the null hypothesis, what is the probability that the null is false?

  2. If the test indicates a failure to reject the null hypothesis, what is the probability that the null is true?

Denote events R = {Reject null}, A = {fAil to reject null}; T = {null is True}; F = {null is False}. Then we are given:

(1) P(F)=0.5;\ P(T)=0.5;

(2) P(R|T)=\frac{P(R\cap T)}{P(T)}=0.05;\ P(R|F)=\frac{P(R\cap F)}{P(F)}=0.5;

(1) and (2) show that we can find P(R\cap T) and P(R\cap F) and therefore also P(A\cap T) and P(A\cap F). Once we know probabilities of elementary events, we can find everything about everything.

Elementary events

Figure 1. Elementary events

Answering the first question: just plug probabilities in P(F|R)=\frac{P(R\cap F)}{P(R)}=\frac{P(R\cap F)}{P(R\cap T)+P(A\cap T)}.

Answering the second question: just plug probabilities in P(T|A)=\frac{P(A\cap T)}{P(A)}=\frac{P(A\cap T)}{P(A\cap T)+P(A\cap F)}.

Patton uses the Bayes' theorem and the law of total probability. The solution suggested above uses only additivity of probability.