Nov 18

Little tricks for AP Statistics

Little tricks for AP Statistics

This year I am teaching AP Statistics. If the things continue the way they are, about half of the class will fail. Here is my diagnosis and how I am handling the problem.

On the surface, the students lack algebra training but I think the problem is deeper: many of them have underdeveloped cognitive abilities. Their perception is slow, memory is limited, analytical abilities are rudimentary and they are not used to work at home. Limited resources require  careful allocation.


Short and intuitive names are better than two-word professional names.

Instead of "sample space" or "probability space" say "universe". The universe is the widest possible event, and nothing exists outside it.

Instead of "elementary event" say "atom". Simplest possible events are called atoms. This corresponds to the theoretical notion of an atom in measure theory (an atom is a measurable set which has positive measure and contains no set of smaller positive measure).

Then the formulation of classical probability becomes short. Let n denote the number of atoms in the universe and let n_A be the number of atoms in event A. If all atoms are equally likely (have equal probabilities), then P(A)=n_A/n.

The clumsy "mutually exclusive events" are better replaced by more visual "disjoint sets". Likewise, instead of "collectively exhaustive events" say "events that cover the universe".

The combination "mutually exclusive" and "collectively exhaustive" events is beyond comprehension for many. I say: if events are disjoint and cover the universe, we call them tiles. To support this definition, play onscreen one of jigsaw puzzles (Video 1) and produce the picture from Figure 1.

Video 1. Tiles (disjoint events that cover the universe)

Tiles (disjoint events that cover the universe)

Figure 1. Tiles (disjoint events that cover the universe)

The philosophy of team work

We are in the same boat. I mean the big boat. Not the class. Not the university. It's the whole country. We depend on each other. Failure of one may jeopardize the well-being of everybody else.

You work in teams. You help each other to learn. My lectures and your presentations are just the beginning of the journey of knowledge into your heads. I cannot control how it settles there. Be my teaching assistants, share your big and little discoveries with your classmates.

I don't just preach about you helping each other. I force you to work in teams. 30% of the final grade is allocated to team work. Team work means joint responsibility. You work on assignments together. I randomly select a team member for reporting. His or her grade is what each team member gets.

This kind of team work is incompatible with the Western obsession with grades privacy. If I say my grade is nobody's business, by extension I consider the level of my knowledge a private issue. This will prevent me from asking for help and admitting my errors. The situation when students hide their errors and weaknesses from others also goes against the ethics of many workplaces. In my class all grades are public knowledge.

In some situations, keeping the grade private is technically impossible. Conducting a competition without announcing the points won is impossible. If I catch a student cheating, I announce the failing grade immediately, as a warning to others.

To those of you who think team-based learning is unfair to better students I repeat: 30% of the final grade is given for team work, not for personal achievements. The other 70% is where you can shine personally.

Breaking the wall of silence

Team work serves several purposes.

Firstly, joint responsibility helps breaking communication barriers. See in Video 2 my students working in teams on classroom assignments. The situation when a weaker student is too proud to ask for help and a stronger student doesn't want to offend by offering help is not acceptable. One can ask for help or offer help without losing respect for each other.

Video 2. Teams working on assignments

Secondly, it turns on resources that are otherwise idle. Explaining something to somebody is the best way to improve your own understanding. The better students master a kind of leadership that is especially valuable in a modern society. For the weaker students, feeling responsible for a team improves motivation.

Thirdly, I save time by having to grade less student papers.

On exams and quizzes I mercilessly punish the students for Yes/No answers without explanations. There are no half-points for half-understanding. This, in combination with the team work and open grades policy allows me to achieve my main objective: students are eager to talk to me about their problems.

Set operations and probability

After studying the basics of set operations and probabilities we had a midterm exam. It revealed that about one-third of students didn't understand this material and some of that misunderstanding came from high school. During the review session I wanted to see if they were ready for a frank discussion and told them: "Those who don't understand probabilities, please raise your hands", and about one-third raised their hands. I invited two of them to work at the board.

Video 3. Translating verbal statements to sets, with accompanying probabilities

Many teachers think that the Venn diagrams explain everything about sets because they are visual. No, for some students they are not visual enough. That's why I prepared a simple teaching aid (see Video 3) and explained the task to the two students as follows:

I am shooting at the target. The target is a square with two circles on it, one red and the other blue. The target is the universe (the bullet cannot hit points outside it). The probability of a set is its area. I am going to tell you one statement after another. You write that statement in the first column of the table. In the second column write the mathematical expression for the set. In the third column write the probability of that set, together with any accompanying formulas that you can come up with. The formulas should reflect the relationships between relevant areas.

Table 1. Set operations and probabilities

Statement Set Probability
1. The bullet hit the universe S P(S)=1
2. The bullet didn't hit the universe \emptyset P(\emptyset )=0
3. The bullet hit the red circle A P(A)
4. The bullet didn't hit the red circle \bar{A}=S\backslash A P(\bar{A})=P(S)-P(A)=1-P(A)
5. The bullet hit both the red and blue circles A\cap B P(A\cap B) (in general, this is not equal to P(A)P(B))
6. The bullet hit A or B (or both) A\cup B P(A\cup B)=P(A)+P(B)-P(A\cap B)

(additivity rule)

7. The bullet hit A but not B A\backslash B P(A\backslash B)=P(A)-P(A\cap B)
8. The bullet hit B but not A B\backslash A P(B\backslash A)=P(B)-P(A\cap B)
9. The bullet hit either A or B (but not both) (A\backslash B)\cup(B\backslash A) P\left( (A\backslash B)\cup (B\backslash A)\right)

=P(A)+P(B)-2P(A\cap B)

During the process, I was illustrating everything on my teaching aid. This exercise allows the students to relate verbal statements to sets and further to their areas. The main point is that people need to see the logic, and that logic should be repeated several times through similar exercises.

May 18

Law of total probability - you could have invented this

Law of total probability - you could have invented this

A knight wants to kill (event K) a dragon. There are two ways to do this: by fighting (event F) the dragon or by outwitting (O) it. The choice of the way (F or O) is random, and in each case the outcome (K or not K) is also random. For the probability of killing there is a simple, intuitive formula:


Its derivation is straightforward from the definition of conditional probability: since F and O cover the whole sample space and are disjoint, we have by additivity of probability

P(K)=P(K\cap(F\cup O))=P(K\cap F)+P(K\cap O)=\frac{P(K\cap F)}{P(F)}P(F)+\frac{P(K\cap O)}{P(O)}P(O)


This is easy to generalize to the case of many conditioning events. Suppose A_1,...,A_n are mutually exclusive (that is, disjoint) and collectively exhaustive (that is, cover the whole sample space). Then for any event B one has


This equation is call the law of total probability.

Application to a sum of continuous and discrete random variables

Let X,Y be independent random variables. Suppose that X is continuous, with a distribution function F_X, and suppose Y is discrete, with values y_1,...,y_n. Then for the distribution function of the sum F_{X+Y} we have

F_{X+Y}(t)=P(X+Y\le t)=\sum_{j=1}^nP(X+Y\le t|Y=y_j)P(Y=y_j)

(by independence conditioning on Y=y_j can be omitted)

=\sum_{j=1}^nP(X\le t-y_j)P(Y=y_j)=\sum_{j=1}^nF_X(t-y_j)P(Y=y_j).

Compare this to the much more complex derivation in case of two continuous variables.


Apr 18

Distribution function estimation

Distribution function estimation

The relativity theory says that what initially looks absolutely difficult, on closer examination turns out to be relatively simple. Here is one such topic. We start with a motivating example.

Large cloud service providers have huge data centers. A data center, being a large group of computer servers, typically requires extensive air conditioning. The intensity and cost of air conditioning depend on the temperature of the surrounding environment. If, as in our motivating example, we denote by T the temperature outside and by t a cut-off value, then a cloud service provider is interested in knowing the probability P(T\le t) for different values of t. This is exactly the distribution function of temperature: F_T(t)=P(T\le t). So how do you estimate it?

It comes down to usual sampling. Fix some cut-off, for example, t=20 and see for how many days in a year the temperature does not exceed 20. If the number of such days is, say, 200, then 200/365 will be the estimate of the probability P(T\le 20).

It remains to dress this idea in mathematical clothes.

Empirical distribution function

If an observation T_i belongs to the event \{T\le 20\}, we count it as 1, otherwise we count it as zero. That is, we are dealing with a dummy variable

(1) 1_{\{T\le 20\}}=\left\{\begin{array}{ll}1,&T\le 20;\\0,&T>20.\end{array}\right.

The total count is \sum 1_{\{T_i\le 20\}} and this is divided by the total number of observations, which is 365, to get 200/365.

It is important to realize that the variable in (1) is a coin (Bernoulli variable). For an unfair coin with  probability of 1 equal to p and probability of zero equal to 1-p the mean is

EC=p\times 1+(1-p)\times 0=p

and the variance is


For the variable in (1) p=P\{1_{\{T\le 20\}}=1\}=P(T\le 20)=F_T(20), so the mean and variance are

(2) E1_{\{T\le 20\}}=F_T(20),\ Var(1_{\{T\le 20\}})=F_T(20)(1-F_T(20)).

Generalizing, the probability F_T(t)=P(T\le t) is estimated by

(3) \frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}

where n is the number of observations. (3) is called an empirical distribution function because it is a direct empirical analog of P(T\le t).

Applying expectation to (3) and using an equation similar to (2), we prove unbiasedness of our estimator:

(4) E\frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}=P(T\le t)=F_T(t).

Further, assuming independent observations we can find variance of (3):

(5) Var\left(\frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}\right) (using homogeneity of degree 2)

=\frac{1}{n^2}Var\left(\sum_{i=1}^n 1_{\{T_i\le t\}}\right) (using independence)

=\frac{1}{n^2}\sum_{i=1}^nVar(1_{\{T_i\le t\}}) (applying an equation similar to (2))


Corollary. (4) and (5) can be used to prove that (3) is a consistent estimator of the distribution function, i.e., (3) converges to F_T(t) in probability.




Mar 18

Intro to option greeks: delta and its determinants

Intro to option greeks: delta and its determinants

I started trading stocks in 2010. I didn't expect to make big profits and wasn't actively trading. That's until 2015, when I met a guy who turned $10,000 into $140,000 in four years. And then I thought: why am I fooling around when it's possible to make good money? Experienced traders say: trading is a journey. That's how my journey started. Stocks move too slowly, to my taste, so I had to look for other avenues.

Two things were clear to me. I didn't want to be glued to the monitor the whole day and didn't want to study a lot of theory. Therefore I decided to concentrate on the futures market. To trade futures, you don't even need to know the definition of a futures contract. The price moves very quickly, and if you know what you are doing, you can make a couple of hundreds in a matter of minutes. It turned out that the futures markets are the best approximation to the efficient market hypothesis. Trend is your friend (until the end), as they say. In the futures markets, trends are rare and short-lived. Trading futures is like driving a race car. The psychological stress is enormous and it may excite your worst instincts. After trying for almost two years and losing $8,000 I gave up. Don't trade futures unless you can predict a big move.

Many people start their trading careers in the forex market because the volumes there are large and transaction fees are low. I never traded forex and think that it is as risky as the futures market. If you want to try it, I would suggest to trade not the exchange rates themselves but indexes or ETF's (exchange traded funds) that trace them. Again, look for large movements.

One more market I don't want to trade is bonds. Actions of central banks and macroeconomic events are among strong movers of this market. Otherwise, it's the same as futures. Futures, forex and bonds have one feature in common. In all of them institutional (large) traders dominate. My impression is that in absence of market-moving events they select a range within which to trade. Having deep pockets, they can buy at the top of the range and sell at the bottom without worrying about the associated loss. Trading in a range like that will kill a retail (small) investor. Changes in fundamentals force the big guys to shift the range, and that's when small investors have a chance to profit.

I tried to avoid options because they require learning some theory. After a prolonged resistance, I started trading options and immediately fell in love with them. I think that anybody with $25,000 in savings can and should be trading options.

Definition. In Math, the Greek letter \Delta (delta) is usually used to denote change or rate of change. In case of options, it's the rate of change of the option price when the stock price changes. Mathematically, it's a derivative \Delta=\frac{\partial c}{\partial S} where c is the call price and S is the stock price. In layman's terms, when the stock price changes by $1, the call price changes (moves in the same direction) by \Delta dollars. The basic features of delta can be understood by looking at how it depends on the strike price, when time is fixed, and how it changes with time, when the strike price is fixed. As before, we concentrate on probabilistic intuition.

How delta depends on strike price

Intro to option greeks: delta and its determinants

Figure 1. AAPL option chain with 26 days to expiration

Look at the option chain in Figure 1. For the strikes that are deep in the money, delta is close to one. This is because if a call option is deep in the money, the probability that it will end up in the money by expiration is high (see how the call price depends on the strike price). Hence, stock price changes are followed by call price changes almost one to one. On the other hand, if a strike is far out of the money, it is likely to remain out of the money by expiration. The stock price changes have little effect on the call price. That's why delta is close to zero.

How delta depends on time to expiration

Intro to option greeks: delta and its determinants

Figure 2. AAPL option chain with 5 days to expiration

Now let us compare that option chain to the one with a shorter time to expiration (see Figure 2). If an option is to expire soon, the probability of a drastic stock movement before expiration is low, see the comparison of areas of influence with different times to expiration. Only a few options with strikes lower than at the money strike have deltas different from one. The deeper in the money calls have deltas equal to one: their prices exactly repeat the stock price. Similarly, only a few out of the money options have deltas different from zero. If the strike is very far out of the money, the call delta is 0 because the call is very likely to expire worthless and its dependence on the stock price is negligible.

Mar 18

Interest rate - the puppetmaster behind option prices

Interest rate - the puppetmaster behind option prices

Call as a function of interest rate

Figure 1. Call as a function of interest rate

The interest rate is the last variable we need to discuss. The dependence of the call price on the interest rate that emerges from the Black-Scholes formula is depicted in Figure 1. The dependence is positive, right? Not so fast. This is the case when common sense should be used instead of mathematical models. One economic factor can influence another through many channels, often leading to contradicting results.

John Hull offers two explanations.

  1. As interest rates in the economy increase, the expected return required by investors from the stock tends to increase. This suggests a positive dependence of the call price on the interest rate.
  2. On the other hand, when the interest rate rises, the present value of any future cash flow received by the long call holder decreases. In particular, this reduces the payoff if at expiration the option is in the money.

The combined impact of these two effects embedded in the Black-Scholes formula is to increase the value of the call options.

However, experience tells the opposite

The Fed changes the interest rate at discrete times, not continually. Two moments matter: when the rumor about the upcoming interest rate change hits the market and when the actual change takes place. The market reaction to the rumor depends on the investors' mood - bullish or bearish. A bullish market tends to shrug off most bad news. In a bearish market, even a slight threat may have drastic consequences. By the time the actual change occurs, it is usually priced in.

In 2017, the Fed raised the rate three times: on March 15 (no reaction, judging by S&P 500 SPDR SPY), June 14 (again no reaction) and December 13 (a slight fall). The huge fall in the beginning of February 2018 was not caused by any actual change. It was an accumulated result of cautiousness ("This market has been bullish for too long!") and fears that the Fed would increase the rates in 2018 by more than had been anticipated. Many investors started selling stocks and buying bonds and other less risky assets. The total value of US bonds is about $20 trillion, while the total market capitalization of US stocks is around $30 trillion. The two markets are comparable in size, which means there is enough room to move from one to another and the total portfolio reshuffling can be considerable. Thus far, the mere expectation that the interest rate will increase has been able to substantially reduce stock prices and, consequently, call prices.

All this I summarized to my students as follows. When interest rates rise, bonds become more attractive. This is a substitution effect: investors switch from one asset to another all the time. Therefore stock prices and call prices fall. Thus the dependence of call prices on interest rates is negative.

The first explanation suggested by Hull neglects the substitution effect. The second explanation is not credible either, for the following reason. As I explained, stock volatility has a very strong influence on options. Options themselves have an even higher volatility. A change in interest rates by a couple percent is nothing in comparison with this volatility. Most investors would not care about the resulting reduction in the present value of future cash flows.

Dec 17

How to study mt3042 Optimisation: a guide to a guide

How to study mt3042 Optimisation: a guide to a guide

Section and examples numbering follows that of mt3042 Optimization Guide by M. Baltovic.

Main idea: look at geometry in the two-dimensional case

Here is an example. The norm of a vector x\in R^n is defined by \left\Vert x\right\Vert =\sqrt{\sum_{i=1}^nx_i^2}. The combination of squaring and extracting a square root often makes it difficult to understand how this construction works. Here is a simple inequality that allows one to do without this norm (or, put it differently, replace it with another norm). Take n=2.

\max \{|x_1|,|x_2|\}=\max \{\sqrt{x_1^2},\sqrt{x_2^2}\}\leq\max\{\sqrt{x_1^2+x_2^2},\sqrt{x_1^2+x_2^2}\}=\left\Vert  x\right\Vert =\sqrt{x_1^2+x_2^2}\leq\sqrt{\max\{|x_1|,|x_2|\}^2+\max \{|x_1|,|x_2|\}^2}=\sqrt{2}\max\{|x_1|,|x_2|\}.

We have proved that \max \{|x_1|,|x_2|\}\leq\left\Vert x\right\Vert\leq\sqrt{2}\max\{|x_1|,|x_2|\}. This easily generalizes to R^{n}:

(1) \max \{|x_1|,...,|x_n|\}\leq\left\Vert x\right\Vert\leq\sqrt{n}\max\{|x_1|,...,|x_n|\}.

Application. The set A\subset R^n is called bounded if there is a constant C such that \left\Vert x\right\Vert \leq C for all x\in A. (1) implies an equivalent definition: the set A\subset R^n is called bounded if there is a constant C such that \max\{|x_1|,...,|x_n|\}\leq C for all x\in A. See p.35 of Baltovic's guide, where the inequality y_{i}\leq \frac{f(\hat{x})}{p_{i}} is sufficient for proving boundedness of the set Y.

Theorem 2.2 (The Cauchy-Schwarz Inequality). This inequality does not have serious applications in the guide. For a nontrivial application of the Cauchy-Schwarz inequality see my post.

2.1.8. Avoid using the definition of continuity in terms of \varepsilon-\delta (Definition 2.18). Use Definition 2.19 in terms of sequences instead.

2.6.2. Definition 2.21 for many students is indigestible. Just say that the matrix A consists of partial derivatives of components of f=(f_1,...,f_m):

A=\left(\begin{array}{ccc}  \frac{\partial f_1}{\partial x_1}&...&\frac{\partial f_m}{\partial x_1} \\...&...&...\\  \frac{\partial f_1}{\partial x_n}&...&\frac{\partial f_m}{\partial x_n}\end{array}\right) .

Theorem 2.11. The proof is really simple in the one-dimensional case. By the definition of the derivative, \frac{f(x_n)-f(x)}{x_n-x}\rightarrow  f^{\prime }(x) for any sequence x_n\rightarrow x. Multiplying this equation by x_n-x\rightarrow 0 we get f(x_{n})-f(x)\rightarrow  (x_n-x)f^{\prime }(x)\rightarrow 0, which proves continuity of f at x.

3.3.1. There is Math that happens on paper (formulas) and the one that happens in the head (logic). Many students see the formulas and miss the logic. Carefully read this section and see if the logic happens in your head.

3.4. The solution to Example 3.2 is overblown. A professional mathematician never thinks like that. A pro would explain the idea as follows: because of Condition 2, the function is close to zero in some neighborhood of infinity \{x:|x|>N\}. Therefore, a maximum should be looked for in the set \{x:|x|\leq N\}. Since this is a compact, the Weierstrass theorem applies. With a proper graphical illustration, the students don't need anything else.

4.2 First-order conditions for optima. See the proof.

4.4 Second-order conditions for optima. See explanation using the Taylor decomposition.

5.3 The Theorem of Lagrange. For the Lagrange method see necessary conditionssufficient conditions, and case of many constraints.

5.4 Proof of Lagrange's Theorem. See a simple explanation of the constraint qualification condition. The explanation on pp.58-60 is hard to understand because of dimensionality.

5.6 The Lagrangian multipliers. See simpler derivation.

6.4 Proof of the Kuhn-Tucker Theorem. In case of the Kuhn-Tucker theorem, the most important point is that, once the binding constraints have been determined, the nonbinding ones can be omitted from the analysis. The proof of nonnegativity of the Lagrange multiplier for binding constraints is less than one page.

Example 6.4. In solutions that rely on the Kuhn-Tucker theorem, the author suggests to check the constraint qualification condition for all possible combinations of constraints. Not only is this time consuming, but this is also misleading, given the fact that often it is possible to determine the binding constraints and use the Lagrange method instead of the Kuhn-Tucker theorem or, alternatively, to use the Kuhn-Tucker theorem for eliminating simple cases. The same problem can be solved using the convexity theory.

Example 6.5. In this case Baltovic makes a controversial experiment: what happens if we go the wrong way (expectedly, bad things happen), without providing the correct solution.

Solution to Exercise 6.1. In this exercise, the revenue is homogeneous of degree 2 and the cost is homogeneous of degree 1, which indicates that the profit is infinite. No need to do a three-page analysis!

7.6 The Bellman equations. There are many optimization methods not covered in Sundaram's book. One of them, Pontryagin's maximum principle, is more general that the Bellman approach.

p. 172. The bound \sum \delta ^{t}|r(s_{t},a_{t})|\leq K\sum \delta ^{t} is obvious and does not require the Cauchy-Schwarz inequality.

Example 8.1. See the solution of this example using the Cauchy-Schwarz inequality.

Nov 17

Finite Horizon Dynamic Programming

Finite Horizon Dynamic Programming

This is the title of the theory we start studying here. We use the Traveller's problem to explain the main definitions:

  1. Set of states, for each time
  2. Set of actions, for each time and state
  3. Reward, for each action
  4. Strategy = set of actions that takes us from A to B (geometrically, it is a path from A to B).
  5. Definition of a Markovian strategy: at each time, the action depends only on the state at that time and not on how we got there.
  6. Value function of a strategy = sum of all rewards

Video 1. Traveller's problem-Definitions

The method of backwards induction

By definition, an optimal strategy maximizes the value function.

Statement. Piece of optimal whole PLUS optimal remainder = optimal whole. That is, if we have an optimal strategy leading from A to B, we can take any piece of it, leading from A to an intermediate point, say, C. Then if we find a partial optimal strategy leading from C to B, then the combination of the piece A to C of the initial optimal strategy and the partial optimal strategy C to B will be optimal from A to B.

Idea: starting from the end, find optimal remainders for all states.


Video 2. Traveller's problem-Solution


Oct 17

Reevaluating probabilities based on piece of evidence

Reevaluating probabilities based on piece of evidence

This actually has to do with the Bayes' theorem. However, in simple problems one can use a dead simple approach: just find probabilities of all elementary events. This post builds upon the post on Significance level and power of test, including the notation. Be sure to review that post.

Here is an example from the guide for Quantitative Finance by A. Patton (University of London course code FN3142).

Activity 7.2 Consider a test that has a Type I error rate of 5%, and power of 50%.

Suppose that, before running the test, the researcher thinks that both the null and the alternative are equally likely.

  1. If the test indicates a rejection of the null hypothesis, what is the probability that the null is false?

  2. If the test indicates a failure to reject the null hypothesis, what is the probability that the null is true?

Denote events R = {Reject null}, A = {fAil to reject null}; T = {null is True}; F = {null is False}. Then we are given:

(1) P(F)=0.5;\ P(T)=0.5;

(2) P(R|T)=\frac{P(R\cap T)}{P(T)}=0.05;\ P(R|F)=\frac{P(R\cap F)}{P(F)}=0.5;

(1) and (2) show that we can find P(R\cap T) and P(R\cap F) and therefore also P(A\cap T) and P(A\cap F). Once we know probabilities of elementary events, we can find everything about everything.

Elementary events

Figure 1. Elementary events

Answering the first question: just plug probabilities in P(F|R)=\frac{P(R\cap F)}{P(R)}=\frac{P(R\cap F)}{P(R\cap T)+P(A\cap T)}.

Answering the second question: just plug probabilities in P(T|A)=\frac{P(A\cap T)}{P(A)}=\frac{P(A\cap T)}{P(A\cap T)+P(A\cap F)}.

Patton uses the Bayes' theorem and the law of total probability. The solution suggested above uses only additivity of probability.


Oct 17

Significance level and power of test

Significance level and power of test

In this post we discuss several interrelated concepts: null and alternative hypotheses, type I and type II errors and their probabilities. Review the definitions of a sample space and elementary events and that of a conditional probability.

Type I and Type II errors

Regarding the true state of nature we assume two mutually exclusive possibilities: the null hypothesis (like the suspect is guilty) and alternative hypothesis (the suspect is innocent). It's up to us what to call the null and what to call the alternative. However, the statistical procedures are not symmetric: it's easier to measure the probability of rejecting the null when it is true than other involved probabilities. This is why what is desirable to prove is usually designated as the alternative.

Usually in books you can see the following table.

Decision taken
Fail to reject null Reject null
State of nature Null is true Correct decision Type I error
Null is false Type II error Correct decision

This table is not good enough because there is no link to probabilities. The next video does fill in the blanks.

Significance level and power of test

Video. Significance level and power of test

Significance level and power of test

The conclusion from the video is that

\frac{P(T\bigcap R)}{P(T)}=P(R|T)=P\text{(Type I error)=significance level} \frac{P(F\bigcap R)}{P(F)}=P(R|F)=P\text{(Correctly rejecting false null)=Power}
Aug 17

Violations of classical assumptions 2

Violations of classical assumptions

This will be a simple post explaining the common observation that "in Economics, variability of many variables is proportional to those variables". Make sure to review the assumptions; they tend to slip from memory. We consider the simple regression

(1) y_i=a+bx_i+e_i.

One of classical assumptions is

Homoscedasticity. All errors have the same variancesVar(e_i)=\sigma^2 for all i.

We discuss its opposite, which is

Heteroscedasticity. Not all errors have the same variance. It would be wrong to write it as Var(e_i)\ne\sigma^2 for all i (which means that all errors have variance different from \sigma^2). You can write that not all Var(e_i) are the same but it's better to use the verbal definition.

Remark about Video 1. The dashed lines can represent mean consumption. Then the fact that variation of a variable grows with its level becomes more obvious.

Video 1. Case for heteroscedasticity

Figure 1. Illustration from Dougherty: as x increases, variance of the error term increases

Homoscedasticity was used in the derivation of the OLS estimator variance; under heteroscedasticity that expression is no longer valid. There are other implications, which will be discussed later.

Companies example. The Samsung Galaxy Note 7 battery fires and explosions that caused two recalls cost the smartphone maker at least $5 billion. There is no way a small company could have such losses.

GDP example. The error in measuring US GDP is on the order of $200 bln, which is comparable to the Kazakhstan GDP. However, the standard deviation of the ratio error/GDP seems to be about the same across countries, if the underground economy is not too big. Often the assumption that the standard deviation of the regression error is proportional to one of regressors is plausible.

To see if the regression error is heteroscedastic, you can look at the graph of the residuals or use statistical tests.