26
May 20

17
Mar 19

## AP Statistics the Genghis Khan way 2

Last semester I tried to explain theory through numerical examples. The results were terrible. Even the best students didn't stand up to my expectations. The midterm grades were so low that I did something I had never done before: I allowed my students to write an analysis of the midterm at home. Those who were able to verbally articulate the answers to me received a bonus that allowed them to pass the semester.

This semester I made a U-turn. I announced that in the first half of the semester we will concentrate on theory and we followed this methodology. Out of 35 students, 20 significantly improved their performance and 15 remained where they were.

### Midterm exam, version 1

#### 1. General density definition (6 points)

a. Define the density $p_X$ of a random variable $X.$ Draw the density of heights of adults, making simplifying assumptions if necessary. Don't forget to label the axes.

b. According to your plot, how much is the integral $\int_{-\infty}^0p_X(t)dt?$ Explain.

c. Why the density cannot be negative?

d. Why the total area under the density curve should be 1?

e. Where are basketball players on your graph? Write down the corresponding expression for probability.

f. Where are dwarfs on your graph? Write down the corresponding expression for probability.

This question is about the interval formula. In each case students have to write the equation for the probability and the corresponding integral of the density. At this level, I don't talk about the distribution function and introduce the density by the interval formula.

#### 2. Properties of means (8 points)

a. Define a discrete random variable and its mean.

b. Define linear operations with random variables.

c. Prove linearity of means.

d. Prove additivity and homogeneity of means.

e. How much is the mean of a constant?

f. Using induction, derive the linearity of means for the case of $n$ variables from the case of two variables (3 points).

#### 3. Covariance properties (6 points)

a. Derive linearity of covariance in the first argument when the second is fixed.

b. How much is covariance if one of its arguments is a constant?

c. What is the link between variance and covariance? If you know one of these functions, can you find the other (there should be two answers)? (4 points)

#### 4. Standard normal variable (6 points)

a. Define the density $p_z(t)$ of a standard normal.

b. Why is the function $p_z(t)$ even? Illustrate this fact on the plot.

c. Why is the function $f(t)=tp_z(t)$ odd? Illustrate this fact on the plot.

d. Justify the equation $Ez=0.$

e. Why is $V(z)=1?$

f. Let $t>0.$ Show on the same plot areas corresponding to the probabilities $A_1=P(0 $A_2=P(z>t),$ $A_3=P(z<-t),$ $A_4=P(-t Write down the relationships between $A_1,...,A_4.$

#### 5. General normal variable (3 points)

a. Define a general normal variable $X.$

b. Use this definition to find the mean and variance of $X.$

c. Using part b, on the same plot graph the density of the standard normal and of a general normal with parameters $\sigma =2,$ $\mu =3.$

### Midterm exam, version 2

#### 1. General density definition (6 points)

a. Define the density $p_X$ of a random variable $X.$ Draw the density of work experience of adults, making simplifying assumptions if necessary. Don't forget to label the axes.

b. According to your plot, how much is the integral $\int_{-\infty}^0p_X(t)dt?$ Explain.

c. Why the density cannot be negative?

d. Why the total area under the density curve should be 1?

e. Where are retired people on your graph? Write down the corresponding expression for probability.

f. Where are young people (up to 25 years old) on your graph? Write down the corresponding expression for probability.

#### 2. Variance properties (8 points)

a. Define variance of a random variable. Why is it non-negative?

b. Define the formula for variance of a linear combination of two variables.

c. How much is variance of a constant?

d. What is the formula for variance of a sum? What do we call homogeneity of variance?

e. What is larger: $V(X+Y)$ or $V(X-Y)$? (2 points)

f. One investor has 100 shares of Apple, another - 200 shares. Which investor's portfolio has larger variability? (2 points)

#### 3. Poisson distribution (6 points)

a. Write down the Taylor expansion and explain the idea. How are the Taylor coefficients found?

b. Use the Taylor series for the exponential function to define the Poisson distribution.

c. Find the mean of the Poisson distribution. What is the interpretation of the parameter $\lambda$ in practice?

#### 4. Standard normal variable (6 points)

a. Define the density $p_z(t)$ of a standard normal.

b. Why is the function $p_z(t)$ even? Illustrate this fact on the plot.

c. Why is the function $f(t)=tp_z(t)$ odd? Illustrate this fact on the plot.

d. Justify the equation $Ez=0.$

e. Why is $V(z)=1?$

f. Let $t>0.$ Show on the same plot areas corresponding to the probabilities $A_1=P(0 $A_2=P(z>t),$ $A_{3}=P(z<-t),$ $A_4=P(-t Write down the relationships between $A_{1},...,A_{4}.$

#### 5. General normal variable (3 points)

a. Define a general normal variable $X.$

b. Use this definition to find the mean and variance of $X.$

c. Using part b, on the same plot graph the density of the standard normal and of a general normal with parameters $\sigma =2,$ $\mu =3.$

16
Mar 19

## AP Statistics the Genghis Khan way 1

Recently I enjoyed reading Jack Weatherford's "Genghis Khan and the Making of the Modern World" (2004). I was reading the book with a specific question in mind: what were the main reasons of the success of the Mongols? Here you can see the list of their innovations, some of which were in fact adapted from the nations they subjugated. But what was the main driving force behind those innovations? The conclusion I came to is that Genghis Khan was a genial psychologist. He used what he knew about individual and social psychology to constantly improve the government of his empire.

I am no Genghis Khan but I try to base my teaching methods on my knowledge of student psychology.

### Problems and suggested solutions

Steven Krantz in his book (How to teach mathematics : Second edition, 1998, don't remember the page) says something like this: If you want your students to do something, arrange your classes so that they do it in the class.

Problem 1. Students mechanically write down what the teacher says and writes.

Solution. I don't allow my students to write while I am explaining the material. When I explain, their task is to listen and try to understand. I invite them to ask questions and prompt me to write more explanations and comments. After they all say "We understand", I clean the board and then they write down whatever they understood and remembered.

Problem 2. Students are not used to analyze what they read or write.

Solution. After students finish their writing, I ask them to exchange notebooks and check each other's writings. It's easier for them to do this while everything is fresh in their memory. I bought and distributed red pens. When they see that something is missing or wrong, they have to write in red. Errors or omissions must stand out. Thus, right there in the class students repeat the material twice.

Problem 3. Students don't study at home.

Solution. I let my students know in advance what the next quiz will be about. Even with this knowledge, most of them don't prepare at home. Before the quiz I give them about half an hour to repeat and discuss the material (this is at least the third repetition). We start the quiz when they say they are ready.

Problem 4. Students don't understand that active repetition (writing without looking at one's notes) is much more productive than passive repetition (just reading the notes).

Solution. Each time before discussion sessions I distribute scratch paper and urge students to write, not just read or talk. About half of them follow my recommendation. Their desire to keep their notebooks neat is not their last consideration. The solution to Problem 1 also hinges upon active repetition.

Problem 5. If students work and are evaluated individually, usually there is no or little interaction between them.

Solution. My class is divided in teams (currently I have teams of two to six people). I randomly select one person from each team to write the quiz. That person's grade is the team's grade. This forces better students to coach others and weaker students to seek help.

Problem 6. Some students don't want to work in teams. They are usually either good students, who don't want to suffer because of weak team members, or weak students, who don't want their low grades to harm other team members.

Solution. The good students usually argue that it's not fair if their grade becomes lower because of somebody else's fault. My answer to them is that the meaning of fairness depends on the definition. In my grading scheme, 30 points out of 100 is allocated for team work and the rest for individual achievements. Therefore I never allow good students to work individually. I want them to be my teaching assistants and help other students. While doing so, I tell them that I may reward good students with a bonus in the end of the semester. In some cases I allow weak students to write quizzes individually but only if the team so requests. The request of the weak student doesn't matter. The weak student still has to participate in team discussions.

Problem 7. There is no accumulation of theoretical knowledge (flat learning curve).

Solution. a) Most students come from high school with little experience in algebra. I raise the level gradually and emphasize understanding. Students never see multiple choice questions in my classes. They also know that right answers without explanations will be discarded.

b) Normally, during my explanations I fill out the board. The amount of the information the students have to remember is substantial and increases over time. If you know a better way to develop one's internal vision, let me know.

c) I don't believe in learning the theory by doing applied exercises. After explaining the theory I formulate it as a series of theoretical exercises. I give the theory in large, logically consistent blocks for students to see the system. Half of exam questions are theoretical (students have to provide proofs and derivations) and the other half - applied.

d) The right motivation can be of two types: theoretical or applied, and I never substitute one for another.

Problem 8. In low-level courses you need to conduct frequent evaluations to keep your students in working shape. Multiply that by the number of students, and you get a serious teaching overload.

Solution. Once at a teaching conference in Prague my colleague from New York boasted that he grades 160 papers per week. Evaluating one paper per team saves you from that hell.

### Outcome

In the beginning of the academic year I had 47 students. In the second semester 12 students dropped the course entirely or enrolled in Stats classes taught by other teachers. Based on current grades, I expect 15 more students to fail. Thus, after the first year I'll have about 20 students in my course (if they don't fail other courses). These students will master statistics at the level of my book.

8
Oct 17

## Reevaluating probabilities based on piece of evidence

This actually has to do with the Bayes' theorem. However, in simple problems one can use a dead simple approach: just find probabilities of all elementary events. This post builds upon the post on Significance level and power of test, including the notation. Be sure to review that post.

Here is an example from the guide for Quantitative Finance by A. Patton (University of London course code FN3142).

Activity 7.2 Consider a test that has a Type I error rate of 5%, and power of 50%.

Suppose that, before running the test, the researcher thinks that both the null and the alternative are equally likely.

1. If the test indicates a rejection of the null hypothesis, what is the probability that the null is false?

2. If the test indicates a failure to reject the null hypothesis, what is the probability that the null is true?

Denote events R = {Reject null}, A = {fAil to reject null}; T = {null is True}; F = {null is False}. Then we are given:

(1) $P(F)=0.5;\ P(T)=0.5;$

(2) $P(R|T)=\frac{P(R\cap T)}{P(T)}=0.05;\ P(R|F)=\frac{P(R\cap F)}{P(F)}=0.5;$

(1) and (2) show that we can find $P(R\cap T)$ and $P(R\cap F)$ and therefore also $P(A\cap T)$ and $P(A\cap F).$ Once we know probabilities of elementary events, we can find everything about everything.

Figure 1. Elementary events

Answering the first question: just plug probabilities in $P(F|R)=\frac{P(R\cap F)}{P(R)}=\frac{P(R\cap F)}{P(R\cap T)+P(A\cap T)}.$

Answering the second question: just plug probabilities in $P(T|A)=\frac{P(A\cap T)}{P(A)}=\frac{P(A\cap T)}{P(A\cap T)+P(A\cap F)}.$

Patton uses the Bayes' theorem and the law of total probability. The solution suggested above uses only additivity of probability.

6
Oct 17

## Significance level and power of test

In this post we discuss several interrelated concepts: null and alternative hypotheses, type I and type II errors and their probabilities. Review the definitions of a sample space and elementary events and that of a conditional probability.

## Type I and Type II errors

Regarding the true state of nature we assume two mutually exclusive possibilities: the null hypothesis (like the suspect is guilty) and alternative hypothesis (the suspect is innocent). It's up to us what to call the null and what to call the alternative. However, the statistical procedures are not symmetric: it's easier to measure the probability of rejecting the null when it is true than other involved probabilities. This is why what is desirable to prove is usually designated as the alternative.

Usually in books you can see the following table.

 Decision taken Fail to reject null Reject null State of nature Null is true Correct decision Type I error Null is false Type II error Correct decision

This table is not good enough because there is no link to probabilities. The next video does fill in the blanks.

Video. Significance level and power of test

## Significance level and power of test

The conclusion from the video is that

$\frac{P(T\bigcap R)}{P(T)}=P(R|T)=P\text{(Type I error)=significance level}$ $\frac{P(F\bigcap R)}{P(F)}=P(R|F)=P\text{(Correctly rejecting false null)=Power}$
4
Sep 17

# Geometry related to derivatives

In a sequence of videos I explain the main ideas pursued by the fathers of Calculus - Isaac Newton and Gottfried Wilhelm Leibniz.

## Derivative equals speed

Here we assume that a point moves along a straight line and try to find its speed as usual. We divide the distance traveled by the time it takes to travel it. The ratio is an average speed over a time interval. As we reduce the length of the time interval, we get a better and better approximation to the exact (instantaneous) speed at a point in time.

Video 1. Derivative is speed

## Position of point as a function of time

Working with the visualization of the point movement on a straight line is inconvenient because it is difficult to correlate the point position to time. It is much better to visualize the movement on the space-time plane where the horizontal axis is for time and the vertical axis is for the point position.

Video 2. Position of point as function of time

## Measuring the slope of a straight line

A little digression: how do you measure the slope of a straight line, if you know the values of the function at different points?

Video 3. Measuring the slope of a straight line

## Derivative as the slope of a tangent line

This is like putting two and two together: we apply the previous definition to the slope of a secant drawn through two points on a graph. Then it remains to notice that the secant approaches the tangent line, as the second point approaches the first.

Video 4. Derivative as tangent slope

## From function to its derivative

This is a very useful exercise that allows later to come up with the optimization conditions, called first order and second order conditions.

Video 5. From function to its derivative

## Conclusion

Let $P(t)$ be some function and fix an initial point $t_1$. The derivative $P^\prime(t_1)$ is defined as the limit

$P^\prime(t_1)=\lim_{t_2\rightarrow t_1}\frac{P(t_2)-P(t_1)}{t_2-t_1}.$

When $P(t)$ describes the movement of a point along a straight line, the derivative gives the speed of that point. When $P(t)$ is drawn on a plane, the derivative gives the slope of the tangent line to the graph.

11
Aug 17

## Violations of classical assumptions

This will be a simple post explaining the common observation that "in Economics, variability of many variables is proportional to those variables". Make sure to review the assumptions; they tend to slip from memory. We consider the simple regression

(1) $y_i=a+bx_i+e_i.$

One of classical assumptions is

Homoscedasticity. All errors have the same variances$Var(e_i)=\sigma^2$ for all $i$.

We discuss its opposite, which is

Heteroscedasticity. Not all errors have the same variance. It would be wrong to write it as $Var(e_i)\ne\sigma^2$ for all $i$ (which means that all errors have variance different from $\sigma^2$). You can write that not all $Var(e_i)$ are the same but it's better to use the verbal definition.

Remark about Video 1. The dashed lines can represent mean consumption. Then the fact that variation of a variable grows with its level becomes more obvious.

Video 1. Case for heteroscedasticity

Figure 1. Illustration from Dougherty: as x increases, variance of the error term increases

Homoscedasticity was used in the derivation of the OLS estimator variance; under heteroscedasticity that expression is no longer valid. There are other implications, which will be discussed later.

Companies example. The Samsung Galaxy Note 7 battery fires and explosions that caused two recalls cost the smartphone maker at least $5 billion. There is no way a small company could have such losses. GDP example. The error in measuring US GDP is on the order of$200 bln, which is comparable to the Kazakhstan GDP. However, the standard deviation of the ratio error/GDP seems to be about the same across countries, if the underground economy is not too big. Often the assumption that the standard deviation of the regression error is proportional to one of regressors is plausible.

To see if the regression error is heteroscedastic, you can look at the graph of the residuals or use statistical tests.

7
Aug 17

## Violations of classical assumptions

This is a large topic which requires several posts or several book chapters. During a conference in Sweden in 2010, a Swedish statistician asked me: "What is Econometrics, anyway? What tools does it use?" I said: "Among others, it uses linear regression." He said: "But linear regression is a general statistical tool, why do they say it's a part of Econometrics?" My answer was: "Yes, it's a general tool but the name Econometrics emphasizes that the motivation for its applications lies in Economics".

Both classical assumptions and their violations should be studied with this point in mind: What is the Economics and Math behind each assumption?

## Violations of the first three assumptions

We consider the simple regression

(1) $y_i=a+bx_i+e_i$

Make sure to review the assumptions. Their numbering and names sometimes are different from what Dougherty's book has. In particular, most of the time I omit the following assumption:

A6. The model is linear in parameters and correctly specified.

When it is not linear in parameters, you can think of nonlinear alternatives. Instead of saying "correctly specified" I say "true model" when a "wrong model" is available.

A1. What if the existence condition is violated? If variance of the regressor is zero, the OLS estimator does not exist. The fitted line is supposed to be vertical, and you can regress $x$ on $y$. Violation of the existence condition in case of multiple regression leads to multicollinearity, and that's where economic considerations are important.

A2. The convenience condition is called so because when it is violated, that is, the regressor is stochastic, there are ways to deal with this problem:  finite-sample theory and large-sample theory.

A3. What if the errors in (1) have means different from zero? This question can be divided in two: 1) the means of the errors are the same: $Ee_i=c\ne 0$ for all $i$ and 2) the means are different. Read the post about centering and see if you can come up with the answer for the first question. The means may be different because of omission of a relevant variable (can you do the math?). In the absence of data on such a variable, there is nothing you can do.

Violations of A4 and A5 will be treated later.

26
Jul 17

## Nonlinear least squares

Here we explain the idea, illustrate the possible problems in Mathematica and, finally, show the implementation in Stata.

## Idea: minimize RSS, as in ordinary least squares

Observations come in pairs $(x_1,y_1),...,(x_n,y_n)$. In case of ordinary least squares, we approximated the y's with linear functions of the parameters, possibly nonlinear in x's. Now we use a function $f(a,b,x_i)$ which may be nonlinear in $a,b$. We still minimize RSS which takes the form $RSS=\sum r_i^2=\sum(y_i-f(a,b,x_i))^2$. Nonlinear least squares estimators are the values $a,b$ that minimize RSS. In general, it is difficult to find the formula (closed-form solution), so in practice software, such as Stata, is used for RSS minimization.

## Simplified idea and problems in one-dimensional case

Suppose we want to minimize $f(x)$. The Newton algorithm (default in Stata) is an iterative procedure that consists of steps:

1. Select the initial value $x_0$.
2. Find the derivative (or tangent) of RSS at $x_0$. Make a small step in the descent direction (indicated by the derivative), to obtain the next value $x_1$.
3. Repeat Step 2, using $x_1$ as the starting point, until the difference between the values of the objective function at two successive points becomes small. The last point $x_n$ will approximate the minimizing point.

Problems:

1. The minimizing point may not exist.
2. When it exists, it may not be unique. In general, there is no way to find out how many local minimums there are and which ones are global.
3. The minimizing point depends on the initial point.

See Video 1 for illustration in the one-dimensional case.

Video 1. NLS geometry

## Problems illustrated in Mathematica

Here we look at three examples of nonlinear functions, two of which are considered in Dougherty. The first one is a power functions (it can be linearized applying logs) and the second is an exponential function (it cannot be linearized). The third function gives rise to two minimums. The possibilities are illustrated in Mathematica.

Video 2. NLS illustrated in Mathematica

## Finally, implementation in Stata

Here we show how to 1) generate a random vector, 2) create a vector of initial values, and 3) program a nonlinear dependence.

Video 3. NLS implemented in Stata

10
Jul 17

## Alternatives to simple regression in Stata

In this post we looked at dependence of EARNINGS on S (years of schooling). In the end I suggested to think about possible variations of the model. Specifically, could the dependence be nonlinear? We consider two answers to this question.

This name is used for the quadratic dependence of the dependent variable on the independent variable. For our variables the dependence is

$EARNINGS=a+bS+cS^2+u$.

Note that the dependence on S is quadratic but the right-hand side is linear in the parameters, so we still are in the realm of linear regression. Video 1 shows how to run this regression.

Video 1. Running quadratic regression in Stata

## Nonparametric regression

The general way to write this model is

$y=m(x)+u.$

The beauty and power of nonparametric regression consists in the fact that we don't need to specify the functional form of dependence of $y$ on $x$. Therefore there are no parameters to interpret, there is only the fitted curve. There is also the estimated equation of the nonlinear dependence, which is too complex to consider here. I already illustrated the difference between parametric and nonparametric regression. See in Video 2 how to run nonparametric regression in Stata.

Video 2. Nonparametric dependence