4
Nov 18

## Little tricks for AP Statistics

This year I am teaching AP Statistics. If the things continue the way they are, about half of the class will fail. Here is my diagnosis and how I am handling the problem.

On the surface, the students lack algebra training but I think the problem is deeper: many of them have underdeveloped cognitive abilities. Their perception is slow, memory is limited, analytical abilities are rudimentary and they are not used to work at home. Limited resources require  careful allocation.

### Terminology

Short and intuitive names are better than two-word professional names.

Instead of "sample space" or "probability space" say "universe". The universe is the widest possible event, and nothing exists outside it.

Instead of "elementary event" say "atom". Simplest possible events are called atoms. This corresponds to the theoretical notion of an atom in measure theory (an atom is a measurable set which has positive measure and contains no set of smaller positive measure).

Then the formulation of classical probability becomes short. Let $n$ denote the number of atoms in the universe and let $n_A$ be the number of atoms in event $A.$ If all atoms are equally likely (have equal probabilities), then $P(A)=n_A/n.$

The clumsy "mutually exclusive events" are better replaced by more visual "disjoint sets". Likewise, instead of "collectively exhaustive events" say "events that cover the universe".

The combination "mutually exclusive" and "collectively exhaustive" events is beyond comprehension for many. I say: if events are disjoint and cover the universe, we call them tiles. To support this definition, play onscreen one of jigsaw puzzles (Video 1) and produce the picture from Figure 1.

Video 1. Tiles (disjoint events that cover the universe)

Figure 1. Tiles (disjoint events that cover the universe)

### The philosophy of team work

We are in the same boat. I mean the big boat. Not the class. Not the university. It's the whole country. We depend on each other. Failure of one may jeopardize the well-being of everybody else.

You work in teams. You help each other to learn. My lectures and your presentations are just the beginning of the journey of knowledge into your heads. I cannot control how it settles there. Be my teaching assistants, share your big and little discoveries with your classmates.

I don't just preach about you helping each other. I force you to work in teams. 30% of the final grade is allocated to team work. Team work means joint responsibility. You work on assignments together. I randomly select a team member for reporting. His or her grade is what each team member gets.

This kind of team work is incompatible with the Western obsession with grades privacy. If I say my grade is nobody's business, by extension I consider the level of my knowledge a private issue. This will prevent me from asking for help and admitting my errors. The situation when students hide their errors and weaknesses from others also goes against the ethics of many workplaces. In my class all grades are public knowledge.

In some situations, keeping the grade private is technically impossible. Conducting a competition without announcing the points won is impossible. If I catch a student cheating, I announce the failing grade immediately, as a warning to others.

To those of you who think team-based learning is unfair to better students I repeat: 30% of the final grade is given for team work, not for personal achievements. The other 70% is where you can shine personally.

### Breaking the wall of silence

Team work serves several purposes.

Firstly, joint responsibility helps breaking communication barriers. See in Video 2 my students working in teams on classroom assignments. The situation when a weaker student is too proud to ask for help and a stronger student doesn't want to offend by offering help is not acceptable. One can ask for help or offer help without losing respect for each other.

Video 2. Teams working on assignments

Secondly, it turns on resources that are otherwise idle. Explaining something to somebody is the best way to improve your own understanding. The better students master a kind of leadership that is especially valuable in a modern society. For the weaker students, feeling responsible for a team improves motivation.

Thirdly, I save time by having to grade less student papers.

On exams and quizzes I mercilessly punish the students for Yes/No answers without explanations. There are no half-points for half-understanding. This, in combination with the team work and open grades policy allows me to achieve my main objective: students are eager to talk to me about their problems.

### Set operations and probability

After studying the basics of set operations and probabilities we had a midterm exam. It revealed that about one-third of students didn't understand this material and some of that misunderstanding came from high school. During the review session I wanted to see if they were ready for a frank discussion and told them: "Those who don't understand probabilities, please raise your hands", and about one-third raised their hands. I invited two of them to work at the board.

Video 3. Translating verbal statements to sets, with accompanying probabilities

Many teachers think that the Venn diagrams explain everything about sets because they are visual. No, for some students they are not visual enough. That's why I prepared a simple teaching aid (see Video 3) and explained the task to the two students as follows:

I am shooting at the target. The target is a square with two circles on it, one red and the other blue. The target is the universe (the bullet cannot hit points outside it). The probability of a set is its area. I am going to tell you one statement after another. You write that statement in the first column of the table. In the second column write the mathematical expression for the set. In the third column write the probability of that set, together with any accompanying formulas that you can come up with. The formulas should reflect the relationships between relevant areas.

Table 1. Set operations and probabilities

 Statement Set Probability 1. The bullet hit the universe $S$$S$ $P(S)=1$$P(S)=1$ 2. The bullet didn't hit the universe $\emptyset$$\emptyset$ $P(\emptyset )=0$$P(\emptyset )=0$ 3. The bullet hit the red circle $A$$A$ $P(A)$$P(A)$ 4. The bullet didn't hit the red circle $\bar{A}=S\backslash A$$\bar{A}=S\backslash A$ $P(\bar{A})=P(S)-P(A)=1-P(A)$$P(\bar{A})=P(S)-P(A)=1-P(A)$ 5. The bullet hit both the red and blue circles $A\cap B$$A\cap B$ $P(A\cap B)$$P(A\cap B)$ (in general, this is not equal to $P(A)P(B)$$P(A)P(B)$) 6. The bullet hit $A$$A$ or $B$$B$ (or both) $A\cup B$$A\cup B$ $P(A\cup B)=P(A)+P(B)-P(A\cap B)$$P(A\cup B)=P(A)+P(B)-P(A\cap B)$ (additivity rule) 7. The bullet hit $A$$A$ but not $B$$B$ $A\backslash B$$A\backslash B$ $P(A\backslash B)=P(A)-P(A\cap B)$$P(A\backslash B)=P(A)-P(A\cap B)$ 8. The bullet hit $B$$B$ but not $A$$A$ $B\backslash A$$B\backslash A$ $P(B\backslash A)=P(B)-P(A\cap B)$$P(B\backslash A)=P(B)-P(A\cap B)$ 9. The bullet hit either $A$$A$ or $B$$B$ (but not both) $(A\backslash B)\cup(B\backslash A)$$(A\backslash B)\cup(B\backslash A)$ $P\left( (A\backslash B)\cup (B\backslash A)\right)$$P\left( (A\backslash B)\cup (B\backslash A)\right)$ $=P(A)+P(B)-2P(A\cap B)$$=P(A)+P(B)-2P(A\cap B)$

During the process, I was illustrating everything on my teaching aid. This exercise allows the students to relate verbal statements to sets and further to their areas. The main point is that people need to see the logic, and that logic should be repeated several times through similar exercises.

8
Oct 17

## Reevaluating probabilities based on piece of evidence

This actually has to do with the Bayes' theorem. However, in simple problems one can use a dead simple approach: just find probabilities of all elementary events. This post builds upon the post on Significance level and power of test, including the notation. Be sure to review that post.

Here is an example from the guide for Quantitative Finance by A. Patton (University of London course code FN3142).

Activity 7.2 Consider a test that has a Type I error rate of 5%, and power of 50%.

Suppose that, before running the test, the researcher thinks that both the null and the alternative are equally likely.

1. If the test indicates a rejection of the null hypothesis, what is the probability that the null is false?

2. If the test indicates a failure to reject the null hypothesis, what is the probability that the null is true?

Denote events R = {Reject null}, A = {fAil to reject null}; T = {null is True}; F = {null is False}. Then we are given:

(1) $P(F)=0.5;\ P(T)=0.5;$

(2) $P(R|T)=\frac{P(R\cap T)}{P(T)}=0.05;\ P(R|F)=\frac{P(R\cap F)}{P(F)}=0.5;$

(1) and (2) show that we can find $P(R\cap T)$ and $P(R\cap F)$ and therefore also $P(A\cap T)$ and $P(A\cap F).$ Once we know probabilities of elementary events, we can find everything about everything.

Figure 1. Elementary events

Answering the first question: just plug probabilities in $P(F|R)=\frac{P(R\cap F)}{P(R)}=\frac{P(R\cap F)}{P(R\cap T)+P(A\cap T)}.$

Answering the second question: just plug probabilities in $P(T|A)=\frac{P(A\cap T)}{P(A)}=\frac{P(A\cap T)}{P(A\cap T)+P(A\cap F)}.$

Patton uses the Bayes' theorem and the law of total probability. The solution suggested above uses only additivity of probability.

6
Oct 17

## Significance level and power of test

In this post we discuss several interrelated concepts: null and alternative hypotheses, type I and type II errors and their probabilities. Review the definitions of a sample space and elementary events and that of a conditional probability.

## Type I and Type II errors

Regarding the true state of nature we assume two mutually exclusive possibilities: the null hypothesis (like the suspect is guilty) and alternative hypothesis (the suspect is innocent). It's up to us what to call the null and what to call the alternative. However, the statistical procedures are not symmetric: it's easier to measure the probability of rejecting the null when it is true than other involved probabilities. This is why what is desirable to prove is usually designated as the alternative.

Usually in books you can see the following table.

 Decision taken Fail to reject null Reject null State of nature Null is true Correct decision Type I error Null is false Type II error Correct decision

This table is not good enough because there is no link to probabilities. The next video does fill in the blanks.

Video. Significance level and power of test

## Significance level and power of test

The conclusion from the video is that

$\frac{P(T\bigcap R)}{P(T)}=P(R|T)=P\text{(Type I error)=significance level}$ $\frac{P(F\bigcap R)}{P(F)}=P(R|F)=P\text{(Correctly rejecting false null)=Power}$
7
Aug 17

## Violations of classical assumptions

This is a large topic which requires several posts or several book chapters. During a conference in Sweden in 2010, a Swedish statistician asked me: "What is Econometrics, anyway? What tools does it use?" I said: "Among others, it uses linear regression." He said: "But linear regression is a general statistical tool, why do they say it's a part of Econometrics?" My answer was: "Yes, it's a general tool but the name Econometrics emphasizes that the motivation for its applications lies in Economics".

Both classical assumptions and their violations should be studied with this point in mind: What is the Economics and Math behind each assumption?

## Violations of the first three assumptions

We consider the simple regression

(1) $y_i=a+bx_i+e_i$

Make sure to review the assumptions. Their numbering and names sometimes are different from what Dougherty's book has. In particular, most of the time I omit the following assumption:

A6. The model is linear in parameters and correctly specified.

When it is not linear in parameters, you can think of nonlinear alternatives. Instead of saying "correctly specified" I say "true model" when a "wrong model" is available.

A1. What if the existence condition is violated? If variance of the regressor is zero, the OLS estimator does not exist. The fitted line is supposed to be vertical, and you can regress $x$ on $y$. Violation of the existence condition in case of multiple regression leads to multicollinearity, and that's where economic considerations are important.

A2. The convenience condition is called so because when it is violated, that is, the regressor is stochastic, there are ways to deal with this problem:  finite-sample theory and large-sample theory.

A3. What if the errors in (1) have means different from zero? This question can be divided in two: 1) the means of the errors are the same: $Ee_i=c\ne 0$ for all $i$ and 2) the means are different. Read the post about centering and see if you can come up with the answer for the first question. The means may be different because of omission of a relevant variable (can you do the math?). In the absence of data on such a variable, there is nothing you can do.

Violations of A4 and A5 will be treated later.

10
Jul 17

## Alternatives to simple regression in Stata

In this post we looked at dependence of EARNINGS on S (years of schooling). In the end I suggested to think about possible variations of the model. Specifically, could the dependence be nonlinear? We consider two answers to this question.

This name is used for the quadratic dependence of the dependent variable on the independent variable. For our variables the dependence is

$EARNINGS=a+bS+cS^2+u$.

Note that the dependence on S is quadratic but the right-hand side is linear in the parameters, so we still are in the realm of linear regression. Video 1 shows how to run this regression.

Video 1. Running quadratic regression in Stata

## Nonparametric regression

The general way to write this model is

$y=m(x)+u.$

The beauty and power of nonparametric regression consists in the fact that we don't need to specify the functional form of dependence of $y$ on $x$. Therefore there are no parameters to interpret, there is only the fitted curve. There is also the estimated equation of the nonlinear dependence, which is too complex to consider here. I already illustrated the difference between parametric and nonparametric regression. See in Video 2 how to run nonparametric regression in Stata.

Video 2. Nonparametric dependence

6
Jul 17

## Running simple regression in Stata

Running simple regression in Stata is, well, simple. It's just a matter of a couple of clicks. Try to make it a small research.

1. Obtain descriptive statistics for your data (Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics). Look at all that stuff you studied in introductory statistics: units of measurement, means, minimums, maximums, and correlations. Knowing the units of measurement will be important for interpreting regression results; correlations will predict signs of coefficients, etc. In your report, don't just mechanically repeat all those measures; try to find and discuss something interesting.
2. Visualize your data (Graphics > Twoway graph). On the graph you can observe outliers and discern possible nonlinearity.
3. After running regression, report the estimated equation. It is called a fitted line and in our case looks like this: Earnings = -13.93+2.45*S (use descriptive names and not abstract X,Y). To see if the coefficient of S is significant, look at its p-value, which is smaller than 0.001. This tells us that at all levels of significance larger than or equal to 0.001 the null that the coefficient of S is significant is rejected. This follows from the definition of p-value. Nobody cares about significance of the intercept. Report also the p-value of the F statistic. It characterizes significance of all nontrivial regressors and is important in case of multiple regression. The last statistic to report is R squared.
4. Think about possible variations of the model. Could the dependence of Earnings on S be nonlinear? What other determinants of Earnings would you suggest from among the variables in Dougherty's file?

Figure 1. Looking at data. For data, we use a scatterplot.

Figure 2. Running regression (Statistics > Linear models and related > Linear regression)

29
Jun 17

## Introduction to Stata

Introduction to Stata: Stata interface, how to use Stata Help, how to use Data Editor and how to graph data. Important details to remember:

1. In any program, the first thing to use is Help. I learned everything from Help and never took any programming courses.
2. The number of observations for all variables in one data file must be the same. This can be a problem if, for example, you want to see out-of-sample predictions.
3. In Data Editor, numeric variables are displayed in black and strings are displayed in red.
4. The name of the hidden variable that counts observations is _n
5. If you have several definitions of graphs in two-way graphs menu, they will be graphed together or separately, depending on what is enabled/disabled.

See details in videos. Sorry about the background noise!

Video 1. Stata interface. The windows introduced: Results, Command, Variables, Properties, Review and Viewer.

Video 2. Using Stata Help. Help can be used through the Review window or in a separate pdf viewer. Eviews Help is much easier to understand.

Video 3. Using Data Editor. How to open and view variables, the visual difference between numeric variables and string variables. The lengths of all variables in the same file must be the same.

Video 4. Graphing data. To graph a variable, you need to define its graph and then display it. It is possible to display more than one variable on the same chart.

22
Jun 17

## Autoregressive–moving-average (ARMA) models

Autoregressive–moving-average (ARMA) models were suggested in 1951 by Peter Whittle in his PhD thesis. Do you think he played with data and then came up with his model? No, he was guided by theory. The same model may describe visually very different data sets, and visualization rarely leads to model formulation.

Recall that the main idea behind autoregressive processes is to regress the variable on its own past values. In case of moving averages, we form linear combinations of elements of white noise. Combining the two ideas, we obtain the definition of the autoregressive–moving-average process:

(1) $y_t=\mu+\beta_1y_{t-1}+...+\beta_py_{t-p}+ \theta_1u_{t-1}+...+\theta_qu_{t-q}+u_t$

$=\mu+\sum_{i=1}^p\beta_iy_{t-i}+\sum_{i=1}^q\theta_iu_{t-i}+u_t.$

It is denoted ARMA(p,q), where p is the number of included past values and q is the number of included past errors (AKA shocks to the system). We should expect a couple of facts to hold for this process.

If its characteristic polynomial $p(x)=1-\beta_1x-...-\beta_px^p$ has roots outside the unit circle, the process is stable and should be stationary.

A stable process can be represented as an infinite moving average. Such a representation is in fact used to analyze its properties.

The coefficients of the moving average part (the thetas) and the constant $\mu$ have no effect on stationarity.

The quantity $\partial y_t/\partial y_{t-1}=\beta_1$ can be called an instantaneous effect of $y_{t-1}$ on $y_t$. This effect accumulates over time (the value at $t$ is influenced by the value at $t-1$, which, in turn, is influenced by the value at $t-2$ and so on). Therefore the long-run interpretation of the coefficients is complicated. Comparison of Figures 1 and 2 illustrates this point.

Exercise. For the model $y_t=\mu+\beta_1y_{t-1}+u_t$ find the mean $Ey_t$ (just modify this argument).

Figure 1. Simulated AR process

Figure 2. Simulated MA

Question. Why in (1) the current error $u_t$ has coefficient 1? Choose the answer you like:

1) We never used the current error with a nontrivial coefficient.

2) It is logical to assume that past shocks may have an aftereffect (measured by thetas) on the current $y_t$ different from 1 but the effect of the current shock should be 1.

3) Mathematically, the case when instead of $u_t$ we have $\theta_0u_t$ with some nonzero $\theta_0$ can be reduced to the case when the current error has coefficient 1. Just introduce a new white noise $v_t=\theta_0u_t$ and rewrite the model using it.

16
Jun 17

## Moving average processes

Moving average processes: this time the intuition is mathematical and even geometric.

## Review and generalize

Science is a vertical structure, as I say in my book, and we have long past the point after which looking back is as important as looking forward. So here are a couple of questions for the reader to review the past material.

Q1. What is a stochastic process? (Answer: imagine a real line with a random variable attached to each integer point.)

Q2. There are good (stationary) processes and bad (all other) processes. How do you define the good ones? (Hint: Properties of means, covariances and variances are bread and butter of professionals.)

Q3. White noise is a simplest (after a constant) type of a stationary process. Give the definition, and don't hope for a hint. Do you realize that the elements of white noise don't interact with one another in the sense that covariance between each two of them is zero?

Idea. Define a class of stationary processes by forming linear combinations of elements of white noise.

Q4. A simple realization of this idea is given here. How do you generalize it?

(1) $y_t=u_t+\theta_1u_{t-1}+...+\theta_qu_{t-q}=u_t+\sum_{i=1}^q\theta_iu_{t-i},$

where $u_t$ is white noise, is called a moving average process of order $q$ and denoted MA(q).

Remarks. 1) The "moving average" name may be misleading. In Finance we use that name when the coefficients sum to one and are positive. Here the thetas do not necessarily sum to one and may change sign.

2) It would be better to say a "moving linear combination". The coefficients of the linear combination do not change but are applied to a moving segment of the white noise, starting from the element dated $t$ and going back to the element dated $t-q$. In this sense we say that (1) involves the segment $[t-q,t]$.

3) In Economics and Finance, the errors $u_t$ are treated as shocks. (1) tells us that the process is a result of the current shock and previous $q$ shocks.

## Moving average properties

First stationarity condition$Ey_t=0$, should be absolutely obvious by now.

Second stationarity condition. Variance does not depend on time:

$Var(y_t)=Ey_t^2=E(u_t+\sum_{i=1}^q\theta_iu_{t-i})(u_t+\sum_{i=1}^q\theta_iu_{t-i})=(1+\sum_{i=1}^q\theta_i^2)\sigma^2$

because only products $u_t^2, u_{t-1}^2,...$ have nonzero expectations.

Third stationarity condition. Here is where geometry is useful. If one linear combination involves the segment $[t-q,t]$ and the other - the segment $[s-q,s]$, then under what condition these segments do not overlap? Answer: if the distance $|s-t|$ between the points $s,t$ is larger than $q$. In this case the linear combinations do not have common elements and $Cov(y_t,y_s)$ is zero.

Exercise. (I leave you the tedious part). Calculate $Cov(y_t,y_s)$ for $|s-t|\le q$.

Conclusion. MA(q) is stationary, for any thetas.

8
Jun 17

## Autoregressive processes

Autoregressive processes: going from the particular to the general is the safest option. Simple observations are the foundation of any theory.

## Intuition

Figure 1. Electricity load in France and Great Britain for 2001 to 2006

If you have only one variable, what can you regress it on? Only on its own past values (future values are not available at any given moment). Figure 1 on electricity demand from a paper by J.W. Taylor illustrates this. A low value of electricity demand, say, in summer last year, will drive down its value in summer this year. Overall, we would expect the electricity demand now to depend on its values in the past 12 months. Another important observation from this example is that probably this time series is stationary.

## AR(p) model

We want a definition of a class of stationary models. From this example we see that excluding the time trend increases chances of obtaining a stationary process. The idea to regress the process on its own past values is realized in

(1) $y_t=\mu+\beta_1y_{t-1}+...+\beta_py_{t-p}+u_t.$

Here $p$ is some positive integer. However, both this example and the one about random walk show that some condition on the coefficients $\mu,\beta_1,...,\beta_p$ will be required for (1) to be stationary. (1) is called an autoregressive process of order $p$ and denoted AR(p).

Exercise 1. Repeat calculations on AR(1) process to see that in case $p=1$ for (1) the stability condition $|\beta_1|<1$ is sufficient for stationarity (that is, the coefficient $\mu$ has no impact on stationarity).

Question. How does this stability condition generalize to AR(p)?

## Characteristic polynomial

Denote $L$ the lag operator defined by $Ly_t=y_{t-1}$. More generally, its powers are defined by $L^ky_t=y_{t-k}$. Then (1) can be rewritten as

$y_t=\mu+\beta_1Ly_t+...+\beta_pL^py_t+u_t.$

Whoever first did this wanted to solve the equation for $y_t$. Sending all terms containing $y_t$ to the left we have

$y_t-(\beta_1Ly_t+...+\beta_pL^py_t)=\mu+u_t.$

The identity operator is defined by $Iy_t=y_t$, so $y_t=Iy_t$. Factoring out $y_t$ we get

(2) $(I-\beta_1L-...-\beta_pL^p)y_t=\mu+u_t.$

Finally, formally solving for $y_t$ we have

(3) $y_t=(I-\beta_1L-...-\beta_pL^p)^{-1}(\mu+u_t).$

Definition 1.  In $I-\beta_1L-...-\beta_pL^p$ replace the identity by 1 and powers of the lag operator by powers of a real number $x$ to obtain the definition of the characteristic polynomial:

(3) $p(x)=1-\beta_1x-...-\beta_px^p$.

$p(x)$ is a polynomial of degree $p$ and by the fundamental theorem of algebra has $p$ roots.

Definition 2. We say that model (1) is stable if its characteristic polynomial (3) has roots outside the unit circle, that is, the roots are larger than 1 in absolute value.

Under this stability condition the passage from (2) to (3) can be justified. For AR(1) process this actually has been done.

Example 1. In case of a first-order process, $p(x)=1-\beta_1x$ has one root $x=1/\beta_1$ which lies outside the unit circle exactly when $|\beta_1|<1.$

Example 2. In case of a second-order process, $p(x)$ has two roots. If both of them are larger than 1 in absolute value, then the process is stable. The formula for the roots of a quadratic equation is well-known but stating it here wouldn't add much to what we know. Most statistical packages, including Stata, have procedures for checking stability.

Remark. Hamilton uses a different definition of the characteristic polynomial (linked to vector autoregressions), that's why in his definition the roots of the characteristic equation should lie inside the unit circle.