May 18

Law of total probability - you could have invented this

Law of total probability - you could have invented this

A knight wants to kill (event K) a dragon. There are two ways to do this: by fighting (event F) the dragon or by outwitting (O) it. The choice of the way (F or O) is random, and in each case the outcome (K or not K) is also random. For the probability of killing there is a simple, intuitive formula:


Its derivation is straightforward from the definition of conditional probability: since F and O cover the whole sample space and are disjoint, we have by additivity of probability

P(K)=P(K\cap(F\cup O))=P(K\cap F)+P(K\cap O)=\frac{P(K\cap F)}{P(F)}P(F)+\frac{P(K\cap O)}{P(O)}P(O)


This is easy to generalize to the case of many conditioning events. Suppose A_1,...,A_n are mutually exclusive (that is, disjoint) and collectively exhaustive (that is, cover the whole sample space). Then for any event B one has


This equation is call the law of total probability.

Application to a sum of continuous and discrete random variables

Let X,Y be independent random variables. Suppose that X is continuous, with a distribution function F_X, and suppose Y is discrete, with values y_1,...,y_n. Then for the distribution function of the sum F_{X+Y} we have

F_{X+Y}(t)=P(X+Y\le t)=\sum_{j=1}^nP(X+Y\le t|Y=y_j)P(Y=y_j)

(by independence conditioning on Y=y_j can be omitted)

=\sum_{j=1}^nP(X\le t-y_j)P(Y=y_j)=\sum_{j=1}^nF_X(t-y_j)P(Y=y_j).

Compare this to the much more complex derivation in case of two continuous variables.


Oct 17

Reevaluating probabilities based on piece of evidence

Reevaluating probabilities based on piece of evidence

This actually has to do with the Bayes' theorem. However, in simple problems one can use a dead simple approach: just find probabilities of all elementary events. This post builds upon the post on Significance level and power of test, including the notation. Be sure to review that post.

Here is an example from the guide for Quantitative Finance by A. Patton (University of London course code FN3142).

Activity 7.2 Consider a test that has a Type I error rate of 5%, and power of 50%.

Suppose that, before running the test, the researcher thinks that both the null and the alternative are equally likely.

  1. If the test indicates a rejection of the null hypothesis, what is the probability that the null is false?

  2. If the test indicates a failure to reject the null hypothesis, what is the probability that the null is true?

Denote events R = {Reject null}, A = {fAil to reject null}; T = {null is True}; F = {null is False}. Then we are given:

(1) P(F)=0.5;\ P(T)=0.5;

(2) P(R|T)=\frac{P(R\cap T)}{P(T)}=0.05;\ P(R|F)=\frac{P(R\cap F)}{P(F)}=0.5;

(1) and (2) show that we can find P(R\cap T) and P(R\cap F) and therefore also P(A\cap T) and P(A\cap F). Once we know probabilities of elementary events, we can find everything about everything.

Elementary events

Figure 1. Elementary events

Answering the first question: just plug probabilities in P(F|R)=\frac{P(R\cap F)}{P(R)}=\frac{P(R\cap F)}{P(R\cap T)+P(A\cap T)}.

Answering the second question: just plug probabilities in P(T|A)=\frac{P(A\cap T)}{P(A)}=\frac{P(A\cap T)}{P(A\cap T)+P(A\cap F)}.

Patton uses the Bayes' theorem and the law of total probability. The solution suggested above uses only additivity of probability.


Oct 17

Significance level and power of test

Significance level and power of test

In this post we discuss several interrelated concepts: null and alternative hypotheses, type I and type II errors and their probabilities. Review the definitions of a sample space and elementary events and that of a conditional probability.

Type I and Type II errors

Regarding the true state of nature we assume two mutually exclusive possibilities: the null hypothesis (like the suspect is guilty) and alternative hypothesis (the suspect is innocent). It's up to us what to call the null and what to call the alternative. However, the statistical procedures are not symmetric: it's easier to measure the probability of rejecting the null when it is true than other involved probabilities. This is why what is desirable to prove is usually designated as the alternative.

Usually in books you can see the following table.

Decision taken
Fail to reject null Reject null
State of nature Null is true Correct decision Type I error
Null is false Type II error Correct decision

This table is not good enough because there is no link to probabilities. The next video does fill in the blanks.

Significance level and power of test

Video. Significance level and power of test

Significance level and power of test

The conclusion from the video is that

\frac{P(T\bigcap R)}{P(T)}=P(R|T)=P\text{(Type I error)=significance level} \frac{P(F\bigcap R)}{P(F)}=P(R|F)=P\text{(Correctly rejecting false null)=Power}
Jul 17

Alternatives to simple regression in Stata

Alternatives to simple regression in Stata

In this post we looked at dependence of EARNINGS on S (years of schooling). In the end I suggested to think about possible variations of the model. Specifically, could the dependence be nonlinear? We consider two answers to this question.

Quadratic regression

This name is used for the quadratic dependence of the dependent variable on the independent variable. For our variables the dependence is


Note that the dependence on S is quadratic but the right-hand side is linear in the parameters, so we still are in the realm of linear regression. Video 1 shows how to run this regression.

Running quadratic regression in Stata

Video 1. Running quadratic regression in Stata

Nonparametric regression

The general way to write this model is


The beauty and power of nonparametric regression consists in the fact that we don't need to specify the functional form of dependence of y on x. Therefore there are no parameters to interpret, there is only the fitted curve. There is also the estimated equation of the nonlinear dependence, which is too complex to consider here. I already illustrated the difference between parametric and nonparametric regression. See in Video 2 how to run nonparametric regression in Stata.

Nonparametric dependence

Video 2. Nonparametric dependence

Jul 17

Running simple regression in Stata

Running simple regression in Stata is, well, simple. It's just a matter of a couple of clicks. Try to make it a small research.

  1. Obtain descriptive statistics for your data (Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics). Look at all that stuff you studied in introductory statistics: units of measurement, means, minimums, maximums, and correlations. Knowing the units of measurement will be important for interpreting regression results; correlations will predict signs of coefficients, etc. In your report, don't just mechanically repeat all those measures; try to find and discuss something interesting.
  2. Visualize your data (Graphics > Twoway graph). On the graph you can observe outliers and discern possible nonlinearity.
  3. After running regression, report the estimated equation. It is called a fitted line and in our case looks like this: Earnings = -13.93+2.45*S (use descriptive names and not abstract X,Y). To see if the coefficient of S is significant, look at its p-value, which is smaller than 0.001. This tells us that at all levels of significance larger than or equal to 0.001 the null that the coefficient of S is significant is rejected. This follows from the definition of p-value. Nobody cares about significance of the intercept. Report also the p-value of the F statistic. It characterizes significance of all nontrivial regressors and is important in case of multiple regression. The last statistic to report is R squared.
  4. Think about possible variations of the model. Could the dependence of Earnings on S be nonlinear? What other determinants of Earnings would you suggest from among the variables in Dougherty's file?
Looking at data

Figure 1. Looking at data. For data, we use a scatterplot.


Running regression

Figure 2. Running regression (Statistics > Linear models and related > Linear regression)

Jun 17

Introduction to Stata

Introduction to Stata: Stata interface, how to use Stata Help, how to use Data Editor and how to graph data. Important details to remember:

  1. In any program, the first thing to use is Help. I learned everything from Help and never took any programming courses.
  2. The number of observations for all variables in one data file must be the same. This can be a problem if, for example, you want to see out-of-sample predictions.
  3. In Data Editor, numeric variables are displayed in black and strings are displayed in red.
  4. The name of the hidden variable that counts observations is _n
  5. If you have several definitions of graphs in two-way graphs menu, they will be graphed together or separately, depending on what is enabled/disabled.

See details in videos. Sorry about the background noise!

Stata interface

Video 1. Stata interface. The windows introduced: Results, Command, Variables, Properties, Review and Viewer.

Using Stata help

Video 2. Using Stata Help. Help can be used through the Review window or in a separate pdf viewer. Eviews Help is much easier to understand.

Using Data Editor

Video 3. Using Data Editor. How to open and view variables, the visual difference between numeric variables and string variables. The lengths of all variables in the same file must be the same.

Graphing data

Video 4. Graphing data. To graph a variable, you need to define its graph and then display it. It is possible to display more than one variable on the same chart.

Feb 17

Review of Hinders "5 steps"

Review of Hinders "5 steps, 2010-2011 Edition"

This is a review of "5 Steps to a 5 AP Statistics, 2010-2011 Edition" by Duane Hinders. The latest edition is "5 Steps to a 5 AP Statistics, 2017 Edition" by C. Andreasen, D. Hinders, and D. McDonald, which I, unfortunately, don't have.

The main part of the book has 14 chapters. These chapters plus the preface, introduction, practice exams, appendices etc. are organized in 5 units, that's why the "5 steps" in the name of the book. The book is concise and explanations are as clear as they can be without derivations. See for yourself: of the 14 chapters, the first four contain mainly methodological recommendations and the real study starts from Chapter 5. There are just 385 pages and the latest edition is about the same size.

When one skips the theory and uses book formulas blindly, there is no guarantee that the result will be correct, because the book may contain errors and typos. Hinders does not avoid a common misconception about confidence intervals, but I hope that's the only lapse.

Figure 1. Cognitive ability

Unlike other books I reviewed before (Agresti and FranklinAlbert and RossmanNewbold, Carlson and Thorne, and Bock, Velleman, De Veaux), this one has real exam questions and better uses the reader's time. If I wanted to pass the AP exam, without spending too much time on preparation, I would read the "5 steps".

Perhaps, the exposition wouldn't look so clear to me if I didn't have prior knowledge of Statistics but younger readers may have the advantage I don't: a better memory. I try to illustrate this in Figure 1. I have stronger analytical skills than most 20-year olds but their memory is better. Because of the age trade-off, overall my cognitive abilities are about the same as those of young people. Read and think about every word, and you may succeed.

Feb 17

The pearls of AP Statistics 37

Confidence interval: attach probability or not attach?

I am reading "5 Steps to a 5 AP Statistics, 2010-2011 Edition" by Duane Hinders (sorry, I don't have the latest edition). The tip at the bottom of p.200 says:

For the exam, be VERY, VERY clear on the discussion above. Many students
seem to think that we can attach a probability to our interpretation of a confidence
interval. We cannot.

This is one of those misconceptions that travel from book to book. Below I show how it may have arisen.

Confidence interval derivation

The intuition behind the confidence interval and the confidence interval derivation using z score have been given here. To make the discussion close to Duane Hinders, I show the confidence interval derivation using the t statistic. Let X_1,...,X_n be a sample of independent observations from a normal population, \mu the population mean and s the standard error. Skipping the intuition, let's go directly to the t statistic

(1) t=\frac{\bar{X}-\mu}{s/\sqrt{n}}.

At the 95% confidence level, from statistical tables find the critical value t_{cr,0.95} of the t statistic such that


Plug here (1) to get

(2) P(-t_{cr,0.95}<\frac{\bar{X}-\mu}{s/\sqrt{n}}<t_{cr,0.95})=0.95.

Using equivalent transformations of inequalities (multiplying them by s/\sqrt{n} and adding \mu to all sides) we rewrite (2) as

(3) P(\mu-t_{cr,0.95}\frac{s}{\sqrt{n}}<\bar{X}<\mu+t_{cr,0.95}\frac{s}{\sqrt{n}})=0.95.

Thus, we have proved

Statement 1. The interval \mu\pm t_{cr,0.95}\frac{s}{\sqrt{n}} contains the values of the sample mean with probability 95%.

The left-side inequality in (3) is equivalent to \mu<\bar{X}+t_{cr,0.95}\frac{s}{\sqrt{n}} and the right-side one is equivalent to \bar{X}-t_{cr,0.95}\frac{s}{\sqrt{n}}<\mu. Combining these two inequalities, we see that (3) can be equivalently written as

(4) P(\bar{X}-t_{cr,0.95}\frac{s}{\sqrt{n}}<\mu<\bar{X}+t_{cr,0.95}\frac{s}{\sqrt{n}})=0.95.

So, we have

Statement 2. The interval \bar{X}\pm t_{cr,0.95}\frac{s}{\sqrt{n}} contains the population mean with probability 95%.

Source of the misconception

In (3), the variable in the middle (\bar{X}) is random, and the statement that it belongs to some interval is naturally probabilistic. People not familiar with the above derivation don't understand how a statement that the population mean (which is a constant) belongs to some interval can be probabilistic. It's the interval ends that are random in (4) (the sample mean and standard error are both random), that's why there is probability! Statements 1 and 2 are equivalent!

My colleague Aidan Islyami mentioned that we should distinguish estimates from estimators.

In all statistical derivations random variables are ex-ante (before the event). No book says that but that's the way it is. An estimate is an ex-post (after the event) value of an estimator. An estimate is, of course, a number and not a random variable. Ex-ante, a confidence interval always has a probability. Ex-post, the fact that an estimate belongs to some interval is deterministic (has probability either 0 or 1) and it doesn't make sense to talk about 95%.

Since confidence levels are always strictly between 0 and 100%, students should keep in mind that we deal with ex-ante variables.
Feb 17

Review of Bock, Velleman, De Veaux

Review of Bock, Velleman, De Veaux "Stats: Modeling the World", Addison-Wesley, 3rd ed. 2010

Who is this book for

Once I asked my students to apply regression. They had to choose whatever problem they liked, find the data on the Internet, run the software and interpret the result. So, one student finds data with an unusual layout. Usually, data labels run across the top, while along columns you have observations. In his case, data labels are at the top and on the left. He arbitrarily selects the columns of data, runs the regression, comes to me and asks: Could you tell me what my variables are? He may have read this book which on p.15 says what a data table is: An arrangement of data in which each row represents a case and each column represents a variable.

There is a category of students I call open-minded. Their brains are unencumbered by prejudices. Their minds are open to whatever they are taught, provided that it is fun. To their tastes, a book is good if it can be productively read while lying on the couch. Most importantly, they prefer verbal explanations to equations. A real-life application of every piece of theory is a must.

If you are this type, go ahead and read this book. The authors did their best to get to you. If, after reading the book, you think "Statistics is an impressive science", you will be right, except that you will know little about it. Or, perhaps, you will know a lot, depending on the definition of intuition and how much of it you absorb. If you are more mathematically oriented, you can even find occasional food for thought, like the derivation of the standard error for a predicted mean value on p.668. By the way, that's where the authors say that the Central Limit Theorem tells us that the standard deviation of \bar{y} is \frac{\sigma}{\sqrt{n}}. No, the CLT is a little bit more complex than that, and I'm sure the authors are aware of that; they just want you to understand and be happy.


Most of praises and criticism I said about Agresti and Franklin apply to this book too, including the one about photos that have nothing to do with the subject matter. However, this book seems to me a little more rigorous, although the same level. You have more choice in terms of using statistical software. Appendix B (Guide to Statistical Software) contains directions about using Data Desk, Excel, JMP, MINITAB, SPSS, TI-89 and TI-NSPIRE.

Word of caution

The exposition is highly informal. For example, normal distributions (called normal models in the book) are never formally defined. All you learn about them is manipulations with the z-score and visualization of the distribution shape. If you move to a higher level, you will be surprised by the fact that you need to study everything anew. For those who are serious about their studies, I want to share my opinion about the progression through introductory to intermediate and advanced courses. The introductory level is a game for kids. Since they don't know much about science, they usually believe that it is the true science. The intermediate level is not much better, as I came to know when I studied Economics at Oregon State University. At the advanced level of a curriculum is where students become experts in the field.

To me, the introductory and intermediate levels are artificial barriers on the way to professionalism, invented in order to extract more money from students. When I was giving intermediate Econometrics in the US, the deputy head of the department expressly asked me to be lenient with bachelor students because they provided funds for master's and PhD programs. Since most university courses are intermediate at best, a university graduate is usually not considered a specialist and is forced to take a master's or PhD program. From a societal point of view, this is a huge waste of time and money.

Dec 16

It’s time to modernize the AP Stats curriculum

It's time to modernize the AP Stats curriculum

The suggestions below are based on the College Board AP Statistics Course Description, Effective Fall 2010. Citing this description, “AP teachers are encouraged to develop or maintain their own curriculum that either includes or exceeds each of these expectations; such courses will be authorized to use the “AP” designation.” However, AP teachers are constrained by the statement that “The Advanced Placement Program offers a course description and exam in statistics to secondary school students who wish to complete studies equivalent to a one semester, introductory, non-calculus-based, college course in statistics.”

Too much material for a one-semester course

I tried to teach AP Stats in one semester following the College Board description and methodology. That is, with no derivations, giving only recipes and concentrating on applications. The students were really stretched, didn’t remember anything after completing the course, and usefulness of the course for the subsequent calculus-based course was minimal.

Suggestion. Reduce the number of topics and concentrate on those, which require going all the way from (again citing the description) Exploring Data to Sampling and Experimentation to Anticipating Patterns to Statistical Inference. Simple regression is such a topic.

I would drop the stem-and-leaf plot, because it is stupid; chi-square test for goodness of fit, homogeneity of proportions and independence, including ANOVA, because it is too advanced and looks too vague without the right explanation. Instead of going wide, it is better to go deeper, building upon what students already know. I’ll post a couple of regression applications.

“Introductory” should not mean stupefying

Statistics has its specifics. Even I, with my extensive experience in Math, made quite a few discoveries for myself while studying Stats. Textbook authors, in their attempts to make exposition accessible, often replace the true statistical ideas by after-the-fact intuition or formulas by their verbal description. See, for example, the z score.

Using TI-83+ and TI-84 graphing calculators is like using a Tesla electric car in conjunction with candles for generating electricity. The sole purpose of these calculators is to prevent cheating. The inclination for cheating is a sign of low understanding and the best proof that the College Board strategy is wrong.

Once you say “this course is non-calculus-based”, you close many doors

When we format a document in Word, we don’t care how formatting is implemented technically and we don’t need to know anything about programming. Looks like the same attitude is imparted to students of Stats. Few people notice a big difference. When we format a document, we have an idea of what we want and test the result against that idea. In Stats, the idea has to be translated to a formula, and the software output has to be translated into a formula for interpretation.

I understand that, for the majority of Stats students, the amount of algebra I use in some of my posts is not accessible. However, the opposite tendency of telling students that they don’t need to remember any formulas is unproductive. It’s only by memorizing and reproducing equations that they can augment their algebraic proficiency. Stats is largely a mental science. To improve mental activity, you have to engage in one.

Suggestion. Instead of “this course is non-calculus-based”, say: the course develops the ability to interpret equations and translate ideas to formulas.

Follow a logical sequence

The way most AP Stats books are written does not give any idea as to what comes from where. When I was a bachelor student, I was looking for explanations, and I would hate reading one of today’s AP Stats textbooks. For those who think, memorizing a bunch of recipes, without seeing the logical links, is a nightmare. In some cases, the absence of logic leads to statements that are plain wrong. Just following the logical sequence will put the pieces of the puzzle together.