2
May 18

## Law of total probability - you could have invented this

A knight wants to kill (event $K$) a dragon. There are two ways to do this: by fighting (event $F$) the dragon or by outwitting ($O$) it. The choice of the way ($F$ or $O$) is random, and in each case the outcome ($K$ or not $K$) is also random. For the probability of killing there is a simple, intuitive formula:

$P(K)=P(K|F)P(F)+P(K|O)P(O)$.

Its derivation is straightforward from the definition of conditional probability: since $F$ and $O$ cover the whole sample space and are disjoint, we have by additivity of probability

$P(K)=P(K\cap(F\cup O))=P(K\cap F)+P(K\cap O)=\frac{P(K\cap F)}{P(F)}P(F)+\frac{P(K\cap O)}{P(O)}P(O)$

$=P(K|F)P(F)+P(K|O)P(O)$.

This is easy to generalize to the case of many conditioning events. Suppose $A_1,...,A_n$ are mutually exclusive (that is, disjoint) and collectively exhaustive (that is, cover the whole sample space). Then for any event $B$ one has

$P(B)=P(B|A_1)P(A_1)+...+P(B|A_n)p(A_n)$.

This equation is call the law of total probability.

## Application to a sum of continuous and discrete random variables

Let $X,Y$ be independent random variables. Suppose that $X$ is continuous, with a distribution function $F_X$, and suppose $Y$ is discrete, with values $y_1,...,y_n$. Then for the distribution function of the sum $F_{X+Y}$ we have

$F_{X+Y}(t)=P(X+Y\le t)=\sum_{j=1}^nP(X+Y\le t|Y=y_j)P(Y=y_j)$

(by independence conditioning on $Y=y_j$ can be omitted)

$=\sum_{j=1}^nP(X\le t-y_j)P(Y=y_j)=\sum_{j=1}^nF_X(t-y_j)P(Y=y_j)$.

Compare this to the much more complex derivation in case of two continuous variables.

4
Apr 18

## Distribution function estimation

The relativity theory says that what initially looks absolutely difficult, on closer examination turns out to be relatively simple. Here is one such topic. We start with a motivating example.

Large cloud service providers have huge data centers. A data center, being a large group of computer servers, typically requires extensive air conditioning. The intensity and cost of air conditioning depend on the temperature of the surrounding environment. If, as in our motivating example, we denote by $T$ the temperature outside and by $t$ a cut-off value, then a cloud service provider is interested in knowing the probability $P(T\le t)$ for different values of $t$. This is exactly the distribution function of temperature: $F_T(t)=P(T\le t)$. So how do you estimate it?

It comes down to usual sampling. Fix some cut-off, for example, $t=20$ and see for how many days in a year the temperature does not exceed 20. If the number of such days is, say, 200, then 200/365 will be the estimate of the probability $P(T\le 20)$.

It remains to dress this idea in mathematical clothes.

## Empirical distribution function

If an observation $T_i$ belongs to the event $\{T\le 20\}$, we count it as 1, otherwise we count it as zero. That is, we are dealing with a dummy variable

(1) $1_{\{T\le 20\}}=\left\{\begin{array}{ll}1,&T\le 20;\\0,&T>20.\end{array}\right.$

The total count is $\sum 1_{\{T_i\le 20\}}$ and this is divided by the total number of observations, which is 365, to get 200/365.

It is important to realize that the variable in (1) is a coin (Bernoulli variable). For an unfair coin with  probability of 1 equal to $p$ and probability of zero equal to $1-p$ the mean is

$EC=p\times 1+(1-p)\times 0=p$

and the variance is

$Var(C)=EC^2-(EC)^2=p-p^2=p(1-p)$.

For the variable in (1) $p=P\{1_{\{T\le 20\}}=1\}=P(T\le 20)=F_T(20)$, so the mean and variance are

(2) $E1_{\{T\le 20\}}=F_T(20),\ Var(1_{\{T\le 20\}})=F_T(20)(1-F_T(20))$.

Generalizing, the probability $F_T(t)=P(T\le t)$ is estimated by

(3) $\frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}$

where $n$ is the number of observations. (3) is called an empirical distribution function because it is a direct empirical analog of $P(T\le t)$.

Applying expectation to (3) and using an equation similar to (2), we prove unbiasedness of our estimator:

(4) $E\frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}=P(T\le t)=F_T(t)$.

Further, assuming independent observations we can find variance of (3):

(5) $Var\left(\frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}\right)$ (using homogeneity of degree 2)

$=\frac{1}{n^2}Var\left(\sum_{i=1}^n 1_{\{T_i\le t\}}\right)$ (using independence)

$=\frac{1}{n^2}\sum_{i=1}^nVar(1_{\{T_i\le t\}})$ (applying an equation similar to (2))

$=\frac{1}{n^2}\sum_{i=1}^nF_T(t)(1-F_T(t))=\frac{1}{n}F_T(t)(1-F_T(t)).$

Corollary. (4) and (5) can be used to prove that (3) is a consistent estimator of the distribution function, i.e., (3) converges to $F_T(t)$ in probability.

8
Oct 17

## Reevaluating probabilities based on piece of evidence

This actually has to do with the Bayes' theorem. However, in simple problems one can use a dead simple approach: just find probabilities of all elementary events. This post builds upon the post on Significance level and power of test, including the notation. Be sure to review that post.

Here is an example from the guide for Quantitative Finance by A. Patton (University of London course code FN3142).

Activity 7.2 Consider a test that has a Type I error rate of 5%, and power of 50%.

Suppose that, before running the test, the researcher thinks that both the null and the alternative are equally likely.

1. If the test indicates a rejection of the null hypothesis, what is the probability that the null is false?

2. If the test indicates a failure to reject the null hypothesis, what is the probability that the null is true?

Denote events R = {Reject null}, A = {fAil to reject null}; T = {null is True}; F = {null is False}. Then we are given:

(1) $P(F)=0.5;\ P(T)=0.5;$

(2) $P(R|T)=\frac{P(R\cap T)}{P(T)}=0.05;\ P(R|F)=\frac{P(R\cap F)}{P(F)}=0.5;$

(1) and (2) show that we can find $P(R\cap T)$ and $P(R\cap F)$ and therefore also $P(A\cap T)$ and $P(A\cap F).$ Once we know probabilities of elementary events, we can find everything about everything.

Figure 1. Elementary events

Answering the first question: just plug probabilities in $P(F|R)=\frac{P(R\cap F)}{P(R)}=\frac{P(R\cap F)}{P(R\cap T)+P(A\cap T)}.$

Answering the second question: just plug probabilities in $P(T|A)=\frac{P(A\cap T)}{P(A)}=\frac{P(A\cap T)}{P(A\cap T)+P(A\cap F)}.$

Patton uses the Bayes' theorem and the law of total probability. The solution suggested above uses only additivity of probability.

6
Oct 17

## Significance level and power of test

In this post we discuss several interrelated concepts: null and alternative hypotheses, type I and type II errors and their probabilities. Review the definitions of a sample space and elementary events and that of a conditional probability.

## Type I and Type II errors

Regarding the true state of nature we assume two mutually exclusive possibilities: the null hypothesis (like the suspect is guilty) and alternative hypothesis (the suspect is innocent). It's up to us what to call the null and what to call the alternative. However, the statistical procedures are not symmetric: it's easier to measure the probability of rejecting the null when it is true than other involved probabilities. This is why what is desirable to prove is usually designated as the alternative.

Usually in books you can see the following table.

 Decision taken Fail to reject null Reject null State of nature Null is true Correct decision Type I error Null is false Type II error Correct decision

This table is not good enough because there is no link to probabilities. The next video does fill in the blanks.

Video. Significance level and power of test

## Significance level and power of test

The conclusion from the video is that

$\frac{P(T\bigcap R)}{P(T)}=P(R|T)=P\text{(Type I error)=significance level}$ $\frac{P(F\bigcap R)}{P(F)}=P(R|F)=P\text{(Correctly rejecting false null)=Power}$
26
Jul 17

## Nonlinear least squares

Here we explain the idea, illustrate the possible problems in Mathematica and, finally, show the implementation in Stata.

## Idea: minimize RSS, as in ordinary least squares

Observations come in pairs $(x_1,y_1),...,(x_n,y_n)$. In case of ordinary least squares, we approximated the y's with linear functions of the parameters, possibly nonlinear in x's. Now we use a function $f(a,b,x_i)$ which may be nonlinear in $a,b$. We still minimize RSS which takes the form $RSS=\sum r_i^2=\sum(y_i-f(a,b,x_i))^2$. Nonlinear least squares estimators are the values $a,b$ that minimize RSS. In general, it is difficult to find the formula (closed-form solution), so in practice software, such as Stata, is used for RSS minimization.

## Simplified idea and problems in one-dimensional case

Suppose we want to minimize $f(x)$. The Newton algorithm (default in Stata) is an iterative procedure that consists of steps:

1. Select the initial value $x_0$.
2. Find the derivative (or tangent) of RSS at $x_0$. Make a small step in the descent direction (indicated by the derivative), to obtain the next value $x_1$.
3. Repeat Step 2, using $x_1$ as the starting point, until the difference between the values of the objective function at two successive points becomes small. The last point $x_n$ will approximate the minimizing point.

Problems:

1. The minimizing point may not exist.
2. When it exists, it may not be unique. In general, there is no way to find out how many local minimums there are and which ones are global.
3. The minimizing point depends on the initial point.

See Video 1 for illustration in the one-dimensional case.

Video 1. NLS geometry

## Problems illustrated in Mathematica

Here we look at three examples of nonlinear functions, two of which are considered in Dougherty. The first one is a power functions (it can be linearized applying logs) and the second is an exponential function (it cannot be linearized). The third function gives rise to two minimums. The possibilities are illustrated in Mathematica.

Video 2. NLS illustrated in Mathematica

## Finally, implementation in Stata

Here we show how to 1) generate a random vector, 2) create a vector of initial values, and 3) program a nonlinear dependence.

Video 3. NLS implemented in Stata

10
Jul 17

## Alternatives to simple regression in Stata

In this post we looked at dependence of EARNINGS on S (years of schooling). In the end I suggested to think about possible variations of the model. Specifically, could the dependence be nonlinear? We consider two answers to this question.

This name is used for the quadratic dependence of the dependent variable on the independent variable. For our variables the dependence is

$EARNINGS=a+bS+cS^2+u$.

Note that the dependence on S is quadratic but the right-hand side is linear in the parameters, so we still are in the realm of linear regression. Video 1 shows how to run this regression.

Video 1. Running quadratic regression in Stata

## Nonparametric regression

The general way to write this model is

$y=m(x)+u.$

The beauty and power of nonparametric regression consists in the fact that we don't need to specify the functional form of dependence of $y$ on $x$. Therefore there are no parameters to interpret, there is only the fitted curve. There is also the estimated equation of the nonlinear dependence, which is too complex to consider here. I already illustrated the difference between parametric and nonparametric regression. See in Video 2 how to run nonparametric regression in Stata.

Video 2. Nonparametric dependence

6
Jul 17

## Running simple regression in Stata

Running simple regression in Stata is, well, simple. It's just a matter of a couple of clicks. Try to make it a small research.

1. Obtain descriptive statistics for your data (Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics). Look at all that stuff you studied in introductory statistics: units of measurement, means, minimums, maximums, and correlations. Knowing the units of measurement will be important for interpreting regression results; correlations will predict signs of coefficients, etc. In your report, don't just mechanically repeat all those measures; try to find and discuss something interesting.
2. Visualize your data (Graphics > Twoway graph). On the graph you can observe outliers and discern possible nonlinearity.
3. After running regression, report the estimated equation. It is called a fitted line and in our case looks like this: Earnings = -13.93+2.45*S (use descriptive names and not abstract X,Y). To see if the coefficient of S is significant, look at its p-value, which is smaller than 0.001. This tells us that at all levels of significance larger than or equal to 0.001 the null that the coefficient of S is significant is rejected. This follows from the definition of p-value. Nobody cares about significance of the intercept. Report also the p-value of the F statistic. It characterizes significance of all nontrivial regressors and is important in case of multiple regression. The last statistic to report is R squared.
4. Think about possible variations of the model. Could the dependence of Earnings on S be nonlinear? What other determinants of Earnings would you suggest from among the variables in Dougherty's file?

Figure 1. Looking at data. For data, we use a scatterplot.

Figure 2. Running regression (Statistics > Linear models and related > Linear regression)

29
Jun 17

## Introduction to Stata

Introduction to Stata: Stata interface, how to use Stata Help, how to use Data Editor and how to graph data. Important details to remember:

1. In any program, the first thing to use is Help. I learned everything from Help and never took any programming courses.
2. The number of observations for all variables in one data file must be the same. This can be a problem if, for example, you want to see out-of-sample predictions.
3. In Data Editor, numeric variables are displayed in black and strings are displayed in red.
4. The name of the hidden variable that counts observations is _n
5. If you have several definitions of graphs in two-way graphs menu, they will be graphed together or separately, depending on what is enabled/disabled.

See details in videos. Sorry about the background noise!

Video 1. Stata interface. The windows introduced: Results, Command, Variables, Properties, Review and Viewer.

Video 2. Using Stata Help. Help can be used through the Review window or in a separate pdf viewer. Eviews Help is much easier to understand.

Video 3. Using Data Editor. How to open and view variables, the visual difference between numeric variables and string variables. The lengths of all variables in the same file must be the same.

Video 4. Graphing data. To graph a variable, you need to define its graph and then display it. It is possible to display more than one variable on the same chart.

21
Feb 17

## The pearls of AP Statistics 37

### Confidence interval: attach probability or not attach?

I am reading "5 Steps to a 5 AP Statistics, 2010-2011 Edition" by Duane Hinders (sorry, I don't have the latest edition). The tip at the bottom of p.200 says:

For the exam, be VERY, VERY clear on the discussion above. Many students
seem to think that we can attach a probability to our interpretation of a confidence
interval. We cannot.

This is one of those misconceptions that travel from book to book. Below I show how it may have arisen.

### Confidence interval derivation

The intuition behind the confidence interval and the confidence interval derivation using z score have been given here. To make the discussion close to Duane Hinders, I show the confidence interval derivation using the t statistic. Let $X_1,...,X_n$ be a sample of independent observations from a normal population, $\mu$ the population mean and $s$ the standard error. Skipping the intuition, let's go directly to the t statistic

(1) $t=\frac{\bar{X}-\mu}{s/\sqrt{n}}$.

At the 95% confidence level, from statistical tables find the critical value $t_{cr,0.95}$ of the t statistic such that

$P(-t_{cr,0.95}

Plug here (1) to get

(2) $P(-t_{cr,0.95}<\frac{\bar{X}-\mu}{s/\sqrt{n}}

Using equivalent transformations of inequalities (multiplying them by $s/\sqrt{n}$ and adding $\mu$ to all sides) we rewrite (2) as

(3) $P(\mu-t_{cr,0.95}\frac{s}{\sqrt{n}}<\bar{X}<\mu+t_{cr,0.95}\frac{s}{\sqrt{n}})=0.95.$

Thus, we have proved

Statement 1. The interval $\mu\pm t_{cr,0.95}\frac{s}{\sqrt{n}}$ contains the values of the sample mean with probability 95%.

The left-side inequality in (3) is equivalent to $\mu<\bar{X}+t_{cr,0.95}\frac{s}{\sqrt{n}}$ and the right-side one is equivalent to $\bar{X}-t_{cr,0.95}\frac{s}{\sqrt{n}}<\mu$. Combining these two inequalities, we see that (3) can be equivalently written as

(4) $P(\bar{X}-t_{cr,0.95}\frac{s}{\sqrt{n}}<\mu<\bar{X}+t_{cr,0.95}\frac{s}{\sqrt{n}})=0.95.$

So, we have

Statement 2. The interval $\bar{X}\pm t_{cr,0.95}\frac{s}{\sqrt{n}}$ contains the population mean with probability 95%.

### Source of the misconception

In (3), the variable in the middle ($\bar{X}$) is random, and the statement that it belongs to some interval is naturally probabilistic. People not familiar with the above derivation don't understand how a statement that the population mean (which is a constant) belongs to some interval can be probabilistic. It's the interval ends that are random in (4) (the sample mean and standard error are both random), that's why there is probability! Statements 1 and 2 are equivalent!

My colleague Aidan Islyami mentioned that we should distinguish estimates from estimators.

In all statistical derivations random variables are ex-ante (before the event). No book says that but that's the way it is. An estimate is an ex-post (after the event) value of an estimator. An estimate is, of course, a number and not a random variable. Ex-ante, a confidence interval always has a probability. Ex-post, the fact that an estimate belongs to some interval is deterministic (has probability either 0 or 1) and it doesn't make sense to talk about 95%.

Since confidence levels are always strictly between 0 and 100%, students should keep in mind that we deal with ex-ante variables.
11
Feb 17

## Gauss-Markov theorem

The Gauss-Markov theorem states that the OLS estimator is the most efficient. Without algebra, you cannot make a single step further, whether it is the precise theoretical statement or an application.

### Why do we care about linearity?

The concept of linearity has been repeated many times in my posts. Here we have to start from scratch, to apply it to estimators.

The slope in simple regression

(1) $y_i=a+bx_i+e_i$

can be estimated by

$\hat{b}(y,x)=\frac{Cov_u(y,x)}{Var_u(x)}$.

Note that the notation makes explicit the dependence of the estimator on $x,y$. Imagine that we have two sets of observations: $(y_1^{(1)},x_1),...,(y_n^{(1)},x_n)$ and $(y_1^{(2)},x_1),...,(y_n^{(2)},x_n)$ (the x coordinates are the same but the y coordinates are different). In addition, the regressor is deterministic. The x's could be spatial units and the y's temperature measurements at these units at two different moments.

Definition. We say that $\hat{b}(y,x)$ is linear with respect to $y$ if for any two vectors $y^{(i)}= (y_1^{(i)},...,y_n^{(i)}),$ $i=1,2,$ and numbers $c,d$ we have

$\hat{b}(cy^{(1)}+dy^{(2)},x)=c\hat{b}(y^{(1)},x)+d\hat{b}(y^{(2)},x)$.

This definition is quite similar to that of linearity of means. Linearity of the estimator with respect to $y$ easily follows from linearity of covariance

$\hat{b}(cy^{(1)}+dy^{(2)},x)=\frac{Cov_u(cy^{(1)}+dy^{(2)},x)}{Var_u(x)}=c\hat{b}(y^{(1)},x)+d\hat{b}(y^{(2)},x)$.

In addition to knowing how to establish linearity, it's a good idea to be able to see when something is not linear. Recall that linearity implies homogeneity of degree 1. Hence, if something is not homogeneous of degree 1, it cannot be linear. The OLS estimator is not linear in x because it is homogeneous of degree -1 in x:

$\hat{b}(y,cx)=\frac{Cov_u(y,cx)}{Var_u(cx)}=\frac{c}{c^2}\frac{Cov_u(y,x)}{Var_u(x)}=\frac{1}{c}\hat{b}(y,x)$.

### Gauss-Markov theorem

Students don't have problems remembering the acronym BLUE: the OLS estimator is Best Linear Unbiased Estimator. Decoding this acronym starts from the end.

1. An estimator, by definition, is a function of sample data.
2. Unbiasedness of OLS estimators is thoroughly discussed here.
3. Linearity of the slope estimator with respect to $y$ has been proved above. Linearity with respect to $x$ is not required.
4. Now we look at the class of all slope estimators that are linear with respect to $y$. As an exercise, show that the instrumental variables estimator belongs to this class.

Gauss-Markov Theorem. Under the classical assumptions, the OLS estimator of the slope has the smallest variance in the class of all slope estimators that are linear with respect to $y$.

In particular, the OLS estimator of the slope is more efficient than the IV estimator. The beauty of this result is that you don't need expressions of their variances (even though they can be derived).

Remark. Even the above formulation is incomplete. In fact, the pair intercept estimator plus slope estimator is efficient. This requires matrix algebra.