## Number of reads of my book on October 29, 2020

For more information see my site

26

May 20

Tags: ANOVA, AP Statistics, Business Statistics, discrete and continuous variables, estimation, graphical and numerical description of data, hypothesis testing, Maximum likelihood method, sampling and sampling distribution, simple regression, University of London International Programmes, uol affiliate centre, UoL International Programmes

7

Aug 17

This is a large topic which requires several posts or several book chapters. During a conference in Sweden in 2010, a Swedish statistician asked me: "What is Econometrics, anyway? What tools does it use?" I said: "Among others, it uses linear regression." He said: "But linear regression is a general statistical tool, why do they say it's a part of Econometrics?" My answer was: "Yes, it's a general tool but the name Econometrics emphasizes that the motivation for its applications lies in Economics".

Both classical assumptions and their violations should be studied with this point in mind: What is the Economics and Math behind each assumption?

We consider the simple regression

(1)

Make sure to review the assumptions. Their numbering and names sometimes are different from what Dougherty's book has. In particular, most of the time I omit the following assumption:

**A6**. The model is linear in parameters and correctly specified.

When it is not linear in parameters, you can think of nonlinear alternatives. Instead of saying "correctly specified" I say "true model" when a "wrong model" is available.

**A1**. What if the existence condition is violated? If variance of the regressor is zero, the OLS estimator does not exist. The fitted line is supposed to be vertical, and you can regress on . Violation of the existence condition in case of multiple regression leads to multicollinearity, and that's where economic considerations are important.

**A2**. The convenience condition is called so because when it is violated, that is, the regressor is stochastic, there are ways to deal with this problem: finite-sample theory and large-sample theory.

**A3**. What if the errors in (1) have means different from zero? This question can be divided in two: 1) the means of the errors are the same: for all and 2) the means are different. Read the post about centering and see if you can come up with the answer for the first question. The means may be different because of omission of a relevant variable (can you do the math?). In the absence of data on such a variable, there is nothing you can do.

Violations of A4 and A5 will be treated later.

Tags: AP Statistics, Business Statistics, Convenience condition, EC2020 Elements of econometrics, error with nonzero mean, Existence condition, mean plus deviation-from-the-mean decomposition, simple regression, unbiasedness, University of London International Programmes, uol affiliate centre, Violations of classical assumptions

6

Jul 17

Running simple regression in Stata is, well, simple. It's just a matter of a couple of clicks. Try to make it a small research.

- Obtain descriptive statistics for your data (Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics). Look at all that stuff you studied in introductory statistics: units of measurement, means, minimums, maximums, and correlations. Knowing the units of measurement will be important for interpreting regression results; correlations will predict signs of coefficients, etc. In your report, don't just mechanically repeat all those measures; try to find and discuss something interesting.
- Visualize your data (Graphics > Twoway graph). On the graph you can observe outliers and discern possible nonlinearity.
- After running regression, report the estimated equation. It is called a fitted line and in our case looks like this: Earnings = -13.93+2.45*S (use descriptive names and not abstract X,Y). To see if the coefficient of S is significant, look at its p-value, which is smaller than 0.001. This tells us that at all levels of significance larger than or equal to 0.001 the null that the coefficient of S is significant is rejected. This follows from the definition of p-value. Nobody cares about significance of the intercept. Report also the p-value of the F statistic. It characterizes significance of all nontrivial regressors and is important in case of multiple regression. The last statistic to report is R squared.
- Think about possible variations of the model. Could the dependence of Earnings on S be nonlinear? What other determinants of Earnings would you suggest from among the variables in Dougherty's file?

24

Jan 17

We consider the slope estimator for the simple regression

assuming that is stochastic.

**First approach**: the sample size is fixed. The unbiasedness and efficiency conditions are replaced by their analogs conditioned on . The outcome is that the slope estimator is unbiased and its variance is the average of the variance that we have in case of a deterministic regressor. See the details.

**Second approach**: the sample size goes to infinity. The main tools used are the properties of probability limits and laws of large numbers. The outcome is that, in the limit, the sample characteristics are replaced by their population cousins and the slope estimator is consistent. This is what we focus on here.

Review the intuition and formal definition. This is the summary:

**Fact 1**. Convergence in probability (which applies to sequences of random variables) is a generalization of the notion of convergence of number sequences. In particular, if is a numerical sequence that converges to a number , , then, treating as a random variable, we have convergence in probability .

**Fact 2**. For those who are familiar with the theory of limits of numerical sequences, from the previous fact it should be clear that convergence in probability *preserves arithmetic operations*. That is, for any sequences of random variables such that limits and exist, we have

and if then

This makes convergence in probability very handy. Convergence in distribution doesn't have such properties.

See the site map for several posts about this. Here we apply the Chebyshev inequality to prove the law of large numbers for sample means. A generalization is given in the Theorem in the end of that post. Here is a further intuitive generalization:

*Normally, unbiased sample characteristics converge in probability to their population counterparts*.

**Example 1**. We know that the sample variance unbiasedly estimates the population variance : . The intuitive generalization says that then

(1) .

Here I argue that, for the purposes of obtaining some identities from the general properties of means, instead of the sample variance it's better to use the variance defined by (with division by instead of ). Using Facts 1 and 2 we get from (1) that

(2)

(sample variance converges in probability to population variance). Here we use .

**Example 2**. Similarly, sample covariance converges in probability to population covariance:

(3)

where by definition .

Here (see equation (5)) I derived the representation of the OLS estimator of the slope

Using preservation of arithmetic operations for convergence in probability, we get

(4)

In the last line we used (2) and (3). From (4) we see what conditions should be imposed for the slope estimator to converge to a spike at the true slope:

(**existence condition**)

and

(**consistency condition**).

Under these conditions, we have (this is called **consistency**).

**Conclusion**. In a way, the second approach is technically simpler than the first.

Tags: AP Statistics, Business Statistics, consistency of the slope estimator, convergence in probability, EC2020 Elements of econometrics, large sample theory, OLS estimator, preservation of arithmetic operations, Regressions with stochastic regressors, simple regression, University of London International Programmes, uol affiliate centre

11

Jan 17

The convenience condition states that the regressor in simple regression is deterministic. Here we look at how this assumption can be avoided using conditional expectation and variance. General idea: you check which parts of the proofs don't go through with stochastic regressors and modify the assumptions accordingly. It happens that only assumptions concerning the error term should be replaced by their conditional counterparts.

We consider the slope estimator for the simple regression

(1)

assuming that is stochastic.

First grab the critical representation (6) derived here:

(1) , where

The usual linearity of means applied to prove unbiasedness doesn't work because now the coefficients are stochastic (in other words, they are not constant). But we have generalized linearity which for the purposes of this proof can be written as

(2)

Let us replace the unbiasedness condition by its conditional version:

**A3'. Unbiasedness condition**. .

Then (1) and (2) give

(3)

which can be called **conditional unbiasedness**. Next applying the law of iterated expectations we obtain **unconditional unbiasedness**:

As one can guess, we have to replace efficiency conditions by their conditional versions:

**A4'. Conditional uncorrelatedness of errors**. Assume that for all .

**A5'. Conditional homoscedasticity**. *All errors have the same conditional variances*: for all ( is a constant).

Now we can derive the **conditional variance** expression, using properties from this post:

(dropping a constant doesn't affect variance)

(for conditionally uncorrelated variables, conditional variance is additive)

(conditional variance is homogeneous of degree 2)

(applying conditional homoscedasticity)

(plugging )

(using the notation of sample variance)

(4)

Finally, using the law of total variance and equations (3) and (4) we obtain

(5)

Replacing the three assumptions about the error by their conditional counterparts allows us to obtain almost perfect analogs of the usual properties of OLS estimators: the usual (unconditional) unbiasedness plus the estimator variance, in which the part containing the regressor should be averaged, to account for its randomness. If you think that solving the problem of stochastic regressors requires nothing more but application of a couple of mathematical tricks, I agree with you.

Tags: AP Statistics, Business Statistics, Conditional homoscedasticity, conditional unbiasedness, Conditional uncorrelatedness, EC2020 Elements of econometrics, OLS estimator, simple regression, stochastic regressors, Unbiasedness in case of stochastic regressors, unconditional unbiasedness, University of London International Programmes, uol affiliate centre, Variance in case of stochastic regressors

8

Jan 17

We consider the simple regression

(1)

Here we derived the OLS estimators of the intercept and slope:

(2) ,

(3) .

**A1. Existence condition**. Since division by zero is not allowed, for (2) to exist *we require* . If this condition is not satisfied, then there is no variance in and all observed points are on the vertical line.

**A2. Convenience condition**. *The regressor* *is deterministic*. This condition is imposed to be able to apply the properties of expectation, see equation (7) in this post. The time trend and dummy variables are examples of deterministic regressors. However, most real-life regressors are stochastic. Modifying the theory in order to cover stochastic regressors is the subject of two posts: finite-sample theory and large-sample theory.

**A3. Unbiasedness condition**. . This is the main assumption that makes sure that OLS estimators are unbiased, see equation (7) in this post.

Unbiasedness characterizes the quality of an estimator, see the intuitive explanation. Unfortunately, unbiasedness is not enough to choose the best estimator because of *nonuniqueness*: usually, if there is one unbiased estimator of a parameter, then there are infinitely many unbiased estimators of the same parameter. For example, we know that the sample mean unbiasedly estimates the population mean . Since ( is the first observation), we can easily construct an infinite family of unbiased estimators , assuming . Indeed, using linearity of expectation .

Variance is another measure of an estimator quality: to have a lower spread of estimator values, among competing estimators we choose the one which has the lowest variance. Knowing the estimator variance allows us to find the z-score and use statistical tables.

It is not difficult to find the variance of the slope estimator using representation (6) derived here:

where

**Don't** try to apply directly the definition of variance at this point, because there will be a square of a sum, which leads to a double sum upon squaring. We need two new assumptions.

**A4. Uncorrelatedness of errors**. Assume that for all (*errors from different equations (1) are uncorrelated*). Note that because of the unbiasedness condition, this assumption is equivalent to for all . This assumption is likely to be satisfied if we observe consumption patterns of unrelated individuals.

**A5. Homoscedasticity**. *All errors have the same variances*: for all . Again, because of the unbiasedness condition, this assumption is equivalent to for all .

Now we can derive the variance expression, using properties from this post:

(dropping a constant doesn't affect variance)

(for uncorrelated variables, variance is additive)

(variance is homogeneous of degree 2)

(applying homoscedasticity)

(plugging )

(using the notation of sample variance)

Note that canceling out two variances in the last line is obvious. It is not so obvious for some if instead of the short notation for variances you use summation signs. The case of the intercept variance is left as an exercise.

The above assumptions A1-A5 are called **classical**. It is necessary to remember their role in derivations because a considerable part of Econometrics is devoted to deviations from classical assumptions. Once you have a certain assumption violated, you should expect the corresponding estimator property invalidated. For example, if , you should expect the estimators to be biased. If any of A4-A5 is not true, the formula we have derived

will not hold. Besides, the Gauss-Markov theorem that the OLS estimators are efficient will not hold (this will be discussed later). The pair A4-A5 can be called an **efficiency condition**.

Tags: AP Statistics, Business Statistics, classical assumptions, Convenience condition, dummy variables, EC2020 Elements of econometrics, efficiency condition, estimator nonuniqueness, Existence condition, Gauss-Markov theorem, homogeneity of variance, Homoscedasticity, OLS estimator, Population mean, sample mean, simple regression, simple regression assumptions, Slope estimator variance, time trend, unbiasedness, Unbiasedness condition, Uncorrelatedness of errors, University of London International Programmes, uol affiliate centre, variance

5

Jan 17

On the Internet you can find both positive and negative reviews. The ones that I saw on Goodreads.com and Amazon.com do not say much about the pros and cons. Here I try to be more specific.

The main limitation of the book is that it adheres to the College Board statement that "*it is a one semester*, *introductory*, *non-calculus-based*, college course in statistics". Hence, there are no derivations and no links between formulas. You will not find explanations of why Statistics works. As a result, there is too much emphasis on memorization. After reading the book, you will likely not have an integral view of statistical methods.

I have seen students who understand such texts well. Generally, they have an excellent memory and better-than-average imagination. But such students are better off reading more advanced books. A potential reader has to lower his/her expectations. I imagine a person who is not interested in taking a more advanced Stats course later. The motivation of that person would be: a) to understand the ways Statistics is applied and/or b) to pass AP Stats just because it is a required course. The review is written on the premise that this is the intended readership.

- The number and variety of exercises. This is good for an instructor who teaches large classes. Having authored several books, I can assure you that inventing many exercises is the most time-consuming part of this business.
- The authors have come up with good visual embellishments of graphs and tables summarized in "A Guide to Learning From the Art in This Text" in the end of the book.
- The book has generous left margins. Sometimes they contain reminders about the past material. Otherwise, the reader can use them for notes.
- MINITAB is prohibitively expensive, but the Student Edition of MINITAB is provided on the accompanying CD.

- I counted about 140 high-resolution photos that have nothing to do with the subject matter. They hardly add to the educational value of the book but certainly add to its cost. This bad trend in introductory textbooks is fueled to a considerable extent by Pearson Education.
- 800+ pages, even after slashing all appendices and unnecessary illustrations, is a lot of reading for one semester. Even if you memorize all of them, during the AP test it be will difficult for you to pull out of your memory exactly that page you need to answer exactly this particular question.
- In an introductory text, one has to refrain from giving too much theory. Still, I don't like some choices made by the authors. The learning curve is flat. As a way of gentle introduction to algebra, verbal descriptions of formulas are normal. But sticking to verbal descriptions until p. 589 is too much. This reminds me a train trip in Kazakhstan. You enter the steppe through the western border and two days later you see the same endless steppe, just the train station is different.
- At the theoretical level, many topics are treated superficially. You can find a lot of additional information in my posts named "The pearls of AP Statistics". Here is the list of most important additions: regression and correlation should be decoupled; the importance of sampling distributions is overstated; probability is better explained without reference to the long run; the difference between the law of large numbers and central limit theorem should be made clear; the rate of convergence in the law of large numbers is not that fast; the law of large numbers is intuitively simple; the uniform distribution can also be made simple; to understand different charts, put them side by side; the Pareto chart is better understood as a special type of a histogram; instead of using the software on the provided CD, try to simulate in Excel yourself.
- Using outdated Texas instruments calculators contradicts the American Statistical Association recommendation to "Use technology for developing concepts and analyzing data".

If I want to save time and don't intend to delve into theory, I would prefer to read a concise book that directly addresses questions given on the AP test. However, to decide for yourself, read the Preface to see how much fantasy has been put into the book, and you may want to read it.

Tags: AP Statistics, Business Statistics, central limit theorem, law of large numbers, Pareto chart, Pareto diagram, Pearson Education, Review of Agresti and Franklin, sampling distribution, simple regression, simulation in Excel, Statistics: The Art and Science of Learning from Data, uniform distribution, University of London International Programmes, uol affiliate centre

26

Dec 16

Agresti and Franklin on p.658 say: The *indicator variable* for a particular category is binary. It equals 1 if the observation falls into that category and it equals 0 otherwise. I say: For most students, this is not clear enough.

Residential power consumption in the US has a seasonal pattern. Heating in winter and cooling in summer cause the differences. We want to capture the dependence of residential power consumption on the season.

Seasons of the year are categorical variables. We have to replace them with quantitative variables, to be able to use in any mathematical procedure that involves arithmetic operations. To this end, we define a *dummy variable* (*indicator*) for winter such that it equals 1 in winter and 0 in any other period of the year. The dummies for spring, summer and autumn are defined similarly. We provide two visualizations assuming monthly observations.

Month | |||||

December | 1 | 0 | 0 | 0 | 1 |

January | 1 | 0 | 0 | 0 | 1 |

February | 1 | 0 | 0 | 0 | 1 |

March | 0 | 1 | 0 | 0 | 1 |

April | 0 | 1 | 0 | 0 | 1 |

May | 0 | 1 | 0 | 0 | 1 |

June | 0 | 0 | 1 | 0 | 1 |

July | 0 | 0 | 1 | 0 | 1 |

August | 0 | 0 | 1 | 0 | 1 |

September | 0 | 0 | 0 | 1 | 1 |

October | 0 | 0 | 0 | 1 | 1 |

November | 0 | 0 | 0 | 1 | 1 |

The first thing that comes to mind is to regress on dummies as in

(1)

Not so fast. To see the problem, let us rewrite (1) as

(2)

This shows that, in addition to the four dummies, there is a fifth variable, which equals 1 across all observations. Let us denote it (for Trivial). Table 1 shows that

(3)

This makes the next definition relevant. Regressors are called *linearly dependent* if one of them, say, , can be expressed as a linear combination of the others: . In case (3), all coefficients are unities, so we have linear dependence. Using (3), let us replace in (2). The resulting equation is rearranged as

(4)

Now we see what the problem is. When regressors are linearly dependent, the model is not uniquely specified. (1) and (4) are two different representations of the same model.

If regressors are linearly dependent, drop them one after another until you get linearly independent ones. For example, dropping the winter dummy, we get

(5)

Here is the estimation result for the two-year data in Figure 1:

.

This means that:

in winter, in spring,

in summer, and in autumn.

It is revealing that cooling requires more power than heating. However, the summer coefficient is not significant. Here is the Excel file with the data and estimation result.

The category that has been dropped is called a *base* (or *reference*) *category*. Thus, the intercept in (5) measures power consumption in winter. The dummy coefficients in (5) measure deviations of power consumption in respective seasons from that in winter.

We want to see how beer consumption depends on gender and income . Let and denote the dummies for males and females, resp. Correct the following model and interpret the resulting coefficients:

When a researcher includes all categories plus the trivial regressor, he/she falls into what is called a *dummy trap*. The problem of linear dependence among regressors is usually discussed under the heading of multiple regression. But since the trivial regressor is present in simple regression too, it might be a good idea to discuss it earlier.

Linear dependence/independence of regressors is an exact condition for existence of the OLS estimator. That is, if regressors are linearly dependent, then the OLS estimator doesn't exist, in which case the question about its further properties doesn't make sense. If, on the other hand, regressors are linearly independent, then the OLS estimator exists, and further properties can be studied, such as unbiasedness, variance and efficiency.

Tags: AP Statistics, base category, Business Statistics, categorical variables, cooling in summer, dummy trap, dummy variable, EC2020 Elements of econometrics, existence of the OLS estimator, Heating in winter, indicator variable, linear combination, linear dependence, linear independence, multiple regression, OLS estimator, power consumption, quantitative variables, reference category, residential power consumption, seasonal pattern, simple regression, unbiasedness, University of London International Programmes, uol affiliate centre, visualization of dummies

1

Dec 16

For the last several years I have been doing research in nonparametric estimation. It is intellectually rewarding and it is the best tool to show Stats students the usefulness of Statistics. Agresti and Franklin have a chapter on nonparametric estimation. However, the choice of the topics (Wilcoxon test and Kruskal-Wallis test) is unfortunate. These two tests are about comparing means from two samples. They provide just numbers (corresponding statistics), which is not very appealing because the students see just another solution to the familiar problem.

Nonparametric technique is best in nonlinear curve fitting, and this is its selling point because it is VISUAL. The following examples explain the difference between parametric and nonparametric estimation.

Suppose we want to use simple regression to estimate dependence of consumption on income. This is a parametric model, with two parameters (intercept and slope). Suppose the fitted line is (I just put plausible numbers). The slope is interpreted as the marginal propensity to consume and can be used in economic modeling to find the budget multiplier. The advantage of parametric estimation is that often estimated parameters have economic meaning.

This example has been taken from Lecture Notes by John Fox. Now let us look at dependence of income on age. It is clear that income is low for young people, then rises with age until middle age and declines after retirement. The dependence is obviously nonlinear and, a priori, no guesses can be made about the shape of the curve.

Figure 1 shows the median and quartiles of the distribution of income from wages and salaries as a function of single years of age. The data are taken from the 1990 U.S. Census one-percent Public Use Microdata Sample, and represent 1.24 million observations. Income starts increasing at around 18 years, tops out at 48 and declines till the age of 65. The fitted line is approximately linear until the age of 24, so young people enjoy a highest and constant income growth rate.

What would have been the better 5-year investment: Apple or MasterCard? Figure 2 shows that the density of return on Apple stock has a negative mode. The density of return on MasterCard has the mode close to zero. This tells us that MasterCard would be better. Indeed, the annual return on Apple is 18%, while on MasterCard it is 29% (over the last 5 years). Nonparametric estimates of densities (kernel density estimates) are used by financial analysts to simulate stock prices to predict their future movements.

**Remark**. For simple statistical tasks I recommend Eviews student version for two reasons. 1) It has excellent Help. When I want my students to understand just the essence and avoid proofs, I tell them to read Eviews Help. 2) The student version is just $39.95. Figures 2 and 3 have been produced using Eviews.

Tags: AP Statistics, budget multiplier, Business Statistics, comparing means from two samples, density of return, distribution of income, economic modeling, kernel density estimation, Kruskal-Wallis test, marginal propensity to consume, nonlinear curve fitting, nonparametric estimation, parametric estimation, parametric model, simple regression, University of London International Programmes, uol affiliate centre, Wilcoxon test

8

Nov 16

The disturbance term: To hide or not to hide? In an introductory Stats course, some part of the theory should be hidden. Where to draw the line is an interesting question. Here I discuss the ideas that look definitely bad to me.

In the main text, Agresti and Franklin never mention the disturbance term in the regression model

(1)

(it is hidden in Exercise 12.105). Instead, they write the equation for the mean that follows from (1) under the standard assumption . This would be fine if the exposition stopped right there. However, one has to explain the random source of variability in . On p. 583 the authors say: "The probability distribution of y values at a fixed value of x is a conditional distribution. At each value of x, there is a conditional distribution of y values. A regression model also describes these distributions. An additional parameter σ describes the standard deviation of each conditional distribution."

Further, Figure 12.4 illustrates distributions of errors at different points and asks: "What do the bell-shaped curves around the line at x = 12 and at x = 16 represent?"

Besides, explanations of heteroscedasticity and of the residual sum of squares are impossible without explicitly referring to the disturbance term.

On p.589 I encountered a statement that puzzled me: "An important property of the correlation is that at any particular x value, the predicted value of y is relatively closer to its mean than x is to its mean. If an x value is a certain number of standard deviations from its mean, then the predicted y is r times that many standard deviations from its mean."

Firstly, this is a verbal interpretation of some formula, so why not give the formula itself? How good must be a student to guess what is behind the verbal formulation?

Secondly, as I stressed in this post, the correlation coefficient does not entail any prediction about the magnitude of a change in one variable caused by a change in another. The above statement about the predicted value of y must be a property of regression. Attributing a regression property to the correlation is not in the best interests of those who want to study Stats at a more advanced level.

Thirdly, I felt challenged to see something new in the area I thought I knew everything about. So here is the derivation. By definition, the fitted value is

(2)

where the hats stand for estimators. The fitted line passes through the point :

(3)

(this will be proved elsewhere). Subtracting (3) from (2) we get

(4)

(using equation (4) from this post)

It is helpful to rewrite (4) in a more symmetric form:

(5)

This is the equation we need. Suppose an x value is a certain number of standard deviations from its mean: . Plug this into (5) to get , that is, the predicted y is times that many standard deviations from its mean.

Tags: AP Statistics, Business Statistics, conditional distribution, disturbance term, fitted line, fitted value, heteroscedasticity, OLS estimator, regression, Residual Sum of Squares, sample mean, simple regression, standard deviation, University of London International Programmes, uol affiliate centre