Jul 17

Running simple regression in Stata

Running simple regression in Stata is, well, simple. It's just a matter of a couple of clicks. Try to make it a small research.

  1. Obtain descriptive statistics for your data (Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics). Look at all that stuff you studied in introductory statistics: units of measurement, means, minimums, maximums, and correlations. Knowing the units of measurement will be important for interpreting regression results; correlations will predict signs of coefficients, etc. In your report, don't just mechanically repeat all those measures; try to find and discuss something interesting.
  2. Visualize your data (Graphics > Twoway graph). On the graph you can observe outliers and discern possible nonlinearity.
  3. After running regression, report the estimated equation. It is called a fitted line and in our case looks like this: Earnings = -13.93+2.45*S (use descriptive names and not abstract X,Y). To see if the coefficient of S is significant, look at its p-value, which is smaller than 0.001. This tells us that at all levels of significance larger than or equal to 0.001 the null that the coefficient of S is significant is rejected. This follows from the definition of p-value. Nobody cares about significance of the intercept. Report also the p-value of the F statistic. It characterizes significance of all nontrivial regressors and is important in case of multiple regression. The last statistic to report is R squared.
  4. Think about possible variations of the model. Could the dependence of Earnings on S be nonlinear? What other determinants of Earnings would you suggest from among the variables in Dougherty's file?
Looking at data

Figure 1. Looking at data. For data, we use a scatterplot.


Running regression

Figure 2. Running regression (Statistics > Linear models and related > Linear regression)

Jul 16

The pearls of AP Statistics 1

I start a series of new posts about AP Statistics. In my view, AP Statistics is not just bad - it is harmful - and somebody has to say this aloud. AP Statistics textbooks target a huge market (I have eight textbooks, and probably there are more). Their level and teaching methodology address the needs of that market, so in this sense the choices made by their authors are justified. However, it is The College Board that sets the standards, and those standards are so low that I would not advise anybody to follow them.

In my posts I hope to cover all eight textbooks that I have. The first batch of my posts will be based on the book A. Agresti, A. Franklin. Statistics: The Art and Science of Learning from Data, 3rd Edition. Pearson, 2013. In conclusion of each batch, I am going to give an overall evaluation of the book. Whatever criticism I have, it is more about The College Board requirements than about a particular book.

They say: Descriptive statistics refers to methods for summarizing the collected data (where the data constitutes either a sample or a population). The summaries usually consist of graphs and numbers such as averages and percentages (p.9). Inferential statistics refers to methods of making decisions or predictions about a population, based on data obtained from a sample of that population (p.10).

I say: I am a professional statistician, and the distinction between descriptive statistics and inferential statistics never played a role in my research or teaching. If you worry about this distinction, give it later, when your students know what you are talking about. This will allow you to avoid asking unprofessionally trivial questions in the beginning of the course.