6
Oct 17

## Significance level and power of test

In this post we discuss several interrelated concepts: null and alternative hypotheses, type I and type II errors and their probabilities. Review the definitions of a sample space and elementary events and that of a conditional probability.

## Type I and Type II errors

Regarding the true state of nature we assume two mutually exclusive possibilities: the null hypothesis (like the suspect is guilty) and alternative hypothesis (the suspect is innocent). It's up to us what to call the null and what to call the alternative. However, the statistical procedures are not symmetric: it's easier to measure the probability of rejecting the null when it is true than other involved probabilities. This is why what is desirable to prove is usually designated as the alternative.

Usually in books you can see the following table.

 Decision taken Fail to reject null Reject null State of nature Null is true Correct decision Type I error Null is false Type II error Correct decision

This table is not good enough because there is no link to probabilities. The next video does fill in the blanks.

Video. Significance level and power of test

## Significance level and power of test

The conclusion from the video is that

$\frac{P(T\bigcap R)}{P(T)}=P(R|T)=P\text{(Type I error)=significance level}$ $\frac{P(F\bigcap R)}{P(F)}=P(R|F)=P\text{(Correctly rejecting false null)=Power}$
11
Aug 17

## Violations of classical assumptions

This will be a simple post explaining the common observation that "in Economics, variability of many variables is proportional to those variables". Make sure to review the assumptions; they tend to slip from memory. We consider the simple regression

(1) $y_i=a+bx_i+e_i.$

One of classical assumptions is

Homoscedasticity. All errors have the same variances$Var(e_i)=\sigma^2$ for all $i$.

We discuss its opposite, which is

Heteroscedasticity. Not all errors have the same variance. It would be wrong to write it as $Var(e_i)\ne\sigma^2$ for all $i$ (which means that all errors have variance different from $\sigma^2$). You can write that not all $Var(e_i)$ are the same but it's better to use the verbal definition.

Remark about Video 1. The dashed lines can represent mean consumption. Then the fact that variation of a variable grows with its level becomes more obvious.

Video 1. Case for heteroscedasticity

Figure 1. Illustration from Dougherty: as x increases, variance of the error term increases

Homoscedasticity was used in the derivation of the OLS estimator variance; under heteroscedasticity that expression is no longer valid. There are other implications, which will be discussed later.

Companies example. The Samsung Galaxy Note 7 battery fires and explosions that caused two recalls cost the smartphone maker at least $5 billion. There is no way a small company could have such losses. GDP example. The error in measuring US GDP is on the order of$200 bln, which is comparable to the Kazakhstan GDP. However, the standard deviation of the ratio error/GDP seems to be about the same across countries, if the underground economy is not too big. Often the assumption that the standard deviation of the regression error is proportional to one of regressors is plausible.

To see if the regression error is heteroscedastic, you can look at the graph of the residuals or use statistical tests.

7
Aug 17

## Violations of classical assumptions

This is a large topic which requires several posts or several book chapters. During a conference in Sweden in 2010, a Swedish statistician asked me: "What is Econometrics, anyway? What tools does it use?" I said: "Among others, it uses linear regression." He said: "But linear regression is a general statistical tool, why do they say it's a part of Econometrics?" My answer was: "Yes, it's a general tool but the name Econometrics emphasizes that the motivation for its applications lies in Economics".

Both classical assumptions and their violations should be studied with this point in mind: What is the Economics and Math behind each assumption?

## Violations of the first three assumptions

We consider the simple regression

(1) $y_i=a+bx_i+e_i$

Make sure to review the assumptions. Their numbering and names sometimes are different from what Dougherty's book has. In particular, most of the time I omit the following assumption:

A6. The model is linear in parameters and correctly specified.

When it is not linear in parameters, you can think of nonlinear alternatives. Instead of saying "correctly specified" I say "true model" when a "wrong model" is available.

A1. What if the existence condition is violated? If variance of the regressor is zero, the OLS estimator does not exist. The fitted line is supposed to be vertical, and you can regress $x$ on $y$. Violation of the existence condition in case of multiple regression leads to multicollinearity, and that's where economic considerations are important.

A2. The convenience condition is called so because when it is violated, that is, the regressor is stochastic, there are ways to deal with this problem:  finite-sample theory and large-sample theory.

A3. What if the errors in (1) have means different from zero? This question can be divided in two: 1) the means of the errors are the same: $Ee_i=c\ne 0$ for all $i$ and 2) the means are different. Read the post about centering and see if you can come up with the answer for the first question. The means may be different because of omission of a relevant variable (can you do the math?). In the absence of data on such a variable, there is nothing you can do.

Violations of A4 and A5 will be treated later.

26
Jul 17

## Nonlinear least squares

Here we explain the idea, illustrate the possible problems in Mathematica and, finally, show the implementation in Stata.

## Idea: minimize RSS, as in ordinary least squares

Observations come in pairs $(x_1,y_1),...,(x_n,y_n)$. In case of ordinary least squares, we approximated the y's with linear functions of the parameters, possibly nonlinear in x's. Now we use a function $f(a,b,x_i)$ which may be nonlinear in $a,b$. We still minimize RSS which takes the form $RSS=\sum r_i^2=\sum(y_i-f(a,b,x_i))^2$. Nonlinear least squares estimators are the values $a,b$ that minimize RSS. In general, it is difficult to find the formula (closed-form solution), so in practice software, such as Stata, is used for RSS minimization.

## Simplified idea and problems in one-dimensional case

Suppose we want to minimize $f(x)$. The Newton algorithm (default in Stata) is an iterative procedure that consists of steps:

1. Select the initial value $x_0$.
2. Find the derivative (or tangent) of RSS at $x_0$. Make a small step in the descent direction (indicated by the derivative), to obtain the next value $x_1$.
3. Repeat Step 2, using $x_1$ as the starting point, until the difference between the values of the objective function at two successive points becomes small. The last point $x_n$ will approximate the minimizing point.

Problems:

1. The minimizing point may not exist.
2. When it exists, it may not be unique. In general, there is no way to find out how many local minimums there are and which ones are global.
3. The minimizing point depends on the initial point.

See Video 1 for illustration in the one-dimensional case.

Video 1. NLS geometry

## Problems illustrated in Mathematica

Here we look at three examples of nonlinear functions, two of which are considered in Dougherty. The first one is a power functions (it can be linearized applying logs) and the second is an exponential function (it cannot be linearized). The third function gives rise to two minimums. The possibilities are illustrated in Mathematica.

Video 2. NLS illustrated in Mathematica

## Finally, implementation in Stata

Here we show how to 1) generate a random vector, 2) create a vector of initial values, and 3) program a nonlinear dependence.

Video 3. NLS implemented in Stata

10
Jul 17

## Alternatives to simple regression in Stata

In this post we looked at dependence of EARNINGS on S (years of schooling). In the end I suggested to think about possible variations of the model. Specifically, could the dependence be nonlinear? We consider two answers to this question.

This name is used for the quadratic dependence of the dependent variable on the independent variable. For our variables the dependence is

$EARNINGS=a+bS+cS^2+u$.

Note that the dependence on S is quadratic but the right-hand side is linear in the parameters, so we still are in the realm of linear regression. Video 1 shows how to run this regression.

Video 1. Running quadratic regression in Stata

## Nonparametric regression

The general way to write this model is

$y=m(x)+u.$

The beauty and power of nonparametric regression consists in the fact that we don't need to specify the functional form of dependence of $y$ on $x$. Therefore there are no parameters to interpret, there is only the fitted curve. There is also the estimated equation of the nonlinear dependence, which is too complex to consider here. I already illustrated the difference between parametric and nonparametric regression. See in Video 2 how to run nonparametric regression in Stata.

Video 2. Nonparametric dependence

6
Jul 17

## Running simple regression in Stata

Running simple regression in Stata is, well, simple. It's just a matter of a couple of clicks. Try to make it a small research.

1. Obtain descriptive statistics for your data (Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics). Look at all that stuff you studied in introductory statistics: units of measurement, means, minimums, maximums, and correlations. Knowing the units of measurement will be important for interpreting regression results; correlations will predict signs of coefficients, etc. In your report, don't just mechanically repeat all those measures; try to find and discuss something interesting.
2. Visualize your data (Graphics > Twoway graph). On the graph you can observe outliers and discern possible nonlinearity.
3. After running regression, report the estimated equation. It is called a fitted line and in our case looks like this: Earnings = -13.93+2.45*S (use descriptive names and not abstract X,Y). To see if the coefficient of S is significant, look at its p-value, which is smaller than 0.001. This tells us that at all levels of significance larger than or equal to 0.001 the null that the coefficient of S is significant is rejected. This follows from the definition of p-value. Nobody cares about significance of the intercept. Report also the p-value of the F statistic. It characterizes significance of all nontrivial regressors and is important in case of multiple regression. The last statistic to report is R squared.
4. Think about possible variations of the model. Could the dependence of Earnings on S be nonlinear? What other determinants of Earnings would you suggest from among the variables in Dougherty's file?

Figure 1. Looking at data. For data, we use a scatterplot.

Figure 2. Running regression (Statistics > Linear models and related > Linear regression)

29
Jun 17

## Introduction to Stata

Introduction to Stata: Stata interface, how to use Stata Help, how to use Data Editor and how to graph data. Important details to remember:

1. In any program, the first thing to use is Help. I learned everything from Help and never took any programming courses.
2. The number of observations for all variables in one data file must be the same. This can be a problem if, for example, you want to see out-of-sample predictions.
3. In Data Editor, numeric variables are displayed in black and strings are displayed in red.
4. The name of the hidden variable that counts observations is _n
5. If you have several definitions of graphs in two-way graphs menu, they will be graphed together or separately, depending on what is enabled/disabled.

See details in videos. Sorry about the background noise!

Video 1. Stata interface. The windows introduced: Results, Command, Variables, Properties, Review and Viewer.

Video 2. Using Stata Help. Help can be used through the Review window or in a separate pdf viewer. Eviews Help is much easier to understand.

Video 3. Using Data Editor. How to open and view variables, the visual difference between numeric variables and string variables. The lengths of all variables in the same file must be the same.

Video 4. Graphing data. To graph a variable, you need to define its graph and then display it. It is possible to display more than one variable on the same chart.

22
Jun 17

## Autoregressive–moving-average (ARMA) models

Autoregressive–moving-average (ARMA) models were suggested in 1951 by Peter Whittle in his PhD thesis. Do you think he played with data and then came up with his model? No, he was guided by theory. The same model may describe visually very different data sets, and visualization rarely leads to model formulation.

Recall that the main idea behind autoregressive processes is to regress the variable on its own past values. In case of moving averages, we form linear combinations of elements of white noise. Combining the two ideas, we obtain the definition of the autoregressive–moving-average process:

(1) $y_t=\mu+\beta_1y_{t-1}+...+\beta_py_{t-p}+ \theta_1u_{t-1}+...+\theta_qu_{t-q}+u_t$

$=\mu+\sum_{i=1}^p\beta_iy_{t-i}+\sum_{i=1}^q\theta_iu_{t-i}+u_t.$

It is denoted ARMA(p,q), where p is the number of included past values and q is the number of included past errors (AKA shocks to the system). We should expect a couple of facts to hold for this process.

If its characteristic polynomial $p(x)=1-\beta_1x-...-\beta_px^p$ has roots outside the unit circle, the process is stable and should be stationary.

A stable process can be represented as an infinite moving average. Such a representation is in fact used to analyze its properties.

The coefficients of the moving average part (the thetas) and the constant $\mu$ have no effect on stationarity.

The quantity $\partial y_t/\partial y_{t-1}=\beta_1$ can be called an instantaneous effect of $y_{t-1}$ on $y_t$. This effect accumulates over time (the value at $t$ is influenced by the value at $t-1$, which, in turn, is influenced by the value at $t-2$ and so on). Therefore the long-run interpretation of the coefficients is complicated. Comparison of Figures 1 and 2 illustrates this point.

Exercise. For the model $y_t=\mu+\beta_1y_{t-1}+u_t$ find the mean $Ey_t$ (just modify this argument).

Figure 1. Simulated AR process

Figure 2. Simulated MA

Question. Why in (1) the current error $u_t$ has coefficient 1? Choose the answer you like:

1) We never used the current error with a nontrivial coefficient.

2) It is logical to assume that past shocks may have an aftereffect (measured by thetas) on the current $y_t$ different from 1 but the effect of the current shock should be 1.

3) Mathematically, the case when instead of $u_t$ we have $\theta_0u_t$ with some nonzero $\theta_0$ can be reduced to the case when the current error has coefficient 1. Just introduce a new white noise $v_t=\theta_0u_t$ and rewrite the model using it.

5
Apr 17

## Maximum likelihood: application to linear model

### Maximum likelihood: application to linear model

We have to remember that a model and a method are not the same. Application of the least squares method to the linear model gives OLS estimators. Here we apply the Maximum Likelihood (ML) method to the same model.

### Assumptions and first order conditions for maximizing likelihood

We assume that the observations satisfy

(1) $y_i=\beta _1+\beta _2x_i+u_i,\ i=1,...,n.$
Our task is to find ML estimators of $\beta _1,\beta _2,\sigma^2$. To be able to realize the ML algorithm, we assume that the regressor $x$ is deterministic. Then at the right side of (1) the error is the only random term.

Step 1. Suppose that $u_1,...,u_n$ are independent normal with mean $0$ and variance $\sigma ^2$. (This implies that the errors are uncorrelated and identically distributed.) The density of $u_i$ is

(2) $p(x)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{x^2}{2\sigma^2}).$
From (1) we see that $y_i$ is normal, as a linear transformation of $u_i$. By equation (2) in that post, the density of observation $(x_i,y_i)$ is
$f(x_i,y_i|\beta_1,\beta_2,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i-\beta_1-\beta_2x_i)^2}{2\sigma^2}}.$

Step 2. The likelihood function, by definition, is the joint density, considered a function of parameters. Because of the independence of observations, it can be obtained as a product of these densities
$L(\beta_1,\beta_2,\sigma^2|y,x)=\prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i-\beta_1-\beta_2x_i)^2}{2\sigma^2}}$

$=(2\pi\sigma^2)^{-n/2}e^{-\sum_{i=1}^n\frac{(y_i-\beta_1-\beta_2x_i)^2}{2\sigma^2}}=(2\pi\sigma^2)^{-n/2}e^{-\frac{RSS}{2\sigma^2}}$

(see this post for the definition of RSS).

Step 3. The log-likelihood is
$\lambda(\beta_1,\beta_2,\sigma^2|y,x)=-\frac{n}{2}\log(2\pi)-\frac{n}{2}\log(\sigma^2)-\frac{RSS}{2\sigma^2}.$

The first-order conditions are
$\frac{\partial\lambda}{\partial\beta_1}=\frac{\partial\lambda}{\partial\beta_2}=\frac{\partial\lambda}{\partial\sigma^2}=0$
(technically, it is easier to differentiate with respect to $\sigma^2$ than to $\sigma$). We obtain a system of three equations for determining the parameters:
$\frac{\partial\lambda}{\partial\beta_1}=\frac{\partial RSS}{\partial\beta_1}/(2\sigma^2),$ $\frac{\partial\lambda}{\partial\beta_2}=\frac{\partial RSS}{\partial\beta_2}/(2\sigma^2),$

$\frac{\partial\lambda}{\partial\sigma^2}=-\frac{n}{2\sigma^2}+\frac{RSS}{2\sigma^4}.$

### ML estimators and discussion

From the first two equations we see that the ML estimators of $\beta_1,\beta_2$ are the same as OLS estimators:

$\hat{\beta_1}^{ML}=\hat{\beta_1}^{OLS},$ $\hat{\beta_2}^{ML}=\hat{\beta_2}^{OLS}.$

We know by the Gauss-Markov theorem that these estimators are most efficient in the set of linear unbiased estimators. The third equation gives

$\hat{\sigma}^2_{ML}=\frac{RSS}{n},$

which is different from $\hat{\sigma}^2_{OLS}=\frac{RSS}{n-2}.$ The ML estimator is more efficient as it achieves the Cramér-Rao lower bound for nonlinear estimators.

24
Mar 17

## Distribution and density functions of a linear transformation

Distribution and density functions of a linear transformation - two short derivations to read during breakfast.

### Distribution function of a linear transformation

Let $X$ be a random variable and let $Y=aX+b$ be its linear transformation (here $a,b$ are some real numbers and $a\ne0$, otherwise $Y$ is not random). If the distribution function $F_X$ is known, what will be the distribution function of $Y$?

The answer is obtained in one line if you know the definition of the distribution function:

(1) $F_Y(y)=P(Y\le y)=P(aX+b\le y)=P(X\le\frac{y-b}{a})=F_X(\frac{y-b}{a})$.

For the inequalities $aX+b\le y$ and $X\le\frac{y-b}{a}$ to be equivalent, we have to assume that $a>0$ (for applications this is enough). The case $a<0$ is left as an exercise.

### Density function of a linear transformation

As above, $Y$ is a linear transformation of $X$. Suppose they have densities $p_X,p_Y$. What is the relationship between the densities?

Recall formula (1) that links distribution and density functions. Equation (1) in terms of densities becomes

$\int_{-\infty}^yp_Y(t)dt=\int_{-\infty}^{\frac{y-b}{a}}p_X(t)dt$.

Let's differentiate this equation. The Newton-Leibnitz formula applied to the integral on the left gives $p_Y$ evaluated at $y$. On the right, additionally, we have to use the chain rule. The result is

(2) $p_Y(y)=p_X(\frac{y-b}{a})\frac{d}{dy}\frac{y-b}{a}=\frac{1}{a}p_X(\frac{y-b}{a})$.

Equation (2) will be used to derive ML estimators for the linear model.