5
May 22

## Vector autoregression (VAR)

Suppose we are observing two stocks and their respective returns are $x_{t},y_{t}.$ To take into account their interdependence, we consider a vector autoregression

(1) $\left\{\begin{array}{c}x_{t}=a_{1}x_{t-1}+b_{1}y_{t-1}+u_{t} \\ y_{t}=a_{2}x_{t-1}+b_{2}y_{t-1}+v_{t}\end{array}\right.$

Try to repeat for this system the analysis from Section 3.5 (Application to an AR(1) process) of the Guide by A. Patton and you will see that the difficulties are insurmountable. However, matrix algebra allows one to overcome them, with proper adjustment.

### Problem

A) Write this system in a vector format

(2) $Y_{t}=\Phi Y_{t-1}+U_{t}.$

What should be $Y_{t},\Phi ,U_{t}$ in this representation?

B) Assume that the error $U_{t}$ in (1) satisfies

(3) $E_{t-1}U_{t}=0,\ EU_{t}U_{t}^{T}=\Sigma ,~EU_{t}U_{s}^{T}=0$ for $t\neq s$ with some symmetric matrix $\Sigma =\left(\begin{array}{cc}\sigma _{11} & \sigma _{12} \\ \sigma _{12} & \sigma _{22}\end{array}\right) .$

What does this assumption mean in terms of the components of $U_{t}$ from (2)? What is $\Sigma$ if the errors in (1) satisfy

(4) $E_{t-1}u_{t}=E_{t-1}v_{t}=0,~Eu_{t}^{2}=Ev_{t}^{2}=\sigma ^{2},$ $Eu_{s}u_{t}=Ev_{s}v_{t}=0$ for $t\neq s,$ $Eu_{s}v_{t}=0$ for all $s,t?$

C) Suppose (1) is stationary. The stationarity condition is expressed in terms of eigenvalues of $\Phi$ but we don't need it. However, we need its implication:

(5) $\det \left( I-\Phi \right) \neq 0$.

Find $\mu =EY_{t}.$

D) Find $Cov(Y_{t-1},U_{t}).$

E) Find $\gamma _{0}\equiv V\left( Y_{t}\right) .$

F) Find $\gamma _{1}=Cov(Y_{t},Y_{t-1}).$

G) Find $\gamma _{2}.$

Solution

A) It takes some practice to see that with the notation

$Y_{t}=\left(\begin{array}{c}x_{t} \\y_{t}\end{array}\right) ,$ $\Phi =\left(\begin{array}{cc}a_{1} & b_{1} \\a_{2} & b_{2}\end{array}\right) ,$ $U_{t}=\left(\begin{array}{c}u_{t} \\v_{t}\end{array}\right)$

the system (1) becomes (2).

B) The equations in (3) look like this:

$E_{t-1}U_{t}=\left(\begin{array}{c}E_{t-1}u_{t} \\ E_{t-1}v_{t}\end{array}\right) =0,$ $EU_{t}U_{t}^{T}=\left(\begin{array}{cc}Eu_{t}^{2} & Eu_{t}v_{t} \\ Eu_{t}v_{t} & Ev_{t}^{2}\end{array}\right) =\left(\begin{array}{cc}\sigma _{11} & \sigma _{12} \\ \sigma _{12} & \sigma _{22}\end{array}\right) ,$ $EU_{t}U_{s}^{T}=\left(\begin{array}{cc}Eu_{t}u_{s} & Eu_{t}v_{s} \\Ev_{t}u_{s} & Ev_{t}v_{s}\end{array}\right) =0.$

Equalities of matrices are understood element-wise, so we get a series of scalar equations $E_{t-1}u_{t}=0,...,Ev_{t}v_{s}=0$ for $t\neq s.$

Conversely, the scalar equations from (4) give

$E_{t-1}U_{t}=0,\ EU_{t}U_{t}^{T}=\left(\begin{array}{cc}\sigma ^{2} & 0 \\0 & \sigma ^{2}\end{array}\right) ,~EU_{t}U_{s}^{T}=0$ for $t\neq s$.

C) (2) implies $EY_{t}=\Phi EY_{t-1}+EU_{t}=\Phi EY_{t-1}$ or by stationarity $\mu =\Phi \mu$ or $\left( I-\Phi \right) \mu =0.$ Hence (5) implies $\mu =0.$

D) From (2) we see that $Y_{t-1}$ depends only on $I_{t}$ (information set at time $t$). Therefore by the LIE

$Cov(Y_{t-1},U_{t})=E\left( Y_{t-1}-EY_{t-1}\right) U_{t}^{T}=E\left[ \left(Y_{t-1}-EY_{t-1}\right) E_{t-1}U_{t}^{T}\right] =0,$ $Cov\left( U_{t},Y_{t-1}\right) =\left[ Cov(Y_{t-1},U_{t})\right] ^{T}=0.$

E) Using the previous post

$\gamma _{0}\equiv V\left( \Phi Y_{t-1}+U_{t}\right) =\Phi V\left(Y_{t-1}\right) \Phi ^{T}+Cov\left( U_{t},Y_{t-1}\right) \Phi ^{T}+\Phi Cov(Y_{t-1},U_{t})+V\left( U_{t}\right)$ $=\Phi \gamma _{0}\Phi ^{T}+\Sigma$

(by stationarity and (3)). Thus, $\gamma _{0}-\Phi \gamma _{0}\Phi^{T}=\Sigma$ and $\gamma _{0}=\sum_{s=0}^{\infty }\Phi ^{s}\Sigma\left( \Phi^{T}\right) ^{s}$ (see previous post).

F) Using the previous result we have

$\gamma _{1}=Cov(Y_{t},Y_{t-1})=Cov(\Phi Y_{t-1}+U_{t},Y_{t-1})=\Phi Cov(Y_{t-1},Y_{t-1})+Cov(U_{t},Y_{t-1})$ $=\Phi Cov(Y_{t-1},Y_{t-1})=\Phi \gamma _{0}=\Phi \sum_{s=0}^{\infty }\Phi^{s}\Sigma\left( \Phi ^{T}\right) ^{s}.$

G) Similarly,

$\gamma _{2}=Cov(Y_{t},Y_{t-2})=Cov(\Phi Y_{t-1}+U_{t},Y_{t-2})=\Phi Cov(Y_{t-1},Y_{t-2})+Cov(U_{t},Y_{t-2})$ $=\Phi Cov(Y_{t-1},Y_{t-2})=\Phi \gamma _{1}=\Phi ^{2}\sum_{s=0}^{\infty}\Phi ^{s}\Sigma\left( \Phi ^{T}\right) ^{s}.$

Autocorrelations require a little more effort and I leave them out.

5
May 22

## Vector autoregressions: preliminaries

Suppose we are observing two stocks and their respective returns are $x_{t},y_{t}.$ A vector autoregression for the pair $x_{t},y_{t}$ is one way to take into account their interdependence. This theory is undeservedly omitted from the Guide by A. Patton.

### Required minimum in matrix algebra

Matrix notation and summation are very simple.

Matrix multiplication is a little more complex. Make sure to read Global idea 2 and the compatibility rule.

The general approach to study matrices is to compare them to numbers. Here you see the first big No: matrices do not commute, that is, in general $AB\neq BA.$

The idea behind matrix inversion is pretty simple: we want an analog of the property $a\times \frac{1}{a}=1$ that holds for numbers.

Some facts about determinants have very complicated proofs and it is best to stay away from them. But a couple of ideas should be clear from the very beginning. Determinants are defined only for square matrices. The relationship of determinants to matrix invertibility explains the role of determinants. If $A$ is square, it is invertible if and only if $\det A\neq 0$ (this is an equivalent of the condition $a\neq 0$ for numbers).

Here is an illustration of how determinants are used. Suppose we need to solve the equation $AX=Y$ for $X,$ where $A$ and $Y$ are known. Assuming that $\det A\neq 0$ we can premultiply the equation by $A^{-1}$ to obtain $A^{-1}AX=A^{-1}Y.$ (Because of lack of commutativity, we need to keep the order of the factors). Using intuitive properties $A^{-1}A=I$ and $IX=X$ we obtain the solution: $X=A^{-1}Y.$ In particular, we see that if $\det A\neq 0,$ then the equation $AX=0$ has a unique solution $X=0.$

Let $A$ be a square matrix and let $X,Y$ be two vectors. $A,Y$ are assumed to be known and $X$ is unknown. We want to check that $X=\sum_{s=0}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}$ solves the equation $X-AXA^{T}=Y.$ (Note that for this equation the trick used to solve $AX=Y$ does not work.) Just plug $X:$

$\sum_{s=0}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}-A\sum_{s=0}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}A^{T}$ $=Y+\sum_{s=1}^{\infty }A^{s}Y\left(A^{T}\right) ^{s}-\sum_{s=1}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}=Y$

(write out a couple of first terms in the sums if summation signs frighten you).

Transposition is a geometrically simple operation. We need only the property $\left( AB\right) ^{T}=B^{T}A^{T}.$

### Variance and covariance

Property 1. Variance of a random vector $X$ and covariance of two random vectors $X,Y$ are defined by

$V\left( X\right) =E\left( X-EX\right) \left( X-EX\right) ^{T},$ $Cov\left(X,Y\right) =E\left( X-EX\right) \left( Y-EY\right) ^{T},$

respectively.

Note that when $EX=0,$ variance becomes

$V\left( X\right) =EXX^{T}=\left(\begin{array}{ccc}EX_{1}^{2} & ... & EX_{1}X_{n} \\ ... & ... & ... \\ EX_{1}X_{n} & ... & EX_{n}^{2}\end{array}\right) .$

Property 2. Let $X,Y$ be random vectors and suppose $A,B$ are constant matrices. We want an analog of $V\left( aX+bY\right) =a^{2}V\left( X\right) +2abcov\left( X,Y\right) +b^{2}V\left( X\right) .$ In the next calculation we have to remember that the multiplication order cannot be changed.

$V\left( AX+BY\right) =E\left[ AX+BY-E\left( AX+BY\right) \right] \left[AX+BY-E\left( AX+BY\right) \right] ^{T}$ $=E\left[ A\left( X-EX\right) +B\left( Y-EY\right) \right] \left[ A\left(X-EX\right) +B\left( Y-EY\right) \right] ^{T}$ $=E\left[ A\left( X-EX\right) \right] \left[ A\left( X-EX\right) \right]^{T}+E\left[ B\left( Y-EY\right) \right] \left[ A\left( X-EX\right) \right]^{T}$ $+E\left[ A\left( X-EX\right) \right] \left[ B\left( Y-EY\right) \right]^{T}+E\left[ B\left( Y-EY\right) \right] \left[ B\left( Y-EY\right) \right]^{T}$

(applying $\left( AB\right) ^{T}=B^{T}A^{T}$)

$=AE\left( X-EX\right) \left( X-EX\right) ^{T}A^{T}+BE\left( Y-EY\right)\left( X-EX\right) ^{T}A^{T}$ $+AE\left( X-EX\right) \left( Y-EY\right) ^{T}B^{T}+BE\left( Y-EY\right)\left( Y-EY\right) ^{T}B^{T}$ $=AV\left( X\right) A^{T}+BCov\left( Y,X\right)A^{T}+ACov(X,Y)B^{T}+BV\left( Y\right) B^{T}.$

19
Feb 22

## Distribution of the estimator of the error variance

If you are reading the book by Dougherty: this post is about the distribution of the estimator  $s^2$ defined in Chapter 3.

Consider regression

(1) $y=X\beta +e$

where the deterministic matrix $X$ is of size $n\times k,$ satisfies $\det \left( X^{T}X\right) \neq 0$ (regressors are not collinear) and the error $e$ satisfies

(2) $Ee=0,Var(e)=\sigma ^{2}I$

$\beta$ is estimated by $\hat{\beta}=(X^{T}X)^{-1}X^{T}y.$ Denote $P=X(X^{T}X)^{-1}X^{T},$ $Q=I-P.$ Using (1) we see that $\hat{\beta}=\beta +(X^{T}X)^{-1}X^{T}e$ and the residual $r\equiv y-X\hat{\beta}=Qe.$ $\sigma^{2}$ is estimated by

(3) $s^{2}=\left\Vert r\right\Vert ^{2}/\left( n-k\right) =\left\Vert Qe\right\Vert ^{2}/\left( n-k\right) .$

$Q$ is a projector and has properties which are derived from those of $P$

(4) $Q^{T}=Q,$ $Q^{2}=Q.$

If $\lambda$ is an eigenvalue of $Q,$ then multiplying $Qx=\lambda x$ by $Q$ and using the fact that $x\neq 0$ we get $\lambda ^{2}=\lambda .$ Hence eigenvalues of $Q$ can be only $0$ or $1.$ The equation $tr\left( Q\right) =n-k$
tells us that the number of eigenvalues equal to 1 is $n-k$ and the remaining $k$ are zeros. Let $Q=U\Lambda U^{T}$ be the diagonal representation of $Q.$ Here $U$ is an orthogonal matrix,

(5) $U^{T}U=I,$

and $\Lambda$ is a diagonal matrix with eigenvalues of $Q$ on the main diagonal. We can assume that the first $n-k$ numbers on the diagonal of $Q$ are ones and the others are zeros.

Theorem. Let $e$ be normal. 1) $s^{2}\left( n-k\right) /\sigma ^{2}$ is distributed as $\chi _{n-k}^{2}.$ 2) The estimators $\hat{\beta}$ and $s^{2}$ are independent.

Proof. 1) We have by (4)

(6) $\left\Vert Qe\right\Vert ^{2}=\left( Qe\right) ^{T}Qe=\left( Q^{T}Qe\right) ^{T}e=\left( Qe\right) ^{T}e=\left( U\Lambda U^{T}e\right)^{T}e=\left( \Lambda U^{T}e\right) ^{T}U^{T}e.$

Denote $S=U^{T}e.$ From (2) and (5)

$ES=0,$ $Var\left( S\right) =EU^{T}ee^{T}U=\sigma ^{2}U^{T}U=\sigma ^{2}I$

and $S$ is normal as a linear transformation of a normal vector. It follows that $S=\sigma z$ where $z$ is a standard normal vector with independent standard normal coordinates $z_{1},...,z_{n}.$ Hence, (6) implies

(7) $\left\Vert Qe\right\Vert ^{2}=\sigma ^{2}\left( \Lambda z\right)^{T}z=\sigma ^{2}\left( z_{1}^{2}+...+z_{n-k}^{2}\right) =\sigma ^{2}\chi _{n-k}^{2}.$

(3) and (7) prove the first statement.

2) First we note that the vectors $Pe,Qe$ are independent. Since they are normal, their independence follows from

$cov(Pe,Qe)=EPee^{T}Q^{T}=\sigma ^{2}PQ=0.$

It's easy to see that $X^{T}P=X^{T}.$ This allows us to show that $\hat{\beta}$ is a function of $Pe$:

$\hat{\beta}=\beta +(X^{T}X)^{-1}X^{T}e=\beta +(X^{T}X)^{-1}X^{T}Pe.$

Independence of $Pe,Qe$ leads to independence of their functions $\hat{\beta}$ and $s^{2}.$

28
Dec 21

## Chi-squared distribution

This post is intended to close a gap in J. Abdey's guide ST2133, which is absence of distributions widely used in Econometrics.

### Chi-squared with one degree of freedom

Let $X$ be a random variable and let $Y=X^{2}.$

Question 1. What is the link between the distribution functions of $Y$ and $X?$

Chart 1. Inverting a square function

The start is simple: just follow the definitions. $F_{Y}\left( y\right)=P\left( Y\leq y\right) =P\left( X^{2}\leq y\right) .$ Assuming that $y>0$, on Chart 1 we see that $\left\{ x:x^{2}\leq y\right\} =\left\{x: -\sqrt{y}\leq x\leq \sqrt{y}\right\} .$ Hence, using additivity of probability,

(1) $F_{Y}\left( y\right) =P\left( -\sqrt{y}\leq X\leq \sqrt{y}\right) =P\left( X\leq \sqrt{y}\right) -P\left( X<-\sqrt{y}\right)$

$=F_{X}\left( \sqrt{y}\right) -F_{X}\left( -\sqrt{y}\right) .$

The last transition is based on the assumption that $P\left( X for all $x$, which is maintained for continuous random variables throughout the guide by Abdey.

Question 2. What is the link between the densities of $X$ and $Y=X^{2}?$ By the Leibniz integral rule (1) implies

(2) $f_{Y}\left( y\right) =f_{X}\left( \sqrt{y}\right) \frac{1}{2\sqrt{y}} +f_{X}\left( -\sqrt{y}\right) \frac{1}{2\sqrt{y}}.$

Exercise. Assuming that $g$ is an increasing differentiable function with the inverse $h$ and $Y=g(X)$ answer questions similar to 1 and 2.

See the definition of $\chi _{1}^{2}.$ Just applying (2) to $X=z$ and $Y=z^{2}=\chi _{1}^{2}$ we get

$f_{\chi _{1}^{2}}\left( y\right) =\frac{1}{\sqrt{2\pi }}e^{-y/2}\frac{1}{2 \sqrt{y}}+\frac{1}{\sqrt{2\pi }}e^{-y/2}\frac{1}{2\sqrt{y}}=\frac{1}{\sqrt{ 2\pi }}y^{1/2-1}e^{-y/2},\ y>0.$

Since $\Gamma \left( 1/2\right) =\sqrt{\pi },$ the procedure for identifying the gamma distribution gives

$f_{\chi _{1}^{2}}\left( x\right) =\frac{1}{\Gamma \left( 1/2\right) }\left( 1/2\right) ^{1/2}x^{1/2-1}e^{-x/2}=f_{1/2,1/2}\left( x\right) .$

We have derived the density of the chi-squared variable with one degree of freedom, see also Example 3.52, J. Abdey, Guide ST2133.

### General chi-squared

For $\chi _{n}^{2}=z_{1}^{2}+...+z_{n}^{2}$ with independent standard normals $z_{1},...,z_{n}$ we can write $\chi _{n}^{2}=\chi _{1}^{2}+...+\chi _{1}^{2}$ where the chi-squared variables on the right are independent and all have one degree of freedom. This is because deterministic (here quadratic) functions of independent variables are independent.

Recall that the gamma density is closed under convolutions with the same $\alpha .$ Then by the convolution theorem we get

$f_{\chi _n^2}=f_{\chi _1^2}\ast ... \ast f_{\chi_1^2}=f_{1/2,1/2} \ast ... \ast f_{1/2,1/2}$ $=f_{1/2,n/2}=\frac{1}{\Gamma \left( n/2\right) 2^{n/2}}x^{n/2-1}e^{-x/2}.$
18
Oct 20

## People need real knowledge

### Traffic analysis

The number of visits to my website has exceeded 206,000. This number depends on what counts as a visit. An external counter, visible to everyone, writes cookies to the reader's computer and counts many visits from one reader as one. The number of individual readers has reached 23,000. The external counter does not give any more statistics. I will give all the numbers from the internal counter, which is visible only to the site owner.

I have a high percentage of complex content. After reading one post, the reader finds that the answer he is looking for depends on the preliminary material. He starts digging it and then has to go deeper and deeper. Hence the number 206,000, that is, one reader visits the site on average 9 times on different days. Sometimes a visitor from one post goes to another by link on the same day. Hence another figure: 310,000 readings.

I originally wrote simple things about basic statistics. Then I began to write accompanying materials for each advanced course that I taught at Kazakh-British Technical University (KBTU). The shift in the number and level of readership shows that people need deep knowledge, not bait for one-day moths.

For example, my simple post on basic statistics was read 2,300 times. In comparison, the more complex post on the Cobb-Douglas function has been read 7,100 times. This function is widely used in economics to model consumer preferences (utility function) and producer capabilities (production function). In all textbooks it is taught using two-dimensional graphs, as P. Samuelson proposed 85 years ago. In fact, two-dimensional graphs are obtained by projection of a three-dimensional graph, which I show, making everything clear and obvious.

The answer to one of the University of London (UoL) exam problems attracted 14,300 readers. It is so complicated that I split the answer into two parts, and there are links to additional material. On the UoL exam, students have to solve this problem in 20-30 minutes, which even I would not be able to do.

### Why my site is unique

My site is unique in several ways. Firstly, I tell the truth about the AP Statistics books. This is a basic statistics course for those who need to interpret tables, graphs and simple statistics. If you have a head on your shoulders, and not a Google search engine, all you need to do is read a small book and look at the solutions. I praise one such book in my reviews. You don't need to attend a two-semester course and read an 800-page book. Moreover, one doesn't need 140 high-quality color photographs that have nothing to do with science and double the price of a book.

Many AP Statistics consumers (that's right, consumers, not students) believe that learning should be fun. Such people are attracted by a book with anecdotes that have no relation to statistics or the life of scientists. In the West, everyone depends on each other, and therefore all the reviews are written in a superlative degree and streamlined. Thank God, I do not depend on the Western labor market, and therefore I tell the truth. Part of my criticism, including the statistics textbook selected for the program "100 Textbooks" of the Ministry of Education and Science of Kazakhstan (MES), is on Facebook.

Secondly, I have the world's only online, free, complete matrix algebra tutorial with all the proofs. Free courses on Udemy, Coursera and edX are not far from AP Statistics in terms of level. Courses at MIT and Khan Academy are also simpler than mine, but have the advantage of being given in video format.

The third distinctive feature is that I help UoL students. It is a huge organization spanning 17 universities and colleges in the UK and with many branches in other parts of the world. The Economics program was developed by the London School of Economics (LSE), one of the world's leading universities.

The problem with LSE courses is that they are very difficult. After the exams, LSE puts out short recommendations on the Internet for solving problems like: here you need to use such and such a theory and such and such an idea. Complete solutions are not given for two reasons: they do not want to help future examinees and sometimes their problems or solutions contain errors (who does not make errors?). But they also delete short recommendations after a year. My site is the only place in the world where there are complete solutions to the most difficult problems of the last few years. It is not for nothing that the solution to one problem noted above attracted 14,000 visits.

Fourthly, my site is unique in terms of the variety of material: statistics, econometrics, algebra, optimization, and finance.

The average number of visits is about 100 per day. When it's time for students to take exams, it jumps to 1-2 thousand. The total amount of materials created in 5 years is equivalent to 5 textbooks. It takes from 2 hours to one day to create one post, depending on the level. After I published this analysis of the site traffic on Facebook, my colleague Nurlan Abiev decided to write posts for the site. I pay for the domain myself, $186 per year. It would be nice to make the site accessible to students and schoolchildren of Kazakhstan, but I don't have time to translate from English. Once I was looking at the requirements of the MES for approval of electronic textbooks. They want several copies of printouts of all (!) materials and a solid payment for the examination of the site. As a result, all my efforts to create and maintain the site so far have been a personal initiative that does not have any support from the MES and its Committee on Science. 10 Dec 18 ## Distributions derived from normal variables ### Useful facts about independence In the one-dimensional case the economic way to define normal variables is this: define a standard normal variable and then a general normal variable as its linear transformation. In case of many dimensions, we follow the same idea. Before doing that we state without proofs two useful facts about independence of random variables (real-valued, not vectors). Theorem 1. Suppose variables $X_1,...,X_n$ have densities $p_1(x_1),...,p_n(x_n).$ Then they are independent if and only if their joint density $p(x_1,...,x_n)$ is a product of individual densities: $p(x_1,...,x_n)=p_1(x_1)...p_n(x_n).$ Theorem 2. If variables $X,Y$ are normal, then they are independent if and only if they are uncorrelated: $cov(X,Y)=0.$ The necessity part (independence implies uncorrelatedness) is trivial. ### Normal vectors Let $z_1,...,z_n$ be independent standard normal variables. A standard normal variable is defined by its density, so all of $z_i$ have the same density. We achieve independence, according to Theorem 1, by defining their joint density to be a product of individual densities. Definition 1. A standard normal vector of dimension $n$ is defined by $z=\left(\begin{array}{c}z_1 \\ ... \\ z_n \\ \end{array}\right)$ Properties$Ez=0$ because all of $z_i$ have means zero. Further, $cov(z_i,z_j)=0$ for $i\neq j$by Theorem 2 and variance of a standard normal is 1. Therefore, from the expression for variance of a vector we see that $Var(z)=I.$ Definition 2. For a matrix $A$ and vector $\mu$ of compatible dimensions a normal vector is defined by $X=Az+\mu.$ Properties$EX=AEz+\mu=\mu$ and $Var(X)=Var(Az)=E(Az)(Az)^T=AEzz^TA^T=AIA^T=AA^T$ (recall that variance of a vector is always nonnegative). ### Distributions derived from normal variables In the definitions of standard distributions (chi square, t distribution and F distribution) there is no reference to any sample data. Unlike statistics, which by definition are functions of sample data, these and other standard distributions are theoretical constructs. Statistics are developed in such a way as to have a distribution equal or asymptotically equal to one of standard distributions. This allows practitioners to use tables developed for standard distributions. Exercise 1. Prove that $\chi_n^2/n$ converges to 1 in probability. Proof. For a standard normal $z$ we have $Ez^2=1$ and $Var(z^2)=2$ (both properties can be verified in Mathematica). Hence, $E\chi_n^2/n=1$ and $Var(\chi_n^2/n)=\sum_iVar(z_i^2)/n^2=2/n\rightarrow 0.$ Now the statement follows from the simple form of the law of large numbers. Exercise 1 implies that for large $n$ the t distribution is close to a standard normal. 30 Nov 18 ## Application: estimating sigma squared ## Application: estimating sigma squared Consider multiple regression (1) $y=X\beta +e$ where (a) the regressors are assumed deterministic, (b) the number of regressors $k$ is smaller than the number of observations $n,$ (c) the regressors are linearly independent, $\det (X^TX)\neq 0,$ and (d) the errors are homoscedastic and uncorrelated, (2) $Var(e)=\sigma^2I.$ Usually students remember that $\beta$ should be estimated and don't pay attention to estimation of $\sigma^2.$ Partly this is because $\sigma^2$ does not appear in the regression and partly because the result on estimation of error variance is more complex than the result on the OLS estimator of $\beta .$ Definition 1. Let $\hat{\beta}=(X^TX)^{-1}X^Ty$ be the OLS estimator of $\beta$. $\hat{y}=X\hat{\beta}$ is called the fitted value and $r=y-\hat{y}$ is called the residual. Exercise 1. Using the projectors $P=X(X^TX)^{-1}X^T$ and $Q=I-P$ show that $\hat{y}=Py$ and $r=Qe.$ Proof. The first equation is obvious. From the model we have $r=X\beta+e-P(X\beta +e).$ Since $PX\beta=X\beta,$ we have further $r=e-Pe=Qe.$ Definition 2. The OLS estimator of $\sigma^2$ is defined by $s^2=\Vert r\Vert^2/(n-k).$ Exercise 2. Prove that $s^2$ is unbiased: $Es^2=\sigma^2.$ Proof. Using projector properties we have $\Vert r\Vert^2=(Qe)^TQe=e^TQ^TQe=e^TQe.$ Expectations of type $Ee^Te$ and $Eee^T$ would be easy to find from (2). However, we need to find $Ee^TQe$ where there is an obstructing $Q.$ See how this difficulty is overcome in the next calculation. $E\Vert r\Vert^2=Ee^TQe$ ($e^TQe$ is a scalar, so its trace is equal to itself) $=Etr(e^TQe)$ (applying trace-commuting) $=Etr(Qee^T)$ (the regressors and hence $Q$ are deterministic, so we can use linearity of $E$) $=tr(QEee^T)$ (applying (2)) $=\sigma^2tr(Q).$ $tr(P)=k$ because this is the dimension of the image of $P.$ Therefore $tr(Q)=n-k.$ Thus, $E\Vert r\Vert^2=\sigma^2(n-k)$ and $Es^2=\sigma^2.$ 18 Nov 18 ## Application: Ordinary Least Squares estimator ## Application: Ordinary Least Squares estimator ### Generalized Pythagoras theorem Exercise 1. Let $P$ be a projector and denote $Q=I-P.$ Then $\Vert x\Vert^2=\Vert Px\Vert^2+\Vert Qx\Vert^2.$ Proof. By the scalar product properties $\Vert x\Vert^2=\Vert Px+Qx\Vert^2=\Vert Px\Vert^2+2(Px)\cdot (Qx)+\Vert Qx\Vert^2.$ $P$ is symmetric and idempotent, so $(Px)\cdot (Qx)=(Px)\cdot[(I-P)x]=x\cdot[(P-P^2)x]=0.$ This proves the statement. ### Ordinary Least Squares (OLS) estimator derivation Problem statement. A vector $y\in R^n$ (the dependent vector) and vectors $x^{(1)},...,x^{(k)}\in R^n$ (independent vectors or regressors) are given. The OLS estimator is defined as that vector $\beta \in R^k$ which minimizes the total sum of squares $TSS=\sum_{i=1}^n(y_i-x^{(1)}\beta_1-...-x^{(k)}\beta_k)^2.$ Denoting $X=(x^{(1)},...,x^{(k)}),$ we see that $TSS=\Vert y-X\beta\Vert^2$ and that finding the OLS estimator means approximating $y$ with vectors from the image $\text{Img}X.$ $x^{(1)},...,x^{(k)}$ should be linearly independent, otherwise the solution will not be unique. Assumption. $x^{(1)},...,x^{(k)}$ are linearly independent. This, in particular, implies that $k\leq n.$ Exercise 2. Show that the OLS estimator is (2) $\hat{\beta}=(X^TX)^{-1}X^Ty.$ Proof. By Exercise 1 we can use $P=X(X^TX)^{-1}X^T.$ Since $X\beta$ belongs to the image of $P,$ $P$ doesn't change it: $X\beta=PX\beta.$ Denoting also $Q=I-P$ we have $\Vert y-X\beta\Vert^2=\Vert y-Py+Py-X\beta\Vert^2$ $=\Vert Qy+P(y-X\beta)\Vert^2$ (by Exercise 1) $=\Vert Qy\Vert^2+\Vert P(y-X\beta)\Vert^2.$ This shows that $\Vert Qy\Vert^2$ is a lower bound for $\Vert y-X\beta\Vert^2.$ This lower bound is achieved when the second term is made zero. From $P(y-X\beta)=Py-X\beta =X(X^TX)^{-1}X^Ty-X\beta=X[(X^TX)^{-1}X^Ty-\beta]$ we see that the second term is zero if $\beta$ satisfies (2). Usually the above derivation is applied to the dependent vector of the form $y=X\beta+e$ where $e$ is a random vector with mean zero. But it holds without this assumption. See also simplified derivation of the OLS estimator. 6 Oct 17 ## Significance level and power of test ## Significance level and power of test In this post we discuss several interrelated concepts: null and alternative hypotheses, type I and type II errors and their probabilities. Review the definitions of a sample space and elementary events and that of a conditional probability. ## Type I and Type II errors Regarding the true state of nature we assume two mutually exclusive possibilities: the null hypothesis (like the suspect is guilty) and alternative hypothesis (the suspect is innocent). It's up to us what to call the null and what to call the alternative. However, the statistical procedures are not symmetric: it's easier to measure the probability of rejecting the null when it is true than other involved probabilities. This is why what is desirable to prove is usually designated as the alternative. Usually in books you can see the following table.  Decision taken Fail to reject null Reject null State of nature Null is true Correct decision Type I error Null is false Type II error Correct decision This table is not good enough because there is no link to probabilities. The next video does fill in the blanks. Video. Significance level and power of test ## Significance level and power of test The conclusion from the video is that $\frac{P(T\bigcap R)}{P(T)}=P(R|T)=P\text{(Type I error)=significance level}$ $\frac{P(F\bigcap R)}{P(F)}=P(R|F)=P\text{(Correctly rejecting false null)=Power}$ 11 Aug 17 ## Violations of classical assumptions 2 ## Violations of classical assumptions This will be a simple post explaining the common observation that "in Economics, variability of many variables is proportional to those variables". Make sure to review the assumptions; they tend to slip from memory. We consider the simple regression (1) $y_i=a+bx_i+e_i.$ One of classical assumptions is Homoscedasticity. All errors have the same variances$Var(e_i)=\sigma^2$ for all $i$. We discuss its opposite, which is Heteroscedasticity. Not all errors have the same variance. It would be wrong to write it as $Var(e_i)\ne\sigma^2$ for all $i$ (which means that all errors have variance different from $\sigma^2$). You can write that not all $Var(e_i)$ are the same but it's better to use the verbal definition. Remark about Video 1. The dashed lines can represent mean consumption. Then the fact that variation of a variable grows with its level becomes more obvious. Video 1. Case for heteroscedasticity Figure 1. Illustration from Dougherty: as x increases, variance of the error term increases Homoscedasticity was used in the derivation of the OLS estimator variance; under heteroscedasticity that expression is no longer valid. There are other implications, which will be discussed later. Companies example. The Samsung Galaxy Note 7 battery fires and explosions that caused two recalls cost the smartphone maker at least$5 billion. There is no way a small company could have such losses.

GDP example. The error in measuring US GDP is on the order of \$200 bln, which is comparable to the Kazakhstan GDP. However, the standard deviation of the ratio error/GDP seems to be about the same across countries, if the underground economy is not too big. Often the assumption that the standard deviation of the regression error is proportional to one of regressors is plausible.

To see if the regression error is heteroscedastic, you can look at the graph of the residuals or use statistical tests.