Today's talk: “Analysis of variance in the central limit theorem”
The talk is about results, which are a combination of methods of the function theory, functional analysis and probability theory. The intuition underlying the central limit theorem will be described, and the history and place of the results of the author in modern theory will be highlighted.
My presentation at Kazakh National University
Application: distribution of sigma squared estimator
Application: distribution of sigma squared estimator
For the formulation of multiple regression and classical conditions on its elements see Application: estimating sigma squared. There we proved unbiasedness of the OLS estimator of Here we do more: we characterize its distribution and obtain unbiasedness as a corollary.
Preliminaries
We need a summary of what we know about the residual and the projector
where
(1)
has
unities and
zeros on the diagonal of its diagonal representation, where
is the number of regressors. With
it's the opposite: it has
unities and
zeros on the diagonal of its diagonal representation. We can always assume that the unities come first, so in the diagonal representation
(2)
the matrix is orthogonal and
can be written as
(3)
where is an identity matrix and the zeros are zero matrices of compatible dimensions.
Characterization of the distribution of 
Exercise 1. Suppose the error vector is normal:
Prove that the vector
is standard normal.
Proof. By the properties of orthogonal matrices
This, together with the equation , proves that
is standard normal.
Exercise 2. Prove that is distributed as
Proof. From (1) and (2) we have
Now (3) shows that which is the definition of
Exercise 3. Find the mean and variance of
Solution. From Exercise 2 we obtain the result proved earlier in a different way:
Further, using the variance of a standard normal
Distributions derived from normal variables
Useful facts about independence
In the one-dimensional case the economic way to define normal variables is this: define a standard normal variable and then a general normal variable as its linear transformation.
In case of many dimensions, we follow the same idea. Before doing that we state without proofs two useful facts about independence of random variables (real-valued, not vectors).
Theorem 1. Suppose variables have densities
Then they are independent if and only if their joint density
is a product of individual densities:
Theorem 2. If variables are normal, then they are independent if and only if they are uncorrelated:
The necessity part (independence implies uncorrelatedness) is trivial.
Normal vectors
Let be independent standard normal variables. A standard normal variable is defined by its density, so all of
have the same density. We achieve independence, according to Theorem 1, by defining their joint density to be a product of individual densities.
Definition 1. A standard normal vector of dimension is defined by
Properties. because all of
have means zero. Further,
for
by Theorem 2 and variance of a standard normal is 1. Therefore, from the expression for variance of a vector we see that
Definition 2. For a matrix and vector
of compatible dimensions a normal vector is defined by
Properties. and
(recall that variance of a vector is always nonnegative).
Distributions derived from normal variables
In the definitions of standard distributions (chi square, t distribution and F distribution) there is no reference to any sample data. Unlike statistics, which by definition are functions of sample data, these and other standard distributions are theoretical constructs. Statistics are developed in such a way as to have a distribution equal or asymptotically equal to one of standard distributions. This allows practitioners to use tables developed for standard distributions.
Exercise 1. Prove that converges to 1 in probability.
Proof. For a standard normal we have
and
(both properties can be verified in Mathematica). Hence,
and
Now the statement follows from the simple form of the law of large numbers.
Exercise 1 implies that for large the t distribution is close to a standard normal.
Application: estimating sigma squared
Application: estimating sigma squared
Consider multiple regression
(1)
where
(a) the regressors are assumed deterministic, (b) the number of regressors is smaller than the number of observations
(c) the regressors are linearly independent,
and (d) the errors are homoscedastic and uncorrelated,
(2)
Usually students remember that should be estimated and don't pay attention to estimation of
Partly this is because
does not appear in the regression and partly because the result on estimation of error variance is more complex than the result on the OLS estimator of
Definition 1. Let be the OLS estimator of
.
is called the fitted value and
is called the residual.
Exercise 1. Using the projectors and
show that
and
Proof. The first equation is obvious. From the model we have Since
we have further
Definition 2. The OLS estimator of is defined by
Exercise 2. Prove that is unbiased:
Proof. Using projector properties we have
Expectations of type and
would be easy to find from (2). However, we need to find
where there is an obstructing
See how this difficulty is overcome in the next calculation.
(
is a scalar, so its trace is equal to itself)
(applying trace-commuting)
(the regressors and hence
are deterministic, so we can use linearity of
)
(applying (2))
because this is the dimension of the image of
Therefore
Thus,
and
Application: Ordinary Least Squares estimator
Application: Ordinary Least Squares estimator
Generalized Pythagoras theorem
Exercise 1. Let be a projector and denote
Then
Proof. By the scalar product properties
is symmetric and idempotent, so
This proves the statement.
Ordinary Least Squares (OLS) estimator derivation
Problem statement. A vector (the dependent vector) and vectors
(independent vectors or regressors) are given. The OLS estimator is defined as that vector
which minimizes the total sum of squares
Denoting we see that
and that finding the OLS estimator means approximating
with vectors from the image
should be linearly independent, otherwise the solution will not be unique.
Assumption. are linearly independent. This, in particular, implies that
Exercise 2. Show that the OLS estimator is
(2)
Proof. By Exercise 1 we can use Since
belongs to the image of
doesn't change it:
Denoting also
we have
(by Exercise 1)
This shows that is a lower bound for
This lower bound is achieved when the second term is made zero. From
we see that the second term is zero if satisfies (2).
Usually the above derivation is applied to the dependent vector of the form where
is a random vector with mean zero. But it holds without this assumption. See also simplified derivation of the OLS estimator.
Different faces of vector variance: again visualization helps
Different faces of vector variance: again visualization helps
In the previous post we defined variance of a column vector with
components by
In terms of elements this is the same as:
(1)
So why knowing the structure of this matrix is so important?
Let be random variables and let
be numbers. In the derivation of the variance of the slope estimator for simple regression we have to deal with the expression of type
(2)
Question 1. How do you multiply a sum by a sum? I mean, how do you use summation signs to find the product ?
Answer 1. Whenever you have problems with summation signs, try to do without them. The product
should contain ALL products Again, a matrix visualization will help:
The product we are looking for should contain all elements of this matrix. So the answer is
(3)
Formally, we can write (the sum does not depend on the index of summation, this is another point many students don't understand) and then perform the multiplication in (3).
Question 2. What is the expression for (2) in terms of covariances of components?
Answer 2. If you understand Answer 1 and know the relationship between variances and covariances, it should be clear that
(4)
Question 3. In light of (1), separate variances from covariances in (4).
Answer 3. When we have
which are diagonal elements of (1). Otherwise, for
we get off-diagonal elements of (1). So the answer is
(5)
Once again, in the first sum on the right we have only variances. In the second sum, the indices are assumed to run from
to
, excluding the diagonal
Corollary. If are uncorrelated, then the second sum in (5) disappears:
(6)
This fact has been used (with a slightly different explanation) in the derivation of the variance of the slope estimator for simple regression.
Question 4. Note that the matrix (1) is symmetric (elements above the main diagonal equal their mirror siblings below that diagonal). This means that some terms in the second sum on the right of (5) are repeated twice. If you group equal terms in (5), what do you get?
Answer 4. The idea is to write
that is, to join equal elements above and below the main diagonal in (1). For this, you need to figure out how to write a sum of the elements that are above the main diagonal. Make a bigger version of (1) (with more off-diagonal elements) to see that the elements that are above the main diagonal are listed in the sum This sum can also be written as
Hence, (5) is the same as
(7)
Unlike (6), this equation is applicable when there is autocorrelation.
Variance of a vector: motivation and visualization
Variance of a vector: motivation and visualization
I always show my students the definition of the variance of a vector, and they usually don't pay attention. You need to know what it is, already at the level of simple regression (to understand the derivation of the slope estimator variance), and even more so when you deal with time series. Since I know exactly where students usually stumble, this post is structured as a series of questions and answers.
Think about ideas: how would you define variance of a vector?
Question 1. We know that for a random variable , its variance is defined by
(1)
Now let
be a vector with components, each of which is a random variable. How would you define its variance?
The answer is not straightforward because we don't know how to square a vector. Let denote the transposed vector. There are two ways to multiply a vector by itself:
and
Question 2. Find the dimensions of and
and their expressions in terms of coordinates of
Answer 2. For a product of matrices there is a compatibility rule that I write in the form
(2)
Recall that in the notation
means that the matrix
has
rows and
columns. For example,
is of size
Verbally, the above rule says that the number of columns of
should be equal to the number of rows of
In the product that common number
disappears and the unique numbers (
and
) give, respectively, the number of rows and columns of
Isn't the the formula
easier to remember than the verbal statement? From (2) we see that is of dimension 1 (it is a scalar) and
is an
matrix.
For actual multiplication of matrices I use the visualization
(3)
Short formulation. Multiply rows from the first matrix by columns from the second one.
Long Formulation. To find the element of
we find a scalar product of the
th row of
and
th column of
To find all elements in the
th row of
we fix the
th row in
and move right the columns in
Alternatively, to find all elements in the
th column of
we fix the
th column in
and move down the rows in
. Using this rule, we have
(4)
Usually students have problems with the second equation.
Based on (1) and (4), we have two candidates to define variance:
(5)
and
(6)
Answer 1. The second definition contains more information, in the sense to be explained below, so we define variance of a vector by (6).
Question 3. Find the elements of this matrix.
Answer 3. Variance of a vector has variances of its components on the main diagonal and covariances outside it:
(7)
If you can't get this on your own, go back to Answer 2.
There is a matrix operation called trace and denoted . It is defined only for square matrices and gives the sum of diagonal elements of a matrix.
Exercise 1. Show that In this sense definition (6) is more informative than (5).
Exercise 2. Show that if , then (7) becomes
Significance level and power of test
Significance level and power of test
In this post we discuss several interrelated concepts: null and alternative hypotheses, type I and type II errors and their probabilities. Review the definitions of a sample space and elementary events and that of a conditional probability.
Type I and Type II errors
Regarding the true state of nature we assume two mutually exclusive possibilities: the null hypothesis (like the suspect is guilty) and alternative hypothesis (the suspect is innocent). It's up to us what to call the null and what to call the alternative. However, the statistical procedures are not symmetric: it's easier to measure the probability of rejecting the null when it is true than other involved probabilities. This is why what is desirable to prove is usually designated as the alternative.
Usually in books you can see the following table.
Decision taken | |||
Fail to reject null | Reject null | ||
State of nature | Null is true | Correct decision | Type I error |
Null is false | Type II error | Correct decision |
This table is not good enough because there is no link to probabilities. The next video does fill in the blanks.
Significance level and power of test
The conclusion from the video is that
Violations of classical assumptions 2
Violations of classical assumptions
This will be a simple post explaining the common observation that "in Economics, variability of many variables is proportional to those variables". Make sure to review the assumptions; they tend to slip from memory. We consider the simple regression
(1)
One of classical assumptions is
Homoscedasticity. All errors have the same variances: for all
.
We discuss its opposite, which is
Heteroscedasticity. Not all errors have the same variance. It would be wrong to write it as for all
(which means that all errors have variance different from
). You can write that not all
are the same but it's better to use the verbal definition.
Remark about Video 1. The dashed lines can represent mean consumption. Then the fact that variation of a variable grows with its level becomes more obvious.

Figure 1. Illustration from Dougherty: as x increases, variance of the error term increases
Homoscedasticity was used in the derivation of the OLS estimator variance; under heteroscedasticity that expression is no longer valid. There are other implications, which will be discussed later.
Companies example. The Samsung Galaxy Note 7 battery fires and explosions that caused two recalls cost the smartphone maker at least $5 billion. There is no way a small company could have such losses.
GDP example. The error in measuring US GDP is on the order of $200 bln, which is comparable to the Kazakhstan GDP. However, the standard deviation of the ratio error/GDP seems to be about the same across countries, if the underground economy is not too big. Often the assumption that the standard deviation of the regression error is proportional to one of regressors is plausible.
To see if the regression error is heteroscedastic, you can look at the graph of the residuals or use statistical tests.
Violations of classical assumptions 1
Violations of classical assumptions
This is a large topic which requires several posts or several book chapters. During a conference in Sweden in 2010, a Swedish statistician asked me: "What is Econometrics, anyway? What tools does it use?" I said: "Among others, it uses linear regression." He said: "But linear regression is a general statistical tool, why do they say it's a part of Econometrics?" My answer was: "Yes, it's a general tool but the name Econometrics emphasizes that the motivation for its applications lies in Economics".
Both classical assumptions and their violations should be studied with this point in mind: What is the Economics and Math behind each assumption?
Violations of the first three assumptions
We consider the simple regression
(1)
Make sure to review the assumptions. Their numbering and names sometimes are different from what Dougherty's book has. In particular, most of the time I omit the following assumption:
A6. The model is linear in parameters and correctly specified.
When it is not linear in parameters, you can think of nonlinear alternatives. Instead of saying "correctly specified" I say "true model" when a "wrong model" is available.
A1. What if the existence condition is violated? If variance of the regressor is zero, the OLS estimator does not exist. The fitted line is supposed to be vertical, and you can regress on
. Violation of the existence condition in case of multiple regression leads to multicollinearity, and that's where economic considerations are important.
A2. The convenience condition is called so because when it is violated, that is, the regressor is stochastic, there are ways to deal with this problem: finite-sample theory and large-sample theory.
A3. What if the errors in (1) have means different from zero? This question can be divided in two: 1) the means of the errors are the same: for all
and 2) the means are different. Read the post about centering and see if you can come up with the answer for the first question. The means may be different because of omission of a relevant variable (can you do the math?). In the absence of data on such a variable, there is nothing you can do.
Violations of A4 and A5 will be treated later.