Suppose we are observing two stocks and their respective returns are To take into account their interdependence, we consider a vector autoregression
Try to repeat for this system the analysis from Section 3.5 (Application to an AR(1) process) of the Guide by A. Patton and you will see that the difficulties are insurmountable. However, matrix algebra allows one to overcome them, with proper adjustment.
A) Write this system in a vector format
What should be in this representation?
B) Assume that the error in (1) satisfies
(3) for with some symmetric matrix
What does this assumption mean in terms of the components of from (2)? What is if the errors in (1) satisfy
(4) for for all
C) Suppose (1) is stationary. The stationarity condition is expressed in terms of eigenvalues of but we don't need it. However, we need its implication:
A) It takes some practice to see that with the notation
the system (1) becomes (2).
B) The equations in (3) look like this:
Equalities of matrices are understood element-wise, so we get a series of scalar equations for
Conversely, the scalar equations from (4) give
C) (2) implies or by stationarity or Hence (5) implies
D) From (2) we see that depends only on (information set at time ). Therefore by the LIE
This is the exam I administered in my class in Spring 2022. By replacing the Poisson distribution with other random variables the UoL examiners can obtain a large variety of versions with which to torture Advanced Statistics students. On the other hand, for the students the answers below can be a blueprint to fend off any assaults.
During the semester my students were encouraged to analyze and collect information in documents typed in Scientific Word or LyX. The exam was an open-book online assessment. Papers typed in Scientific Word or LyX were preferred and copying from previous analysis was welcomed. This policy would be my preference if I were to study a subject as complex as Advanced Statistics. The students were given just two hours on the assumption that they had done the preparations diligently. Below I give the model answers right after the questions.
Midterm Spring 2022
You have to clearly state all required theoretical facts. Number all equations that you need to use in later calculations and reference them as necessary. Answer the questions in the order they are asked. When you don't know the answer, leave some space. For each unexplained fact I subtract one point. Put your name in the file name.
In questions 1-9 is the Poisson variable.
Define and derive the population mean and population variance of the sum where is an i.i.d. sample from .
Answer. is defined by Using and (ST2133 p.80) we have
(by independence and identical distribution). [Some students derived instead of respective equations for sample means].
Derive the MGF of the standardized sample mean.
Answer. Knowing this derivation is a must because it is a combination of three important facts.
a) Let Then so standardizing and gives the same result.
b) The MGF of is expressed through the MGF of :
(independence) (identical distribution)
c) If is a linear transformation of then
When answering the question we assume any i.i.d. sample from a population with mean and population variance :
Putting in c) and using a) we get
(using b) and )
This is a general result which for the Poisson distribution can be specified as follows. From ST2133, example 3.38 we know that . Therefore, we obtain
[Instead of some students gave .]
Derive the cumulant generating function of the standardized sample mean.
Answer. Again, there are a couple of useful general facts.
I) Decomposition of MGF around zero. The series leads to
where are moments of and Differentiating this equation yields
and setting gives the rule for finding moments from MGF:
II) Decomposition of the cumulant generating function around zero. can also be decomposed into its Taylor series:
where the coefficients are called cumulants and can be found using . Since
Thus, for any random variable with mean and variance we have
terms of higher order for small.
III) If then by c)
IV) By b)
Using III), and then IV) we have
For the last term on the right we use the approximation around zero from II):
[Important. Why the above steps are necessary? Passing from the series to the series for is not straightforward and can easily lead to errors. It is not advisable in case of the Poisson to derive from .]
Prove the central limit theorem using the cumulant generating function you obtained.
Answer. In the previous question we proved that around zero
This implies that
(1) for each around zero.
But we know that for a standard normal its MGF is (ST2133 example 3.42) and hence for the standard normal
Theorem (link between pointwise convergence of MGFs of and convergence in distribution of ) Let be a sequence of random variables and let be some random variable. If converges for each from a neighborhood of zero to , then converges in distribution to
Using (1), (2) and this theorem we finish the proof that converges in distribution to the standard normal, which is the central limit theorem.
State the factorization theorem and apply it to show that is a sufficient statistic.
Answer. The solution is given on p.180 of ST2134. For the joint density is
The expression on the right does not depend on if and only of The last condition describes level sets of Thus it is minimal sufficient.
Find the Method of Moments estimator of the population mean.
Answer. The idea of the method is to take some populational property (for example, ) and replace the population characteristic (in this case ) by its sample analog () to obtain a MM estimator. In our case [Try to do this for the Gamma distribution].
Find the Fisher information.
Answer. From Problem 5 the log-likelihood is
Hence the score function is (see Example 2.30 in ST2134)
and the Fisher information is
Derive the Cramer-Rao lower bound for for a random sample.
Answer. (See Example 3.17 in ST2134) Since is an unbiased estimator of by Problem 1, from the Cramer-Rao theorem we know that
and in fact by Problem 1 this lower bound is attained.
Distribution of the estimator of the error variance
If you are reading the book by Dougherty: this post is about the distribution of the estimator defined in Chapter 3.
where the deterministic matrix is of size satisfies (regressors are not collinear) and the error satisfies
is estimated by Denote Using (1) we see that and the residual is estimated by
is a projector and has properties which are derived from those of
If is an eigenvalue of then multiplying by and using the fact that we get Hence eigenvalues of can be only or The equation
tells us that the number of eigenvalues equal to 1 is and the remaining are zeros. Let be the diagonal representation of Here is an orthogonal matrix,
and is a diagonal matrix with eigenvalues of on the main diagonal. We can assume that the first numbers on the diagonal of are ones and the others are zeros.
Theorem. Let be normal. 1) is distributed as 2) The estimators and are independent.
Proof. 1) We have by (4)
Denote From (2) and (5)
and is normal as a linear transformation of a normal vector. It follows that where is a standard normal vector with independent standard normal coordinates Hence, (6) implies
(3) and (7) prove the first statement.
2) First we note that the vectors are independent. Since they are normal, their independence follows from
It's easy to see that This allows us to show that is a function of :
Independence of leads to independence of their functions and
Let be the joint density of the vector , where is a parameter (possibly a vector). The ML estimator is obtained by maximizing over the function with fixed at the observed data. The estimator depends on the data and can be denoted
Fisher-Neyman theorem. is sufficient for if and only if the joint density can be represented as
where, as the notation suggests, depends on only through and does not depend on
Maximizing the left side of (1) is the same thing as maximizing because does not depend on But this means that depends on only through A sufficient statistic is all you need to find the ML estimator. This interpretation is easier to understand than the definition of sufficiency.
Minimal sufficient statistic
Definition 2. A sufficient statistic is called minimal sufficient if for any other statistic there exists a function such that
A level set is a set of type for a constant (which in general can be a constant vector). See the visualization of level sets. A level set is also called a preimage and denoted When is one-to-one the preimage contains just one point. When is not one-to-one the preimage contains more than one point. The wider it is the less information about the sample carries the statistic (because many data sets are mapped to a single point and you cannot tell one data set from another by looking at the statistic value). In the definition of the minimal sufficient statistic we have
Since generally contains more than one point, this shows that the level sets of are generally wider than those of Since this is true for any carries less information about than any other statistic.
Definition 2 is an existence statement and is difficult to verify directly as there are words "for any" and "exists". Again it's better to relate it to ML estimation.
Suppose for two sets of data there is a positive number such that
Maximizing the left side we get the estimator Maximizing we get Since does not depend on (2) tells us that
Thus, if two sets of data satisfy (2), the ML method cannot distinguish between and and supplies the same estimator. Let us call indistinguishable if there is a positive number such that (2) is true.
An equation means that belong to the same level set.
Characterization of minimal sufficiency. A statistic is minimal sufficient if and only if its level sets coincide with sets of indistinguishable
The advantage of this formulation is that it relates a geometric notion of level sets to the ML estimator properties. The formulation in the guide by J. Abdey is:
A statistic is minimal sufficient if and only if the equality is equivalent to (2).
Rewriting (2) as
we get a practical way of finding a minimal sufficient statistic: form the ratio on the left of (3) and find the sets along which the ratio does not depend on Those sets will be level sets of
We have derived the density of the chi-squared variable with one degree of freedom, see also Example 3.52, J. Abdey, Guide ST2133.
For with independent standard normals we can write where the chi-squared variables on the right are independent and all have one degree of freedom. This is because deterministic (here quadratic) functions of independent variables are independent.
Definition. The gamma distribution is a two-parametric family of densities. For the density is defined by
Obviously, you need to know what is a gamma function. My notation of the parameters follows Feller, W. An Introduction to Probability Theory and its Applications, Volume II, 2nd edition (1971). It is different from the one used by J. Abdey in his guide ST2133.
It is really a density because
Suppose you see an expression and need to determine which gamma density this is. The power of the exponent gives you and the power of gives you It follows that the normalizing constant should be and the density is
The most important property is that the family of gamma densities with the same is closed under convolutions. Because of the associativity property it is enough to prove this for the case of two gamma densities.
Alternative proof. The moment generating function of a sum of two independent beta distributions with the same shows that this sum is again a beta distribution with the same , see pp. 141, 209 in the guide ST2133.
The gamma function and gamma distribution are two different things. This post is about the former and is a preparatory step to study the latter.
Definition. The gamma function is defined by
The integrand is smooth on so its integrability is determined by its behavior at and . Because of the exponent, it is integrable in the neighborhood of The singularity at is integrable if In all calculations involving the gamma function one should remember that its argument should be positive.
1) Factorial-like property. Integration by parts shows that
3) Combining the first two properties we see that for a natural
Thus the gamma function extends the factorial to non-integer
Indeed, using the density of the standard normal we see that
These problems are among the most difficult. It's important to work out a general approach to such problems. All references are to J. Abdey, Advanced statistics: distribution theory, ST2133, University of London, 2021.
Step 1. Conditioning is usually suggested by the problem statement: is conditioned on .
Your life will be easier if you follow the notation used in the guide: use for probability mass functions (discrete variables) and for (probability) density functions (continuous variables).
a) If and both are discrete (Example 5.1, Example 5.13, Example 5.18):
b) If and both are continuous (Activity 5.6):
c) If is discrete, is continuous (Example 5.2, Activity 5.5):
d) If is continuous, is discrete (Activity 5.12):
In all cases you need to figure out over which to sum or integrate.
Step 2. Write out the conditional densities/probabilities with the same arguments
as in your conditional equation.
Step 3. Reduce the result to one of known distributions using the completeness
Let denote the number of hurricanes which form in a given year, and let denote the number of these which make landfall. Suppose each hurricane has a probability of making landfall independent of other hurricanes. Given the number of hurricanes , then can be thought of as the number of successes in independent and identically distributed Bernoulli trials. We can write this as . Suppose we also have that . Find the distribution of (noting that ).
Step 1. The number of hurricanes takes values and is distributed as Poisson. The number of landfalls for a given is binomial with values . It follows that .
Write the general formula for conditional probability:
Step 2. Specifying the distributions:
Step 3. Reduce the result to one of known distributions:
(pull out of summation everything that does not depend on summation variable
(replace to better see the structure)
(using the completeness axiom for the Poisson variable)
Why do we need this link? For simplicity consider the rectangle The integrals
both are taken over the rectangle but they are not the same. is a double (two-dimensional) integral, meaning that its definition uses elementary areas, while is an iterated integral, where each of the one-dimensional integrals uses elementary segments. To make sense of this, you need to consult an advanced text in calculus. The difference notwithstanding, in good cases their values are the same. Putting aside the question of what is a "good case", we concentrate on geometry: how a double integral can be expressed as an iterated integral.
It is enough to understand the idea in case of an oval on the plane. Let be the function that describes the lower boundary of the oval and let be the function that describes the upper part. Further, let the vertical lines and be the minimum and maximum values of in the oval (see Chart 1).
Chart 1. The boundary of the oval above the green line is described by u(x) and below - by l(x)
We can paint the oval with strokes along red lines from to If we do this for all we'll have painted the whole oval. This corresponds to the representation of as the union of segments with
and to the equality of integrals
(double integral) (iterated integral)
Density of a sum of two variables
Assumption 1 Suppose the random vector has a density and define (unlike the convolution theorem below, here don't have to be independent).
From the definitions of the distribution function and probability
The integral on the right is a double integral. The painting analogy (see Chart 2)
Chart 2. Integration for sum of two variables
Differentiating both sides with respect to we get
If we start with the inner integral that is with respect to and the outer integral with respect to then similarly
Exercise. Suppose the random vector has a density and define Find Hint: review my post on Leibniz integral rule.
In addition to Assumption 1, let be independent. Then and the above formula gives
This is denoted as and called a convolution.
The following may help to understand this formula. The function is a density (it is non-negative and integrates to 1). Its graph is a mirror image of that of with respect to the vertical axis. The function is a shift of by along the horizontal axis. For fixed it is also a density. Thus in the definition of convolution we integrate the product of two densities Further, to understand the asymptotic behavior of when imagine two bell-shaped densities and When goes to, say, infinity, the humps of those densities are spread apart more and more. The hump of one of them gets multiplied by small values of the other. That's why goes to zero, in a certain sense.
The convolution of two densities is always a density because it is non-negative and integrates to one:
Replacing in the inner integral we see that this is