5
May 22

## Vector autoregressions: preliminaries

Suppose we are observing two stocks and their respective returns are $x_{t},y_{t}.$ A vector autoregression for the pair $x_{t},y_{t}$ is one way to take into account their interdependence. This theory is undeservedly omitted from the Guide by A. Patton.

### Required minimum in matrix algebra

Matrix notation and summation are very simple.

Matrix multiplication is a little more complex. Make sure to read Global idea 2 and the compatibility rule.

The general approach to study matrices is to compare them to numbers. Here you see the first big No: matrices do not commute, that is, in general $AB\neq BA.$

The idea behind matrix inversion is pretty simple: we want an analog of the property $a\times \frac{1}{a}=1$ that holds for numbers.

Some facts about determinants have very complicated proofs and it is best to stay away from them. But a couple of ideas should be clear from the very beginning. Determinants are defined only for square matrices. The relationship of determinants to matrix invertibility explains the role of determinants. If $A$ is square, it is invertible if and only if $\det A\neq 0$ (this is an equivalent of the condition $a\neq 0$ for numbers).

Here is an illustration of how determinants are used. Suppose we need to solve the equation $AX=Y$ for $X,$ where $A$ and $Y$ are known. Assuming that $\det A\neq 0$ we can premultiply the equation by $A^{-1}$ to obtain $A^{-1}AX=A^{-1}Y.$ (Because of lack of commutativity, we need to keep the order of the factors). Using intuitive properties $A^{-1}A=I$ and $IX=X$ we obtain the solution: $X=A^{-1}Y.$ In particular, we see that if $\det A\neq 0,$ then the equation $AX=0$ has a unique solution $X=0.$

Let $A$ be a square matrix and let $X,Y$ be two vectors. $A,Y$ are assumed to be known and $X$ is unknown. We want to check that $X=\sum_{s=0}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}$ solves the equation $X-AXA^{T}=Y.$ (Note that for this equation the trick used to solve $AX=Y$ does not work.) Just plug $X:$

$\sum_{s=0}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}-A\sum_{s=0}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}A^{T}$ $=Y+\sum_{s=1}^{\infty }A^{s}Y\left(A^{T}\right) ^{s}-\sum_{s=1}^{\infty }A^{s}Y\left( A^{T}\right) ^{s}=Y$

(write out a couple of first terms in the sums if summation signs frighten you).

Transposition is a geometrically simple operation. We need only the property $\left( AB\right) ^{T}=B^{T}A^{T}.$

### Variance and covariance

Property 1. Variance of a random vector $X$ and covariance of two random vectors $X,Y$ are defined by

$V\left( X\right) =E\left( X-EX\right) \left( X-EX\right) ^{T},$ $Cov\left( X,Y\right) =E\left( X-EX\right) \left( Y-EY\right) ^{T},$

respectively.

Note that when $EX=0,$ variance becomes

$V\left( X\right) =EXX^{T}=\left( \begin{array}{ccc}EX_{1}^{2} & ... & EX_{1}X_{n} \\ ... & ... & ... \\ EX_{1}X_{n} & ... & EX_{n}^{2}\end{array}\right) .$

Property 2. Let $X,Y$ be random vectors and suppose $A,B$ are constant matrices. We want an analog of $V\left( aX+bY\right) =a^{2}V\left( X\right) +2abcov\left( X,Y\right) +b^{2}V\left( X\right) .$ In the next calculation we have to remember that the multiplication order cannot be changed.

$V\left( AX+BY\right) =E\left[ AX+BY-E\left( AX+BY\right) \right] \left[ AX+BY-E\left( AX+BY\right) \right] ^{T}$

$=E\left[ A\left( X-EX\right) +B\left( Y-EY\right) \right] \left[ A\left( X-EX\right) +B\left( Y-EY\right) \right] ^{T}$

$=E\left[ A\left( X-EX\right) \right] \left[ A\left( X-EX\right) \right] ^{T}+E\left[ B\left( Y-EY\right) \right] \left[ A\left( X-EX\right) \right] ^{T}$

$+E\left[ A\left( X-EX\right) \right] \left[ B\left( Y-EY\right) \right] ^{T}+E\left[ B\left( Y-EY\right) \right] \left[ B\left( Y-EY\right) \right] ^{T}$

(applying $\left( AB\right) ^{T}=B^{T}A^{T}$)

$=AE\left( X-EX\right) \left( X-EX\right) ^{T}A^{T}+BE\left( Y-EY\right) \left( X-EX\right) ^{T}A^{T}$

$+AE\left( X-EX\right) \left( Y-EY\right) ^{T}B^{T}+BE\left( Y-EY\right) \left( Y-EY\right) ^{T}B^{T}$

$=AV\left( X\right) A^{T}+BCov\left( Y,X\right) A^{T}+ACov(X,Y)B^{T}+BV\left( Y\right) B^{T}.$

22
Mar 22

## Blueprint for exam versions

This is the exam I administered in my class in Spring 2022. By replacing the Poisson distribution with other random variables the UoL examiners can obtain a large variety of versions with which to torture Advanced Statistics students. On the other hand, for the students the answers below can be a blueprint to fend off any assaults.

During the semester my students were encouraged to analyze and collect information in documents typed in Scientific Word or LyX. The exam was an open-book online assessment. Papers typed in Scientific Word or LyX were preferred and copying from previous analysis was welcomed. This policy would be my preference if I were to study a subject as complex as Advanced Statistics. The students were given just two hours on the assumption that they had done the preparations diligently. Below I give the model answers right after the questions.

## Midterm Spring 2022

You have to clearly state all required theoretical facts. Number all equations that you need to use in later calculations and reference them as necessary. Answer the questions in the order they are asked. When you don't know the answer, leave some space. For each unexplained fact I subtract one point. Put your name in the file name.

In questions 1-9 $X$ is the Poisson variable.

### Question 1

Define $X$ and derive the population mean and population variance of the sum $S_{n}=\sum_{i=1}^{n}X_{i}$ where $X_{i}$ is an i.i.d. sample from $X$.

Answer. $X$ is defined by $P\left( X=x\right) =e^{-\lambda }\frac{\lambda ^{x}}{x!},\ x=0,1,...$ Using $EX=\lambda$ and $Var\left( X\right) =\lambda$ (ST2133 p.80) we have

$ES_{n}=\sum EX_{i}=n\lambda ,$ $Var\left( S_{n}\right) =\sum V\left( X_{i}\right) =n\lambda$

(by independence and identical distribution). [Some students derived $EX=\lambda ,$ $Var\left( X\right) =\lambda$ instead of respective equations for sample means].

### Question 2

Derive the MGF of the standardized sample mean.

Answer. Knowing this derivation is a must because it is a combination of three important facts.

a) Let $z_{n}=\frac{\bar{X}-E\bar{X}}{\sigma \left( \bar{X}\right) }.$ Then $z_{n}=\frac{nS_{n}-EnS_{n}}{\sigma \left( nS_{n}\right) }=\frac{S_{n}-ES_{n} }{\sigma \left( S_{n}\right) },$ so standardizing $\bar{X}$ and $S_{n}$ gives the same result.

b) The MGF of $S_{n}$ is expressed through the MGF of $X$:

$M_{S_{n}}\left( t\right) =Ee^{S_{n}t}=Ee^{X_{1}t+...+X_{n}t}=Ee^{X_{1}t}...e^{X_{n}t}=$

(independence) $=Ee^{X_{1}t}...Ee^{X_{n}t}=$ (identical distribution) $=\left[ M_{X}\left( t\right) \right] ^{n}.$

c) If $X$ is a linear transformation of $Y,$ $X=a+bY,$ then

$M_{X}\left( t\right) =Ee^{X}=Ee^{\left( a+bY\right) t}=e^{at}Ee^{Y\left( bt\right) }=e^{at}M_{Y}\left( bt\right) .$

When answering the question we assume any i.i.d. sample from a population with mean $\mu$ and population variance $\sigma ^{2}$:

Putting in c) $a=-\frac{ES_{n}}{\sigma \left( S_{n}\right) },$ $b=\frac{1}{\sigma \left( S_{n}\right) }$ and using a) we get

$M_{z_{n}}\left( t\right) =E\exp \left( \frac{S_{n}-ES_{n}}{\sigma \left( S_{n}\right) }t\right) =e^{-ES_{n}t/\sigma \left( S_{n}\right) }M_{S_{n}}\left( t/\sigma \left( S_{n}\right) \right)$

(using b) and $ES_{n}=n\mu ,$ $Var\left( S_{n}\right) =n\sigma ^{2}$)

$=e^{-ES_{n}t/\sigma \left( S_{n}\right) }\left[ M_{X}\left( t/\sigma \left( S_{n}\right) \right) \right] ^{n}=e^{-n\mu t/\left( \sqrt{n}\sigma \right) }% \left[ M_{X}\left( t/\left( \sqrt{n}\sigma \right) \right) \right] ^{n}.$

This is a general result which for the Poisson distribution can be specified as follows. From ST2133, example 3.38 we know that $M_{X}\left( t\right)=\exp \left( \lambda \left( e^{t}-1\right) \right)$. Therefore, we obtain

$M_{z_{n}}\left( t\right) =e^{-\sqrt{\lambda }t}\left[ \exp \left( \lambda \left( e^{t/\left( n\sqrt{\lambda }\right) }-1\right) \right) \right] ^{n}= e^{-t\sqrt{\lambda }+n\lambda \left( e^{t/\left( n\sqrt{\lambda }\right) }-1\right) }.$

[Instead of $M_{z_n}$ some students gave $M_X$.]

### Question 3

Derive the cumulant generating function of the standardized sample mean.

Answer. Again, there are a couple of useful general facts.

I) Decomposition of MGF around zero. The series $e^{x}=\sum_{i=0}^{\infty } \frac{x^{i}}{i!}$ leads to

$M_{X}\left( t\right) =Ee^{tX}=E\left( \sum_{i=0}^{\infty }\frac{t^{i}X^{i}}{ i!}\right) =\sum_{i=0}^{\infty }\frac{t^{i}}{i!}E\left( X^{i}\right) =\sum_{i=0}^{\infty }\frac{t^{i}}{i!}\mu _{i}$

where $\mu _{i}=E\left( X^{i}\right)$ are moments of $X$ and $\mu _{0}=EX^{0}=1.$ Differentiating this equation yields

$M_{X}^{(k)}\left( t\right) =\sum_{i=k}^{\infty }\frac{t^{i-k}}{\left( i-k\right) !}\mu _{i}$

and setting $t=0$ gives the rule for finding moments from MGF: $\mu _{k}=M_{X}^{(k)}\left( 0\right) .$

II) Decomposition of the cumulant generating function around zero. $K_{X}\left( t\right) =\log M_{X}\left( t\right)$ can also be decomposed into its Taylor series:

$K_{X}\left( t\right) =\sum_{i=0}^{\infty }\frac{t^{i}}{i!}\kappa _{i}$

where the coefficients $\kappa _{i}$ are called cumulants and can be found using $\kappa _{k}=K_{X}^{(k)}\left( 0\right)$. Since

$K_{X}^{\prime }\left( t\right) =\frac{M_{X}^{\prime }\left( t\right) }{ M_{X}\left( t\right) }$ and $K_{X}^{\prime \prime }\left( t\right) =\frac{ M_{X}^{\prime \prime }\left( t\right) M_{X}\left( t\right) -\left( M_{X}^{\prime }\left( t\right) \right) ^{2}}{M_{X}^{2}\left( t\right) }$

we have

$\kappa _{0}=\log M_{X}\left( 0\right) =0,$ $\kappa _{1}=\frac{M_{X}^{\prime }\left( 0\right) }{M_{X}\left( 0\right) }=\mu _{1},$

$\kappa _{2}=\mu _{2}-\mu _{1}^{2}=EX^{2}-\left( EX\right) ^{2}=Var\left( X\right) .$

Thus, for any random variable $X$ with mean $\mu$ and variance $\sigma ^{2}$ we have

$K_{X}\left( t\right) =\mu t+\frac{\sigma ^{2}t^{2}}{2}+$ terms of higher order for $t$ small.

III) If $X=a+bY$ then by c)

$K_{X}\left( t\right) =K_{a+bY}\left( t\right) =\log \left[ e^{at}M_{Y}\left( bt\right) \right] =at+K_{X}\left( bt\right) .$

IV) By b)

$K_{S_{n}}\left( t\right) =\log \left[ M_{X}\left( t\right) \right] ^{n}=nK_{X}\left( t\right) .$

Using III), $z_{n}=\frac{S_{n}-ES_{n}}{\sigma \left( S_{n}\right) }$ and then IV) we have

$K_{z_{n}}\left( t\right) =\frac{-ES_{n}}{\sigma \left( S_{n}\right) } t+K_{S_{n}}\left( \frac{t}{\sigma \left( S_{n}\right) }\right) =\frac{-ES_{n} }{\sigma \left( S_{n}\right) }t+nK_{X}\left( \frac{t}{\sigma \left( S_{n}\right) }\right) .$

For the last term on the right we use the approximation around zero from II):

$K_{z_{n}}\left( t\right) =\frac{-ES_{n}}{\sigma \left( S_{n}\right) } t+nK_{X}\left( \frac{t}{\sigma \left( S_{n}\right) }\right) \approx \frac{ -ES_{n}}{\sigma \left( S_{n}\right) }t+n\mu \frac{t}{\sigma \left( S_{n}\right) }+n\frac{\sigma ^{2}}{2}\left( \frac{t}{\sigma \left( S_{n}\right) }\right) ^{2}$

$=-\frac{n\mu }{\sqrt{n}\sigma }t+n\mu \frac{t}{\sqrt{n}\sigma }+n\frac{ \sigma ^{2}}{2}\left( \frac{t}{\sigma \left( S_{n}\right) }\right) ^{2}=t^{2}/2.$

[Important. Why the above steps are necessary? Passing from the series $M_{X}\left( t\right) =\sum_{i=0}^{\infty }\frac{t^{i}}{i!}\mu _{i}$ to the series for $K_{X}\left( t\right) =\log M_{X}\left( t\right)$ is not straightforward and can easily lead to errors. It is not advisable in case of the Poisson to derive $K_{z_{n}}$ from $M_{z_{n}}\left( t\right) =$ $e^{-t \sqrt{\lambda }+n\lambda \left( e^{t/\left( n\sqrt{\lambda }\right) }-1\right) }$.]

### Question 4

Prove the central limit theorem using the cumulant generating function you obtained.

Answer. In the previous question we proved that around zero

$K_{z_{n}}\left( t\right) \rightarrow \frac{t^{2}}{2}.$

This implies that

(1) $M_{z_{n}}\left( t\right) \rightarrow e^{t^{2}/2}$ for each $t$ around zero.

But we know that for a standard normal $X$ its MGF is $M_{X}\left( t\right) =\exp \left( \mu t+\frac{\sigma ^{2}t^{2}}{2}\right)$ (ST2133 example 3.42) and hence for the standard normal

(2) $M_{z}\left( t\right) =e^{t^{2}/2}.$

Theorem (link between pointwise convergence of MGFs of $\left\{ X_{n}\right\}$ and convergence in distribution of $\left\{ X_{n}\right\}$) Let $\left\{ X_{n}\right\}$ be a sequence of random variables and let $X$ be some random variable. If $M_{X_{n}}\left( t\right)$ converges for each $t$ from a neighborhood of zero to $M_{X}\left( t\right)$, then $X_{n}$ converges in distribution to $X.$

Using (1), (2) and this theorem we finish the proof that $z_{n}$ converges in distribution to the standard normal, which is the central limit theorem.

### Question 5

State the factorization theorem and apply it to show that $U=\sum_{i=1}^{n}X_{i}$ is a sufficient statistic.

Answer. The solution is given on p.180 of ST2134. For $x_{i}=1,...,n$ the joint density is

(3) $f_{X}\left( x,\lambda \right) =\prod\limits_{i=1}^{n}e^{-\lambda } \frac{\lambda ^{x_{i}}}{x_{i}!}=\frac{\lambda ^{\Sigma x_{i}}e^{-n\lambda }}{\Pi _{i=1}^{n}x_{i}!}.$

To satisfy the Fisher-Neyman factorization theorem set

$g\left( \sum x_{i},\lambda \right) =\lambda ^{\Sigma x_{i}e^{-n\lambda }},\ h\left( x\right) =\frac{1}{\Pi _{i=1}^{n}x_{i}!}$

and then we see that $\sum x_{i}$ is a sufficient statistic for $\lambda .$

### Question 6

Find a minimal sufficient statistic for $\lambda$ stating all necessary theoretical facts.

AnswerCharacterization of minimal sufficiency A statistic $T\left( X\right)$ is minimal sufficient if and only if level sets of $T$ coincide with sets on which the ratio $f_{X}\left( x,\theta \right) /f_{X}\left( y,\theta \right)$ does not depend on $\theta .$

From (3)

$f_{X}\left( x,\lambda \right) /f_{X}\left( y,\lambda \right) =\frac{\lambda ^{\Sigma x_{i}}e^{-n\lambda }}{\Pi _{i=1}^{n}x_{i}!}\left[ \frac{\lambda ^{\Sigma y_{i}}e^{-n\lambda }}{\Pi _{i=1}^{n}y_{i}!}\right] ^{-1}=\lambda ^{ \left[ \Sigma x_{i}-\Sigma y_{i}\right] }\frac{\Pi _{i=1}^{n}y_{i}!}{\Pi _{i=1}^{n}x_{i}!}.$

The expression on the right does not depend on $\lambda$ if and only of $\Sigma x_{i}=\Sigma y_{i}0.$ The last condition describes level sets of $T\left( X\right) =\sum X_{i}.$ Thus it is minimal sufficient.

### Question 7

Find the Method of Moments estimator of the population mean.

Answer. The idea of the method is to take some populational property (for example, $EX=\lambda$) and replace the population characteristic (in this case $EX$) by its sample analog ($\bar{X}$) to obtain a MM estimator. In our case $\hat{\lambda}_{MM}= \bar{X}.$ [Try to do this for the Gamma distribution].

### Question 8

Find the Fisher information.

Answer. From Problem 5 the log-likelihood is

$l_{X}\left( \lambda ,x\right) =-n\lambda +\sum x_{i}\log \lambda -\sum \log \left( x_{i}!\right) .$

Hence the score function is (see Example 2.30 in ST2134)

$s_{X}\left( \lambda ,x\right) =\frac{\partial }{\partial \lambda } l_{X}\left( \lambda ,x\right) =-n+\frac{1}{\lambda }\sum x_{i}.$

Then

$\frac{\partial ^{2}}{\partial \lambda ^{2}}l_{X}\left( \lambda ,x\right) =- \frac{1}{\lambda ^{2}}\sum x_{i}$

and the Fisher information is

$I_{X}\left( \lambda \right) =-E\left( \frac{\partial ^{2}}{\partial \lambda ^{2}}l_{X}\left( \lambda ,x\right) \right) =\frac{1}{\lambda ^{2}}E\sum X_{i}=\frac{n\lambda }{\lambda ^{2}}=\frac{n}{\lambda }.$

### Question 9

Derive the Cramer-Rao lower bound for $V\left( \bar{X}\right)$ for a random sample.

Answer. (See Example 3.17 in ST2134) Since $\bar{X}$ is an unbiased estimator of $\lambda$ by Problem 1, from the Cramer-Rao theorem we know that

$V\left( \bar{X}\right) \geq \frac{1}{I_{X}\left( \lambda \right) }=\frac{ \lambda }{n}$

and in fact by Problem 1 this lower bound is attained.

19
Feb 22

## Estimation of parameters of a normal distribution

Here we show that the knowledge of the distribution of $s^{2}$ for linear regression allows one to do without long calculations contained in the guide ST 2134 by J. Abdey.

Theorem. Let $y_{1},...,y_{n}$ be independent observations from $N\left( \mu,\sigma ^{2}\right)$. 1) $s^{2}\left( n-1\right) /\sigma ^{2}$ is distributed as $\chi _{n-1}^{2}.$ 2) The estimators $\bar{y}$ and $s^{2}$ are independent. 3) $Es^{2}=\sigma ^{2},$ 4) $Var\left( s^{2}\right) =\frac{2\sigma ^{4}}{n-1},$ 5) $\frac{s^{2}-\sigma ^{2}}{\sqrt{2\sigma ^{4}/\left(n-1\right) }}$ converges in distribution to $N\left( 0,1\right) .$

Proof. We can write $y_{i}=\mu +e_{i}$ where $e_{i}$ is distributed as $N\left( 0,\sigma ^{2}\right) .$ Putting $\beta =\mu ,\ y=\left(y_{1},...,y_{n}\right) ^{T},$ $e=\left( e_{1},...,e_{n}\right) ^{T}$ and $X=\left( 1,...,1\right) ^{T}$ (a vector of ones) we satisfy (1) and (2). Since $X^{T}X=n,$ we have $\hat{\beta}=\bar{y}.$ Further,

$r\equiv y-X\hat{ \beta}=\left( y_{1}-\bar{y},...,y_{n}-\bar{y}\right) ^{T}$

and

$s^{2}=\left\Vert r\right\Vert ^{2}/\left( n-1\right) =\sum_{i=1}^{n}\left( y_{i}-\bar{y}\right) ^{2}/\left( n-1\right) .$

Thus 1) and 2) follow from results for linear regression.

3) For a normal variable $X$ its moment generating function is $M_{X}\left( t\right) =\exp \left(\mu t+\frac{1}{2}\sigma ^{2}t^{2}\right)$ (see Guide ST2133, 2021, p.88). For the standard normal we get

$M_{z}^{\prime }\left( t\right) =\exp \left( \frac{1}{2}t^{2}\right) t,$ $M_{z}^{\prime \prime }\left( t\right) =\exp \left( \frac{1}{2}t^{2}\right) (t^{2}+1),$

$M_{z}^{\prime \prime \prime}\left( t\right) =\exp \left( \frac{1}{2}t^{2}\right) (t^{3}+2t+t),$ $M_{z}^{(4)}\left( t\right) =\exp \left( \frac{1}{2}t^{2}\right) (t^{4}+6t^{2}+3).$

Applying the general property $EX^{r}=M_{X}^{\left( r\right) }\left( 0\right)$ (same guide, p.84) we see that

$Ez=0,$ $Ez^{2}=1,$ $Ez^{3}=0,$ $Ez^{4}=3,$

$Var(z)=1,$ $Var\left( z^{2}\right) =Ez^{4}-\left( Ez^{2}\right) ^{2}=3-1=2.$

Therefore

$Es^{2}=\frac{\sigma ^{2}}{n-1}E\left( z_{1}^{2}+...+z_{n-1}^{2}\right) =\frac{\sigma ^{2}}{n-1}\left( n-1\right) =\sigma ^{2}.$

4) By independence of standard normals

$Var\left( s^{2}\right) =$ $\left(\frac{\sigma ^{2}}{n-1}\right) ^{2}\left[ Var\left( z_{1}^{2}\right) +...+Var\left( z_{n-1}^{2}\right) \right] =\frac{\sigma ^{4}}{\left( n-1\right) ^{2}}2\left( n-1\right) =\frac{2\sigma ^{4}}{n-1}.$

5) By standardizing $s^{2}$ we have $\frac{s^{2}-Es^{2}}{\sigma \left(s^{2}\right) }=\frac{s^{2}-\sigma ^{2}}{\sqrt{2\sigma ^{4}/\left( n-1\right) }}$ and this converges in distribution to $N\left( 0,1\right)$ by the central limit theorem.

19
Feb 22

## Distribution of the estimator of the error variance

If you are reading the book by Dougherty: this post is about the distribution of the estimator  $s^2$ defined in Chapter 3.

Consider regression

(1) $y=X\beta +e$

where the deterministic matrix $X$ is of size $n\times k,$ satisfies $\det \left( X^{T}X\right) \neq 0$ (regressors are not collinear) and the error $e$ satisfies

(2) $Ee=0,Var(e)=\sigma ^{2}I$

$\beta$ is estimated by $\hat{\beta}=(X^{T}X)^{-1}X^{T}y.$ Denote $P=X(X^{T}X)^{-1}X^{T},$ $Q=I-P.$ Using (1) we see that $\hat{\beta}=\beta +(X^{T}X)^{-1}X^{T}e$ and the residual $r\equiv y-X\hat{\beta}=Qe.$ $\sigma^{2}$ is estimated by

(3) $s^{2}=\left\Vert r\right\Vert ^{2}/\left( n-k\right) =\left\Vert Qe\right\Vert ^{2}/\left( n-k\right) .$

$Q$ is a projector and has properties which are derived from those of $P$

(4) $Q^{T}=Q,$ $Q^{2}=Q.$

If $\lambda$ is an eigenvalue of $Q,$ then multiplying $Qx=\lambda x$ by $Q$ and using the fact that $x\neq 0$ we get $\lambda ^{2}=\lambda .$ Hence eigenvalues of $Q$ can be only $0$ or $1.$ The equation $tr\left( Q\right) =n-k$
tells us that the number of eigenvalues equal to 1 is $n-k$ and the remaining $k$ are zeros. Let $Q=U\Lambda U^{T}$ be the diagonal representation of $Q.$ Here $U$ is an orthogonal matrix,

(5) $U^{T}U=I,$

and $\Lambda$ is a diagonal matrix with eigenvalues of $Q$ on the main diagonal. We can assume that the first $n-k$ numbers on the diagonal of $Q$ are ones and the others are zeros.

Theorem. Let $e$ be normal. 1) $s^{2}\left( n-k\right) /\sigma ^{2}$ is distributed as $\chi _{n-k}^{2}.$ 2) The estimators $\hat{\beta}$ and $s^{2}$ are independent.

Proof. 1) We have by (4)

(6) $\left\Vert Qe\right\Vert ^{2}=\left( Qe\right) ^{T}Qe=\left( Q^{T}Qe\right) ^{T}e=\left( Qe\right) ^{T}e=\left( U\Lambda U^{T}e\right) ^{T}e=\left( \Lambda U^{T}e\right) ^{T}U^{T}e.$

Denote $S=U^{T}e.$ From (2) and (5)

$ES=0,$ $Var\left( S\right) =EU^{T}ee^{T}U=\sigma ^{2}U^{T}U=\sigma ^{2}I$

and $S$ is normal as a linear transformation of a normal vector. It follows that $S=\sigma z$ where $z$ is a standard normal vector with independent standard normal coordinates $z_{1},...,z_{n}.$ Hence, (6) implies

(7) $\left\Vert Qe\right\Vert ^{2}=\sigma ^{2}\left( \Lambda z\right) ^{T}z=\sigma ^{2}\left( z_{1}^{2}+...+z_{n-k}^{2}\right) =\sigma ^{2}\chi _{n-k}^{2}.$

(3) and (7) prove the first statement.

2) First we note that the vectors $Pe,Qe$ are independent. Since they are normal, their independence follows from

$cov(Pe,Qe)=EPee^{T}Q^{T}=\sigma ^{2}PQ=0.$

It's easy to see that $X^{T}P=X^{T}.$ This allows us to show that $\hat{\beta}$ is a function of $Pe$:

$\hat{\beta}=\beta +(X^{T}X)^{-1}X^{T}e=\beta +(X^{T}X)^{-1}X^{T}Pe.$

Independence of $Pe,Qe$ leads to independence of their functions $\hat{\beta}$ and $s^{2}.$

5
Feb 22

## Sufficiency and minimal sufficiency

### Sufficient statistic

I find that in the notation of a statistic it is better to reflect the dependence on the argument. So I write $T\left( X\right)$ for a statistic, where $X$ is a sample, instead of a faceless $U$ or $V.$

Definition 1. The statistic $T\left( X\right)$ is called sufficient for the parameter $\theta$ if the distribution of $X$ conditional on $T\left( X\right)$ does not depend on $\theta .$

The main results on sufficiency and minimal sufficiency become transparent if we look at them from the point of view of Maximum Likelihood (ML) estimation.

Let $f_{X}\left( x,\theta \right)$ be the joint density of the vector $X=\left( X_{1},...,X_{n}\right)$, where $\theta$ is a parameter (possibly a vector). The ML estimator is obtained by maximizing over $\theta$ the function $f_{X}\left( x,\theta \right)$ with $x=\left(x_{1},...,x_{n}\right)$ fixed at the observed data. The estimator depends on the data and can be denoted $\hat{\theta}_{ML}\left( x\right) .$

Fisher-Neyman theorem. $T\left( X\right)$ is sufficient for $\theta$ if and only if the joint density can be represented as

(1) $f_{X}\left( x,\theta \right) =g\left( T\left( x\right) ,\theta \right) k\left( x\right)$

where, as the notation suggests, $g$ depends on $x$ only through $T\left(x\right)$ and $k$ does not depend on $\theta .$

Maximizing the left side of (1) is the same thing as maximizing $g\left(T\left( x\right) ,\theta \right)$ because $k$ does not depend on $\theta .$ But this means that $\hat{\theta}_{ML}\left( x\right)$ depends on $x$ only through $T\left( x\right) .$ A sufficient statistic is all you need to find the ML estimator. This interpretation is easier to understand than the definition of sufficiency.

### Minimal sufficient statistic

Definition 2. A sufficient statistic $T\left( X\right)$ is called minimal sufficient if for any other statistic $S\left( X\right)$ there exists a function $g$ such that $T\left( X\right) =g\left( S\left( X\right) \right) .$

A level set is a set of type $\left\{ x:T\left( x\right) =c\right\} ,$ for a constant $c$ (which in general can be a constant vector). See the visualization of level sets.  A level set is also called a preimage and denoted $T^{-1}\left( c\right) =\left\{ x:T\left(x\right) =c\right\} .$ When $T$ is one-to-one the preimage contains just one point. When $T$ is not one-to-one the preimage contains more than one point. The wider it is the less information about the sample carries the statistic (because many data sets are mapped to a single point and you cannot tell one data set from another by looking at the statistic value). In the definition of the minimal sufficient statistic we have

$\left\{x:T\left( X\right) =c\right\} =\left\{ x:g\left( S\left( X\right) \right)=c\right\} =\left\{ x:S\left( X\right) \in g^{-1}\left( c\right) \right\} .$

Since $g^{-1}\left( c\right)$ generally contains more than one point, this shows that the level sets of $T\left( X\right)$ are generally wider than those of $S\left( X\right) .$ Since this is true for any $S\left( X\right) ,$ $T\left( X\right)$ carries less information about $X$ than any other statistic.

Definition 2 is an existence statement and is difficult to verify directly as there are words "for any" and "exists". Again it's better to relate it to ML estimation.

Suppose for two sets of data $x,y$ there is a positive number $k\left(x,y\right)$ such that

(2) $f_{X}\left( x,\theta \right) =k\left( x,y\right) f_{X}\left( y,\theta\right) .$

Maximizing the left side we get the estimator $\hat{\theta}_{ML}\left(x\right) .$ Maximizing $f_{X}\left( y,\theta \right)$ we get $\hat{\theta}_{ML}\left( y\right) .$ Since $k\left( x,y\right)$ does not depend on $\theta ,$ (2) tells us that

$\hat{\theta}_{ML}\left( x\right) =\hat{\theta}_{ML}\left( y\right) .$

Thus, if two sets of data $x,y$ satisfy (2), the ML method cannot distinguish between $x$ and $y$ and supplies the same estimator. Let us call $x,y$ indistinguishable if there is a positive number $k\left( x,y\right)$ such that (2) is true.

An equation $T\left( x\right) =T\left( y\right)$ means that $x,y$ belong to the same level set.

Characterization of minimal sufficiency. A statistic $T\left( X\right)$ is minimal sufficient if and only if its level sets coincide with sets of indistinguishable $x,y.$

The advantage of this formulation is that it relates a geometric notion of level sets to the ML estimator properties. The formulation in the guide by J. Abdey is:

A statistic $T\left( X\right)$ is minimal sufficient if and only if the equality $T\left( x\right) =T\left( y\right)$ is equivalent to (2).

Rewriting (2) as

(3) $f_{X}\left( x,\theta \right) /f_{X}\left( y,\theta \right) =k\left(x,y\right)$

we get a practical way of finding a minimal sufficient statistic: form the ratio on the left of (3) and find the sets along which the ratio does not depend on $\theta .$ Those sets will be level sets of $T\left( X\right) .$

28
Dec 21

## Chi-squared distribution

This post is intended to close a gap in J. Abdey's guide ST2133, which is absence of distributions widely used in Econometrics.

### Chi-squared with one degree of freedom

Let $X$ be a random variable and let $Y=X^{2}.$

Question 1. What is the link between the distribution functions of $Y$ and $X?$

Chart 1. Inverting a square function

The start is simple: just follow the definitions. $F_{Y}\left( y\right)=P\left( Y\leq y\right) =P\left( X^{2}\leq y\right) .$ Assuming that $y>0$, on Chart 1 we see that $\left\{ x:x^{2}\leq y\right\} =\left\{x: -\sqrt{y}\leq x\leq \sqrt{y}\right\} .$ Hence, using additivity of probability,

(1) $F_{Y}\left( y\right) =P\left( -\sqrt{y}\leq X\leq \sqrt{y}\right) =P\left( X\leq \sqrt{y}\right) -P\left( X<-\sqrt{y}\right)$

$=F_{X}\left( \sqrt{y}\right) -F_{X}\left( -\sqrt{y}\right) .$

The last transition is based on the assumption that $P\left( X for all $x$, which is maintained for continuous random variables throughout the guide by Abdey.

Question 2. What is the link between the densities of $X$ and $Y=X^{2}?$ By the Leibniz integral rule (1) implies

(2) $f_{Y}\left( y\right) =f_{X}\left( \sqrt{y}\right) \frac{1}{2\sqrt{y}} +f_{X}\left( -\sqrt{y}\right) \frac{1}{2\sqrt{y}}.$

Exercise. Assuming that $g$ is an increasing differentiable function with the inverse $h$ and $Y=g(X)$ answer questions similar to 1 and 2.

See the definition of $\chi _{1}^{2}.$ Just applying (2) to $X=z$ and $Y=z^{2}=\chi _{1}^{2}$ we get

$f_{\chi _{1}^{2}}\left( y\right) =\frac{1}{\sqrt{2\pi }}e^{-y/2}\frac{1}{2 \sqrt{y}}+\frac{1}{\sqrt{2\pi }}e^{-y/2}\frac{1}{2\sqrt{y}}=\frac{1}{\sqrt{ 2\pi }}y^{1/2-1}e^{-y/2},\ y>0.$

Since $\Gamma \left( 1/2\right) =\sqrt{\pi },$ the procedure for identifying the gamma distribution gives

$f_{\chi _{1}^{2}}\left( x\right) =\frac{1}{\Gamma \left( 1/2\right) }\left( 1/2\right) ^{1/2}x^{1/2-1}e^{-x/2}=f_{1/2,1/2}\left( x\right) .$

We have derived the density of the chi-squared variable with one degree of freedom, see also Example 3.52, J. Abdey, Guide ST2133.

### General chi-squared

For $\chi _{n}^{2}=z_{1}^{2}+...+z_{n}^{2}$ with independent standard normals $z_{1},...,z_{n}$ we can write $\chi _{n}^{2}=\chi _{1}^{2}+...+\chi _{1}^{2}$ where the chi-squared variables on the right are independent and all have one degree of freedom. This is because deterministic (here quadratic) functions of independent variables are independent.

Recall that the gamma density is closed under convolutions with the same $\alpha .$ Then by the convolution theorem we get

$f_{\chi _{n}^{2}}=f_{\chi _{1}^{2}}\ast ...\ast f_{\chi _{1}^{2}}=f_{1/2,1/2}\ast ...\ast f_{1/2,1/2}$ $=f_{1/2,n/2}=\frac{1}{\Gamma \left( n/2\right) 2^{n/2}}x^{n/2-1}e^{-x/2}.$
27
Dec 21

## Gamma distribution

Definition. The gamma distribution $Gamma\left( \alpha ,\nu \right)$ is a two-parametric family of densities. For $\alpha >0,\nu >0$ the density is defined by

$f_{\alpha ,\nu }\left( x\right) =\frac{1}{\Gamma \left( \nu \right) }\alpha ^{\nu }x^{\nu -1}e^{-\alpha x},\ x>0;$ $f_{\alpha ,\nu }\left( x\right) =0,\ x<0.$

Obviously, you need to know what is a gamma function. My notation of the parameters follows Feller, W. An Introduction to Probability Theory and its Applications, Volume II, 2nd edition (1971). It is different from the one used by J. Abdey in his guide ST2133.

### Property 1

It is really a density because

$\frac{1}{\Gamma \left( \nu \right) }\alpha ^{\nu }\int_{0}^{\infty }x^{\nu -1}e^{-\alpha x}dx=$ (replace $\alpha x=t$)

$=\frac{1}{\Gamma \left( \nu \right) }\alpha ^{\nu }\int_{0}^{\infty }t^{\nu -1}\alpha ^{1-\nu -1}e^{-t}dt=1.$

Suppose you see an expression $x^{a}e^{-bx}$ and need to determine which gamma density this is. The power of the exponent gives you $\alpha =b$ and the power of $x$ gives you $\nu =a+1.$ It follows that the normalizing constant should be $\frac{1}{\Gamma \left( a+1\right) }b^{a+1}$ and the density is $\frac{1}{\Gamma \left( a+1\right) }b^{a+1}x^{a}e^{-bx},$ $x>0.$

### Property 2

The most important property is that the family of gamma densities with the same $\alpha$ is closed under convolutions. Because of the associativity property $f_{X}\ast f_{Y}\ast f_{Z}=\left( f_{X}\ast f_{Y}\right) \ast f_{Z}$ it is enough to prove this for the case of two gamma densities.

First we want to prove

(1) $\left( f_{\alpha ,\mu }\ast f_{\alpha ,\nu }\right) \left( x\right) = \frac{\Gamma \left( \mu +\nu \right) }{\Gamma \left( \mu \right) \Gamma \left( \nu \right) }\int_{0}^{1}\left( 1-t\right) ^{\mu -1}t^{\nu -1}dt\times f_{\alpha ,\mu +\nu }(x).$

Start with the general definition of convolution and recall where the density vanishes:

$\left( f_{\alpha ,\mu }\ast f_{\alpha ,\nu }\right) \left( x\right) =\int_{-\infty }^{\infty }f_{\alpha ,\mu }\left( x-y\right) f_{\alpha ,\nu }\left( y\right) dy=\int_{0}^{x}f_{\alpha ,\mu }\left( x-y\right) f_{\alpha ,\nu }\left( y\right) dy$

(plug the densities and take out the constants)

$=\int_{0}^{x}\left[ \frac{1}{\Gamma \left( \mu \right) }\alpha ^{\mu }\left( x-y\right) ^{\mu -1}e^{-\alpha \left( x-y\right) }\right] \left[ \frac{1}{\Gamma \left( \nu \right) }\alpha ^{\nu }y^{\nu -1}e^{-\alpha y} \right] dy$ $=\frac{\alpha ^{\mu +\nu }e^{-\alpha x}}{\Gamma \left( \mu \right) \Gamma \left( \nu \right) }\int_{0}^{x}\left( x-y\right) ^{\mu -1}y^{\nu -1}dy$

(replace $y=xt$)

$=\frac{\Gamma \left( \mu +\nu \right) }{\Gamma \left( \mu \right) \Gamma \left( \nu \right) }\frac{\alpha ^{\mu +\nu }x^{\mu +\nu -1}e^{-\alpha x}}{ \Gamma \left( \mu +\nu \right) }\int_{0}^{1}\left( 1-t\right) ^{\mu -1}t^{\nu -1}dt$ $=\frac{\Gamma \left( \mu +\nu \right) }{\Gamma \left( \mu \right) \Gamma \left( \nu \right) }\int_{0}^{1}\left( 1-t\right) ^{\mu -1}t^{\nu -1}dt\times f_{\alpha ,\mu +\nu }\left( x\right).$

Thus (1) is true. Integrating it we have

$\int_{R}\left( f_{\alpha ,\mu }\ast f_{\alpha ,\nu }\right) \left( x\right) dx=\frac{\Gamma \left( \mu +\nu \right) }{\Gamma \left( \mu \right) \Gamma \left( \nu \right) }\int_{0}^{1}\left( 1-t\right) ^{\mu -1}t^{\nu -1}dt\times \int_{R}f_{\alpha ,\mu +\nu }\left( x\right) dx.$

We know that the convolution of two densities is a density. Therefore the last equation implies

$\frac{\Gamma \left( \mu +\nu \right) }{\Gamma \left( \mu \right) \Gamma \left( \nu \right) }\int_{0}^{1}\left( 1-t\right) ^{\mu -1}t^{\nu -1}dt=1$

and

$f_{\alpha ,\mu }\ast f_{\alpha ,\nu }=f_{\alpha ,\mu +\nu },\ \mu ,\nu >0.$

Alternative proof. The moment generating function of a sum of two independent beta distributions with the same $\alpha$ shows that this sum is again a beta distribution with the same $\alpha$, see pp. 141, 209 in the guide ST2133.

26
Dec 21

## Gamma function

### Gamma function

The gamma function and gamma distribution are two different things. This post is about the former and is a preparatory step to study the latter.

Definition. The gamma function is defined by

$\Gamma \left( t\right) =\int_{0}^{\infty }x^{t-1}e^{-x}dx,\ t> 0.$

The integrand $f(t)=x^{t-1}e^{-x}$ is smooth on $\left( 0,\infty \right) ,$ so its integrability is determined by its behavior at $\infty$ and $0$. Because of the exponent, it is integrable in the neighborhood of $\infty .$ The singularity at $0$ is integrable if $t>0.$ In all calculations involving the gamma function one should remember that its argument should be positive.

## Properties

1) Factorial-like property. Integration by parts shows that

$\Gamma \left( t\right) =-\int_{0}^{\infty }x^{t-1}\left( e^{-x}\right) ^{\prime }dx=-x^{t-1}e^{-x}|_{0}^{\infty }+\left( t-1\right) \int_{0}^{\infty }x^{t-2}e^{-x}dx$

$=\left( t-1\right) \Gamma \left( t-1\right)$ if $t>1.$

2) $\Gamma \left( 1\right) =1$ because $\int_{0}^{\infty }e^{-x}dx=1.$

3) Combining the first two properties we see that for a natural $n$

$\Gamma \left( n+1\right) =n\Gamma ( n) =...=n\times \left( n-1\right) ...\times 1\times \Gamma \left( 1\right) =n!$

Thus the gamma function extends the factorial to non-integer $t>0.$

4) $\Gamma \left( 1/2\right) =\sqrt{\pi }.$

Indeed, using the density $f_{z}$ of the standard normal $z$ we see that

$\Gamma \left( 1/2\right) =\int_{0}^{\infty }x^{-1/2}e^{-x}dx=$

(replacing $x^{1/2}=u$)

$=\int_{0}^{\infty }\frac{1}{u}e^{-u^{2}}2udu=2\int_{0}^{\infty }e^{-u^{2}}du=\int_{-\infty }^{\infty }e^{-u^{2}}du=$

(replacing $u=z/\sqrt{2}$)

$=\frac{\sqrt{\pi }}{\sqrt{2\pi }}\int_{-\infty }^{\infty }e^{-z^{2}/2}dz= \sqrt{\pi }\int_{R}f_{z}\left( t\right) dt=\sqrt{\pi }.$

Many other properties are not required in this course.

25
Dec 21

## Analysis of problems with conditioning

These problems are among the most difficult. It's important to work out a general approach to such problems. All references are to J. Abdey,  Advanced statistics: distribution theory, ST2133, University of London, 2021.

### General scheme

Step 1. Conditioning is usually suggested by the problem statement: $Y$ is conditioned on $X$.

Your life will be easier if you follow the notation used in the guide: use $p$ for probability mass functions (discrete variables) and $f$ for (probability) density functions (continuous variables).

a) If $Y|X$ and $X$ both are discrete (Example 5.1, Example 5.13, Example 5.18):

$p_{Y}\left( y\right) =\sum_{Set}p_{Y\vert X}\left( y\vert x\right) p_{X}\left( x\right) .$

b) If $Y|X$ and $X$ both are continuous (Activity 5.6):

$f_{Y}\left( y\right) =\int_{Set}f_{Y\vert X}\left( y\vert x\right) f_{X}\left( x\right) dx.$

c) If $Y|X$ is discrete, $X$ is continuous (Example 5.2, Activity 5.5):

$p_{Y}\left( y\right) =\int_{Set}p_{Y\vert X}\left( y\vert x\right) f_{X}\left( x\right) dx$

d) If $Y|X$ is continuous, $X$ is discrete (Activity 5.12):

$f_{Y}\left( y\right) =\sum_{Set}f_{Y\vert X}\left( y\vert x\right) p_{X}\left( x\right) .$

In all cases you need to figure out $Set$ over which to sum or integrate.

Step 2. Write out the conditional densities/probabilities with the same arguments

Step 3. Reduce the result to one of known distributions using the completeness
axiom.

### Example 5.1

Let $X$ denote the number of hurricanes which form in a given year, and let $Y$ denote the number of these which make landfall. Suppose each hurricane has a probability of $\pi$ making landfall independent of other hurricanes. Given the number of hurricanes $x$, then $Y$ can be thought of as the number of successes in $x$ independent and identically distributed Bernoulli trials. We can write this as $Y|X=x\sim Bin(x,\pi )$. Suppose we also have that $X\sim Pois(\lambda )$. Find the distribution of $Y$ (noting that $X\geq Y$ ).

### Solution

Step 1. The number of hurricanes $X$ takes values $0,1,2,...$ and is distributed as Poisson. The number of landfalls for a given $X=x$ is binomial with values $y=0,...,x$. It follows that $Set=\{x:x\ge y\}$.

Write the general formula for conditional probability:

$p_{Y}\left( y\right) =\sum_{x=y}^{\infty }p_{Y\vert X}\left( y\vert x\right) p_{X}\left( x\right) .$

Step 2. Specifying the distributions:

$p_{X}\left( x\right) =e^{-\mu }\frac{\mu ^{x}}{x!},$ where $x=0,1,2,...,$

and

$P\left( Bin\left( x,\pi \right) =y\right) =p_{Y\vert X}\left( y\vert x\right) =C_{x}^{y}\pi ^{y}\left( 1-\pi \right) ^{x-y}$ where $y\leq x.$

Step 3. Reduce the result to one of known distributions:

$p_{Y}\left( y\right) =\sum_{x=y}^{\infty }C_{x}^{y}\pi ^{y}\left( 1-\pi \right) ^{x-y}e^{-\mu }\frac{\mu ^{x}}{x!}$

(pull out of summation everything that does not depend on summation variable
$x$)

$=\frac{e^{-\mu }\mu ^{y}}{y!}\pi ^{y}\sum_{x=y}^{\infty }\frac{1}{\left( x-y\right) !}\left( \mu \left( 1-\pi \right) \right) ^{x-y}$

(replace $x-y=z$ to better see the structure)

$=\frac{e^{-\mu }\mu ^{y}}{y!}\pi ^{y}\sum_{z=0}^{\infty }\frac{1}{z!}\left( \mu \left( 1-\pi \right) \right) ^{z}$

(using the completeness axiom $\sum_{x=0}^{\infty }\frac{\mu ^{x}}{x!}=e^{\mu }$ for the Poisson variable)

$=\frac{e^{-\mu }}{y!}\left( \mu \pi \right) ^{y}e^{\mu \left( 1-\pi \right) }=\frac{e^{-\mu \pi }}{y!}\left( \mu \pi \right) ^{y}=p_{Pois(\mu \pi )}\left( y\right) .$

14
Dec 21

## Sum of random variables and convolution

### Link between double and iterated integrals

Why do we need this link? For simplicity consider the rectangle $A=\left\{ a\leq x\leq b,c\leq y\leq d\right\} .$ The integrals

$I_{1}=\underset{A}{\int \int }f(x,y)dydx$

and

$I_{2}=\int_{a}^{b}\left( \int_{c}^{d}f(x,y)dy\right) dx$

both are taken over the rectangle $A$ but they are not the same. $I_{1}$ is a double (two-dimensional) integral, meaning that its definition uses elementary areas, while $I_{2}$ is an iterated integral, where each of the one-dimensional integrals uses elementary segments. To make sense of this, you need to consult an advanced text in calculus. The  difference notwithstanding, in good cases their values are the same. Putting aside the question of what is a "good case", we  concentrate on geometry: how a double integral can be expressed as an iterated integral.

It is enough to understand the idea in case of an oval $A$ on the plane. Let $y=l\left( x\right)$ be the function that describes the lower boundary of the oval and let $y=u\left( x\right)$ be the function that describes the upper part. Further, let the vertical lines $x=m$ and $x=M$ be the minimum and maximum values of $x$ in the oval (see Chart 1).

Chart 1. The boundary of the oval above the green line is described by u(x) and below - by l(x)

We can paint the oval with strokes along red lines from $y=l\left( x\right)$ to $y=u\left(x\right) .$ If we do this for all $x\in \left[ m,M\right] ,$ we'll have painted the whole oval. This corresponds to the representation of $A$ as the union of segments $\left\{ y:l\left( x\right) \leq y\leq u\left( x\right) \right\}$ with $x\in \left[ m,M\right] :$

$A=\bigcup\limits_{m\leq x\leq M}\left\{ y:l\left( x\right) \leq y\leq u\left( x\right) \right\}$

and to the equality of integrals

(double integral)$\underset{A}{\int \int }f(s,t)dsdt=\int_{m}^{M}\left( \int_{l\left( x\right) }^{u(x)}f(x,y)dy\right) dx$ (iterated integral)

### Density of a sum of two variables

Assumption 1 Suppose the random vector $\left( X,Y\right)$ has a density $f_{X,Y}$ and define $Z=X+Y$ (unlike the convolution theorem below, here $X,Y$ don't have to be independent).

From the definitions of the distribution function $F_{Z}\left( z\right)=P\left( Z\leq z\right)$ and probability

$P\left( A\right) =\underset{A}{\int \int }f_{X,Y}(x,y)dxdy$

we have

$F_{Z}\left( z\right) =P\left( Z\leq z\right) =P\left( X+Y\leq z\right) = \underset{x+y\leq z}{\int \int }f_{X,Y}(x,y)dxdy.$

The integral on the right is a double integral. The painting analogy (see Chart 2)

Chart 2. Integration for sum of two variables

suggests that

$\left\{ (x,y)\in R^{2}:x+y\leq z\right\} =\bigcup\limits_{-\infty

Hence,

$\int_{-\infty }^{z}f_{Z}\left( z\right) dz=F_{Z}\left( z\right) =\int_{R}\left( \int_{-\infty }^{z-x}f_{X,Y}(x,y)dy\right) dx.$

Differentiating both sides with respect to $z$ we get

$f_{Z}\left( z\right) =\int_{R}f_{X,Y}(x,z-x)dx.$

If we start with the inner integral that is with respect to $x$ and the outer integral $-$ with respect to $y,$ then similarly

$f_{Z}\left( z\right) =\int_{R}f_{X,Y}(z-y,y)dy.$

Exercise. Suppose the random vector $\left( X,Y\right)$ has a density $f_{X,Y}$ and define $Z=X-Y.$ Find $f_{Z}.$ Hint: review my post on Leibniz integral rule.

### Convolution theorem

In addition to Assumption 1, let $X,Y$ be independent. Then $f_{X,Y}(x,y)=f_{X}(x)f_{Y}\left( y\right)$ and the above formula gives

$f_{Z}\left( z\right) =\int_{R}f_{X}(x)f_{Y}\left( z-x\right) dx.$

This is denoted as $\left( f_{X}\ast f_{Y}\right) \left( z\right)$ and called a convolution.

The following may help to understand this formula. The function $g(x)=f_{Y}\left( -x\right)$ is a density (it is non-negative and integrates to 1). Its graph is a mirror image of that of $f_{Y}$ with respect to the vertical axis. The function $h_{z}(x)=f_{Y}\left( z-x\right)$ is a shift of $g$ by $z$ along the horizontal axis. For fixed $z,$ it is also a density. Thus in the definition of convolution we integrate the product of two densities $f_{X}(x)f_{Y}\left( z-x\right) .$ Further, to understand the asymptotic behavior of $\left( f_{X}\ast f_{Y}\right) \left( z\right)$ when $\left\vert z\right\vert \rightarrow \infty$ imagine two bell-shaped densities $f_{X}(x)$ and $f_{Y}\left( z-x\right) .$ When $z$ goes to, say, infinity, the humps of those densities are spread apart more and more. The hump of one of them gets multiplied by small values of the other. That's why $\left(f_{X}\ast f_{Y}\right) \left( z\right)$ goes to zero, in a certain sense.

The convolution of two densities is always a density because it is non-negative and integrates to one:

$\int_{R}f_{Z}\left( z\right) dz=\int_{R}\left( \int_{R}f_{X}(x)f_{Y}\left( z-x\right) dx\right) dz=\int_{R}f_{X}(x)\left( \int_{R}f_{Y}\left( z-x\right) dz\right) dx$

Replacing $z-x=y$ in the inner integral we see that this is

$\int_{R}f_{X}(x)dx\int_{R}f_{Y}\left( y\right) dy=1.$