Mar 23

Full solution to Example 2.15 from the guide ST2134

Full solution to Example 2.15 from the guide ST2134

Students tend to miss the ideas needed for this example for two reasons: in the guide there is a reference to ST104b Statistics 2, which nobody consults, and the short notation U of the statistic conceals the essence.

Recommendation: for the statistic U=T\left( X\right) always use T\left(X\right) rather than U. Similarly, if x=\left( x_{1},...,x_{n}\right) is the observed sample, use T\left( x\right) rather than u.

Example 2.15. Let X=\left( X_{1},X_{2},...,X_{n}\right) be a random sample (meaning i.i.d. variables) from a Pois(\lambda ) distribution, and let U=T(X)=\sum_{n=1}^{n}X_{i}. Show that T\left( X\right) is a sufficient statistic.

Solution. There are two ways to solve this problem. One is to use the definition of a sufficient statistic, and the other is to apply the sufficiency principle. It is a good idea to announce which way you go in the very beginning. We apply the definition, and we have to show that the density of X conditional on T\left( X\right) does not depend on \lambda .

Step 1. The density of X_{i} is p_{X_{i}}\left( x_{i},\lambda \right)=e^{-\lambda }\frac{\lambda ^{x_{i}}}{x_{i}!}, x_{i}=0,1,2,... and by independence
p_{X_{1},...,X_{n}}\left( x_{1},...,x_{n},\lambda \right) =e^{-\lambda }  \frac{\lambda ^{x_{1}}}{x_{1}!}...e^{-\lambda }\frac{\lambda ^{x_{n}}}{x_{n}!  }=e^{-n\lambda }\frac{\lambda ^{\Sigma x_{i}}}{x_{1}!...x_{n}!}.

Step 2. We need to characterize the distribution of S_{n}=\sum_{i=1}^{n}X_{i}, and this is accomplished with the MGF. For one Poisson variable X we have
M_{X}\left( t\right) =Ee^{tX}=\sum_{x=0}^{\infty }e^{tx}e^{-\lambda }  \frac{\lambda ^{x}}{x!}=e^{-\lambda }\sum_{x=0}^{\infty }\frac{\left(  e^{t}\lambda \right) ^{x}}{x!}
=e^{-\lambda }e^{e^{t}\lambda }\sum_{x=0}^{\infty }e^{-e^{t}\lambda }\frac{  \left( e^{t}\lambda \right) ^{x}}{x!}=e^{\lambda \left( e^{t}-1\right) }.
Here we used the completeness axiom \sum_{x=0}^{\infty }e^{-\lambda _{1}}  \frac{\left( \lambda _{1}\right) ^{x}}{x!}=1 with \lambda_{1}=e^{t}\lambda .

Step 3. This result for the sum implies
M_{S_{n}}\left( t\right) =Ee^{t\Sigma X_{i}}=E\left(  e^{tX_{1}}...e^{tX_{n}}\right)=Ee^{tX_{1}}...Ee^{tX_{n}} (by independence)
=\left( Ee^{tX}\right) ^{n} (since X_{1},...,X_{n} have identical distribution)
=e^{n\lambda \left( e^{t}-1\right) } (by Step 2).
By Step 2 we know that X\sim Pois\left( n\lambda \right) implies M_{X}\left( t\right) =e^{n\lambda \left( e^{t}-1\right) }. As we just showed, the MGF of M_{S_{n}} is the same. The uniqueness theorem says:

if random variables X and Y have the same MGF's then their distributions are the same.

It follows that M_{S_{n}}\left( t\right) \sim Pois\left( n\lambda \right) and that
p_{\Sigma X_{i}}\left( \Sigma x_{i},\lambda \right) =e^{-n\lambda }\frac{  \left( n\lambda \right) ^{\Sigma x_{i}}}{\left( \Sigma x_{i}\right) !},\ \Sigma x_{i}=0,1,2,...
(which is written as e^{-n\lambda }\frac{\left( n\lambda \right) ^{u}}{u!} in the guide, and to me this is not transparent).

Step 4. To check that the conditional density does not depend on the parameter, we recall that the conditional density along the level set simplifies to (see Guide, p.30)
P\left( X=x|U=u\right) =\frac{P\left( X=x\right) }{P\left( U=u\right) }
(no joint density in the numerator). In our situation the full expression for the ratio on the right is
\frac{P( X=x) }{P( U=u) } =\frac{p_{X_1,...,X_n}  ( x_1,...,x_n,\lambda ) }{p_{\Sigma X_i}( \Sigma  x_i,\lambda ) }=e^{-n\lambda }\frac{\lambda ^{\Sigma x_i}}{  x_1!...x_n!}\left( e^{-n\lambda }\frac{( n\lambda ) ^{\Sigma  x_i}}{( \Sigma x_i) !}\right) ^{-1}
=\frac{\left( \Sigma x_{i}\right) !}{n^{\Sigma  x_{i}}\prod\nolimits_{i=1}^{n}x_{i}!}.
As there is no \lambda in the result, \sum X_{i} is sufficient for \lambda .

Exercise. Do Example 2.16 from the Guide following this format.

Jan 23

Excel for mass education

Excel for mass education

Problem statement

The Covid with its lockdowns has posed a difficult question: how do you teach online and preclude cheating by students? How do you do that efficiently with a large number of students and without lowering the teaching standards? I think the answer depends on what you teach. Using Excel made my course very attractive because many students adore learning Excel functions.

Suggested solution

Last year I taught Financial Econometrics. The topic was Portfolio Optimization using the Sharpe ratio. The idea was to give the students Excel files with individual data sizes so that they have to do the calculations themselves. Those who tried to obtain a file from another student and send it to me under their own names were easily identified. I punished both the giver and receiver of the file. Some steps for assignment preparation and report checking may be very time consuming if you don’t automate them. In the following list, the starred steps are the ones that may take a lot of time with large groups of students.

Step 1. I download data for several stocks from Yahoo Finance and put them in one Excel file where I have the students’ list (Video 1).

Step 2. For each student I randomly choose the sample size for the chunk of data to be selected from the data I downloaded. The students are required to use the whole sample in their calculations.

Step 3*. For creating individual student files with assignments, I use a Visual Basic macro. It reads a student name, his or her sample size, creates an Excel file, pastes there the appropriate sample and saves the file under that student’s name (Video 2).

Step 4*. In Gmail I prepare messages with individual Excel files. Gmail has an option for scheduling emails (Video 3). Outlook.com also has this feature but it requires too many clicks.

Step 5. The test is administered using MS Teams. In the beginning of the test, I give the necessary oral instructions and post the assignment description (which is common for all students). The emails are scheduled to be sent 10 minutes after the session start. The time for the test is just enough to do calculations in Excel. I cannot control how the students do it nor can I see if they share screens to help each other. But I know that the task is difficult enough, so one needs to be familiar with the material in order to accomplish the task, even when one sees on the screen how somebody else is doing it.

Step 6*. Upon completion of the test, the students email me their files. The messages arrival times are recorded by Gmail. I have to check the files and post the grades (video 4).

Skills to test

Portfolio Optimization involves the following steps.

a) For each stock one has to find daily rates of return.

b) Using arbitrary initial portfolio shares, the daily rates of return on the portfolio are calculated. I require the students to use matrix multiplication for this, which makes checking their work easier.

c) The daily rates of return on the portfolio are used to find the average return, standard deviation and Sharpe ratio for the portfolio. The fact that after all these calculations the students have to obtain a single number also simplifies verification.

d) Finally, the students have to optimize the portfolio shares using the Solver add-in.

The list above is just an example. The task can be expanded to check the knowledge of other elements of matrix algebra, Econometrics and/or Finance. In one of my assignments, I required my students to run a multiple regression. The Excel add-in called Data Analysis allows one to do that easily but my students were required to do everything using the matrix expression for the OLS estimator and also to report the results using Excel string functions.

To make my job easier, I partially or completely automate time-consuming operations. Arguably, everything can be completely automated using Power Automate promoted by Microsoft. Except for the macro, my home-made solutions are simpler.

Detailed explanations

How to make Gmail your mailto protocol handler

Video 1. Initial file

Video 2. Creating Excel individual files

Video 3. Scheduling emails

Video 4. How to quickly check students work

Macro for creating files

Sub CreateDataFiles()
' This needs a file with student names (column A), block sizes (column C)
' and data to choose data blocks from (columns F through M). All on sheet "block finec"
' It creates files with student names and individual data blocks
' If necessary, change edit whatever you want
' Also can change the range address. R1C5 - upper left corner of the data
' "R" & Size & "C13" - lower right corner of the data
' Size is read off column C

' First select the cells with block sizes and then run the macro

' Files will be created and saved with student names
' Keyboard Shortcut: Ctrl+i
Application.ScreenUpdating = False
For Each cell In Selection.Cells

Size = cell.Value
Address = cell.Address
Name = cell.Offset(0, -2).Value

Application.Goto Reference:="R1C5:R" & Size & "C13"
Application.CutCopyMode = False
ChDir "C:\Users\Student files"
ActiveWorkbook.SaveAs Filename:= _
"C:\Users\Student files\" & Name & ".xlsx", _
FileFormat:=xlOpenXMLWorkbook, CreateBackup:=False

Workbooks("Stat 2 Spring 2022 list with emails.xlsm").Activate

End Sub

Oct 22

A problem to do once and never come back

A problem to do once and never come back

There is a problem I gave on the midterm that does not require much imagination. Just know the definitions and do the technical work, so I was hoping we could put this behind us. Turned out we could not and thus you see this post.

Problem. Suppose the joint density of variables X,Y is given by

f_{X,Y}(x,y)=\left\{  \begin{array}{c}k\left( e^{x}+e^{y}\right) \text{ for }0<y<x<1, \\  0\text{ otherwise.}\end{array}\right.

I. Find k.

II. Find marginal densities of X,Y. Are X,Y independent?

III. Find conditional densities f_{X|Y},\ f_{Y|X}.

IV. Find EX,\ EY.

When solving a problem like this, the first thing to do is to give the theory. You may not be able to finish without errors the long calculations but your grade will be determined by the beginning theoretical remarks.

I. Finding the normalizing constant

Any density should satisfy the completeness axiom: the area under the density curve (or in this case the volume under the density surface) must be equal to one: \int \int f_{X,Y}(x,y)dxdy=1. The constant k chosen to satisfy this condition is called a normalizing constant. The integration in general is over the whole plain R^{2} and the first task is to express the above integral as an iterated integral. This is where the domain where the density is not zero should be taken into account. There is little you can do without geometry. One example of how to do this is here.

The shape of the area A=\left\{ (x,y):0<y<x<1\right\} is determined by a) the extreme values of x,y and b) the relationship between them. The extreme values are 0 and 1 for both x and y, meaning that A is contained in the square \left\{ (x,y):0<x,y\text{ and}\ x,y<1\right\} . The inequality y<x means that we cut out of this square the triangle below the line y=x (it is really the lower triangle because if from a point on the line y=x we move down vertically, x will stay the same and y will become smaller than x).

In the iterated integral:

a) the lower and upper limits of integration for the inner integral are the boundaries for the inner variable; they may depend on the outer variable but not on the inner variable.

b) the lower and upper limits of integration for the outer integral are the extreme values for the outer variable; they must be constant.

This is illustrated in Pane A of Figure 1.

Figure 1. Integration order

Figure 1. Integration order

Always take the inner integral in parentheses to show that you are dealing with an iterated integral.

a) In the inner integral integrating over x means moving along blue arrows from the boundary x=y to the boundary x=1. The boundaries may depend on y but not on x because the outer integral is over y.

b) In the outer integral put the extreme values for the outer variable. Thus,

\underset{A}{\int \int }f_{X,Y}(x,y)dxdy=\int_{0}^{1}\left(\int_{y}^{1}f_{X,Y}(x,y)dx\right) dy.

Check that if we first integrate over y (vertically along red arrows, see Pane B in Figure 1) then the equation

\underset{A}{\int \int }f_{X,Y}(x,y)dxdy=\int_{0}^{1}\left(\int_{0}^{x}f_{X,Y}(x,y)dy\right) dx


In fact, from the definition A=\left\{ (x,y):0<y<x<1\right\} one can see that the inner interval for x is \left[ y,1\right] and for y it is \left[ 0,x\right] .

II. Marginal densities

I can't say about this more than I said here.

The condition for independence of X,Y is f_{X,Y}\left( x,y\right)  =f_{X}\left( x\right) f_{Y}\left( y\right) (this is a direct analog of the independence condition for events P\left( A\cap B\right) =P\left( A\right) P\left( B\right) ). In words: the joint density decomposes into a product of individual densities.

III. Conditional densities

In this case the easiest is to recall the definition of conditional probability P\left( A|B\right) =\frac{P\left( A\cap B\right) }{P\left(B\right) }. The definition of conditional densities f_{X|Y},\ f_{Y|X} is quite similar:

(2) f_{X|Y}\left( x|y\right) =\frac{f_{X,Y}\left( x,y\right) }{f_{Y}\left(  y\right) },\ f_{Y|X}\left( y|x\right) =\frac{f_{X,Y}\left( x,y\right) }{f_{X}\left( x\right) }.

Of course, f_{Y}\left( y\right) ,f_{X}\left( x\right) here can be replaced by their marginal equivalents.

IV. Finding expected values of X,Y

The usual definition EX=\int xf_{X}\left( x\right) dx takes an equivalent form using the marginal density:

EX=\int x\left( \int f_{X,Y}\left( x,y\right) dy\right) dx=\int \int  xf_{X,Y}\left( x,y\right) dydx.

Which equation to use is a matter of convenience.

Another replacement in the usual definition gives the definition of conditional expectations:

E\left( X|Y\right) =\int xf_{X|Y}\left( x|y\right) dx, E\left( Y|X\right)  =\int yf_{Y|X}\left( y|x\right) dx.

Note that these are random variables: E\left( X|Y=y\right) depends in y and E\left( Y|X=x\right) depends on x.

Solution to the problem

Being a lazy guy, for the problem this post is about I provide answers found in Mathematica:

I. k=0.581977

II. f_{X}\left( x\right) =-1+e^{x}\left( 1+x\right) , for x\in[ 0,1], f_{Y}\left( y\right) =e-e^{y}y, for y\in \left[ 0,1\right] .

It is readily seen that the independence condition is not satisfied.

III. f_{X|Y}\left( x|y\right) =\frac{k\left( e^{x}+e^{y}\right) }{e-e^{y}y} for 0<y<x<1,

f_{Y|X}\left(y|x\right) =\frac{k\left(e^x+e^y\right) }{-1+e^x\left( 1+x\right) } for 0<y<x<1.

IV. EX=0.709012, EY=0.372965.

Oct 22

Marginal probabilities and densities

Marginal probabilities and densities

This is to help everybody, from those who study Basic Statistics up to Advanced Statistics ST2133.

Discrete case

Suppose in a box we have coins and banknotes of only two denominations: $1 and $5 (see Figure 1).

Box with cash

Figure 1. Illustration of two variables

We pull one out randomly. The division of cash by type (coin or banknote) divides the sample space (shown as a square, lower left picture) with probabilities p_{c} and p_{b} (they sum to one). The division by denomination ($1 or $5) divides the same sample space differently, see the lower right picture, with the probabilities to pull out $1 and $5 equal to p_{1} and p_{5}, resp. (they also sum to one). This is summarized in the tables

Variable 1: Cash type Prob
coin p_{c}
banknote p_{b}
Variable 2: Denomination Prob
$1 p_{1}
$5 p_{5}

Now we can consider joint events and probabilities (see Figure 2, where the two divisions are combined).

Box with cash

Figure 2. Joint probabilities

For example, if we pull out a random item it can be a coin and $1 and the corresponding probability is P\left(item=coin,\ item\ value=\$1\right) =p_{c1}. The two divisions of the sample space generate a new division into four parts. Then geometrically it is obvious that we have four identities:

Adding over denominations: p_{c1}+p_{c5}=p_{c}, p_{b1}+p_{b5}=p_{b},

Adding over cash types: p_{c1}+p_{b1}=p_{1}, p_{c5}+p_{b5}=p_{5}.

Formally, here we use additivity of probability for disjoint events

P\left( A\cup B\right) =P\left( A\right) +P\left( B\right) .

In words: we can recover own probabilities of variables 1,2 from joint probabilities.


Suppose we have two discrete random variables X,Y taking values x_{1},...,x_{n} and y_{1},...,y_{m}, resp., and their own probabilities are P\left( X=x_{i}\right) =p_{i}^{X}, P\left(Y=y_{j}\right) =p_{j}^{Y}. Denote the joint probabilities P\left(X=x_{i},Y=y_{j}\right) =p_{ij}. Then we have the identities

(1) \sum_{j=1}^mp_{ij}=p_{i}^{X}, \sum_{i=1}^np_{ij}=p_{j}^{Y} (n+m equations).

In words: to obtain the marginal probability of one variable (say, Y) sum over the values of the other variable (in this case, X).

The name marginal probabilities is used for p_{i}^{X},p_{j}^{Y} because in the two-dimensional table they arise as a result of summing table entries along columns or rows and are displayed in the margins.

Analogs for continuous variables with densities

Suppose we have two continuous random variables X,Y and their own densities are f_{X} and f_{Y}. Denote the joint density f_{X,Y}. Then replacing in (1) sums by integrals and probabilities by densities we get

(2) \int_R f_{X,Y}\left( x,y\right) dy=f_{X}\left( x\right) ,\ \int_R f_{X,Y}\left( x,y\right) dx=f_{Y}\left( y\right) .

In words: to obtain one marginal density (say, f_{Y}) integrate out the other variable (in this case, x).


May 22

Vector autoregression (VAR)

Vector autoregression (VAR)

Suppose we are observing two stocks and their respective returns are x_{t},y_{t}. To take into account their interdependence, we consider a vector autoregression

(1) \left\{\begin{array}{c}  x_{t}=a_{1}x_{t-1}+b_{1}y_{t-1}+u_{t} \\  y_{t}=a_{2}x_{t-1}+b_{2}y_{t-1}+v_{t}\end{array}\right.

Try to repeat for this system the analysis from Section 3.5 (Application to an AR(1) process) of the Guide by A. Patton and you will see that the difficulties are insurmountable. However, matrix algebra allows one to overcome them, with proper adjustment.


A) Write this system in a vector format

(2) Y_{t}=\Phi Y_{t-1}+U_{t}.

What should be Y_{t},\Phi ,U_{t} in this representation?

B) Assume that the error U_{t} in (1) satisfies

(3) E_{t-1}U_{t}=0,\ EU_{t}U_{t}^{T}=\Sigma ,~EU_{t}U_{s}^{T}=0 for t\neq  s with some symmetric matrix \Sigma =\left(\begin{array}{cc}\sigma _{11} & \sigma _{12} \\\sigma _{12} & \sigma _{22}  \end{array}\right) .

What does this assumption mean in terms of the components of U_{t} from (2)? What is \Sigma if the errors in (1) satisfy

(4) E_{t-1}u_{t}=E_{t-1}v_{t}=0,~Eu_{t}^{2}=Ev_{t}^{2}=\sigma ^{2}, Eu_{s}u_{t}=Ev_{s}v_{t}=0 for t\neq s, Eu_{s}v_{t}=0 for all s,t?

C) Suppose (1) is stationary. The stationarity condition is expressed in terms of eigenvalues of \Phi but we don't need it. However, we need its implication:

(5) \det \left( I-\Phi \right) \neq 0.

Find \mu =EY_{t}.

D) Find Cov(Y_{t-1},U_{t}).

E) Find \gamma _{0}\equiv V\left( Y_{t}\right) .

F) Find \gamma _{1}=Cov(Y_{t},Y_{t-1}).

G) Find \gamma _{2}.


A) It takes some practice to see that with the notation

Y_{t}=\left(\begin{array}{c}x_{t} \\y_{t}\end{array}\right) , \Phi =\left(\begin{array}{cc}  a_{1} & b_{1} \\a_{2} & b_{2}\end{array}\right) , U_{t}=\left(  \begin{array}{c}u_{t} \\v_{t}\end{array}\right)

the system (1) becomes (2).

B) The equations in (3) look like this:

E_{t-1}U_{t}=\left(\begin{array}{c}E_{t-1}u_{t} \\  E_{t-1}v_{t}\end{array}\right) =0, EU_{t}U_{t}^{T}=\left(  \begin{array}{cc}Eu_{t}^{2} & Eu_{t}v_{t} \\Eu_{t}v_{t} & Ev_{t}^{2}  \end{array}\right) =\left(\begin{array}{cc}  \sigma _{11} & \sigma _{12} \\  \sigma _{12} & \sigma _{22}\end{array}  \right) ,

EU_{t}U_{s}^{T}=\left(\begin{array}{cc}  Eu_{t}u_{s} & Eu_{t}v_{s} \\Ev_{t}u_{s} & Ev_{t}v_{s}  \end{array}\right) =0.

Equalities of matrices are understood element-wise, so we get a series of scalar equations E_{t-1}u_{t}=0,...,Ev_{t}v_{s}=0 for t\neq s.

Conversely, the scalar equations from (4) give

E_{t-1}U_{t}=0,\ EU_{t}U_{t}^{T}=\left(\begin{array}{cc}  \sigma ^{2} & 0 \\0 & \sigma ^{2}\end{array}  \right) ,~EU_{t}U_{s}^{T}=0 for t\neq s.

C) (2) implies EY_{t}=\Phi EY_{t-1}+EU_{t}=\Phi EY_{t-1} or by stationarity \mu =\Phi \mu or \left( I-\Phi \right) \mu =0. Hence (5) implies \mu =0.

D) From (2) we see that Y_{t-1} depends only on I_{t} (information set at time t). Therefore by the LIE

Cov(Y_{t-1},U_{t})=E\left( Y_{t-1}-EY_{t-1}\right) U_{t}^{T}=E\left[ \left(  Y_{t-1}-EY_{t-1}\right) E_{t-1}U_{t}^{T}\right] =0,

Cov\left( U_{t},Y_{t-1}\right) =\left[ Cov(Y_{t-1},U_{t})\right] ^{T}=0.

E) Using the previous post

\gamma _{0}\equiv V\left( \Phi Y_{t-1}+U_{t}\right) =\Phi V\left(  Y_{t-1}\right) \Phi ^{T}+Cov\left( U_{t},Y_{t-1}\right) \Phi ^{T}+\Phi  Cov(Y_{t-1},U_{t})+V\left( U_{t}\right)

=\Phi \gamma _{0}\Phi ^{T}+\Sigma

(by stationarity and (3)). Thus, \gamma _{0}-\Phi \gamma _{0}\Phi  ^{T}=\Sigma and \gamma _{0}=\sum_{s=0}^{\infty }\Phi ^{s}\Sigma\left( \Phi  ^{T}\right) ^{s} (see previous post).

F) Using the previous result we have

\gamma _{1}=Cov(Y_{t},Y_{t-1})=Cov(\Phi Y_{t-1}+U_{t},Y_{t-1})=\Phi  Cov(Y_{t-1},Y_{t-1})+Cov(U_{t},Y_{t-1})

=\Phi Cov(Y_{t-1},Y_{t-1})=\Phi \gamma _{0}=\Phi \sum_{s=0}^{\infty }\Phi  ^{s}\Sigma\left( \Phi ^{T}\right) ^{s}.

G) Similarly,

\gamma _{2}=Cov(Y_{t},Y_{t-2})=Cov(\Phi Y_{t-1}+U_{t},Y_{t-2})=\Phi  Cov(Y_{t-1},Y_{t-2})+Cov(U_{t},Y_{t-2})

=\Phi Cov(Y_{t-1},Y_{t-2})=\Phi \gamma _{1}=\Phi ^{2}\sum_{s=0}^{\infty  }\Phi ^{s}\Sigma\left( \Phi ^{T}\right) ^{s}.

Autocorrelations require a little more effort and I leave them out.


Mar 22

Midterm Spring 2022

Blueprint for exam versions

This is the exam I administered in my class in Spring 2022. By replacing the Poisson distribution with other random variables the UoL examiners can obtain a large variety of versions with which to torture Advanced Statistics students. On the other hand, for the students the answers below can be a blueprint to fend off any assaults.

During the semester my students were encouraged to analyze and collect information in documents typed in Scientific Word or LyX. The exam was an open-book online assessment. Papers typed in Scientific Word or LyX were preferred and copying from previous analysis was welcomed. This policy would be my preference if I were to study a subject as complex as Advanced Statistics. The students were given just two hours on the assumption that they had done the preparations diligently. Below I give the model answers right after the questions.

Midterm Spring 2022

You have to clearly state all required theoretical facts. Number all equations that you need to use in later calculations and reference them as necessary. Answer the questions in the order they are asked. When you don't know the answer, leave some space. For each unexplained fact I subtract one point. Put your name in the file name.

In questions 1-9 X is the Poisson variable.

Question 1

Define X and derive the population mean and population variance of the sum S_{n}=\sum_{i=1}^{n}X_{i} where X_{i} is an i.i.d. sample from X.

Answer. X is defined by P\left( X=x\right) =e^{-\lambda }\frac{\lambda ^{x}}{x!},\  x=0,1,... Using EX=\lambda and Var\left( X\right) =\lambda (ST2133 p.80) we have

ES_{n}=\sum EX_{i}=n\lambda , Var\left( S_{n}\right) =\sum V\left(  X_{i}\right) =n\lambda

(by independence and identical distribution). [Some students derived EX=\lambda , Var\left( X\right) =\lambda instead of respective equations for sample means].

Question 2

Derive the MGF of the standardized sample mean.

Answer. Knowing this derivation is a must because it is a combination of three important facts.

a) Let z_{n}=\frac{\bar{X}-E\bar{X}}{\sigma \left( \bar{X}\right) }. Then z_{n}=\frac{nS_{n}-EnS_{n}}{\sigma \left( nS_{n}\right) }=\frac{S_{n}-ES_{n}  }{\sigma \left( S_{n}\right) }, so standardizing \bar{X} and S_{n} gives the same result.

b) The MGF of S_{n} is expressed through the MGF of X:

M_{S_{n}}\left( t\right)  =Ee^{S_{n}t}=Ee^{X_{1}t+...+X_{n}t}=Ee^{X_{1}t}...e^{X_{n}t}=

(independence) =Ee^{X_{1}t}...Ee^{X_{n}t}= (identical distribution) =\left[ M_{X}\left( t\right) \right] ^{n}.

c) If X is a linear transformation of Y, X=a+bY, then

M_{X}\left( t\right) =Ee^{X}=Ee^{\left( a+bY\right) t}=e^{at}Ee^{Y\left(  bt\right) }=e^{at}M_{Y}\left( bt\right) .

When answering the question we assume any i.i.d. sample from a population with mean \mu and population variance \sigma ^{2}:

Putting in c) a=-\frac{ES_{n}}{\sigma \left( S_{n}\right) }, b=\frac{1}{\sigma \left( S_{n}\right) } and using a) we get

M_{z_{n}}\left( t\right) =E\exp \left( \frac{S_{n}-ES_{n}}{\sigma \left(  S_{n}\right) }t\right) =e^{-ES_{n}t/\sigma \left( S_{n}\right)  }M_{S_{n}}\left( t/\sigma \left( S_{n}\right) \right)

(using b) and ES_{n}=n\mu , Var\left( S_{n}\right) =n\sigma ^{2})

=e^{-ES_{n}t/\sigma \left( S_{n}\right) }\left[ M_{X}\left( t/\sigma \left(  S_{n}\right) \right) \right] ^{n}=e^{-n\mu t/\left( \sqrt{n}\sigma \right) }%  \left[ M_{X}\left( t/\left( \sqrt{n}\sigma \right) \right) \right] ^{n}.

This is a general result which for the Poisson distribution can be specified as follows. From ST2133, example 3.38 we know that M_{X}\left( t\right)=\exp \left( \lambda \left( e^{t}-1\right) \right) . Therefore, we obtain

M_{z_{n}}\left( t\right) =e^{-\sqrt{\lambda }t}\left[ \exp \left( \lambda  \left( e^{t/\left( n\sqrt{\lambda }\right) }-1\right) \right) \right] ^{n}=  e^{-t\sqrt{\lambda }+n\lambda \left( e^{t/\left( n\sqrt{\lambda }\right)  }-1\right) }.

[Instead of M_{z_n} some students gave M_X.]

Question 3

Derive the cumulant generating function of the standardized sample mean.

Answer. Again, there are a couple of useful general facts.

I) Decomposition of MGF around zero. The series e^{x}=\sum_{i=0}^{\infty }  \frac{x^{i}}{i!} leads to

M_{X}\left( t\right) =Ee^{tX}=E\left( \sum_{i=0}^{\infty }\frac{t^{i}X^{i}}{  i!}\right) =\sum_{i=0}^{\infty }\frac{t^{i}}{i!}E\left( X^{i}\right)  =\sum_{i=0}^{\infty }\frac{t^{i}}{i!}\mu _{i}

where \mu _{i}=E\left( X^{i}\right) are moments of X and \mu  _{0}=EX^{0}=1. Differentiating this equation yields

M_{X}^{(k)}\left( t\right) =\sum_{i=k}^{\infty }\frac{t^{i-k}}{\left(  i-k\right) !}\mu _{i}

and setting t=0 gives the rule for finding moments from MGF: \mu  _{k}=M_{X}^{(k)}\left( 0\right) .

II) Decomposition of the cumulant generating function around zero. K_{X}\left( t\right) =\log M_{X}\left( t\right) can also be decomposed into its Taylor series:

K_{X}\left( t\right) =\sum_{i=0}^{\infty }\frac{t^{i}}{i!}\kappa _{i}

where the coefficients \kappa _{i} are called cumulants and can be found using \kappa _{k}=K_{X}^{(k)}\left( 0\right) . Since

K_{X}^{\prime }\left( t\right) =\frac{M_{X}^{\prime }\left( t\right) }{  M_{X}\left( t\right) } and K_{X}^{\prime \prime }\left( t\right) =\frac{  M_{X}^{\prime \prime }\left( t\right) M_{X}\left( t\right) -\left(  M_{X}^{\prime }\left( t\right) \right) ^{2}}{M_{X}^{2}\left( t\right) }

we have

\kappa _{0}=\log M_{X}\left( 0\right) =0, \kappa _{1}=\frac{M_{X}^{\prime  }\left( 0\right) }{M_{X}\left( 0\right) }=\mu _{1},

\kappa _{2}=\mu  _{2}-\mu _{1}^{2}=EX^{2}-\left( EX\right) ^{2}=Var\left( X\right) .

Thus, for any random variable X with mean \mu and variance \sigma ^{2} we have

K_{X}\left( t\right) =\mu t+\frac{\sigma ^{2}t^{2}}{2}+ terms of higher order for t small.

III) If X=a+bY then by c)

K_{X}\left( t\right) =K_{a+bY}\left( t\right) =\log \left[  e^{at}M_{Y}\left( bt\right) \right] =at+K_{X}\left( bt\right) .

IV) By b)

K_{S_{n}}\left( t\right) =\log \left[ M_{X}\left( t\right) \right]  ^{n}=nK_{X}\left( t\right) .

Using III), z_{n}=\frac{S_{n}-ES_{n}}{\sigma \left( S_{n}\right) } and then IV) we have

K_{z_{n}}\left( t\right) =\frac{-ES_{n}}{\sigma \left( S_{n}\right) }  t+K_{S_{n}}\left( \frac{t}{\sigma \left( S_{n}\right) }\right) =\frac{-ES_{n}  }{\sigma \left( S_{n}\right) }t+nK_{X}\left( \frac{t}{\sigma \left(  S_{n}\right) }\right) .

For the last term on the right we use the approximation around zero from II):

K_{z_{n}}\left( t\right) =\frac{-ES_{n}}{\sigma \left( S_{n}\right) }  t+nK_{X}\left( \frac{t}{\sigma \left( S_{n}\right) }\right) \approx \frac{  -ES_{n}}{\sigma \left( S_{n}\right) }t+n\mu \frac{t}{\sigma \left(  S_{n}\right) }+n\frac{\sigma ^{2}}{2}\left( \frac{t}{\sigma \left(  S_{n}\right) }\right) ^{2}

=-\frac{n\mu }{\sqrt{n}\sigma }t+n\mu \frac{t}{\sqrt{n}\sigma }+n\frac{  \sigma ^{2}}{2}\left( \frac{t}{\sigma \left( S_{n}\right) }\right)  ^{2}=t^{2}/2.

[Important. Why the above steps are necessary? Passing from the series M_{X}\left( t\right) =\sum_{i=0}^{\infty }\frac{t^{i}}{i!}\mu _{i} to the series for K_{X}\left( t\right) =\log M_{X}\left( t\right) is not straightforward and can easily lead to errors. It is not advisable in case of the Poisson to derive K_{z_{n}} from M_{z_{n}}\left( t\right) = e^{-t  \sqrt{\lambda }+n\lambda \left( e^{t/\left( n\sqrt{\lambda }\right)  }-1\right) }.]

Question 4

Prove the central limit theorem using the cumulant generating function you obtained.

Answer. In the previous question we proved that around zero

K_{z_{n}}\left( t\right) \rightarrow \frac{t^{2}}{2}.

This implies that

(1) M_{z_{n}}\left( t\right) \rightarrow e^{t^{2}/2} for each t around zero.

But we know that for a standard normal X its MGF is M_{X}\left( t\right)  =\exp \left( \mu t+\frac{\sigma ^{2}t^{2}}{2}\right) (ST2133 example 3.42) and hence for the standard normal

(2) M_{z}\left( t\right) =e^{t^{2}/2}.

Theorem (link between pointwise convergence of MGFs of \left\{  X_{n}\right\} and convergence in distribution of \left\{ X_{n}\right\} ) Let \left\{ X_{n}\right\} be a sequence of random variables and let X be some random variable. If M_{X_{n}}\left( t\right) converges for each t from a neighborhood of zero to M_{X}\left( t\right), then X_{n} converges in distribution to X.

Using (1), (2) and this theorem we finish the proof that z_{n} converges in distribution to the standard normal, which is the central limit theorem.

Question 5

State the factorization theorem and apply it to show that U=\sum_{i=1}^{n}X_{i} is a sufficient statistic.

Answer. The solution is given on p.180 of ST2134. For x_{i}=1,...,n the joint density is

(3) f_{X}\left( x,\lambda \right) =\prod\limits_{i=1}^{n}e^{-\lambda }  \frac{\lambda ^{x_{i}}}{x_{i}!}=\frac{\lambda ^{\Sigma x_{i}}e^{-n\lambda }}{\Pi _{i=1}^{n}x_{i}!}.

To satisfy the Fisher-Neyman factorization theorem set

g\left( \sum x_{i},\lambda \right) =\lambda ^{\Sigma x_{i}e^{-n\lambda }},\ h\left( x\right) =\frac{1}{\Pi _{i=1}^{n}x_{i}!}

and then we see that \sum x_{i} is a sufficient statistic for \lambda .

Question 6

Find a minimal sufficient statistic for \lambda stating all necessary theoretical facts.

AnswerCharacterization of minimal sufficiency A statistic T\left(  X\right) is minimal sufficient if and only if level sets of T coincide with sets on which the ratio f_{X}\left( x,\theta \right) /f_{X}\left(  y,\theta \right) does not depend on \theta .

From (3)

f_{X}\left( x,\lambda \right) /f_{X}\left( y,\lambda \right) =\frac{\lambda  ^{\Sigma x_{i}}e^{-n\lambda }}{\Pi _{i=1}^{n}x_{i}!}\left[ \frac{\lambda  ^{\Sigma y_{i}}e^{-n\lambda }}{\Pi _{i=1}^{n}y_{i}!}\right] ^{-1}=\lambda ^{  \left[ \Sigma x_{i}-\Sigma y_{i}\right] }\frac{\Pi _{i=1}^{n}y_{i}!}{\Pi  _{i=1}^{n}x_{i}!}.

The expression on the right does not depend on \lambda if and only of \Sigma x_{i}=\Sigma y_{i}0. The last condition describes level sets of T\left( X\right) =\sum X_{i}. Thus it is minimal sufficient.

Question 7

Find the Method of Moments estimator of the population mean.

Answer. The idea of the method is to take some populational property (for example, EX=\lambda ) and replace the population characteristic (in this case EX) by its sample analog (\bar{X}) to obtain a MM estimator. In our case \hat{\lambda}_{MM}=  \bar{X}. [Try to do this for the Gamma distribution].

Question 8

Find the Fisher information.

Answer. From Problem 5 the log-likelihood is

l_{X}\left( \lambda ,x\right) =-n\lambda +\sum x_{i}\log \lambda -\sum \log  \left( x_{i}!\right) .

Hence the score function is (see Example 2.30 in ST2134)

s_{X}\left( \lambda ,x\right) =\frac{\partial }{\partial \lambda }  l_{X}\left( \lambda ,x\right) =-n+\frac{1}{\lambda }\sum x_{i}.


\frac{\partial ^{2}}{\partial \lambda ^{2}}l_{X}\left( \lambda ,x\right) =-  \frac{1}{\lambda ^{2}}\sum x_{i}

and the Fisher information is

I_{X}\left( \lambda \right) =-E\left( \frac{\partial ^{2}}{\partial \lambda  ^{2}}l_{X}\left( \lambda ,x\right) \right) =\frac{1}{\lambda ^{2}}E\sum  X_{i}=\frac{n\lambda }{\lambda ^{2}}=\frac{n}{\lambda }.

Question 9

Derive the Cramer-Rao lower bound for V\left( \bar{X}\right) for a random sample.

Answer. (See Example 3.17 in ST2134) Since \bar{X} is an unbiased estimator of \lambda by Problem 1, from the Cramer-Rao theorem we know that

V\left( \bar{X}\right) \geq \frac{1}{I_{X}\left( \lambda \right) }=\frac{  \lambda }{n}

and in fact by Problem 1 this lower bound is attained.

Feb 22

Estimation of parameters of a normal distribution

Estimation of parameters of a normal distribution

Here we show that the knowledge of the distribution of s^{2} for linear regression allows one to do without long calculations contained in the guide ST 2134 by J. Abdey.

Theorem. Let y_{1},...,y_{n} be independent observations from N\left( \mu,\sigma ^{2}\right) . 1) s^{2}\left( n-1\right) /\sigma ^{2} is distributed as \chi _{n-1}^{2}. 2) The estimators \bar{y} and s^{2} are independent. 3) Es^{2}=\sigma ^{2}, 4) Var\left( s^{2}\right) =\frac{2\sigma ^{4}}{n-1}, 5) \frac{s^{2}-\sigma ^{2}}{\sqrt{2\sigma ^{4}/\left(n-1\right) }} converges in distribution to N\left( 0,1\right) .

Proof. We can write y_{i}=\mu +e_{i} where e_{i} is distributed as N\left( 0,\sigma ^{2}\right) . Putting \beta =\mu ,\ y=\left(y_{1},...,y_{n}\right) ^{T}, e=\left( e_{1},...,e_{n}\right) ^{T} and X=\left( 1,...,1\right) ^{T} (a vector of ones) we satisfy (1) and (2). Since X^{T}X=n, we have \hat{\beta}=\bar{y}. Further,

r\equiv y-X\hat{  \beta}=\left( y_{1}-\bar{y},...,y_{n}-\bar{y}\right) ^{T}


s^{2}=\left\Vert r\right\Vert ^{2}/\left( n-1\right) =\sum_{i=1}^{n}\left(  y_{i}-\bar{y}\right) ^{2}/\left( n-1\right) .

Thus 1) and 2) follow from results for linear regression.

3) For a normal variable X its moment generating function is M_{X}\left( t\right) =\exp \left(\mu t+\frac{1}{2}\sigma ^{2}t^{2}\right) (see Guide ST2133, 2021, p.88). For the standard normal we get

M_{z}^{\prime }\left( t\right) =\exp \left(  \frac{1}{2}t^{2}\right) t, M_{z}^{\prime \prime }\left( t\right) =\exp \left( \frac{1}{2}t^{2}\right) (t^{2}+1),

M_{z}^{\prime \prime \prime}\left( t\right) =\exp \left( \frac{1}{2}t^{2}\right) (t^{3}+2t+t), M_{z}^{(4)}\left( t\right) =\exp \left( \frac{1}{2}t^{2}\right)  (t^{4}+6t^{2}+3).

Applying the general property EX^{r}=M_{X}^{\left(  r\right) }\left( 0\right) (same guide, p.84) we see that

Ez=0, Ez^{2}=1, Ez^{3}=0, Ez^{4}=3,

Var(z)=1, Var\left( z^{2}\right) =Ez^{4}-\left( Ez^{2}\right)  ^{2}=3-1=2.


Es^{2}=\frac{\sigma ^{2}}{n-1}E\left( z_{1}^{2}+...+z_{n-1}^{2}\right) =\frac{\sigma ^{2}}{n-1}\left( n-1\right) =\sigma ^{2}.

4) By independence of standard normals

Var\left( s^{2}\right) = \left(\frac{\sigma ^{2}}{n-1}\right) ^{2}\left[ Var\left( z_{1}^{2}\right)  +...+Var\left( z_{n-1}^{2}\right) \right] =\frac{\sigma ^{4}}{\left(  n-1\right) ^{2}}2\left( n-1\right) =\frac{2\sigma ^{4}}{n-1}.

5) By standardizing s^{2} we have \frac{s^{2}-Es^{2}}{\sigma \left(s^{2}\right) }=\frac{s^{2}-\sigma ^{2}}{\sqrt{2\sigma ^{4}/\left( n-1\right) }} and this converges in distribution to N\left( 0,1\right) by the central limit theorem.


Feb 22

Distribution of the estimator of the error variance

Distribution of the estimator of the error variance

If you are reading the book by Dougherty: this post is about the distribution of the estimator  s^2 defined in Chapter 3.

Consider regression

(1) y=X\beta +e

where the deterministic matrix X is of size n\times k, satisfies \det  \left( X^{T}X\right) \neq 0 (regressors are not collinear) and the error e satisfies

(2) Ee=0,Var(e)=\sigma ^{2}I

\beta is estimated by \hat{\beta}=(X^{T}X)^{-1}X^{T}y. Denote P=X(X^{T}X)^{-1}X^{T}, Q=I-P. Using (1) we see that \hat{\beta}=\beta +(X^{T}X)^{-1}X^{T}e and the residual r\equiv y-X\hat{\beta}=Qe. \sigma^{2} is estimated by

(3) s^{2}=\left\Vert r\right\Vert ^{2}/\left( n-k\right) =\left\Vert  Qe\right\Vert ^{2}/\left( n-k\right) .

Q is a projector and has properties which are derived from those of P

(4) Q^{T}=Q, Q^{2}=Q.

If \lambda is an eigenvalue of Q, then multiplying Qx=\lambda x by Q and using the fact that x\neq 0 we get \lambda ^{2}=\lambda . Hence eigenvalues of Q can be only 0 or 1. The equation tr\left( Q\right) =n-k
tells us that the number of eigenvalues equal to 1 is n-k and the remaining k are zeros. Let Q=U\Lambda U^{T} be the diagonal representation of Q. Here U is an orthogonal matrix,

(5) U^{T}U=I,

and \Lambda is a diagonal matrix with eigenvalues of Q on the main diagonal. We can assume that the first n-k numbers on the diagonal of Q are ones and the others are zeros.

Theorem. Let e be normal. 1) s^{2}\left( n-k\right) /\sigma ^{2} is distributed as \chi _{n-k}^{2}. 2) The estimators \hat{\beta} and s^{2} are independent.

Proof. 1) We have by (4)

(6) \left\Vert Qe\right\Vert ^{2}=\left( Qe\right) ^{T}Qe=\left(  Q^{T}Qe\right) ^{T}e=\left( Qe\right) ^{T}e=\left( U\Lambda U^{T}e\right)  ^{T}e=\left( \Lambda U^{T}e\right) ^{T}U^{T}e.

Denote S=U^{T}e. From (2) and (5)

ES=0, Var\left( S\right) =EU^{T}ee^{T}U=\sigma ^{2}U^{T}U=\sigma ^{2}I

and S is normal as a linear transformation of a normal vector. It follows that S=\sigma z where z is a standard normal vector with independent standard normal coordinates z_{1},...,z_{n}. Hence, (6) implies

(7) \left\Vert Qe\right\Vert ^{2}=\sigma ^{2}\left( \Lambda z\right)  ^{T}z=\sigma ^{2}\left( z_{1}^{2}+...+z_{n-k}^{2}\right) =\sigma ^{2}\chi  _{n-k}^{2}.

(3) and (7) prove the first statement.

2) First we note that the vectors Pe,Qe are independent. Since they are normal, their independence follows from

cov(Pe,Qe)=EPee^{T}Q^{T}=\sigma ^{2}PQ=0.

It's easy to see that X^{T}P=X^{T}. This allows us to show that \hat{\beta} is a function of Pe:

\hat{\beta}=\beta +(X^{T}X)^{-1}X^{T}e=\beta +(X^{T}X)^{-1}X^{T}Pe.

Independence of Pe,Qe leads to independence of their functions \hat{\beta} and s^{2}.


Feb 22

Sufficiency and minimal sufficiency

Sufficiency and minimal sufficiency

Sufficient statistic

I find that in the notation of a statistic it is better to reflect the dependence on the argument. So I write T\left( X\right) for a statistic, where X is a sample, instead of a faceless U or V.

Definition 1. The statistic T\left( X\right) is called sufficient for the parameter \theta if the distribution of X conditional on T\left( X\right) does not depend on \theta .

The main results on sufficiency and minimal sufficiency become transparent if we look at them from the point of view of Maximum Likelihood (ML) estimation.

Let f_{X}\left( x,\theta \right) be the joint density of the vector X=\left( X_{1},...,X_{n}\right) , where \theta is a parameter (possibly a vector). The ML estimator is obtained by maximizing over \theta the function f_{X}\left( x,\theta \right) with x=\left(x_{1},...,x_{n}\right) fixed at the observed data. The estimator depends on the data and can be denoted \hat{\theta}_{ML}\left( x\right) .

Fisher-Neyman theorem. T\left( X\right) is sufficient for \theta if and only if the joint density can be represented as

(1) f_{X}\left( x,\theta \right) =g\left( T\left( x\right) ,\theta \right) k\left( x\right)

where, as the notation suggests, g depends on x only through T\left(x\right) and k does not depend on \theta .

Maximizing the left side of (1) is the same thing as maximizing g\left(T\left( x\right) ,\theta \right) because k does not depend on \theta . But this means that \hat{\theta}_{ML}\left( x\right) depends on x only through T\left( x\right) . A sufficient statistic is all you need to find the ML estimator. This interpretation is easier to understand than the definition of sufficiency.

Minimal sufficient statistic

Definition 2. A sufficient statistic T\left( X\right) is called minimal sufficient if for any other statistic S\left( X\right) there exists a function g such that T\left( X\right) =g\left( S\left( X\right) \right) .

A level set is a set of type \left\{ x:T\left( x\right) =c\right\} , for a constant c (which in general can be a constant vector). See the visualization of level sets.  A level set is also called a preimage and denoted T^{-1}\left( c\right) =\left\{ x:T\left(x\right) =c\right\} . When T is one-to-one the preimage contains just one point. When T is not one-to-one the preimage contains more than one point. The wider it is the less information about the sample carries the statistic (because many data sets are mapped to a single point and you cannot tell one data set from another by looking at the statistic value). This is illustrated by the following example.

Example 1. On the plane R^2 define two statistics: U_1(X)=(X_1,X_2) and  U_2(X)=(X_1+X_2)/2 . For  U_1 the level set  \{U_1(x)=c=(c_1,c_2)\},\ c\in R^2 , consists of just one point and knowing the statistic value is equivalent to knowing the whole sample. For  U_2 the level set  \{U_2(x)=c\},\ c\in R is a straight line. If we know  U_2 , we know the sample mean but not the separate observations.

In the definition of the minimal sufficient statistic we have

\left\{x:T\left( X\right) =c\right\} =\left\{ x:g\left( S\left( X\right) \right)=c\right\} =\left\{ x:S\left( X\right) \in g^{-1}\left( c\right) \right\} .

Since g^{-1}\left( c\right) generally contains more than one point, this shows that the level sets of T\left( X\right) are generally wider than those of S\left( X\right) . Since this is true for any S\left( X\right) , T\left( X\right) carries less information about X than any other statistic.

Definition 2 is an existence statement and is difficult to verify directly as there are words "for any" and "exists". Again it's better to relate it to ML estimation.

Suppose for two sets of data x,y there is a positive number k\left(x,y\right) such that

(2) f_{X}\left( x,\theta \right) =k\left( x,y\right) f_{X}\left( y,\theta\right) .

Maximizing the left side we get the estimator \hat{\theta}_{ML}\left(x\right) . Maximizing f_{X}\left( y,\theta \right) we get \hat{\theta}_{ML}\left( y\right) . Since k\left( x,y\right) does not depend on \theta , (2) tells us that

\hat{\theta}_{ML}\left( x\right) =\hat{\theta}_{ML}\left( y\right) .

Thus, if two sets of data x,y satisfy (2), the ML method cannot distinguish between x and y and supplies the same estimator. Let us call x,y indistinguishable if there is a positive number k\left( x,y\right) such that (2) is true.

An equation T\left( x\right) =T\left( y\right) means that x,y belong to the same level set.

Characterization of minimal sufficiency. A statistic T\left( X\right) is minimal sufficient if and only if its level sets coincide with sets of indistinguishable x,y.

The advantage of this formulation is that it relates a geometric notion of level sets to the ML estimator properties. The formulation in the guide by J. Abdey is:

A statistic T\left( X\right) is minimal sufficient if and only if the equality T\left( x\right) =T\left( y\right) is equivalent to (2).

Rewriting (2) as

(3) f_{X}\left( x,\theta \right) /f_{X}\left( y,\theta \right) =k\left(x,y\right)

we get a practical way of finding a minimal sufficient statistic: form the ratio on the left of (3) and find the sets along which the ratio does not depend on \theta . Those sets will be level sets of T\left( X\right) .

Dec 21

Chi-squared distribution

Chi-squared distribution

This post is intended to close a gap in J. Abdey's guide ST2133, which is absence of distributions widely used in Econometrics.

Chi-squared with one degree of freedom

Let X be a random variable and let Y=X^{2}.

Question 1. What is the link between the distribution functions of Y and X?

Chart 1. Inverting a square function

Chart 1. Inverting a square function

The start is simple: just follow the definitions. F_{Y}\left( y\right)=P\left( Y\leq y\right) =P\left( X^{2}\leq y\right) . Assuming that y>0, on Chart 1 we see that \left\{ x:x^{2}\leq y\right\} =\left\{x: -\sqrt{y}\leq x\leq \sqrt{y}\right\} . Hence, using additivity of probability,

(1) F_{Y}\left( y\right) =P\left( -\sqrt{y}\leq X\leq \sqrt{y}\right)  =P\left( X\leq \sqrt{y}\right) -P\left( X<-\sqrt{y}\right)

=F_{X}\left( \sqrt{y}\right) -F_{X}\left( -\sqrt{y}\right) .

The last transition is based on the assumption that P\left( X<x  \right) =P\left( X\leq x\right) , for all x, which is maintained for continuous random variables throughout the guide by Abdey.

Question 2. What is the link between the densities of X and Y=X^{2}? By the Leibniz integral rule (1) implies

(2) f_{Y}\left( y\right) =f_{X}\left( \sqrt{y}\right) \frac{1}{2\sqrt{y}}  +f_{X}\left( -\sqrt{y}\right) \frac{1}{2\sqrt{y}}.

Exercise. Assuming that g is an increasing differentiable function with the inverse h and Y=g(X) answer questions similar to 1 and 2.

See the definition of \chi _{1}^{2}. Just applying (2) to X=z and   Y=z^{2}=\chi _{1}^{2} we get

f_{\chi _{1}^{2}}\left( y\right) =\frac{1}{\sqrt{2\pi }}e^{-y/2}\frac{1}{2  \sqrt{y}}+\frac{1}{\sqrt{2\pi }}e^{-y/2}\frac{1}{2\sqrt{y}}=\frac{1}{\sqrt{  2\pi }}y^{1/2-1}e^{-y/2},\ y>0.

Since \Gamma \left( 1/2\right) =\sqrt{\pi }, the procedure for identifying the gamma distribution gives

f_{\chi _{1}^{2}}\left( x\right) =\frac{1}{\Gamma \left( 1/2\right) }\left(  1/2\right) ^{1/2}x^{1/2-1}e^{-x/2}=f_{1/2,1/2}\left( x\right) .

We have derived the density of the chi-squared variable with one degree of freedom, see also Example 3.52, J. Abdey, Guide ST2133.

General chi-squared

For \chi _{n}^{2}=z_{1}^{2}+...+z_{n}^{2} with independent standard normals z_{1},...,z_{n} we can write \chi _{n}^{2}=\chi _{1}^{2}+...+\chi _{1}^{2} where the chi-squared variables on the right are independent and all have one degree of freedom. This is because deterministic (here quadratic) functions of independent variables are independent.

Recall that the gamma density is closed under convolutions with the same \alpha . Then by the convolution theorem we get

f_{\chi _{n}^{2}}=f_{\chi _{1}^{2}}\ast ...\ast f_{\chi  _{1}^{2}}=f_{1/2,1/2}\ast ...\ast f_{1/2,1/2} =f_{1/2,n/2}=\frac{1}{\Gamma \left( n/2\right) 2^{n/2}}x^{n/2-1}e^{-x/2}.