Feb 22

Sufficiency and minimal sufficiency

Sufficiency and minimal sufficiency

Sufficient statistic

I find that in the notation of a statistic it is better to reflect the dependence on the argument. So I write T\left( X\right) for a statistic, where X is a sample, instead of a faceless U or V.

Definition 1. The statistic T\left( X\right) is called sufficient for the parameter \theta if the distribution of X conditional on T\left( X\right) does not depend on \theta .

The main results on sufficiency and minimal sufficiency become transparent if we look at them from the point of view of Maximum Likelihood (ML) estimation.

Let f_{X}\left( x,\theta \right) be the joint density of the vector X=\left( X_{1},...,X_{n}\right) , where \theta is a parameter (possibly a vector). The ML estimator is obtained by maximizing over \theta the function f_{X}\left( x,\theta \right) with x=\left(x_{1},...,x_{n}\right) fixed at the observed data. The estimator depends on the data and can be denoted \hat{\theta}_{ML}\left( x\right) .

Fisher-Neyman theorem. T\left( X\right) is sufficient for \theta if and only if the joint density can be represented as

(1) f_{X}\left( x,\theta \right) =g\left( T\left( x\right) ,\theta \right) k\left( x\right)

where, as the notation suggests, g depends on x only through T\left(x\right) and k does not depend on \theta .

Maximizing the left side of (1) is the same thing as maximizing g\left(T\left( x\right) ,\theta \right) because k does not depend on \theta . But this means that \hat{\theta}_{ML}\left( x\right) depends on x only through T\left( x\right) . A sufficient statistic is all you need to find the ML estimator. This interpretation is easier to understand than the definition of sufficiency.

Minimal sufficient statistic

Definition 2. A sufficient statistic T\left( X\right) is called minimal sufficient if for any other statistic S\left( X\right) there exists a function g such that T\left( X\right) =g\left( S\left( X\right) \right) .

A level set is a set of type \left\{ x:T\left( x\right) =c\right\} , for a constant c (which in general can be a constant vector). See the visualization of level sets.  A level set is also called a preimage and denoted T^{-1}\left( c\right) =\left\{ x:T\left(x\right) =c\right\} . When T is one-to-one the preimage contains just one point. When T is not one-to-one the preimage contains more than one point. The wider it is the less information about the sample carries the statistic (because many data sets are mapped to a single point and you cannot tell one data set from another by looking at the statistic value). In the definition of the minimal sufficient statistic we have

\left\{x:T\left( X\right) =c\right\} =\left\{ x:g\left( S\left( X\right) \right)=c\right\} =\left\{ x:S\left( X\right) \in g^{-1}\left( c\right) \right\} .

Since g^{-1}\left( c\right) generally contains more than one point, this shows that the level sets of T\left( X\right) are generally wider than those of S\left( X\right) . Since this is true for any S\left( X\right) , T\left( X\right) carries less information about X than any other statistic.

Definition 2 is an existence statement and is difficult to verify directly as there are words "for any" and "exists". Again it's better to relate it to ML estimation.

Suppose for two sets of data x,y there is a positive number k\left(x,y\right) such that

(2) f_{X}\left( x,\theta \right) =k\left( x,y\right) f_{X}\left( y,\theta\right) .

Maximizing the left side we get the estimator \hat{\theta}_{ML}\left(x\right) . Maximizing f_{X}\left( y,\theta \right) we get \hat{\theta}_{ML}\left( y\right) . Since k\left( x,y\right) does not depend on \theta , (2) tells us that

\hat{\theta}_{ML}\left( x\right) =\hat{\theta}_{ML}\left( y\right) .

Thus, if two sets of data x,y satisfy (2), the ML method cannot distinguish between x and y and supplies the same estimator. Let us call x,y indistinguishable if there is a positive number k\left( x,y\right) such that (2) is true.

An equation T\left( x\right) =T\left( y\right) means that x,y belong to the same level set.

Characterization of minimal sufficiency. A statistic T\left( X\right) is minimal sufficient if and only if its level sets coincide with sets of indistinguishable x,y.

The advantage of this formulation is that it relates a geometric notion of level sets to the ML estimator properties. The formulation in the guide by J. Abdey is:

A statistic T\left( X\right) is minimal sufficient if and only if the equality T\left( x\right) =T\left( y\right) is equivalent to (2).

Rewriting (2) as

(3) f_{X}\left( x,\theta \right) /f_{X}\left( y,\theta \right) =k\left(x,y\right)

we get a practical way of finding a minimal sufficient statistic: form the ratio on the left of (3) and find the sets along which the ratio does not depend on \theta . Those sets will be level sets of T\left( X\right) .

Mar 17

Maximum likelihood: idea and life of a bulb

Maximum likelihood: idea of the method and application to life of a bulb. Sometimes I plagiarize from my book.

Maximum likelihood idea

Figure 1. Maximum likelihood idea

The main idea of the maximum likelihood (ML) method is illustrated in Figure 1. We start with the sample depicted with points on the horizontal axis. Then we think which of the densities shown on the figure is more likely to have generated that sample. Of course, it's the one on the left, filled with grey.

This density takes higher values at observed points than the other two. Note also that the position of the density is regulated by its parameters. This explains the main idea: choose the parameters so as to maximize the density at the observed points.


Step 1. A statistical model usually contains a random term. To describe that term, choose a density from some parametric family. Denote it f(x|\theta) where \theta is a parameter or a set of parameters. f(x_i|\theta) is the value of the density at the ith observation.

Step 2. Assume that observations are independent. Then the joint density is a product of own densities: f(x_1,...,x_n|\theta)=f(x_1|\theta)...f(x_n|\theta). Since the observations are fixed, the joint density is a function of just parameters.

Definition. The joint density as a function of just parameters is called a likelihood function and denoted L(\theta|x_1,...,x_n)=f(x_1,...,x_n|\theta), to reflect the fact that the parameters are the main argument. The parameters that maximize the likelihood function, if they exist, are the maximum likelihood estimators, and thus the name Maximum Likelihood (ML) method.

Step 3. Since \log x is a monotone function, the likelihood L(\theta |x_1,...,x_n) and the log-likelihood function \lambda(\theta)=\log L(\theta|x_1,...,x_n) are maximized at the same time. The likelihood is often a multiplicative function, in which case maximizing the log-likelihood is technically easier.

Comments. (1) Most of the time the likelihood function is difficult to maximize analytically. Then maximization is done on the computer. A numerical algorithm can give the solution only approximately. Moreover, the likelihood function may not have maximums at all or may have many maximums; in the former case the numerical procedure does not converge and in the latter the computer gives only one solution.

(2) One should distinguish models and estimation methods. The OLS method applied to the linear model gives OLS estimators. The ML method applied to the same linear model gives ML estimators. Most linear models are dealt with using the least squares method. All exercises for maximum likelihood require some algebra, as is seen from the algorithm.

Example: life of a bulb

Life of a bulb is described by the exponential distribution

p(t)=0, if t\le0, and p(t)=\mu{e^{-\mu t}} if t>0

where \mu is a positive parameter. Life of a bulb cannot be negative, so the density is zero on the left half-axis. The density takes high values in the right neighborhood of the origin and quickly declines afterwards. That means that the probability that the bulb will burn right after it's produced is the highest, but if it survives the first minutes (hours, days), it will serve for a while. Most electronic products behave like this.

Exercise. Derive the ML estimator of \mu.

Solution. Step 1. f(x_i|\mu)=\mu e^{-\mu x_i}.

Step 2. Assuming independent observations, the joint density is a product of these densities \mu^{n}e^{-\mu x_1}...e^{-\mu x_n}.

Step 3. The log-likelihood function is \lambda=n\log\mu-\mu(x_1+...+x_n). The first order condition is \frac{\partial\lambda}{\partial\mu}=\frac{n}{\mu}-(x_1+...+x_n)=0 and its solution is \mu=\frac{1}{\bar{x}}. To make sure that this is a maximum, we need to check the second order condition. Since \frac{\partial^{2}\lambda }{\partial\mu^{2}}=-\frac{n}{\mu^2} is negative, we have really found the maximum.

Conclusion: \hat{\mu}_{ML}=\frac{1}{\bar{x}} is the ML estimator for \mu.