4
Apr 18

## Distribution function estimation

The relativity theory says that what initially looks absolutely difficult, on closer examination turns out to be relatively simple. Here is one such topic. We start with a motivating example.

Large cloud service providers have huge data centers. A data center, being a large group of computer servers, typically requires extensive air conditioning. The intensity and cost of air conditioning depend on the temperature of the surrounding environment. If, as in our motivating example, we denote by $T$ the temperature outside and by $t$ a cut-off value, then a cloud service provider is interested in knowing the probability $P(T\le t)$ for different values of $t$. This is exactly the distribution function of temperature: $F_T(t)=P(T\le t)$. So how do you estimate it?

It comes down to usual sampling. Fix some cut-off, for example, $t=20$ and see for how many days in a year the temperature does not exceed 20. If the number of such days is, say, 200, then 200/365 will be the estimate of the probability $P(T\le 20)$.

It remains to dress this idea in mathematical clothes.

## Empirical distribution function

If an observation $T_i$ belongs to the event $\{T\le 20\}$, we count it as 1, otherwise we count it as zero. That is, we are dealing with a dummy variable

(1) $1_{\{T\le 20\}}=\left\{\begin{array}{ll}1,&T\le 20;\\0,&T>20.\end{array}\right.$

The total count is $\sum 1_{\{T_i\le 20\}}$ and this is divided by the total number of observations, which is 365, to get 200/365.

It is important to realize that the variable in (1) is a coin (Bernoulli variable). For an unfair coin with  probability of 1 equal to $p$ and probability of zero equal to $1-p$ the mean is $EC=p\times 1+(1-p)\times 0=p$

and the variance is $Var(C)=EC^2-(EC)^2=p-p^2=p(1-p)$.

For the variable in (1) $p=P\{1_{\{T\le 20\}}=1\}=P(T\le 20)=F_T(20)$, so the mean and variance are

(2) $E1_{\{T\le 20\}}=F_T(20),\ Var(1_{\{T\le 20\}})=F_T(20)(1-F_T(20))$.

Generalizing, the probability $F_T(t)=P(T\le t)$ is estimated by

(3) $\frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}$

where $n$ is the number of observations. (3) is called an empirical distribution function because it is a direct empirical analog of $P(T\le t)$.

Applying expectation to (3) and using an equation similar to (2), we prove unbiasedness of our estimator:

(4) $E\frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}=P(T\le t)=F_T(t)$.

Further, assuming independent observations we can find variance of (3):

(5) $Var\left(\frac{1}{n}\sum_{i=1}^n 1_{\{T_i\le t\}}\right)$ (using homogeneity of degree 2) $=\frac{1}{n^2}Var\left(\sum_{i=1}^n 1_{\{T_i\le t\}}\right)$ (using independence) $=\frac{1}{n^2}\sum_{i=1}^nVar(1_{\{T_i\le t\}})$ (applying an equation similar to (2)) $=\frac{1}{n^2}\sum_{i=1}^nF_T(t)(1-F_T(t))=\frac{1}{n}F_T(t)(1-F_T(t)).$

Corollary. (4) and (5) can be used to prove that (3) is a consistent estimator of the distribution function, i.e., (3) converges to $F_T(t)$ in probability.