Distribution function estimation
The relativity theory says that what initially looks absolutely difficult, on closer examination turns out to be relatively simple. Here is one such topic. We start with a motivating example.
Large cloud service providers have huge data centers. A data center, being a large group of computer servers, typically requires extensive air conditioning. The intensity and cost of air conditioning depend on the temperature of the surrounding environment. If, as in our motivating example, we denote by the temperature outside and by a cut-off value, then a cloud service provider is interested in knowing the probability for different values of . This is exactly the distribution function of temperature: . So how do you estimate it?
It comes down to usual sampling. Fix some cut-off, for example, and see for how many days in a year the temperature does not exceed 20. If the number of such days is, say, 200, then 200/365 will be the estimate of the probability .
It remains to dress this idea in mathematical clothes.
Empirical distribution function
If an observation belongs to the event , we count it as 1, otherwise we count it as zero. That is, we are dealing with a dummy variable
The total count is and this is divided by the total number of observations, which is 365, to get 200/365.
It is important to realize that the variable in (1) is a coin (Bernoulli variable). For an unfair coin with probability of 1 equal to and probability of zero equal to the mean is
and the variance is
For the variable in (1) , so the mean and variance are
Generalizing, the probability is estimated by
where is the number of observations. (3) is called an empirical distribution function because it is a direct empirical analog of .
Applying expectation to (3) and using an equation similar to (2), we prove unbiasedness of our estimator:
Further, assuming independent observations we can find variance of (3):
(5) (using homogeneity of degree 2)
(applying an equation similar to (2))
Corollary. (4) and (5) can be used to prove that (3) is a consistent estimator of the distribution function, i.e., (3) converges to in probability.