Large Deviations 4 – Sanov’s Theorem

Although we could have defined things for a more general topological space, most of our thoughts about Cramer’s theorem, and the Gartner-Ellis theorem which generalises it, are based on means of real-valued random variables. For Cramer’s theorem, we genuinely are interested only in means of i.i.d. random variables. In Gartner-Ellis, one might say that we are able to relax the condition on independence and perhaps identical distribution too, in a controlled way. But this is somewhat underselling the theorem: using G-E, we can deal with a much broader category of measures than just means of collections of variables. The key is that convergence of the log moment generating function is exactly enough to give a LDP with some rate, and we have a general method for finding the rate function.

So, Gartner-Ellis provides a fairly substantial generalisation to Cramer’s theorem, but is still similar in flavour. But what about if we look for additional properties of a collection of i.i.d. random variables (X_n). After all, the mean is not the only interesting property. One thing we could look at is the actual values taken by the X_ns. If the underlying distribution is continuous, this is not going to give much more information than what we started with. With probability, \{X_1,\ldots,X_n\} is a set of size n, with distribution given by the product of the underlying measure. However, if the random variables take values in a discrete set, or better still a finite set, then (X_1,\ldots,X_n) gives a so-called empirical distribution.

As n grows towards infinity, we expect this empirical distribution to approximate the real underlying distribution fairly well. This isn’t necessarily quite as easy as it sounds. By the strong law of large numbers applied to indicator functions 1(X_i\leq t), the empirical cdf at t converges almost surely to the true cdf at t. To guarantee that this convergence is uniform in t is tricky in general (for reference, see the Glivenko-Cantelli theorem), but is clear for random variables defined on finite sets, and it seems reasonable that an extension to discrete sets should be possible.

So such empirical distributions might well admit an LDP. Note that in the case of Bernoulli random variables, the empirical distribution is in fact exactly equivalent to the empirical mean, so Cramer’s theorem applies. But, in fact we have a general LDP for empirical distributions. I claim that the main point of interest here is the nature of the rate function – I will discuss why the existence of an LDP is not too surprising at the end.

The rate function is going to be interesting whatever form it ends up taking. After all, it is effectively going to some sort of metric on measures, as it records how far a possible empirical measure is from the true distribution. Apart from total variation distance, we don’t currently have many standard examples for metrics on a space of measures. Anyway, the rate function is the main content of Sanov’s theorem. This has various forms, depending on how fiddly you are prepared for the proof to be.

Define L_n:=\sum_{i=1}^n \delta_{X_i}\in\mathcal{M}_1(E) to be the empirical measure generated by X_1,\ldots,X_n. Then L_n satisfies an LDP on \mathcal{M}_1(E) with rate n and rate function given by H(\cdot,\mu), where \mu is the underlying distribution.

The function H is the relative entropy, defined by:

H(\nu|\mu):=\int_E \log\frac{\nu(x)}{\mu(x)}d\nu(v),

whenever \nu<<\mu, and \infty otherwise. We can see why this absolute continuity condition is required from the statement of the LDP. If the underlying distribution \mu has measure zero on some set A, then the observed values will not be in A with probability 1, and so the empirical measure will be zero on A also.

Note that an alternative form is:

H(\nu|\mu)=\int_E \frac{\nu(x)}{\mu(x)}\log\frac{\nu(x)}{\mu(x)}d\mu(v)=\mathbb{E}_\nu\frac{\nu(x)}{\mu(x)}\log\frac{\nu(x)}{\mu(x)}.

Perhaps it is more clear why this expectation is something we would want to minimise.

In particular, if we want to know the most likely asymptotic empirical distribution inducing a large deviation empirical mean (as in Cramer), then we find the distribution with suitable mean, and smallest entropy relative to the true underlying distribution.

A remark on the proof. If the underlying set of values is finite, then a proof of this result is essentially combinatorial. The empirical distribution is some multinomial distribution, and we can obtain exact forms for everything and then proceed with asymptotic approximations.

I said earlier that I would comment on why the LDP is not too surprising even in general, once we know Gartner-Ellis. Instead of letting X_i take values in whatever space we were considering previously, say the reals, consider instead the point mass function \delta_{X_i} which is effectively exactly the same random variable, only now defined on the space of probability measures. The empirical measure is then exactly:

\frac{1}{n}\sum_{i=1}^n \delta_{X_i}.

If the support K of the (X_i)s is finite, then in fact this space of measures is a convex subspace of \mathbb{R}^K, and so the multi-dimensional version of Cramer’s theorem applies. In general, we can work in the possibly infinite-dimensional space [0,1]^K, and our relevant subset is compact, as a closed subset of a compact space (by Tychonoff). So the LDP in this case follows from our previous work.