I’ve been doing a lot of thinking about Large Deviations recently, in particular how to apply the theory to random graphs and related models. I’ve just writing an article about some of the more interesting aspects, so thought it was probably worth turning it into a few posts.
Given i.i.d. real-valued random variables with finite expectation, and , the Weak Law of Large Numbers asserts that the empirical mean converges in distribution to . So . In fact, if , we have the Central Limit Theorem, and a consequence is that whenever .
In a concrete example, if we toss a coin some suitably large number of times, the probability that the proportion of heads will be substantially greater or smaller than tends to zero. So the probability that at least of the results are heads tends to zero. But how fast? Consider first four tosses, then eight. A quick addition of the relevant terms in the binomial distribution gives:
There are two observations to be made. The first is that the second is substantially smaller than the first – the decay appears to be relatively fast. The second observation is that is substantially larger than the rest of the sum. So by far the most likely way for at least out of twelve tosses to be heads is if exactly are heads. Cramer’s theorem applies to a general i.i.d. sequence of RVs, provided the tail is not too heavy. It show that the probability of any such large deviation event decays exponentially with n, and identifies the exponent.
Theorem (Cramer): Let be i.i.d. real-valued random variables which satisfy for every . Then for any ,
- So, informally, .
- I(z) is called the Fenchel-Legendre transform (or convex conjugate) of .
- Considering t=0 confirms that .
- In their extremely useful book, Dembo and Zeitouni present this theorem in greater generality, allowing to be supported on , considering a more general set of large deviation events, and relaxing the requirement for finite mean, and thus also the finite moment generating function condition. All of this will still be a special case of the Gartner-Ellis theorem, which will be examined in a subsequent post, so we make do with this form of Cramer’s result for now.
The proof of Cramer’s theorem splits into an upper bound and a lower bound. The former is relatively straightforward, applying Markov’s inequality to , then optimising over the choice of t. This idea is referred to by various sources as the exponential Chebyshev inequality or a Chernoff bound. The lower bound is more challenging. We reweight the distribution function F(x) of by a factor , then choose t so that the large deviation event is in fact now within the treatment of the CLT, from which suitable bounds are obtained.
To avoid overcomplicating this initial presentation, some details have been omitted. It is not clear, for example, whether I(x) should be finite whenever x is in the support of . (It certainly must be infinite outside – consider the probability that 150% or -40% of coin tosses come up heads!) In order to call this a Large Deviation Principle, we also want some extra regularity on I(x), not least to ensure it is unique. This will be discussed in the next posts.