We want to drop the i.i.d. assumption from Cramer’s theorem, to get a criterion for a general LDP as defined in the previous post to hold.
For general random variables on with laws , we will continue to have an upper bound like in Cramer’s theorem, provided the moment generating functions of converge as required. For analogy with Cramer, take . The Gartner-Ellis theorem gives conditions for the existence of a suitable lower bound and, in particular, when this is the same as the upper bound.
We define the logarithmic moment generating function
and assume that the limit
exists for all . We also assume that , where . We also define the Fenchel-Legendre transform as before:
We say is an exposed point of if for some ,
Such a is then called an exposing hyperplane. One way of thinking about this definition is that is convex, but is strictly convex in any direction at an exposed point. Alternatively, at an exposed point y, there is a vector such that has a global minimum or maximum at y, where is the projection into . Roughly speaking, this vector is what we will to take the Cramer transform for the lower bound at x. Recall that the Cramer transform is an exponential reweighting of the probability density, which makes a previously unlikely event into a normal one. We may now state the theorem.
With the assumptions above:
- , closed.
- , open, where E is the set of exposed points of whose exposing hyperplane is in .
- If is also lower semi-continuous, and is differentiable on (which is non-empty by the previous assumption), and is steep, that is, for any , , then we may replace by G in the second statement. Then satisfies the LDP on with rate n and rate function .
Where do all the terms come from?
As ever, because everything is on an exponential scale, the infimum in the statements affirms the intuitive notion that in the limit, “an unlikely event will happen in the most likely of the possible (unlikely) ways”. The reason why the first statement does not hold for open sets in general is that the infimum may not be attained for open sets. For the proof, we need an exposing hyperplane at x so we can find an exponential tilt (or Cramer transform) that makes x the standard outcome. Crucially, in order to apply probabilistic ideas to the resulting distribution, everything must be normalisable. So we need an exposing hyperplane so as to isolate the point x on an exponential scale in the transform. And the exposing hyperplane must be in if we are to have a chance of getting any useful information out of the transform. By convexity, this is equivalent to the exposing hyperplane being in .