Large Deviations 3 – Gartner-Ellis Theorem: Where do the all terms come from?

We want to drop the i.i.d. assumption from Cramer’s theorem, to get a criterion for a general LDP as defined in the previous post to hold.

Preliminaries

For general random variables (Z_n) on \mathbb{R}^d with laws (\mu_n), we will continue to have an upper bound like in Cramer’s theorem, provided the moment generating functions of Z_n converge as required. For analogy with Cramer, take Z_n=\frac{S_n}{n}. The Gartner-Ellis theorem gives conditions for the existence of a suitable lower bound and, in particular, when this is the same as the upper bound.

We define the logarithmic moment generating function

\Lambda_n(\lambda):=\log\mathbb{E}e^{\langle \lambda,Z_n\rangle},

and assume that the limit

\Lambda(\lambda)=\lim_{n\rightarrow\infty}\frac{1}{n}\Lambda_n(n\lambda)\in[-\infty,\infty],

exists for all \lambda\in\mathbb{R}^d. We also assume that 0\in\text{int}(\mathcal{D}_\Lambda), where \mathcal{D}_\Lambda:=\{\lambda\in\mathbb{R}^d:\Lambda(\lambda)<\infty\}. We also define the Fenchel-Legendre transform as before:

\Lambda^*(x)=\sup_{\lambda\in\mathbb{R}^d}\left[\langle x,\lambda\rangle - \Lambda(\lambda)\right],\quad x\in\mathbb{R}^d.

We say y\in\mathbb{R}^d is an exposed point of \Lambda^* if for some \lambda,

\langle \lambda,y\rangle - \Lambda^*(y)>\langle\lambda,x\rangle - \Lambda^*(x),\quad \forall x\in\mathbb{R}^d.

Such a \lambda is then called an exposing hyperplane. One way of thinking about this definition is that \Lambda^*(x) is convex, but is strictly convex in any direction at an exposed point. Alternatively, at an exposed point y, there is a vector \lambda such that \Lambda^*\circ \pi_\lambda has a global minimum or maximum at y, where \pi_\lambda is the projection into \langle \lambda\rangle. Roughly speaking, this vector is what we will to take the Cramer transform for the lower bound at x. Recall that the Cramer transform is an exponential reweighting of the probability density, which makes a previously unlikely event into a normal one. We may now state the theorem.

Gartner-Ellis Theorem

With the assumptions above:

  1. \limsup_{n\rightarrow\infty}\frac{1}{n}\log \mu_n(F)\leq -\inf_{x\in F}\Lambda^*(x), \forall F\subset\mathbb{R}^d closed.
  2. \liminf_{n\rightarrow\infty}\frac{1}{n}\log \mu_n(G)\geq -\inf_{x\in G\cap E}\Lambda^*(x), \forall G\subset\mathbb{R}^d open, where E is the set of exposed points of \Lambda^* whose exposing hyperplane is in \text{int}(\mathcal{D}_\Lambda).
  3. If \Lambda is also lower semi-continuous, and is differentiable on \text{int}(\mathcal{D}_\Lambda) (which is non-empty by the previous assumption), and is steep, that is, for any \lambda\in\partial\mathcal{D}_\Lambda, \lim_{\nu\rightarrow\lambda}|\nabla \Lambda(\nu)|=\infty, then we may replace G\cap E by G in the second statement. Then (\mu_n) satisfies the LDP on \mathbb{R}^d with rate n and rate function \Lambda^*.

Where do all the terms come from?

As ever, because everything is on an exponential scale, the infimum in the statements affirms the intuitive notion that in the limit, “an unlikely event will happen in the most likely of the possible (unlikely) ways”. The reason why the first statement does not hold for open sets in general is that the infimum may not be attained for open sets. For the proof, we need an exposing hyperplane at x so we can find an exponential tilt (or Cramer transform) that makes x the standard outcome. Crucially, in order to apply probabilistic ideas to the resulting distribution, everything must be normalisable. So we need an exposing hyperplane so as to isolate the point x on an exponential scale in the transform. And the exposing hyperplane must be in \mathcal{D}_\Lambda if we are to have a chance of getting any useful information out of the transform. By convexity, this is equivalent to the exposing hyperplane being in \text{int}(\mathcal{D}_\Lambda).

Advertisements

Large Deviations 1 – Motivation and Cramer’s Theorem

I’ve been doing a lot of thinking about Large Deviations recently, in particular how to apply the theory to random graphs and related models. I’ve just writing an article about some of the more interesting aspects, so thought it was probably worth turning it into a few posts.

Motivation

Given X_1,X_2,\ldots i.i.d. real-valued random variables with finite expectation, and S_n:=X_1+\ldots+X_n, the Weak Law of Large Numbers asserts that the empirical mean \frac{S_n}{n} converges in distribution to \mathbb{E}X_1. So \mathbb{P}(S_n\geq n(\mathbb{E}X_1+\epsilon))\rightarrow 0. In fact, if \mathbb{E}X_1^2<\infty, we have the Central Limit Theorem, and a consequence is that \mathbb{P}(S_n\geq n\mathbb{E}X_1+n^\alpha)\rightarrow 0 whenever \alpha>\frac12.

In a concrete example, if we toss a coin some suitably large number of times, the probability that the proportion of heads will be substantially greater or smaller than \frac12 tends to zero. So the probability that at least \frac34 of the results are heads tends to zero. But how fast? Consider first four tosses, then eight. A quick addition of the relevant terms in the binomial distribution gives:

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of four tosses are heads}\right)=\frac{1}{16}+\frac{4}{16}=\frac{5}{16},

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of twelve tosses are heads}\right)=\frac{1}{2^{12}}+\frac{12}{2^{12}}+\frac{66}{2^{12}}+\frac{220}{2^{12}}=\frac{299}{2^{12}}.

There are two observations to be made. The first is that the second is substantially smaller than the first – the decay appears to be relatively fast. The second observation is that \frac{220}{2^{12}} is substantially larger than the rest of the sum. So by far the most likely way for at least \tfrac34 out of twelve tosses to be heads is if exactly \tfrac34 are heads. Cramer’s theorem applies to a general i.i.d. sequence of RVs, provided the tail is not too heavy. It show that the probability of any such large deviation event decays exponentially with n, and identifies the exponent.

Theorem (Cramer): Let (X_i) be i.i.d. real-valued random variables which satisfy \mathbb{E}e^{tX_1}<\infty for every t\in\mathbb{R}. Then for any a>\mathbb{E}X_1,

\lim_{n\rightarrow \infty}\frac{1}{n}\log\mathbb{P}(S_n\geq an)=-I(a),

\text{where}\quad I(z):=\sup_{t\in\mathbb{R}}\left[zt-\log\mathbb{E}e^{tX_1}\right].

Remarks

  • So, informally, \mathbb{P}(S_n\geq an)\sim e^{-nI(a)}.
  • I(z) is called the Fenchel-Legendre transform (or convex conjugate) of \log\mathbb{E}e^{tX_1}.
  • Considering t=0 confirms that I(z)\in[0,\infty].
  • In their extremely useful book, Dembo and Zeitouni present this theorem in greater generality, allowing X_i to be supported on \mathbb{R}^d, considering a more general set of large deviation events, and relaxing the requirement for finite mean, and thus also the finite moment generating function condition. All of this will still be a special case of the Gartner-Ellis theorem, which will be examined in a subsequent post, so we make do with this form of Cramer’s result for now.

The proof of Cramer’s theorem splits into an upper bound and a lower bound. The former is relatively straightforward, applying Markov’s inequality to e^{tS_n}, then optimising over the choice of t. This idea is referred to by various sources as the exponential Chebyshev inequality or a Chernoff bound. The lower bound is more challenging. We reweight the distribution function F(x) of X_1 by a factor e^{tx}, then choose t so that the large deviation event is in fact now within the treatment of the CLT, from which suitable bounds are obtained.

To avoid overcomplicating this initial presentation, some details have been omitted. It is not clear, for example, whether I(x) should be finite whenever x is in the support of X_1. (It certainly must be infinite outside – consider the probability that 150% or -40% of coin tosses come up heads!) In order to call this a Large Deviation Principle, we also want some extra regularity on I(x), not least to ensure it is unique. This will be discussed in the next posts.