Large Deviations 5 – Stochastic Processes and Mogulskii’s Theorem

Motivation

In the previous posts about Large Deviations, most of the emphasis has been on the theory. To summarise briefly, we have a natural idea that for a family of measures supported on the same metric space, increasingly concentrated as some index grows, we might expect the probability of seeing values in a set not containing the limit in distribution to grow exponentially. The canonical example is the sample mean of a family of IID random variables, as treated by Cramer’s theorem.

It becomes apparent that it will not be enough to specify the exponent for a given large deviation event just by taking the infimum of the rate function, so we have to define an LDP topologically, with different behaviour on open and closed sets. Now we want to find some LDPs for more complicated measures, but which will have genuinely non-trivial applications. The key idea in all of this is that the infimum present in the definition of an LDP doesn’t just specify the rate function, it also might well give us some information about the configurations or events that lead to the LDP.

The slogan for the LDP as in Frank den Hollander’s excellent book is: “A large deviation event will happen in the least unlikely of all the unlikely ways.” This will be useful when our underlying space is a bit more complicated.

Setup

As a starting point, consider the set-up for Cramer’s theorem, with IID $X_1,\ldots,X_n$. But instead of investigating LD behaviour for the sample mean, we investigate LD behaviour for the whole set of RVs. There is a bijection between sequences and the partial sums process, so we investigate the partial sums process, rescaled appropriately. For the moment this is a sequence not a function or path (continuous or otherwise), but in the limit it will be, and furthermore it won’t make too much difference whether we interpolate linearly or step-wise.

Concretely, we consider the rescaled random walk:

$Z_n(t):=\tfrac{1}{n}\sum_{i=1}^{[nt]}X_i,\quad t\in[0,1],$

with laws $\mu_n$ supported on $L_\infty([0,1])$. Note that the expected behaviour is a straight line from (0,0) to (1,$\mathbb{E}X_1$). In fact we can say more than that. By Donsker’s theorem we have a functional version of a central limit theorem, which says that deviations from this expected behaviour are given by suitably scaled Brownian motion:

$\sqrt{n}\left(\frac{Z_n(t)-t\mathbb{E}X}{\sqrt{\text{Var}(X_1)}}\right)\quad\stackrel{d}{\rightarrow}\quad B(t),\quad t\in[0,1].$

This is what we expect ‘standard’ behaviour to look like:

The deviations from a straight line are on a scale of $\sqrt{n}$. Here are two examples of potential large deviation behaviour:

Or this:

Note that these are qualitatively different. In the first case, the first half of the random variables are in general much larger than the second half, which appear to have empirical mean roughly 0. In the second case, a large deviation in overall mean is driven by a single very large value. It is obviously of interest to find out what the probabilities of each of these possibilities are.

We can do this via an LDP for $(\mu_n)$. Now it is really useful to be working in a topological context with open and closed sets. It will turn out that the rate function is supported on absolutely continuous functions, whereas obviously for finite n, none of the sample paths are continuous!

We assume that $\Lambda(\lambda)$ is the logarithmic moment generating function of X_1 as before, with $\Lambda^*(x)$ the Fenchel-Legendre transform. Then the key result is:

Theorem (Mogulskii): The measures $(\mu_n)$ satisfy an LDP on $L_\infty([0,1])$ with good rate function:

$I(\phi)=\begin{cases}\int_0^1 \Lambda^*(\phi'(t))dt,&\quad \text{if }\phi\in\mathcal{AC}, \phi(0)=0,\\ \infty&\quad\text{otherwise,}\end{cases}$

where AC is the space of absolutely continuous functions on [0,1]. Note that AC is dense in $L_\infty([0,1])$, so any open set contains a $\phi$ for which $I(\phi)$ is at least in principle finite. (Obviously, if $\Lambda^*$ is not finite everywhere, then extra restrictions of $\phi'$ are required.)

The following picture may be helpful at providing some motivation:

So what is going on is that if we take a path and zoom in on some small interval around a point, note first that behaviour on this interval is independent of behaviour everywhere else. Then the gradient at the point is the local empirical mean of the random variables around this point in time. The probability that this differs from the actual mean is given by Cramer’s rate function applied to the empirical mean, so we obtain the rate function for the whole path by integrating.

More concretely, but still very informally, suppose there is some $\phi'(t)\neq \mathbb{E}X$, then this says that:

$Z_n(t+\delta t)-Z_n(t)=\phi'(t)\delta t+o(\delta t),$

$\Rightarrow\quad \mu_n\Big(\phi'(t)\delta t+o(\delta t)=\frac{1}{n}\sum_{i=nt+1}^{n(t+\delta t)}X_i\Big),$

$= \mu_n\Big( \phi'(t)+o(1)=\frac{1}{n\delta t}\sum_{i=1}^{n\delta t}X_i\Big)\sim e^{-n\delta t\Lambda^*(\phi'(t))},$

by Cramer. Now we can use independence:

$\mu_n(Z_n\approx \phi)=\prod_{\delta t}e^{-n\delta t \Lambda^*(\phi'(t))}=e^{-\sum_{\delta t}n\delta t \Lambda^*(\phi'(t))}\approx e^{-n\int_0^1 \Lambda^*(\phi'(t))dt},$

as in fact is given by Mogulskii.

Remarks

1) The absolutely continuous requirement is useful. We really wouldn’t want to be examining carefully the tail of the underlying distribution to see whether it is possible on an exponential scale that o(n) consecutive RVs would have sum O(n).

2) In general $\Lambda^*(x)$ will be convex, which has applications as well as playing a useful role in the proof. Recalling den Hollander’s mantra, we are interested to see where infima hold for LD sets in the host space. So for the event that the empirical mean is greater than some threshold larger than the expectation, Cramer’s theorem told us that this is exponentially the same as same the empirical mean is roughly equal to the threshold. Now Mogulskii’s theorem says more. By convexity, we know that the integral functional for the rate function is minimised by straight lines. So we learn that the contributions to the large deviation are spread roughly equally through the sample. Note that this is NOT saying that all the random variables will have the same higher than expected value. The LDP takes no account of fluctuations in the path on a scale smaller than n. It does however rule out both of the situations pictured a long way up the page. We should expect to see roughly a straight line, with unexpectedly steep gradient.

3) The proof as given in Dembo and Zeitouni is quite involved. There are a few stages, the first and simplest of which is to show that it doesn’t matter on an exponential scale whether we interpolate linearly or step-wise. Later in the proof we will switch back and forth at will. The next step is to show the LDP for the finite-dimensional problem given by evaluating the path at finitely many points in [0,1]. A careful argument via the Dawson-Gartner theorem allows lifting of the finite-dimensional projections back to the space of general functions with the topology of pointwise convergence. It remains to prove that the rate function is indeed the supremum of the rate functions achieved on projections. Convexity of $\Lambda^*(x)$ is very useful here for the upper bound, and this is where it comes through that the rate function is infinite when the comparison path is not absolutely continuous. To lift to the finer topology of $L_\infty([0,1])$ requires only a check of exponential tightness in the finer space, which follows from Arzela-Ascoli after some work.

In conclusion, it is fairly tricky to prove even this most straightforward case, so unsurprisingly it is hard to extend to the natural case where the distributions of the underlying RVs (X) change continuously in time, as we will want for the analysis of more combinatorial objects. Next time I will consider why it is hard but potentially interesting to consider with adaptations of these techniques an LDP for the size of the largest component in a sparse random graph near criticality.

Large Deviations 4 – Sanov’s Theorem

Although we could have defined things for a more general topological space, most of our thoughts about Cramer’s theorem, and the Gartner-Ellis theorem which generalises it, are based on means of real-valued random variables. For Cramer’s theorem, we genuinely are interested only in means of i.i.d. random variables. In Gartner-Ellis, one might say that we are able to relax the condition on independence and perhaps identical distribution too, in a controlled way. But this is somewhat underselling the theorem: using G-E, we can deal with a much broader category of measures than just means of collections of variables. The key is that convergence of the log moment generating function is exactly enough to give a LDP with some rate, and we have a general method for finding the rate function.

So, Gartner-Ellis provides a fairly substantial generalisation to Cramer’s theorem, but is still similar in flavour. But what about if we look for additional properties of a collection of i.i.d. random variables $(X_n)$. After all, the mean is not the only interesting property. One thing we could look at is the actual values taken by the $X_n$s. If the underlying distribution is continuous, this is not going to give much more information than what we started with. With probability, $\{X_1,\ldots,X_n\}$ is a set of size n, with distribution given by the product of the underlying measure. However, if the random variables take values in a discrete set, or better still a finite set, then $(X_1,\ldots,X_n)$ gives a so-called empirical distribution.

As n grows towards infinity, we expect this empirical distribution to approximate the real underlying distribution fairly well. This isn’t necessarily quite as easy as it sounds. By the strong law of large numbers applied to indicator functions $1(X_i\leq t)$, the empirical cdf at t converges almost surely to the true cdf at t. To guarantee that this convergence is uniform in t is tricky in general (for reference, see the Glivenko-Cantelli theorem), but is clear for random variables defined on finite sets, and it seems reasonable that an extension to discrete sets should be possible.

So such empirical distributions might well admit an LDP. Note that in the case of Bernoulli random variables, the empirical distribution is in fact exactly equivalent to the empirical mean, so Cramer’s theorem applies. But, in fact we have a general LDP for empirical distributions. I claim that the main point of interest here is the nature of the rate function – I will discuss why the existence of an LDP is not too surprising at the end.

The rate function is going to be interesting whatever form it ends up taking. After all, it is effectively going to some sort of metric on measures, as it records how far a possible empirical measure is from the true distribution. Apart from total variation distance, we don’t currently have many standard examples for metrics on a space of measures. Anyway, the rate function is the main content of Sanov’s theorem. This has various forms, depending on how fiddly you are prepared for the proof to be.

Define $L_n:=\sum_{i=1}^n \delta_{X_i}\in\mathcal{M}_1(E)$ to be the empirical measure generated by $X_1,\ldots,X_n$. Then $L_n$ satisfies an LDP on $\mathcal{M}_1(E)$ with rate n and rate function given by $H(\cdot,\mu)$, where $\mu$ is the underlying distribution.

The function H is the relative entropy, defined by:

$H(\nu|\mu):=\int_E \log\frac{\nu(x)}{\mu(x)}d\nu(v),$

whenever $\nu<<\mu$, and $\infty$ otherwise. We can see why this absolute continuity condition is required from the statement of the LDP. If the underlying distribution $\mu$ has measure zero on some set A, then the observed values will not be in A with probability 1, and so the empirical measure will be zero on A also.

Note that an alternative form is:

$H(\nu|\mu)=\int_E \frac{\nu(x)}{\mu(x)}\log\frac{\nu(x)}{\mu(x)}d\mu(v)=\mathbb{E}_\nu\frac{\nu(x)}{\mu(x)}\log\frac{\nu(x)}{\mu(x)}.$

Perhaps it is more clear why this expectation is something we would want to minimise.

In particular, if we want to know the most likely asymptotic empirical distribution inducing a large deviation empirical mean (as in Cramer), then we find the distribution with suitable mean, and smallest entropy relative to the true underlying distribution.

A remark on the proof. If the underlying set of values is finite, then a proof of this result is essentially combinatorial. The empirical distribution is some multinomial distribution, and we can obtain exact forms for everything and then proceed with asymptotic approximations.

I said earlier that I would comment on why the LDP is not too surprising even in general, once we know Gartner-Ellis. Instead of letting $X_i$ take values in whatever space we were considering previously, say the reals, consider instead the point mass function $\delta_{X_i}$ which is effectively exactly the same random variable, only now defined on the space of probability measures. The empirical measure is then exactly:

$\frac{1}{n}\sum_{i=1}^n \delta_{X_i}.$

If the support K of the $(X_i)$s is finite, then in fact this space of measures is a convex subspace of $\mathbb{R}^K$, and so the multi-dimensional version of Cramer’s theorem applies. In general, we can work in the possibly infinite-dimensional space $[0,1]^K$, and our relevant subset is compact, as a closed subset of a compact space (by Tychonoff). So the LDP in this case follows from our previous work.

Large Deviations 3 – Gartner-Ellis Theorem: Where do the all terms come from?

We want to drop the i.i.d. assumption from Cramer’s theorem, to get a criterion for a general LDP as defined in the previous post to hold.

Preliminaries

For general random variables $(Z_n)$ on $\mathbb{R}^d$ with laws $(\mu_n)$, we will continue to have an upper bound like in Cramer’s theorem, provided the moment generating functions of $Z_n$ converge as required. For analogy with Cramer, take $Z_n=\frac{S_n}{n}$. The Gartner-Ellis theorem gives conditions for the existence of a suitable lower bound and, in particular, when this is the same as the upper bound.

We define the logarithmic moment generating function

$\Lambda_n(\lambda):=\log\mathbb{E}e^{\langle \lambda,Z_n\rangle},$

and assume that the limit

$\Lambda(\lambda)=\lim_{n\rightarrow\infty}\frac{1}{n}\Lambda_n(n\lambda)\in[-\infty,\infty],$

exists for all $\lambda\in\mathbb{R}^d$. We also assume that $0\in\text{int}(\mathcal{D}_\Lambda)$, where $\mathcal{D}_\Lambda:=\{\lambda\in\mathbb{R}^d:\Lambda(\lambda)<\infty\}$. We also define the Fenchel-Legendre transform as before:

$\Lambda^*(x)=\sup_{\lambda\in\mathbb{R}^d}\left[\langle x,\lambda\rangle - \Lambda(\lambda)\right],\quad x\in\mathbb{R}^d.$

We say $y\in\mathbb{R}^d$ is an exposed point of $\Lambda^*$ if for some $\lambda$,

$\langle \lambda,y\rangle - \Lambda^*(y)>\langle\lambda,x\rangle - \Lambda^*(x),\quad \forall x\in\mathbb{R}^d.$

Such a $\lambda$ is then called an exposing hyperplane. One way of thinking about this definition is that $\Lambda^*(x)$ is convex, but is strictly convex in any direction at an exposed point. Alternatively, at an exposed point y, there is a vector $\lambda$ such that $\Lambda^*\circ \pi_\lambda$ has a global minimum or maximum at y, where $\pi_\lambda$ is the projection into $\langle \lambda\rangle$. Roughly speaking, this vector is what we will to take the Cramer transform for the lower bound at x. Recall that the Cramer transform is an exponential reweighting of the probability density, which makes a previously unlikely event into a normal one. We may now state the theorem.

Gartner-Ellis Theorem

With the assumptions above:

1. $\limsup_{n\rightarrow\infty}\frac{1}{n}\log \mu_n(F)\leq -\inf_{x\in F}\Lambda^*(x)$, $\forall F\subset\mathbb{R}^d$ closed.
2. $\liminf_{n\rightarrow\infty}\frac{1}{n}\log \mu_n(G)\geq -\inf_{x\in G\cap E}\Lambda^*(x)$, $\forall G\subset\mathbb{R}^d$ open, where E is the set of exposed points of $\Lambda^*$ whose exposing hyperplane is in $\text{int}(\mathcal{D}_\Lambda)$.
3. If $\Lambda$ is also lower semi-continuous, and is differentiable on $\text{int}(\mathcal{D}_\Lambda)$ (which is non-empty by the previous assumption), and is steep, that is, for any $\lambda\in\partial\mathcal{D}_\Lambda$, $\lim_{\nu\rightarrow\lambda}|\nabla \Lambda(\nu)|=\infty$, then we may replace $G\cap E$ by G in the second statement. Then $(\mu_n)$ satisfies the LDP on $\mathbb{R}^d$ with rate n and rate function $\Lambda^*$.

Where do all the terms come from?

As ever, because everything is on an exponential scale, the infimum in the statements affirms the intuitive notion that in the limit, “an unlikely event will happen in the most likely of the possible (unlikely) ways”. The reason why the first statement does not hold for open sets in general is that the infimum may not be attained for open sets. For the proof, we need an exposing hyperplane at x so we can find an exponential tilt (or Cramer transform) that makes x the standard outcome. Crucially, in order to apply probabilistic ideas to the resulting distribution, everything must be normalisable. So we need an exposing hyperplane so as to isolate the point x on an exponential scale in the transform. And the exposing hyperplane must be in $\mathcal{D}_\Lambda$ if we are to have a chance of getting any useful information out of the transform. By convexity, this is equivalent to the exposing hyperplane being in $\text{int}(\mathcal{D}_\Lambda)$.