Large Deviations 5 – Stochastic Processes and Mogulskii’s Theorem

Motivation

In the previous posts about Large Deviations, most of the emphasis has been on the theory. To summarise briefly, we have a natural idea that for a family of measures supported on the same metric space, increasingly concentrated as some index grows, we might expect the probability of seeing values in a set not containing the limit in distribution to grow exponentially. The canonical example is the sample mean of a family of IID random variables, as treated by Cramer’s theorem.

It becomes apparent that it will not be enough to specify the exponent for a given large deviation event just by taking the infimum of the rate function, so we have to define an LDP topologically, with different behaviour on open and closed sets. Now we want to find some LDPs for more complicated measures, but which will have genuinely non-trivial applications. The key idea in all of this is that the infimum present in the definition of an LDP doesn’t just specify the rate function, it also might well give us some information about the configurations or events that lead to the LDP.

The slogan for the LDP as in Frank den Hollander’s excellent book is: “A large deviation event will happen in the least unlikely of all the unlikely ways.” This will be useful when our underlying space is a bit more complicated.

Setup

As a starting point, consider the set-up for Cramer’s theorem, with IID $X_1,\ldots,X_n$ . But instead of investigating LD behaviour for the sample mean, we investigate LD behaviour for the whole set of RVs. There is a bijection between sequences and the partial sums process, so we investigate the partial sums process, rescaled appropriately. For the moment this is a sequence not a function or path (continuous or otherwise), but in the limit it will be, and furthermore it won’t make too much difference whether we interpolate linearly or step-wise.

Concretely, we consider the rescaled random walk:

$Z_n(t):=\tfrac{1}{n}\sum_{i=1}^{[nt]}X_i,\quad t\in[0,1],$

with laws $\mu_n$ supported on $L_\infty([0,1])$ . Note that the expected behaviour is a straight line from (0,0) to (1, $\mathbb{E}X_1$ ). In fact we can say more than that. By Donsker’s theorem we have a functional version of a central limit theorem, which says that deviations from this expected behaviour are given by suitably scaled Brownian motion:

$\sqrt{n}\left(\frac{Z_n(t)-t\mathbb{E}X}{\sqrt{\text{Var}(X_1)}}\right)\quad\stackrel{d}{\rightarrow}\quad B(t),\quad t\in[0,1].$

This is what we expect ‘standard’ behaviour to look like:

The deviations from a straight line are on a scale of $\sqrt{n}$ . Here are two examples of potential large deviation behaviour:

Or this:

Note that these are qualitatively different. In the first case, the first half of the random variables are in general much larger than the second half, which appear to have empirical mean roughly 0. In the second case, a large deviation in overall mean is driven by a single very large value. It is obviously of interest to find out what the probabilities of each of these possibilities are.

We can do this via an LDP for $(\mu_n)$ . Now it is really useful to be working in a topological context with open and closed sets. It will turn out that the rate function is supported on absolutely continuous functions, whereas obviously for finite n, none of the sample paths are continuous!

We assume that $\Lambda(\lambda)$ is the logarithmic moment generating function of X_1 as before, with $\Lambda^*(x)$ the Fenchel-Legendre transform. Then the key result is:

Theorem (Mogulskii): The measures $(\mu_n)$ satisfy an LDP on $L_\infty([0,1])$ with good rate function:

$I(\phi)=\begin{cases}\int_0^1 \Lambda^*(\phi'(t))dt,&\quad \text{if }\phi\in\mathcal{AC}, \phi(0)=0,\\ \infty&\quad\text{otherwise,}\end{cases}$

where AC is the space of absolutely continuous functions on [0,1]. Note that AC is dense in $L_\infty([0,1])$ , so any open set contains a $\phi$ for which $I(\phi)$ is at least in principle finite. (Obviously, if $\Lambda^*$ is not finite everywhere, then extra restrictions of $\phi'$ are required.)

The following picture may be helpful at providing some motivation:

So what is going on is that if we take a path and zoom in on some small interval around a point, note first that behaviour on this interval is independent of behaviour everywhere else. Then the gradient at the point is the local empirical mean of the random variables around this point in time. The probability that this differs from the actual mean is given by Cramer’s rate function applied to the empirical mean, so we obtain the rate function for the whole path by integrating.

More concretely, but still very informally, suppose there is some $\phi'(t)\neq \mathbb{E}X$ , then this says that:

$Z_n(t+\delta t)-Z_n(t)=\phi'(t)\delta t+o(\delta t),$

$\Rightarrow\quad \mu_n\Big(\phi'(t)\delta t+o(\delta t)=\frac{1}{n}\sum_{i=nt+1}^{n(t+\delta t)}X_i\Big),$

$= \mu_n\Big( \phi'(t)+o(1)=\frac{1}{n\delta t}\sum_{i=1}^{n\delta t}X_i\Big)\sim e^{-n\delta t\Lambda^*(\phi'(t))},$

by Cramer. Now we can use independence:

$\mu_n(Z_n\approx \phi)=\prod_{\delta t}e^{-n\delta t \Lambda^*(\phi'(t))}=e^{-\sum_{\delta t}n\delta t \Lambda^*(\phi'(t))}\approx e^{-n\int_0^1 \Lambda^*(\phi'(t))dt},$

as in fact is given by Mogulskii.

Remarks

1) The absolutely continuous requirement is useful. We really wouldn’t want to be examining carefully the tail of the underlying distribution to see whether it is possible on an exponential scale that o(n) consecutive RVs would have sum O(n).

2) In general $\Lambda^*(x)$ will be convex, which has applications as well as playing a useful role in the proof. Recalling den Hollander’s mantra, we are interested to see where infima hold for LD sets in the host space. So for the event that the empirical mean is greater than some threshold larger than the expectation, Cramer’s theorem told us that this is exponentially the same as same the empirical mean is roughly equal to the threshold. Now Mogulskii’s theorem says more. By convexity, we know that the integral functional for the rate function is minimised by straight lines. So we learn that the contributions to the large deviation are spread roughly equally through the sample. Note that this is NOT saying that all the random variables will have the same higher than expected value. The LDP takes no account of fluctuations in the path on a scale smaller than n. It does however rule out both of the situations pictured a long way up the page. We should expect to see roughly a straight line, with unexpectedly steep gradient.

3) The proof as given in Dembo and Zeitouni is quite involved. There are a few stages, the first and simplest of which is to show that it doesn’t matter on an exponential scale whether we interpolate linearly or step-wise. Later in the proof we will switch back and forth at will. The next step is to show the LDP for the finite-dimensional problem given by evaluating the path at finitely many points in [0,1]. A careful argument via the Dawson-Gartner theorem allows lifting of the finite-dimensional projections back to the space of general functions with the topology of pointwise convergence. It remains to prove that the rate function is indeed the supremum of the rate functions achieved on projections. Convexity of $\Lambda^*(x)$ is very useful here for the upper bound, and this is where it comes through that the rate function is infinite when the comparison path is not absolutely continuous. To lift to the finer topology of $L_\infty([0,1])$ requires only a check of exponential tightness in the finer space, which follows from Arzela-Ascoli after some work.

In conclusion, it is fairly tricky to prove even this most straightforward case, so unsurprisingly it is hard to extend to the natural case where the distributions of the underlying RVs (X) change continuously in time, as we will want for the analysis of more combinatorial objects. Next time I will consider why it is hard but potentially interesting to consider with adaptations of these techniques an LDP for the size of the largest component in a sparse random graph near criticality.

Large Deviations 1 – Motivation and Cramer’s Theorem
Large Deviations 2 – LDPs, Rate Functions and Lower Semi-Continuity
Large Deviations 3 – Gartner-Ellis Theorem
Large Deviations 4 – Sanov’s Theorem
properties of heuristics and A* (alikhuram.wordpress.com)
Measures Are Better (rjlipton.wordpress.com)
Lebesgue Differentiation (landonkavlie.wordpress.com)
Applying Green’s Theorem on Hurricane Project (harveyjohnson.wordpress.com)
input something every day! (toutsurladance.wordpress.com)

Eventually Almost Everywhere

A blog about probability and olympiads by Dominic Yeo

Large Deviations 5 – Stochastic Processes and Mogulskii’s Theorem

1 thought on “Large Deviations 5 – Stochastic Processes and Mogulskii’s Theorem”

Leave a comment Cancel reply

Related articles

Share this:

Related

1 thought on “Large Deviations 5 – Stochastic Processes and Mogulskii’s Theorem”

Leave a comment Cancel reply