Large Deviations 5 – Stochastic Processes and Mogulskii’s Theorem

Motivation

In the previous posts about Large Deviations, most of the emphasis has been on the theory. To summarise briefly, we have a natural idea that for a family of measures supported on the same metric space, increasingly concentrated as some index grows, we might expect the probability of seeing values in a set not containing the limit in distribution to grow exponentially. The canonical example is the sample mean of a family of IID random variables, as treated by Cramer’s theorem.

It becomes apparent that it will not be enough to specify the exponent for a given large deviation event just by taking the infimum of the rate function, so we have to define an LDP topologically, with different behaviour on open and closed sets. Now we want to find some LDPs for more complicated measures, but which will have genuinely non-trivial applications. The key idea in all of this is that the infimum present in the definition of an LDP doesn’t just specify the rate function, it also might well give us some information about the configurations or events that lead to the LDP.

The slogan for the LDP as in Frank den Hollander’s excellent book is: “A large deviation event will happen in the least unlikely of all the unlikely ways.” This will be useful when our underlying space is a bit more complicated.

Setup

As a starting point, consider the set-up for Cramer’s theorem, with IID X_1,\ldots,X_n. But instead of investigating LD behaviour for the sample mean, we investigate LD behaviour for the whole set of RVs. There is a bijection between sequences and the partial sums process, so we investigate the partial sums process, rescaled appropriately. For the moment this is a sequence not a function or path (continuous or otherwise), but in the limit it will be, and furthermore it won’t make too much difference whether we interpolate linearly or step-wise.

Concretely, we consider the rescaled random walk:

Z_n(t):=\tfrac{1}{n}\sum_{i=1}^{[nt]}X_i,\quad t\in[0,1],

with laws \mu_n supported on L_\infty([0,1]). Note that the expected behaviour is a straight line from (0,0) to (1,\mathbb{E}X_1). In fact we can say more than that. By Donsker’s theorem we have a functional version of a central limit theorem, which says that deviations from this expected behaviour are given by suitably scaled Brownian motion:

\sqrt{n}\left(\frac{Z_n(t)-t\mathbb{E}X}{\sqrt{\text{Var}(X_1)}}\right)\quad\stackrel{d}{\rightarrow}\quad B(t),\quad t\in[0,1].

This is what we expect ‘standard’ behaviour to look like:

mog1 - Copy

The deviations from a straight line are on a scale of \sqrt{n}. Here are two examples of potential large deviation behaviour:

mog2 - Copy

Or this:

mog3 - Copy

Note that these are qualitatively different. In the first case, the first half of the random variables are in general much larger than the second half, which appear to have empirical mean roughly 0. In the second case, a large deviation in overall mean is driven by a single very large value. It is obviously of interest to find out what the probabilities of each of these possibilities are.

We can do this via an LDP for (\mu_n). Now it is really useful to be working in a topological context with open and closed sets. It will turn out that the rate function is supported on absolutely continuous functions, whereas obviously for finite n, none of the sample paths are continuous!

We assume that \Lambda(\lambda) is the logarithmic moment generating function of X_1 as before, with \Lambda^*(x) the Fenchel-Legendre transform. Then the key result is:

Theorem (Mogulskii): The measures (\mu_n) satisfy an LDP on L_\infty([0,1]) with good rate function:

I(\phi)=\begin{cases}\int_0^1 \Lambda^*(\phi'(t))dt,&\quad \text{if }\phi\in\mathcal{AC}, \phi(0)=0,\\ \infty&\quad\text{otherwise,}\end{cases}

where AC is the space of absolutely continuous functions on [0,1]. Note that AC is dense in L_\infty([0,1]), so any open set contains a \phi for which I(\phi) is at least in principle finite. (Obviously, if \Lambda^* is not finite everywhere, then extra restrictions of \phi' are required.)

The following picture may be helpful at providing some motivation:

CptPath2

So what is going on is that if we take a path and zoom in on some small interval around a point, note first that behaviour on this interval is independent of behaviour everywhere else. Then the gradient at the point is the local empirical mean of the random variables around this point in time. The probability that this differs from the actual mean is given by Cramer’s rate function applied to the empirical mean, so we obtain the rate function for the whole path by integrating.

More concretely, but still very informally, suppose there is some \phi'(t)\neq \mathbb{E}X, then this says that:

Z_n(t+\delta t)-Z_n(t)=\phi'(t)\delta t+o(\delta t),

\Rightarrow\quad \mu_n\Big(\phi'(t)\delta t+o(\delta t)=\frac{1}{n}\sum_{i=nt+1}^{n(t+\delta t)}X_i\Big),

= \mu_n\Big( \phi'(t)+o(1)=\frac{1}{n\delta t}\sum_{i=1}^{n\delta t}X_i\Big)\sim e^{-n\delta t\Lambda^*(\phi'(t))},

by Cramer. Now we can use independence:

\mu_n(Z_n\approx \phi)=\prod_{\delta t}e^{-n\delta t \Lambda^*(\phi'(t))}=e^{-\sum_{\delta t}n\delta t \Lambda^*(\phi'(t))}\approx e^{-n\int_0^1 \Lambda^*(\phi'(t))dt},

as in fact is given by Mogulskii.

Remarks

1) The absolutely continuous requirement is useful. We really wouldn’t want to be examining carefully the tail of the underlying distribution to see whether it is possible on an exponential scale that o(n) consecutive RVs would have sum O(n).

2) In general \Lambda^*(x) will be convex, which has applications as well as playing a useful role in the proof. Recalling den Hollander’s mantra, we are interested to see where infima hold for LD sets in the host space. So for the event that the empirical mean is greater than some threshold larger than the expectation, Cramer’s theorem told us that this is exponentially the same as same the empirical mean is roughly equal to the threshold. Now Mogulskii’s theorem says more. By convexity, we know that the integral functional for the rate function is minimised by straight lines. So we learn that the contributions to the large deviation are spread roughly equally through the sample. Note that this is NOT saying that all the random variables will have the same higher than expected value. The LDP takes no account of fluctuations in the path on a scale smaller than n. It does however rule out both of the situations pictured a long way up the page. We should expect to see roughly a straight line, with unexpectedly steep gradient.

3) The proof as given in Dembo and Zeitouni is quite involved. There are a few stages, the first and simplest of which is to show that it doesn’t matter on an exponential scale whether we interpolate linearly or step-wise. Later in the proof we will switch back and forth at will. The next step is to show the LDP for the finite-dimensional problem given by evaluating the path at finitely many points in [0,1]. A careful argument via the Dawson-Gartner theorem allows lifting of the finite-dimensional projections back to the space of general functions with the topology of pointwise convergence. It remains to prove that the rate function is indeed the supremum of the rate functions achieved on projections. Convexity of \Lambda^*(x) is very useful here for the upper bound, and this is where it comes through that the rate function is infinite when the comparison path is not absolutely continuous. To lift to the finer topology of L_\infty([0,1]) requires only a check of exponential tightness in the finer space, which follows from Arzela-Ascoli after some work.

In conclusion, it is fairly tricky to prove even this most straightforward case, so unsurprisingly it is hard to extend to the natural case where the distributions of the underlying RVs (X) change continuously in time, as we will want for the analysis of more combinatorial objects. Next time I will consider why it is hard but potentially interesting to consider with adaptations of these techniques an LDP for the size of the largest component in a sparse random graph near criticality.

Poisson Tails

I’ve had plenty of ideas for potential probability posts recently, but have been a bit too busy to write any of them up. I guess that’s a good thing in some sense. Anyway, this is a quick remark based on an argument I was thinking about yesterday. It combines Large Deviation theory, which I have spent a lot of time learning about this year, and the Poisson process, which I have spent a bit of time teaching.

Question

Does the Poisson distribution have an exponential tail? I ended up asking this question for two completely independent reasons yesterday. Firstly, I’ve been reading up about some more complex models of random networks. Specifically, the Erdos-Renyi random graph is interesting mathematical structure in its own right, but the independent edge condition results in certain regularity properties which are not seen in many real-world networks. In particular, the degree sequence of real-world networks typically follows an approximate power law. That is, the tail is heavy. This corresponds to our intuition that most networks contain ‘hubs’ which are connected to a large region of the network. Think about key servers or websites like Wikipedia and Google which are linked to by millions of other pages, or the social butterfly who will introduce friends from completely different circles. In any case, this property is not observed in an Erdos-Renyi graph, where the degrees are binomial, and in the sparse situation, rescale in the limit to a Poisson distribution. So, to finalise this observation, we want to be able to prove formally that the Poisson distribution has an exponential (so faster than power-law) tail.

The second occurrence of this question concerns large deviations for the exploration process of a random graph. This is a topic I’ve mentioned elsewhere (here for the exploration process, here for LDs) so I won’t recap extensively now. Anyway, the results we are interested in give estimates for the rate of decay in probability for the event that the path defined by the exploration process differs substantially from the expected path as n grows. A major annoyance in this analysis is the possibility of jumps. A jump occurs if a set of o(n) adjacent underlying random variables (here, the increments in the exploration process) have O(n) sum. A starting point might be to consider whether O(1) adjacent RVs can have O(n) sum, or indeed whether a single Poisson random variable can have sum of order n. In practice, this asks whether the probability \mathbb{P}(X>\alpha n) decays faster than exponentially in n. If it does, then this is dominated on a large deviations scale. If it decays exactly exponentially in n, then we have to consider such jumps in the analysis.

Approach

We can give a precise statement of the probabilities that a Po(\lambda) random variable X returns a given integer value:

\mathbb{P}(X=k)=e^{-\lambda}\frac{\lambda^k}{k!}.

Note that these are the terms in the Taylor expansion of e^{\lambda} appropriately normalised. So, while it looks like it should be possible to evaluate

\mathbb{P}(X>\alpha n)=e^{-\lambda}\sum_{\alpha n}^\infty \frac{\lambda^k}{k!},

this seems impossible to do directly, and it isn’t even especially obvious what a sensible bounding strategy might be.

The problem of estimating the form of the limit in probability of increasing unlikely deviations from expected behaviour surely reminds us of Cramer’s theorem. But this and other LD theory is generally formulated in terms of n random variables displaying some collective deviation, rather than a single random variable, with the size of the deviation growing. But we can transform our problem into that form by appealing to the three equivalent definitions of the Poisson process.

Recall that the Poisson process is the canonical description of, say, an arrivals process, where events in disjoint intervals are independent, and the expected number of arrives in a fixed interval is proportional to the width of the interval, giving a well-defined notion of ‘rate’ as we would want. The two main ways to define the process are: 1) the times between arrivals are given by i.i.d. Exponential RVs with parameter \lambda equal to the rate; and 2) the number of arrivals in interval [s,t] is independent of all other times, and has distribution given by Po(\lambda(t-s)). The fact that this definition gives a well-defined process is not necessarily obvious, but let’s not discuss that further here.

So the key equivalence to be exploited is that the event X>n for X\sim \text{Po}(\lambda) is a statement that there are at least n arrivals by time 1. If we move to the exponential inter-arrival times definition, we can write this as:

\mathbb{P}(Z_1+\ldots+Z_n<1),

where the Z’s are the i.i.d. exponential random variables. But this is exactly what we are able to specify through Cramer’s theorem. Recall that the moment generating function of an exponential distribution is not finite everywhere, but that doesn’t matter as we construct our rate function by taking the supremum over some index t of:

I(x)=\sup_t (xt-\log \mathbb{E}e^{tZ_1})=\sup_t(xt-\log(\frac{\lambda}{\lambda-t})).

A simple calculation then gives

I(x)=\lambda x-1 - \log \lambda x.

\Rightarrow I(x)\uparrow \infty\text{ as }x\downarrow 0.

Note that I(1) is the same for both Exp(\lambda) and Po(\lambda), because of the PP equality of events:

\{Z_1+\ldots+Z_n\leq n\}=\{\text{Po}(\lambda n)=\text{Po}(\lambda)_1+\ldots+\text{Po}(\lambda)_n> n\},

similar to the previous argument. In particular, for all \epsilon>0,

\mathbb{P}(\text{Po}(\lambda)>n)=\mathbb{P}(\frac{Z_1+\ldots+Z_n}{n}<\frac{1}{n})<\mathbb{P}(\frac{Z_1+\ldots+Z_n}{n}<\epsilon),\text{ for large }n.

\mathbb{P}(\text{Po}(\lambda)>n)=O(e^{-nI(\epsilon)}),\text{ for all }\epsilon.

Since we can take I(\epsilon) as large as we want, we conclude that the probability decays faster than exponentially in n.

Large Deviations 4 – Sanov’s Theorem

Although we could have defined things for a more general topological space, most of our thoughts about Cramer’s theorem, and the Gartner-Ellis theorem which generalises it, are based on means of real-valued random variables. For Cramer’s theorem, we genuinely are interested only in means of i.i.d. random variables. In Gartner-Ellis, one might say that we are able to relax the condition on independence and perhaps identical distribution too, in a controlled way. But this is somewhat underselling the theorem: using G-E, we can deal with a much broader category of measures than just means of collections of variables. The key is that convergence of the log moment generating function is exactly enough to give a LDP with some rate, and we have a general method for finding the rate function.

So, Gartner-Ellis provides a fairly substantial generalisation to Cramer’s theorem, but is still similar in flavour. But what about if we look for additional properties of a collection of i.i.d. random variables (X_n). After all, the mean is not the only interesting property. One thing we could look at is the actual values taken by the X_ns. If the underlying distribution is continuous, this is not going to give much more information than what we started with. With probability, \{X_1,\ldots,X_n\} is a set of size n, with distribution given by the product of the underlying measure. However, if the random variables take values in a discrete set, or better still a finite set, then (X_1,\ldots,X_n) gives a so-called empirical distribution.

As n grows towards infinity, we expect this empirical distribution to approximate the real underlying distribution fairly well. This isn’t necessarily quite as easy as it sounds. By the strong law of large numbers applied to indicator functions 1(X_i\leq t), the empirical cdf at t converges almost surely to the true cdf at t. To guarantee that this convergence is uniform in t is tricky in general (for reference, see the Glivenko-Cantelli theorem), but is clear for random variables defined on finite sets, and it seems reasonable that an extension to discrete sets should be possible.

So such empirical distributions might well admit an LDP. Note that in the case of Bernoulli random variables, the empirical distribution is in fact exactly equivalent to the empirical mean, so Cramer’s theorem applies. But, in fact we have a general LDP for empirical distributions. I claim that the main point of interest here is the nature of the rate function – I will discuss why the existence of an LDP is not too surprising at the end.

The rate function is going to be interesting whatever form it ends up taking. After all, it is effectively going to some sort of metric on measures, as it records how far a possible empirical measure is from the true distribution. Apart from total variation distance, we don’t currently have many standard examples for metrics on a space of measures. Anyway, the rate function is the main content of Sanov’s theorem. This has various forms, depending on how fiddly you are prepared for the proof to be.

Define L_n:=\sum_{i=1}^n \delta_{X_i}\in\mathcal{M}_1(E) to be the empirical measure generated by X_1,\ldots,X_n. Then L_n satisfies an LDP on \mathcal{M}_1(E) with rate n and rate function given by H(\cdot,\mu), where \mu is the underlying distribution.

The function H is the relative entropy, defined by:

H(\nu|\mu):=\int_E \log\frac{\nu(x)}{\mu(x)}d\nu(v),

whenever \nu<<\mu, and \infty otherwise. We can see why this absolute continuity condition is required from the statement of the LDP. If the underlying distribution \mu has measure zero on some set A, then the observed values will not be in A with probability 1, and so the empirical measure will be zero on A also.

Note that an alternative form is:

H(\nu|\mu)=\int_E \frac{\nu(x)}{\mu(x)}\log\frac{\nu(x)}{\mu(x)}d\mu(v)=\mathbb{E}_\nu\frac{\nu(x)}{\mu(x)}\log\frac{\nu(x)}{\mu(x)}.

Perhaps it is more clear why this expectation is something we would want to minimise.

In particular, if we want to know the most likely asymptotic empirical distribution inducing a large deviation empirical mean (as in Cramer), then we find the distribution with suitable mean, and smallest entropy relative to the true underlying distribution.

A remark on the proof. If the underlying set of values is finite, then a proof of this result is essentially combinatorial. The empirical distribution is some multinomial distribution, and we can obtain exact forms for everything and then proceed with asymptotic approximations.

I said earlier that I would comment on why the LDP is not too surprising even in general, once we know Gartner-Ellis. Instead of letting X_i take values in whatever space we were considering previously, say the reals, consider instead the point mass function \delta_{X_i} which is effectively exactly the same random variable, only now defined on the space of probability measures. The empirical measure is then exactly:

\frac{1}{n}\sum_{i=1}^n \delta_{X_i}.

If the support K of the (X_i)s is finite, then in fact this space of measures is a convex subspace of \mathbb{R}^K, and so the multi-dimensional version of Cramer’s theorem applies. In general, we can work in the possibly infinite-dimensional space [0,1]^K, and our relevant subset is compact, as a closed subset of a compact space (by Tychonoff). So the LDP in this case follows from our previous work.

Large Deviations 3 – Gartner-Ellis Theorem: Where do the all terms come from?

We want to drop the i.i.d. assumption from Cramer’s theorem, to get a criterion for a general LDP as defined in the previous post to hold.

Preliminaries

For general random variables (Z_n) on \mathbb{R}^d with laws (\mu_n), we will continue to have an upper bound like in Cramer’s theorem, provided the moment generating functions of Z_n converge as required. For analogy with Cramer, take Z_n=\frac{S_n}{n}. The Gartner-Ellis theorem gives conditions for the existence of a suitable lower bound and, in particular, when this is the same as the upper bound.

We define the logarithmic moment generating function

\Lambda_n(\lambda):=\log\mathbb{E}e^{\langle \lambda,Z_n\rangle},

and assume that the limit

\Lambda(\lambda)=\lim_{n\rightarrow\infty}\frac{1}{n}\Lambda_n(n\lambda)\in[-\infty,\infty],

exists for all \lambda\in\mathbb{R}^d. We also assume that 0\in\text{int}(\mathcal{D}_\Lambda), where \mathcal{D}_\Lambda:=\{\lambda\in\mathbb{R}^d:\Lambda(\lambda)<\infty\}. We also define the Fenchel-Legendre transform as before:

\Lambda^*(x)=\sup_{\lambda\in\mathbb{R}^d}\left[\langle x,\lambda\rangle - \Lambda(\lambda)\right],\quad x\in\mathbb{R}^d.

We say y\in\mathbb{R}^d is an exposed point of \Lambda^* if for some \lambda,

\langle \lambda,y\rangle - \Lambda^*(y)>\langle\lambda,x\rangle - \Lambda^*(x),\quad \forall x\in\mathbb{R}^d.

Such a \lambda is then called an exposing hyperplane. One way of thinking about this definition is that \Lambda^*(x) is convex, but is strictly convex in any direction at an exposed point. Alternatively, at an exposed point y, there is a vector \lambda such that \Lambda^*\circ \pi_\lambda has a global minimum or maximum at y, where \pi_\lambda is the projection into \langle \lambda\rangle. Roughly speaking, this vector is what we will to take the Cramer transform for the lower bound at x. Recall that the Cramer transform is an exponential reweighting of the probability density, which makes a previously unlikely event into a normal one. We may now state the theorem.

Gartner-Ellis Theorem

With the assumptions above:

  1. \limsup_{n\rightarrow\infty}\frac{1}{n}\log \mu_n(F)\leq -\inf_{x\in F}\Lambda^*(x), \forall F\subset\mathbb{R}^d closed.
  2. \liminf_{n\rightarrow\infty}\frac{1}{n}\log \mu_n(G)\geq -\inf_{x\in G\cap E}\Lambda^*(x), \forall G\subset\mathbb{R}^d open, where E is the set of exposed points of \Lambda^* whose exposing hyperplane is in \text{int}(\mathcal{D}_\Lambda).
  3. If \Lambda is also lower semi-continuous, and is differentiable on \text{int}(\mathcal{D}_\Lambda) (which is non-empty by the previous assumption), and is steep, that is, for any \lambda\in\partial\mathcal{D}_\Lambda, \lim_{\nu\rightarrow\lambda}|\nabla \Lambda(\nu)|=\infty, then we may replace G\cap E by G in the second statement. Then (\mu_n) satisfies the LDP on \mathbb{R}^d with rate n and rate function \Lambda^*.

Where do all the terms come from?

As ever, because everything is on an exponential scale, the infimum in the statements affirms the intuitive notion that in the limit, “an unlikely event will happen in the most likely of the possible (unlikely) ways”. The reason why the first statement does not hold for open sets in general is that the infimum may not be attained for open sets. For the proof, we need an exposing hyperplane at x so we can find an exponential tilt (or Cramer transform) that makes x the standard outcome. Crucially, in order to apply probabilistic ideas to the resulting distribution, everything must be normalisable. So we need an exposing hyperplane so as to isolate the point x on an exponential scale in the transform. And the exposing hyperplane must be in \mathcal{D}_\Lambda if we are to have a chance of getting any useful information out of the transform. By convexity, this is equivalent to the exposing hyperplane being in \text{int}(\mathcal{D}_\Lambda).

Large Deviations 2 – LDPs, Rate Functions and Lower Semi-Continuity

Remarks from Cramer’s Theorem

So in the previous post we discussed Cramer’s theorem on large deviations for means of i.i.d. random variables. It’s worth stepping back and thinking more abstractly about what we showed. Each S_n has some law, which we think of as a measure on \mathbb{R}, though this could equally well be some other space, depending on where the random variables are supported. The law of large numbers asserts that as n\rightarrow\infty, these measures are increasingly concentrated at a single point in \mathbb{R}, which in this case is \mathbb{E}X_1. Cramer’s theorem then asserts that the measure of certain sets not containing this point of concentration decays exponentially in n, and quantifies the exponent, a so-called rate function, via a Legendre transform of the log moment generating function of the underlying distribution.

One key point is that we considered only certain sets [a,\infty),\,a>\mathbb{E}X_1, though we could equally well have considered (-\infty,a],\,a<\mathbb{E}X_1. What would happen if we wanted to consider an interval, say [a,b],\,\mathbb{E}X_1<a<b? Well, \mu_n([a,b])=\mu_n([a,\infty))-\mu_n((b,\infty)), and we might as well assume that \mu_n is sufficiently continuous, at least in the limit, that we can replace the open interval bound with a closed one. Then Cramer’s theorem asserts, written in a more informal style, that \mu_n([a,\infty))\sim e^{-nI(a)} and  similarly for [b,\infty). So provided I(a)<I(b), we have

\mu_n([a,b])\sim e^{-nI(a)}-e^{-nI(b)}\sim e^{-nI(a)}.

To in order to accord with our intuition, we would like I(x) to be increasing for x>\mathbb{E}X_1, and decreasing for x<\mathbb{E}X_1. Also, we want I(\mathbb{E}X_1)=0, to account for the fact that \mu_n([\mathbb{E}X_1,\infty))=O(1). For each consider a sequence of coin tosses. The probability that the observed proportion of heads is in [\frac12,1] should be roughly 1/2 for all n.

Note that in the previous displayed equation for \mu_n([a,b]) the right hand side has no dependence on b. Informally, this means that any event which is at least as unlikely as the event of a deviation to a, will in the limit happen in the most likely of the unlikely ways, which will in this case be a deviation to a, because of relative domination of exponential functions. So if, rather than just half-lines and intervals, we wanted to consider more general sets, we might conjecture a result of the form:

\mu_n(\Gamma)\sim e^{-n\inf_{z\in\Gamma}(z)},

with the approximation defined formally as in the statement of Cramer’s theorem. What can go wrong?

Large Deviations Principles

Well, if the set \Gamma=\{\gamma\} a single point, and the underlying distribution is continuous, then we would expect \mu_n(\{\gamma\})=0 for all n. Similarly, we would expect \mu_n((\mathbb{E}X_1,\infty))\sim O(1), but there is no a priori reason why I(z) should be continuous at \mathbb{E}X_1. (In fact, this is false.), so taking \Gamma=(\mathbb{E}X_1,\infty) again gives a contradiction.

So we need something a bit more precise. Noting that the problem here is that measure (in this case, measure of likeliness on an exponential scale) can leak into open sets through the boundary in the limit, and also the rate function requires some sort of neighbourhood to make sense for continuous RVs, so boundaries of closed sets may give an overestimate. This is reminiscent of weak convergence, and motivated by this, the appropriate general definition for a Large Deviation Principle is:

A sequence of measure (\mu_n) on some space E satisfies an LDP with rate function I and speed n if \forall \Gamma\in \mathcal{B}(E):

-\inf_{x\in\Gamma^\circ}I(x)\leq \liminf \frac{1}{n}\log\mu_n(\Gamma)\leq \limsup\frac{1}{n}\log\mu_n(\Gamma)\leq -\inf_{x\in \bar{\Gamma}}I(x).

Although this might look very technical, you might as well think of it as nothing more than the previous conjecture for general sets, with the two problems that we mentioned now taken care of.

So, we need to define a rate function. I: E\rightarrow[0,\infty] is a rate function, if it not identically infinite. We also demand that it is lower semi-continuous, and has closed level sets \Psi_I^\alpha:=\{x\in E: I(x)\leq\alpha\}. These definitions are in fact equivalent. I will say what lower semi-continuity is in a moment. Some authors also demand that the level sets be compact. Others call this a good rate function, or similar. The advantage of this is that infima on closed sets are attained.

It is possible to specify a different rate. The rate gives the speed of convergence. \frac 1 n can be replaced with any function converging to 0, including continuously.

Lower Semi-Continuity

A function f is lower semi-continuous if

f(x)\leq \liminf f(x_n),\text{ for all sequences }x_n\rightarrow x.

One way of thinking about this definition is to say that the function cannot jump upwards as it reaches a boundary, it can only jump downwards (or not jump at all). The article on Wikipedia for semi-continuity has this picture explaining how a lower semi-continuous function must behave at discontinuities. Note that the value of f at the discontinuity could be the blue dot, or anything less than the blue dot. It is reasonable clear why this definition is equivalent to having closed level sets.

So the question to ask is: why should rate functions be lower semi-continuous? Rather than proceeding directly, we argue by uniqueness. Given a function on \mathbb{R} with discontinuities, we can turn it into a cadlag function, or a caglad function by fiddling with the values taken at points of discontinuity. We can do a similar thing to turn any function into a lower semi-continuous function. Given f, we define

f_*(x):=\liminf_{x_n\rightarrow x}f(x_n)=\sup\{\inf_G f: x\ni G, G \text{ open}\}.

The notes I borrowed this idea from described this as the maximal lower semi-continuous regularisation, which I think is quite a good explanation despite the long words.

Anyway, the claim is that if I(x) satisfies a LDP then so does $I_*(x)$. This needs to be checked, but it explains why we demand that the rate function be lower semi-continuous. We really want the rate function not to be unique, and this is a good way to prevent an obvious cause of non-uniqueness. It needs to be checked that it is actually unique once we have this assumption, but that is relatively straightforward.

So, to check that the lower semi-continuous regularisation of I satisfies the LDP if I does, we observe that the upper bound is trivial, since I^*\leq I everywhere. Then, for every open set G, note that for x\in G, I_*(x)=\liminf_{x_n\rightarrow x}I(x), so we might as well consider sequences within G, and so I_*(x)\geq \inf \inf_G I. So, since I_*(x)\leq I(x), it follows that

\inf_G I_*=\inf_G I,

and thus we get the upper bound for the LDP.

References

The motivation for this particular post was my own, but the set of notes here, as cited in the previous post were very useful. Also the Wikipedia page on semi-continuity, and Frank den Hollander’s book ‘Large Deviations’.

Large Deviations 1 – Motivation and Cramer’s Theorem

I’ve been doing a lot of thinking about Large Deviations recently, in particular how to apply the theory to random graphs and related models. I’ve just writing an article about some of the more interesting aspects, so thought it was probably worth turning it into a few posts.

Motivation

Given X_1,X_2,\ldots i.i.d. real-valued random variables with finite expectation, and S_n:=X_1+\ldots+X_n, the Weak Law of Large Numbers asserts that the empirical mean \frac{S_n}{n} converges in distribution to \mathbb{E}X_1. So \mathbb{P}(S_n\geq n(\mathbb{E}X_1+\epsilon))\rightarrow 0. In fact, if \mathbb{E}X_1^2<\infty, we have the Central Limit Theorem, and a consequence is that \mathbb{P}(S_n\geq n\mathbb{E}X_1+n^\alpha)\rightarrow 0 whenever \alpha>\frac12.

In a concrete example, if we toss a coin some suitably large number of times, the probability that the proportion of heads will be substantially greater or smaller than \frac12 tends to zero. So the probability that at least \frac34 of the results are heads tends to zero. But how fast? Consider first four tosses, then eight. A quick addition of the relevant terms in the binomial distribution gives:

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of four tosses are heads}\right)=\frac{1}{16}+\frac{4}{16}=\frac{5}{16},

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of twelve tosses are heads}\right)=\frac{1}{2^{12}}+\frac{12}{2^{12}}+\frac{66}{2^{12}}+\frac{220}{2^{12}}=\frac{299}{2^{12}}.

There are two observations to be made. The first is that the second is substantially smaller than the first – the decay appears to be relatively fast. The second observation is that \frac{220}{2^{12}} is substantially larger than the rest of the sum. So by far the most likely way for at least \tfrac34 out of twelve tosses to be heads is if exactly \tfrac34 are heads. Cramer’s theorem applies to a general i.i.d. sequence of RVs, provided the tail is not too heavy. It show that the probability of any such large deviation event decays exponentially with n, and identifies the exponent.

Theorem (Cramer): Let (X_i) be i.i.d. real-valued random variables which satisfy \mathbb{E}e^{tX_1}<\infty for every t\in\mathbb{R}. Then for any a>\mathbb{E}X_1,

\lim_{n\rightarrow \infty}\frac{1}{n}\log\mathbb{P}(S_n\geq an)=-I(a),

\text{where}\quad I(z):=\sup_{t\in\mathbb{R}}\left[zt-\log\mathbb{E}e^{tX_1}\right].

Remarks

  • So, informally, \mathbb{P}(S_n\geq an)\sim e^{-nI(a)}.
  • I(z) is called the Fenchel-Legendre transform (or convex conjugate) of \log\mathbb{E}e^{tX_1}.
  • Considering t=0 confirms that I(z)\in[0,\infty].
  • In their extremely useful book, Dembo and Zeitouni present this theorem in greater generality, allowing X_i to be supported on \mathbb{R}^d, considering a more general set of large deviation events, and relaxing the requirement for finite mean, and thus also the finite moment generating function condition. All of this will still be a special case of the Gartner-Ellis theorem, which will be examined in a subsequent post, so we make do with this form of Cramer’s result for now.

The proof of Cramer’s theorem splits into an upper bound and a lower bound. The former is relatively straightforward, applying Markov’s inequality to e^{tS_n}, then optimising over the choice of t. This idea is referred to by various sources as the exponential Chebyshev inequality or a Chernoff bound. The lower bound is more challenging. We reweight the distribution function F(x) of X_1 by a factor e^{tx}, then choose t so that the large deviation event is in fact now within the treatment of the CLT, from which suitable bounds are obtained.

To avoid overcomplicating this initial presentation, some details have been omitted. It is not clear, for example, whether I(x) should be finite whenever x is in the support of X_1. (It certainly must be infinite outside – consider the probability that 150% or -40% of coin tosses come up heads!) In order to call this a Large Deviation Principle, we also want some extra regularity on I(x), not least to ensure it is unique. This will be discussed in the next posts.

Analytic vs Probabilistic Arguments for a Supercritical BP

This follows on directly from the previous post. I was originally going to talk only about what follows, but I got rather carried away with the branching process account. I was stuck on a particular exercise, and we ended up coming up with two arguments: one analytic and one probabilistic. Since the typical flavour of this blog is to present problems which show the advantage of the probabilistic approach, it seems only fair to remark on this case, where the analytic method was less interesting, but much simpler.

Recall that we have a supercritical random graph G(n,\frac{\lambda}{n}), \lambda>1, and we are considering the rescaled exploration process S_{nt}, which has asymptotic mean \mu_t=1-t-e^{-\lambda t}. We can calculate similarly an expression for the asymptotic variance

\frac{\text{Var}(S_{nt})}{n}\rightarrow v_t=e^{-\lambda t}(1-e^{-\lambda t}).

To use this to verify the result about the size of the giant component, we verify that \mu_{\zeta_\lambda+x/\sqrt{n}} is negative, and has small variance, which would confirm that the giant component has size bounded above by \zeta_\lambda almost surely. A similar argument is required for the lower bound. The variance is a separate matter, but it is therefore necessary that \mu_t should be decreasing at t=\zeta_\lambda, that is \mu_t'=\lambda e^{-\lambda \zeta_\lambda}<0. This is what we try to prove in the remainder of this post. Recall that in the previous post we have checked that it is equal to zero here.

Heuristic Explanation

\mu_t has been rescaled from the original definition of the exploration process in both size and time-scale so some care is needed to see why this should hold in the limit. Remember that all components apart from the giant component are of size O(log n). So immediately after exhausting the giant component, you are likely to be visiting components of size roughly log n. A time interval of dt for \mu corresponds to ndt for S, during which S will visit some components of size log n and some of O(1) and some in between. In particular, some fixed proportion of vertices are isolated, that is, in a component of size 1.

There is then a complicated size-biasing train of thought. A component of size log n is more likely to come up than an isolated vertex, but there are not as many of them. The log n components push the derivative \mu_t' towards zero, because S_t decreases by 1 over a time-interval of length log n, which gives a gradient of zero in the limit. However, the isolated vertices give a gradient of -1, because S_t decreases by 1 over a time interval of 1. Despite the fact that log n intervals are likely to appear earlier, it still remains the case that after exhausting a component (in particular, at time t=\zeta_\lambda, after exhausting the giant component), with some bounded below positive probability you will choose an isolated vertex next. The component size only affects that time-scale if it is O(n), which none of the remaining components are, so the derivative \mu_{\zeta_\lambda}' consists of some complicated weighted mean of 0 and -1. In particular, it is negative.

Analytic solution

Obviously, that won’t do in practice. Suppressing lambdas for ease of notation, the key fact is: e^{-\lambda \zeta}=1-\zeta. We want to show that \lambda e^{-\lambda \zeta}<1. Substituting

\lambda=-\frac{\log(1-\zeta)}{\zeta},

means that it is required to show:

-\frac{1-\zeta}{\zeta}\log(1-\zeta)<1.

Differentiating the left hand side gives:

\frac{\log(1-\zeta)+\zeta}{\zeta^2}<0,

since of course \log(1-\zeta)=\zeta+\frac{\zeta^2}{2}+\frac{\zeta^3}{3}+\dots. So it suffice to check the result for small \zeta. But, again using a Taylor series:

-\frac{1-\zeta}{\zeta}\log(1-\zeta)=1-\frac12\zeta+O(\zeta^2)<1,

for small \zeta. This gives the required result.

Probabilistic Interpretation and Solution

First, we observe that \lambda e^{-\lambda\zeta}=\lambda(1-\zeta) is the expected number of vertices in the first generation of a \text{Po}(\lambda) whose progeny become extinct. This motivates considering the canonical decomposition of a supercritical branching process Z into the skeleton process and the dual process. The skeleton Z^+ consists of all vertices which have infinitely many successors. It is relatively easy to show that this is a branching process with offspring distribution \text{Po}(\lambda\zeta) conditioned on being positive. The dual process Z^* is a G-W branching process with offspring distribution \text{Po}(\lambda) conditioned on dying. This is the same as a branching process with offspring distribution \text{Po}(\lambda(1-\zeta), by a sprinkling argument, which says that if we begin with a Poisson number of things, then remove each one independently with some fixed probability, the remaining number of things is Poisson also.

We can construct the original branching process by

  • With probability \zeta, take the skeleton, and affixe independent copies of Z^* at every vertex in the skeleton.
  • With probability 1-\zeta, just take a copy of Z^*.

It is immediately clear that \lambda(1-\zeta)\leq 1. After all, the dual process is almost surely finite, so the offspring distribution cannot have expectation greater than 1. Checking that this is strong is more fiddly. The best way I have come up with is to examine the tail of the distribution of total population size of the original branching process.

The total population size T of a branching process has an exponential tail if the offspring distribution is subcritical. It isn’t hugely surprising that this behaves like a large deviation for iid RVs, since in the limit such an event requires a lot of the offspring counts to deviate substantially from the mean. The same holds in the supercritical case, with the additional complication that though the finite tail decays exponential, there is positive probability that the total size will be infinite. In the critical case, however, there is a power-law decay. This is not hugely surprising as it marks the threshhold for the appearance of the infinite population, just as in a multiplicative coalescent at time 1, we have a load of very large components just about to form a giant component. The tool for all of these results is Dwass’s Theorem, which says:

\mathbb{P}(T=n)=\frac{1}{n}\mathbb{P}(X_1+\ldots+X_n=n-1),

where X_1 are iid with the offspring distribution. When \mathbb{E}X_1\neq 1, this is a large deviation event, for which Cramer’s theorem applies (assuming, as is the case for the Poisson distribution, that the offspring distribution has finite variance). When, \mathbb{E}X=1, the Central Limit Theorem says that with high probability,

X_1+\ldots+X_n\in [n-n^{3/4},n+n^{3/4}],

so, skating over the details of whether everything is exactly uniform within this CLT scaling window,

\mathbb{P}(T=n)\geq \frac{1}{n}\cdot\frac{1}{2n^{3/4}}.

The true exponent of the power law decay is substantially slower than this, but the above argument works as a back-of-the-envelope bound.

In particular, if the dual process has mean 1, then the population size of the original branching process is given by taking a distribution with exponential tail with some probability and a distribution with power-law tail with some probability. Obviously the power-law will dominate, which contradicts the assumption that the original branching process was supercritical, and so has an exponential tail.

Branching Processes and Dwass’s Theorem

This is something I had to think about when writing my Part III essay, and it turns out to be relevant to some of the literature I’ve been reading this week. The main result is hugely helpful for reducing a potentially complicated combinatorial object to a finite sum of i.i.d. random variables, which in general we do know quite a lot about. I was very pleased with the proof I came up with while writing the essay, even if in the end it turned out to have appeared elsewhere before. (Citation at end)

Galton-Watson processes

A Galton-Watson process is a stochastic process describing a simple model for evolution of a population. At each stage of the evolution, a new generation is created as every member of the current generation produces some number of `offspring’ with identical and independent (both across all generations and within generations) distributions. Such processes were introduced by Galton and Watson to examine the evolution of surnames through history.

More precisely, we specify an offspring distribution, a probability distribution supported on \mathbb{N}_0. Then define a sequence of random variables (Z_n,n\in\mathbb{N}) by:

Z_{n+1}=Y_1^n+\ldots+Y_{Z_n}^n,

where (Y_k^n,k\geq 1,n\geq 0) is a family of i.i.d. random variables with the offspring distribution Y. We say Z_n is the size of the nth generation. From now on, assume Z_0=1 and then we call (Z_n,n\geq 0) a Galton-Watson process. We also define the total population size to be

X:=Z_0+Z_1+Z_2+\ldots,

noting that this might be infinite. We refer to the situation where X<\infty finite as extinction, and can show that extinction occurs almost surely when \mathbb{E}Y\leq 1, excepting the trivial case Y=\delta_1. The strict inequality parts are as you would expect. We say the process is critical if \mathbb{E}Y=1, and this is less obvious to visualise, but works equally well in the proof, which is usually driven using generating functions.

Total Population Size and Dwass’s Theorem

Of particular interest is X, the total population size, and its distribution. The following result gives us a precise and useful result linking the probability of the population having size n and the distribution of the sum of n RVs with the relevant offspring distribution. Among the consequences are that we can conclude immediately, by CLT and Cramer’s Large Deviations Theorem, that the total population size distribution has power-law decay in the critical case, and exponential decay otherwise.

Theorem (Dwass (1)): For a general branching process with a single time-0 ancestor and offspring distribution $Y$ and total population size $X$:

\mathbb{P}(X=k)=\frac{1}{k}\mathbb{P}(Y^1+\ldots+ Y^k=k-1),\quad k\geq 1

where Y^1,\ldots,Y^k are independent copies of Y.

We now give a proof via a combinatorial argument. The approach is similar to that given in (2). Much of the literature gives a proof using generating functions.

Proof: For motivation, consider the following. It is natural to consider a branching process as a tree, with the time-0 ancestor as the root. Suppose the event \{X=k\} in holds, which means that the tree has k vertices. Now consider the numbers of offspring of each vertex in the tree. Since every vertex except the root has exactly one parent, and there are no vertices outside the tree, we must have Y^1+\ldots+Y^k=k-1 where Y^1,\ldots,Y^k are the offspring numbers in some order. However, observe that this is not sufficient. For example, if Y^1 is the number of offspring of the root, and k\geq 2, then we must have Y^1\geq 1. Continue reading

Effective Bandwidth

Here, devices have fixed capacity, but packet sizes are random. So, we still have a capacity constraint for the links, but we accept that it won’t be possible to ensure that we stay within those limits all the time, and seek instead to minimise the probability that the limits are exceeded, while keeping throughput as high as possible.

An important result is Chernoff’s Bound: \mathbb{P}(Y\geq 0)\leq \inf_{s\geq 0}\mathbb{E}e^{sY}. The proof is very straightforward: apply Markov’s inequality to the non-negative random variable e^{SY}. So in particular \frac{1}{n}\log\mathbb{P}(X_1+\ldots+X_n\geq 0)\leq \inf M(s), where M(s)=\log\mathbb{E}e^{sX}, and Cramer’s Theorem asserts that after taking a limit in n on the LHS, equality holds, provided \mathbb{E}X<0,\mathbb{P}(X>0)>0.

We assume that the traffic has the form S=\sum_{j=1}^J\sum_{i=1}^{n_j}X_{ji}, where these summands are iid, interpreted as one of the n_j loads used on source j. We have

\log\mathbb{P}(S>c)\leq\log \mathbb{E}[e^{s(S-C)}]=\sum_{j=1}^Jn_jM_j(s)-sC

so \inf(\sum n_jM_j(s)-sC)\leq -\gamma\quad\Rightarrow\quad \mathbb{P}(s\geq C)\leq e^{-\gamma}

so we want this to hold for large \gamma.

We might then choose to restrict attention to

A=\{n:\sum n_jM_j-sC\leq-\gamma,\text{ some }s\geq 0\}

So, when operating near capacity, say with call profile n* on (ie near) the boundary of A, with s* the argmin of the above. Then the tangent plane is \sum n_jM_j(s^*)-s^*C=-\gamma, and since A’s complement is convex, it suffices to stay on the ‘correct’ side (ie halfspace) of this tangent plane.

We can rewrite as \sum n_jM_j(S^*)\leq C-\frac{\gamma}{s^*}. Note that this is reasonable since s* is fixed, and we call \frac{M_j(s)}{s}=:\alpha_j(s), the effective bandwidth. It is with respect to this average that we are bounding probabilities, hence ‘effective’.

Observe that \alpha_j(s) is increasing by Jensen as (\mathbb{E}e^X)^t\leq \mathbb{E}e^{tX} for t>1 implies that for t>s, (\mathbb{E}e^{sX})^t\leq(\mathbb{E}e^{tX})^s.

In particular,

\mathbb{E}X\leq \alpha_j(s)\leq \text{ess sup}X