Large Deviations 6 – Random Graphs

As a final instalment in this sequence of posts on Large Deviations, I’m going to try and explain how one might be able to apply some of the theory to a problem about random graphs. I should explain in advance that much of what follows will be a heuristic argument only. In a way, I’m more interested in explaining what the technical challenges are than trying to solve them. Not least because at the moment I don’t know exactly how to solve most of them. At the very end I will present a rate function, and reference properly the authors who have proved this. Their methods are related but not identical to what I will present.

Problem

Recall the two standard definitions of random graphs. As in many previous posts, we are interested in the sparse case where the average degree of a vertex is o(1). Anyway, we start with n vertices, and in one description we add an edge between any pair of vertices independently and with fixed probability \frac{\lambda}{n}. In the second model, we choose uniformly at random from the set of graphs with n vertices and \frac{\lambda n}{2} edges. Note that if we take the first model and condition on the number of edges, we get the second model, since the probability of a given configuration appearing in G(n,p) is a function only of the number of edges present. Furthermore, the number of edges in G(n,p) is binomial with parameters \binom{n}{2} and p. For all purposes here it will make no difference to approximate the former by \frac{n^2}{2}.

Of particular interest in the study of sparse random graphs is the phase transition in the size of the largest component observed as \lambda passes 1. Below 1, the largest component has size on a scale of log n, and with high probability all components are trees. Above 1, there is a unique giant component containing \alpha_\lambda n vertices, and all other components are small. For \lambda\approx 1, where I don’t want to discuss what ‘approximately’ means right now, we have a critical window, for which there are infinitely many components with sizes on a scale of n^{2/3}.

A key observation is that this holds irrespective of which model we are using. In particular, this is consistent. By the central limit theorem, we have that:

|E(G(n,\frac{\lambda}{n}))|\sim \text{Bin}\left(\binom{n}{2},\frac{\lambda}{n}\right)\approx \frac{n\lambda}{2}\pm\alpha,

where \alpha is the error due to CLT-scale fluctuations. In particular, these fluctuations are on a scale smaller than n, so in the limit have no effect on which value of \lambda in the edge-specified model is appropriate.

However, it is still a random model, so we can condition on any event which happens with positive probability, so we might ask: what does a supercritical random graph look like if we condition it to have no giant component? Assume for now that we are considering G(n,\frac{\lambda}{n}),\lambda>1.

This deviation from standard behaviour might be achieved in at least two ways. Firstly, we might just have insufficient edges. If we have a large deviation towards too few edges, then this would correspond to a subcritical G(n,\frac{\mu n}{2}), so would have no giant components. However, it is also possible that the lack of a giant component is due to ‘clustering’. We might in fact have the correct number of edges, but they might have arranged themselves into a configuration that keeps the number of components small. For example, we might have a complete graph on Kn^{1/2} vertices plus a whole load of isolated vertices. This has the correct number of edges, but certainly no giant component (that is an O(n) component).

We might suspect that having too few edges would be the primary cause of having no giant component, but it would be interesting if clustering played a role. In a previous post, I talked about more realistic models of complex networks, for which clustering beyond the levels of Erdos-Renyi is one of the properties we seek. There I described a few models which might produce some of these properties. Obviously another model is to take Erdos-Renyi and condition it to have lots of clustering but that isn’t hugely helpful as it is not obvious what the resulting graphs will in general look like. It would certainly be interesting if conditioning on having no giant component were enough to get lots of clustering.

To do this, we need to find a rate function for the size of the giant component in a supercritical random graph. Then we will assume that evaluating this near 0 gives the LD probability of having ‘no giant component’. We will then compare this to the straightforward rate function for the number of edges; in particular, evaluated at criticality, so the probability that we have a subcritical number of edges in our supercritical random graph. If they are the same, then this says that the surfeit of edges dominates clustering effects. If the former is smaller, then clustering may play a non-trivial role. If the former is larger, then we will probably have made a mistake, as we expect on a LD scale that having too few edges will almost surely lead to a subcritical component.

Methods

The starting point is the exploration process for components of the random graph. Recall we start at some vertex v and explore the component containing v depth-first, tracking the number of vertices which have been seen but not yet explored. We can extend this to all components by defining:

S(0)=0, \quad S(t)=S(t-1)+(X(t)-1),

where X(t) is the number of children of the t’th vertex. For a single component, S(t) is precisely the number of seen but unexplored vertices. It is more complicated in general. Note that when we exhaust the first component S(t)=-1, and then when we exhaust the second component S(t)=-2 and so on. So in fact

S_t-\min_{0\leq s\leq t}S_s

is the number of seen but unexplored vertices, with \min_{0\leq s\leq t}S_s equal to (-1) times the number of components already explored up to time t.

Once we know the structure of the first t vertices, we expect the distribution of X(t) – 1 to be

\text{Bin}\Big(n-t-[S_t-\min_{0\leq s\leq t}S_s],\tfrac{\lambda}{n}\Big)-1.

We aren’t interested in all the edges of the random graph, only in some tree skeleton of each component. So we don’t need to consider the possibility of edges connecting our current location to anywhere we’ve previously visited (as such an edge would have been consider then – it’s a depth-first exploration), hence the -t. But we also don’t want to consider edges connecting our current location to anywhere we’ve seen, since that would be a surplus edge creating a cycle, hence the -S_s. It is binomial because by independence even after all this conditioning, the probability that there’s an edge from my current location to any other vertex apart from those discounted is equal to \frac{\lambda}{n} and independent.

For Mogulskii’s theorem in the previous post, we had an LDP for the rescaled paths of a random walk with independent stationary increments. In this situation we have a random walk where the increments do not have this property. They are not stationary because the pre-limit distribution depends on time. They are also not independent, because the distribution depends on behaviour up to time t, but only through the value of the walk at the present time.

Nonetheless, at least by following through the heuristic of having an instantaneous exponential cost for a LD event, then products of sums becoming integrals within the exponent, we would expect to have a similar result for this case. We can find the rate function \Lambda_\lambda^*(x)of latex \text{Po}(\lambda)-1$ and thus get a rate function for paths of the exploration process

I_\lambda(f)=\int_0^1 \Lambda_{(1-t-\bar{f}(t))\lambda}^*(f')dt,

where \bar{f}(t) is the height of f above its previous minimum.

Technicalities and Challenges

1) First we need to prove that it is actually possible to extend Mogulskii to this more general setting. Even though we are varying the distribution continuously, so we have some sort of ‘local almost convexity’, the proof is going to be fairly fiddly.

2) Having to consider excursions above the local minima is a massive hassle. We would ideally like to replace \bar{f} with f. This doesn’t seem unreasonable. After all, if we pick a giant component within o(n) steps, then everything considered before the giant component won’t show up in the O(n) rescaling, so we will have a series of macroscopic excursions above 0 with widths giving the actual sizes of the giant components. The problem is that even though with high probability we will pick a giant component after O(1) components, then probability that we do not do this decays only exponentially fast, so will show up as a term in the LD analysis. We would hope that this would not be important – after all later we are going to take an infimum, and since the order we choose the vertices to explore is random and in particular independent of the actual structure, it ought not to make a huge difference to any result.

3) A key lemma in the proof of Mogulskii in Dembo and Zeitouni was the result that it doesn’t matter from an LDP point of view whether we consider the linear (continuous) interpolation or the step-wise interpolation to get a process that actually lives in L_\infty([0,1]). In this generalised case, we will also need to check that approximating the Binomial distribution by its Poisson limit is valid on an exponential scale. Note that because errors in the approximation for small values of t affect the parameter of the distribution at larger times, this will be more complicated to check than for the IID case.

4) Once we have a rate function, if we actually want to know about the structure of the ‘typical’ graph displaying some LD property, we will need to find the infimum of the integrated rate function with some constraints. This is likely to be quite nasty unless we can directly use Euler-Lagrange or some other variational tool.

Answer

Papers by O’Connell and Puhalskii have found the rate function. Among other interesting things, we learn that:

I_{(1+\epsilon)}(0)\approx \frac{\epsilon^3}{6},

while the rate function for the number of edges:

-\lim\tfrac{1}{n}\log\mathbb{P}\Big(\text{Bin}(\tfrac{n^2}{2},\tfrac{1+\epsilon}{n})\leq\tfrac{n}{2}\Big)\approx \frac{\epsilon^2}{4}.

So in fact it looks as if there might be a significant contribution from clustering after all.

Large Deviations 5 – Stochastic Processes and Mogulskii’s Theorem

Motivation

In the previous posts about Large Deviations, most of the emphasis has been on the theory. To summarise briefly, we have a natural idea that for a family of measures supported on the same metric space, increasingly concentrated as some index grows, we might expect the probability of seeing values in a set not containing the limit in distribution to grow exponentially. The canonical example is the sample mean of a family of IID random variables, as treated by Cramer’s theorem.

It becomes apparent that it will not be enough to specify the exponent for a given large deviation event just by taking the infimum of the rate function, so we have to define an LDP topologically, with different behaviour on open and closed sets. Now we want to find some LDPs for more complicated measures, but which will have genuinely non-trivial applications. The key idea in all of this is that the infimum present in the definition of an LDP doesn’t just specify the rate function, it also might well give us some information about the configurations or events that lead to the LDP.

The slogan for the LDP as in Frank den Hollander’s excellent book is: “A large deviation event will happen in the least unlikely of all the unlikely ways.” This will be useful when our underlying space is a bit more complicated.

Setup

As a starting point, consider the set-up for Cramer’s theorem, with IID X_1,\ldots,X_n. But instead of investigating LD behaviour for the sample mean, we investigate LD behaviour for the whole set of RVs. There is a bijection between sequences and the partial sums process, so we investigate the partial sums process, rescaled appropriately. For the moment this is a sequence not a function or path (continuous or otherwise), but in the limit it will be, and furthermore it won’t make too much difference whether we interpolate linearly or step-wise.

Concretely, we consider the rescaled random walk:

Z_n(t):=\tfrac{1}{n}\sum_{i=1}^{[nt]}X_i,\quad t\in[0,1],

with laws \mu_n supported on L_\infty([0,1]). Note that the expected behaviour is a straight line from (0,0) to (1,\mathbb{E}X_1). In fact we can say more than that. By Donsker’s theorem we have a functional version of a central limit theorem, which says that deviations from this expected behaviour are given by suitably scaled Brownian motion:

\sqrt{n}\left(\frac{Z_n(t)-t\mathbb{E}X}{\sqrt{\text{Var}(X_1)}}\right)\quad\stackrel{d}{\rightarrow}\quad B(t),\quad t\in[0,1].

This is what we expect ‘standard’ behaviour to look like:

mog1 - Copy

The deviations from a straight line are on a scale of \sqrt{n}. Here are two examples of potential large deviation behaviour:

mog2 - Copy

Or this:

mog3 - Copy

Note that these are qualitatively different. In the first case, the first half of the random variables are in general much larger than the second half, which appear to have empirical mean roughly 0. In the second case, a large deviation in overall mean is driven by a single very large value. It is obviously of interest to find out what the probabilities of each of these possibilities are.

We can do this via an LDP for (\mu_n). Now it is really useful to be working in a topological context with open and closed sets. It will turn out that the rate function is supported on absolutely continuous functions, whereas obviously for finite n, none of the sample paths are continuous!

We assume that \Lambda(\lambda) is the logarithmic moment generating function of X_1 as before, with \Lambda^*(x) the Fenchel-Legendre transform. Then the key result is:

Theorem (Mogulskii): The measures (\mu_n) satisfy an LDP on L_\infty([0,1]) with good rate function:

I(\phi)=\begin{cases}\int_0^1 \Lambda^*(\phi'(t))dt,&\quad \text{if }\phi\in\mathcal{AC}, \phi(0)=0,\\ \infty&\quad\text{otherwise,}\end{cases}

where AC is the space of absolutely continuous functions on [0,1]. Note that AC is dense in L_\infty([0,1]), so any open set contains a \phi for which I(\phi) is at least in principle finite. (Obviously, if \Lambda^* is not finite everywhere, then extra restrictions of \phi' are required.)

The following picture may be helpful at providing some motivation:

CptPath2

So what is going on is that if we take a path and zoom in on some small interval around a point, note first that behaviour on this interval is independent of behaviour everywhere else. Then the gradient at the point is the local empirical mean of the random variables around this point in time. The probability that this differs from the actual mean is given by Cramer’s rate function applied to the empirical mean, so we obtain the rate function for the whole path by integrating.

More concretely, but still very informally, suppose there is some \phi'(t)\neq \mathbb{E}X, then this says that:

Z_n(t+\delta t)-Z_n(t)=\phi'(t)\delta t+o(\delta t),

\Rightarrow\quad \mu_n\Big(\phi'(t)\delta t+o(\delta t)=\frac{1}{n}\sum_{i=nt+1}^{n(t+\delta t)}X_i\Big),

= \mu_n\Big( \phi'(t)+o(1)=\frac{1}{n\delta t}\sum_{i=1}^{n\delta t}X_i\Big)\sim e^{-n\delta t\Lambda^*(\phi'(t))},

by Cramer. Now we can use independence:

\mu_n(Z_n\approx \phi)=\prod_{\delta t}e^{-n\delta t \Lambda^*(\phi'(t))}=e^{-\sum_{\delta t}n\delta t \Lambda^*(\phi'(t))}\approx e^{-n\int_0^1 \Lambda^*(\phi'(t))dt},

as in fact is given by Mogulskii.

Remarks

1) The absolutely continuous requirement is useful. We really wouldn’t want to be examining carefully the tail of the underlying distribution to see whether it is possible on an exponential scale that o(n) consecutive RVs would have sum O(n).

2) In general \Lambda^*(x) will be convex, which has applications as well as playing a useful role in the proof. Recalling den Hollander’s mantra, we are interested to see where infima hold for LD sets in the host space. So for the event that the empirical mean is greater than some threshold larger than the expectation, Cramer’s theorem told us that this is exponentially the same as same the empirical mean is roughly equal to the threshold. Now Mogulskii’s theorem says more. By convexity, we know that the integral functional for the rate function is minimised by straight lines. So we learn that the contributions to the large deviation are spread roughly equally through the sample. Note that this is NOT saying that all the random variables will have the same higher than expected value. The LDP takes no account of fluctuations in the path on a scale smaller than n. It does however rule out both of the situations pictured a long way up the page. We should expect to see roughly a straight line, with unexpectedly steep gradient.

3) The proof as given in Dembo and Zeitouni is quite involved. There are a few stages, the first and simplest of which is to show that it doesn’t matter on an exponential scale whether we interpolate linearly or step-wise. Later in the proof we will switch back and forth at will. The next step is to show the LDP for the finite-dimensional problem given by evaluating the path at finitely many points in [0,1]. A careful argument via the Dawson-Gartner theorem allows lifting of the finite-dimensional projections back to the space of general functions with the topology of pointwise convergence. It remains to prove that the rate function is indeed the supremum of the rate functions achieved on projections. Convexity of \Lambda^*(x) is very useful here for the upper bound, and this is where it comes through that the rate function is infinite when the comparison path is not absolutely continuous. To lift to the finer topology of L_\infty([0,1]) requires only a check of exponential tightness in the finer space, which follows from Arzela-Ascoli after some work.

In conclusion, it is fairly tricky to prove even this most straightforward case, so unsurprisingly it is hard to extend to the natural case where the distributions of the underlying RVs (X) change continuously in time, as we will want for the analysis of more combinatorial objects. Next time I will consider why it is hard but potentially interesting to consider with adaptations of these techniques an LDP for the size of the largest component in a sparse random graph near criticality.

Large Deviations 4 – Sanov’s Theorem

Although we could have defined things for a more general topological space, most of our thoughts about Cramer’s theorem, and the Gartner-Ellis theorem which generalises it, are based on means of real-valued random variables. For Cramer’s theorem, we genuinely are interested only in means of i.i.d. random variables. In Gartner-Ellis, one might say that we are able to relax the condition on independence and perhaps identical distribution too, in a controlled way. But this is somewhat underselling the theorem: using G-E, we can deal with a much broader category of measures than just means of collections of variables. The key is that convergence of the log moment generating function is exactly enough to give a LDP with some rate, and we have a general method for finding the rate function.

So, Gartner-Ellis provides a fairly substantial generalisation to Cramer’s theorem, but is still similar in flavour. But what about if we look for additional properties of a collection of i.i.d. random variables (X_n). After all, the mean is not the only interesting property. One thing we could look at is the actual values taken by the X_ns. If the underlying distribution is continuous, this is not going to give much more information than what we started with. With probability, \{X_1,\ldots,X_n\} is a set of size n, with distribution given by the product of the underlying measure. However, if the random variables take values in a discrete set, or better still a finite set, then (X_1,\ldots,X_n) gives a so-called empirical distribution.

As n grows towards infinity, we expect this empirical distribution to approximate the real underlying distribution fairly well. This isn’t necessarily quite as easy as it sounds. By the strong law of large numbers applied to indicator functions 1(X_i\leq t), the empirical cdf at t converges almost surely to the true cdf at t. To guarantee that this convergence is uniform in t is tricky in general (for reference, see the Glivenko-Cantelli theorem), but is clear for random variables defined on finite sets, and it seems reasonable that an extension to discrete sets should be possible.

So such empirical distributions might well admit an LDP. Note that in the case of Bernoulli random variables, the empirical distribution is in fact exactly equivalent to the empirical mean, so Cramer’s theorem applies. But, in fact we have a general LDP for empirical distributions. I claim that the main point of interest here is the nature of the rate function – I will discuss why the existence of an LDP is not too surprising at the end.

The rate function is going to be interesting whatever form it ends up taking. After all, it is effectively going to some sort of metric on measures, as it records how far a possible empirical measure is from the true distribution. Apart from total variation distance, we don’t currently have many standard examples for metrics on a space of measures. Anyway, the rate function is the main content of Sanov’s theorem. This has various forms, depending on how fiddly you are prepared for the proof to be.

Define L_n:=\sum_{i=1}^n \delta_{X_i}\in\mathcal{M}_1(E) to be the empirical measure generated by X_1,\ldots,X_n. Then L_n satisfies an LDP on \mathcal{M}_1(E) with rate n and rate function given by H(\cdot,\mu), where \mu is the underlying distribution.

The function H is the relative entropy, defined by:

H(\nu|\mu):=\int_E \log\frac{\nu(x)}{\mu(x)}d\nu(v),

whenever \nu<<\mu, and \infty otherwise. We can see why this absolute continuity condition is required from the statement of the LDP. If the underlying distribution \mu has measure zero on some set A, then the observed values will not be in A with probability 1, and so the empirical measure will be zero on A also.

Note that an alternative form is:

H(\nu|\mu)=\int_E \frac{\nu(x)}{\mu(x)}\log\frac{\nu(x)}{\mu(x)}d\mu(v)=\mathbb{E}_\nu\frac{\nu(x)}{\mu(x)}\log\frac{\nu(x)}{\mu(x)}.

Perhaps it is more clear why this expectation is something we would want to minimise.

In particular, if we want to know the most likely asymptotic empirical distribution inducing a large deviation empirical mean (as in Cramer), then we find the distribution with suitable mean, and smallest entropy relative to the true underlying distribution.

A remark on the proof. If the underlying set of values is finite, then a proof of this result is essentially combinatorial. The empirical distribution is some multinomial distribution, and we can obtain exact forms for everything and then proceed with asymptotic approximations.

I said earlier that I would comment on why the LDP is not too surprising even in general, once we know Gartner-Ellis. Instead of letting X_i take values in whatever space we were considering previously, say the reals, consider instead the point mass function \delta_{X_i} which is effectively exactly the same random variable, only now defined on the space of probability measures. The empirical measure is then exactly:

\frac{1}{n}\sum_{i=1}^n \delta_{X_i}.

If the support K of the (X_i)s is finite, then in fact this space of measures is a convex subspace of \mathbb{R}^K, and so the multi-dimensional version of Cramer’s theorem applies. In general, we can work in the possibly infinite-dimensional space [0,1]^K, and our relevant subset is compact, as a closed subset of a compact space (by Tychonoff). So the LDP in this case follows from our previous work.

Large Deviations 3 – Gartner-Ellis Theorem: Where do the all terms come from?

We want to drop the i.i.d. assumption from Cramer’s theorem, to get a criterion for a general LDP as defined in the previous post to hold.

Preliminaries

For general random variables (Z_n) on \mathbb{R}^d with laws (\mu_n), we will continue to have an upper bound like in Cramer’s theorem, provided the moment generating functions of Z_n converge as required. For analogy with Cramer, take Z_n=\frac{S_n}{n}. The Gartner-Ellis theorem gives conditions for the existence of a suitable lower bound and, in particular, when this is the same as the upper bound.

We define the logarithmic moment generating function

\Lambda_n(\lambda):=\log\mathbb{E}e^{\langle \lambda,Z_n\rangle},

and assume that the limit

\Lambda(\lambda)=\lim_{n\rightarrow\infty}\frac{1}{n}\Lambda_n(n\lambda)\in[-\infty,\infty],

exists for all \lambda\in\mathbb{R}^d. We also assume that 0\in\text{int}(\mathcal{D}_\Lambda), where \mathcal{D}_\Lambda:=\{\lambda\in\mathbb{R}^d:\Lambda(\lambda)<\infty\}. We also define the Fenchel-Legendre transform as before:

\Lambda^*(x)=\sup_{\lambda\in\mathbb{R}^d}\left[\langle x,\lambda\rangle - \Lambda(\lambda)\right],\quad x\in\mathbb{R}^d.

We say y\in\mathbb{R}^d is an exposed point of \Lambda^* if for some \lambda,

\langle \lambda,y\rangle - \Lambda^*(y)>\langle\lambda,x\rangle - \Lambda^*(x),\quad \forall x\in\mathbb{R}^d.

Such a \lambda is then called an exposing hyperplane. One way of thinking about this definition is that \Lambda^*(x) is convex, but is strictly convex in any direction at an exposed point. Alternatively, at an exposed point y, there is a vector \lambda such that \Lambda^*\circ \pi_\lambda has a global minimum or maximum at y, where \pi_\lambda is the projection into \langle \lambda\rangle. Roughly speaking, this vector is what we will to take the Cramer transform for the lower bound at x. Recall that the Cramer transform is an exponential reweighting of the probability density, which makes a previously unlikely event into a normal one. We may now state the theorem.

Gartner-Ellis Theorem

With the assumptions above:

  1. \limsup_{n\rightarrow\infty}\frac{1}{n}\log \mu_n(F)\leq -\inf_{x\in F}\Lambda^*(x), \forall F\subset\mathbb{R}^d closed.
  2. \liminf_{n\rightarrow\infty}\frac{1}{n}\log \mu_n(G)\geq -\inf_{x\in G\cap E}\Lambda^*(x), \forall G\subset\mathbb{R}^d open, where E is the set of exposed points of \Lambda^* whose exposing hyperplane is in \text{int}(\mathcal{D}_\Lambda).
  3. If \Lambda is also lower semi-continuous, and is differentiable on \text{int}(\mathcal{D}_\Lambda) (which is non-empty by the previous assumption), and is steep, that is, for any \lambda\in\partial\mathcal{D}_\Lambda, \lim_{\nu\rightarrow\lambda}|\nabla \Lambda(\nu)|=\infty, then we may replace G\cap E by G in the second statement. Then (\mu_n) satisfies the LDP on \mathbb{R}^d with rate n and rate function \Lambda^*.

Where do all the terms come from?

As ever, because everything is on an exponential scale, the infimum in the statements affirms the intuitive notion that in the limit, “an unlikely event will happen in the most likely of the possible (unlikely) ways”. The reason why the first statement does not hold for open sets in general is that the infimum may not be attained for open sets. For the proof, we need an exposing hyperplane at x so we can find an exponential tilt (or Cramer transform) that makes x the standard outcome. Crucially, in order to apply probabilistic ideas to the resulting distribution, everything must be normalisable. So we need an exposing hyperplane so as to isolate the point x on an exponential scale in the transform. And the exposing hyperplane must be in \mathcal{D}_\Lambda if we are to have a chance of getting any useful information out of the transform. By convexity, this is equivalent to the exposing hyperplane being in \text{int}(\mathcal{D}_\Lambda).

Large Deviations 2 – LDPs, Rate Functions and Lower Semi-Continuity

Remarks from Cramer’s Theorem

So in the previous post we discussed Cramer’s theorem on large deviations for means of i.i.d. random variables. It’s worth stepping back and thinking more abstractly about what we showed. Each S_n has some law, which we think of as a measure on \mathbb{R}, though this could equally well be some other space, depending on where the random variables are supported. The law of large numbers asserts that as n\rightarrow\infty, these measures are increasingly concentrated at a single point in \mathbb{R}, which in this case is \mathbb{E}X_1. Cramer’s theorem then asserts that the measure of certain sets not containing this point of concentration decays exponentially in n, and quantifies the exponent, a so-called rate function, via a Legendre transform of the log moment generating function of the underlying distribution.

One key point is that we considered only certain sets [a,\infty),\,a>\mathbb{E}X_1, though we could equally well have considered (-\infty,a],\,a<\mathbb{E}X_1. What would happen if we wanted to consider an interval, say [a,b],\,\mathbb{E}X_1<a<b? Well, \mu_n([a,b])=\mu_n([a,\infty))-\mu_n((b,\infty)), and we might as well assume that \mu_n is sufficiently continuous, at least in the limit, that we can replace the open interval bound with a closed one. Then Cramer’s theorem asserts, written in a more informal style, that \mu_n([a,\infty))\sim e^{-nI(a)} and  similarly for [b,\infty). So provided I(a)<I(b), we have

\mu_n([a,b])\sim e^{-nI(a)}-e^{-nI(b)}\sim e^{-nI(a)}.

To in order to accord with our intuition, we would like I(x) to be increasing for x>\mathbb{E}X_1, and decreasing for x<\mathbb{E}X_1. Also, we want I(\mathbb{E}X_1)=0, to account for the fact that \mu_n([\mathbb{E}X_1,\infty))=O(1). For each consider a sequence of coin tosses. The probability that the observed proportion of heads is in [\frac12,1] should be roughly 1/2 for all n.

Note that in the previous displayed equation for \mu_n([a,b]) the right hand side has no dependence on b. Informally, this means that any event which is at least as unlikely as the event of a deviation to a, will in the limit happen in the most likely of the unlikely ways, which will in this case be a deviation to a, because of relative domination of exponential functions. So if, rather than just half-lines and intervals, we wanted to consider more general sets, we might conjecture a result of the form:

\mu_n(\Gamma)\sim e^{-n\inf_{z\in\Gamma}(z)},

with the approximation defined formally as in the statement of Cramer’s theorem. What can go wrong?

Large Deviations Principles

Well, if the set \Gamma=\{\gamma\} a single point, and the underlying distribution is continuous, then we would expect \mu_n(\{\gamma\})=0 for all n. Similarly, we would expect \mu_n((\mathbb{E}X_1,\infty))\sim O(1), but there is no a priori reason why I(z) should be continuous at \mathbb{E}X_1. (In fact, this is false.), so taking \Gamma=(\mathbb{E}X_1,\infty) again gives a contradiction.

So we need something a bit more precise. Noting that the problem here is that measure (in this case, measure of likeliness on an exponential scale) can leak into open sets through the boundary in the limit, and also the rate function requires some sort of neighbourhood to make sense for continuous RVs, so boundaries of closed sets may give an overestimate. This is reminiscent of weak convergence, and motivated by this, the appropriate general definition for a Large Deviation Principle is:

A sequence of measure (\mu_n) on some space E satisfies an LDP with rate function I and speed n if \forall \Gamma\in \mathcal{B}(E):

-\inf_{x\in\Gamma^\circ}I(x)\leq \liminf \frac{1}{n}\log\mu_n(\Gamma)\leq \limsup\frac{1}{n}\log\mu_n(\Gamma)\leq -\inf_{x\in \bar{\Gamma}}I(x).

Although this might look very technical, you might as well think of it as nothing more than the previous conjecture for general sets, with the two problems that we mentioned now taken care of.

So, we need to define a rate function. I: E\rightarrow[0,\infty] is a rate function, if it not identically infinite. We also demand that it is lower semi-continuous, and has closed level sets \Psi_I^\alpha:=\{x\in E: I(x)\leq\alpha\}. These definitions are in fact equivalent. I will say what lower semi-continuity is in a moment. Some authors also demand that the level sets be compact. Others call this a good rate function, or similar. The advantage of this is that infima on closed sets are attained.

It is possible to specify a different rate. The rate gives the speed of convergence. \frac 1 n can be replaced with any function converging to 0, including continuously.

Lower Semi-Continuity

A function f is lower semi-continuous if

f(x)\leq \liminf f(x_n),\text{ for all sequences }x_n\rightarrow x.

One way of thinking about this definition is to say that the function cannot jump upwards as it reaches a boundary, it can only jump downwards (or not jump at all). The article on Wikipedia for semi-continuity has this picture explaining how a lower semi-continuous function must behave at discontinuities. Note that the value of f at the discontinuity could be the blue dot, or anything less than the blue dot. It is reasonable clear why this definition is equivalent to having closed level sets.

So the question to ask is: why should rate functions be lower semi-continuous? Rather than proceeding directly, we argue by uniqueness. Given a function on \mathbb{R} with discontinuities, we can turn it into a cadlag function, or a caglad function by fiddling with the values taken at points of discontinuity. We can do a similar thing to turn any function into a lower semi-continuous function. Given f, we define

f_*(x):=\liminf_{x_n\rightarrow x}f(x_n)=\sup\{\inf_G f: x\ni G, G \text{ open}\}.

The notes I borrowed this idea from described this as the maximal lower semi-continuous regularisation, which I think is quite a good explanation despite the long words.

Anyway, the claim is that if I(x) satisfies a LDP then so does $I_*(x)$. This needs to be checked, but it explains why we demand that the rate function be lower semi-continuous. We really want the rate function not to be unique, and this is a good way to prevent an obvious cause of non-uniqueness. It needs to be checked that it is actually unique once we have this assumption, but that is relatively straightforward.

So, to check that the lower semi-continuous regularisation of I satisfies the LDP if I does, we observe that the upper bound is trivial, since I^*\leq I everywhere. Then, for every open set G, note that for x\in G, I_*(x)=\liminf_{x_n\rightarrow x}I(x), so we might as well consider sequences within G, and so I_*(x)\geq \inf \inf_G I. So, since I_*(x)\leq I(x), it follows that

\inf_G I_*=\inf_G I,

and thus we get the upper bound for the LDP.

References

The motivation for this particular post was my own, but the set of notes here, as cited in the previous post were very useful. Also the Wikipedia page on semi-continuity, and Frank den Hollander’s book ‘Large Deviations’.

Large Deviations 1 – Motivation and Cramer’s Theorem

I’ve been doing a lot of thinking about Large Deviations recently, in particular how to apply the theory to random graphs and related models. I’ve just writing an article about some of the more interesting aspects, so thought it was probably worth turning it into a few posts.

Motivation

Given X_1,X_2,\ldots i.i.d. real-valued random variables with finite expectation, and S_n:=X_1+\ldots+X_n, the Weak Law of Large Numbers asserts that the empirical mean \frac{S_n}{n} converges in distribution to \mathbb{E}X_1. So \mathbb{P}(S_n\geq n(\mathbb{E}X_1+\epsilon))\rightarrow 0. In fact, if \mathbb{E}X_1^2<\infty, we have the Central Limit Theorem, and a consequence is that \mathbb{P}(S_n\geq n\mathbb{E}X_1+n^\alpha)\rightarrow 0 whenever \alpha>\frac12.

In a concrete example, if we toss a coin some suitably large number of times, the probability that the proportion of heads will be substantially greater or smaller than \frac12 tends to zero. So the probability that at least \frac34 of the results are heads tends to zero. But how fast? Consider first four tosses, then eight. A quick addition of the relevant terms in the binomial distribution gives:

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of four tosses are heads}\right)=\frac{1}{16}+\frac{4}{16}=\frac{5}{16},

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of twelve tosses are heads}\right)=\frac{1}{2^{12}}+\frac{12}{2^{12}}+\frac{66}{2^{12}}+\frac{220}{2^{12}}=\frac{299}{2^{12}}.

There are two observations to be made. The first is that the second is substantially smaller than the first – the decay appears to be relatively fast. The second observation is that \frac{220}{2^{12}} is substantially larger than the rest of the sum. So by far the most likely way for at least \tfrac34 out of twelve tosses to be heads is if exactly \tfrac34 are heads. Cramer’s theorem applies to a general i.i.d. sequence of RVs, provided the tail is not too heavy. It show that the probability of any such large deviation event decays exponentially with n, and identifies the exponent.

Theorem (Cramer): Let (X_i) be i.i.d. real-valued random variables which satisfy \mathbb{E}e^{tX_1}<\infty for every t\in\mathbb{R}. Then for any a>\mathbb{E}X_1,

\lim_{n\rightarrow \infty}\frac{1}{n}\log\mathbb{P}(S_n\geq an)=-I(a),

\text{where}\quad I(z):=\sup_{t\in\mathbb{R}}\left[zt-\log\mathbb{E}e^{tX_1}\right].

Remarks

  • So, informally, \mathbb{P}(S_n\geq an)\sim e^{-nI(a)}.
  • I(z) is called the Fenchel-Legendre transform (or convex conjugate) of \log\mathbb{E}e^{tX_1}.
  • Considering t=0 confirms that I(z)\in[0,\infty].
  • In their extremely useful book, Dembo and Zeitouni present this theorem in greater generality, allowing X_i to be supported on \mathbb{R}^d, considering a more general set of large deviation events, and relaxing the requirement for finite mean, and thus also the finite moment generating function condition. All of this will still be a special case of the Gartner-Ellis theorem, which will be examined in a subsequent post, so we make do with this form of Cramer’s result for now.

The proof of Cramer’s theorem splits into an upper bound and a lower bound. The former is relatively straightforward, applying Markov’s inequality to e^{tS_n}, then optimising over the choice of t. This idea is referred to by various sources as the exponential Chebyshev inequality or a Chernoff bound. The lower bound is more challenging. We reweight the distribution function F(x) of X_1 by a factor e^{tx}, then choose t so that the large deviation event is in fact now within the treatment of the CLT, from which suitable bounds are obtained.

To avoid overcomplicating this initial presentation, some details have been omitted. It is not clear, for example, whether I(x) should be finite whenever x is in the support of X_1. (It certainly must be infinite outside – consider the probability that 150% or -40% of coin tosses come up heads!) In order to call this a Large Deviation Principle, we also want some extra regularity on I(x), not least to ensure it is unique. This will be discussed in the next posts.

Gaussian tail bounds and a word of caution about CLT

The first part is more of an aside. In a variety of contexts, whether for testing Large Deviations Principles or calculating expectations by integrating over the tail, it is useful to know good approximations to the tail of various distributions. In particular, the exact form of the tail of a standard normal distribution is not particularly tractable. The following upper bound is therefore often extremely useful, especially because it is fairly tight, as we will see.

Let Z\sim N(0,1) be a standard normal RV. We are interested in the tail probability R(x)=\mathbb{P}(Z\geq x). The density function of a normal RV decays very rapidly, as the exponential of a quadratic function of x. This means we might expect that conditional on \{Z\geq x\}, with high probability Z is in fact quite close to x. This concentration of measure property would suggest that the tail probability decays at a rate comparable to the density function itself. In fact, we can show that:

\mathbb{P}(Z>x)< \frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.

It is in fact relatively straightforward:

\mathbb{P}(Z>x)=\frac{1}{\sqrt{2\pi}}\int_x^\infty e^{-u^2/2}du< \frac{1}{\sqrt{2\pi}}\int_x^\infty \frac{u}{x}e^{-u^2/2}du=\frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.

Just by comparing derivatives, we can also show that this bound is fairly tight. In particular:

\frac{1}{\sqrt{2\pi}}\frac{x}{x^2+1}e^{-x^2/2}<\mathbb{P}(Z>x)< \frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.

—-

Now for the second part about CLT. The following question is why I started thinking about various interpretations of CLT in the previous post. Suppose we are trying to prove the Strong Law of Large Numbers for a random variable with 0 mean and unit variance. Suppose we try to use an argument via Borel-Cantelli:

\mathbb{P}(\frac{S_n}{n}>\epsilon) = \mathbb{P}(\frac{S_n}{\sqrt{n}}>\epsilon\sqrt{n})\stackrel{\text{CLT}}{\approx}\mathbb{P}(Z>\epsilon\sqrt{n}).

Now we can use our favourite estimate on the tail of a normal distribution.

\mathbb{P}(Z>\epsilon\sqrt{n})\leq \frac{1}{\epsilon\sqrt{n}\sqrt{2\pi}}e^{-n/2}

\Rightarrow \sum_n \mathbb{P}(Z>\epsilon\sqrt{n})\leq \frac{1}{\sqrt{2\pi}}(e^{-1/2})^n=\frac{1}{\sqrt{2\pi}(e^{1/2}-1)}<\infty.

By Borel-Cantelli, we conclude that with probability 1, eventually \frac{S_n}{n}<\epsilon. This holds for all \epsilon>0, and a symmetric result for the negative case. We therefore obtain the Strong Law of Large Numbers.

The question is: was that application of CLT valid? It certainly looks ok, but I claim not. The main problem is that the deviations under discussion fall outside the remit of discussion. CLT gives a limiting expression for deviations on the \sqrt{n} scale.

Let’s explain this another way. Let’s take \epsilon=10^{-2}. CLT says that as n becomes very large

\mathbb{P}(\frac{S_n}{\sqrt{n}}>1000)\approx \mathbb{P}(Z>1000).

But we don’t know how large n has to be before this approximation is vaguely accurate. If in fact it only becomes accurate for n=10^{12}, then it is not relevant for estimating

\mathbb{P}(\frac{S_n}{\sqrt{n}}>\epsilon\sqrt{n}).

This looks like an artificial example, but the key is that this problem becomes worse as n grows, (or as we increase the number which currently reads as 1000), and certainly is invalid in the limit. [I find the original explanation about scale of deviation treated by CLT more manageable, but hopefully this further clarifies.]

One solution might be to find some sort of uniform convergence criterion for CLT, ie a (hopefully rapidly decreasing) function f(n) such that

\sup_{x\in\mathbb{R}}|\mathbb{P}(\frac{S_n}{\sqrt{n}}>x)-\mathbb{P}(Z>x)|\leq f(n).

This is possible, as given by the Berry-Esseen theorem, but even the most careful refinements in the special case where the third moment is bounded fail to give better bounds than

f(n)\sim \frac{1}{\sqrt{n}}.

Adding this error term will certainly destroy any hope we had of the sum being finite. Of course, part of the problem is that the supremum in the above definition is certainly not going to be attained at any point under discussion in these post-\sqrt{n} deviations. We really want to take a supremum over larger-than-usual deviations if this is to work out.

By this stage, however, I hope it is clear what the cautionary note is, even if the argument could potentially be patched. CLT is a theorem about standard deviations. Separate principles are required to deal with the case of large deviations. This feels like a strangely ominous note on which to end, but I don’t think there’s much more to say. Do comment below if you think there’s a quick fix to the argument for SLLN presented above.

Large Deviations and the CLT

Taking a course on Large Deviations has forced me to think a bit more carefully about what happens when you have large collections of IID random variables. I guess the first thing think to think about is ‘What is a Large Deviation‘? In particular, how large or deviant does it have to be?

Of primary interest is the tail of the distribution function of S_n=X_1+\ldots+X_n, where the X_i are independent and identically distributed as X. As we can always negate everything later if necessary, we typically consider the probability of events of the type:

\mathbb{P}(S_n\geq \theta(n))

where \theta(n) is some function which almost certainly increases fairly fast with n. More pertinently, if we are looking for some limit which corresponds to an actual random variable, we perhaps want to look at lots of related \theta(n)s simultaneously. More concretely, we should fix \theta and consider the probabilities

\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha). (*)

Throughout, we lose no generality by assuming that \mathbb{E}X=0. Of course, it is possible that this expectation does not exist, but that is certainly a question for another post!

Now let’s consider the implications of our choice of \theta(n). If this increases with n too slowly, and the likely deviation of S_n is greater than \theta(n), then the event might not be a large deviation at all. In fact, the difference between this event and the event (S_n is above 0, that is, its mean) becomes negligible, and so the probability at (*) might be 1/2 or whatever, regardless of the value of \alpha. So object \lim \frac{S_n}{\theta(n)} whatever that means, certainly cannot be a proper random variable, as if we were to have convergence in distribution, this would imply that the limit RV consisted of point mass at each of \{+\infty, -\infty\}.

On the other hand, if \theta(n) increases rapidly with n, then the probabilities at (*) might become very small indeed when \alpha>0. For example, we might expect:

\lim_{n\rightarrow\infty}\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha)=\begin{cases}0& \alpha>0\\1&\alpha<0.\end{cases}

and more information to be required when \alpha=0. This is what we mean by a large deviation event. Although we always have to define everything concretely in terms of some finite sum S_n, we are always thinking about the behaviour in the limit. A large deviation principle exists in an enormous range of cases to show that these probabilities in fact decay exponentially. Again, that is the subject for another post, or indeed the lecture course I’m attending.

Instead, I want to return to the Central Limit Theorem. I first encountered this result in popular science books in a vague “the histogram of boys’ heights looks like a bell” kind of way, then, once a normal random variable had been to some extent defined, it returned in A-level statistics courses in a slightly more fleshed out form. As an undergraduate, you see it in several forms, including as a corollary following from Levy’s convergence theorem.

In all applications though, it is generally used as a method of calculating good approximations. It is not uncommon to see it presented as:

\mathbb{P}(a\sigma\sqrt{n}+\mu n\leq S_n\leq b\sigma\sqrt{n}+\mu n)\approx \frac{1}{\sqrt{2\pi}}\int_a^b e^{-x^2/2}dx.

Although in many cases that is the right way to think use it, it isn’t the most interesting aspect of the theorem itself. CLT says that the correct scaling of \theta(n) so that the deviation probabilities lie between the two cases outline above is the same (that is, \theta(n)=O(\sqrt{n}) in some sense) for an enormous class of distributions, and in particular, most distributions that one might encounter in practice (ie finite mean, finite variance). There is even greater universality, as furthermore the limit distribution at this interface has the same form (some appropriate normal distribution) whenever X is in this class of distributions. I think that goes a long way to explaining why we should care about the theorem. It also immediately prompts several questions:

  • What happens for less regular distributions? It is now more clear what the right question to ask in this setting might be. What is the appropriate scaling for \theta(n) in this case, if such a scaling exists? Is there a similar universality property for suitable classes of distributions?
  • What is special about the normal distribution? The theorem itself shows us that it appears as a universal scaling limit in distribution, but we might reasonably ask what properties such a distribution should have, as perhaps this will offer a clue to a version of CLT or LLNs for less regular distributions.
  • We can see that the Weak Law of Large Numbers follows immediately from CLT. In fact we can say more, perhaps a Slightly Less Weak LLN, that

\frac{S_n-\mu n}{\sigma \theta(n)}\stackrel{d}{\rightarrow}0

  • whenever \sqrt{n}<<\theta(n). But of course, we also have a Strong Law of Large Numbers, which asserts that the empirical mean converges almost surely. What is the threshhold for almost sure convergence, because there is no a priori reason why it should be \theta(n)=n?

To be continued next time.