Characterisations of Geometric Random Graphs

Continuing the LMS-EPSRC summer school on Random Graphs, Geometry and Asymptotic Structure, we’ve now had three of the five lectures by Mathew Penrose on Geometric Random Graphs.

The basic idea is that instead of viewing a graph entirely abstractly, we now place the vertices in the plane, or some other real space. In many network situations, we would expect connectivity to depend somehow on distance. Agents or sites which are close together might be considered more likely to have the sort of relationship indicated by being connected with an edge. In the model discussed in this course, this dependence is deterministic. We have some parameter r, and once we have chosen the location of all the vertices, we connect a pair of vertices if the distance between them is less than r.

For the purposes of this, we work in a compact space [0,1]^d, and we are interested in the limit as the number of vertices n grows to infinity. To avoid the graph getting too connected, as in the standard random graph model, we take r to be a decreasing function of n. Anyway, we place the n points into the unit hypercube uniformly at random, and then the edges are specified by the adjacency rule above. In general, because r_n will be o(1), we won’t have to worry too much above boundary effects. The number of vertices within r_n of the boundary of the cube will be o(1). For some results, this is a genuine problem, when it may be easier to work on the torus.

In G(n,p), the order of np in the limit determines the qualitative structure of the graph. This is the expected degree of a given fixed vertex. In this geometric model, the relevant parameter is nr_n^d, where d is the dimension of the hypercube. If this parameter tends to 0, we say the graph is sparse, and dense if it tends to infinity. The intermediate case is called a thermodynamic limit. Note that the definition of sparse here is slightly different from G(n,p).

Much of the content of the first three lectures has been verifying that the distributions of various quantities in the graph, for example the total number of edges, are asymptotically Poisson. Although sometimes arguments are applicable over a broad spectrum, we also sometimes have to use different calculations for different scaling windows. For example, it is possible to show convergence to a Poisson distribution for the number of edges in the sparse case, from which we get an asymptotic normal approximation almost for free. In the denser regimes, the argument is somewhat more technical, with some substantial moment calculations.

A useful tool in these calculations are some bounds derived via Stein’s method for sums of ‘almost independent’ random variables. For example, the presence or non-presence of an edge between two pairs of vertices are independent in this setting if the pairs are disjoint, and the dependence is still only mild if they share a vertex. An effective description is via a so-called dependency graph, where we view the random variables as the vertices of a graph, with an edge between them if there is some dependence. This description doesn’t have any power in itself, but it does provide a concise notation for what would otherwise be very complicated, and we are able to show versions of (Binomials converge to Poisson) and CLT via these that are exactly as required for this purpose.

In particular, we are able to show that if E_n is the total number of edges, under a broad set of scaling regimes, if \lambda_n is the expected total number of edges, then d_{TV}(E_n,\mathrm{Po}(\lambda_n))\rightarrow 0, as n grows. This convergence in total variation distance is as strong a result as one could hope for, and when the sequence of \lambda_n is O(1), we can derive a normal approximation as well.

At this point it is worth discussing an alternative specification of the model. Recall that for a standard homogenous random graph, we have the choice of G(n,m) and G(n,p) as definitions. G(n,m) is the finer measure, and G(n,p) can be viewed as a weighted mix of G(n,m). We can’t replicate this directly in the geometric setting because the edges and non-edges are a deterministic function of the vertex locations. What we can randomise is the number of vertices. Since we are placing the vertices uniformly at random, it makes sense to consider as an alternative a Poisson Point Process with intensity n. The number of vertices we get overall will be distributed as Po(n), which is concentrated near n, in the same manner as G(n,c/n).

As in G(n,p), this is a less basic model because it is a mixture of the fixed-vertex models. Let’s see if how we would go about extending the total variation convergence result to this slightly different setting without requiring a more general version of the Poisson Approximation Lemma. To avoid having to define everything again, we add a ‘ to indicate that we are talking about the Poisson Point Process case. Writing d(.,.) for total variation distance, the result we have is:

\lim_{n\rightarrow\infty} d(E_n,\mathrm{Po}(\lambda_n))=0.

We want to show that

\lim_{n\rightarrow\infty}d(E_n',\mathrm{Po}(\lambda_n'))=0,

which we can decompose in terms of expectations in the original model by conditioning on N_n

\leq \lim_{n\rightarrow\infty}\mathbb{E}\Big[\mathbb{E}[d(E_{N_n},\mathrm{Po}(\lambda_n')) | N_n]\Big],

where the outer expectation is over N. The observation here, is that the number of points given by the Poisson process induces a measure on distributions, the overwhelming majority of which look quite like Poisson distributions with parameter n. The reason we have a less than sign is that we are applying the triangle inequality in the sum giving total variation distance:

d(X,Y)=\sum_{k\geq 0}|\mathbb{P}(X=k)-\mathbb{P}(Y=k)|.

From this, we use the triangle inequality again:

\lim_{n\rightarrow\infty} \mathbb{E}\Big[\mathbb{E}[d(E_{N_n},\mathrm{Po}(\lambda_{N_n})) | N_n]\Big]

+\lim_{n\rightarrow\infty}\mathbb{E}\Big[\mathbb{E}[d(\mathrm{Po}(\lambda_{N_n}),\mathrm{Po}(\lambda_n')) | N_n]\Big].

Then, by a large deviations argument, we have that for any \epsilon>0, \mathbb{P}(|N_n-n|\geq \epsilon n)\rightarrow 0 exponentially in n. Also, total variation distance is, by definition, bounded above by 1. In the first term, the inner conditioning on N_n is irrelevant, and we have that E_{N_n} converges to the Poisson distribution for any fixed N_n\in (n(1-\epsilon),n(1+\epsilon)). Furthermore, we showed in the proof of the non-PPP result that this convergence is uniform in this interval. (This is not surprising – the upper bound is some well-behaved polynomial in 1/n.) So with probability 1- e^{-\Theta(n)} N_n is in the region where this convergence happens, and elsewhere, the expected TV distance is bounded below 1, so the overall expectation tends to 0. With a similar LD argument, for the second term it suffices to prove that when \lambda\rightarrow\mu, we must have d(\mathrm{Po}(\lambda),\mathrm{Po}(\mu))\rightarrow 0. This is ‘obviously’ true. Formally, it is probably easiest to couple the distributions \mathrm{Bin}(n,\lambda/n),\mathrm{Bin}(n,\mu/n) in the obvious way, and carry the convergence of TV distance as the parameter varies through the convergence in n.

That all sounded a little bit painful, but is really just the obvious thing to do with each term – it’s only the language that’s long-winded!

Anyway, I’m looking forward to seeing how the course develops. In particular, when you split the space into small blocks, the connectivity properties resemble those of (site) percolation, so I wonder whether there will be concrete parallels. Also, after reading about some recent results concerning the metric structure of the critical components in the standard random graph process, it will be interesting to see how these compare to the limit of a random graph process which comes equipped with metric structure for free!

Advertisements

Large Deviations 6 – Random Graphs

As a final instalment in this sequence of posts on Large Deviations, I’m going to try and explain how one might be able to apply some of the theory to a problem about random graphs. I should explain in advance that much of what follows will be a heuristic argument only. In a way, I’m more interested in explaining what the technical challenges are than trying to solve them. Not least because at the moment I don’t know exactly how to solve most of them. At the very end I will present a rate function, and reference properly the authors who have proved this. Their methods are related but not identical to what I will present.

Problem

Recall the two standard definitions of random graphs. As in many previous posts, we are interested in the sparse case where the average degree of a vertex is o(1). Anyway, we start with n vertices, and in one description we add an edge between any pair of vertices independently and with fixed probability \frac{\lambda}{n}. In the second model, we choose uniformly at random from the set of graphs with n vertices and \frac{\lambda n}{2} edges. Note that if we take the first model and condition on the number of edges, we get the second model, since the probability of a given configuration appearing in G(n,p) is a function only of the number of edges present. Furthermore, the number of edges in G(n,p) is binomial with parameters \binom{n}{2} and p. For all purposes here it will make no difference to approximate the former by \frac{n^2}{2}.

Of particular interest in the study of sparse random graphs is the phase transition in the size of the largest component observed as \lambda passes 1. Below 1, the largest component has size on a scale of log n, and with high probability all components are trees. Above 1, there is a unique giant component containing \alpha_\lambda n vertices, and all other components are small. For \lambda\approx 1, where I don’t want to discuss what ‘approximately’ means right now, we have a critical window, for which there are infinitely many components with sizes on a scale of n^{2/3}.

A key observation is that this holds irrespective of which model we are using. In particular, this is consistent. By the central limit theorem, we have that:

|E(G(n,\frac{\lambda}{n}))|\sim \text{Bin}\left(\binom{n}{2},\frac{\lambda}{n}\right)\approx \frac{n\lambda}{2}\pm\alpha,

where \alpha is the error due to CLT-scale fluctuations. In particular, these fluctuations are on a scale smaller than n, so in the limit have no effect on which value of \lambda in the edge-specified model is appropriate.

However, it is still a random model, so we can condition on any event which happens with positive probability, so we might ask: what does a supercritical random graph look like if we condition it to have no giant component? Assume for now that we are considering G(n,\frac{\lambda}{n}),\lambda>1.

This deviation from standard behaviour might be achieved in at least two ways. Firstly, we might just have insufficient edges. If we have a large deviation towards too few edges, then this would correspond to a subcritical G(n,\frac{\mu n}{2}), so would have no giant components. However, it is also possible that the lack of a giant component is due to ‘clustering’. We might in fact have the correct number of edges, but they might have arranged themselves into a configuration that keeps the number of components small. For example, we might have a complete graph on Kn^{1/2} vertices plus a whole load of isolated vertices. This has the correct number of edges, but certainly no giant component (that is an O(n) component).

We might suspect that having too few edges would be the primary cause of having no giant component, but it would be interesting if clustering played a role. In a previous post, I talked about more realistic models of complex networks, for which clustering beyond the levels of Erdos-Renyi is one of the properties we seek. There I described a few models which might produce some of these properties. Obviously another model is to take Erdos-Renyi and condition it to have lots of clustering but that isn’t hugely helpful as it is not obvious what the resulting graphs will in general look like. It would certainly be interesting if conditioning on having no giant component were enough to get lots of clustering.

To do this, we need to find a rate function for the size of the giant component in a supercritical random graph. Then we will assume that evaluating this near 0 gives the LD probability of having ‘no giant component’. We will then compare this to the straightforward rate function for the number of edges; in particular, evaluated at criticality, so the probability that we have a subcritical number of edges in our supercritical random graph. If they are the same, then this says that the surfeit of edges dominates clustering effects. If the former is smaller, then clustering may play a non-trivial role. If the former is larger, then we will probably have made a mistake, as we expect on a LD scale that having too few edges will almost surely lead to a subcritical component.

Methods

The starting point is the exploration process for components of the random graph. Recall we start at some vertex v and explore the component containing v depth-first, tracking the number of vertices which have been seen but not yet explored. We can extend this to all components by defining:

S(0)=0, \quad S(t)=S(t-1)+(X(t)-1),

where X(t) is the number of children of the t’th vertex. For a single component, S(t) is precisely the number of seen but unexplored vertices. It is more complicated in general. Note that when we exhaust the first component S(t)=-1, and then when we exhaust the second component S(t)=-2 and so on. So in fact

S_t-\min_{0\leq s\leq t}S_s

is the number of seen but unexplored vertices, with \min_{0\leq s\leq t}S_s equal to (-1) times the number of components already explored up to time t.

Once we know the structure of the first t vertices, we expect the distribution of X(t) – 1 to be

\text{Bin}\Big(n-t-[S_t-\min_{0\leq s\leq t}S_s],\tfrac{\lambda}{n}\Big)-1.

We aren’t interested in all the edges of the random graph, only in some tree skeleton of each component. So we don’t need to consider the possibility of edges connecting our current location to anywhere we’ve previously visited (as such an edge would have been consider then – it’s a depth-first exploration), hence the -t. But we also don’t want to consider edges connecting our current location to anywhere we’ve seen, since that would be a surplus edge creating a cycle, hence the -S_s. It is binomial because by independence even after all this conditioning, the probability that there’s an edge from my current location to any other vertex apart from those discounted is equal to \frac{\lambda}{n} and independent.

For Mogulskii’s theorem in the previous post, we had an LDP for the rescaled paths of a random walk with independent stationary increments. In this situation we have a random walk where the increments do not have this property. They are not stationary because the pre-limit distribution depends on time. They are also not independent, because the distribution depends on behaviour up to time t, but only through the value of the walk at the present time.

Nonetheless, at least by following through the heuristic of having an instantaneous exponential cost for a LD event, then products of sums becoming integrals within the exponent, we would expect to have a similar result for this case. We can find the rate function \Lambda_\lambda^*(x)of latex \text{Po}(\lambda)-1$ and thus get a rate function for paths of the exploration process

I_\lambda(f)=\int_0^1 \Lambda_{(1-t-\bar{f}(t))\lambda}^*(f')dt,

where \bar{f}(t) is the height of f above its previous minimum.

Technicalities and Challenges

1) First we need to prove that it is actually possible to extend Mogulskii to this more general setting. Even though we are varying the distribution continuously, so we have some sort of ‘local almost convexity’, the proof is going to be fairly fiddly.

2) Having to consider excursions above the local minima is a massive hassle. We would ideally like to replace \bar{f} with f. This doesn’t seem unreasonable. After all, if we pick a giant component within o(n) steps, then everything considered before the giant component won’t show up in the O(n) rescaling, so we will have a series of macroscopic excursions above 0 with widths giving the actual sizes of the giant components. The problem is that even though with high probability we will pick a giant component after O(1) components, then probability that we do not do this decays only exponentially fast, so will show up as a term in the LD analysis. We would hope that this would not be important – after all later we are going to take an infimum, and since the order we choose the vertices to explore is random and in particular independent of the actual structure, it ought not to make a huge difference to any result.

3) A key lemma in the proof of Mogulskii in Dembo and Zeitouni was the result that it doesn’t matter from an LDP point of view whether we consider the linear (continuous) interpolation or the step-wise interpolation to get a process that actually lives in L_\infty([0,1]). In this generalised case, we will also need to check that approximating the Binomial distribution by its Poisson limit is valid on an exponential scale. Note that because errors in the approximation for small values of t affect the parameter of the distribution at larger times, this will be more complicated to check than for the IID case.

4) Once we have a rate function, if we actually want to know about the structure of the ‘typical’ graph displaying some LD property, we will need to find the infimum of the integrated rate function with some constraints. This is likely to be quite nasty unless we can directly use Euler-Lagrange or some other variational tool.

Answer

Papers by O’Connell and Puhalskii have found the rate function. Among other interesting things, we learn that:

I_{(1+\epsilon)}(0)\approx \frac{\epsilon^3}{6},

while the rate function for the number of edges:

-\lim\tfrac{1}{n}\log\mathbb{P}\Big(\text{Bin}(\tfrac{n^2}{2},\tfrac{1+\epsilon}{n})\leq\tfrac{n}{2}\Big)\approx \frac{\epsilon^2}{4}.

So in fact it looks as if there might be a significant contribution from clustering after all.

Large Deviations 5 – Stochastic Processes and Mogulskii’s Theorem

Motivation

In the previous posts about Large Deviations, most of the emphasis has been on the theory. To summarise briefly, we have a natural idea that for a family of measures supported on the same metric space, increasingly concentrated as some index grows, we might expect the probability of seeing values in a set not containing the limit in distribution to grow exponentially. The canonical example is the sample mean of a family of IID random variables, as treated by Cramer’s theorem.

It becomes apparent that it will not be enough to specify the exponent for a given large deviation event just by taking the infimum of the rate function, so we have to define an LDP topologically, with different behaviour on open and closed sets. Now we want to find some LDPs for more complicated measures, but which will have genuinely non-trivial applications. The key idea in all of this is that the infimum present in the definition of an LDP doesn’t just specify the rate function, it also might well give us some information about the configurations or events that lead to the LDP.

The slogan for the LDP as in Frank den Hollander’s excellent book is: “A large deviation event will happen in the least unlikely of all the unlikely ways.” This will be useful when our underlying space is a bit more complicated.

Setup

As a starting point, consider the set-up for Cramer’s theorem, with IID X_1,\ldots,X_n. But instead of investigating LD behaviour for the sample mean, we investigate LD behaviour for the whole set of RVs. There is a bijection between sequences and the partial sums process, so we investigate the partial sums process, rescaled appropriately. For the moment this is a sequence not a function or path (continuous or otherwise), but in the limit it will be, and furthermore it won’t make too much difference whether we interpolate linearly or step-wise.

Concretely, we consider the rescaled random walk:

Z_n(t):=\tfrac{1}{n}\sum_{i=1}^{[nt]}X_i,\quad t\in[0,1],

with laws \mu_n supported on L_\infty([0,1]). Note that the expected behaviour is a straight line from (0,0) to (1,\mathbb{E}X_1). In fact we can say more than that. By Donsker’s theorem we have a functional version of a central limit theorem, which says that deviations from this expected behaviour are given by suitably scaled Brownian motion:

\sqrt{n}\left(\frac{Z_n(t)-t\mathbb{E}X}{\sqrt{\text{Var}(X_1)}}\right)\quad\stackrel{d}{\rightarrow}\quad B(t),\quad t\in[0,1].

This is what we expect ‘standard’ behaviour to look like:

mog1 - Copy

The deviations from a straight line are on a scale of \sqrt{n}. Here are two examples of potential large deviation behaviour:

mog2 - Copy

Or this:

mog3 - Copy

Note that these are qualitatively different. In the first case, the first half of the random variables are in general much larger than the second half, which appear to have empirical mean roughly 0. In the second case, a large deviation in overall mean is driven by a single very large value. It is obviously of interest to find out what the probabilities of each of these possibilities are.

We can do this via an LDP for (\mu_n). Now it is really useful to be working in a topological context with open and closed sets. It will turn out that the rate function is supported on absolutely continuous functions, whereas obviously for finite n, none of the sample paths are continuous!

We assume that \Lambda(\lambda) is the logarithmic moment generating function of X_1 as before, with \Lambda^*(x) the Fenchel-Legendre transform. Then the key result is:

Theorem (Mogulskii): The measures (\mu_n) satisfy an LDP on L_\infty([0,1]) with good rate function:

I(\phi)=\begin{cases}\int_0^1 \Lambda^*(\phi'(t))dt,&\quad \text{if }\phi\in\mathcal{AC}, \phi(0)=0,\\ \infty&\quad\text{otherwise,}\end{cases}

where AC is the space of absolutely continuous functions on [0,1]. Note that AC is dense in L_\infty([0,1]), so any open set contains a \phi for which I(\phi) is at least in principle finite. (Obviously, if \Lambda^* is not finite everywhere, then extra restrictions of \phi' are required.)

The following picture may be helpful at providing some motivation:

CptPath2

So what is going on is that if we take a path and zoom in on some small interval around a point, note first that behaviour on this interval is independent of behaviour everywhere else. Then the gradient at the point is the local empirical mean of the random variables around this point in time. The probability that this differs from the actual mean is given by Cramer’s rate function applied to the empirical mean, so we obtain the rate function for the whole path by integrating.

More concretely, but still very informally, suppose there is some \phi'(t)\neq \mathbb{E}X, then this says that:

Z_n(t+\delta t)-Z_n(t)=\phi'(t)\delta t+o(\delta t),

\Rightarrow\quad \mu_n\Big(\phi'(t)\delta t+o(\delta t)=\frac{1}{n}\sum_{i=nt+1}^{n(t+\delta t)}X_i\Big),

= \mu_n\Big( \phi'(t)+o(1)=\frac{1}{n\delta t}\sum_{i=1}^{n\delta t}X_i\Big)\sim e^{-n\delta t\Lambda^*(\phi'(t))},

by Cramer. Now we can use independence:

\mu_n(Z_n\approx \phi)=\prod_{\delta t}e^{-n\delta t \Lambda^*(\phi'(t))}=e^{-\sum_{\delta t}n\delta t \Lambda^*(\phi'(t))}\approx e^{-n\int_0^1 \Lambda^*(\phi'(t))dt},

as in fact is given by Mogulskii.

Remarks

1) The absolutely continuous requirement is useful. We really wouldn’t want to be examining carefully the tail of the underlying distribution to see whether it is possible on an exponential scale that o(n) consecutive RVs would have sum O(n).

2) In general \Lambda^*(x) will be convex, which has applications as well as playing a useful role in the proof. Recalling den Hollander’s mantra, we are interested to see where infima hold for LD sets in the host space. So for the event that the empirical mean is greater than some threshold larger than the expectation, Cramer’s theorem told us that this is exponentially the same as same the empirical mean is roughly equal to the threshold. Now Mogulskii’s theorem says more. By convexity, we know that the integral functional for the rate function is minimised by straight lines. So we learn that the contributions to the large deviation are spread roughly equally through the sample. Note that this is NOT saying that all the random variables will have the same higher than expected value. The LDP takes no account of fluctuations in the path on a scale smaller than n. It does however rule out both of the situations pictured a long way up the page. We should expect to see roughly a straight line, with unexpectedly steep gradient.

3) The proof as given in Dembo and Zeitouni is quite involved. There are a few stages, the first and simplest of which is to show that it doesn’t matter on an exponential scale whether we interpolate linearly or step-wise. Later in the proof we will switch back and forth at will. The next step is to show the LDP for the finite-dimensional problem given by evaluating the path at finitely many points in [0,1]. A careful argument via the Dawson-Gartner theorem allows lifting of the finite-dimensional projections back to the space of general functions with the topology of pointwise convergence. It remains to prove that the rate function is indeed the supremum of the rate functions achieved on projections. Convexity of \Lambda^*(x) is very useful here for the upper bound, and this is where it comes through that the rate function is infinite when the comparison path is not absolutely continuous. To lift to the finer topology of L_\infty([0,1]) requires only a check of exponential tightness in the finer space, which follows from Arzela-Ascoli after some work.

In conclusion, it is fairly tricky to prove even this most straightforward case, so unsurprisingly it is hard to extend to the natural case where the distributions of the underlying RVs (X) change continuously in time, as we will want for the analysis of more combinatorial objects. Next time I will consider why it is hard but potentially interesting to consider with adaptations of these techniques an LDP for the size of the largest component in a sparse random graph near criticality.

Poisson Tails

I’ve had plenty of ideas for potential probability posts recently, but have been a bit too busy to write any of them up. I guess that’s a good thing in some sense. Anyway, this is a quick remark based on an argument I was thinking about yesterday. It combines Large Deviation theory, which I have spent a lot of time learning about this year, and the Poisson process, which I have spent a bit of time teaching.

Question

Does the Poisson distribution have an exponential tail? I ended up asking this question for two completely independent reasons yesterday. Firstly, I’ve been reading up about some more complex models of random networks. Specifically, the Erdos-Renyi random graph is interesting mathematical structure in its own right, but the independent edge condition results in certain regularity properties which are not seen in many real-world networks. In particular, the degree sequence of real-world networks typically follows an approximate power law. That is, the tail is heavy. This corresponds to our intuition that most networks contain ‘hubs’ which are connected to a large region of the network. Think about key servers or websites like Wikipedia and Google which are linked to by millions of other pages, or the social butterfly who will introduce friends from completely different circles. In any case, this property is not observed in an Erdos-Renyi graph, where the degrees are binomial, and in the sparse situation, rescale in the limit to a Poisson distribution. So, to finalise this observation, we want to be able to prove formally that the Poisson distribution has an exponential (so faster than power-law) tail.

The second occurrence of this question concerns large deviations for the exploration process of a random graph. This is a topic I’ve mentioned elsewhere (here for the exploration process, here for LDs) so I won’t recap extensively now. Anyway, the results we are interested in give estimates for the rate of decay in probability for the event that the path defined by the exploration process differs substantially from the expected path as n grows. A major annoyance in this analysis is the possibility of jumps. A jump occurs if a set of o(n) adjacent underlying random variables (here, the increments in the exploration process) have O(n) sum. A starting point might be to consider whether O(1) adjacent RVs can have O(n) sum, or indeed whether a single Poisson random variable can have sum of order n. In practice, this asks whether the probability \mathbb{P}(X>\alpha n) decays faster than exponentially in n. If it does, then this is dominated on a large deviations scale. If it decays exactly exponentially in n, then we have to consider such jumps in the analysis.

Approach

We can give a precise statement of the probabilities that a Po(\lambda) random variable X returns a given integer value:

\mathbb{P}(X=k)=e^{-\lambda}\frac{\lambda^k}{k!}.

Note that these are the terms in the Taylor expansion of e^{\lambda} appropriately normalised. So, while it looks like it should be possible to evaluate

\mathbb{P}(X>\alpha n)=e^{-\lambda}\sum_{\alpha n}^\infty \frac{\lambda^k}{k!},

this seems impossible to do directly, and it isn’t even especially obvious what a sensible bounding strategy might be.

The problem of estimating the form of the limit in probability of increasing unlikely deviations from expected behaviour surely reminds us of Cramer’s theorem. But this and other LD theory is generally formulated in terms of n random variables displaying some collective deviation, rather than a single random variable, with the size of the deviation growing. But we can transform our problem into that form by appealing to the three equivalent definitions of the Poisson process.

Recall that the Poisson process is the canonical description of, say, an arrivals process, where events in disjoint intervals are independent, and the expected number of arrives in a fixed interval is proportional to the width of the interval, giving a well-defined notion of ‘rate’ as we would want. The two main ways to define the process are: 1) the times between arrivals are given by i.i.d. Exponential RVs with parameter \lambda equal to the rate; and 2) the number of arrivals in interval [s,t] is independent of all other times, and has distribution given by Po(\lambda(t-s)). The fact that this definition gives a well-defined process is not necessarily obvious, but let’s not discuss that further here.

So the key equivalence to be exploited is that the event X>n for X\sim \text{Po}(\lambda) is a statement that there are at least n arrivals by time 1. If we move to the exponential inter-arrival times definition, we can write this as:

\mathbb{P}(Z_1+\ldots+Z_n<1),

where the Z’s are the i.i.d. exponential random variables. But this is exactly what we are able to specify through Cramer’s theorem. Recall that the moment generating function of an exponential distribution is not finite everywhere, but that doesn’t matter as we construct our rate function by taking the supremum over some index t of:

I(x)=\sup_t (xt-\log \mathbb{E}e^{tZ_1})=\sup_t(xt-\log(\frac{\lambda}{\lambda-t})).

A simple calculation then gives

I(x)=\lambda x-1 - \log \lambda x.

\Rightarrow I(x)\uparrow \infty\text{ as }x\downarrow 0.

Note that I(1) is the same for both Exp(\lambda) and Po(\lambda), because of the PP equality of events:

\{Z_1+\ldots+Z_n\leq n\}=\{\text{Po}(\lambda n)=\text{Po}(\lambda)_1+\ldots+\text{Po}(\lambda)_n> n\},

similar to the previous argument. In particular, for all \epsilon>0,

\mathbb{P}(\text{Po}(\lambda)>n)=\mathbb{P}(\frac{Z_1+\ldots+Z_n}{n}<\frac{1}{n})<\mathbb{P}(\frac{Z_1+\ldots+Z_n}{n}<\epsilon),\text{ for large }n.

\mathbb{P}(\text{Po}(\lambda)>n)=O(e^{-nI(\epsilon)}),\text{ for all }\epsilon.

Since we can take I(\epsilon) as large as we want, we conclude that the probability decays faster than exponentially in n.

Delayed Connectivity in Random Graphs

Aside

I presented a poster at the Oxford SIAM Student Chapter Conference on Friday. It was nice to win the prize for best poster, but mainly I enjoyed putting it together. For once it meant ignoring the technical details and anything requiring displayed formulae, and focusing only on aspects that could be conveyed with bullet points and images. Anyway, this is what I came up with. The real thing is sitting safely in a tube in my office, ready for the next time it is needed in a hurry!

Delayed Connectivity Poster

Large Deviations 4 – Sanov’s Theorem

Although we could have defined things for a more general topological space, most of our thoughts about Cramer’s theorem, and the Gartner-Ellis theorem which generalises it, are based on means of real-valued random variables. For Cramer’s theorem, we genuinely are interested only in means of i.i.d. random variables. In Gartner-Ellis, one might say that we are able to relax the condition on independence and perhaps identical distribution too, in a controlled way. But this is somewhat underselling the theorem: using G-E, we can deal with a much broader category of measures than just means of collections of variables. The key is that convergence of the log moment generating function is exactly enough to give a LDP with some rate, and we have a general method for finding the rate function.

So, Gartner-Ellis provides a fairly substantial generalisation to Cramer’s theorem, but is still similar in flavour. But what about if we look for additional properties of a collection of i.i.d. random variables (X_n). After all, the mean is not the only interesting property. One thing we could look at is the actual values taken by the X_ns. If the underlying distribution is continuous, this is not going to give much more information than what we started with. With probability, \{X_1,\ldots,X_n\} is a set of size n, with distribution given by the product of the underlying measure. However, if the random variables take values in a discrete set, or better still a finite set, then (X_1,\ldots,X_n) gives a so-called empirical distribution.

As n grows towards infinity, we expect this empirical distribution to approximate the real underlying distribution fairly well. This isn’t necessarily quite as easy as it sounds. By the strong law of large numbers applied to indicator functions 1(X_i\leq t), the empirical cdf at t converges almost surely to the true cdf at t. To guarantee that this convergence is uniform in t is tricky in general (for reference, see the Glivenko-Cantelli theorem), but is clear for random variables defined on finite sets, and it seems reasonable that an extension to discrete sets should be possible.

So such empirical distributions might well admit an LDP. Note that in the case of Bernoulli random variables, the empirical distribution is in fact exactly equivalent to the empirical mean, so Cramer’s theorem applies. But, in fact we have a general LDP for empirical distributions. I claim that the main point of interest here is the nature of the rate function – I will discuss why the existence of an LDP is not too surprising at the end.

The rate function is going to be interesting whatever form it ends up taking. After all, it is effectively going to some sort of metric on measures, as it records how far a possible empirical measure is from the true distribution. Apart from total variation distance, we don’t currently have many standard examples for metrics on a space of measures. Anyway, the rate function is the main content of Sanov’s theorem. This has various forms, depending on how fiddly you are prepared for the proof to be.

Define L_n:=\sum_{i=1}^n \delta_{X_i}\in\mathcal{M}_1(E) to be the empirical measure generated by X_1,\ldots,X_n. Then L_n satisfies an LDP on \mathcal{M}_1(E) with rate n and rate function given by H(\cdot,\mu), where \mu is the underlying distribution.

The function H is the relative entropy, defined by:

H(\nu|\mu):=\int_E \log\frac{\nu(x)}{\mu(x)}d\nu(v),

whenever \nu<<\mu, and \infty otherwise. We can see why this absolute continuity condition is required from the statement of the LDP. If the underlying distribution \mu has measure zero on some set A, then the observed values will not be in A with probability 1, and so the empirical measure will be zero on A also.

Note that an alternative form is:

H(\nu|\mu)=\int_E \frac{\nu(x)}{\mu(x)}\log\frac{\nu(x)}{\mu(x)}d\mu(v)=\mathbb{E}_\nu\frac{\nu(x)}{\mu(x)}\log\frac{\nu(x)}{\mu(x)}.

Perhaps it is more clear why this expectation is something we would want to minimise.

In particular, if we want to know the most likely asymptotic empirical distribution inducing a large deviation empirical mean (as in Cramer), then we find the distribution with suitable mean, and smallest entropy relative to the true underlying distribution.

A remark on the proof. If the underlying set of values is finite, then a proof of this result is essentially combinatorial. The empirical distribution is some multinomial distribution, and we can obtain exact forms for everything and then proceed with asymptotic approximations.

I said earlier that I would comment on why the LDP is not too surprising even in general, once we know Gartner-Ellis. Instead of letting X_i take values in whatever space we were considering previously, say the reals, consider instead the point mass function \delta_{X_i} which is effectively exactly the same random variable, only now defined on the space of probability measures. The empirical measure is then exactly:

\frac{1}{n}\sum_{i=1}^n \delta_{X_i}.

If the support K of the (X_i)s is finite, then in fact this space of measures is a convex subspace of \mathbb{R}^K, and so the multi-dimensional version of Cramer’s theorem applies. In general, we can work in the possibly infinite-dimensional space [0,1]^K, and our relevant subset is compact, as a closed subset of a compact space (by Tychonoff). So the LDP in this case follows from our previous work.

Large Deviations 3 – Gartner-Ellis Theorem: Where do the all terms come from?

We want to drop the i.i.d. assumption from Cramer’s theorem, to get a criterion for a general LDP as defined in the previous post to hold.

Preliminaries

For general random variables (Z_n) on \mathbb{R}^d with laws (\mu_n), we will continue to have an upper bound like in Cramer’s theorem, provided the moment generating functions of Z_n converge as required. For analogy with Cramer, take Z_n=\frac{S_n}{n}. The Gartner-Ellis theorem gives conditions for the existence of a suitable lower bound and, in particular, when this is the same as the upper bound.

We define the logarithmic moment generating function

\Lambda_n(\lambda):=\log\mathbb{E}e^{\langle \lambda,Z_n\rangle},

and assume that the limit

\Lambda(\lambda)=\lim_{n\rightarrow\infty}\frac{1}{n}\Lambda_n(n\lambda)\in[-\infty,\infty],

exists for all \lambda\in\mathbb{R}^d. We also assume that 0\in\text{int}(\mathcal{D}_\Lambda), where \mathcal{D}_\Lambda:=\{\lambda\in\mathbb{R}^d:\Lambda(\lambda)<\infty\}. We also define the Fenchel-Legendre transform as before:

\Lambda^*(x)=\sup_{\lambda\in\mathbb{R}^d}\left[\langle x,\lambda\rangle - \Lambda(\lambda)\right],\quad x\in\mathbb{R}^d.

We say y\in\mathbb{R}^d is an exposed point of \Lambda^* if for some \lambda,

\langle \lambda,y\rangle - \Lambda^*(y)>\langle\lambda,x\rangle - \Lambda^*(x),\quad \forall x\in\mathbb{R}^d.

Such a \lambda is then called an exposing hyperplane. One way of thinking about this definition is that \Lambda^*(x) is convex, but is strictly convex in any direction at an exposed point. Alternatively, at an exposed point y, there is a vector \lambda such that \Lambda^*\circ \pi_\lambda has a global minimum or maximum at y, where \pi_\lambda is the projection into \langle \lambda\rangle. Roughly speaking, this vector is what we will to take the Cramer transform for the lower bound at x. Recall that the Cramer transform is an exponential reweighting of the probability density, which makes a previously unlikely event into a normal one. We may now state the theorem.

Gartner-Ellis Theorem

With the assumptions above:

  1. \limsup_{n\rightarrow\infty}\frac{1}{n}\log \mu_n(F)\leq -\inf_{x\in F}\Lambda^*(x), \forall F\subset\mathbb{R}^d closed.
  2. \liminf_{n\rightarrow\infty}\frac{1}{n}\log \mu_n(G)\geq -\inf_{x\in G\cap E}\Lambda^*(x), \forall G\subset\mathbb{R}^d open, where E is the set of exposed points of \Lambda^* whose exposing hyperplane is in \text{int}(\mathcal{D}_\Lambda).
  3. If \Lambda is also lower semi-continuous, and is differentiable on \text{int}(\mathcal{D}_\Lambda) (which is non-empty by the previous assumption), and is steep, that is, for any \lambda\in\partial\mathcal{D}_\Lambda, \lim_{\nu\rightarrow\lambda}|\nabla \Lambda(\nu)|=\infty, then we may replace G\cap E by G in the second statement. Then (\mu_n) satisfies the LDP on \mathbb{R}^d with rate n and rate function \Lambda^*.

Where do all the terms come from?

As ever, because everything is on an exponential scale, the infimum in the statements affirms the intuitive notion that in the limit, “an unlikely event will happen in the most likely of the possible (unlikely) ways”. The reason why the first statement does not hold for open sets in general is that the infimum may not be attained for open sets. For the proof, we need an exposing hyperplane at x so we can find an exponential tilt (or Cramer transform) that makes x the standard outcome. Crucially, in order to apply probabilistic ideas to the resulting distribution, everything must be normalisable. So we need an exposing hyperplane so as to isolate the point x on an exponential scale in the transform. And the exposing hyperplane must be in \mathcal{D}_\Lambda if we are to have a chance of getting any useful information out of the transform. By convexity, this is equivalent to the exposing hyperplane being in \text{int}(\mathcal{D}_\Lambda).

Large Deviations 2 – LDPs, Rate Functions and Lower Semi-Continuity

Remarks from Cramer’s Theorem

So in the previous post we discussed Cramer’s theorem on large deviations for means of i.i.d. random variables. It’s worth stepping back and thinking more abstractly about what we showed. Each S_n has some law, which we think of as a measure on \mathbb{R}, though this could equally well be some other space, depending on where the random variables are supported. The law of large numbers asserts that as n\rightarrow\infty, these measures are increasingly concentrated at a single point in \mathbb{R}, which in this case is \mathbb{E}X_1. Cramer’s theorem then asserts that the measure of certain sets not containing this point of concentration decays exponentially in n, and quantifies the exponent, a so-called rate function, via a Legendre transform of the log moment generating function of the underlying distribution.

One key point is that we considered only certain sets [a,\infty),\,a>\mathbb{E}X_1, though we could equally well have considered (-\infty,a],\,a<\mathbb{E}X_1. What would happen if we wanted to consider an interval, say [a,b],\,\mathbb{E}X_1<a<b? Well, \mu_n([a,b])=\mu_n([a,\infty))-\mu_n((b,\infty)), and we might as well assume that \mu_n is sufficiently continuous, at least in the limit, that we can replace the open interval bound with a closed one. Then Cramer’s theorem asserts, written in a more informal style, that \mu_n([a,\infty))\sim e^{-nI(a)} and  similarly for [b,\infty). So provided I(a)<I(b), we have

\mu_n([a,b])\sim e^{-nI(a)}-e^{-nI(b)}\sim e^{-nI(a)}.

To in order to accord with our intuition, we would like I(x) to be increasing for x>\mathbb{E}X_1, and decreasing for x<\mathbb{E}X_1. Also, we want I(\mathbb{E}X_1)=0, to account for the fact that \mu_n([\mathbb{E}X_1,\infty))=O(1). For each consider a sequence of coin tosses. The probability that the observed proportion of heads is in [\frac12,1] should be roughly 1/2 for all n.

Note that in the previous displayed equation for \mu_n([a,b]) the right hand side has no dependence on b. Informally, this means that any event which is at least as unlikely as the event of a deviation to a, will in the limit happen in the most likely of the unlikely ways, which will in this case be a deviation to a, because of relative domination of exponential functions. So if, rather than just half-lines and intervals, we wanted to consider more general sets, we might conjecture a result of the form:

\mu_n(\Gamma)\sim e^{-n\inf_{z\in\Gamma}(z)},

with the approximation defined formally as in the statement of Cramer’s theorem. What can go wrong?

Large Deviations Principles

Well, if the set \Gamma=\{\gamma\} a single point, and the underlying distribution is continuous, then we would expect \mu_n(\{\gamma\})=0 for all n. Similarly, we would expect \mu_n((\mathbb{E}X_1,\infty))\sim O(1), but there is no a priori reason why I(z) should be continuous at \mathbb{E}X_1. (In fact, this is false.), so taking \Gamma=(\mathbb{E}X_1,\infty) again gives a contradiction.

So we need something a bit more precise. Noting that the problem here is that measure (in this case, measure of likeliness on an exponential scale) can leak into open sets through the boundary in the limit, and also the rate function requires some sort of neighbourhood to make sense for continuous RVs, so boundaries of closed sets may give an overestimate. This is reminiscent of weak convergence, and motivated by this, the appropriate general definition for a Large Deviation Principle is:

A sequence of measure (\mu_n) on some space E satisfies an LDP with rate function I and speed n if \forall \Gamma\in \mathcal{B}(E):

-\inf_{x\in\Gamma^\circ}I(x)\leq \liminf \frac{1}{n}\log\mu_n(\Gamma)\leq \limsup\frac{1}{n}\log\mu_n(\Gamma)\leq -\inf_{x\in \bar{\Gamma}}I(x).

Although this might look very technical, you might as well think of it as nothing more than the previous conjecture for general sets, with the two problems that we mentioned now taken care of.

So, we need to define a rate function. I: E\rightarrow[0,\infty] is a rate function, if it not identically infinite. We also demand that it is lower semi-continuous, and has closed level sets \Psi_I^\alpha:=\{x\in E: I(x)\leq\alpha\}. These definitions are in fact equivalent. I will say what lower semi-continuity is in a moment. Some authors also demand that the level sets be compact. Others call this a good rate function, or similar. The advantage of this is that infima on closed sets are attained.

It is possible to specify a different rate. The rate gives the speed of convergence. \frac 1 n can be replaced with any function converging to 0, including continuously.

Lower Semi-Continuity

A function f is lower semi-continuous if

f(x)\leq \liminf f(x_n),\text{ for all sequences }x_n\rightarrow x.

One way of thinking about this definition is to say that the function cannot jump upwards as it reaches a boundary, it can only jump downwards (or not jump at all). The article on Wikipedia for semi-continuity has this picture explaining how a lower semi-continuous function must behave at discontinuities. Note that the value of f at the discontinuity could be the blue dot, or anything less than the blue dot. It is reasonable clear why this definition is equivalent to having closed level sets.

So the question to ask is: why should rate functions be lower semi-continuous? Rather than proceeding directly, we argue by uniqueness. Given a function on \mathbb{R} with discontinuities, we can turn it into a cadlag function, or a caglad function by fiddling with the values taken at points of discontinuity. We can do a similar thing to turn any function into a lower semi-continuous function. Given f, we define

f_*(x):=\liminf_{x_n\rightarrow x}f(x_n)=\sup\{\inf_G f: x\ni G, G \text{ open}\}.

The notes I borrowed this idea from described this as the maximal lower semi-continuous regularisation, which I think is quite a good explanation despite the long words.

Anyway, the claim is that if I(x) satisfies a LDP then so does $I_*(x)$. This needs to be checked, but it explains why we demand that the rate function be lower semi-continuous. We really want the rate function not to be unique, and this is a good way to prevent an obvious cause of non-uniqueness. It needs to be checked that it is actually unique once we have this assumption, but that is relatively straightforward.

So, to check that the lower semi-continuous regularisation of I satisfies the LDP if I does, we observe that the upper bound is trivial, since I^*\leq I everywhere. Then, for every open set G, note that for x\in G, I_*(x)=\liminf_{x_n\rightarrow x}I(x), so we might as well consider sequences within G, and so I_*(x)\geq \inf \inf_G I. So, since I_*(x)\leq I(x), it follows that

\inf_G I_*=\inf_G I,

and thus we get the upper bound for the LDP.

References

The motivation for this particular post was my own, but the set of notes here, as cited in the previous post were very useful. Also the Wikipedia page on semi-continuity, and Frank den Hollander’s book ‘Large Deviations’.

Large Deviations 1 – Motivation and Cramer’s Theorem

I’ve been doing a lot of thinking about Large Deviations recently, in particular how to apply the theory to random graphs and related models. I’ve just writing an article about some of the more interesting aspects, so thought it was probably worth turning it into a few posts.

Motivation

Given X_1,X_2,\ldots i.i.d. real-valued random variables with finite expectation, and S_n:=X_1+\ldots+X_n, the Weak Law of Large Numbers asserts that the empirical mean \frac{S_n}{n} converges in distribution to \mathbb{E}X_1. So \mathbb{P}(S_n\geq n(\mathbb{E}X_1+\epsilon))\rightarrow 0. In fact, if \mathbb{E}X_1^2<\infty, we have the Central Limit Theorem, and a consequence is that \mathbb{P}(S_n\geq n\mathbb{E}X_1+n^\alpha)\rightarrow 0 whenever \alpha>\frac12.

In a concrete example, if we toss a coin some suitably large number of times, the probability that the proportion of heads will be substantially greater or smaller than \frac12 tends to zero. So the probability that at least \frac34 of the results are heads tends to zero. But how fast? Consider first four tosses, then eight. A quick addition of the relevant terms in the binomial distribution gives:

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of four tosses are heads}\right)=\frac{1}{16}+\frac{4}{16}=\frac{5}{16},

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of twelve tosses are heads}\right)=\frac{1}{2^{12}}+\frac{12}{2^{12}}+\frac{66}{2^{12}}+\frac{220}{2^{12}}=\frac{299}{2^{12}}.

There are two observations to be made. The first is that the second is substantially smaller than the first – the decay appears to be relatively fast. The second observation is that \frac{220}{2^{12}} is substantially larger than the rest of the sum. So by far the most likely way for at least \tfrac34 out of twelve tosses to be heads is if exactly \tfrac34 are heads. Cramer’s theorem applies to a general i.i.d. sequence of RVs, provided the tail is not too heavy. It show that the probability of any such large deviation event decays exponentially with n, and identifies the exponent.

Theorem (Cramer): Let (X_i) be i.i.d. real-valued random variables which satisfy \mathbb{E}e^{tX_1}<\infty for every t\in\mathbb{R}. Then for any a>\mathbb{E}X_1,

\lim_{n\rightarrow \infty}\frac{1}{n}\log\mathbb{P}(S_n\geq an)=-I(a),

\text{where}\quad I(z):=\sup_{t\in\mathbb{R}}\left[zt-\log\mathbb{E}e^{tX_1}\right].

Remarks

  • So, informally, \mathbb{P}(S_n\geq an)\sim e^{-nI(a)}.
  • I(z) is called the Fenchel-Legendre transform (or convex conjugate) of \log\mathbb{E}e^{tX_1}.
  • Considering t=0 confirms that I(z)\in[0,\infty].
  • In their extremely useful book, Dembo and Zeitouni present this theorem in greater generality, allowing X_i to be supported on \mathbb{R}^d, considering a more general set of large deviation events, and relaxing the requirement for finite mean, and thus also the finite moment generating function condition. All of this will still be a special case of the Gartner-Ellis theorem, which will be examined in a subsequent post, so we make do with this form of Cramer’s result for now.

The proof of Cramer’s theorem splits into an upper bound and a lower bound. The former is relatively straightforward, applying Markov’s inequality to e^{tS_n}, then optimising over the choice of t. This idea is referred to by various sources as the exponential Chebyshev inequality or a Chernoff bound. The lower bound is more challenging. We reweight the distribution function F(x) of X_1 by a factor e^{tx}, then choose t so that the large deviation event is in fact now within the treatment of the CLT, from which suitable bounds are obtained.

To avoid overcomplicating this initial presentation, some details have been omitted. It is not clear, for example, whether I(x) should be finite whenever x is in the support of X_1. (It certainly must be infinite outside – consider the probability that 150% or -40% of coin tosses come up heads!) In order to call this a Large Deviation Principle, we also want some extra regularity on I(x), not least to ensure it is unique. This will be discussed in the next posts.

Weak Convergence and the Portmanteau Lemma

Much of the theory of Large Deviations splits into separate treatment of open and closed sets in the rescaled domains. Typically we seek upper bounds for the rate function on closed sets, and lower bounds for the rate function on open sets. When things are going well, these turn out to be same, and so we can get on with some applications and pretty much forget about the topology underlying the construction. Many sources made a comment along the lines of “this is natural, by analogy with weak convergence”.

Weak convergence is a topic I learned about in Part III Advanced Probability. I fear it may have been one of those things that leaked out of my brain shortly after the end of the exam season… Anyway, this feels like a good time to write down what it is all about a bit more clearly. (I’ve slightly cheated, and chosen definitions and bits of the portmanteau lemma which look maximally similar to the Large Deviation material, which I’m planning on writing a few posts about over the next week.)

The motivation is that we want to extend the notion of convergence in distribution of random variables to general measures. There are several ways to define convergence in distribution, so accordingly there are several ways to generalise it. Much of what follows will be showing that these are equivalent.

We work in a metric space (X,d) and have a sequence (\mu_n) and \mu of (Borel) probability measures. We say that (\mu_n) converges weakly to \mu, or \mu_n\Rightarrow\mu if:

\mu_n(f)\rightarrow\mu(f), \quad\forall f\in\mathcal{C}_b(X).

So the test functions required for result are the class of bounded, continuous functions on X. We shall see presently that it suffices to check a smaller class, eg bounded Lipschitz functions. Indeed the key result, which is often called the portmanteau lemma, gives a set of alternative conditions for weak convergence. We will prove the equivalence cyclically.

Portmanteau Lemma

The following are equivalent.

a) \mu_n\Rightarrow \mu.

b) \mu_n(f)\rightarrow\mu(f) for all bounded Lipschitz functions f.

c) \limsup_n \mu_n(F)\leq \mu(F) for all closed sets F. Note that we demanded that all the measures be Borel, so there is no danger of \mu(F) not being defined.

d) \liminf_n \mu_n(F)\geq \mu(G) for all open sets G.

e) \lim_n \mu_n(A)=\mu(A) whenever \mu(\partial A)=0. Such an A is called a continuity set.

Remarks

a) All of these statements are well-defined if X is a general topological space. I can’t think of any particular examples where we want to use measures on a non-metrizable space (eg C[0,1] with topology induced by pointwise convergence), but there seem to be a few references (such as the one cited here) implying that the results continue to hold in this case provided X is locally compact Hausdorff. This seems like an interesting thing to think about, but perhaps not right now.

b1) This doesn’t strike me as hugely surprising. I want to say that any bounded continuous function can be uniformly approximated almost everywhere by bounded Lipschitz functions. Even if that isn’t true, I am still not surprised.

b2) In fact this condition could be replaced by several alternatives. In the proof that follows, we only use one type of function, so any subset of \mathcal{C}_b(X) that contains the ones we use will be sufficient to determine weak convergence.

c) and d) Why should the sign be this way round? The canonical example to have in mind is some sequence of point masses \delta_{x_n} where x_n\rightarrow x in some non-trivial way. Then there is some open set eg X\{x} such that \mu_n(X\backslash x)=1 but \mu(X\backslash x)=0. Informally, we might say that in the limit, some positive mass could ‘leak out’ into the boundary of an open set.

e) is then not surprising, as the condition of being a continuity set precisely prohibits the above situation from happening.

Proof

a) to b) is genuinely trivial. For b) to c), find some set F’ containing F such that \mu(F')-\mu(F)=\epsilon. Then find a Lipschitz function f which is 0 outside F’ and 1 on F. We obtain

\limsup_n \mu_n(F)\leq \limsup \mu_n(f)=\mu(f)\leq \mu(F').

But \epsilon was arbitrary, so the result follows as it tends to zero. c) and d) are equivalent after taking F^c=G. If we assume c) and d) and apply them to A^\circ, \bar{A}, then e) follows.

e) to a) is a little trickier. Given a bounded continuous function f, assume WLOG that it has domain [0,1]. At most countably many events \{f=a\} have positive mass under each of \mu, (\mu_n). So given M>0, we can choose a sequence

-1=a_0<a_1<\ldots<a_M=2, such that |a_{k+1}-a_k|<\frac{1}{M},

and \mu(f=a_k)=\mu_n(f=a_k)=0 for all k,n. Now it is clear what to do. \{f\in[a_k,a_{k+1}]\} is a continuity set, so we can apply e), then patch everything together. There are slightly too many Ms and \epsilons to do this sensibly in WordPress, so I will leave it at that.

I will conclude by writing down a combination of c) and d) that will look very familiar soon.

\mu(A^\circ)\leq \liminf_n \mu_n(A)\leq \limsup_n\mu_n(A)\leq \mu(\bar{B}).

References

Apart from the Part III Advanced Probability course, this article was prompted by various books on Large Deviations, including those by Frank den Hollander and Ellis / Dupuis. I’ve developed the proof above from the hints given in the appendix of these very comprehensible notes by Rassoul-Agha and Seppalainen.