Weak Convergence and the Portmanteau Lemma

Much of the theory of Large Deviations splits into separate treatment of open and closed sets in the rescaled domains. Typically we seek upper bounds for the rate function on closed sets, and lower bounds for the rate function on open sets. When things are going well, these turn out to be same, and so we can get on with some applications and pretty much forget about the topology underlying the construction. Many sources made a comment along the lines of “this is natural, by analogy with weak convergence”.

Weak convergence is a topic I learned about in Part III Advanced Probability. I fear it may have been one of those things that leaked out of my brain shortly after the end of the exam season… Anyway, this feels like a good time to write down what it is all about a bit more clearly. (I’ve slightly cheated, and chosen definitions and bits of the portmanteau lemma which look maximally similar to the Large Deviation material, which I’m planning on writing a few posts about over the next week.)

The motivation is that we want to extend the notion of convergence in distribution of random variables to general measures. There are several ways to define convergence in distribution, so accordingly there are several ways to generalise it. Much of what follows will be showing that these are equivalent.

We work in a metric space (X,d) and have a sequence (\mu_n) and \mu of (Borel) probability measures. We say that (\mu_n) converges weakly to \mu, or \mu_n\Rightarrow\mu if:

\mu_n(f)\rightarrow\mu(f), \quad\forall f\in\mathcal{C}_b(X).

So the test functions required for result are the class of bounded, continuous functions on X. We shall see presently that it suffices to check a smaller class, eg bounded Lipschitz functions. Indeed the key result, which is often called the portmanteau lemma, gives a set of alternative conditions for weak convergence. We will prove the equivalence cyclically.

Portmanteau Lemma

The following are equivalent.

a) \mu_n\Rightarrow \mu.

b) \mu_n(f)\rightarrow\mu(f) for all bounded Lipschitz functions f.

c) \limsup_n \mu_n(F)\leq \mu(F) for all closed sets F. Note that we demanded that all the measures be Borel, so there is no danger of \mu(F) not being defined.

d) \liminf_n \mu_n(F)\geq \mu(G) for all open sets G.

e) \lim_n \mu_n(A)=\mu(A) whenever \mu(\partial A)=0. Such an A is called a continuity set.


a) All of these statements are well-defined if X is a general topological space. I can’t think of any particular examples where we want to use measures on a non-metrizable space (eg C[0,1] with topology induced by pointwise convergence), but there seem to be a few references (such as the one cited here) implying that the results continue to hold in this case provided X is locally compact Hausdorff. This seems like an interesting thing to think about, but perhaps not right now.

b1) This doesn’t strike me as hugely surprising. I want to say that any bounded continuous function can be uniformly approximated almost everywhere by bounded Lipschitz functions. Even if that isn’t true, I am still not surprised.

b2) In fact this condition could be replaced by several alternatives. In the proof that follows, we only use one type of function, so any subset of \mathcal{C}_b(X) that contains the ones we use will be sufficient to determine weak convergence.

c) and d) Why should the sign be this way round? The canonical example to have in mind is some sequence of point masses \delta_{x_n} where x_n\rightarrow x in some non-trivial way. Then there is some open set eg X\{x} such that \mu_n(X\backslash x)=1 but \mu(X\backslash x)=0. Informally, we might say that in the limit, some positive mass could ‘leak out’ into the boundary of an open set.

e) is then not surprising, as the condition of being a continuity set precisely prohibits the above situation from happening.


a) to b) is genuinely trivial. For b) to c), find some set F’ containing F such that \mu(F')-\mu(F)=\epsilon. Then find a Lipschitz function f which is 0 outside F’ and 1 on F. We obtain

\limsup_n \mu_n(F)\leq \limsup \mu_n(f)=\mu(f)\leq \mu(F').

But \epsilon was arbitrary, so the result follows as it tends to zero. c) and d) are equivalent after taking F^c=G. If we assume c) and d) and apply them to A^\circ, \bar{A}, then e) follows.

e) to a) is a little trickier. Given a bounded continuous function f, assume WLOG that it has domain [0,1]. At most countably many events \{f=a\} have positive mass under each of \mu, (\mu_n). So given M>0, we can choose a sequence

-1=a_0<a_1<\ldots<a_M=2, such that |a_{k+1}-a_k|<\frac{1}{M},

and \mu(f=a_k)=\mu_n(f=a_k)=0 for all k,n. Now it is clear what to do. \{f\in[a_k,a_{k+1}]\} is a continuity set, so we can apply e), then patch everything together. There are slightly too many Ms and \epsilons to do this sensibly in WordPress, so I will leave it at that.

I will conclude by writing down a combination of c) and d) that will look very familiar soon.

\mu(A^\circ)\leq \liminf_n \mu_n(A)\leq \limsup_n\mu_n(A)\leq \mu(\bar{B}).


Apart from the Part III Advanced Probability course, this article was prompted by various books on Large Deviations, including those by Frank den Hollander and Ellis / Dupuis. I’ve developed the proof above from the hints given in the appendix of these very comprehensible notes by Rassoul-Agha and Seppalainen.


Advanced Probability Revision Summary


My plan to write a post a day about a likely piece of Advanced Probability bookwork has been more useful than even I anticipated. It’s surprising how often you come to write something down and realise you don’t understand half the steps. Anyway, here’s a list, in approximately chronological order through the course, of what I’ve covered:

Recurrence and Transience of BM

In this post, we consider Brownian motion as a Markov process, and consider the recurrence and transience properties in several dimensions. As motivation, observe from Question 5 of this exam paper that it is a very much non-trivial operation to show that Brownian motion in two-dimensions almost surely has zero Lebesgue measure. We would expect this to be true by default, as we visualise BM as a curve. So it is interesting to see how much we can deduce without significant technical analysis. We will make use of Ito’s formula. This material comes from the Part III course Advanced Probability, which doesn’t explicitly mention Ito’s result, and instead proves the result required separately, making use of the nature of solutions to the diffusion equation. In this context we assume that for f\in C_b^{1,2}:

M_t:=f(t,B_t)-f(0,B_0)-\int_0^t(\frac{\partial}{\partial t}+\frac12\Delta)f(s,B_s)ds

is a martingale. Of course, precisely, from Ito’s formula, this can be expressed as the stochastic integral of a bounded function with respect to Brownian motion, which is therefore a (continuous local, but bounded) martingale.

d=1: In one dimension, BM is point-recurrent. This means that almost surely, BM returns to zero infinitely many times. This is easiest shown by using the time-reversal equivalence to deduce that \lim\sup B_t=-\lim\inf B_t=\infty.

d=2: BM is two dimensions is point-transient. That means that the probability of returning to a given point is <1. In fact it is 0, as one might suspect from the fact that BM is space-invariant and, intuitively at least, has measure 0. However, it is neighbourhood-recurrent, meaning that it almost surely returns to a ball around a given point infinitely often. We discuss small balls around 0, but obviously the conclusions apply equally well elsewhere.

The aim is to choose a function so that the expression in Ito’s formula as above is as simple as possible. Taking f a function of space alone and harmonic causes the integral term to vanish. In this case, f(y)=\log|y| will suffice. Obviously we have to restrict attention to \epsilon\leq |y|\leq R. We stop M at T_\epsilon\wedge T_R, that is the first time that the BM hits the boundary of the annulus on which f is defined, and apply OST, since \log|B_t| is bounded here and the stopping time is a.s. finite. We obviously have to assume the BM starts in this annulus, but then we obtain:

\mathbb{E}_x\log|B_{T_\epsilon\wedge T_R}|=\log|x|

and so we can consider the two possibilities for B_{T_\epsilon\wedge T_R} to deduce:

\mathbb{P}_x(T_\epsilon<T_R)=\frac{\log R-\log|x|}{\log R-\log\epsilon}

Now let \epsilon\downarrow 0 to see that \mathbb{P}_x(B_t=0,\text{ some }t>0)=0. Now apply the (weak) Markov property at a small fixed time a, to deduce, with a mild abuse of notation:

\mathbb{P}_0(B_t=0,\text{ some }t>a)=\int \mathbb{P}_x(B_t=0,t>0)\mathbb{P}_0(B_a=dx)=0

as the first term in the integral we have shown to be 0 B_a-a.e. Then let a\downarrow 0 to obtain the result about point-transience.

For neighbourhood recurrence, instead let R\uparrow\infty, so \mathbb{P}_x(T_\epsilon<\infty)=1. As before, can integrate over law of B_n to obtain

\mathbb{P}_0(|B_t|\leq \epsilon,\text{ some }t\geq n)=1

which is precisely what we require for transience.

d=>3: BM is transient. That is, |B_t|\rightarrow\infty a.s. Note that for d>3, the first three components have the same distribution as BM in three dimensions, and so it suffices to consider the case d=3.

Here, the correct choice of harmonic function is f(y)=\frac{1}{|y|}, so conclude as before that


From this, we can take a limit to see that

\mathbb{P}_x(T_\epsilon<\infty)\leq \frac{\epsilon}{|x|}

We deploy a neat trick to lift this result to a global statement about transience. Define the events that the modulus never returns to n after hitting n^3

A_n:=\{|B_t|>n\quad \forall t\geq T_{n^3}\}



Applying Borel-Cantelli 1, A_n eventually holds almost surely, which certainly implies the desired result.

Coalescence 1: What is it, and why do we care?

As part of Part III, instead of sitting an extra exam paper I am writing an essay. I have chosen the topic of ‘Multiplicative Coalescence’. I want to avoid contravening plagiarism rules, which don’t allow you to quote your own words without a proper citation, which I figure is tricky on a blog, nor open publishing of anything you intend to submit. So just to be absolutely sure, I’m going to suppress this series of posts until after May 4th, when everything has to be handed in.


Informal Description

Coalescence refers to a process in which particles join together over time. An example might be islands of foam on the surface of a cup of coffee. When two clumps meet, they join, and will never split. In this example, a model would need to take into account the shape of all the islands, their positions, their velocities, and boundary properties. To make things tractable, we need to distance ourselves from the idea that particles merge through collisions, which are highly physical and complicated, and instead just consider that they merge.

Description of the Model

When two particles coalesce, it is natural to assume that mass is conserved, as this will be necessary in any physical application. With this in mind, it makes sense to set up the entire model using only the masses of particles. Define the kernel K(x,y) which describes the relative rate or likelihood of the coalescence {x,y} -> x+y. This has a different precise meaning in different contexts. Effectively, we are making a mean-field assumption that all the complications of a physical model as described above can be absorbed into this coalescent kernel, either because the number of particles is large, or because the other effects are small.

When there is, initially, a finite number of particles, the process is stochastic. Coalescence almost surely happen one at a time, and so we can view the process as a continuous time Markov Chain with state space the set of relevant partitions of the total mass present. The transition rate p(A,B) is given by K(x,y) when the coalescence {x,y} -> x+y transforms partition into B, and 0 otherwise. An observation is that the process recording the number of {x,y} -> x+y coalescences is an inhomogeneous Poisson process with local intensity n(x,t)n(y,t)K(x,y) where n(x,t) is the number of particles with mass at time t.

This motivates the move to an infinite-volume setting. Suppose that there are infinitely many particles, so that coalescences are occurring continuously. The rate of {x,y} -> x+y coalescences is still n(x,t)n(y,t)K(x,y) but now n(x,t) specifies the density of particles with mass at time t. Furthermore, because of the continuum framework, this rate is now deterministic rather than stochastic. This is extremely important, as by removing the probability from a probabilistic model, it can be treated as a large but simple ODE.

Two Remarks

1) Once this introduction is finished, we shall be bringing our focus onto multiplicative coalescence, where K(x,y) = xy. In particular, this is a homogeneous function, as are the other canonical kernels. This means that considering K(x,y) = cxy is the same as performing a constant factor time-change when K(x,y) = xy. Similarly, it is not important how the density n(x,t) is scaled as this can also be absorbed with a time-change. In some contexts, it will be natural and useful to demand that the total density be 1, but this will not always be possible. In general it is convenient to absorb as much as possible into the time parameter, particularly initial conditions, as will be discussed.

2) Working with an infinite volume of particles means that mass is no longer constrained to finitely many values. Generally, it is assumed that the masses are discrete, taking values in the positive integers, or continuous, taking values in the positive reals. In this case, the rate of coalescences between particles with masses in (x, x+dx) and (y,y+dy) is n(x,t)n(y,t)K(x,y)dxdy. The main difference between these will arise when we try to view the process as limits of finite processes. Continue reading