Gromov-Hausdorff Distance on Trees

This post continues the exposition of Gromov-Hausdorff distance, as introduced in Chapter Four of Steve Evans’ Probability and Real Trees, which we are reading as a group in the Stats Dept in Oxford at the moment. In this post, we consider applications of the Gromov-Hausdorff distance we have just introduced in the context of trees, viewed as metric spaces.

First we consider a direct application of the previous result, which related the Gromov-Hausdorff distance to the infimum of the distortion across the family of correspondences between the two relevant metric spaces. I found this as Corollary 3.7 in notes by Le Gall and Miermont on scaling limits of random trees and maps, which can be found here. I’m not clear whether there is an original source, but the result is simple enough that probably it does not matter hugely.

Proposition – Given f,g excursions above [0,1], and T_f, T_g the real trees associated with these excursions in the standard way, then

d_{GH}(T_f,T_g)\le 2 ||f-g||.

Proof: We construct an appropriate correspondence

\mathcal {R}=\left\{(a,b) \,:\, \exists t\in[0,1]\text{ s.t. } a=p_f(t), b=p_g(t)\right\}.

In other words, the trees are defined as projections from [0.1] (with some equivalence structure), so taking pairs of projections from [0,1] gives a natural correspondence. Now, suppose (p_f(s),p_g(s)), (p_f(t),p_g(t))\in \mathcal {R}. Then

d_{T_f}(p_f(s),p_f(t)) = f(s) + f(t) - 2\hat f(s,t),

where \hat f(s,t):= \min{r\in[s,t]}f(r). Obviously, we can replace f by g to obtain

d_{T_g}(p_g(s),p_g(t)) = g(s) + g(t) - 2\hat g(s,t).

By thinking slightly carefully about where the functions achieve their minima compared with where the minimum in the sup norm is achieved, we can conclude that

|d_{T_f}(p_f(s),p_f(t)) - d_{T_g}(p_g(s),p_g(t)) | \le 4||f-g||,

and so the result follows from the relation between Gromov-Hausdorff distance and the infimum of distortions over the set of correspondences proved at the end of the previous post.

Gromov-Hausdorff Limits of Real Trees

If we are going to consider the Gromov-Hausdorff topology for limits of tree, we want to be sure that the limit of a sequence of real trees is another real tree. In particular, we want to show that this convergence preserves the property of being a geodesic space.

Theorem 4.19 – Given X_n geodesic spaces, and X complete, such that X_n \stackrel{d_{GH}}\rightarrow X, then X is geodesic.

Proof: Here, I’ll go in the opposite order to Evans’ book, as I think it’s easier understand why a special case of this implies the whole result once you’ve actually shown that special case.

Anyway, we want to show that given x,y\in X, there is a geodesic segment x\leftrightarrow y in X. We will start by showing that there is a well-defined midpoint of the geodesic, that is a point z \in X such that d(x,z)=d(y,z)=\frac12 d(x,y).

Given arbitrary \epsilon>0, we can take n such that d_{GH}(X,X_n) < \frac{\epsilon}{3} and then a correspondence \mathcal{R} between X,X_n with \mathrm{dis}\mathcal {R}<\frac{2\epsilon}{3}. Now, by definition of correspondence, we have x',y'\in X_n such that (x,x'),(y,y')\in\mathcal{R}. But we do have geodesics in X_n, so we can take z'\in X_n the midpoint of geodesic x'\leftrightarrow y'. Predictably, we now take z\in X such that (z,z')\in\mathcal {R}.

We can show that z is ‘almost’ the midpoint of [x,y] in the sense that

|d(x,z) - \frac12 d(x,y)| \le |d(x',z')-\frac12 d(x',y')+\frac32 \mathrm{dis}\mathcal{R} \le \epsilon.

Similarly, we have |d(y,z)-\frac12 d(x,y)|\le \epsilon.

Perhaps it’s helpful to think of this point z that we’ve constructed as being like a taut, but not quite rigid string. The midpoint of the string has to be fairly near the midpoint of the endpoints, and in particular, as we let \epsilon\rightarrow 0, the z’s we deal with form a Cauchy sequence, and thus converge to some point, which (in a case of poor notation planning) we also call z \in X, which is the midpoint of [x,y] as described before.

We can now iterate this iteration, to demonstrate that whenever q is a dyadic rational in [0,1], there exists z_q \in X such that

d(x,z_q)=q d(x,y),\quad d(y,z_q) = (1-q)d(x,y).

Again then, if we want the above to hold for some general real r in [0,1], we can approximate r arbitrarily well by dyadic rationals q, and the associated points z_q are Cauchy and thus have a well-defined limit with the required properties. We thus have our geodesic segment x\leftrightarrow y.

Rooted Gromov-Hausdorff Distance

In the end, the trees we want to compare might be rooted. For example, we talk about finite trees being invariant under random re-rooting, and we might be interested in similar results for real trees, in particular the BCRT. So we need to compare metric spaces as viewed outwards from particular identified points of each space.

An isometry of rooted spaces must map the root to the root, and so we adjust to obtain rooted Gromov-Hausdorff distance. We might try to consider embeddings into a common space so that the roots are shared, but it will be more convenient to maintain the infimum over metrics on the disjoint union as before. But to ensure the roots are in roughly the ‘same’ place in both set embeddings, we minimise the maximum of the Hausdorff distance between the sets, and the distance between the images of the roots in the common covering space.

Similar results apply as in the unrooted case, and normally the proofs are very similar. As we might expect, when defining a correspondence between rooted spaces, we demand that the pair of roots is one of the roots in the correspondence, and then the same equivalence between the minimal distortion and the rooted G-H distance applies.

Evans shows that \mathbb{T}^{\mathrm{root}}, the set of compact rooted trees is complete and separable under the rooted G-H topology. Separability is relatively easy to see. Compact trees have finite \epsilon-nets, and there is a canonical way to view the net as the vertices of a finite tree with edge lengths. Approximating these edge lengths by rationals and consider the countable family of isomorphism classes of rooted finite trees gives separability.

If we want, we can also define k-pointed Gromov-Hausdorff distance, where we demand k points in each space are held fixed.

Tree \eta-erasure

To show that this is a natural topology to consider for the family of trees, Evans devotes a short section to the operation of \eta-erasure, where all points within \eta of a leaf are removed from a given tree.

Formally, R_\eta is a map \mathbb{T}^{\mathrm{root}}\rightarrow \mathbb{T}^{\mathrm{root}} (recalling that these are compact real trees), so that R_\eta(T) consists of the tree

\{\rho\}\cup\{a \in T \, : \, \exists x\in T,\, a\in[\rho, x],d_T(a,x)\ge \eta\},

rooted again at \rho. We claim that the range of R_\eta is the set of compact rooted trees with a finite number of leaves. In this setting, we want a geodesic definition of leaf, for example a leaf is a point that doesn’t lie in the interior of the (unique) geodesic segment between any other point and the root.

Given a tree T with a finite number of leaves, we can glue disjoint segments of length \eta onto every leaf. Taking R_\eta of this deeper tree will give T. Similarly, suppose R_\eta(T) has infinitely many leaves, which we can label a_1,a_2,\ldots. Thus we also have x_1,x_2,\ldots \in T such that a_i\in[\rho,x_i], and the segments \{[a_i,x_i]\} are disjoint, and all have length at least \eta, hence T cannot be compact, as it has no finite \frac{\eta}{2} net.

It is clear that for fixed T, the family of these maps applied to T is continuous with respect to \eta in the G-H topology. When \eta is changed a small amount, the amount of extra tree removed is locally small, and so approximating correspondences by points in what is left is absolutely fine. Indeed, the operations T_{\eta_1}, T_{\eta_2} commute to give T_{\eta_1+\eta_2}.

We want to show that for fixed \eta this map R_\eta is continuous with respect to G-H topology. Suppose two compact rooted trees S,T are covered by a rooted correspondence \mathcal R with distortion \epsilon \ll \eta. We can’t immediately restrict \mathcal R to R_\eta(S)\times R_\eta(T), as it won’t necessarily be a surjection under the projection maps any more.

But we can get around this. Note that if (x,y)\in\mathcal{R}, then |d_S(\rho_S,x) - d_T(\rho_T,y)| \le \epsilon by assumption. We will construct a correspondence \mathcal R' on the erased trees as follows. Given (x,y)\in\mathcal{R}, if (x,y)\in R_\eta(S)\times R_\eta(T), we keep it, and if x\not\in R_\eta(S), y\not \in R_\eta(y), we throw it away. Suppose we have (s,t) in \mathcal{R} with s\in R_\eta(S), t\not \in R_\eta(T). Let l_s be a leaf of S such that s \in [\rho_S,l_s], and let \bar t be the farthest point from the root of the geodesic [\rho_T,t] restricted to R_\eta(T). So the tree above \bar t includes t and has height at most \eta.

In words, if a point appears in a pair in the correspondence but is removed by the erasure, we replace it in the pair with the point closest to the original point that was not removed by the erasure.

It remains to prove that this works. My original proof was short but false, and its replacement is long (and I hope true now), but will postpone writing this down either until another post, or indefinitely. The original proof by Evans and co-authors can be found as the main content of Lemma 2.6 in the original paper [1].

REFERENCES

[1] Evans, Pitman, Winter – Rayleigh processes, real trees, and root growth with regrafting.

Gromov-Hausdorff Distance and Correspondences

This term, some members of the probability group have been reading Probability and Real Trees, by Steve Evans based on his Saint-Flour course in 2005. A pdf can be found here. This morning was my turn to present, and I gave a precis of Chapter 4, which concerns metrics on metric spaces, a family of tools which will be essential for later chapters which discuss convergence of trees, viewed as metric objects. Hausdorff Distance We start by considering a metric on subsets of a given base space X. The Hausdorff distance between two sets A, B is defined as

d_H(A,B):=\inf\{ r>0: A\subset U_r(B), B\subset U_r(A)\},

where U_r(A):=\{x\in X, d(x,A)< r\} consists of set A, and all the points within r of set A. So the Hausdorff distance measures how much we have to fatten each set before it contains the other. Note that if we have a giant set next to a tiny one, we will have to fatten the tiny one a great deal more to achieve this. Sometimes it will be more helpful to think of the following alternative characterisation

d_H(A,B) = \max\left \{ \sup_{a\in A}\inf_{b\in B} d(a,b), \inf_{a\in A}\sup_{b\in B} d(a,b)\right\}.

In words, we measure how far away is the point in A farthest from B, and vice versa, and take the larger of the two. The presence of the sups and infs indicates that the inclusion or otherwise of the boundaries of A,B does not affect this distance. In particular, this means that d_H(A,\bar A) = d_H(A,A^{\circ}), and so to allow us to call Hausdorff distance a metric, we restrict attention to the closed subsets of X, M(X).

We also observe immediately that if X does not have bounded diameter, then it is possible for the Hausdorff distance between two sets to be infinite. We won’t worry about that for now, because we will mainly be considering host spaces which are compact, for which the following results will be useful.

Lemma 4.4 – If (X,d) is complete, then (M(X),d_H) is also complete.

Proof: Assume we have a sequence of closed sets S_1,S_2,\ldots \subset X which have the Cauchy property under d_H. I’m going to do this slightly differently to Evans, because it’s not the case you can immediately choose x_n\in S_n for each n, such that d(x_n,x_m)\le d_H(S_n,S_m) for all m,n. For an explicit counterexample, see comment from Ed Crane below the article.

Note that if a subsequence of a Cauchy sequence converges to a limit, then the whole sequence converges to that same limit. So we can WLOG replace S_1,S_2,\ldots by some subsequence such that d_H(S_n,S_{n+1})\le 2^{-n}. Now it is clear that for any x_n\in S_n, there is a choice of x_{n+1}\in S_{n+1} such that d(x_n,x_{n+1})\le 2^{-n} (*). Starting from arbitrary x_1\in S_1, we can construct in this manner a sequence x_1,x_2,\ldots,\in X that is Cauchy, and thus has a limit point x\in X.

Let \mathcal{X} be the set of sequences (x_m,x_{m+1},\ldots) for some m, with x_n\in S_n\,\forall n\ge m, satisfying (*). Now let S be the closure of the set of limit points of this family of sequences, which we have shown is non-empty.

Then for any n, and any x_n\in S_n, we can construct such a sequence, and its limit point x, and the triangle inequality down the path x_n,x_{n+1},\ldots gives d(x_n,S)\le 2^{-(n-1)}. Furthermore, by construction S\subset U_{2^{-(n-1)}}(S_n), hence it follows that S_n \stackrel{d_H}\rightarrow S.

Lemma 4.5 – Given (X,d) compact, (M(X),d_H) is also compact.

Sketch proof: We are working with metric spaces, for which the existence of a finite \epsilon-net for every \epsilon>0 is equivalent to compactness. [An \epsilon-net is a set S of points in the space such that every point of the space lies within \epsilon of an element of S. Thinking another way, it is the set of centres of the balls in a finite covering of the space by \epsilon-balls.] It is not too hard to check that if S_\epsilon is an \epsilon-net for (X,d), then \mathcal{P}(S_\epsilon) is finite, and an \epsilon-net for (M(X),d_H).

Gromov-Hausdorff Distance

So far this is fine, but won’t be useful by itself for comparing how similar two trees are as metric spaces, because we can’t be sure a priori that we can embed them in a common host space. To resolve this, we consider instead the Gromov-Hausdorff distance, which will serve as a distance between metric spaces, even when they are not canonically defined as subsets of a common metric space.

Given X, Y metric spaces, we define

d_{GH}(X,Y)=\inf_Z \left\{ d_H(X',Y') \, : \, X',Y' \subset (Z,d)\text{ a metric space }, X'\simeq X, Y'\simeq Y\right\}.

In words, the Gromov-Hausdorff distance between two metric spaces examines the ways to embed them isometrically into a common larger metric space, and gives the minimal Hausdorff distance between them under the class of such embeddings. One issue is that the collection of all metric spaces is not a set. For example, given any set, we can define a metric via the discrete metric, so the collection of metric spaces is at least as large as the collection of all sets, which is not a set. Fortunately, all is not broken, as when we consider a general metric space Z in which we might embed copies of X and Y we are wasting lots of the perhaps very complicated space, because we only need to compare the subsets which are isometric copies of X and Y. So in fact, we lose nothing if we assume that Z is a disjoint union of copies of X and Y, with a metric chosen appropriately. So

d_{GH}=\inf\left\{ d_H(X,Y) : d\text{ a metric on }X\coprod Y\text{ restricting to }d_X \text{ on }X,\, d_Y\text{ on }Y \right\}.

In practice though, this is difficult to compute, since the set of things we have to minimize over is complicated. It turns out we can find an equivalent characterisation which will be easier to use in a number of examples, including the case of real trees which is the whole point of the course.

Correspondence and Distortion

We define a correspondence from X to Y to be

\mathcal{R}\subset X\times Y\text{ s.t. } \pi_X(\mathcal {R}) = X, \, \pi_Y(\mathcal {R}) = Y,

where \pi_X,\pi_Y are the canonical projection maps from X\times Y into X,Y respectively. So we can think of a correspondence as being something like a matching from X to Y. In a matching, we insist that the projection maps into each set are injections, ie that each element of X (resp Y) can appear in at most one pair, whereas for a correspondence, we demand that the projection maps are surjections, ie that each element of X appears in at least one pair.

Then the distortion of a correspondence

\mathrm{dis}(\mathcal{R}):= \sup\left\{ |d_X(x,x') - d_Y(y,y')| \,;\, (x,y),(x',y')\in \mathcal{R} \right\}.

In words, if two sets are non-isomorphic, then a correspondence can’t describe an isometry between the sets, and the distortion is a measure of how far from being an isometry the correspondence is. That is, given a pair of pairs in the correspondence, for an isometry, the distance between the X-elements would be equal to the distance between the Y-elements, and the distortion measures the largest discrepancy between such pairs of pairs over the correspondence.

Theorem 4.11 d_{GH}(X,Y) = \frac12 \inf_{\mathcal{R}} (\mathrm{dis}\mathcal R), where the infimum is taken over all correspondences \mathcal{R} X to Y.

Remark: The RHS of this equivalence can be thought of as the set coupling between X and Y such that the pairs have as equal distances as possible.

Proof: Given an embedding into X\coprod Y with d_H(X,Y)<r, we have \mathcal{R} with \mathrm{dis}\mathcal {R}<2r, by taking:

\mathcal{R}=\{(x,y): d(x,y)<r\}.

From the definition of Hausdorff distance, it follows that the for every x, there is a y with d(x,y)<r, and hence the appropriate projection maps are projections.

So it remains to prove that d_{GH}(X,Y)\le \frac12 \mathrm{dis}\mathcal{R}. We can define a metric on X\times Y by

d(x,y)=\int\left\{ d_X(x,x')+d_Y(y,y') + r \,:\, (x',y')\in \mathcal{R} \right\}.

Then for any x\in X, there is (x,y)\in\mathcal{R}, and thus d(x,y)\le r, and vice versa hence d_H(X,Y)\le r.

It only remains to check that this is actually a metric. Let’s take x,\bar x \in X, and so

d(\bar x,y)\le \inf\{ d_X(x,\bar x) + d_X(x,x')+d_Y(y,y')+r \,: \, (x',y')\in\mathcal{R}\},

so taking d_X(x,\bar x) outside the brackets gives one form of the triangle inequality. We have to check the ‘other combination’ of the triangle inequality. We assume that the infima for (x,y), (\bar x,y) are attained at (x',y'),(\bar x',\bar y') respectively.

d(x,y)+d(\bar x,y)= 2r+ d_X(x,x')+d_X(\bar x,\bar x') + d_Y(y,y')+d_Y(d,\bar y').

But we also have d_X(x',\bar x')-d_Y(y',\bar y')\ge -r from the definition of distortion, and so adding these gives the triangle inequality we want, and completes the proof of this theorem.corc

Tightness in Skorohod Space

This post continues the theme of revising topics in the analytic toolkit relevant to proving convergence of stochastic processes. Of particular interest is the question of how to prove that families of Markov chains might have a process scaling limit converging to a solution of some stochastic differential equation, in a generalisation of Donsker’s theorem for Brownian motion. In this post, however, we address more general aspects of convergence of stochastic processes, with particular reference to Skorohod space.

Topological Background

I’ve discussed Skorohod space in a previous post. For now, we focus attention on compactly supported functions, D[0,T]. Some of what follows can be extended to the infinite-time setting easily, and some requires more work. Although we can define a metric on the space of cadlag functions in lots of ways, it is more useful to think topologically, or at least with a more vague sense of metric. We say two cadlag functions are close to one another if there is a reparameterisation of the time-axis, (a function [0,T] to itself) that is uniformly close to the identity function, and when applied to one of the cadlag functions, brings it close to the other cadlag function. Heuristically, two cadlag functions are close if their large jumps are close to one another and of similar size, and if they are uniformly close elsewhere. It is worth remembering that a cadlag function on even an unbounded interval can have only countably many jumps, and only finitely many with magnitude greater than some threshold on any compact interval.

For much of the theory one would like to use, it is useful for the spaces under investigation to be separable. Recall a topological space is separable if there exists a countable dense subset. Note in particular that D[0,T] is not separable under the uniform metric, since we can define f_x(\cdot)=\mathbf{1}_{(\cdot \ge x)} for each x\in[0,T], then ||f_x-f_y||_\infty=1 whenever x\ne y. In particular, we have an uncountable collection of disjoint open sets given by the balls \mathcal{B}(f_x,\frac12), and so the space is not countable. Similarly, C[0,\infty) is not separable. A counterexample might be given by considering functions which take the values {0,1} on the integers. Thus we have a map from \{0,1\}^{\mathbb{N}}\rightarrow C[0,\infty), where the uniform distance between any two distinct image points is at least one, hence the open balls of radius 1/2 around each image point give the same contradiction as before. However, the Stone-Weierstrass theorem shows that C[0,T] is separable, as we can approximate any such function uniformly well by a polynomial, and thus uniformly well by a polynomial with rational coefficients.

In any case, it can be shown that D[0,T] is separable with respect to the natural choice of metric. It can also be shown that there is a metric which gives the same open sets (hence is a topologically equivalent metric) under which D[0,T] is complete, and hence a Polish space.

Compactness in C[0,T] and D[0,T]

We are interested in tightness of measures on D[0,T], so first we need to address compactness for sets of deterministic functions in D[0,T]. First, we consider C[0,T]. Here, the conditions for a set of functions to be compact is given by the celebrated Arzela-Ascoli theorem. We are really interested in compactness as a property of size, so we consider instead relative compactness. A set is relatively compact (sometimes pre-compact) if its closure is compact. For the existence of subsequential limits, this is identical to compactness, only now we allow the possibility of the limit point lying outside the set.

We note that the function C[0,T]\rightarrow \mathbb{R} given by ||f||_\infty is continuous, and hence uniform boundedness is certainly a required condition for compactness in C[0,T]. Arzela-Ascoli states that uniform boundedness plus equicontinuity is sufficient for a set of such functions to be compact. Equicontinuity should be thought of as uniform continuity that is uniform among all the functions in the set, rather than just within the argument of an individual particular function.

For identical reasons, we need uniform boundedness for relative compactness in D[0,T], but obviously uniform continuity won’t work as a criterion for discontinuous functions! We seek some analogue of the modulus of continuity that ignores jumps. We define

\omega'_\delta(f):=\inf_{\{t_i\}} \max_i \sup_{s,t\in[t_{i-1},t_i)} |f(s)-f(t)|,

where the infimum is taken over all meshes 0=t_0<t_1<\ldots<t_r with t_i-t_{i-1}>\delta. Note that as \delta\downarrow 0, we can, if we want, place the t_i so that large jumps of the function f take place over the boundaries between adjacent parts of the mesh. In particular, for a given cadlag function, it can be shown fairly easily that \omega'_\delta(f)\downarrow 0 as \delta\rightarrow 0. Then, unsurprisingly, in a similar fashion to the Arzela-Ascoli theorem, it follows that a set of functions A\subset D[0,T] is relatively compact if it is uniformly bounded, and

\lim_{\delta\rightarrow 0} \sup_{f\in A}\omega'_\delta(f)=0.

Note that this ‘modulus of continuity’ needs to decay uniformly across the set of functions, but that we do not need to choose the mesh at level \delta uniformly across all functions. This would obviously not work, as then the functions \mathbf{1}_{(\cdot\ge x_n)} for any sequence x_n\rightarrow x would not be compact, but they clearly converge in Skorohod space!

Tightness in C[0,T] and D[0,T]

Naturally, we are mainly interested in (probability) measures on D[0,T], and in particular conditions for tightness on this space. Recall a family of measures is tight if for any \epsilon>0, there exists some compact set A such that

\pi(A)>1-\epsilon,\quad \forall \pi\in\Pi.

So, for measures (\mu_n) on D[0,T], the sequence is tight precisely if for any \epsilon>0, there exists M,\delta and some N such that for any n>N, both

\mu_n(||f||_\infty >M)\le \epsilon,\quad \mu_n(\omega'_\delta(f)>\epsilon)\le \epsilon

hold. In fact, the second condition controls variation sufficiently strongly, that we can replace the first condition with

\mu_n(|f(0)|>M)\le \epsilon.

Often we might be taking some sort of scaling limit of these processes in D[0,T], where the jumps become so small in the limit that we expect the limit process to be continuous, perhaps an SDE or diffusion. If we can replace \omega'_\delta by \omega_\delta, the standard modulus of continuity, then we have the additional that any weak limit lies in C[0,T].

In general, to prove convergence of some stochastic processes, we will want to show that the processes are tight, by demonstrating the properties above, or something equivalent. Then Prohorov’s theorem (which I tend to think of as a probabilistic functional version of Bolzano-Weierstrass) asserts that the family of processes has a weak subsequential limit. Typically, one then shows that any weak subsequential limit must have the law of some particular random process. Normally this is achieved by showing some martingale property (eg for an SDE) in the limit, often by using the Skorohod representation theorem to use almost sure subsequential convergence rather than merely weak convergence. Then one argues that there is a unique process with this property and a given initial distribution. So since all weak subsequential limits are this given process, in fact the whole family has a weak limit.

Non-separable Skorohod Representations

In the previous post, I discussed the statement and proof of the Skorohod representation theorem. This concerns the conditions under which it is possible to couple distributions which converge in law, to obtain a family of random variable on a possibly very large probability space, which converge almost surely. The condition for the theorem to hold is that the base space, or at least the support of the limiting distribution should be a separable metric space. Skorohod’s original proof concerned the case where all the distributions were supported on a complete, separable metric space (Polish space), but this extension is not particularly involved, and was proven not long after the original result.

It is natural to ask exactly what goes wrong in non-separable or non-metrizable spaces. Recall a space is separable if it contains a countable dense subset. Obviously, finite or countable sets are by definition separable with any metric. Considering the points with rational coordinates shows that \mathbb{R}^d is separable for each d, and the Stone-Weierstrass theorem shows that continuous functions with on a bounded interval are also separable with the uniform topology, as they can be approximated uniformly well by polynomials with rational coefficients. One heuristic is that a separable space does not have ‘too many’ open sets.

There are references (for example, see [2]) to examples of Skorohod non-representation in non-metrizable topological spaces, which are ‘big’ enough to allow convergence in distribution with respect to a particular class of test functions, but where the distributions are not uniformly tight, so cannot converge almost surely. However, I don’t really understand this well at all, and have struggled to chase the references, some of which are unavailable, and some in French.

Instead, I want to talk about an example given in [1] of a family of distributions on a non-separable space, which cannot be coupled to converge almost surely. The space is (0,1) equipped with the discrete metric, which says that d(x,y)=1 whenever x\ne y. Note that it is very hard to have even deterministic convergence in this space, since the only way to be close to a element of the space is indeed to be equal to that element. We will construct random variables and it will unsurprising that they cannot possibly converge almost surely in any coupling, but the exact nature of the construction will lead to convergence in distribution.

Based on what we proved last time, the support of the limiting distribution will be non-separable. It turns out that the existence of such a distribution is equiconsistent in the sense of formal logic with the existence of an extension of Lebesgue measure to the whole power set of (0,1). This is not allowed under the Axiom of Choice, but is consistent under the slightly weaker Axiom of Dependent Choice (AC). This weaker condition says, translated into language more familiar to me, that every directed graph with arbitrary (and in particular, potentially uncountable) vertex set, and with all out-degrees at least 1 contains an infinite directed path. This seems obvious when viewed through the typically countable context of graph theory. But the natural construction is to start somewhere and ‘just keep going’ wherever possible, which involves making a choice from the out-neighbourhood at lots of vertices. Thus it is clear why this is weaker than AC. Anyway, in the sequel, we assume that this extension of Lebesgue measure exists.

Example (from [1]): We take (X_n)_{n\ge 1} to be an IID sequence of non-negative RVs defined on the probability space ((0,1),\mathcal{B}(0,1),\mathrm{Leb}), with expectation under Lebesgue measure equal to 1. It is not obvious how to do this, with the restriction on the probability space. One example might be to write \omega\in(0,1) as \overline{\omega_1\omega_2\ldots}, the binary expansion, and then set X_n=2\omega_n. We will later require that X_n is not identically 1, which certainly holds in this example just given.

Let \mu be the extension of Lebesgue measure to the power set \mathcal{P}=\mathcal{P}(0,1). Now define the measures:

\mu_n(B)=\mathbb{E}_\mu(X_n \mathbf{1}_B),\quad \forall B\in\mathcal{P}.

To clarify, we are defining a family of measures which also are defined for all elements of the power set. We have defined them in a way that is by definition a coupling. This will make it possible to show convergence in distribution, but they will not converge almost surely in this coupling, or, in fact, under any coupling. Now consider a restricted class of sets, namely B\in \sigma(X_1,\ldots,X_k), the class of sets distinguishable by the outcomes of the first k RVs.

[Caution: the interpretation of this increasing filtration is a bit different to the standard setting with for example Markov processes, as the sets under consideration are actually subsets of the probability space on which everything is defined. In particular, there is no notion that a ‘fixed deterministic set’ lies in all the layers of the filtration.]

Anyway, by independence, when n>k, by independence, we have

\mu_n(B)=\mathbb{E}_\mu(X_n\mathbf{1}_B)=\mathbb{E}_\mu(X_n)\mathbb{E}_\mu(\mathbf{1}_B)=\mu(B).

So whenever B\in\mathcal{F}\bigcup_k \sigma(X_1,\ldots,X_k), \lim_n \mu_n(B)=\mu(B). By MCT, we can extend this convergence to any bounded \mathcal F-measurable function.

This is the clever bit. We want to show that \mu_n(B)\rightarrow\mu(B) for all B\in\mathcal P, but we only have it so far for B\in\mathcal F. But since \mathcal{F}\subset \mathcal P, which is the base field of the probability space under the (non-AC) assumption, we can take conditional expectations. In particular for any B\in\mathcal P, \mathbb{E}_\mu[\mathbf{1}_B | \mathcal{F}] is a bounded, \mathcal F-measurable function. Hence, by definition of \mu_n and the extended MCT result:

\mu_n(B)=\mathbb{E}_\mu[X_n\mathbb{E}_\mu[\mathbf{1}_B|\mathcal F]]=\mathbf{E}_{\mu_n}[\mathbb{E}_\mu[\mathbf{1}_B|\mathcal F]] \rightarrow \mathbb{E}_\mu [\mathbb{E}_\mu[\mathbf{1}_B |\mathcal{F}]].

Now, since by definition \mathbf{1}_B is \mathcal{P}-measurable, applying the tower law gives that this is equal to \mu(B). So we have

\mu_n(B)\rightarrow \mu(B),\quad \forall B\in\mathcal{P}. (*)

This gives weak convergence \mu_n\Rightarrow \mu. At first glance it might look like we have proved a much stronger condition than we need. But recall that in any set equipped with the discrete topology, any set is both open and closed, and so to use the portmanteau lemma, (*) really is required.

Now we have to check that we can’t have almost sure convergence in any coupling of these measures. Suppose that we have a probability space with random variables Y,(Y_n) satisfying \mathcal L(Y)=\mu, \mathcal L(Y_n)=\mu_n. But citing the example I gave of X_n satisfying the conditions, the only values taken by Y_n are 0 and 2, and irrespective of the coupling,

\mathbb{P}(Y_n=2\text{ infinitely often})>0.

So it is impossible that Y_n can converge almost surely to any supported on [0,1].

References

[1] Berti, Pratelli, Rigo – Skorohod Representation and Disintegrability (here – possibly not open access)

[2] Jakubowski – The almost sure Skorokhod representation for subsequences in non-metric spaces.

Skorohod Representation Theorem

Continuing the theme of revising theory in the convergence of random processes that I shouldn’t have forgotten so rapidly, today we consider the Skorohod Representation Theorem. Recall from the standard discussion of the different modes of convergence of random variables that almost sure convergence is among the strongest since it implies convergence in probability and thus convergence in distribution. (But not convergence in L_1. For example, take U uniform on [0,1], and X_n=n\mathbf{1}_{\{U<\frac{1}{n}\}}.)

Almost sure convergence is therefore in some sense the most useful form of convergence to have. However, it comes with a strong prerequisite, that the random variables be defined on the same probability space, which is not required for convergence in distribution. Indeed, one can set up weak versions of convergence in distribution which do not even require the convergents to be random variables. The Skorohod representation theorem gives a partial converse to this result. It states some conditions under which random variables which converge in distribution can be coupled on some larger probability space to obtain almost sure convergence.

Skorohod’s original proof dealt with convergence of distributions defined on complete, separable metric spaces (Polish spaces). The version discussed here is from Chapter 5 of Billingsley [1], and assumes the limiting distribution has separable support. More recent authors have considered stronger convergence conditions (convergence in total variation or Wasserstein distance, for example) with weaker topological requirements, and convergence of random variables defined in non-metrizable spaces.

Theorem (Skorohod representation theorem): Suppose that distributions P_n\Rightarrow P, where P is a distribution with separable support. Then we can define a probability space (\Omega,\mathcal{F},\mathbb{P}) and random variables X,(X_n)_{n\ge 1} on this space such that the laws of X,X_n are P,P_n respectively and X_n(\omega)\rightarrow X(\omega) for all \omega\in\Omega.

NB. We are proving ‘sure convergence’ rather than merely almost sure convergence! It is not surprising that this is possible, since changing the value of all the X_ns on a set with measure zero doesn’t affect the conditions for convergence in distribution.

Applications: Before going through the Billingsley proof, we consider one simple application of this result. Let S be a separable metric space containing the support of X, and g a continuous function S\rightarrow S'. Then

X_n\stackrel{a.s.}{\rightarrow}X\quad\iff\quad g(X_n)\stackrel{a.s.}{\rightarrow}g(X).

So, by applying the Skorohod representation theorem once, and the result that almost sure convergence implies convergence in distribution, we have shown that

X_n\stackrel{d}{\rightarrow}X\quad\iff \quad g(X_n)\stackrel{d}{\rightarrow}g(X),

subject to these conditions on the space supporting X.

Proof (from [1]): Unsurprisingly, the idea is to construct realisations of the (X_n) from a realisation of X. We take X, and a partition of the support of X into small measurable sets, chosen so that the probability of lying in a particular set is almost the same for X_n as for X, for large n. Then, the X_n are constructed so that for large n, with limitingly high probability X_n lies in the same small set as X.

Constructing the partition is the first step. For each x\in S:=\mathrm{supp}(X), there must be some radius \frac{\epsilon}{4}<r_x<\frac{\epsilon}{2} such that P(\partial B(x,r_x)=0. This is where we use separability. Since every point in the space is within \frac{\epsilon}{4} of some element of a countable sequence of elements of the space, we can take a countable subset of these open balls B(x,r_x) which cover the space. Furthermore, we can take a finite subset of the balls which cover all of the space apart from a set of measure at most \epsilon. We want the sets to be disjoint, and we can achieve this by removing the intersections inductively in the obvious way. We end up with a collection B_0,B_1,\ldots,B_k, where B_0 is the leftover space, such that

  • P(B_0)<\epsilon
  • P(\partial B_i)=0,\quad i=0,1,\ldots,k
  • \mathrm{diam}(B_i)<\epsilon,\quad i=1\ldots,k.

Now suppose for each m, we take such a partition B^m_0,B^m_1,\ldots,B^m_{k_m}, for which \epsilon_m=\frac{1}{2^m}. Unsurprisingly, this scaling of \epsilon is chosen so as to use Borel-Cantelli at the end. Then, from convergence in distribution, there exists an integer N_m such that for n\ge N_m, we have

P_n(B^m_i)\ge (1-\epsilon_m)P(B^m_i),\quad i=0,1,\ldots,k_m. (*)

Now, for N_m\le n <N_{m+1}, for each B^m_i with non-zero probability under P, take Y_{n,i} to be independent random variables with law P_n(\cdot | B^m_i) equal to the restriction onto the set. Now take \xi\sim U[0,1] independent of everything so far. Now we make concrete the heuristic for constructing X_n from X. We define:

X_n=\sum_{i=0}^{k_m}\mathbf{1}_{\{\xi\le 1-\epsilon_m, X\in B^m_i\}} Y_{n,i} + \mathbf{1}_{\{\xi>1-\epsilon_m\}}Z_n.

We haven’t defined Z_n yet. But, from (*), there is a unique distribution such that taking Z_n to be independent of everything so far, with this distribution, we have \mathcal{L}(X_n)=P_n. Note that by iteratively defining random variables which are independent of everything previously defined, our resulting probability space \Omega will be a large product space.

Note that \xi controls whether the X_n follow the law we have good control over, and we also want to avoid the set B^m_0. So define E_m:=\{X\not \in B^m_0, \xi\le 1-\epsilon_m\}. Then, P(E_m)<2\epsilon_m=2^{-(m-1)}, and so by Borel-Cantelli, with probability 1, E_m holds for all m larger than some threshold. Let us call this \liminf_m E_m=: E, and on this event E, we have by definition X_n \rightarrow X. So we have almost sure convergence. But we can easily convert this to sure convergence by removing all \omega\in\Omega for which \xi(\omega)=1 and setting X_n\equiv X on E^c, as this does not affect the distributions.

Omissions: 

  • Obviously, I have omitted the exact construction of the distribution of Z_n. This can be reverse reconstructed very easily, but requires more notation than is ideal for this medium.
  • It is necessary to remove any sets B^m_i with zero measure under P for the conditioning to make sense. These can be added to B^m_0 without changing any of the required conditions.
  • We haven’t dealt with any X_n for n<N_1.

The natural question to ask is what happens if we remove the restriction that the space be separable. There are indeed counterexamples to the existence of a Skorohod representation. The clearest example I’ve found so far is supported on (0,1) with a metric inducing the discrete topology. If time allows, I will explain this construction in a post shortly.

References

[1] – Billingsley – Convergence of Probability Measures, 2nd edition (1999)

Random Maps 1 – Towards the Schaeffer Bijection

I have spent the past ten days in Saint Flour, an inaccessible but picturesque town in the rolling hills of Cantal, in the middle of France, and venue for perhaps the most notable summer school in probability. My highlight has been the course ‘Aspects of Random Maps’ delivered by Gregory Miermont, and I thought I should write a few posts about points of interest encountered during the lectures and private study.

A map is an embedding of a connected graph onto a surface. We typically do not care about the nature of this embedding up to homeomorphisms of the surface which preserve orientations of the map. One advantage for doing this is that the set of maps now might be countable, and the set of maps with n edges might be finite. This can be proved by considering a map to be obtained by glueing together polygonal faces. Some potential glueings are impossible, and some are equivalent, but for a fixed number of edges, the number of such sets of polygons and a possible glueing is finite. In fact we can be much more precise than this about how to describe precisely the legal glueing through a triple of permutations, but I won’t discuss this here.

I haven’t yet given a complete definition of a map. We want a typical large map, that is a map with a large number of vertices and edges, to be topologically roughly the same as the surface it is embedded into. In particular, the map needs to encode the geometric features of the surface. So a small triangle on the surface of a torus should not be considered a map. To rigorise this, we demand that any face of a map should be a topological disc, in particular, it should be simply connected. Since the torus itself is not simply connected, this excludes our triangle example. Note a single vertex on a torus is also excluded.

Although it goes against the usual order of definitions, it might be helpful to think of a map as an embedding which satisfies Euler’s formula: V – E + F = 2 -2g, where g is the genus of the surface. For a connected planar graph, induction is on the number of edges and vertices is the typical way to prove this result. The inductive step works the same on a more general surface, but it is less clear what the base case should be. Another consequence of the definition is that we should work on the sphere rather than the plane. From now on, this is our surface of interest.

We begin by considering \mathcal{M}_n to be the family of rooted plane maps with n edges. The root is a distinguished oriented edge. Our aim is to count the size of this set.

Before doing this, we digress onto the topic of rooted plane trees. Note that any (rooted) tree in the classical sense is planar, but in a rooted plane tree, we also specify the geometric ordering of the offspring. For example, if the root has two offspring, of which one has precisely one offspring and the other has none, we consider these as two separate cases.

So now, if we denote by a_k the number of plane trees with k edges, we can define a generating function via A(z):=\sum_{k\ge 0} a_k z^k. If the root vertex v has no offspring, this gives one possibility corresponding to k=0. Otherwise, there is a well-defined left-most offspring of the root, called u. Then u and its descendents form a plane tree, and v and its descendents apart from those through u also form a plane tree. So after accounting for the edge between u and v, we obtain

A(z)=1+zA(z)^2,

whenever A(z) is defined. We now can apply whichever is our favourite method of showing that this is the generating function of the Catalan numbers, a_k=\frac{1}{k+1}\binom{2k}{k}.

There is a more complicated version of this generating function argument due to Tutte that allows us to enumerate \mathcal{M}_n. It is convenient to work with a second variable in the generating function that encodes the degree of the root face. The resulting equation of generating functions is less well-known but using the Lagrange inversion formula gives the explicit expression

|\mathcal{M}_n|=\frac{2}{n+2}\cdot \frac{3^n}{n+1}\binom{2n}{n}.

Although there are extra terms, this motivates seeking a bijection between maps, and some version of rooted plane trees, perhaps decorated with some extra information. As in many cases, this will turn out to be possible. The bijection we end up with will not just help us enumerate the maps, but will also allow us to control a lot more information about distances in the map, which will be particularly useful when we try to take limits.

The first observation is that given a map, we can construct a dual map, by placing fresh vertices somewhere in the middle of each face, and joining a pair of these if the corresponding faces in the original graph share an edge.

Alternatively, we can place the same fresh vertices in the middle of each face, then join each new vertex to an original vertex, if that original vertex lies on the face corresponding to the new vertex. If you focus in on an original edge, it is clear that it is now surrounded by a ‘diamond’ (if you’ve drawn the diagram in a natural way) of new edges. Removing the original edges thus leaves us with a quadrangulation. This procedure is called the ‘trivial bijection’ between \mathcal{M}_n and \mathcal{Q}_n, the family of rooted quadrangulations with n faces. Note that the root in such a quadrangulation is an identified directed edge, rather than a vertex. We haven’t yet specified how to describe the root of the resulting quadrangulation. It suffices to take the first new edge which lies clockwise of the root edge in the original graph, seen from the ‘tail’ of the root, which is of course oriented.

In this, and the bijections which follow, the natural questions to ask are: a) is the inverse obvious? and b) what happens to self-loops and isthmuses? Here, the inverse really is obvious. Any quadrangulation is bipartite, hence two-colourable, so we need to fix one colour and join the two vertices of that colour within each face to recover the original graph. The root tells us which colour we need to take. As for the second question, first we should say that an isthmus is an edge which has the same face on both sides. This causes no problems in this particular bijection. For the self-loops, we get a sort of Pacman-like quadrangle, with two ‘outer-edges’ between the same two vertices, and an edge between one of the outer vertices and some internal vertex. This edge contributes twice to the degree of the face.

The upshot of this is that for a simple enumeration, it suffices to prove that |\mathcal{Q}_n|=\frac{2}{n+2}\cdot \frac{3^n}{n+1}\binom{2n}{n}. This may not look like we have achieved much, but we can now apply Euler’s formula to any quadrangulation in this set to deduce that the number of vertices present is n+2. If we consider the set \mathcal{Q}_n^*, where now we identify a particular vertex v_* in the quadrangulation, it suffices to prove that |\mathcal{Q}_n^*|=2.3^n \cdot a_n, where a_n is the nth Catalan number as before. Now we have the most efficient setup to look for a bijection with some type of decorated plane tree as discussed before.

Hausdorff Dimension

We work out the area of a square by multiplying the side length by itself. We think of the area as a measure of the size of the square. This value remains meaningful however we think of the square. Whether as a region of R^2 (the plane), or as a subset of the complex numbers, viewed as the Argand diagram, or as an object sitting in the 3-dimensional real world.

But this then becomes a problem if we want to keep using the word ‘measure’. Because the standard Lebesgue measure on R^2 coincides with this definition of area, but if we are in R^3, then the measure of a square is zero. Of course, we are really articulating the idea that dimension is key. The area of a square may be large or small, but its volume will always be zero. Similarly, if we were to try and measure the ‘length’ of a square, in some sense, the only sensible answer could be that the length is infinite. Why? Well, any measure must satisfy the constraint of being increasing under containment. In this sense, any square contains an arbitrarily large number of disjoint lines of length k/2, where k is the side-length of the square, and so the only possibility for the length measure is infinite.

So we realise that dimension is key to resolving this problem. And we are clear in our minds that a square is two-dimensional, since it can be embedded in R^2 but not in R. But this is all rather vague. We would ideally like to have a definition of dimension that is independent of choice of embedding. Not least, the property used in the case of the square has an algebraic flavour that is clear here, but which will certainly not generalise. In vector space terminology, a square is not a subspace, but it certainly generates a subspace of dimension 2. But we want a definition that exploits topological properties. Not least because, for example, the surface of a sphere sits in R^3 and indeed generates R^3 as a vector space, but is very much two-dimensional under any sensible definition.

Ideally we want a way to find the dimension of a space, given only its metric properties. Heuristically, we feel that a measure is the metric to the power of the dimension, so this would resolve all the problems we have been considering.

The rest of this post will talk about one possible solution, the Hausdorff dimension, which specifies the dimension of a metric space in an embedding-independent manner. The motivation is as follows. We can cover a square with lots of smaller squares. Indeed, a square of side-length 1 can be decomposed into (1+o(1))r^{-2} squares of side-length r, as r \rightarrow 0. Although they do not tessellate as nicely, the same result holds for small balls. Such a square can be covered by \Theta(r^{-2}) balls of radius r. This is a useful observation, because a ball is specified only by metric properties of the space.

From this idea, we obtain a way of constructing a family of measures on the Borel sets of a metric space, which specify the size of subsets of any dimension. We define the \alpha-Hausdorff measure of A by:

\mathcal{H}^\alpha(A):= \liminf_{\delta\downarrow 0} \left\{\sum_{i=1}^\infty \text{diam}(E_i)^\alpha:(E_i)\text{ a covering of }A, \text{ diam}(E_i)<\delta.\right\}

In other words, we take a covering of A by small sets of diameter at most \delta. These might as well be balls at this point. Then we calculate the infimum of the \alpha-dimensional measure of these balls, across all such coverings. Letting \delta\downarrow 0 gives a measure of the \alpha-dimensional volume required to cover a subset with small \alpha-dimensional balls. We expect this to be zero when the set has dimension less than \alpha, and infinite when the set has dimension greater than \alpha. Naturally then, we define the Hausdorff dimension by

\text{dim}(E)=\inf\{s\geq 0: \mathcal{H}^s(E)=0\}.

An inquisitive child might well ask the following question: we are happy with the notion of 2-dimensional space, and 3-dimensional space, but what about other non-integer values? Can we have a 2.5-dimensional space, and what does it mean?

As an example of why the integers might not be enough to specify dimension, it is useful to consider fractals. A fractal is some object that has properties of self-similarity, which might mean that, for example, it is impossible to tell how zoomed in a view is. These are typically constructed as the limit of an iterative procedure where each unit present is broken down into several smaller units in a self-similar way. Examples include the Koch snowflake, where the middle third of a line is replaced the other two sides of the equilateral triangle with base equal to that middle third. The same operation is then successively applied to each smaller line segment and so on. The Sierpinski gasket is constructed by applying a similar operation to the areas of triangles.

In the case of the Koch snowflake, in each operation, the total length of the object increases by a factor of 4/3, and so the length of the limiting object is infinite, even though it is constructed from lines, and the endpoints are still distance 1 apart. So it is not clear what the dimension of this object should be. It doesn’t fill space to quite the same extent as a plane area, but by self-similarity, it fills space a bit more than a line locally to any given point.

It’s perhaps easiest to consider as an example for calculation the Cantor Set. Recall this is constructed by starting with the unit interval [0,1], then removing the middle third to leave [0,1/3] u [2/3,1] and then successively removing the middle third of every remaining interval. What remains are precisely those reals which have no ‘1’ in their ternary expansion. Unsurprisingly, this is nowhere dense, and has zero Lebesgue measure, since the measure after the nth iteration is (\frac23)^n\rightarrow 0.

However, we can calculate its Hausdorff dimension, by taking advantage of its self-similarity. We suppose that this dimension is s, and we consider the s-Hausdorff measure. It is clear from the definition that the measure of a disjoint union of sets is equal to the sum of the individual measures. So we apply this to the decomposition of the Cantor set C as:

C=(C\cap [0,\frac13] )\cup (C\cap [\frac23,1]).

Note that C\cap[0,\frac13] and C\cap[\frac23,1] are isometric to \frac{C}{3}. So a covering of C with diameter less than d induces a covering of C/3 with diameter less than d/3, and the s-dimensional volume of the covering sets will be scaled by a factor of 3^{-s}. We conclude that

H^s(C)=2H^s(\frac{C}{3})=\frac{2H^s(C)}{3^s}

\Rightarrow H^s(C)=\frac{\log 2}{\log 3},

giving us an example of a non-integral Hausdorff dimension. It turns out that a space with any real Hausdorff dimension can be constructed in this way by adjusting the constants of self-similarity in the iterative construction.

Unsurprisingly, once we are comfortable with objects which are self-similar, it makes sense to look at the Hausdorff dimension of random objects which are self-similar in distribution. The primary example is Brownian motion. Again, we might assume a priori that since Brownian motion is a function of a one-dimensional variable, typically time, it should have dimension one. In fact, recalling that the variation of BM is infinite, it is clear that we have the same situation as in the Koch snowflake example discussed above. Some of the tools involved to deal with the dimension of BM in the two cases d=1 and d=>2 are interesting and non-trivial, so I will postpone discussion of this until a possible further post.

Recent Progress and Gromov-Hausdorff Convergence

For the past few weeks I’ve been working on the problem of Cycle-Induced Forest Fires, which I’ve referred to in passing in some recent posts. The aim has been to find a non-contrived process which exhibits self-organised criticality, that is, where the process displays critical characteristics (scaling laws, multiple components at the largest order of magnitude) forever. Note that this is in contrast to the conventional Erdos-Renyi graph process, which is only critical at a single time n/2.

The conjecture is that the largest component in equilibrium typically has size on a scale of n^2/3. An argument based on the equilibrium proportion of isolated vertices gives an upper bound on this exponent. The working argument I have for the lower bound at the moment can comfortably fit on the back of a napkin, with perhaps some context provided verbally. Of course, the current full text is very much larger than that, mainly because the napkin would feature assertions like “event A happens at time O(n^\beta)“; whereas the more formal argument has to go like:

“With high probability as n\rightarrow\infty, event A happens between times n^{\beta-\epsilon},n^{\beta+\epsilon}, for any suitably small \epsilon>0. Furthermore, the probability that A happens after this upper threshold decays exponentially with n for fixed \epsilon, and the probability that A happens before the lower threshold is at most n^{-\epsilon}. Finally, this is under the implicit assumption that there will be no fragmentations before event A, and this holds with probability 1-o(1) etc.”

It’s got to the point where I’ve exhausted the canonical set of symbols for small quantities: \epsilon,\delta,(\eta ?).

This has been a very long way of setting up what was going to be my main point, which is that at many points during undergraduate mathematics, colleagues (and occasionally to be honest, probably myself too) have complained that they “don’t want to have anything to do with analysis. They just want to focus on algebra / number theory / statistics / fluids…” Anyway, the point of this ramble was that I think I’ve realised that it is very hard to think about any sort of open problem without engaging with the sort of ideas that a few years ago I would have thought of (and possibly dismissed) as ‘analysis’.

Much of my working on this problem has been rather from first principles, so haven’t been thinking much about any neat less elementary theory recently.

Ok, so on with the actual post now.

Last month I talked about local limits of graphs, which describe convergence in distribution of (local) neighbourhood structure about a ‘typical’ vertex. This is the correct context in which to make claims like “components of G(n,\frac{\lambda}{n}) look like Galton-Watson trees with offspring distribution \text{Po}(\lambda)“.

Even from this example, we can see a couple of drawbacks and omissions from this limiting picture. In the sub-critical regime, this G-W tree will be almost surely finite, but the number of vertices in the graph is going to infinity. More concretely, the limit description only tells us about a single component. If we wanted to know about a second component, in this case, it would be roughly independent of the size of the first component, and with the same distribution, but if we wanted to know about all components, it would get much more complicated.

Similarly, this local limit description isn’t particularly satisfactory in the supercritical regime. When the component in question is finite, this description is correct, but with high probability we have a giant component, and so the ‘typical’ vertex is with some positive probability in the giant component. This is reflected by the fact that the G-W tree with supercritical offspring distribution is infinite with some positive probability. However, the giant component does not look like a \text{Po}(\lambda) G-W tree. As we exhaust O(n) vertices, the offspring distribution decreases, in expectation at least. In fact, without the assumption that the giant component is with high probability unique (so \frac{L_1}{n}=1-\mathbb{P}(|C(v)|<\infty), we can’t even deduce the expected size of the giant component from the local limit result.

This is all unsurprising. By definition a local limit describes the structure near some vertex. How near? Well, finitely near. It can be arbitrarily large, but still finite, so in particular, the change in the offspring distribution after O(n) vertices as mentioned above will not be covered.

So, if we want to learn more about the global structure of a large discrete object, we need to consider a different type of limit. In particular, the limit will not necessarily be a graph. Rather than try to define a priori a ‘continuum’ version of a graph, it is sensible to generalise from the idea that a graph is a discrete object and instead consider it as a metric space.

In this article, I don’t want to spend much time at all thinking about how to encode a finite graph as a metric space. We have a natural notion of graph distance between vertices, and it is not hard to extend this to points on edges. Alternatively, for sparse graphs, we have an encoding through various functions, which live in some (metric) function space.

However, in general, the graph will be a metric object itself, rather than necessarily a subset of a global metric space. We will be interested in convergence, so we need a suitable style of convergence of different metric spaces.

The natural candidate for this is the Gromov-Hausdorff metric, and the corresponding Gromov-Hausdorff convergence.

The Hausdorff distance between two subsets X, Y of a metric space is defined as follows. Informally, we say that d_H(X,Y)<\epsilon if any point of X is within distance \epsilon from some point of Y, in the sense of the original metric. Formally

d_H(X,Y):=\max \{\sup_{x\in X}\inf_{y\in Y}d(x,y), \sup_{y\in Y}\inf_{x\in X}d(x,y)\}.

It is not particularly illuminating to prove that this is in fact a metric. In fact, it isn’t a metric as the definition stands, but rather a pseudo-metric, which is exactly the same, only allowing d(X,Y)=0 when X and Y are not equal. Note that

d(X^\circ,\bar X)=0,

for any set X, so this gives an example, provided X is not both open and closed. Furthermore, if the underlying metric space is unbounded, then the Hausdorff distance between two sets might be infinite. For example in \mathbb{R},

d_H(\mathbb{R}_{<0},\mathbb{R}_{>0})=\infty.

We can overcome this pair of objections by restricting attention to closed, bounded sets. In practice, many spaces under consideration will be real in flavour, so it makes sense to define this for compact sets when appropriate.

But this still leaves the underlying problem, which is how to define a distance function on metric spaces. If two metric spaces X and Y were both subspaces of some larger metric space then it would be easy, as we now have the Hausdorff distance. So this is in fact how we proceed in general. We don’t need any knowledge of this covering space a priori, we can just choose the one which minimises the resulting Hausdorff distance. That is

d_{GH}(X,Y)=\inf\{d_H(\phi(X),\psi(Y))\},

where the infimum is taken over all metric spaces (E,d), and isometric embeddings \phi: X\rightarrow E, \psi: Y\rightarrow E.

The first observation is that this will again be a pseudometric unless we demand that X, Y be closed and bounded. The second is that this index set is not a set. Fortunately, this is quickly rectified. Instead consider all metrics on the disjoint union of sets X and Y, which is set, and contains the subset of those metrics which restrict to the correct metric on each of X and Y. It can be checked that this forms a true metric on the set of compact metric spaces up to isometry.

We have an alternative characterisation. Given compact sets X and Y, a correspondence between X and Y is a set of pairs in X\times Y, such that both projection maps are surjective. Ie for any x in X, there is some pair (x,y) in the correspondence. Let \mathcal{C}(X,Y) be the set of such correspondences. We then define the distortion of correspondence \mathcal{R} by:

\text{dis}(\mathcal{R}):=\sup\{|d_X(x_1,x_2)-d_2(y_1,y_2)|: (x_i,y_i)\in\mathcal{R}\}.

Then

d_{GH}(X,Y)=\frac{1}{2}\inf_{\mathcal{R}\in\mathcal{C}(X,Y)}\text{dis}(\mathcal{R}).

In particular, this gives another reason why we don’t have to worry about taking an infimum over a proper class.

Gromov-Hausdorff convergence then has the natural definition. Note that this does not respect topological equivalence, ie homeomorphism. For example,

\bar{B(0,\frac{1}{n})}\stackrel{GH}{\rightarrow} \{0\},

where the latter has the trivial metric. In particular, although all the closed balls are homeomorphic, the G-H limit is not.

A final remark is that the trees we might be looking at are not necessarily compact, so it is useful to have a notion of how this might be extended to non-compact spaces. The answer is to borrow the idea from local limits of considering large finite balls around a fixed central point. In the case of trees, this is particularly well-motivated, as it is often quite natural to have a canonical choice for the ‘root’.

Then with identified points p_n\in X_n, say (X_n,p_n)\rightarrow (X,p) if for any R>0 the R-ball around p_n in X_n converges to the R-ball around p in X. We adjust the definition of distortion to include the condition that the infimum be over correspondences for which (p_X,p_Y) is an element.

REFERENCES

This article was based on some lecture notes by Jean-Francois Le Gall from the Clay Institute Summer School which can be found on the author’s website here (about halfway down the page). This material is in chapter 3. I also used Nicolas Curien’s tutorials on this chapter to inform some of the examples. The resolution of the proper class problem was mentioned by several sources I examined. These notes by Jan Christina were among the best.

Nested Closed Intervals

Th UK team for this year’s International Mathematical Olympiad in Santa Marta, Colombia, has just been selected. For details, see the BMOC website.

During the selection camp, which was hosted at Oundle School near Peterborough, we spent a while discussing analytic questions that typically lie outside the scope of the olympiad syllabus. Furthermore, incorrect consideration of, for example, the exact conditions for a stationary point to be a global maximum, are likely to incur very heavy penalties if a candidate has attempted a solution using, for example, Lagrange multipliers. As a result we have a dilemma about how much analysis to teach during the training process: we want the students to be able to use sophisticated methods if necessary; but we don’t want to spoil the experience of learning this theory in a serious step-by-step manner as first year undergraduates.

This post does not present a solution to this dilemma. Rather, I want to discuss one question that arose on the last day of exams. Because the statement of the question is currently classified, I will have to be oblique in discussion of the solution, but this shouldn’t distract from the maths I actually want to talk about.

The setup is as follows. We have a sequence of nested closed intervals in the reals, that is:

[a_1,b_1]\supset [a_2,b_2]\supset [a_3,b_3]\supset\ldots

We want to demonstrate that there is some real number that lies in all of the intervals, that is \cap_{n\geq 1}[a_n,b_n]\neq \varnothing. This feels intuitively obvious, but some form of proof is required.

First, what does closed mean? Well a closed interval is a closed set, but what about, for example, [a,\infty)\subset\mathbb{R}? It turns out that it is most convenient to define an open set, and then take a closed set to be the complement of an open set.

The best way of thinking about an open set is to say that it does not contain its boundary. This is certainly the case for an open interval or an open ball. It is not immediately clear how to extend this to a general definition. But note that if no point in the set X can be on the boundary, then in all the natural examples to consider there must some finite distance between any point x\in X and the boundary. So in particular there is a small open ball centred on x that is entirely contained within X. This is then the definition of an open set in a metric space (in particular for some \mathbb{R}^d).

Note that it is not a sensible definition to say that a closed set has the property that there is a closed ball containing each point. Any open set has this property also! For if there is an open ball of radius R around a point x, then there is a closed ball of radius R/2 around that same point. So we really do have to say that a set is closed if the complement is open. Note that in \mathbb{R}, a closed interval is closed, and a finite union of closed intervals is closed, though not a countable union as:

(0,1]=\cup_{n\geq 1}[\frac{1}{n},1].

Now we know what a closed set is, we can start thinking about the question.

First we remark that it is not true if we allow the interval to be unbounded. Obviously \cap_n [n,\infty)=\varnothing. Note that even though it appears that these sets have an open upper boundary, they are closed because the complement (-\infty,n) is open. This will not be a problem in our question because everything is contained within the first interval [a_1,b_1] and so is bounded.

Secondly we remark that the result is not true if we move to a general host set. For example, it makes sense to consider open and closed sets in the rationals. For example, the open ball radius 1 either side of 1/2 is just all the rationals strictly between -1/2 and 3/2. We could write this as (-\frac12,\frac32)\cap\mathbb{Q}. But then note that

\cap_{n\geq 1}[\pi-\frac{1}{n},\pi+\frac{1}{n}]\cap\mathbb{Q}

cannot contain any rational elements, so must be empty.

There was various talk that this result might in general be a consequence of the Baire Category Theorem. I think this is overkill. Firstly, there is a straightforward proof for the reals which I will give at the end. Secondly, if anything it is a lemma required for the proof of BCT. This result is often called Cantor’s Lemma.

Lemma (Cantor): Let X be a complete metric space, and let F_1\supset F_2\supset\ldots be a nested sequence of non-empty closed subsets of X with \text{diam}(F_n)\rightarrow 0. There there exists x\in X such that

\cap F_n=\{x\}.

Translation: for ‘complete metric space’ think ‘the reals’ or \mathbb{R}^d. The diameter is, unsurprisingly, the largest distance between two points in the set. For reasons that I won’t go into, the argument for the olympiad question gave \text{diam}(F_{n+1})\leq \frac12\text{diam}(F_n) so this certainly holds here.

Proof: With reference to AC if necessary, pick an element x_n\in F_n for all n. Note that by nesting, x_m\in F_n\;\forall n\leq m. As a result, for m>n the distance d(x_n,x_m)\leq \text{diam}(F_n). This tends to 0 as n grows. The definition of complete is that such a sequence then has a limit point x.

Informally, completeness says that if a sequence of points get increasingly close, they must tend towards a point in the set. This is why it doesn’t work for the rationals. You can have a sequence of rationals that get very close together, but approach a point not in the set, eg an irrational. We use the definition of closed sets in terms of sequences: if the sequence is within a closed set, then the limit is too. This could only go wrong if we ‘leaked onto the boundary’ in the limit, but for a closed set, the boundary is in the set. This shows that x\in F_n for each n, and so $x\in\cap_n F_n$. But if there is another point in \cap_n F_n, then the distance between them is strictly positive, contradicting the claim that diameter tends to 0. This ends the proof.

Olympiad-friendly version: I think the following works fine as a fairly topology definition-free proof. Consider the sequence of left-boundaries

a_1\leq a_2\leq a_3\leq \ldots <b_1.

This sequence is non-decreasing and bounded, so it has a well-defined limit. Why? Consider the supremum. We can’t exceed the sup, but we must eventually get arbitrarily close, by definition of supremum and because the sequence is non-decreasing. Call this limit a. Then do the same for the upper boundaries to get limit b.

If a>b, then there must be some a_n>b_n, which is absurd. So we must have some non-empty interval as the intersection. Consideration of the diameter again gives that this must be a single point.

Local Limits

In several previous posts, I have talked about scaling limits of various random graphs. Typically in this situation we are interested in convergence of large-scale properties of the graph as the size grows to some limit. These properties will normally be metric in flavour: diameter, component size and so on. To describe convergence of these properties, we divide by the relevant scale, which will often be some simple function of n. If we are looking to find an actual limit object, this is even more important. This is rather similar to describing properties of centred random walks. There, if we run the walk for time n, we have to rescale by \frac{1}{\sqrt{n}} to see the fluctuations on a finite positive scale.

One of the best examples is Aldous’ Continuum Random Tree which we can view as the limit of a Galton-Watson tree conditioned to have total size n, as n tends to infinity. Because of the exploration process or contour process interpretation, where these functions behave rather like a random walk, the correct scaling in this context is again \frac{1}{\sqrt{n}}. The point about this convergence is that it is realised entirely as a convergence of some function that represents the tree. For each finite n, it is clear that the tree with n vertices is a graph, but this is neither clear nor true for the limit object. Although it does indeed have no cycles, if nothing else, if the CRT were a graph it would have [0,1] as vertex set and then would be highly non-obvious how to define the edges.

Local limits aim to give convergence towards a (discrete) infinite graph. The sort of properties we are looking for are now local properties such as degrees and correlations of degrees. These don’t require knowledge of the whole graph, only of some finite subset. First consider the possibility that the sequence of deterministic graphs has the property:

G_1\leq G_2\leq G_3\leq\ldots

where \leq denotes an induced subgraph. Then it is relatively clear what the limit should be, as it is well-defined to take a union. This won’t work directly for a limit of random graphs, because the above relation in probability doesn’t even really make sense if we have a different probability space for each finite graph. This is a general clue that we should be looking to use convergence in distribution rather than anything stronger.

In the previous example, suppose the first finite graph G_1 consists of a single vertex v. If the limit graph (remember this is just the union, since that is well-defined) has bounded degrees, then there is some N such that G_N contains all the information we might want about the limiting neighbourhood of vertex v. For some larger N, G_N contains all the vertex and edges within distance r from our starting vertex v that appear in the limit graph.

This is all the motivation we require for a genuine definition. We will define our limit in terms of neighbourhoods, so we need some mechanism to choose the central vertex of such a neighbourhood. The answer is to consider rooted graphs, that it a graph with an identified vertex. We can introduce randomness by specifying a random graph, or by giving a distribution for the choice of root. If G is finite, the canonical choice is to choose the root uniformly from the set of vertices. This isn’t an option for an infinite graph, so we define the system as (G, p) where G is a (for now deterministic) graph, and p is a probability measure on V(G).

We say that the limit of finite (G_n) is the random rooted infinite graph (G, p) if the neighbourhoods of G_n around a randomly chosen vertex converge in distribution to the neighbourhoods of G around p. Formally, say (G_n)[U_n]\stackrel{d}{\rightarrow} (G,p) if for all r>0, for any finite rooted graph (H,w), the probability that (H,w) is isomorphic to the ball of radius r in G_n centred at randomly chosen $v_n$ converges to the probability that (H,w) is isomorphic to the ball of radius r around v in (G,v), where v is distributed according to measure p.

Informally, we might say that if we zoom in on an average vertex in G_n for large n, the neighbourhood looks the same as the neighbourhood around the root in (G, p). We now consider three examples.

1) When we talk about approximating the component size in a sparse Erdos-Renyi random graph by a \text{Po}(\lambda) branching process, this is exactly the limit sense we mean. The approximation fails if we fix n and take the neighbourhood size very large (eg radius n), but for finite neighbourhoods, or any radius growing more slowly than n, the approximation is good.

2) To emphasise why rooting the finite graphs makes a difference, consider the full binary tree with n levels (so 2^n-1 vertices). If we fix the root, then the limit is the infinite-level binary tree, though this isn’t especially surprising or interesting.

Things get a bit more complicated if we root randomly. Remember that the motivation for random rooting is that we want to know the local structure around a vertex chosen at random in many applications. If we definitely know what vertex we are going to choose, we know the local structure a priori. Note that in an n-level binary tree, 2^{n-1} vertices are leaves, not counting the base of the tree, and 2^{n-2} are distance 1 from a leaf, and 2^{n-3} are distance 2 from a leaf and so on.

This gives us a precise description of the limiting local neighbourhood structure. The resulting limiting object is called the canopy tree. One picture of this can be found on page 6 of this paper. A verbal description is also possible. Consider the set of non-negative integers, arranged in the usual manner on the real line, with edges between adjacent elements. The distribution of the root will be supported on this set of vertices, corresponding to the distance from the leaves in the pre-limit graph. So we have mass 1/2 at 0, 1/4 at 1, 1/8 at 2 and so on. We then connect each vertex k to a full k-level binary tree. The resulting canopy tree looks like an infinite-level full binary tree, viewed from the leaves, which is of course a reasonable heuristic, since that is there the mass is concentrated if we randomly root.

3) In particular, the limit is not the infinite-level binary tree. The canopy tree and the infinite-level binary tree have qualitatively different properties. Simple random walk on the canopy tree is recurrent for example. In fact, a result of Benjamini and Schramm, as explained in this review by Curien, says that any local limit of uniformly bounded degree, uniformly rooted, planar graphs is recurrent for SRW. The infinite-level binary tree can be expressed as a local limit if we choose the root distribution sensibly, using large random 3-regular graphs. The previous result does not apply because the random 3-regular graphs are not almost surely planar.

REFERENCES:

– Much of this article is a paraphrase of a section of Itai Benjamini’s mini-course at the DSSA in Haifa March 2013.

– As well as the review paper linked above, these notes by David Aldous were very useful.