# The reflection principle and conditioned RWs

I haven’t published a post about probability for far too long. Several are queued, so perhaps this will be the start of a deluge.

Anyway, with my advisor at Technion, I’m still working on some problems concerning Gaussian random walk subject to some conditioning which is complicated, but in practice (we hope) only mildly different to conditioning the walk to stay positive. Our conditioning at step n depends on some external randomness, but also on the future trajectory of the walk (related to the embedding of the walk in a 2D DGFF), thus ruining the possibility of applying the Markov property in any proof without significant preliminary work.

It seemed worth writing briefly about some of these results in a slightly simpler setting. The goal is to assemble many of the ingredients required to prove a local limit for Gaussian random walk conditioned to stay positive, in a sense which will be clarified towards the end. This is not the best way to go about getting scaling limits (as discussed briefly here, and for which see references [Ig74] and [Bo76]), and it’s probably not the best way to get local limits in the simplest setting, but it’s the method we are currently working to generalise, and follows the outline of [B-JD05], but in much less technical detail.

Probabilities via the reflection principle

We start with Brownian motion. The reflection principle, as described briefly in this post from the depths of history, is a classical technique for studying the maximum of Brownian motion. Roughly speaking, we exploit the fact that $(W_t,t\ge 0)\stackrel{d}=(-W_t,t\ge 0)$, but we then apply this at the hitting time of a particular positive value, using the Strong Markov Property.

Let $S_t=\max_{0\le s\le t}W_s$ be the running maximum of the Brownian motion $W_t$, and $\tau_b$ the hitting time of b. Then

$\mathbb{P}(S_t\ge b, B_t\le a)=\mathbb{P}(\tau_b

which, by SMP at $\tau_b$ and the reflection invariance of a standard BM, is equal to

$\mathbb{P}(\tau_b

This obviously assumed $b\ge a$, but if we set $b=a$, we find

$\mathbb{P}(S_t\ge b)=\mathbb{P}(B_t>b)+\mathbb{P}(S_t\ge b,B_t\le b)=2\mathbb{P}(B_t\ge b).$

Or, in other words, $S_t\stackrel{d}=|B_t|$.

While we can’t derive such nice equalities in distribution, the reflection principle is robust enough to treat more complicated settings, such as Brownian bridge.

We might want to ask about the maximum of a standard Brownian bridge, but we can be more general, and ask about the maximum of a Brownian bridge with drift (let’s say general bridge here). It’s important to remember that a general Brownian bridge has the same distribution as a linear transformation of a standard Brownian bridge. Everything is Gaussian, after all. So asking whether the maximum of a general Brownian bridge is less than a particular value is equivalent to asking whether a standard Brownian bridge lies below a fixed line. Wherever possible, we make such a transformation at the start and perform the simplest version of the required calculation.

So, suppose we have a bridge B from (0,0) to (t,a), and we want to study $\max_{s\in[0,t]} B_s$. Fix some $b>a$, and work with a standard Brownian motion $W_s$. By a similar argument to before,

$\mathbb{P}(\tau_b\le t, W_t\in[a,a+\mathrm{d}x]) = \mathbb{P}(W_t\in [2b-a-\mathrm{d}x,2b-a]) = \frac{\mathrm{d}x}{\sqrt{2\pi t}}e^{-(2b-a)^2/2t},$

and

$\mathbb{P}(W_t\in[a,a+\mathrm{d}x])=\frac{\mathrm{d}x}{\sqrt{2\pi t}}e^{-a^2/2t}.$

So

$\mathbb{P}(\max_{s\in[0,t]}B_t\ge b) = \exp\left(\frac{-(2b-a)^2 + a^2}{2t}\right).$

Random walk conditioned to stay positive

Our main concern is conditioning to stay above zero. Let $\mathbb{P}_{0,x}^{t,y}$ be some complete if cumbersome notation for a Brownian bridge B from (0,x) to (t,y). Then another simple transformation of the previous result gives

$\mathbb{P}_{0,x}^{t,y}(B_s\ge 0,\,s\in[0,t])=1-\exp\left( \frac{-(x+y)^2 + (x-y)^2}{2t} \right)= 1-\exp\left(-\frac{2xy}{t}\right).$

Then, if $xy\ll t$, we can approximate this by $\frac{2xy}{t}$. (*)

Extend the notation so $\mathbb{P}_{0,x}$ describes Brownian motion started from (0,x). Then integrating over y, gives

$\mathbb{P}_{0,x}(B_s\ge 0,\, s\in[0,t] ) = \frac{x}{t}\mathbb{E}[B_t\vee 0] = \sqrt{\frac{2}{\pi}} \frac{x}{\sqrt{t}}.$

(It might appear that we have integrated the approximation (*) over parts of the range where it is not a valid approximation, but the density of $B_t=\Theta(t)$ vanishes exponentially fast, and so actually it’s not a problem.)

We now want to extend this to random walks. Some remarks:

• We used the Gaussian property of Brownian motion fairly heavily throughout this derivation. In general random walks are not Gaussian, though we can make life easier by focusing on this case.
• We also used the continuity property of Brownian motion when we applied the reflection principle. For a general random walk, it’s hopeless to consider the hitting times of individual values. We have to consider instead the hitting times of regions $\tau_{(-\infty,b]}$, and so on. One can still apply SMP and a reflection principle, but this gives bounds rather than equalities. (The exception is simple random walk, for which other more combinatorial methods may be available anyway.)
• On the flip side, if we are interested in Brownian motion/bridge staying positive, we can’t start it from zero, as then the probability of this event is zero, by Blumenthal’s 0-1 law. By contrast, we can certainly ask about random walk staying positive when started from zero without taking a limit.

A useful technique will be the viewpoint of random walk as the values taken by Brownian motion at a sequence of stopping times. This Skorohod embedding is slightly less obvious when considering a general random walk bridge inside a general Brownian bridge, but is achievable. We want to study quantities like

$\mathbb{P}(S_k\ge 0,\, k=1,\ldots,n \big| S_0=x,S_n=y),$

where for simplicity let’s just take $(S_k,k\ge 0)$ to be a random walk with standard Gaussian increments. It’s possible we might want to take a scaling limit in x and y as functions of n. But first if we take x,y fixed, and embed the random walk bridge with these endpoints into the corresponding Brownian bridge with $t\approx n$, we are then faced with the question:

What’s the probability that the Brownian bridge goes below zero, but the embedded RW with n steps does not?

If the Brownian bridge conditioned to go below zero spends time $\Theta_p(n)$ below zero, then for large n it’s asymptotically very unlikely that the n places at which we embed the random walk avoids this set of intervals.

Several technical estimates are required to make this analysis rigorous. The conclusion is that there exists a function $f(x)$ for which $f(x)=x(1+o(1))$ as $x\rightarrow\infty$, such that

$q_n(x,y):=\mathbb{P}(S_k\ge 0,\, k=0,1,\ldots,n \,\big|\, S_0=x,S_n=y) \sim \frac{2f(x)f(y)}{n},$

$\text{and}\quad q_n(x):=\mathbb{P}(S_k\ge 0,\,k=0,1,\ldots,n\,\big|\, S_0=x)\sim \sqrt{\frac{2}{\pi}}\frac{f(x)}{\sqrt{n}}.$

As earlier, the second is obtained from the first by integrating over suitable y. This function $f$ has to account for the extra complications when either end-point is near zero, for which the event where the Brownian motion goes negative without the random walk going negative requires additional care.

Limits for the conditioned random walk

In the previous post on this topic, we addressed scaling limits in space and time for conditioned random walks. But we don’t have to look at the classical Donsker scaling to see the effects of conditioning to stay positive. In our setting, we are interested in studying the distribution of $S_m$ conditional on the event $(S_1\ge 0,S_2\ge 0,\ldots, S_n\ge 0)$, with limits taken in the order $n\rightarrow\infty$ and then $m\rightarrow\infty$.

(At a more general level, it is meaningful to describe the random walk conditioned on staying positive forever. Although this would a priori require conditioning on an event of probability zero, it can be handled formally as an example of an h-transform.)

As explained in that previous post, the scaling invariance of the Bessel process $W^+$ (which it’s not unreasonable to think of as ‘Brownian motion conditioned to stay non-negative’) suggests that this limit should exist, and be given by the entrance law of $W^+$. But this is hard to extract from a scaling limit.

However, we can use the previous estimate to start a direct calculation.

$\mathbb{P}(S_m\in \mathrm{d}y \,\big|\, S_k\ge 0,\, k=1,\ldots,n) = \frac{q_m(0,y) q_{n-m}(y) \mathbb{P}(S_m\in\mathrm{d}y)}{q_n(0)}.$

Here, we used the Markov property at time m to split the event that $S_m=y$ and the walk stays positive into two time-intervals. We will later take m large, so we may approximate as

$\frac{2f(0)f(y)/m \times \sqrt{\frac{2}{\pi}}f(y)/\sqrt{n-m}\times \mathbb{P}(S_m\in\mathrm{d}y) } { \sqrt{\frac{2}{\pi}}f(0)/\sqrt{n}}\stackrel{n\rightarrow\infty}=\frac{2f(y)^2}{m}\mathbb{P}(S_m\in\mathrm{d}y).$

This final probability emphasises that as $m\rightarrow\infty$ we only really have to consider $y=\Theta(\sqrt{m})$, so set $y=z\sqrt{m}$, and we obtain

$\lim_{n\rightarrow\infty}\mathbb{P}(\frac{S_m}{\sqrt{m}}\in\mathrm{d}z\,\big|\, S_k\ge 0,\,k=1,\ldots,n)$

$\sim \sqrt{m}\cdot\frac{2z^2m}{m}\cdot \frac{1}{\sqrt{2\pi}}\frac{1}{\sqrt{m}}e^{-z^2/2} = \sqrt{\frac{2}{\pi}}z^2 e^{-z^2/2}.$

This is precisely the entrance law of the 3-dimensional Bessel process, usually denoted $R$. This process is invariant under time-rescaling in the same fashion as Brownian motion. Indeed, one representation of R is as the radial part of a three-dimensional Brownian motion, given by independent BMs in each coordinate. (See [Pi75] for explanation of the relation to ‘BM conditioned to stay non-negative’.) We could complete the analogy by showing that $q_n(x,y)$ converges to the transition density of R as well. (Cf the prelude to Theorem 2.2 of [B-JD05].)

Final remarks

The order of taking limits is truly crucial. We can also obtain a distributional scaling limit at time n under conditioning to stay non-negative up to time n. But then this is the size-biased normal distribution $\sim ze^{-z^2/2}$ (the Rayleigh distribution), rather than the square-size-biased normal distribution we say in this setting. And we can exactly see why. Relative to the normal distribution which applies in the absence of conditioning, we require size-biasing to account for the walk staying positive up to time m, and then also size-biasing to account for the walk staying positive for the rest of time (or up to n in the $n\rightarrow\infty$ limit if you prefer).

The asymptotics for $q_n(x,y)$ were the crucial step, for which only heuristics are present in this post. It remains the case that estimates of this kind form the crucial step in other more exotic conditioning scenarios. This is immediately visible (even if the random walk notation is rather exotic) in, for example, Proposition 2.2 of [CHL17], of which we currently require a further level of generalisation.

References

[Bo76] – Bolthausen – On a functional central limit theorem for random walks conditioned to stay positive

[B-JD05] – Bryn-Jones, Doney – A functional limit theorem for random walk conditioned to stay non-negative

[CHL17] – Cortines, Hartung, Louidor – The structure of extreme level sets in branching Brownian motion

[Ig74] – Iglehart – Functional central limit theorems for random walks conditioned to stay positive

[Pi75] – Pitman – One-dimensional Brownian motion and the three-dimensional Bessel process

# Kernels of critical graph components

This post is motivated by G(N,p), the classical Erdos-Renyi random graph, specifically its critical window, when $p=p(N)=\frac{1}{N}(1+\lambda N^{-1/3})$.

We start with the following observation, which makes no restriction on p. Suppose a component of G(N,p) is a tree. Then, the graph geometry of this component is that of a uniform random tree on the appropriate number of vertices. This is deliberately informal. To be formal, we’d have to say “condition on a particular subset of vertices forming a tree-component” and so on. But the formality is broadly irrelevant, because at the level of metric scaling limits, if we want to describe the structure of a tree component, it doesn’t matter whether it has $\log N$ or $\frac{1}{7}N$ vertices, because in both cases the tree structure is uniform. The only thing that changes is the scaling factor.

In general, when V vertices form a connected component of a graph with E edges, we define the excess to be E-V+1. So the excess is non-negative, and is zero precisely when the component is a tree. I’m reluctant to say that the excess counts the number of cycles in the component, but certainly it quantifies the amount of cyclic structure present. We will sometimes, in a mild abuse of notation, talk about excess edges. But note that for a connected component with positive excess, there is a priori no way to select which edges would be the excess edges. In a graph process, or when there is some underlying exploration of the component, there sometimes might be a canonical way to classify the excess edges, though it’s worth remarking that the risk of size-biasing errors is always extremely high in this sort of situation.

Returning to the random graph process, as so often there are big changes around criticality. In the subcritical regime, the components are small, and most of them, even the largest with high probability, are trees. In the supercritical regime, the giant component has excess $\Theta(N)$, which is qualitatively very different.

It feels like every talk I’ve ever given has begun with an exposition of Aldous’s seminal paper [Al97] giving a distributional scaling limit of the sizes of critical components in the critical window, and a relation between the process on this time-scale and the multiplicative coalescent. And it remains relevant here, because the breadth-first exploration process can also be used to track the number of excess edges.

In a breadth-first exploration, we have a stack of vertices we are waiting to explore. We pick one and look its neighbours restricted to the rest of the graph, that is without the vertices we have already fully explored, and also without the other vertices in the stack. That’s the easiest way to handle the total component size. But we can simultaneously track how many times we would have joined to a neighbour within the stack, which leads to an excess edge, and Aldous derives a joint distributional scaling limit for the sizes of the critical components and their excesses. (Note that in this case, there is a canonical notion of excess edge, but it depends not just on the graph structure, but also on the extra randomness of the ordering within the breadth-first search.)

Roughly speaking, we consider the reflected exploration process, and its scaling limit, which is a reflected parabolically-drifting Brownian motion (though the details of this are not important at this level of exposition, except that it’s a well-behaved non-negative process that hits zero often). The component sizes are given by the widths of the excursions above zero, scaled up in a factor $N^{1/3}$. Then conditional on the shape of the excursion, the excess is Poisson with parameter the area under the excursion, with no rescaling. That is, a critical component has $\Theta(1)$ excess.

So, with Aldous’s result in the background, when we ask about the metric structure of these critical components, we are really asking: “what does a uniformly-chosen connected component with fixed excess look like when the number of vertices grows?”

I’ll try to keep notation light, but let’s say T(n,k) is a uniform choice from connected graphs on n vertices with excess k.

[Note, the separation of N and n is deliberate, because in the critical window, the connected components have size $n = \Theta(N^{2/3})$, so I want to distinguish the two problems.]

In this post, we will mainly address the question: “what does the cycle structure of T(n,k) look like for large n?” When k=0, we have a uniform tree, and the convergence of this to the Brownian CRT is now well-known [CRT2, LeGall]. We hope for results with a similar flavour for positive excess k.

2-cores and kernels

First, we have to give a precise statement of what it means to study just the cycle structure of a connected component. From now on I will assume we are always working with a connected graph.

There are several equivalent definitions of the 2-core C(G) of a graph G:

• When the excess is positive, there are some cycles. The 2-core is the union of all edges which form part of some cycle, and any edges which lie on a path between two edges which both form part of some cycle.
• C(G) is the maximal induced subgraph where all degrees are at least two.
• If you remove all the leaves from the graph, then all the leaves from the remaining graph, and continue, the 2-core is the state you arrive at where there are no leaves.

It’s very helpful to think of the overall structure of the graph as consisting of the 2-core, with pendant trees ‘hanging off’ the 2-core. That is, we can view every vertex of the 2-core as the root of a (possibly size 1) tree. This is particular clear if we remove all the edges of the 2-core from the graph. What remains is a forest, with one tree for each vertex of the 2-core.

In general, the k-core is the maximal induced subgraph where all degrees are at least k. The core is generally taken to be something rather different. For this post (and any immediate sequels) I will never refer to the k-core for k>2, and certainly not to the traditional core. So I write ‘core’ for ‘2-core’.

As you can see in the diagram, the core consists of lots of paths, and topologically, the lengths of these paths are redundant. So we will often consider instead the kernel, K(G), which is constructed by taking the core and contracting all the paths between vertices of degree greater than 2. The resulting graph has minimal degree at least three. So far we’ve made no comment about the simplicity of the original graphs, but certainly the kernel need not be simple. It will regularly have loops and multiple edges. The kernel of the graph and core in the previous diagram is therefore this:

Kernels of critical components

To recap, we can deconstruct a connected graph as follows. It has a kernel, and each edge of the kernel is a path length of some length in the core. The rest of the graph consists of trees hanging off from the core vertices.

For now, we ask about the distribution of the kernel of a T(n,K). You might notice that the case k=1 is slightly awkward, as when the core consists of a single cycle, it’s somewhat ambiguous how to define the kernel. Everything we do is easily fixable for k=1, but rather than carry separate cases, we handle the case $k\ge 2$.

We first observe that fixing k doesn’t confirm the number of vertices or edges in the kernel. For example, both of the following pictures could correspond to k=3:

However, with high probability the kernel is 3-regular, which suddenly makes the previous post relevant. As I said earlier, it can introduce size-biasing errors to add the excess edges one-at-a-time, but these should be constant factor errors, not scaling errors. So imagine the core of a large graph with excess k=2. For the sake of argument, assume the kernel has the dumbbell / handcuffs shape. Now add an extra edge somewhere. It’s asymptotically very unlikely that this is incident to one of the two vertices with degree three in the core. Note it would need to be incident to both to generate the right-hand picture above. Instead, the core will gain two new vertices of degree three.

Roughly equivalently, once the size of the core is fixed (and large) we have to make a uniform choice from connected graphs of this size where almost every vertex has degree 2, and $\Theta(1)$ of the rest have degree 3 or higher. But the sum of the degrees is fixed, because the excess is fixed. If there are n vertices in the core, then there are $\Theta(n)$ more graphs where all the vertices have degree 2 or 3, than graphs where a vertex has degree at least 4. Let’s state this formally.

Proposition: The kernel of a uniform graph with n vertices and excess $k\ge 2$ is, with high probability as $n\rightarrow\infty$, 3-regular.

This proved rather more formally as part of Theorem 7 of [JKLP], essentially as a corollary after some very comprehensive generating function setup; and in [LPW] with a more direct computation.

In the previous post, we introduced the configuration model as a method for constructing regular graphs (or any graphs with fixed degree sequence). We observe that, conditional on the event that the resulting graph is simple, it is in fact uniformly-distributed among simple graphs. When the graph is allowed to be a multigraph, this is no longer true. However, in many circumstances, as remarked in (1.1) of [JKLP], for most applications the configuration model measure on multigraphs is the most natural.

Given a 3-regular labelled multigraph H with 2(k-1) vertices and 3(k-1) edges, and K a uniform choice from the configuration model with these parameters, we have

$\mathbb{P}\left( K \equiv H \right) \propto \left(2^{t(H)} \prod_{e\in E(H)} \mathrm{mult}(e)! \right)^{-1},$

where t(H) is the number of loops in H, and mult(e) the multiplicity of an edge e. This might seem initially counter-intuitive, because it looks we are biasing against graphs with multiple edges, when perhaps our intuition is that because there are more ways to form a set of multiple edges we should bias in favour of it.

I think it’s most helpful to look at a diagram of a multigraph as shown, and ask how to assign stubs to edges. At a vertex with degree three, all stub assignments are different, that is 3!=6 possibilities. At the multiple edge, however, we care which stubs match with which stubs, but we don’t care about the order within the multi-edge. Alternatively, there are three choices of how to divide each vertex’s stubs into (2 for the multi-edge, 1 for the rest), and then two choices for how to match up the multi-edge stubs, ie 18 in total = 36/2, and a discount factor of 2.

We mention this because in fact K(T(n,k)) converges in distribution to this uniform configuration model. Once you know that K(T(n,k)) is with high probability 3-regular, then again it’s probably easiest to think about the core, indeed you might as well condition on its total size and number of degree 3 vertices. It’s then not hard to convince yourself that a uniform choice induces a uniform choice of kernel. Again, let’s state that as a proposition.

Proposition: For any H a 3-regular labelled multigraph H with 2(k-1) vertices and 3(k-1) edges as before,

$\lim_{n\rightarrow\infty}\mathbb{P}\left( K(T(n,k)) \equiv H \right) \propto \left(2^{t(H)} \prod_{e\in E(H)} \mathrm{mult}(e)! \right)^{-1}.$

As we said before, the kernel describes the topology of the core. To reconstruct the graph, we need to know the lengths in the core, and then how to glue pendant trees onto the core. But this final stage depends on k only through the total length of paths in the core. Given that information, it’s a combinatorial problem, and while I’m not claiming it’s easy, it’s essentially the same as for the case with k=1, and is worth treating separately.

It is worth clarifying a couple of things first though. Even the outline of methods above relies on the fact that the size of the core diverges as n grows. Again, the heuristic is that up to size-biasing errors, T(n,k) looks like a uniform tree with some uniformly-chosen extra edges. But distances in T(n,k) scale like $n^{1/2}$ (and thus in critical components of G(N,p) scale like $N^{1/3}$). And the core will be roughly the set of edges on paths between the uniformly-chosen pairs of vertices, and so will also have length $\Theta(n^{1/2})$.

Once you have conditioned on the kernel structure, and the (large) number of internal vertices on paths in the core (ie the length of the core), it is natural that the assignment of the degree-2 vertices to core paths / kernel edges is uniform. A consequence of this is that if you record $(Y_1,\ldots,Y_m)$ the lengths of paths in the core, where m=3(k-1), then

$\frac{(Y_1,\ldots,Y_m)}{\sum Y_i} \stackrel{d}\rightarrow \mathrm{Dirichlet}(1,1,\ldots,1).$

This is stated formally as Corollary 7 b) of [ABG09]. It’s worth noting that this confirms that the lengths of core paths are bounded in probability away from zero after the appropriate rescaling. In seeking a metric scaling limit, this is convenient as it means there’s so danger that two of the degree-3 vertices end up in ‘the same place’ in the scaling limit object.

To recap, the only missing ingredients now to give a complete limiting metric description of T(n,k) are 1) a distributional limit of the total core length; 2) some appropriate description of set of pendant trees conditional on the size of the pendant forest. [ABG09] show the first of these. As remarked before, all the content of the second of these is encoded in the unicyclic k=1 case, which I have written about before, albeit slightly sketchily, here. (Note that in that post we get around size-biasing by counting a slightly different object, namely unicyclic graphs with an identified cyclic edge.)

However, [ABG09] also propose an alternative construction, which you can think of as glueing CRTs directly onto the stubs of the kernel (with the same distribution as before). The proof that this construction works isn’t as painful as one might fear, and allows a lot of the other metric distributional results to be read off as corollaries.

References

[ABG09] – Addario-Berry, Broutin, Goldschmidt – Critical random graphs: limiting constructions and distributional properties

[CRT2] – Aldous – The continuum random tree: II

[Al97] – Aldous – Brownian excursions, critical random graphs and the multiplicative coalescent

[JKLP] – Janson, Knuth, Luczak, Pittel – The birth of the giant component

[LeGall] – Le Gall – Random trees and applications

[LPW] – Luczak, Pittel, Wierman – The structure of a random graph at the point of the phase transition

# Random transpositions

We study a procedure for generating a random sequence of permutations of [N]. We start with the identity permutation, and then in each step, we choose two elements uniformly at random, and swap them. We obtain a sequence of permutations, where each term is obtained from the previous one by multiplying by a uniformly-chosen transposition.

Some more formality and some technical remarks:

• This is a Markov chain, and as often with Markov chains, it would be better it was aperiodic. As described, the cycle will alternate between odd and even permutations. So we allow the two elements chosen to be the same. This laziness slows down the chain by a factor N-1/N, but removes periodicity. We will work over timescales where this adjustment makes no practical difference.
• Let $\tau_1,\tau_2,\ldots$ be the sequence of transpositions. We could define the sequence of permutations by $\pi_m= \tau_m\cdot\tau_{m-1}\cdot \ldots\cdot \tau_1$. I find it slightly more helpful to think of swapping the elements in places i and j, rather the elements i and j themselves, and so I’ll use this language, for which $\pi_m = \tau_1\cdot \tau_2\cdot\ldots \cdot \tau_m$ is the appropriate description. Of course, transpositions and the identity are self-inverse permutations, so it makes no difference to anything we might discuss.
• You can view this as lazy random walk on the Cayley graph of $S_N$ generated by the set of transpositions. That is, the vertices of the graph are elements of $S_N$, and two are connected by an edge if one can be obtained from the other by multiplying by a transposition. Note this relation is symmetric. Hence random transposition random walk.
• Almost everything under discussion would work in continuous time too.

At a very general level, this sort of model is interesting because sometimes the only practical way to introduce ‘global randomness’ is repeatedly to apply ‘local randomness’. This is not the case for permutations – it is not hard to sample uniformly from $S_N$. But it is a tractable model in which to study relevant questions about the generating randomness on a complicated set through iterated local operations.

Since it is a Markov chain with a straightforward invariant distribution, we can ask about the mixing time. That is, the correct scaling for the number of moves before the random permutation is close in distribution (say in the sense of total variation distance) to the equilibrium distribution. See this series of posts for an odd collection of background material on the topic. Diaconis and Shahshahani [DS81] give an analytic argument for mixing around $\frac{N\log N}{2}$ transpositions. Indeed they include a constant because it is a sharp cutoff, where the total variation distance drops from approximately 1 to approximately 0 in O(N) steps.

Comparison with Erdos-Renyi random graph process

In the previous result, one might observe that $m=\frac{N\log N}{2}$ is also the threshold number of edges to guarantee connectivity of the Erdos-Renyi random graph G(N,m) with high probability. [ER59] Indeed, there is also a sharp transition around this threshold in this setting too.

We explore this link further. We can construct a sequence of random graphs simultaneously with the random transposition random walk. When we multiply by transposition (i j), we add edge ij in the graph. Laziness of RTRW and the possibility of multiple edges mean this definition isn’t literally the same as the conventional definition of a discrete-time Erdos-Renyi random graph process, but again this is not a problem for any of the effects we seek to study.

The similarity between the constructions is clear. But what about the differences? For the RTRW, we need to track more information than the random graph. That is, we need to know what order the transpositions were added, rather than merely which edges were added. However, the trade-off is that a permutation is a simpler object than a graph in the following sense. A permutation can be a described as a union of disjoint cycles. In an exchangeable setting, all the information about a random permutation is encoded in the lengths of the these cycles. Whereas in a graph, geometry is important. It’s an elegant property of the Erdos-Renyi process that we can forget about the geometry and treat it as a process on component sizes (indeed, a multiplicative coalescent process), but there are other questions we might need to ask for which we do have to study the graph structure itself.

Within this analogy, unfortunately the word cycle means different things in the two different settings. In a permutation, a cycle is a directed orbit, while in a graph it has the usual definition. I’m going to write graph-cycle whenever relevant to avoid confusion.

A first observation is that, under this equivalence, the cycles of the permutation form a finer partition than the components of the graph. This is obvious. If we split the vertices into sets A and B, and there are no edges between them, then nothing in set A will ever get moved out of set A by a transposition. (Note that the slickness of this analogy is the advantage of viewing a transposition as swapping the elements in places i and j.)

However, we might then ask under what circumstances is a cycle of the permutation the same as a component of the graph (rather than a strict subset of it). A first answer is the following:

Lemma: [Den59] The permutation formed by a product of transpositions corresponding in any order to a tree in the graph has a single cycle.

We can treat this as a standalone problem and argue in the following predictable fashion. (Indeed, I was tempted to set this as a problem during selection for the UK team for IMO 2017 – it’s perfectly suitable in this context I think.) The first transposition corresponds to some edge say ab, and removing this edge divides the vertices into components $A \ni a, B\ni b$. Since no further transposition swaps between places in A and places in B, the final permutation maps a into B and b into A, and otherwise preserves A and B.

This argument extends to later transpositions too. Now, suppose there are multiple cycles. Colour one of them. So during the process, the coloured labels move around. At some point, we must swap a coloured label with an uncoloured label. Consider this edge, between places a and b as before, and indeed the same conclusion holds. WLOG we move the coloured label from a to b. But then at the end of the process (ie in the permutation) there are more coloured labels in B than initially. But the number of coloured labels should be the same, because they just cycle around in the final permutation.

We can learn a bit more by trying thinking about the action on cycles (in the permutation) of adding a transposition. In the following pair of diagrams, the black arrows represent the original permutation (note it’s not helpful to think of the directed edges as having anything to do with transpositions now), the dashed line represents a new transposition, and the new arrows describe the new permutation which results from this product.

It’s clear from this that adding a transposition between places corresponding to different cycles causes the cycles to merge, while adding a transposition between places already in the same cycle causes the cycle to split into two cycles. Furthermore the sizes of the two cycles formed is related to the distance in the cycle between the places defining the transposition.

This allows us to prove the lemma by adding the edges of the tree one-at-a-time and using induction. The inductive claim is that cycles of the permutation exactly correspond to components of the partially-built tree. Assuming this claim guarantees that the next step is definitely a merge, not a split (otherwise the edge corresponding to the next step would have to form a cycle). If all N-1 steps are merges, then the number of cycles is reduced by one on each step, and so the final permutation must be a single cycle.

Uniform split-merge

This gives another framework for thinking about the RTRW itself, entirely in terms of cycle lengths as a partition of [N]. That is, given a partition, we choose a pair of parts in a size-biased way. If they are different, we merge them; and if it is the same part, with size k, we split them into two parts, with sizes chosen uniformly from { (1,k-1), (2,k-2), …  (k-1,1) }.

What’s nice about this is that it’s easy to generalise to real-valued partitions, eg of [0,1]. Given a partition of [0,1], we sample two IID U[0,1] random variables $U_1,U_2$. If these correspond to different parts, we replace these parts by a single part with size given by the sum. If these correspond to the same part, with size $\alpha$, we split this part into two parts with sizes $|U_1-U_2|$ and $\alpha - |U_1-U_2|$. This is equivalent in a distributional sense to sampling another U[0,1] variable U and replacing $\alpha$ with $(\alpha U, \alpha(1-U))$. We probably want our partition to live in $\ell^1_\searrow$, so we might have to reorder the parts afterwards too.

These uniform split-merge dynamics have a (unique) stationary distribution, the canonical Poisson-Dirichlet random partition, hereafter PD(0,1). This was first shown in [DMZZ04], and then in a framework more relevant to this post by Schramm [Sch08].

Conveniently, PD(0,1) is also the scaling limit of the cycle lengths in a uniform random permutation (scaled by N). The best way to see this is to start with the observation that the length of the cycle containing 1 in a permutation chosen uniformly from $S_N$ has the uniform distribution on {1,…,N}. This matches up well with the uniform stick-breaking construction of PD(0,1), though other arguments are available too. Excellent background on Poisson-Dirichlet distributions and this construction and equivalence can be found in Chapter 3 of Pitman’s comprehensive St. Flour notes [CSP]. Also see this post, and the links within, with the caveat that my understanding of the topic was somewhat shaky then (as presently, for now).

However, Schramm says slightly more than this. As the Erdos-Renyi graph passes criticality, there is a well-defined (and whp unique) giant component including $\Theta(N)$ vertices. It’s not clear that the corresponding permutation should have giant cycles. Indeed, whp the giant component has $\Theta(N)$ surplus edges, so the process of cycle lengths will have undergone $O(N)$ splits. Schramm shows that most of the labels within the giant component are contained in giant cycles in the permutation. Furthermore, the distribution of cycle lengths within the giant component, rescaled by the size of the giant component, converges in distribution to PD(0,1) at any supercritical time $\frac{(1+\epsilon)N}{2}$

This is definitely surprising, since we already know that the whole permutation doesn’t look close to uniform until time $\frac{N\log N}{2}$. Essentially, even though the size of the giant component is non-constant (ie it’s gaining vertices), the uniform split-merge process is happening to the cycles within it at rate N. So heuristically, at the level of the largest cycles, at any supercritical time we have a non-trivial partition, so at any slightly later time (eg $\frac{(1+\epsilon/2)N}{2}$ and $\frac{(1+\epsilon)N}{2}$ ), mixing will have comfortably occurred, and so the distribution is close to PD(0,1).

This is explained very clearly in the introduction of [Ber10], in which the approach is extended to a random walk on $S_N$ driven by a uniform choice from any conjugacy class.

So this really does tell us how the global uniform randomness emerges. As the random graph process passes criticality, we have a positive mass of labels in a collection of giant cycles which are effectively a continuous-space uniform split-merge model near equilibrium (and thus with PD(0,1) marginals). The remaining cycles are small, corresponding to small trees which make up the remaining (subcritical by duality) components of the ER graph. These cycles slowly get absorbed into the giant cycles, but on a sufficiently slow timescale relevant to the split-merge dynamics that we do not need to think of a separate split-merge-with-immigration model. Total variation distance on permutations does feel the final few fixed points (corresponding to isolated vertices in the graph), hence the sharp cutoff corresponding to sharp transition in the number of isolated vertices.

References

[Ber10] – N. Berestycki – Emergence of giant cycles and slowdown transition in random transpositions and k-cycles. [arXiv version]

[CSP] – Pitman – Combinatorial stochastic processes. [pdf available]

[Den59] – Denes – the representation of a permutation as a product of a minimal number of transpositions, and its connection with the theory of graphs

[DS81] – Diaconis, Shahshahani – Generating a random permutation with random transpositions

[DMZZ04] – Diaconis, Mayer-Wolf, Zeitouni, Zerner – The Poisson-Dirichlet distribution is the unique invariant distribution for uniform split-merge transformations [link]

[ER59] – Erdos, Renyi – On random graphs I.

[Sch08] – Schramm – Compositions of random transpositions [book link]

# Generating uniform trees

A long time ago, I wrote quite a few a things about uniform trees. That is, a uniform choice from the $n^{n-2}$ unrooted trees with vertex set [n]. This enumeration, normally called Cayley’s formula, has several elegant arguments, including the classical Prufer bijection. But making a uniform choice from a large set is awkward, and so we seek more probabilistic methods to sample such a tree, which might also give insight into the structure of a ‘typical’ uniform tree.

In another historic post, I talked about the Aldous-Broder algorithm. Here’s a quick summary. We run a random walk on the complete graph $K_n$ started from a uniformly-chosen vertex. Every time we arrive at a vertex we haven’t visited before, we record the edge just traversed. Eventually we have visited all n vertices, so have recorded n-1 edges. It’s easy enough to convince yourself that these n-1 edges form a tree (how could there be a cycle?) and a bit more complicated to decide that the distribution of this tree is uniform.

It’s worth noting that this algorithm works to construct a uniform spanning tree on any connected base graph.

This post is about a few alternative constructions and interpretations of the uniform random tree. The first construction uses a Galton-Watson process. We take a Galton-Watson process where the offspring distribution is Poisson(1), and condition that the total population size is n. The resulting random tree has a root but no labels, however if we assign labels in [n] uniformly at random, the resulting rooted tree has the uniform distribution among rooted trees on [n].

Proof

This is all about moving from ordered trees to non-ordered trees. That is, when setting up a Galton-Watson tree, we distinguish between the following two trees, drawn extremely roughly in Paint:

That is, it matters which of the first-generation vertices have three children. Anyway, for such a (rooted) ordered tree T with n vertices, the probability that the Galton-Watson process ends up equal to T is

$\mathbb{P}(GW = T) = \prod_{v\in T} \frac{e^{-1}}{C(v)!} = e^{-n} \prod_{v\in T}\frac{1}{C(v)!},$

where $C(v)$ is the number of children of a vertex $v\in T$. Then, since $\mathbb{P}( |GW|=n )$ is a function of n, we find

$\mathbb{P}(GW=T \,\big|\, |GW|=n) = f(n)\prod_{v\in T} \frac{1}{C(v)!},$

where f(n) is a function of n alone (ie depends on T only through its size n).

But given an unordered rooted tree t, labelled by [n], there are $\prod_{v \in t} C(v)!$ ordered trees associated to t in the natural way. Furthermore, if we take the Poisson Galton-Watson tree conditioned to have total population size n, and label uniformly at random with [n], we obtain any one of these ordered trees with probability $\frac{f(n)}{n!} \prod_{v\in t} \frac{1}{C(v)!}$. So the probability that we have t after we forget about the ordering is $\frac{f(n)}{n!}$, which is a function of n alone, and so the distribution is uniform among the set of rooted unordered trees labelled by [n], exactly as required.

Heuristic for Poisson offspring distribution

In this proof, the fact that $\mathbb{P}(C(v)=k)\propto \frac{1}{k!}$ exactly balances the number of orderings of the k children explains why Poisson(1) works out. Indeed, you can see in the proof that Poisson(c) works equally well, though when $c\ne 1$, the event we are conditioning on (namely that the total population size is n) has probability decaying exponentially in n, whereas for c=1, the branching process is critical, and the probability decays polynomially.

We can provide independent motivation though, from the Aldous-Broder construction. Both the conditioned Galton-Watson construction and the A-B algorithm supply the tree with a root, so we’ll keep that, and look at the distribution of the degree of the root as constructed by A-B. Let $\rho=v_1,v_2,v_3,\ldots$ be the vertices [n], ordered by their discovery during the construction. Then $\rho$ is definitely connected by an edge to $v_2$, but thereafter it follows by an elementary check that the probability $\rho$ is connected to $v_m$ is $\frac{1}{n-1}$, independently across all m. In other words, the distribution of the degree of $\rho$ in the tree as constructed by A-B is

$1+ \mathrm{Bin}\left(n-2,\frac{1}{n-1}\right) \approx 1+\mathrm{Poisson}(1).$

Now, in the Galton-Watson process, conditioning the tree to have fixed, large size changes the offspring distribution of the root. Conveniently though, in a limiting sense it’s the same change as conditioning the tree to have size at least n. Since these events are monotone in n, it’s possible to take a limit of the conditioning events, and interpret the result as the Galton-Watson tree conditioned to survive. It’s a beautiful result that this interpretation can be formalised as a local limit. The limiting spine decomposition consists of an infinite spine, where the offspring distribution is a size-biased version of the original offspring distribution (and so in particular, always has at least one child) and where non-spine vertices have the original distribution.

In particular, the number of the offspring of the root is size-biased, and it is well-known and not hard to check that size-biasing Poisson(c) gives 1+Poisson(c) ! So in fact we have, in an appropriate limiting sense in both objects, a match between the degree distribution of the root in the uniform tree, and in the conditioned Galton-Watson tree.

This isn’t supposed to justify why a conditioned Galton-Watson tree is relevant a priori (especially the unconditional independence of degrees), but it does explain why Poisson offspring distributions are relevant.

Construction via G(N,p) and the random cluster model

The main reason uniform trees were important to my thesis was their appearance in the Erdos-Renyi random graph G(N,p). The probability that vertices {1, …, n} form a tree component in G(N,p) with some particular structure is

$p^{n-1} (1-p)^{\binom{n}{2}-(n-1)} \times (1-p)^{n(N-m)}.$

Here, the first two terms give the probability that the graph structure on {1, …, n} is correct, and the the final term gives the probability of the (independent) event that these vertices are not connected to anything else in the graph. In particular, this has no dependence on the tree structure chosen on [n] (for example, whether it should be a path or a star – both examples of trees). So the conditional distribution is uniform among all trees.

If we work in some limiting regime, where $pn\rightarrow 0$ (for example if n is fixed and $p=\frac{1}{N}\rightarrow 0$), then we can get away asymptotically with less strong conditioning. Suppose we condition instead just that [n] form a component. Now, there are more ways to form a connected graph with one cycle on [n] than there are trees on [n], but the former all require an extra edge, and so the probability that a given one such tree-with-extra-edge appears as the restriction to [n] in G(N,p) is asymptotically negligible compared to the probability that the restriction to [n] of G(N,p) is a tree. Naturally, the local limit of components in G(N,c/N) is a Poisson(c) Galton-Watson branching process, and so this is all consistent with the original construction.

One slightly unsatisfying aspect to this construction is that we have to embed the tree of size [n] within a much larger graph on [N] to see uniform trees. We can’t choose a scaling p=p(n) such that G(n,p) itself concentrates on trees. To guarantee connectivity with high probability, we need to take $p> \frac{\log n}{n}$, but by this threshold, the graph has (many) cycles with high probability.

At this PIMS summer school in Vancouver, one of the courses is focusing on lattice spin models, including the random cluster model, which we now briefly define. We start with some underlying graph G. From a physical motivation, we might take G to be $\mathbb{Z}^d$ or some finite subset of it, or a d-ary tree, or the complete graph $K_N$. As in classical bond percolation (note G(N,p) is bond percolation on $K_N$), a random subset of the edges of G are included, or declared open. The probability of a given configuration w, with e open edges is proportional to

$p^e (1-p)^{|E(G)| - e} q^{k(w)},$ (*)

where the edge-weight $p\in(0,1)$ as usual, and cluster weight $q\in (0,\infty)$, and $k(w)$ counts the number of connected components in configuration w. When q=1, we recover classical bond percolation (including G(N,p) ), while for q>1, this cluster-reweighting favours having more components, and q<1 favours fewer components. Note that in the case $q\ne 1$, the normalising constant (or partition function) of (*) is generally intractable to calculate explicitly.

As in the Erdos-Renyi graph, consider fixing the underlying graph G, and taking $p\rightarrow 0$, but also taking $\frac{q}{p}\rightarrow 0$. So the resulting graph asymptotically ‘wants to have as few edges as possible, but really wants to have as few components as possible’. In particular, 1) all spanning trees of G are equally likely; 2) any configuration with more than one component has asymptotically negligible probability relative to any tree; 3) any graph with a cycle has #components + #edges greater than that of a tree, and so is asymptotically negligible probability relative to any tree.

In other words, the limit of the distribution is the uniform spanning tree of G, and so this (like Aldous-Broder) is a substantial generalisation, which constructs the uniform random tree in the special case where $G=K_n$.

# Real Trees – Root Growth and Regrafting

Two weeks ago in our reading group meeting, Raphael told us about Chapter Five which introduces root growth and regrafting. One of the points of establishing the Gromov-Hausdorff topology in this book was to provide a more natural setting for a discussion of tree-valued processes. Indeed in what follows, one can imagine how to start the construction of a similar process for the excursions which can be used to encode real trees, involving cutting off sub-excursions above one-sided local minima, then glueing them back in elsewhere. But taking account of the equivalence structure will be challenging, and it is much nicer to be able to describe cutting a tree in two by removing a single point without having to worry about quotient maps.

We have seen in Chapter Two an example of a process defined on the family of rooted trees with n labelled vertices which has the uniform rooted tree as an invariant distribution. Given a rooted tree with root p, we choose uniformly at random a vertex p’ in [n] to be the new root. Then if p’=p we do nothing, otherwise we remove the unique first edge in the path from p’ to p, giving two trees. Adding an edge from p to p’ completes the step and gives a new tree with p’ as root. We might want to take a metric limit of these processes as n grows and see whether we end up with a stationary real tree-valued process whose marginals are the BCRT.

To see non-trivial limiting behaviour, it is most interesting to consider the evolution of a particular subtree (which includes the root) through this process. If the vertex chosen for cutting lies in our observed subtree, then the subtree undergoes a prune and regraft operation. On the other hand, if the vertex chosen for cutting does not lie in the subtree, then we do not see any effect of the pruning, except the addition of a new vertex below the original root, which becomes the new root. So essentially, from the point of view of our observed subtree, the root is growing.

Now we can think about interpreting the dynamics of a natural limit process acting on real trees. The key idea is that we don’t change the set on which the tree is supported much, but instead just change the metric. In particular, we will keep the original tree, and add on length at unit rate. Of course, where this length gets added on entirely determines the metric structure of the tree, but that doesn’t stop us giving a simple ‘name’ for the extra length. If we consider a process $X^T$ starting from a particular finite subtree T, then at time t, the tree $X^T_t$ is has vertex set $T \coprod (0,t]$. (Finite subtree here means that it has finite total length.)

Root regrafting should happen at a rate proportional to the total length of the current observed tree. This is reasonable since after all it is supported within a larger tree, so in the discrete case the probability of a prune-regrafting event happening within a given observed subtree is proportional to the number of vertices in that subtree, which scales naturally as length in the real tree limit. It turns out that to get unit rate root growth with $\Theta(1)$ rate prune-regrafting, we should consider subtrees of size $\sqrt{n}$ within a host tree of size n as $n\rightarrow\infty$. We also rescale the lengths by $\frac{1}{\sqrt{n}}$, and time by $\sqrt{n}$ so we actually see prune-regraft events.

Furthermore, if the subtree is pruned, the location of the pruning is chosen uniformly by length of the current observed subtree. So we can view the pruning process as being driven by a Poisson point process with intensity given by the instantaneous length measure of the tree, which at time t has vertex set $T\coprod (0,t]$. It will turn out to be consistent that there is a ‘piecewise isometry’ for want of a better phrase between the metric (and thus length measure) on $X^T_t$ and the canonical induced measure on $T\coprod (0,t]$, so we can describe the instances and locations of the pruning events via a pair of PPPs. The first is supported on $T \times [0,\infty)$, and the second on $\{(t,x): 0 \le x \le t\}$, since we only ‘notice’ pruning at the point labelled x if the pruning happens at some time t after x was created.

If we start from a compact tree T, then the total intensity of this pair is finite up to some time t, and so we have a countable sequence $\tau_0=0<\tau_1<\tau_2<\ldots$ of times for pruning events. It is easy to describe (but a bit messy to notate) the evolution of the metric between these pruning times. Essentially the distance between any pair of points in the observed tree at time $\tau_m$ with root $\rho_{\tau_m}$ is constant between times $\tau_m,\tau_{m+1}$, and new points are added so that the distance between $\rho_{\tau_m}$ and any new point $a\in(\tau_m,\tau_{m+1}]$ is $a-\tau_m$, and everything thing else follows from straightforward consideration of geodesics.

When a pruning event happens at point $x_m$ at time $\tau_m$, distances are preserved within the subtree above $x_m$ in $X^T_{\tau_m -}$, and within the rest of the tree. Again, an expression for the cross distances is straightforward but requires a volume of notation not ideally suited to this medium.

The natural thing to consider is the coupled processes started from different subtrees (again both must contain the original root) of the same host tree. Say $T^1,T^2\le T$, then it is relatively easy to check that $X^{T^1}_t,X^{T^2}_t \le X^T_t \,\forall t$, when we drive the processes by consistent coupled Poisson processes. Furthermore, it is genuinely obvious that the Hausdorff distance between $X^{T^1}_t,X^{T^2}_t$, here viewed as compact subsets of $(X^T_t, d^T_t)$ remains constant during root growth phase.

Less obvious but more important is that the Hausdorff distance decreases during regrafting events. Suppose that just before a regrafting event, the two subtrees are T’ and T”, and the Hausdorff distance between them is $\epsilon$. This Hausdorff distance is with respect to the metric on the whole tree T. [Actually this is a mild abuse of notation – I’m now taking T to be the whole tree just before the regraft, rather than the tree at time 0.]

So for any $a\in T'$, we can choose $b\in T''$ such that $d_T(a,b)\le \epsilon$. This is preserved under the regraft unless the pruning point lies on the geodesic segment (in T) between a and b. But in that case, the distance between a and the pruning point is again at most $\epsilon$, and so after the regrafting, a is at most $\epsilon$ away from the new root, which is in both subtrees, and in particular the regrafted version of T”.

This is obviously a useful first step on the path to proving any kind of convergence result. There are some technicalities which we have skipped over. It is fairly natural that this leads to a Markov process when the original tree is finite, but it is less clear how to define these dynamics when the total tree length is infinite, as we don’t want regrafting events to be happening continuously unless we can bound their net effect in some sense.

Last week, Franz showed us how to introduce the BCRT into matters. Specifically, that BCRT is the unique stationary distribution for this process. After a bit more work, the previous result says that for convergence properties it doesn’t matter too much what tree we start from, so it is fine to start from a single point. Then, the cut points and growth mechanism corresponds very well to the Poisson line-breaking construction of the BCRT. With another ‘grand coupling’ we can indeed construct them simultaneously. Furthermore, we can show weak convergence of the discrete-world Markov chain tree algorithm to the process with these RG with RG dynamics.

It does seem slightly counter-intuitive that a process defined on the whole of the discrete tree converges to a process defined through subtrees. Evans remarks in the introduction to the chapter that this is a consequence of having limits described as compact real trees. Then limitingly almost all vertices are close to leaves, so in a Hausdorff sense, considering only $\sqrt{n}$ of the vertices (ie a subtree) doesn’t really make any difference after rescaling edge lengths. I feel I don’t understand exactly why it’s ok to take the limits in this order, but I can see why this might work after more checking.

Tomorrow, we will have our last session, probably discussing subtree prune-and-regraft, where the regrafting does not necessarily happen at the root.

# Gromov-Hausdorff Distance on Trees

This post continues the exposition of Gromov-Hausdorff distance, as introduced in Chapter Four of Steve Evans’ Probability and Real Trees, which we are reading as a group in the Stats Dept in Oxford at the moment. In this post, we consider applications of the Gromov-Hausdorff distance we have just introduced in the context of trees, viewed as metric spaces.

First we consider a direct application of the previous result, which related the Gromov-Hausdorff distance to the infimum of the distortion across the family of correspondences between the two relevant metric spaces. I found this as Corollary 3.7 in notes by Le Gall and Miermont on scaling limits of random trees and maps, which can be found here. I’m not clear whether there is an original source, but the result is simple enough that probably it does not matter hugely.

Proposition – Given f,g excursions above [0,1], and $T_f, T_g$ the real trees associated with these excursions in the standard way, then

$d_{GH}(T_f,T_g)\le 2 ||f-g||.$

Proof: We construct an appropriate correspondence

$\mathcal {R}=\left\{(a,b) \,:\, \exists t\in[0,1]\text{ s.t. } a=p_f(t), b=p_g(t)\right\}.$

In other words, the trees are defined as projections from [0.1] (with some equivalence structure), so taking pairs of projections from [0,1] gives a natural correspondence. Now, suppose $(p_f(s),p_g(s)), (p_f(t),p_g(t))\in \mathcal {R}$. Then

$d_{T_f}(p_f(s),p_f(t)) = f(s) + f(t) - 2\hat f(s,t),$

where $\hat f(s,t):= \min{r\in[s,t]}f(r)$. Obviously, we can replace f by g to obtain

$d_{T_g}(p_g(s),p_g(t)) = g(s) + g(t) - 2\hat g(s,t).$

By thinking slightly carefully about where the functions achieve their minima compared with where the minimum in the sup norm is achieved, we can conclude that

$|d_{T_f}(p_f(s),p_f(t)) - d_{T_g}(p_g(s),p_g(t)) | \le 4||f-g||,$

and so the result follows from the relation between Gromov-Hausdorff distance and the infimum of distortions over the set of correspondences proved at the end of the previous post.

Gromov-Hausdorff Limits of Real Trees

If we are going to consider the Gromov-Hausdorff topology for limits of tree, we want to be sure that the limit of a sequence of real trees is another real tree. In particular, we want to show that this convergence preserves the property of being a geodesic space.

Theorem 4.19 – Given $X_n$ geodesic spaces, and $X$ complete, such that $X_n \stackrel{d_{GH}}\rightarrow X$, then X is geodesic.

Proof: Here, I’ll go in the opposite order to Evans’ book, as I think it’s easier understand why a special case of this implies the whole result once you’ve actually shown that special case.

Anyway, we want to show that given $x,y\in X$, there is a geodesic segment $x\leftrightarrow y$ in X. We will start by showing that there is a well-defined midpoint of the geodesic, that is a point $z \in X$ such that $d(x,z)=d(y,z)=\frac12 d(x,y)$.

Given arbitrary $\epsilon>0$, we can take n such that $d_{GH}(X,X_n) < \frac{\epsilon}{3}$ and then a correspondence $\mathcal{R}$ between $X,X_n$ with $\mathrm{dis}\mathcal {R}<\frac{2\epsilon}{3}$. Now, by definition of correspondence, we have $x',y'\in X_n$ such that $(x,x'),(y,y')\in\mathcal{R}$. But we do have geodesics in $X_n$, so we can take $z'\in X_n$ the midpoint of geodesic $x'\leftrightarrow y'$. Predictably, we now take $z\in X$ such that $(z,z')\in\mathcal {R}$.

We can show that z is ‘almost’ the midpoint of [x,y] in the sense that

$|d(x,z) - \frac12 d(x,y)| \le |d(x',z')-\frac12 d(x',y')+\frac32 \mathrm{dis}\mathcal{R} \le \epsilon.$

Similarly, we have $|d(y,z)-\frac12 d(x,y)|\le \epsilon$.

Perhaps it’s helpful to think of this point z that we’ve constructed as being like a taut, but not quite rigid string. The midpoint of the string has to be fairly near the midpoint of the endpoints, and in particular, as we let $\epsilon\rightarrow 0$, the z’s we deal with form a Cauchy sequence, and thus converge to some point, which (in a case of poor notation planning) we also call $z \in X$, which is the midpoint of $[x,y]$ as described before.

We can now iterate this iteration, to demonstrate that whenever q is a dyadic rational in [0,1], there exists $z_q \in X$ such that

$d(x,z_q)=q d(x,y),\quad d(y,z_q) = (1-q)d(x,y).$

Again then, if we want the above to hold for some general real r in [0,1], we can approximate r arbitrarily well by dyadic rationals q, and the associated points $z_q$ are Cauchy and thus have a well-defined limit with the required properties. We thus have our geodesic segment $x\leftrightarrow y$.

Rooted Gromov-Hausdorff Distance

In the end, the trees we want to compare might be rooted. For example, we talk about finite trees being invariant under random re-rooting, and we might be interested in similar results for real trees, in particular the BCRT. So we need to compare metric spaces as viewed outwards from particular identified points of each space.

An isometry of rooted spaces must map the root to the root, and so we adjust to obtain rooted Gromov-Hausdorff distance. We might try to consider embeddings into a common space so that the roots are shared, but it will be more convenient to maintain the infimum over metrics on the disjoint union as before. But to ensure the roots are in roughly the ‘same’ place in both set embeddings, we minimise the maximum of the Hausdorff distance between the sets, and the distance between the images of the roots in the common covering space.

Similar results apply as in the unrooted case, and normally the proofs are very similar. As we might expect, when defining a correspondence between rooted spaces, we demand that the pair of roots is one of the roots in the correspondence, and then the same equivalence between the minimal distortion and the rooted G-H distance applies.

Evans shows that $\mathbb{T}^{\mathrm{root}}$, the set of compact rooted trees is complete and separable under the rooted G-H topology. Separability is relatively easy to see. Compact trees have finite $\epsilon$-nets, and there is a canonical way to view the net as the vertices of a finite tree with edge lengths. Approximating these edge lengths by rationals and consider the countable family of isomorphism classes of rooted finite trees gives separability.

If we want, we can also define k-pointed Gromov-Hausdorff distance, where we demand k points in each space are held fixed.

Tree $\eta$-erasure

To show that this is a natural topology to consider for the family of trees, Evans devotes a short section to the operation of $\eta$-erasure, where all points within $\eta$ of a leaf are removed from a given tree.

Formally, $R_\eta$ is a map $\mathbb{T}^{\mathrm{root}}\rightarrow \mathbb{T}^{\mathrm{root}}$ (recalling that these are compact real trees), so that $R_\eta(T)$ consists of the tree

$\{\rho\}\cup\{a \in T \, : \, \exists x\in T,\, a\in[\rho, x],d_T(a,x)\ge \eta\},$

rooted again at $\rho$. We claim that the range of $R_\eta$ is the set of compact rooted trees with a finite number of leaves. In this setting, we want a geodesic definition of leaf, for example a leaf is a point that doesn’t lie in the interior of the (unique) geodesic segment between any other point and the root.

Given a tree T with a finite number of leaves, we can glue disjoint segments of length $\eta$ onto every leaf. Taking $R_\eta$ of this deeper tree will give T. Similarly, suppose $R_\eta(T)$ has infinitely many leaves, which we can label $a_1,a_2,\ldots$. Thus we also have $x_1,x_2,\ldots \in T$ such that $a_i\in[\rho,x_i]$, and the segments $\{[a_i,x_i]\}$ are disjoint, and all have length at least $\eta$, hence T cannot be compact, as it has no finite $\frac{\eta}{2}$ net.

It is clear that for fixed T, the family of these maps applied to T is continuous with respect to $\eta$ in the G-H topology. When $\eta$ is changed a small amount, the amount of extra tree removed is locally small, and so approximating correspondences by points in what is left is absolutely fine. Indeed, the operations $T_{\eta_1}, T_{\eta_2}$ commute to give $T_{\eta_1+\eta_2}$.

We want to show that for fixed $\eta$ this map $R_\eta$ is continuous with respect to G-H topology. Suppose two compact rooted trees S,T are covered by a rooted correspondence $\mathcal R$ with distortion $\epsilon \ll \eta$. We can’t immediately restrict $\mathcal R$ to $R_\eta(S)\times R_\eta(T)$, as it won’t necessarily be a surjection under the projection maps any more.

But we can get around this. Note that if $(x,y)\in\mathcal{R}$, then $|d_S(\rho_S,x) - d_T(\rho_T,y)| \le \epsilon$ by assumption. We will construct a correspondence $\mathcal R'$ on the erased trees as follows. Given $(x,y)\in\mathcal{R}$, if $(x,y)\in R_\eta(S)\times R_\eta(T)$, we keep it, and if $x\not\in R_\eta(S), y\not \in R_\eta(y)$, we throw it away. Suppose we have $(s,t) in \mathcal{R}$ with $s\in R_\eta(S), t\not \in R_\eta(T)$. Let $l_s$ be a leaf of S such that $s \in [\rho_S,l_s]$, and let $\bar t$ be the farthest point from the root of the geodesic $[\rho_T,t]$ restricted to $R_\eta(T)$. So the tree above $\bar t$ includes t and has height at most $\eta$.

In words, if a point appears in a pair in the correspondence but is removed by the erasure, we replace it in the pair with the point closest to the original point that was not removed by the erasure.

It remains to prove that this works. My original proof was short but false, and its replacement is long (and I hope true now), but will postpone writing this down either until another post, or indefinitely. The original proof by Evans and co-authors can be found as the main content of Lemma 2.6 in the original paper [1].

REFERENCES

[1] Evans, Pitman, Winter – Rayleigh processes, real trees, and root growth with regrafting.

# Gromov-Hausdorff Distance and Correspondences

This term, some members of the probability group have been reading Probability and Real Trees, by Steve Evans based on his Saint-Flour course in 2005. A pdf can be found here. This morning was my turn to present, and I gave a precis of Chapter 4, which concerns metrics on metric spaces, a family of tools which will be essential for later chapters which discuss convergence of trees, viewed as metric objects. Hausdorff Distance We start by considering a metric on subsets of a given base space X. The Hausdorff distance between two sets A, B is defined as

$d_H(A,B):=\inf\{ r>0: A\subset U_r(B), B\subset U_r(A)\},$

where $U_r(A):=\{x\in X, d(x,A)< r\}$ consists of set A, and all the points within r of set A. So the Hausdorff distance measures how much we have to fatten each set before it contains the other. Note that if we have a giant set next to a tiny one, we will have to fatten the tiny one a great deal more to achieve this. Sometimes it will be more helpful to think of the following alternative characterisation

$d_H(A,B) = \max\left \{ \sup_{a\in A}\inf_{b\in B} d(a,b), \inf_{a\in A}\sup_{b\in B} d(a,b)\right\}.$

In words, we measure how far away is the point in A farthest from B, and vice versa, and take the larger of the two. The presence of the sups and infs indicates that the inclusion or otherwise of the boundaries of A,B does not affect this distance. In particular, this means that $d_H(A,\bar A) = d_H(A,A^{\circ})$, and so to allow us to call Hausdorff distance a metric, we restrict attention to the closed subsets of X, M(X).

We also observe immediately that if X does not have bounded diameter, then it is possible for the Hausdorff distance between two sets to be infinite. We won’t worry about that for now, because we will mainly be considering host spaces which are compact, for which the following results will be useful.

Lemma 4.4 – If (X,d) is complete, then (M(X),d_H) is also complete.

Proof: Assume we have a sequence of closed sets $S_1,S_2,\ldots \subset X$ which have the Cauchy property under d_H. I’m going to do this slightly differently to Evans, because it’s not the case you can immediately choose $x_n\in S_n$ for each n, such that $d(x_n,x_m)\le d_H(S_n,S_m)$ for all m,n. For an explicit counterexample, see comment from Ed Crane below the article.

Note that if a subsequence of a Cauchy sequence converges to a limit, then the whole sequence converges to that same limit. So we can WLOG replace $S_1,S_2,\ldots$ by some subsequence such that $d_H(S_n,S_{n+1})\le 2^{-n}$. Now it is clear that for any $x_n\in S_n$, there is a choice of $x_{n+1}\in S_{n+1}$ such that $d(x_n,x_{n+1})\le 2^{-n}$ (*). Starting from arbitrary $x_1\in S_1$, we can construct in this manner a sequence $x_1,x_2,\ldots,\in X$ that is Cauchy, and thus has a limit point $x\in X$.

Let $\mathcal{X}$ be the set of sequences $(x_m,x_{m+1},\ldots)$ for some m, with $x_n\in S_n\,\forall n\ge m$, satisfying (*). Now let S be the closure of the set of limit points of this family of sequences, which we have shown is non-empty.

Then for any n, and any $x_n\in S_n$, we can construct such a sequence, and its limit point x, and the triangle inequality down the path $x_n,x_{n+1},\ldots$ gives $d(x_n,S)\le 2^{-(n-1)}$. Furthermore, by construction $S\subset U_{2^{-(n-1)}}(S_n)$, hence it follows that $S_n \stackrel{d_H}\rightarrow S$.

Lemma 4.5 – Given (X,d) compact, (M(X),d_H) is also compact.

Sketch proof: We are working with metric spaces, for which the existence of a finite $\epsilon$-net for every $\epsilon>0$ is equivalent to compactness. [An $\epsilon$-net is a set S of points in the space such that every point of the space lies within $\epsilon$ of an element of S. Thinking another way, it is the set of centres of the balls in a finite covering of the space by $\epsilon$-balls.] It is not too hard to check that if $S_\epsilon$ is an $\epsilon$-net for (X,d), then $\mathcal{P}(S_\epsilon)$ is finite, and an $\epsilon$-net for (M(X),d_H).

Gromov-Hausdorff Distance

So far this is fine, but won’t be useful by itself for comparing how similar two trees are as metric spaces, because we can’t be sure a priori that we can embed them in a common host space. To resolve this, we consider instead the Gromov-Hausdorff distance, which will serve as a distance between metric spaces, even when they are not canonically defined as subsets of a common metric space.

Given X, Y metric spaces, we define

$d_{GH}(X,Y)=\inf_Z \left\{ d_H(X',Y') \, : \, X',Y' \subset (Z,d)\text{ a metric space }, X'\simeq X, Y'\simeq Y\right\}.$

In words, the Gromov-Hausdorff distance between two metric spaces examines the ways to embed them isometrically into a common larger metric space, and gives the minimal Hausdorff distance between them under the class of such embeddings. One issue is that the collection of all metric spaces is not a set. For example, given any set, we can define a metric via the discrete metric, so the collection of metric spaces is at least as large as the collection of all sets, which is not a set. Fortunately, all is not broken, as when we consider a general metric space Z in which we might embed copies of X and Y we are wasting lots of the perhaps very complicated space, because we only need to compare the subsets which are isometric copies of X and Y. So in fact, we lose nothing if we assume that Z is a disjoint union of copies of X and Y, with a metric chosen appropriately. So

$d_{GH}=\inf\left\{ d_H(X,Y) : d\text{ a metric on }X\coprod Y\text{ restricting to }d_X \text{ on }X,\, d_Y\text{ on }Y \right\}.$

In practice though, this is difficult to compute, since the set of things we have to minimize over is complicated. It turns out we can find an equivalent characterisation which will be easier to use in a number of examples, including the case of real trees which is the whole point of the course.

Correspondence and Distortion

We define a correspondence from X to Y to be

$\mathcal{R}\subset X\times Y\text{ s.t. } \pi_X(\mathcal {R}) = X, \, \pi_Y(\mathcal {R}) = Y,$

where $\pi_X,\pi_Y$ are the canonical projection maps from $X\times Y$ into $X,Y$ respectively. So we can think of a correspondence as being something like a matching from X to Y. In a matching, we insist that the projection maps into each set are injections, ie that each element of X (resp Y) can appear in at most one pair, whereas for a correspondence, we demand that the projection maps are surjections, ie that each element of X appears in at least one pair.

Then the distortion of a correspondence

$\mathrm{dis}(\mathcal{R}):= \sup\left\{ |d_X(x,x') - d_Y(y,y')| \,;\, (x,y),(x',y')\in \mathcal{R} \right\}.$

In words, if two sets are non-isomorphic, then a correspondence can’t describe an isometry between the sets, and the distortion is a measure of how far from being an isometry the correspondence is. That is, given a pair of pairs in the correspondence, for an isometry, the distance between the X-elements would be equal to the distance between the Y-elements, and the distortion measures the largest discrepancy between such pairs of pairs over the correspondence.

Theorem 4.11 $d_{GH}(X,Y) = \frac12 \inf_{\mathcal{R}} (\mathrm{dis}\mathcal R)$, where the infimum is taken over all correspondences $\mathcal{R}$ X to Y.

Remark: The RHS of this equivalence can be thought of as the set coupling between X and Y such that the pairs have as equal distances as possible.

Proof: Given an embedding into $X\coprod Y$ with $d_H(X,Y), we have $\mathcal{R}$ with $\mathrm{dis}\mathcal {R}<2r$, by taking:

$\mathcal{R}=\{(x,y): d(x,y)

From the definition of Hausdorff distance, it follows that the for every x, there is a y with d(x,y)<r, and hence the appropriate projection maps are projections.

So it remains to prove that $d_{GH}(X,Y)\le \frac12 \mathrm{dis}\mathcal{R}$. We can define a metric on $X\times Y$ by

$d(x,y)=\int\left\{ d_X(x,x')+d_Y(y,y') + r \,:\, (x',y')\in \mathcal{R} \right\}.$

Then for any $x\in X$, there is $(x,y)\in\mathcal{R}$, and thus $d(x,y)\le r$, and vice versa hence $d_H(X,Y)\le r$.

It only remains to check that this is actually a metric. Let’s take $x,\bar x \in X$, and so

$d(\bar x,y)\le \inf\{ d_X(x,\bar x) + d_X(x,x')+d_Y(y,y')+r \,: \, (x',y')\in\mathcal{R}\},$

so taking $d_X(x,\bar x)$ outside the brackets gives one form of the triangle inequality. We have to check the ‘other combination’ of the triangle inequality. We assume that the infima for $(x,y), (\bar x,y)$ are attained at $(x',y'),(\bar x',\bar y')$ respectively.

$d(x,y)+d(\bar x,y)= 2r+ d_X(x,x')+d_X(\bar x,\bar x') + d_Y(y,y')+d_Y(d,\bar y').$

But we also have $d_X(x',\bar x')-d_Y(y',\bar y')\ge -r$ from the definition of distortion, and so adding these gives the triangle inequality we want, and completes the proof of this theorem.corc

# Enumerating Forests

I’ve just got back from a visit to Budapest University of Technology, where it was very pleasant to be invited to give a talk, as well as continuing the discussion our research programme with Balazs. My talk concerned a limit for the exploration process of an Erdos-Renyi random graph conditioned to have no cycles. Watch this space (hopefully very soon) for a fully rigorous account of this. In any case, my timings were not as slick as I would like, and I had to miss out a chunk I’d planned to say about a result of Britikov concerning enumerating unrooted forests. It therefore feels like an excellent time to write something again, and explain this paper, which you might be able to find here, if you have appropriate journal rights.

We are interested to calculate $a_{n,m}$ the number of forests with vertex set [n] consisting of m unrooted trees. Recall that if we were interested in rooted trees, we could appeal to Prufer codes to show that there are $m n^{n-m-1}$ such forests, and indeed results of Pitman give a coalescent/fragmentation scheme as m varies between 1 and n-1. It seems that there is no neat combinatorial re-interpretation of the unrooted case though, so Britikov uses an analytic method.

We know that

$a_{n,m}= \frac{n!}{m!} \sum_{\substack{k_1+\ldots+k_m=n\\ k_i\ge 1}} \prod_{j=1}^m \frac{k_j^{k_j-2}}{k_j!}.$

To see this, observe that the $k_j$s correspond to the sizes of the m trees in the forest; $\frac{n!}{\prod k_j!}$ gives the multinomial number of ways to assign vertices to the trees; given the labels for a tree of size $k_j$, there are $k_j^{k_j-2}$ ways to make up the tree itself; and $\frac{1}{m!}$ accounts for the fact that the trees have no order.

What we would really like to do is to take the uniform distribution on the set of all labelled trees, then simulate m IID copies of this distribution, and condition the union to contain precisely n vertices. But obviously this is an infinite set, so we cannot choose uniformly from it. Instead, we can tilt so that large trees are unlikely. In particular, for each x we define

$\mathbb{P}(\xi=k) \propto \frac{k^{k-2} x^k}{k!}$,

and define the normalising constant

$B(x):= \sum_{k\ge 1} \frac{k^{k-2}x^k}{k!},$

whenever it exists. It turns out that $x\le e^{-1}$ is precisely the condition for $B(x)<\infty$. Note now that if $\xi_1,x_2,\ldots$ are IID copies of $\xi$, then

$\mathbb{P}(\xi_1+\ldots+\xi_m=n) = \frac{x^n}{B(x)^m} \sum_{k_1+\ldots + k_m=n} \prod_{j=1}^m \frac{k_j^{k_j-2}}{k_j!},$

and so we obtain

$a_{n,m}= \frac{n!}{m!} \frac{B(x)^m}{x^n} \mathbb{P}(\xi_1+\ldots + \xi_m=n).$

So asymptotics for $a_{n,m}$ might follows from laws of large numbers of this distribution $\xi$.

So far, we haven’t said anything about how to choose this value x. But observe that if you want to have lots of trees in the forest, then the individual trees should generally be small, so we take x small to tilt away from a preference for large trees. It turns out that there is a similar interpretation of criticality for forests as for general graphs, and taking x equal to 1/e, its radius of convergence works well for this setting. If you want even fewer trees, there is no option to take x larger than 1/e, but instead one can use large deviations machinery rather than laws of large number asymptotics.

We will be interested in asymptotics of the characteristic function of $\xi$ for x=1/e. In particular $\mathbb{E}[e^{it\xi}]=\frac{B(xe^{it})}{B(x)}$, and it will be enough to clarify the behaviour of this as $t\rightarrow 0$. It’s easier to work with a relation analytic function

$\theta(x)=\sum_{k\ge 1} \frac{k^{k-1}x^k}{k!},$

ie the integral of B. What now feels like a long time ago I wrote a masters’ thesis on the subject of multiplicative coalescence, and this shows up as the generating function of the solutions to Smoluchowski’s equations with monodisperse initial conditions, which are themselves closely related to the Borel distributions. In any case, several of the early papers on this topic made progress by establishing that the radius of convergence is 1/e, and that $\theta(x)e^{-\theta(x)}=x$ everywhere where $|x|\le 1/e$. We want to consider x=1/e, for which $\theta=1$.

Note that $\mathbb{E}\xi = \frac{\theta(x)}{B(x)}$, so we will make progress by relating $B(x),\theta(x)$ in two ways. One way involves playing around with contour integrals in a fashion that is clear in print, but involves quite a lot of notation. The second way is the Renyi relation which asserts that $\theta(x)=B(x)+\frac{\theta(x)^2}{2}$. We will briefly give a combinatorial proof. Observe that after multiplying through by factorials and interpreting the square of a generating function, this is equivalent to

$k^{k-1} = k^{k-2} + \frac12 \sum_{\substack{l+m=k\\l,m\ge 1}} l^{l-1}m^{m-1}\binom{k}{l},$

for all k. As we might expect from the appearance of this equality, we can prove it using a bijection on trees. Obviously on the LHS we have the size of the set of rooted trees on [k]. Now consider the set of pairs of disjoint rooted trees with vertex set [k]. This second term on the RHS is clearly the size of this set. Given an element of this set, join up the two roots, and choose whichever root was not initially in the same tree as 1 to be the new root. We claim this gives a bijection between this set, and the set of rooted trees on [k], for which 1 is not the root. Given the latter, the only pair of trees that leads to the right rooted tree on [k] under this mapping is given by cutting off the unique edge incident to the root that separates the root and vertex 1. In particular, since there is a canonical bijection between rooted trees for which 1 is the root, and unrooted trees (!), we can conclude the Renyi relation.

The Renyi relation now gives $\mathbb{E}\xi = \frac{\theta(x)}{B(x)}=2$ when x=1/e. If we wanted, we could show that the variance is infinite, which is not completely surprising, as the parameter x lies on the radius of convergence of the generating function.

Now, playing around with contour integrals, and being careful about which strands to take leads to the asymptotic as $t\rightarrow 0$

$\mathbb{E}[ e^{it\xi}] = 1+2it + \frac{2}{3}i |2t|^{3/2} (i\mathrm{sign}(t))^{3/2} + o(|t|^{3/2}).$

So from this, we can show that the characteristic function of the rescaled centred partial sum $\frac{\xi_1+\ldots+\xi_N-2N}{bN^{2/3}}$ converges to $\exp(-|t|^{3/2}\exp(\frac{i\pi}{4}\mathrm{sign} t))$, where $b= (32/9)^{1/3}$ is a constant arising out of the previous step.

We recognise this as the characteristic function of the stable distribution with parameters 3/2 and -1. In particular, we know now that $\xi$ is in the domain of attraction for a stable-3/2 distribution. If we wanted a version of the central limit theorem for such partial sums, we could have that, but since we care about the partial sums of the $\xi_i$s taking a specific value, rather than a range of values on the scale of the fluctuations, we actually need a local limit theorem.

To make this clear, let’s return to the simplest example of the CLT, with some random variables with mean $\mu$ and variance $\sigma^2<\infty$. Then the partial sums satisfy

$\mathbb{P}(\mu N + a\sigma\sqrt{N} \le S_N \le \mu_N+b\sigma\sqrt{N}) \rightarrow \int_a^b f_{\mathcal N}(x)dx,$

as $N\rightarrow\infty$. But what about the probability of $S_N$ taking a particular value m that lies between $\mu N+a\sigma \sqrt{N}$ and $\mu N + b\sigma \sqrt{N}$? If the underlying distribution was continuous, this would be uncontroversial – considering the probability of lying in a range that is smaller than the scale of the CLT can be shown in a similar way to the CLT itself. A local limit theorem asserts that when the underlying distribution is supported on some lattice, mostly naturally the integers, then these probabilities are in the limit roughly the same whenever m is close to $\mu N+a\sigma\sqrt{N}$.

In this setting, a result of Ibragimov and Linnik that I have struggled to find anywhere in print (especially in English) gives us local limit theory for integer-supported distributions in the domain of attraction of a stable distribution. Taking p( ) to be the density of this distribution, we obtain

$bm^{2/3}\mathbb{P}(\xi_1+\ldots+\xi_m=n) - p(\frac{n-2m}{b m^{2/3}}) \rightarrow 0$

as $n\rightarrow\infty$, uniformly on any set of m for which $z= \frac{n-2m}{bm^{2/3}}$ is bounded. Conveniently, the two occurrences of b clear, and Britikov obtains

$a_{n,m} = (1+o(1)) \frac{\sqrt{2\pi} n^{n-1/6}}{2^{n-m}(n-m)!} p(\frac{n-2m}{n^{2/3}},$

uniformly in the same sense as before.

# Random Maps 3 – Leaves and Geodesics in BCRT

Recall in the previous two posts, we’ve introduced some of the background to maps on various surfaces. In particular, we’ve introduced the remarkable Cori-Vauquelin-Schaeffer bijection which maps between plane trees labelled with uniform increments and quadrangulations of the sphere, up to some careful fiddling around with rooting and pointing an edge.

We are interested in the case where we choose uniformly a large element from these classes. We want to derive a scaling limit for the uniform planar quadrangulation, and we hope that we will be able to carry some properties of the scaling limit of the labelled trees, which may well be simpler, across the CVS bijection. It is convenient that the vertices of the plane tree become the vertices of the quadrangulation. We are looking to find some sort of metric limit, in the Gromov-Hausdorff sense, and so it will remain to deduce exactly how to use the labelling obtained from the tree to gain information about distances in the (limiting) quadrangulation.

Of course, all of this relies on the fact that there is a nice limit for the ordered plane trees in the first place. Unsurprisingly, it turns out that this is Aldous’s Brownian continuum random tree. The easiest way to see this is to consider the contour process of the ordered plane tree. This is chosen uniformly at random from the set of paths from (0,0) to (2n,0) with increments of size {-1,1} and which stay non-negative. It is thus precisely a simple random walk started at (0,0) conditioned to hit (2n,0) and to be non-negative. Since SRW suitably rescaled converges to Brownian motion, it is unsurprising (but not totally trivial) that this conditioned object converges to a Brownian excursion.

The Brownian excursion can be viewed as a continuous analogue of the contour process for the BCRT, but it is more natural to consider this convergence in the Gromov-Hausdorff topology. In this setting, we say that for a large value of n, the tree is ‘roughly isometric’ to the BCRT in distribution. Here, roughly isometric means the two metric spaces can be embedded isometrically into a common metric space such that they are close together, now in the sense of Hausdorff distance.

At this point, it is worth thinking about this interpretation of the BCRT. We have previously considered this as the scaling limit of a uniformly chosen Cayley tree, that is any unrooted tree on n labelled vertices. Essentially, we are now specifying that the BCRT can carry extra information, namely a root, and geometric information about the order of branches. The root is uncontroversial. Canonically, the root of the BCRT will be at the point associated with time 0 in the driving Brownian excursion. However, we can easily check that the distribution of a uniform rooted plane tree is invariant under re-rooting, and so any argument we have for convergence of the rooted trees to the BCRT will work with the root in a different place. Applying something like a tower law, we conclude that the convergence works when the root is chosen uniformly in the limit.

One potential problem to be discussed is what it means to choose a point uniformly in the limit. We have two possible approaches. One is to consider Lebesgue measure on any path in the BCRT, and glue these together. However, we have a uniform stick-breaking construction of the BCRT, and one consequence of the construction is that the total length of sticks required is infinite, so this won’t work.

The other option is to project Lebesgue measure on [0,1] via the same map that sends points on the Brownian excursion to points in the tree. Note that the so-called real tree is constructed from the excursion by identifying points s and t where f(s)=f(t), and f(x)>f(s) for x in (s,t). But then we might wonder whether this can really be said to be ‘uniform’, since different points in the BCRT will have a different number of pre-images in [0,1]. In fact though, it turns out that in this sense, projected-Leb[0,1]-almost all the points in the BCRT are leaves.

To prove this, naturally we first need to define a leaf, in the setting of these continuum trees. The degree of a vertex is an idea we might keep in mind, but we can’t use this, as we don’t have vertices any more. However, we have a continuous analogue of degree, given by counting the number of connected components remaining after removing a vertex. In particular, we can define the set of leaves as

$\mathcal{L}(\mathcal{T}):=\{x\in\mathcal{T}:\mathcal{T}\backslash \{x\}\text{ is connected}\}.$

We will give a sketch proof of this result about leaves shortly. First, we clarify some notation, and consider properties of geodesics (shortest-length paths) in the tree.

Define $\check{f}(u,v):=\min_{x\in[u\wedge v,u\vee v]} f(x)$ to be the minimum value attained by f between u and v. Consider the value x at which this minimum is attained. Then, projecting onto the tree, p(x) is the ‘most recent common ancestor’ of points u and v. We can make this a bit more precise by considering geodesics in the tree starting at the root. Analogous to the unique path property in a discrete tree, in this continuous setting there is a unique path from the root to any given point, along which the height is strictly increasing. This is not surprising. It follows from one of the definitions of a real tree that the length of the path from p(0) to p(s) should be f(s), and so there is a unique isometric embedding of [0,f(s)] into $\mathcal{T}_f$ which starts at 0 and ends at p(s). Anyway, under this $p(\check{f}(s,t))$ gives the point at which the geodesics from p(s) to p(0) and from p(t) to p(0) meet.

Furthermore, we can now describe the distance in the tree between p(s) and p(t). This is given by

$d_f(s,t):= f(s)+f(t)-2\check{f}(s,t),$

and with the geodesic picture, it is easy to see why. Consider the point x at which $\check{f}(s,t)$ achieves the minimum. As we have said, this lies on the geodesics from p(s) to 0 and p(t) to 0, and paths between points are unique, so removing point x disconnects p(s) and p(t) in the tree. So we need to concatenate the geodesic from p(s) to p(x) and from p(x) to p(t). But these are subsets of the two geodesics discussed, and their respective lengths are $f(s)-\check{f}(s,t)$ and $f(t)-\check{f}(s,t)$.

We can now give a sketch proof the result that almost all the support of $\lambda_f$, the projection on Lebesgue measure from [0,1] onto $\mathcal{T}_{f}$ is on $\mathcal{T}(\mathcal{T}_f)$.

Given $s,t\in[0,1]$, suppose we are removing p(s), and this separates p(t) from the root, which is canonically p(0). Without loss of generality, take t>s. Now suppose that $\check{f}(s,t), and that, as before, this infimum is attained at $x\in[s,t]$. Then the geodesic from p(0) to p(t) will pass through p(x), but not through p(s), so in particular, removing p(s) cannot disconnect p(t) from the root.

Thus, p(s) is not a leaf if and only if there exists some small window [s,t] such that $f(s)\le f(x),\;\forall x\in[s,t]$. By Blumenthal’s 0-1 law, for fixed x, this happens with probability 0 if f is Brownian motion. Here, f is not Brownian motion, but a Brownian excursion with length 1. However, Blumenthal’s 0-1 law depends on the instantaneous behaviour after time s, ie the sigma field $\mathcal{F}_s^+$. So, for $s\in(0,1)$, the value of a Brownian at time 1 is independent of this sigma field, so if we imagine Brownian excursion as a ‘conditioned’ Brownian motion, this conditioning should have no effect on the conclusion of this corollary to Blumenthal’s 0-1 law.

This is not a formal argument, but it sketches why with probability 1, p(s) is a leaf for each $s\in(0,1)$, from which the result follows.

# Random Maps 2 – The Schaeffer Bijection

As indicated at the end of the previous post, our aim is to find a natural bijection between the set of pointed, rooted quadrangulations with n faces, and some set of objects based on decorating rooted plane trees with n edges in some fashion. Unlike our previous example, the construction of this bijection is definitely not trivial. It seems like a foolish ambition to explain this without several pictures, so I’m going to focus on some aspects of the analysis which I found challenging, rather than the construction itself.

Anyway, we don’t yet know what the extended set of trees should be. We need an extra factor of $3^n$, so it is natural to consider adding some sort of labelling of the tree, where for each non-root vertex in turn there are three options. So, given a rooted tree T, we label the vertices such that the root has label 0, and if a parent vertex has label k, any offspring has label k-1, k or k+1. Such a labelling is called admissable, and $\mathbb{T}_n$ is the set of rooted plane trees with n edges and an admissable labelling.

We now demonstrate how to construct an element of $\mathcal{Q}_n$ from an element of $\mathbb{T}_n$. Various authors had considered this problem to various extents, and so what follows is known as the Cori-Vauquelin-Schaeffer bijection, at least in this course.

Consider a contour exploration of the tree. That is, start out at the root and at all times take first-edge you encounter going clockwise from your current direction. When you arrive at a leaf, you will indeed therefore immediately retrace your most recent step. The key property is that you traverse each edge exactly twice, and so we may think of the tree as having 2n oriented edges. It is more useful to think about corners. A corner is the directed arc (WLOG clockwise) between adjacent edges at a vertex. There is a natural bijection between corners and directed edges, by looking anti-clockwise from the tail of the edge. So the contour process explored the directed edges in some order, and hence explores the corners of the tree. One thing I found confusing initially was switching between considering vertices and corners. I feel in retrospect that the only reason we need the vertices themselves is to induce the labelling onto the corners. These are the only thing we will use in the construction.

As we trace out the contour process, naturally we see different labels. We define the successor of a corner with label k to be the next corner seen in the contour process (taken modulo 2n if necessary) with label k-1. Note that any corner on a vertex with minimal label will not have a successor. To counter this, we add a new vertex, suggestively called $v_*$, with a single corner (ie no edges yet) and denote this corner to be the successor of the corners in the original tree with minimal label.

To construct our quadrangulation, we simply join up every corner with its successor corner. Note that if you are thinking of the successor of a corner as a vertex (rather than as a corner) you will get in trouble here, as it might be several ways to draw this arc.

The red arcs and vertex v* are added to form the quadrangulation. Note the blue angles indicate the three corners around the vertex labelled -1.

It is not obvious that it is possible to do this so that the arcs do not overlap. However, by considering the label process as you explore via the contour process, it becomes clear that you can discount the possibility of any overlaps one by one. This applies equally to pairs of new arcs overlapping, as well as new arcs overlapping with edges of the original tree. In any case, we remove the edges of the original tree to obtain the quadrangulation.

Note that when you move from any corner of a vertex with label k to its successor, then to the successor of its successor and so on, the labels are decreasing, so eventually you must end up at a corner with minimal label, and hence at $v_*$. We conclude that the graph of arcs is connected. It remains to show that it is a quadrangulation.

This is rather fiddly to do without a diagram. Note first that whenever we have a directed edge in the tree going from label k to label k-1, then this edge essentially becomes an arc of the quadrangulation. We show that the edge oriented in the other direction, called say e, induces three further arcs of a quadrangle. So e goes from label k-1 to k. Consider the corners before and following e in the contour exploration, which is a corner around the vertex with label k. The successor of the corner after e is a corner with label k-1, and this has a successor with label k-2. By construction, this must also be the successor of the corner before e. Why? Well as we traverse the contour beyond e, the first appearance of label k-1 must happen before the first appearance of label k-2, as the increments can only be in {-1,0,1}. This gives us the three further arcs. Note also that the 2-colouring of the quadrangulation is given by the parity of the tree-labelling.

I was bothered about what happens if two vertices with label k-1 are in fact the same. This would happen if, for example, the vertex labelled k is a leaf. Then, at least two of the corners around the single vertex with label k-1 have the same corner as successor. A naïve attempt at drawing the resulting arcs did not give a quadrangle. The key observation is that you have to draw the arcs in the direction of the contour process. So in this case, the arc from the corner before edge e will loop all the way around the vertex with label k, so it contains the other two relevant arcs on its way to the vertex with label k-2, giving us the ‘pacman’ quadrangle discussed earlier.

The other case we have to check is when our base edge joins two vertices with label k. Then the other two vertices of the face will have label k-1. This is similar to the above, and slightly easier.

As a preliminary to checking that we can invert this construction, we observe that the vertices of the quadrangulation are the vertices of the original tree plus $v_*$, and furthermore, the labels in the tree are given by the graph distance from $v_*$ in the quadrangulation, with a constant added uniformly so that the root vertex has label 0.

At this point, we observe that in the construction, we didn’t specify how to choose the rooted edge of the quadrangulation. Canonically, we take it to be the arc between the first corner of the root in the contour process, and its successor. However, we can orient it in either direction, giving us the extra factor of 2 we were looking for.

Returning to the inverse, it is clear what to do when we see a quadrangle corresponding to the second case above – namely put an edge between the two vertices with label k. In the case where the face has labels {k,k-1,k-1,k-2} it is less obvious. Note though that by starting at the first corner of the root, which is identified by the rooted edge in the quadrangulation, we can recover the contour process from the arcs of the quadrangulation, and the labels. So when we see such a face, we can use this information to choose which of the (k-1)-labelled vertices to join to the vertex with label k.

Anyway, now we are convinced that this bijection works, the next stage is to apply it to gain extra information about a uniformly-chosen large quadrangulation. We can view the vertices as being those of a large uniform plane tree, and the labels as given by a random walk along this large tree. We might expect to see this labelling structure converge to something that looks like Brownian motion indexed by a Brownian continuum random tree, in a sense to be made more precise. And the labelling is not merely a decoration in the quadrangulation, since it specifies the distance to the identified point $v_*$. In particular, this gives a bound on the distance between any two vertices in the quadrangulation, eg two vertices chosen uniformly at random. In fact, by looking more carefully at the scaling limit of the uniform tree’s contour process, we can say rather more than that.