Generating uniform trees

A long time ago, I wrote quite a few a things about uniform trees. That is, a uniform choice from the n^{n-2} unrooted trees with vertex set [n]. This enumeration, normally called Cayley’s formula, has several elegant arguments, including the classical Prufer bijection. But making a uniform choice from a large set is awkward, and so we seek more probabilistic methods to sample such a tree, which might also give insight into the structure of a ‘typical’ uniform tree.

In another historic post, I talked about the Aldous-Broder algorithm. Here’s a quick summary. We run a random walk on the complete graph K_n started from a uniformly-chosen vertex. Every time we arrive at a vertex we haven’t visited before, we record the edge just traversed. Eventually we have visited all n vertices, so have recorded n-1 edges. It’s easy enough to convince yourself that these n-1 edges form a tree (how could there be a cycle?) and a bit more complicated to decide that the distribution of this tree is uniform.

It’s worth noting that this algorithm works to construct a uniform spanning tree on any connected base graph.

This post is about a few alternative constructions and interpretations of the uniform random tree. The first construction uses a Galton-Watson process. We take a Galton-Watson process where the offspring distribution is Poisson(1), and condition that the total population size is n. The resulting random tree has a root but no labels, however if we assign labels in [n] uniformly at random, the resulting rooted tree has the uniform distribution among rooted trees on [n].

Proof

This is all about moving from ordered trees to non-ordered trees. That is, when setting up a Galton-Watson tree, we distinguish between the following two trees, drawn extremely roughly in Paint:

That is, it matters which of the first-generation vertices have three children. Anyway, for such a (rooted) ordered tree T with n vertices, the probability that the Galton-Watson process ends up equal to T is

\mathbb{P}(GW = T) = \prod_{v\in T} \frac{e^{-1}}{C(v)!} = e^{-n} \prod_{v\in T}\frac{1}{C(v)!},

where C(v) is the number of children of a vertex v\in T. Then, since \mathbb{P}( |GW|=n ) is a function of n, we find

\mathbb{P}(GW=T \,\big|\, |GW|=n) = f(n)\prod_{v\in T} \frac{1}{C(v)!},

where f(n) is a function of n alone (ie depends on T only through its size n).

But given an unordered rooted tree t, labelled by [n], there are \prod_{v \in t} C(v)! ordered trees associated to t in the natural way. Furthermore, if we take the Poisson Galton-Watson tree conditioned to have total population size n, and label uniformly at random with [n], we obtain any one of these ordered trees with probability \frac{f(n)}{n!} \prod_{v\in t} \frac{1}{C(v)!}. So the probability that we have t after we forget about the ordering is \frac{f(n)}{n!}, which is a function of n alone, and so the distribution is uniform among the set of rooted unordered trees labelled by [n], exactly as required.

Heuristic for Poisson offspring distribution

In this proof, the fact that \mathbb{P}(C(v)=k)\propto \frac{1}{k!} exactly balances the number of orderings of the k children explains why Poisson(1) works out. Indeed, you can see in the proof that Poisson(c) works equally well, though when c\ne 1, the event we are conditioning on (namely that the total population size is n) has probability decaying exponentially in n, whereas for c=1, the branching process is critical, and the probability decays polynomially.

We can provide independent motivation though, from the Aldous-Broder construction. Both the conditioned Galton-Watson construction and the A-B algorithm supply the tree with a root, so we’ll keep that, and look at the distribution of the degree of the root as constructed by A-B. Let \rho=v_1,v_2,v_3,\ldots be the vertices [n], ordered by their discovery during the construction. Then \rho is definitely connected by an edge to v_2, but thereafter it follows by an elementary check that the probability \rho is connected to v_m is \frac{1}{n-1}, independently across all m. In other words, the distribution of the degree of \rho in the tree as constructed by A-B is

1+ \mathrm{Bin}\left(n-2,\frac{1}{n-1}\right) \approx 1+\mathrm{Poisson}(1).

Now, in the Galton-Watson process, conditioning the tree to have fixed, large size changes the offspring distribution of the root. Conveniently though, in a limiting sense it’s the same change as conditioning the tree to have size at least n. Since these events are monotone in n, it’s possible to take a limit of the conditioning events, and interpret the result as the Galton-Watson tree conditioned to survive. It’s a beautiful result that this interpretation can be formalised as a local limit. The limiting spine decomposition consists of an infinite spine, where the offspring distribution is a size-biased version of the original offspring distribution (and so in particular, always has at least one child) and where non-spine vertices have the original distribution.

In particular, the number of the offspring of the root is size-biased, and it is well-known and not hard to check that size-biasing Poisson(c) gives 1+Poisson(c) ! So in fact we have, in an appropriate limiting sense in both objects, a match between the degree distribution of the root in the uniform tree, and in the conditioned Galton-Watson tree.

This isn’t supposed to justify why a conditioned Galton-Watson tree is relevant a priori (especially the unconditional independence of degrees), but it does explain why Poisson offspring distributions are relevant.

Construction via G(N,p) and the random cluster model

The main reason uniform trees were important to my thesis was their appearance in the Erdos-Renyi random graph G(N,p). The probability that vertices {1, …, n} form a tree component in G(N,p) with some particular structure is

p^{n-1} (1-p)^{\binom{n}{2}-(n-1)} \times (1-p)^{n(N-m)}.

Here, the first two terms give the probability that the graph structure on {1, …, n} is correct, and the the final term gives the probability of the (independent) event that these vertices are not connected to anything else in the graph. In particular, this has no dependence on the tree structure chosen on [n] (for example, whether it should be a path or a star – both examples of trees). So the conditional distribution is uniform among all trees.

If we work in some limiting regime, where pn\rightarrow 0 (for example if n is fixed and p=\frac{1}{N}\rightarrow 0), then we can get away asymptotically with less strong conditioning. Suppose we condition instead just that [n] form a component. Now, there are more ways to form a connected graph with one cycle on [n] than there are trees on [n], but the former all require an extra edge, and so the probability that a given one such tree-with-extra-edge appears as the restriction to [n] in G(N,p) is asymptotically negligible compared to the probability that the restriction to [n] of G(N,p) is a tree. Naturally, the local limit of components in G(N,c/N) is a Poisson(c) Galton-Watson branching process, and so this is all consistent with the original construction.

One slightly unsatisfying aspect to this construction is that we have to embed the tree of size [n] within a much larger graph on [N] to see uniform trees. We can’t choose a scaling p=p(n) such that G(n,p) itself concentrates on trees. To guarantee connectivity with high probability, we need to take p> \frac{\log n}{n}, but by this threshold, the graph has (many) cycles with high probability.

At this PIMS summer school in Vancouver, one of the courses is focusing on lattice spin models, including the random cluster model, which we now briefly define. We start with some underlying graph G. From a physical motivation, we might take G to be \mathbb{Z}^d or some finite subset of it, or a d-ary tree, or the complete graph K_N. As in classical bond percolation (note G(N,p) is bond percolation on K_N), a random subset of the edges of G are included, or declared open. The probability of a given configuration w, with e open edges is proportional to

p^e (1-p)^{|E(G)| - e} q^{k(w)}, (*)

where the edge-weight p\in(0,1) as usual, and cluster weight q\in (0,\infty), and k(w) counts the number of connected components in configuration w. When q=1, we recover classical bond percolation (including G(N,p) ), while for q>1, this cluster-reweighting favours having more components, and q<1 favours fewer components. Note that in the case q\ne 1, the normalising constant (or partition function) of (*) is generally intractable to calculate explicitly.

As in the Erdos-Renyi graph, consider fixing the underlying graph G, and taking p\rightarrow 0, but also taking \frac{q}{p}\rightarrow 0. So the resulting graph asymptotically ‘wants to have as few edges as possible, but really wants to have as few components as possible’. In particular, 1) all spanning trees of G are equally likely; 2) any configuration with more than one component has asymptotically negligible probability relative to any tree; 3) any graph with a cycle has #components + #edges greater than that of a tree, and so is asymptotically negligible probability relative to any tree.

In other words, the limit of the distribution is the uniform spanning tree of G, and so this (like Aldous-Broder) is a substantial generalisation, which constructs the uniform random tree in the special case where G=K_n.

 

Random walks conditioned to stay positive

In this post, I’m going to discuss some of the literature concerning the question of conditioning a simple random walk to lie above a line with fixed gradient. A special case of this situation is conditioning to stay non-negative. Some notation first. Let (S_n)_{n\ge 0} be a random walk with IID increments, with distribution X. Take \mu to be the expectation of these increments, and we’ll assume that the variance \sigma^2 is finite, though at times we may need to enforce slightly stronger regularity conditions.

(Although simple symmetric random walk is a good example for asymptotic heuristics, in general we also assume that if the increments are discrete they don’t have parity-based support, or any other arithmetic property that prevents local limit theorems holding.)

We will investigate the probability that S_n\ge 0 for n=0,1,…,N, particularly for large N. For ease of notation we write T=\inf\{n\ge 0\,:\, S_n<0\} for the hitting time of the negative half-plane. Thus we are interested in S_n conditioned on T>N, or T=N, mindful that these might not be the same. We will also discuss briefly to what extent we can condition on T=\infty.

In the first paragraph, I said that this is a special case of conditioning SRW to lie above a line with fixed gradient. Fortunately, all the content of the general case is contained in the special case. We can repose the question of S_n conditioned to stay above n\alpha until step N by the question of S_n-n\alpha (which, naturally, has drift \mu-\alpha) conditioned to stay non-negative until step N, by a direct coupling.

Applications

Simple random walk is a perfectly interesting object to study in its own right, and this is a perfectly natural question to ask about it. But lots of probabilistic models can be studied via naturally embedded SRWs, and it’s worth pointing out a couple of applications to other probabilistic settings (one of which is the reason I was investigating this literature).

In many circumstances, we can desribe random trees and random graphs by an embedded random walk, such as an exploration process, as described in several posts during my PhD, such as here and here. The exploration process of a Galton-Watson branching tree is a particularly good example, since the exploration process really is simple random walk, unlike in, for example, the Erdos-Renyi random graph G(N,p), where the increments are only approximately IID. In this setting, the increments are given by the offspring distribution minus one, and the hitting time of -1 is the total population size of the branching process. So if the expectation of the offspring distribution is at most 1, then the event that the size of the tree is large is an atypical event, corresponding to delayed extinction. Whereas if the expectation is greater than one, then it is an event with limiting positive probability. Indeed, with positive probability the exploration process never hits -1, corresponding to survival of the branching tree. There are plenty of interesting questions about the structure of a branching process tree conditional on having atypically large size, including the spine decomposition of Kesten [KS], but the methods described in this post can be used to quantify the probability, or at least the scale of the probability of this atypical event.

In my current research, I’m studying a random walk embedded in a construction of the infinite-volume DGFF pinned at zero, as introduced by Biskup and Louidor [BL]. The random walk controls the gross behaviour of the field on annuli with dyadically-growing radii. Anyway, in this setting the random walk has Gaussian increments. (In fact, there is a complication because the increments aren’t exactly IID, but that’s definitely not a problem at this level of exposition.) The overall field is decomposed as a sum of the random walk, plus independent DGFFs with Dirichlet boundary conditions on each of the annuli, plus asymptotically negligible corrections from a ‘binding field’. Conditioning that this pinned field be non-negative up to the Kth annulus corresponds to conditioning the random walk to stay above the magnitude of the minimum of each successive annular DGFF. (These minima are random, but tightly concentrated around their expectations.)

Conditioning on \{T > N\}

When we condition on \{T>N\}, obviously the resulting distribution (of the process) is a mixture of the distributions we obtain by conditioning on each of \{T=N+1\}, \{T=N+2\},\ldots. Shortly, we’ll condition on \{T=N\} itself, but first it’s worth establishing how to relate the two options. That is, conditional on \{T>N\}, what is the distribution of T?

Firstly, when \mu>0, this event always has positive probability, since \mathbb{P}(T=\infty)>0. So as N\rightarrow\infty, the distribution of the process conditional on \{T>N\} converges to the distribution of the process conditional on survival. So we’ll ignore this for now.

In the case \mu\le 0, everything is encapsulated in the tail of the probabilities \mathbb{P}(T=N), and these tails are qualitatively different in the cases \mu=0 and \mu<0.

When \mu=0, then \mathbb{P}(T=N) decays polynomially in N. In the special case where S_n is simple symmetric random walk (and N has the correct parity), we can check this just by an application of Stirling’s formula to count paths with this property. By contrast, when \mu<0, even demanding S_N=-1 is a large deviations event in the sense of Cramer’s theorem, and so the probability decays exponentially with N. Mogulskii’s theorem gives a large deviation principle for random walks to lie above a line defined on the scale N. The crucial fact here is that the probabilistic cost of staying positive until N has the same exponent as the probabilistic cost of being positive at N. Heuristically, we think of spreading the non-expected behaviour of the increments uniformly through the process, at only polynomial cost once we’ve specified the multiset of values taken by the increments. So, when \mu<0, we have

\mathbb{P}(T\ge(1+\epsilon)N) \ll \mathbb{P}(T= N).

Therefore, conditioning on \{T\ge N\} in fact concentrates T on N+o(N). Whereas by contrast, when \mu=0, conditioning on \{T\ge N\} gives a nontrivial limit in distribution for T/N, supported on [1,\infty).

A related problem is the value taken by S_N, conditional on {T>N}. It’s a related problem because the event {T>N} depends only on the process up to time N, and so given the value of S_N, even with the conditioning, after time N, the process is just an unconditioned RW. This is a classic application of the Markov property, beloved in several guises by undergraduate probability exam designers.

Anyway, Iglehart [Ig2] shows an invariance principle for S_N | T>N when \mu<0, without scaling. That is S_N=\Theta(1), though the limiting distribution depends on the increment distribution in a sense that is best described through Laplace transforms. If we start a RW with negative drift from height O(1), then it hits zero in time O(1), so in fact this shows that conditonal on \{T\ge N\}, we have T= N +O(1) with high probability. When \mu=0, we have fluctuations on a scale \sqrt{N}, as shown earlier by Iglehart [Ig1]. Again, thinking about the central limit theorem, this fits the asymptotic description of T conditioned on T>N.

Conditioning on T=N

In the case \mu=0, conditioning on T=N gives

\left[\frac{1}{\sqrt{N}}S(\lfloor Nt\rfloor ) ,t\in[0,1] \right] \Rightarrow W^+(t), (*)

where W^+ is a standard Brownian excursion on [0,1]. This is shown roughly simultaneously in [Ka] and [DIM]. This is similar to Donsker’s theorem for the unconditioned random walk, which converges after rescaling to Brownian motion in this sense, or Brownian bridge if you condition on S_N=0. Skorohod’s proof for Brownian bridge [Sk] approximates the event \{S_N=0\} by \{S_N\in[-\epsilon \sqrt{N},+\epsilon \sqrt{N}]\}, since the probability of this event is bounded away from zero. Similarly, but with more technicalities, a proof of convergence conditional on T=N can approximate by \{S_m\ge 0, m\in[\delta N,(1-\delta)N], S_N\in [-\epsilon \sqrt{N},+\epsilon\sqrt{N}]\}. The technicalities here emerge since T, the first return time to zero, is not continuous as a function of continuous functions. (Imagine a sequence of processes f^N for which f^N(x)\ge 0 on [0,1] and f^N(\frac12)=\frac{1}{N}.)

Once you condition on T=N, the mean \mu doesn’t really matter for this scaling limit. That is, so long as variance is finite, for any \mu\in\mathbb{R}, the same result (*) holds, although a different proof is in general necessary. See [BD] and references for details. However, this is particularly clear in the case where the increments are Gaussian. In this setting, we don’t actually need to take a scaling limit. The distribution of Gaussian *random walk bridge* doesn’t depend on the mean of the increments. This is related to the fact that a linear transformation of a Gaussian is Gaussian, and can be seen by examining the joint density function directly.

Conditioning on T=\infty

When \mu>0, the event \{T=\infty\} occurs with positive probability, so it is well-defined to condition on it. When \mu\le 0, this is not the case, and so we have to be more careful.

First, an observation. Just for clarity, let’s take \mu<0, and condition on \{T>N\}, and look at the distribution of S_{\epsilon N}, where \epsilon>0 is small. This is approximately given by

\frac{S_{\epsilon N}}{\sqrt{N}}\stackrel{d}{\approx}W^+(\epsilon).

Now take \epsilon\rightarrow\infty and consider the RHS. If instead of the Brownian excursion W^+, we instead had Brownian motion, we could specify the distribution exactly. But in fact, we can construct Brownian excursion as the solution to an SDE:

\mathrm{d}W^+(t) = \left[\frac{1}{W^+(t)} - \frac{W^+(t)}{1-t}\right] \mathrm{d}t + \mathrm{d}B(t),\quad t\in(0,1) (**)

for B a standard Brownian motion. I might return in the next post to why this is valid. For now, note that the first drift term pushes the excursion away from zero, while the second term brings it back to zero as t\rightarrow 1.

From this, the second drift term is essentially negligible if we care about scaling W^+(\epsilon) as \epsilon\rightarrow 0, and we can say that W^+(\epsilon)=\Theta(\sqrt{\epsilon}).

So, returning to the random walk, we have

\frac{S_{\epsilon N}}{\sqrt{\epsilon N}}\stackrel{d}{\approx} \frac{W^+(\epsilon)}{\sqrt{\epsilon}} = \Theta(1).

At a heuristic level, it’s tempting to try ‘taking N\rightarrow\infty while fixing \epsilon N‘, to conclude that there is a well-defined scaling limit for the RW conditioned to stay positive forever. But we came up with this estimate by taking N\rightarrow\infty and then \epsilon\rightarrow 0 in that order. So while the heuristic might be convincing, this is not the outline of a valid argument in any way. However, the SDE representation of W^+ in the \epsilon\rightarrow 0 regime is useful. If we drop the second drift term in (**), we define the three-dimensional Bessel process, which (again, possibly the subject of a new post) is the correct scaling limit we should be aiming for.

Finally, it’s worth observing that the limit \{T=\infty\}=\lim_{N\rightarrow\infty} \{T>N\} is a monotone limit, and so further tools are available. In particular, if we know that the trajectories of the random walk satisfy the FKG property, then we can define this limit directly. It feels intuitively clear that random walks should satisfy the FKG inequality (in the sense that if a RW is large somewhere, it’s more likely to be large somewhere else). You can do a covariance calculation easily, but a standard way to show the FKG inequality applies is by verifying the FKG lattice condition, and unless I’m missing something, this is clear (though a bit annoying to check) when the increments are Gaussian, but not in general. Even so, defining this monotone limit does not tell you that it is non-degenerate (ie almost-surely finite), for which some separate estimates would be required.

A final remark: in a recent post, I talked about the Skorohod embedding, as a way to construct any centered random walk where the increments have finite variance as a stopped Brownian motion. One approach to conditioning a random walk to lie above some discrete function is to condition the corresponding Brownian motion to lie above some continuous extension of that function. This is a slightly stronger conditioning, and so any approach of this kind must quantify how much stronger. In Section 4 of [BL], the authors do this for the random walk associated with the DGFF conditioned to lie above a polylogarithmic curve.

References

[BD] – Bertoin, Doney – 1994 – On conditioning a random walk to stay nonnegative

[BL] – Biskup, Louidor – 2016 – Full extremal process, cluster law and freezing for two-dimensional discrete Gaussian free field

[DIM] – Durrett, Iglehart, Miller – 1977 – Weak convergence to Brownian meander and Brownian excursion

[Ig1] – Iglehart – 1974 – Functional central limit theorems for random walks conditioned to stay positive

[Ig2] – Iglehart – 1974 – Random walks with negative drift conditioned to stay positive

[Ka] – Kaigh – 1976 – An invariance principle for random walk conditioned by a late return to zero

[KS] – Kesten, Stigum – 1966 – A limit theorem for multidimensional Galton-Watson processes

[Sk] – Skorohod – 1955 – Limit theorems for stochastic processes with independent increments