# The reflection principle and conditioned RWs

I haven’t published a post about probability for far too long. Several are queued, so perhaps this will be the start of a deluge.

Anyway, with my advisor at Technion, I’m still working on some problems concerning Gaussian random walk subject to some conditioning which is complicated, but in practice (we hope) only mildly different to conditioning the walk to stay positive. Our conditioning at step n depends on some external randomness, but also on the future trajectory of the walk (related to the embedding of the walk in a 2D DGFF), thus ruining the possibility of applying the Markov property in any proof without significant preliminary work.

It seemed worth writing briefly about some of these results in a slightly simpler setting. The goal is to assemble many of the ingredients required to prove a local limit for Gaussian random walk conditioned to stay positive, in a sense which will be clarified towards the end. This is not the best way to go about getting scaling limits (as discussed briefly here, and for which see references [Ig74] and [Bo76]), and it’s probably not the best way to get local limits in the simplest setting, but it’s the method we are currently working to generalise, and follows the outline of [B-JD05], but in much less technical detail.

Probabilities via the reflection principle

We start with Brownian motion. The reflection principle, as described briefly in this post from the depths of history, is a classical technique for studying the maximum of Brownian motion. Roughly speaking, we exploit the fact that $(W_t,t\ge 0)\stackrel{d}=(-W_t,t\ge 0)$, but we then apply this at the hitting time of a particular positive value, using the Strong Markov Property.

Let $S_t=\max_{0\le s\le t}W_s$ be the running maximum of the Brownian motion $W_t$, and $\tau_b$ the hitting time of b. Then

$\mathbb{P}(S_t\ge b, B_t\le a)=\mathbb{P}(\tau_b

which, by SMP at $\tau_b$ and the reflection invariance of a standard BM, is equal to

$\mathbb{P}(\tau_b

This obviously assumed $b\ge a$, but if we set $b=a$, we find

$\mathbb{P}(S_t\ge b)=\mathbb{P}(B_t>b)+\mathbb{P}(S_t\ge b,B_t\le b)=2\mathbb{P}(B_t\ge b).$

Or, in other words, $S_t\stackrel{d}=|B_t|$.

While we can’t derive such nice equalities in distribution, the reflection principle is robust enough to treat more complicated settings, such as Brownian bridge.

We might want to ask about the maximum of a standard Brownian bridge, but we can be more general, and ask about the maximum of a Brownian bridge with drift (let’s say general bridge here). It’s important to remember that a general Brownian bridge has the same distribution as a linear transformation of a standard Brownian bridge. Everything is Gaussian, after all. So asking whether the maximum of a general Brownian bridge is less than a particular value is equivalent to asking whether a standard Brownian bridge lies below a fixed line. Wherever possible, we make such a transformation at the start and perform the simplest version of the required calculation.

So, suppose we have a bridge B from (0,0) to (t,a), and we want to study $\max_{s\in[0,t]} B_s$. Fix some $b>a$, and work with a standard Brownian motion $W_s$. By a similar argument to before,

$\mathbb{P}(\tau_b\le t, W_t\in[a,a+\mathrm{d}x]) = \mathbb{P}(W_t\in [2b-a-\mathrm{d}x,2b-a]) = \frac{\mathrm{d}x}{\sqrt{2\pi t}}e^{-(2b-a)^2/2t},$

and

$\mathbb{P}(W_t\in[a,a+\mathrm{d}x])=\frac{\mathrm{d}x}{\sqrt{2\pi t}}e^{-a^2/2t}.$

So

$\mathbb{P}(\max_{s\in[0,t]}B_t\ge b) = \exp\left(\frac{-(2b-a)^2 + a^2}{2t}\right).$

Random walk conditioned to stay positive

Our main concern is conditioning to stay above zero. Let $\mathbb{P}_{0,x}^{t,y}$ be some complete if cumbersome notation for a Brownian bridge B from (0,x) to (t,y). Then another simple transformation of the previous result gives

$\mathbb{P}_{0,x}^{t,y}(B_s\ge 0,\,s\in[0,t])=1-\exp\left( \frac{-(x+y)^2 + (x-y)^2}{2t} \right)= 1-\exp\left(-\frac{2xy}{t}\right).$

Then, if $xy\ll t$, we can approximate this by $\frac{2xy}{t}$. (*)

Extend the notation so $\mathbb{P}_{0,x}$ describes Brownian motion started from (0,x). Then integrating over y, gives

$\mathbb{P}_{0,x}(B_s\ge 0,\, s\in[0,t] ) = \frac{x}{t}\mathbb{E}[B_t\vee 0] = \sqrt{\frac{2}{\pi}} \frac{x}{\sqrt{t}}.$

(It might appear that we have integrated the approximation (*) over parts of the range where it is not a valid approximation, but the density of $B_t=\Theta(t)$ vanishes exponentially fast, and so actually it’s not a problem.)

We now want to extend this to random walks. Some remarks:

• We used the Gaussian property of Brownian motion fairly heavily throughout this derivation. In general random walks are not Gaussian, though we can make life easier by focusing on this case.
• We also used the continuity property of Brownian motion when we applied the reflection principle. For a general random walk, it’s hopeless to consider the hitting times of individual values. We have to consider instead the hitting times of regions $\tau_{(-\infty,b]}$, and so on. One can still apply SMP and a reflection principle, but this gives bounds rather than equalities. (The exception is simple random walk, for which other more combinatorial methods may be available anyway.)
• On the flip side, if we are interested in Brownian motion/bridge staying positive, we can’t start it from zero, as then the probability of this event is zero, by Blumenthal’s 0-1 law. By contrast, we can certainly ask about random walk staying positive when started from zero without taking a limit.

A useful technique will be the viewpoint of random walk as the values taken by Brownian motion at a sequence of stopping times. This Skorohod embedding is slightly less obvious when considering a general random walk bridge inside a general Brownian bridge, but is achievable. We want to study quantities like

$\mathbb{P}(S_k\ge 0,\, k=1,\ldots,n \big| S_0=x,S_n=y),$

where for simplicity let’s just take $(S_k,k\ge 0)$ to be a random walk with standard Gaussian increments. It’s possible we might want to take a scaling limit in x and y as functions of n. But first if we take x,y fixed, and embed the random walk bridge with these endpoints into the corresponding Brownian bridge with $t\approx n$, we are then faced with the question:

What’s the probability that the Brownian bridge goes below zero, but the embedded RW with n steps does not?

If the Brownian bridge conditioned to go below zero spends time $\Theta_p(n)$ below zero, then for large n it’s asymptotically very unlikely that the n places at which we embed the random walk avoids this set of intervals.

Several technical estimates are required to make this analysis rigorous. The conclusion is that there exists a function $f(x)$ for which $f(x)=x(1+o(1))$ as $x\rightarrow\infty$, such that

$q_n(x,y):=\mathbb{P}(S_k\ge 0,\, k=0,1,\ldots,n \,\big|\, S_0=x,S_n=y) \sim \frac{2f(x)f(y)}{n},$

$\text{and}\quad q_n(x):=\mathbb{P}(S_k\ge 0,\,k=0,1,\ldots,n\,\big|\, S_0=x)\sim \sqrt{\frac{2}{\pi}}\frac{f(x)}{\sqrt{n}}.$

As earlier, the second is obtained from the first by integrating over suitable y. This function $f$ has to account for the extra complications when either end-point is near zero, for which the event where the Brownian motion goes negative without the random walk going negative requires additional care.

Limits for the conditioned random walk

In the previous post on this topic, we addressed scaling limits in space and time for conditioned random walks. But we don’t have to look at the classical Donsker scaling to see the effects of conditioning to stay positive. In our setting, we are interested in studying the distribution of $S_m$ conditional on the event $(S_1\ge 0,S_2\ge 0,\ldots, S_n\ge 0)$, with limits taken in the order $n\rightarrow\infty$ and then $m\rightarrow\infty$.

(At a more general level, it is meaningful to describe the random walk conditioned on staying positive forever. Although this would a priori require conditioning on an event of probability zero, it can be handled formally as an example of an h-transform.)

As explained in that previous post, the scaling invariance of the Bessel process $W^+$ (which it’s not unreasonable to think of as ‘Brownian motion conditioned to stay non-negative’) suggests that this limit should exist, and be given by the entrance law of $W^+$. But this is hard to extract from a scaling limit.

However, we can use the previous estimate to start a direct calculation.

$\mathbb{P}(S_m\in \mathrm{d}y \,\big|\, S_k\ge 0,\, k=1,\ldots,n) = \frac{q_m(0,y) q_{n-m}(y) \mathbb{P}(S_m\in\mathrm{d}y)}{q_n(0)}.$

Here, we used the Markov property at time m to split the event that $S_m=y$ and the walk stays positive into two time-intervals. We will later take m large, so we may approximate as

$\frac{2f(0)f(y)/m \times \sqrt{\frac{2}{\pi}}f(y)/\sqrt{n-m}\times \mathbb{P}(S_m\in\mathrm{d}y) } { \sqrt{\frac{2}{\pi}}f(0)/\sqrt{n}}\stackrel{n\rightarrow\infty}=\frac{2f(y)^2}{m}\mathbb{P}(S_m\in\mathrm{d}y).$

This final probability emphasises that as $m\rightarrow\infty$ we only really have to consider $y=\Theta(\sqrt{m})$, so set $y=z\sqrt{m}$, and we obtain

$\lim_{n\rightarrow\infty}\mathbb{P}(\frac{S_m}{\sqrt{m}}\in\mathrm{d}z\,\big|\, S_k\ge 0,\,k=1,\ldots,n)$

$\sim \sqrt{m}\cdot\frac{2z^2m}{m}\cdot \frac{1}{\sqrt{2\pi}}\frac{1}{\sqrt{m}}e^{-z^2/2} = \sqrt{\frac{2}{\pi}}z^2 e^{-z^2/2}.$

This is precisely the entrance law of the 3-dimensional Bessel process, usually denoted $R$. This process is invariant under time-rescaling in the same fashion as Brownian motion. Indeed, one representation of R is as the radial part of a three-dimensional Brownian motion, given by independent BMs in each coordinate. (See [Pi75] for explanation of the relation to ‘BM conditioned to stay non-negative’.) We could complete the analogy by showing that $q_n(x,y)$ converges to the transition density of R as well. (Cf the prelude to Theorem 2.2 of [B-JD05].)

Final remarks

The order of taking limits is truly crucial. We can also obtain a distributional scaling limit at time n under conditioning to stay non-negative up to time n. But then this is the size-biased normal distribution $\sim ze^{-z^2/2}$ (the Rayleigh distribution), rather than the square-size-biased normal distribution we say in this setting. And we can exactly see why. Relative to the normal distribution which applies in the absence of conditioning, we require size-biasing to account for the walk staying positive up to time m, and then also size-biasing to account for the walk staying positive for the rest of time (or up to n in the $n\rightarrow\infty$ limit if you prefer).

The asymptotics for $q_n(x,y)$ were the crucial step, for which only heuristics are present in this post. It remains the case that estimates of this kind form the crucial step in other more exotic conditioning scenarios. This is immediately visible (even if the random walk notation is rather exotic) in, for example, Proposition 2.2 of [CHL17], of which we currently require a further level of generalisation.

References

[Bo76] – Bolthausen – On a functional central limit theorem for random walks conditioned to stay positive

[B-JD05] – Bryn-Jones, Doney – A functional limit theorem for random walk conditioned to stay non-negative

[CHL17] – Cortines, Hartung, Louidor – The structure of extreme level sets in branching Brownian motion

[Ig74] – Iglehart – Functional central limit theorems for random walks conditioned to stay positive

[Pi75] – Pitman – One-dimensional Brownian motion and the three-dimensional Bessel process

# Kernels of critical graph components

This post is motivated by G(N,p), the classical Erdos-Renyi random graph, specifically its critical window, when $p=p(N)=\frac{1}{N}(1+\lambda N^{-1/3})$.

We start with the following observation, which makes no restriction on p. Suppose a component of G(N,p) is a tree. Then, the graph geometry of this component is that of a uniform random tree on the appropriate number of vertices. This is deliberately informal. To be formal, we’d have to say “condition on a particular subset of vertices forming a tree-component” and so on. But the formality is broadly irrelevant, because at the level of metric scaling limits, if we want to describe the structure of a tree component, it doesn’t matter whether it has $\log N$ or $\frac{1}{7}N$ vertices, because in both cases the tree structure is uniform. The only thing that changes is the scaling factor.

In general, when V vertices form a connected component of a graph with E edges, we define the excess to be E-V+1. So the excess is non-negative, and is zero precisely when the component is a tree. I’m reluctant to say that the excess counts the number of cycles in the component, but certainly it quantifies the amount of cyclic structure present. We will sometimes, in a mild abuse of notation, talk about excess edges. But note that for a connected component with positive excess, there is a priori no way to select which edges would be the excess edges. In a graph process, or when there is some underlying exploration of the component, there sometimes might be a canonical way to classify the excess edges, though it’s worth remarking that the risk of size-biasing errors is always extremely high in this sort of situation.

Returning to the random graph process, as so often there are big changes around criticality. In the subcritical regime, the components are small, and most of them, even the largest with high probability, are trees. In the supercritical regime, the giant component has excess $\Theta(N)$, which is qualitatively very different.

It feels like every talk I’ve ever given has begun with an exposition of Aldous’s seminal paper [Al97] giving a distributional scaling limit of the sizes of critical components in the critical window, and a relation between the process on this time-scale and the multiplicative coalescent. And it remains relevant here, because the breadth-first exploration process can also be used to track the number of excess edges.

In a breadth-first exploration, we have a stack of vertices we are waiting to explore. We pick one and look its neighbours restricted to the rest of the graph, that is without the vertices we have already fully explored, and also without the other vertices in the stack. That’s the easiest way to handle the total component size. But we can simultaneously track how many times we would have joined to a neighbour within the stack, which leads to an excess edge, and Aldous derives a joint distributional scaling limit for the sizes of the critical components and their excesses. (Note that in this case, there is a canonical notion of excess edge, but it depends not just on the graph structure, but also on the extra randomness of the ordering within the breadth-first search.)

Roughly speaking, we consider the reflected exploration process, and its scaling limit, which is a reflected parabolically-drifting Brownian motion (though the details of this are not important at this level of exposition, except that it’s a well-behaved non-negative process that hits zero often). The component sizes are given by the widths of the excursions above zero, scaled up in a factor $N^{1/3}$. Then conditional on the shape of the excursion, the excess is Poisson with parameter the area under the excursion, with no rescaling. That is, a critical component has $\Theta(1)$ excess.

So, with Aldous’s result in the background, when we ask about the metric structure of these critical components, we are really asking: “what does a uniformly-chosen connected component with fixed excess look like when the number of vertices grows?”

I’ll try to keep notation light, but let’s say T(n,k) is a uniform choice from connected graphs on n vertices with excess k.

[Note, the separation of N and n is deliberate, because in the critical window, the connected components have size $n = \Theta(N^{2/3})$, so I want to distinguish the two problems.]

In this post, we will mainly address the question: “what does the cycle structure of T(n,k) look like for large n?” When k=0, we have a uniform tree, and the convergence of this to the Brownian CRT is now well-known [CRT2, LeGall]. We hope for results with a similar flavour for positive excess k.

2-cores and kernels

First, we have to give a precise statement of what it means to study just the cycle structure of a connected component. From now on I will assume we are always working with a connected graph.

There are several equivalent definitions of the 2-core C(G) of a graph G:

• When the excess is positive, there are some cycles. The 2-core is the union of all edges which form part of some cycle, and any edges which lie on a path between two edges which both form part of some cycle.
• C(G) is the maximal induced subgraph where all degrees are at least two.
• If you remove all the leaves from the graph, then all the leaves from the remaining graph, and continue, the 2-core is the state you arrive at where there are no leaves.

It’s very helpful to think of the overall structure of the graph as consisting of the 2-core, with pendant trees ‘hanging off’ the 2-core. That is, we can view every vertex of the 2-core as the root of a (possibly size 1) tree. This is particular clear if we remove all the edges of the 2-core from the graph. What remains is a forest, with one tree for each vertex of the 2-core.

In general, the k-core is the maximal induced subgraph where all degrees are at least k. The core is generally taken to be something rather different. For this post (and any immediate sequels) I will never refer to the k-core for k>2, and certainly not to the traditional core. So I write ‘core’ for ‘2-core’.

As you can see in the diagram, the core consists of lots of paths, and topologically, the lengths of these paths are redundant. So we will often consider instead the kernel, K(G), which is constructed by taking the core and contracting all the paths between vertices of degree greater than 2. The resulting graph has minimal degree at least three. So far we’ve made no comment about the simplicity of the original graphs, but certainly the kernel need not be simple. It will regularly have loops and multiple edges. The kernel of the graph and core in the previous diagram is therefore this:

Kernels of critical components

To recap, we can deconstruct a connected graph as follows. It has a kernel, and each edge of the kernel is a path length of some length in the core. The rest of the graph consists of trees hanging off from the core vertices.

For now, we ask about the distribution of the kernel of a T(n,K). You might notice that the case k=1 is slightly awkward, as when the core consists of a single cycle, it’s somewhat ambiguous how to define the kernel. Everything we do is easily fixable for k=1, but rather than carry separate cases, we handle the case $k\ge 2$.

We first observe that fixing k doesn’t confirm the number of vertices or edges in the kernel. For example, both of the following pictures could correspond to k=3:

However, with high probability the kernel is 3-regular, which suddenly makes the previous post relevant. As I said earlier, it can introduce size-biasing errors to add the excess edges one-at-a-time, but these should be constant factor errors, not scaling errors. So imagine the core of a large graph with excess k=2. For the sake of argument, assume the kernel has the dumbbell / handcuffs shape. Now add an extra edge somewhere. It’s asymptotically very unlikely that this is incident to one of the two vertices with degree three in the core. Note it would need to be incident to both to generate the right-hand picture above. Instead, the core will gain two new vertices of degree three.

Roughly equivalently, once the size of the core is fixed (and large) we have to make a uniform choice from connected graphs of this size where almost every vertex has degree 2, and $\Theta(1)$ of the rest have degree 3 or higher. But the sum of the degrees is fixed, because the excess is fixed. If there are n vertices in the core, then there are $\Theta(n)$ more graphs where all the vertices have degree 2 or 3, than graphs where a vertex has degree at least 4. Let’s state this formally.

Proposition: The kernel of a uniform graph with n vertices and excess $k\ge 2$ is, with high probability as $n\rightarrow\infty$, 3-regular.

This proved rather more formally as part of Theorem 7 of [JKLP], essentially as a corollary after some very comprehensive generating function setup; and in [LPW] with a more direct computation.

In the previous post, we introduced the configuration model as a method for constructing regular graphs (or any graphs with fixed degree sequence). We observe that, conditional on the event that the resulting graph is simple, it is in fact uniformly-distributed among simple graphs. When the graph is allowed to be a multigraph, this is no longer true. However, in many circumstances, as remarked in (1.1) of [JKLP], for most applications the configuration model measure on multigraphs is the most natural.

Given a 3-regular labelled multigraph H with 2(k-1) vertices and 3(k-1) edges, and K a uniform choice from the configuration model with these parameters, we have

$\mathbb{P}\left( K \equiv H \right) \propto \left(2^{t(H)} \prod_{e\in E(H)} \mathrm{mult}(e)! \right)^{-1},$

where t(H) is the number of loops in H, and mult(e) the multiplicity of an edge e. This might seem initially counter-intuitive, because it looks we are biasing against graphs with multiple edges, when perhaps our intuition is that because there are more ways to form a set of multiple edges we should bias in favour of it.

I think it’s most helpful to look at a diagram of a multigraph as shown, and ask how to assign stubs to edges. At a vertex with degree three, all stub assignments are different, that is 3!=6 possibilities. At the multiple edge, however, we care which stubs match with which stubs, but we don’t care about the order within the multi-edge. Alternatively, there are three choices of how to divide each vertex’s stubs into (2 for the multi-edge, 1 for the rest), and then two choices for how to match up the multi-edge stubs, ie 18 in total = 36/2, and a discount factor of 2.

We mention this because in fact K(T(n,k)) converges in distribution to this uniform configuration model. Once you know that K(T(n,k)) is with high probability 3-regular, then again it’s probably easiest to think about the core, indeed you might as well condition on its total size and number of degree 3 vertices. It’s then not hard to convince yourself that a uniform choice induces a uniform choice of kernel. Again, let’s state that as a proposition.

Proposition: For any H a 3-regular labelled multigraph H with 2(k-1) vertices and 3(k-1) edges as before,

$\lim_{n\rightarrow\infty}\mathbb{P}\left( K(T(n,k)) \equiv H \right) \propto \left(2^{t(H)} \prod_{e\in E(H)} \mathrm{mult}(e)! \right)^{-1}.$

As we said before, the kernel describes the topology of the core. To reconstruct the graph, we need to know the lengths in the core, and then how to glue pendant trees onto the core. But this final stage depends on k only through the total length of paths in the core. Given that information, it’s a combinatorial problem, and while I’m not claiming it’s easy, it’s essentially the same as for the case with k=1, and is worth treating separately.

It is worth clarifying a couple of things first though. Even the outline of methods above relies on the fact that the size of the core diverges as n grows. Again, the heuristic is that up to size-biasing errors, T(n,k) looks like a uniform tree with some uniformly-chosen extra edges. But distances in T(n,k) scale like $n^{1/2}$ (and thus in critical components of G(N,p) scale like $N^{1/3}$). And the core will be roughly the set of edges on paths between the uniformly-chosen pairs of vertices, and so will also have length $\Theta(n^{1/2})$.

Once you have conditioned on the kernel structure, and the (large) number of internal vertices on paths in the core (ie the length of the core), it is natural that the assignment of the degree-2 vertices to core paths / kernel edges is uniform. A consequence of this is that if you record $(Y_1,\ldots,Y_m)$ the lengths of paths in the core, where m=3(k-1), then

$\frac{(Y_1,\ldots,Y_m)}{\sum Y_i} \stackrel{d}\rightarrow \mathrm{Dirichlet}(1,1,\ldots,1).$

This is stated formally as Corollary 7 b) of [ABG09]. It’s worth noting that this confirms that the lengths of core paths are bounded in probability away from zero after the appropriate rescaling. In seeking a metric scaling limit, this is convenient as it means there’s so danger that two of the degree-3 vertices end up in ‘the same place’ in the scaling limit object.

To recap, the only missing ingredients now to give a complete limiting metric description of T(n,k) are 1) a distributional limit of the total core length; 2) some appropriate description of set of pendant trees conditional on the size of the pendant forest. [ABG09] show the first of these. As remarked before, all the content of the second of these is encoded in the unicyclic k=1 case, which I have written about before, albeit slightly sketchily, here. (Note that in that post we get around size-biasing by counting a slightly different object, namely unicyclic graphs with an identified cyclic edge.)

However, [ABG09] also propose an alternative construction, which you can think of as glueing CRTs directly onto the stubs of the kernel (with the same distribution as before). The proof that this construction works isn’t as painful as one might fear, and allows a lot of the other metric distributional results to be read off as corollaries.

References

[ABG09] – Addario-Berry, Broutin, Goldschmidt – Critical random graphs: limiting constructions and distributional properties

[CRT2] – Aldous – The continuum random tree: II

[Al97] – Aldous – Brownian excursions, critical random graphs and the multiplicative coalescent

[JKLP] – Janson, Knuth, Luczak, Pittel – The birth of the giant component

[LeGall] – Le Gall – Random trees and applications

[LPW] – Luczak, Pittel, Wierman – The structure of a random graph at the point of the phase transition

# Generating uniform trees

A long time ago, I wrote quite a few a things about uniform trees. That is, a uniform choice from the $n^{n-2}$ unrooted trees with vertex set [n]. This enumeration, normally called Cayley’s formula, has several elegant arguments, including the classical Prufer bijection. But making a uniform choice from a large set is awkward, and so we seek more probabilistic methods to sample such a tree, which might also give insight into the structure of a ‘typical’ uniform tree.

In another historic post, I talked about the Aldous-Broder algorithm. Here’s a quick summary. We run a random walk on the complete graph $K_n$ started from a uniformly-chosen vertex. Every time we arrive at a vertex we haven’t visited before, we record the edge just traversed. Eventually we have visited all n vertices, so have recorded n-1 edges. It’s easy enough to convince yourself that these n-1 edges form a tree (how could there be a cycle?) and a bit more complicated to decide that the distribution of this tree is uniform.

It’s worth noting that this algorithm works to construct a uniform spanning tree on any connected base graph.

This post is about a few alternative constructions and interpretations of the uniform random tree. The first construction uses a Galton-Watson process. We take a Galton-Watson process where the offspring distribution is Poisson(1), and condition that the total population size is n. The resulting random tree has a root but no labels, however if we assign labels in [n] uniformly at random, the resulting rooted tree has the uniform distribution among rooted trees on [n].

Proof

This is all about moving from ordered trees to non-ordered trees. That is, when setting up a Galton-Watson tree, we distinguish between the following two trees, drawn extremely roughly in Paint:

That is, it matters which of the first-generation vertices have three children. Anyway, for such a (rooted) ordered tree T with n vertices, the probability that the Galton-Watson process ends up equal to T is

$\mathbb{P}(GW = T) = \prod_{v\in T} \frac{e^{-1}}{C(v)!} = e^{-n} \prod_{v\in T}\frac{1}{C(v)!},$

where $C(v)$ is the number of children of a vertex $v\in T$. Then, since $\mathbb{P}( |GW|=n )$ is a function of n, we find

$\mathbb{P}(GW=T \,\big|\, |GW|=n) = f(n)\prod_{v\in T} \frac{1}{C(v)!},$

where f(n) is a function of n alone (ie depends on T only through its size n).

But given an unordered rooted tree t, labelled by [n], there are $\prod_{v \in t} C(v)!$ ordered trees associated to t in the natural way. Furthermore, if we take the Poisson Galton-Watson tree conditioned to have total population size n, and label uniformly at random with [n], we obtain any one of these ordered trees with probability $\frac{f(n)}{n!} \prod_{v\in t} \frac{1}{C(v)!}$. So the probability that we have t after we forget about the ordering is $\frac{f(n)}{n!}$, which is a function of n alone, and so the distribution is uniform among the set of rooted unordered trees labelled by [n], exactly as required.

Heuristic for Poisson offspring distribution

In this proof, the fact that $\mathbb{P}(C(v)=k)\propto \frac{1}{k!}$ exactly balances the number of orderings of the k children explains why Poisson(1) works out. Indeed, you can see in the proof that Poisson(c) works equally well, though when $c\ne 1$, the event we are conditioning on (namely that the total population size is n) has probability decaying exponentially in n, whereas for c=1, the branching process is critical, and the probability decays polynomially.

We can provide independent motivation though, from the Aldous-Broder construction. Both the conditioned Galton-Watson construction and the A-B algorithm supply the tree with a root, so we’ll keep that, and look at the distribution of the degree of the root as constructed by A-B. Let $\rho=v_1,v_2,v_3,\ldots$ be the vertices [n], ordered by their discovery during the construction. Then $\rho$ is definitely connected by an edge to $v_2$, but thereafter it follows by an elementary check that the probability $\rho$ is connected to $v_m$ is $\frac{1}{n-1}$, independently across all m. In other words, the distribution of the degree of $\rho$ in the tree as constructed by A-B is

$1+ \mathrm{Bin}\left(n-2,\frac{1}{n-1}\right) \approx 1+\mathrm{Poisson}(1).$

Now, in the Galton-Watson process, conditioning the tree to have fixed, large size changes the offspring distribution of the root. Conveniently though, in a limiting sense it’s the same change as conditioning the tree to have size at least n. Since these events are monotone in n, it’s possible to take a limit of the conditioning events, and interpret the result as the Galton-Watson tree conditioned to survive. It’s a beautiful result that this interpretation can be formalised as a local limit. The limiting spine decomposition consists of an infinite spine, where the offspring distribution is a size-biased version of the original offspring distribution (and so in particular, always has at least one child) and where non-spine vertices have the original distribution.

In particular, the number of the offspring of the root is size-biased, and it is well-known and not hard to check that size-biasing Poisson(c) gives 1+Poisson(c) ! So in fact we have, in an appropriate limiting sense in both objects, a match between the degree distribution of the root in the uniform tree, and in the conditioned Galton-Watson tree.

This isn’t supposed to justify why a conditioned Galton-Watson tree is relevant a priori (especially the unconditional independence of degrees), but it does explain why Poisson offspring distributions are relevant.

Construction via G(N,p) and the random cluster model

The main reason uniform trees were important to my thesis was their appearance in the Erdos-Renyi random graph G(N,p). The probability that vertices {1, …, n} form a tree component in G(N,p) with some particular structure is

$p^{n-1} (1-p)^{\binom{n}{2}-(n-1)} \times (1-p)^{n(N-m)}.$

Here, the first two terms give the probability that the graph structure on {1, …, n} is correct, and the the final term gives the probability of the (independent) event that these vertices are not connected to anything else in the graph. In particular, this has no dependence on the tree structure chosen on [n] (for example, whether it should be a path or a star – both examples of trees). So the conditional distribution is uniform among all trees.

If we work in some limiting regime, where $pn\rightarrow 0$ (for example if n is fixed and $p=\frac{1}{N}\rightarrow 0$), then we can get away asymptotically with less strong conditioning. Suppose we condition instead just that [n] form a component. Now, there are more ways to form a connected graph with one cycle on [n] than there are trees on [n], but the former all require an extra edge, and so the probability that a given one such tree-with-extra-edge appears as the restriction to [n] in G(N,p) is asymptotically negligible compared to the probability that the restriction to [n] of G(N,p) is a tree. Naturally, the local limit of components in G(N,c/N) is a Poisson(c) Galton-Watson branching process, and so this is all consistent with the original construction.

One slightly unsatisfying aspect to this construction is that we have to embed the tree of size [n] within a much larger graph on [N] to see uniform trees. We can’t choose a scaling p=p(n) such that G(n,p) itself concentrates on trees. To guarantee connectivity with high probability, we need to take $p> \frac{\log n}{n}$, but by this threshold, the graph has (many) cycles with high probability.

At this PIMS summer school in Vancouver, one of the courses is focusing on lattice spin models, including the random cluster model, which we now briefly define. We start with some underlying graph G. From a physical motivation, we might take G to be $\mathbb{Z}^d$ or some finite subset of it, or a d-ary tree, or the complete graph $K_N$. As in classical bond percolation (note G(N,p) is bond percolation on $K_N$), a random subset of the edges of G are included, or declared open. The probability of a given configuration w, with e open edges is proportional to

$p^e (1-p)^{|E(G)| - e} q^{k(w)},$ (*)

where the edge-weight $p\in(0,1)$ as usual, and cluster weight $q\in (0,\infty)$, and $k(w)$ counts the number of connected components in configuration w. When q=1, we recover classical bond percolation (including G(N,p) ), while for q>1, this cluster-reweighting favours having more components, and q<1 favours fewer components. Note that in the case $q\ne 1$, the normalising constant (or partition function) of (*) is generally intractable to calculate explicitly.

As in the Erdos-Renyi graph, consider fixing the underlying graph G, and taking $p\rightarrow 0$, but also taking $\frac{q}{p}\rightarrow 0$. So the resulting graph asymptotically ‘wants to have as few edges as possible, but really wants to have as few components as possible’. In particular, 1) all spanning trees of G are equally likely; 2) any configuration with more than one component has asymptotically negligible probability relative to any tree; 3) any graph with a cycle has #components + #edges greater than that of a tree, and so is asymptotically negligible probability relative to any tree.

In other words, the limit of the distribution is the uniform spanning tree of G, and so this (like Aldous-Broder) is a substantial generalisation, which constructs the uniform random tree in the special case where $G=K_n$.

# Exploring the Supercritical Random Graph

I’ve spent a bit of time this week reading and doing all the exercises from some excellent notes by van der Hofstad about random graphs. I think they are absolutely excellent and would not be surprised if they become the standard text for an introduction to probabilistic combinatorics. You can find them hosted on the author’s website. I’ve been reading chapters 4 and 5, which approaches the properties of phase transitions in G(n,p) by formalising the analogy between component sizes and population sizes in a binomial branching process. When I met this sort of material for the first time during Part III, the proofs generally relied on careful first and second moment bounds, which is fine in many ways, but I enjoyed vdH’s (perhaps more modern?) approach, as it seems to give a more accurate picture of what is actually going on. In this post, I am going to talk about using the branching process picture to explain why the giant component emerges when it does, and how to get a grip on how large it is at any time after it has emerged.

Background

A quick tour through the background, and in particular the notation will be required. At some point I will write a post about this topic in a more digestible format, but for now I want to move on as quickly as possible.

We are looking at the sparse random graph $G(n,\frac{\lambda}{n})$, in the super-critical phase $\lambda>1$. With high probability (that is, with probability tending to 1 as n grows), we have a so-called giant component, with O(n) vertices.

Because all the edges in the configuration are independent, we can view the component containing a fixed vertex as a branching process. Given vertex v(1), the number of neighbours is distributed like $\text{Bi}(n-1,\frac{\lambda}{n})$. The number of neighbours of each of these which we haven’t already considered is then $\text{Bi}(n-k,\frac{\lambda}{n})$, conditional on k, the number of vertices we have already discounted. After any finite number of steps, k=o(n), and so it is fairly reasonable to approximate this just by $\text{Bi}(n,\frac{\lambda}{n})$. Furthermore, as n grows, this distribution converges to $\text{Po}(\lambda)$, and so it is natural to expect that the probability that the fixed vertex lies in a giant component is equal to the survival probability $\zeta_\lambda$ (that is, the probability that it is infinite) of a branching process with $\text{Po}(\lambda)$ offspring distribution. Note that given a graph, the probability of a fixed vertex lying in a giant component is equal to the fraction of the vertex in the giant component. At this point it is clear why the emergence of the giant component must happen at $\lambda=1$, because we require $\mathbb{E}\text{Po}(\lambda)>1$ for the survival probability to be non-zero. Obviously, all of this needs to be made precise and rigorous, and this is treated in sections 4.3 and 4.4 of the notes.

Exploration Process

A common functional of a rooted branching process to consider is the following. This is called in various places an exploration process, a depth-first process or a Lukasiewicz path. We take a depth-first labelling of the tree v(0), v(1), v(2),… , and define c(k) to be the number of children of vertex v(k). We then define the exploration process by:

$S(0)=0,\quad S(k+1)=S(k)+c(k)-1.$

By far the best way to think of this is to imagine we are making the depth-first walk on the tree. S(k) records how many vertices we have seen (because they are connected by an edge to a vertex we have visited) but have not yet visited. To clarify understanding of the definition, note that when you arrive at a vertex with no children, this should decrease by one, as you can see no new vertices, but have visited an extra one.

This exploration process is useful to consider for a couple of reasons. Firstly, you can reconstruct the branching process directly from it. Secondly, while other functionals (eg the height, or contour process) look like random walks, the exploration process genuinely is a random walk. The distribution of the number of children of the next vertex we arrive at is independent of everything we have previously seen in the tree, and is the same for every vertex. If we were looking at branching processes in a different context, we might observe that this gives some information in a suitably-rescaled limit, as rescaled random walks converge to Brownian motion if the variance of the (offspring) distribution is finite. (This is Donsker’s result, which I should write something about soon…)

The most important property is that the exploration process returns to 0 precisely when we have exhausted all the vertices in a component. At that point, we have seen exactly the vertices which we have explored. There is no reason not to extend the definition to forests, that is a union of trees. The depth-first exploration is the same – but when we have exhausted one component, we move onto another component, chosen according to some labelling property. Then, running minima of the exploration process (ie times when it is smaller than it has been before) correspond to jumping between components, and thus excursions above the minimum to components themselves. The running minimum will be non-positive, with absolute value equal to the number of components already exhausted.

Although the exploration process was defined with reference to and in the language of trees, the result of a branching process, this is not necessary. With some vertex denoted as the root, we can construct a depth-first labelling of a general graph, and the exploration process follows exactly as before. Note that we end up ignoring all edges except a set that forms a forest. This is what we will apply to G(n,p).

Exploring G(n,p)

When we jump between components in the exploration process on a supercritical (that is $\lambda>1$) random graph, we move to a component chosen randomly with size-biased distribution. If there is a giant component, as we know there is in the supercritical case, then this will dominate the size-biased distribution. Precisely, if the giant component takes up a fraction H of the vertices, then the number of components to be explored before we get to the giant component is geometrically distributed with parameter H. All other components have size O(log n), so the expected number of vertices explored before we get to the giant component is O(log n)/H = o(n), and so in the limit, we explore the giant component immediately.

The exploration process therefore gives good control on the giant component in the limit, as roughly speaking the first time it returns to 0 is the size of the giant component. Fortunately, we can also control the distribution of S_t, the exploration process at time t. We have that:

$S_t+(t-1)\sim \text{Bi}(n-1,1-(1-p)^t).$

This is not too hard to see. $S_t+(t-1)$ is number of vertices we have explored or seen, ie are connected to a vertex we have explored. Suppose the remaining vertices are called unseen, and we began the exploration at vertex 1. Then any vertex with label in {2,…,n} is unseen if it successively avoids being in the neighbourhood of v(1), v(2), … v(t). This happens with probability $(1-p)^t$, and so the probability of being an explored or seen vertex is the complement of this.

In the supercritical case, we are taking $p=\frac{\lambda}{n}$ with $\lambda>1$, and we also want to speed up S, so that all the exploration processes are defined on [0,1], and rescale the sizes by n, so that it records the fraction of the graph rather than the number of vertices. So we set consider the rescaling $\frac{1}{n}S_{nt}$.

It is straightforward to use the distribution of S_t we deduce that the asymptotic mean $\mathbb{E}\frac{1}{n}S_{nt}=\mu_t = 1-t-e^{-\lambda t}$.

Now we are in a position to provide more concrete motivation for the claim that the proportion of vertices in the giant component is $\zeta_\lambda$, the survival probability of a branching process with $\text{Po}(\lambda)$ offspring distribution. It helps to consider instead the extinction probability $1-\zeta_\lambda$. We have:

$1-\zeta_\lambda=\sum_{k\geq 0}\mathbb{P}(\text{Po}(\lambda)=k)(1-\zeta_\lambda)^k=e^{-\lambda\zeta_\lambda},$

where the second equality is a consequence of the simple form for the moment generating function of the Poisson distribution.

As a result, we have that $\mu_{\zeta_\lambda}=0$. In fact we also have a central limit theorem for S_t, which enables us to deduce that $\frac{1}{n}S_{n\zeta_\lambda}=0$ with high probability, as well as in expectation, which is precisely what is required to prove that the giant component of $G(n,\frac{\lambda}{n})$ has size $n(\zeta_\lambda+o(1))$.

# The Inspection Paradox and related topics

In the final class for Applied Probability, we discussed the so-called Inspection Paradox for an arrivals process. We assume that buses, sat, arrive as a Poisson process with rate 1, and consider the size of the interval (between buses) containing some fixed time T. The ‘paradox’ is that the size of this interval is larger in expectation than the average time between buses, which of course is given by an exponential random variable.

As with many paradoxes, it isn’t really that surprising after all. Perhaps what is more surprising is that the difference between the expected observed interval and the expected actual interval time is quite small here. There are several points of interest:

1) The Markov property of the Poisson process is key. In particular, this says that the expectation (and indeed the distribution) of the waiting time for a given customer arriving at T is not dependent on T, even if T is a random variable (or rather, a class of random variables, the stopping times). So certainly the inspection paradox property will hold whenever the process has the Markov property, because the inspected interval contains the inspected waiting time, which is equal in distribution to any fixed interval.

2) Everything gets slightly complicated by the fact that the Poisson process is defined to begin at 0. In particular, it is not reversible. Under the infinitesimal (or even the independent Poisson increments) definition, we can view the Poisson process not as a random non-decreasing function of time, but rather as a random collection of points on the positive reals. With this setup, it is clearly no problem to define instead a random collection of points on all the reals. [If we consider this instead as a random collection of point masses, then this is one of the simplest examples of a random measure, but that’s not hugely relevant here.]

We don’t need to worry too much about what value the Poisson process takes at any given time if we are only looking at increments, but if it makes you more comfortable, you can still safely assume that it is 0 at time 0. Crucially, the construction IS now reversible. The number of points in the interval [s,t] has distribution parameterised by t-s, so we it doesn’t matter which direction we are moving in down the real line. In this case, A_t, the time since the previous arrival, and E_t, the waiting time until the next arrival, are both Exp(1) RVs, as the memorylessness property applies in each time direction.

For the original Poisson process, we actually have A_t stochastically dominated by an Exp(1) distribution, because it is conditioned to be less than or equal to t. So in this case, the expected interval time is some complicated function of t, lying strictly between 1 and 2. In our process extended to the whole real line, the expected interval time is exactly 2.

This idea of extending onto the whole real line explains why we mainly consider delayed renewal processes rather than just normal renewal processes. The condition that we start a holding time at 0 is often not general enough, particularly when the holding times are not exponential and so the process is not memoryless.

3) There is a general size-biasing principle in action here. Roughly speaking, we are more likely to arrive in a large interval than in a small interval. The scaling required is proportional to the length of the interval. Given a density function f(x) of X, we define the size-biased density function to be xf(x). We need to normalise to give a probability distribution, and dividing by the expectation EX is precisely what is needed. Failure to account for when an observation should have the underlying distribution or the size-biased distribution is a common cause of supposed paradoxes. A common example is ‘on average my friends have more friends than I do’. Obviously, various assumption on me and my friends, and how we are drawn from the set of people (and the distribution of number of friends) is required that might not necessarily be entirely accurate in all situations.

In the Poisson process example above, the holding times have density function $e^{-x}$, so the size-biased density function if $xe^{-x}$. This corresponds to a $\Gamma(2,1)$ distribution, which may be written as the sum of two independent Exp(1) RVs as suggested above.

4) A further paradox mentioned on the sheet is the waiting time paradox. This says that the expected waiting time is longer if you arrive at a random time than if you just miss a bus. This is not too surprising: consider at least the stereotypical complaint about buses in this country arriving in threes, at least roughly. Then if you just miss a bus, there is a 2/3 chance that another will be turning up essentially immediately. On the sheet, we showed that the $\Gamma(\alpha,1)$ distribution has this property also, provided $\alpha<1$.

We can probably do slightly better than this. The memoryless property of the exponential distribution says that:

$\mathbb{P}(Z>t+s|Z>t)=\mathbb{P}(Z>s).$

In general, for the sort of holding times we might see at a bus stop, we might expect it to be the case that if we have waited a long time already, then we are less likely relatively to have to wait a long time more, that is:

$\mathbb{P}(Z>t+s|Z>t)\leq\mathbb{P}(Z>s),$

and furthermore this will be strict if neither s nor t is 0. I see no reason not to make up a definition, and call this property supermemorylessness. However, for the subclass of Gamma distributions described above, we have the opposite property:

$\mathbb{P}(Z>t+s|Z>t)\geq\mathbb{P}(Z>s).$

Accordingly, let’s call this submemorylessness. If this is strict, then it says that we are more likely to have to wait a long time if we have already been waiting a long time. This seems contrary to most real-life distributions, but it certainly explains the paradox. If we arrive at a random time, then the appropriate holding time has been ‘waiting’ for a while, so is more likely to require a longer observed wait than if I had arrived as the previous bus departed.

In conclusion, before you think of something as a paradox, think about whether the random variables being compared are being generated in the same way, in particular whether one is size-biased, and whether there are effects due to non-memorylessness.