# The Yule Process

The second problem sheet for classes on the Applied Probability course this term features a long question about the Yule process. This is probably the simplest example of a birth process. It’s named for the British statistician George Udny Yule, though some sources prefer to call it the Yule-Furry process for the American physicist Wendell Furry who used it as a model of a radioactive reaction.

The model is straightforward. At any time there is some number of individuals in the population, and each individual gives birth to an offspring at constant rate $\lambda$, independently from the rest of the population. After a birth has happened, the parent and child evolve independently. In the notation of general birth processes, the birth rate when there are n individuals is $\lambda_n=\lambda n$.

Note that if we start with two or more individuals, the sizes of the two or more families of descendents evolve as a continuous-time Polya’s urn. The arrivals process speeds up with time, but the jump chain is exactly Polya’s urn. Unsurprisingly, the Yule process can be found embedded in preferential attachment models, and other processes which are based around Polya’s urn with extra information.

This is a discrete, random version of exponential growth. Since the geometric distribution is the discrete analogue of the exponential distribution, we probably shouldn’t be surprised to learn that this is indeed the distribution of the process at some fixed time t, when it is started from a single original ancestor. This is all we care about, since the numbers of descendents from each different original ancestors are independent. In general, the distribution of the population size at some fixed time will be negative binomial, that is, a sum of IID geometric distributions.

The standard method here is to proceed using generating functions. Conditioning on the first splitting time gives two independent copies of the original process over a shorter time-scale. One derives an ODE in time for the generating function evaluated at any particular value z. This can be solved uniquely for each z, and patching together gives the generating function of the distribution at any specific time t, which can be seen to coincide with the corresponding generating function of the geometric distribution with parameter $e^{-\lambda t}$.

So we were trying to decide whether there might be a more heuristic argument for this geometric distribution. The method we came up with is not immediate, but does justify the geometric distribution in a couple of steps. First, we say that the birth times are $T_2,T_3,\ldots$, so between times $[T_n,T_{n+1})$ there are n individuals, with $T_1:=0$ for concreteness. Then by construction of the birth process, $T_{n+1}-T_n\stackrel{d}{=}\mathrm{Exp}(\lambda n)$.

We now look at these ‘inter-birth times’ backwards, starting from $T_{n+1}$. Note that $\mathrm{Exp}(\lambda n)$ is the distribution of the time for the first of n IID $\mathrm{Exp}(\lambda)$ clocks to ring. But then, looking backwards, the next inter-birth time is thus the distribution of the time for one of (n-1) IID $\mathrm{Exp}(\lambda)$ clocks to ring. So by memorylessness of the exponential distribution (discussed at great length on the first problem sheet), we can actually take these (n-1) clocks to be exactly those of the original n clocks which did not ring first. Continuing this argument, we can show that the first (in the original time direction) inter-birth time corresponds to the time spent waiting for the final clock to ring. Rewriting this observation formally:

$T_{n+1}\stackrel{d}{=}\max\{X_i : X_1,\ldots,X_n\stackrel{\text{iid}}{\sim}\mathrm{Exp}(\lambda)\}.$ (*)

To return to justifying the geometric form of the distribution, we need to clarify the easiest relationship between the population size at a fixed size and these birth times. As we are aiming for the geometric distribution, the probability of the event $\{X_t>n\}$ will be most useful. Clearly this event is the same as $\{T_{n+1}, and from the description involving maxima of IID exponentials, this is easy to compute as $(1-e^{-\lambda t})^n$, which is exactly what we want.

There are two interesting couplings hidden in these constructions. On closer inspection they turn out to be essentially the same from two different perspectives.

We have specified the distribution of $T_n$ at (*). Look at this distribution on the right hand side. There is a very natural way to couple these distributions for all n, namely to take some infinite sequence $X_1,X_2,\ldots$ of IID $\mathrm{Exp}(\lambda)$ random variables, then use initial sequences of these to generate each of the $T_n$s as described in (*).

Does this coupling correspond to the use of these IID RVs in the birth process? Well, in fact it doesn’t. Examining the argument, we can see that $X_1$ gives a different inter-birth time for each value of t in the correspondence proposed. Even more concretely, in the birth process, almost surely $T_{n+1}>T_n$ for each n. This is not true if we take the canonical coupling of (*). Here, if $X_n<\max\{X_1,\ldots,X_{n-1}\}$, which happens with high probability for large n, we have $T_{n+1}=T_n$ in the process of running maxima.

Perhaps more interestingly, we might observe that this birth process gives a coupling of the geometric distributions. If we want to recover the standard parameterisation of the geometric distribution, we should reparameterise time. [And thus generate an essentially inevitable temptation to make some joke about now having a Yule Log process.]

Let’s consider what the standard coupling might be. For a binomial random variable, either on [n] or some more exotic set, as in percolation, we can couple across all values of the parameter by constructing a family independent uniform random variables, and returning a 1 if $U_i>1-p$ and so on, where p is the parameter of a specific binomial realisation.

We can do exactly the same here. A geometric distribution can be justified as the first success in a sequence of Bernoulli trials, so again we can replace the relevant Bernoulli distribution with a uniform distribution. Take $U_1,U_2,\ldots$ to be IID U[0,1] random variables. Then, we have:

$X_t=\stackrel{d}{=}\bar X_t:= \max\{n: U_1,\ldots,U_{n-1}\ge e^{-\lambda t}\}.$

The equality in distribution holds for any particular value of t by constructing. But it certainly doesn’t hold uniformly in t. Note that if we define $\bar X_t$ as a process, then typically the jumps of this process will be greater than 1, which is forbidden in the Yule process.

So, we have seen that this Yule process, even though its distribution at a fixed time has a standard form, provides a coupling of such distributions that is perhaps slightly surprising.

# Poisson Random Measures

[This is a companion to the previous post. They explore different aspects of the same problem which I have been thinking about from a research point of view. So that they can be read independently, there has inevitably been some overlap.]

As I explained in passing previously, Poisson Random Measures have come up in my current research project. Indeed, the context where they have appeared seems like a very good motivation for considering the construction and some properties of PRMs.

We begin not with a Poisson variable, but with a standard Erdos-Renyi random graph $G(n,\frac{c}{n})$. The local limit of a component in this random graph is given by a Galton-Watson branching process with Poisson(c) offspring distribution. Recall that a local limit is description of what the structure looks like near a given (or random) vertex. Since the vertices in G(n,p) are exchangeable, this rooting matters less. Anyway, the number of neighbours in the graph of our root is given by Bin(n-1,c/n). Suppose that the root v_0, has k neighbours. Then if we are just interested in determining the vertices in the component, we can ignore the possibility of further edges between these neighbours. So if we pick one of the neighbours of the root, say v_1, and count the number of neighbours of this vertex that we haven’t already considered, this is distributed as Bin(n-1-k,c/n), since we discount the root and the k neighbours of the root.

Then, as n grows large, Bin(n-1,c/n) converges in distribution to Po(c). Except on a very unlikely event whose probability we can control if we need, so does Bin(n-1-k,c/n). Indeed if we consider a set of K vertices which are already connected in some way, then the distribution of the number of neighbours of one of them which we haven’t already considered is still Po(c) in the limit.

Now we consider what happens if we declare the graph to be inhomogeneous. The simplest possible way to achieve this is to specify two types of vertices, say type A and type B. Then we specify the proportion of vertices of each type, and the probability that there is an edge between two vertices of given types. This is best given by a symmetric matrix. So for example, if we wanted a random bipartite graph, we could achieve this as described by setting all the diagonal entries of the matrix to be zero.

So does the local limit extend to this setting? Yes, unsurprisingly it does. To be concrete, let’s say that the proportion of types A and B are a and b respectively, and the probabilities of having edges between vertices of various types is given by $P=(p_{ij}/n)_{i,j\in\{A,B\}}$. So we can proceed exactly as before, only now we have to count how many type A neighbours and how many type B neighbours we see at all stages. We have to specify the type of our starting vertex. Suppose for now that it is type A. Then the number of type A neighbours is distributed as

$\text{Bin}(an,p_{AA}/n)\stackrel{d}{\rightarrow}\text{Po}(ap_{AA})$,

and similarly the limiting number of type B neighbours is $\sim \text{Po}(bp_{AB})$. Crucially, this is independent of the number of type A neighbours. The argument extends naturally to later generations, and the result is exactly a multitype Galton-Watson process as defined in the previous post.

My motivating model is the forest fire. Here, components get burned when they are large and reduced to singletons. It is therefore natural to talk about the ‘age’ of a vertex, that is, how long has elapsed since it was last burned. If we are interested in the forest fire process at some fixed time T>1, that is, once burning has started, then we can describe it as an inhomogeneous random graph, given that we know the ages of the vertices.

For, given two vertices with ages s and t, where WLOG s<t, we know that the older vertex could not have been joined to the other vertex between times T-t and T-s. Why? Well, if it had, then it too would have been burned at time T-s when the other vertex was burned. So the only possibility is that they might have been joined by an edge between times T-s and T. Since each edge arrives at rate 1/n, the probability that this happens is $1-e^{-s/n}\approx \frac{s}{n}$. Indeed, in general the probability that two vertices of ages s and t are joined at time T is $\frac{s\wedge t}{n}$.

Again at fixed time T>1, the sequence of ages of the vertices converges weakly to some fixed distribution (which depends on T) as the number of vertices grows to infinity. We can then recover the graph structure by assigning ages according to this distribution, then growing the inhomogeneous random graph with the kernel as described. The question is: when we look for a local limit, how to do we describe the offspring distribution?

Note that in the limit, components will be burned continuously, so the distribution of possible ages is continuous (with an atom at T for those vertices which have never been burned). So if we try to calculate the distribution of the number of neighbours of age s, we are going to be doomed, because with probability 1 then is no vertex of age s anywhere!

The answer is that the offspring distribution is given by a Poisson Random Measure. You can think of this as a Poisson Point Process where the intensity is non-constant. For example, let us consider how many neighbours we expect to have with ages [s,s+ds]. Let us suppose the age of our root is t>s+ds for now. Assuming the distribution of ages, $f(\cdot)$ is positive and continuous, the number of vertices with these ages in the system is roughly nf(s)ds, and so the number of neighbours with this property is roughly $\text{Bin}(nf(s)ds,\frac{s}{n})$. In particular, this does have a Poisson limit. We need to be careful about whether this Poisson limit is preserved by the approximation. In fact this is fine. Let’s assume WLOG that f is increasing at s. Then the number of age [s,s+ds] neighbours can be stochastically bounded between $\text{Bin}(nf(s)ds,\frac{s}{n})$ and $\text{Bin}(nf(s+ds)ds,\frac{s+ds}{n}$. As n grows, these converge in the distribution to two Poisson random variables, and then we can let ds go to zero. Note for full formalism, we may need to account for the large deviations event that the number of age s vertices in the system is noticeably different from its expectation. Whether this is necessary depends on whether the ages are assigning deterministically, or drawn IID-ly from f.

One important result to be drawn from this example is that the number of offspring from disjoint type sets, say $[s_1,s_2], [t_1,t_2]$ are independent, for the same reason as in the two-type setting, namely that the underlying binomial variables are independent. We are, after all, testing different sets of vertices! The other is that the number of neighbours with ages in some range is Poisson. Notice that these two results are consistent. The number of neighbours with ages in the set $[s_1,s_2]\cup [t_1,t_2]$ is given by the sum of two independent Poisson RVs, and hence is Poisson itself. The parameter of the sum RV is given by the sum of the original parameters.

These are essentially all the ingredients required for the definition of a Poisson Random Measure. Note that the set of offspring is a measure of the space of ages, or types. (Obviously, this isn’t a probability measure.) We take a general space E, with sigma algebra $\mathcal{E}$, and an underlying measure $\mu$ on E. We want a distribution $\nu$ for measures on E, such that for each Borel set $A\in\mathcal{E}$, $\nu(A)$, which is random because $\nu$ is, is distributed as $\text{Po}(\mu(A))$, and furthermore, for disjoint $A,B\in\mathcal{E}$, the random variables $\nu(A),\nu(B)$ are independent.

If $M=\mu(E)<\infty$, then constructing such a random measure is not too hard using a thinning property. We know that $\nu(E)\stackrel{d}{=}\text{Po}(M)$, and so if we sample a Poisson(M) number of RVs with distribution given by $\frac{\mu(\cdot)}{M}$, we get precisely the desired PRM. Proving this is the unique distribution with this property is best done using properties of the Laplace transform, which uniquely defines the law of a random measure in the same manner that the moment generating function defines the law of a random variable. Here the argument is a function, rather than a single variable for the MGF, reflecting the fact that the space of measures is a lot ‘bigger’ than the reals, where a random variable is supported. We can extend this construction for sigma-finite spaces, that is some countable union of finite spaces.

One nice result about Poisson random measures concerns the expectation of functions evaluated at such a random measure. Recall that some function f evaluated at the measure $\sum \delta_{x_i}$ is given by $\sum f(x_i)$. Then, subject to mild conditions on f, the expectation

$\mathbb{E}\nu (f)=\mu(f).$

Note that when $f=1_A$, this is precisely one of the definitions of the PRM. So by a monotone class result, it is not surprising that this holds more generally. Anyway, I’m currently trying to use results like these to get some control over what the structure of this branching processes look like, even when the type space is continuous as in the random graph with specified ages.

# Multitype Branching Processes

One of the fundamental objects in classical probability theory is the Galton-Watson branching process. This is defined to be a model for the growth of a population, where each individual in a generation gives birth to some number (possibly zero) of offspring, who form the next generation. Crucially, the numbers of offspring of the individuals are IID, with the same distribution both within generations and between generations.

There are several ways one might generalise this, such as non-IID offspring distributions, or pairs of individuals producing some number of offspring, but here we consider the situation where each individual has some type, and different types have different offspring distributions. Note that if there are K types, say, then the offspring distributions should now be supported on $\mathbb{Z}_{\ge 0}^K$. Let’s say the offspring distribution from a parent of type i is $\mu^{(i)}$.

The first question to address is one of survival. Recall that if we want to know whether a standard Galton-Watson process has positive probability of having infinite size, that is never going extinct, we only need to know the expectation of the offspring distribution. If this is less than 1, then the process is subcritical and is almost surely finite. If it is greater than 1, then it is supercritical and survives with positive probability. If the expectation is exactly 1 (and the variance is finite) then the process is critical and although it is still almost surely finite, the overall population size has a power-law tail, and hence (or otherwise) the expected population size is infinite.

We would like a similar result for the multitype process, saying that we do not need to know everything about the distribution to decide what the survival probability should be.

The first thing to address is why we can’t just reduce the multitype change to the monotype setting. It’s easiest to assume that we know the type of the root in the multitype tree. The case where the type of the root is random can be reconstructed later. Anyway, suppose now that we want to know the offspring distribution of a vertex in the m-th generation. To decide this, we need to know the probability that this vertex has a given type, say type j. To calculate this, we need to work out all the type possibilities for the first m generations, and their probabilities, which may well include lots of complicated size-biasing. Certainly it is not easy, and there’s no reason why these offspring distributions should be IID. The best we can say is that they should probably be exchangeable within each generation.

Obviously if the offspring distribution does not depend on the parent’s type, then we have a standard Galton-Watson tree with types assigned in an IID manner to the realisation. If the types are symmetric (for example if M, to be defined, is invariant under permuting the indices) then life gets much easier. In general, however, it will be more complicated than this.

We can however think about how to decide on survival probability. We consider the expected number of offspring, allowing both the type of the parent and the type of the child to vary. So define $m_{ij}$ to be the expected number of type j children born to a type i parent. Then write these in a matrix $M=(m_{ij})$.

One generalisation is to consider a Galton-Watson forest started from some positive number of roots of various types. Suppose we have a vector $\nu=(\nu_i)$ listing the number of roots of each type. Then the expected number of descendents of each type at generation n is given by the vector $\nu M^n$.

Let $\lambda$ be the largest eigenvalue of M. As for the transition matrices of Markov chains, the Perron-Frobenius theorem applies here, which confirms that, because the entries of M are positive, the eigenvalue with largest modulus is simple and real, and the associated eigenvector has entirely positive entries. [In fact we need a couple of extra conditions on M, including that it is possible to get from any type to any other type – we say irreducible – but that isn’t worth going into now.]

So in fact the total number of descendents at generation n grows like $\lambda^n$ in expectation, and so we have the same description of subcriticality and supercriticality. We can also make a sensible comment about the left-$\lambda$-eigenvector of M. This is the limiting proportion of the different types of vertices.

It’s a result (eg. [3]) that the height profile of a depth-first search on a standard Galton-Watson tree converges to Brownian Motion. Another way to phrase this is that a GW tree conditioned to have some size N has the Brownian Continuum Random Tree as a scaling limit as N grows to infinity. Miermont [4] proves that this result holds for the multitype tree as well. In the remainder of this post I want to discuss one idea along the way to the proof, and one application.

I said initially that there wasn’t a trivial reduction of a multitype process to a monotype process. There is however a non-trivial embedding of a monotype process in a multitype process. Consider all the vertices of type 1, and all the paths between such vertices. Then draw a new tree consisting of just the type 1 vertices. Two of these are joined by an edge if there is no other type 1 vertex on the unique path between them in the original tree. If that definition is confusing, think of the most sensible way to construct a tree on the type 1 vertices from the original, and you’ve probably chosen this definition.

There are two important things about this new tree. 1) It is a Galton-Watson tree, and 2) if the original tree is critical, then this reduced tree is also critical. Proving 1) is heavily dependent on exactly what definitions one takes for both the multitype branching mechanism and the standard G-W mechanism. Essentially, at a type 1 vertex, the number of type 1 descendents is not dependent on anything that happened at previous generations, nor in other branches of the original tree. This gives IID offspring distributions once it is formalised. As for criticality, we note that by the matrix argument given before, under the irreducibility condition discussed, the expectation of the total population size is infinite iff the expected number of type 1 vertices is also infinite. Since the proportion of type 1 vertices is given by the first element of the left eigenvector, which is positive, we can make a further argument that the number of type 1 vertices has a power-law tail iff the total population size also has a power-law tail.

I want to end by explaining why I was thinking about this model at all. In many previous posts I’ve discussed the forest fire model, where occasionally all the edges in some large component are deleted, and the component becomes a set of singletons again. We are interested in the local limit. That is, what do the large components look like from the point of view of a single vertex in the component? If we were able to prove that the large components have BCRT as the scaling limit, this would answer this question.

This holds for the original random graph process. There are two sensible ways to motivate this. Firstly, given that a component is a tree (which it is with high probability if its size is O(1) ), its distribution is that of the uniform tree, and it is known that this has BCRT as a scaling limit [1]. Alternatively, we know that the components have a Poisson Galton-Watson process as a local limit by the same argument used to calculate the increments of the exploration process. So we have an alternative description of the BCRT appearing: the scaling limit of G-W trees conditioned on their size.

Regarding the forest fires, if we stop the process at some time T>1, we know that some vertices have been burned several times and some vertices have never received an edge. What is clear though is that if we specify the age of each vertex, that is, how long has elapsed since it was last burned; conditional on this, we have an inhomogeneous random graph. Note that if we have two vertices of ages s and t, then the probability that there is an edge between them is $1-e^{-\frac{s\wedge t}{n}}$, ie approximately $\frac{s\wedge t}{n}$. The function giving the probabilities of edges between different types of vertices is called the kernel, and here it is sufficiently well-behaved (in particular, it is bounded) that we are able to use the results of Bollobas et al in [2], where they discuss general sparse inhomogeneous random graphs. They show, among many other things, that in this setting as well the local limit is a multitype branching process.

So in conclusion, we have almost all the ingredients towards proving the result we want, that forest fire components have BCRT scaling limit. The only outstanding matter is that the Miermont result deals with a finite number of types, whereas obviously in the setting where we parameterise by age, the set of types is continuous. In other words, I’m working hard!

References

[1] Aldous – The Continuum Random Tree III

[2] Bollobas, Janson, Riordan – The phase transition in inhomogeneous random graphs

[3] Le Gall – Random Trees and Applications

[4] Miermont – Invariance principles for spatial multitype Galton-Watson trees

# Branching Random Walk and Amenability

This post is about some of the things I learned in an interesting given by Elisabetta Candellero in Oxford last week, based on joint work with Matt Roberts. The paper on which this is based can be found here. The main thing I want to talk about are some properties of graphs which were mentioned near the beginning which I hadn’t heard about before.

Branching Random Walk (hereafter BRW) is a model to which much attention has been paid, because of its natural applications in a range of physical and genetic settings. As with many of the best models, the definition is pretty much in the title. We take the ingredients for a random walk on a graph, which is a graph, and a transition matrix P on that graph. For most of the time we will consider simple random walk, so the graph G exactly specifies P. This requires the additional condition that the graph G is locally finite. We will introduce a branching mechanism, so at discrete times {0,1,2,…} we will track both the number of particles, and their current locations. We start at time 0 with a single particle at some vertex. Then at each time-step, all the vertices present die, and each gives birth independently to some number of offspring according to a fixed probability distribution $\mu$. These offspring then perform one move according to transition matrix P. Note that if you want the system to carry the appearance of having no death, then taking the support of the offspring distribution to be {1,2,3,…} achieves precisely this. The properties we consider will not be very interesting unless G is infinite, so assume that from now on.

There are almost limitless ways we could think of to generalise these dynamics. The offspring distribution could be allowed to depend on the vertex the particle is occupying. The joint transition probabilities of the offspring at a vertex could be biased in favour or against the offspring moving to the same site next. The environment could be chosen in advance before the process starts, but random.

The classical question about BRW is that of recurrence and transience. The definition extends naturally from that of a Markov chain (which any non-branching random walk on a graph is). As in that setting, we say a BRW is recurrent if every vertex is almost surely visited infinitely often by particles of the graph.

Heuristically, we should observe that in some sense, it is quite difficult for simple random walk on an infinite graph to be recurrent. We have examples in $\mathbb{Z},\mathbb{Z}^2$, but these are about as ‘small’ as an infinite graph can be. An idea might be that if the number of sites some distance away from where we start grows rapidly as the distance grows, then there isn’t enough ‘pull’ back to visit the sites near where we start infinitely often. Extending this argument, it is easier for a BRW to be recurrent, as we have the option to make the branching rate large, which means that there are lots of particles at large times, hence more possibility for visiting everywhere. Note that if the offspring distribution is subcritical, we don’t stand a chance of having interesting properties. If we ignore the random walk part, we just have a subcritical Galton-Watson process, which dies out almost surely.

We need a measure of the concept discussed in the heuristic for how fast the number of vertices in the graph grows as we consider bands of vertices further and further away from the starting vertex. The standard measure for this is the spectral radius, which is defined not in terms of number of vertices, but through the limiting probability of returning to a fixed vertex at large time n. Precisely

$\rho:= \limsup \mathbb{P}_i(X_n=i)^{1/n},$

so in some approximation sense

$\mathbb{P}_i(X_n=i)\sim \rho^{n},$

which explains why $\rho\le 1$. Note that by considering the sum of such terms, if simple random walk on G is recurrent, then $\rho=1$, but the converse does not hold. (Consider SRW on $\mathbb{Z}^3$ for example.)

It’s also worth remarking that $\rho$ is a class property. In particular, for a connected graph, the value of $\rho$ is independent of i. This is not surprising, as if d is the graph distance between vertices i and j, then

$p_{ii}^{(n)}\ge p_{ij}^{(d)}p_{jj}^{(n-2d)}p_{ji}^{(d)},$

and vice versa, which enables us to sandwich usefully for the limits.

Really, $\rho$ is a function of the transition matrix P. In fact, we can be more specific, by considering diagonalising P. The only case we care about is when P is infinite, so this is not especially nice, but it makes it clear why $p_{ii}^{(n)}$ decays like $|\rho|^n$ where $\rho$ is the largest eigenvalue of P. Indeed this is an alternative definition of the spectral radius. Note that Perron-Frobenius theory (which seems to keep coming up on the blog this week…) says that since $|\rho|\le 1$, then if $|\rho|=1$, we must have $\rho=1$. So the spectral radius being 1 is precisely equivalent to having an invariant measure. We don’t know whether we can normalise it, but P-F guarantees the relevant left-eigenvector is non-negative, and hence a measure.

Next we give this situation a name. Say that a random walk is amenable if $\rho(P)=1$. We can extend this property to say that a graph is amenable if SRW on it is amenable.

This is not the standard definition of amenability. This property is originally defined (by von Neumann) in the context of groups. A group G is said to be amenable if there exists a left-invariant probability measure on G, ie $\mu$ such that

$\forall A\subset G, \forall g\in G, \mu(gA)=A.$

The uniform distribution shows that any finite group is amenable.

It turns out that in general there are several conditions for a group which are equivalent to amenability. One is that, given G finitely generated by B, the Cayley graph for G with edges given by elements of B does not satisfy a strong isoperimetric inequality. Such an inequality is an alternative way of saying that the graph grows rapidly. It says that the size of the boundary of a subset of the vertices is uniformly large relative to the size of the set. Precisely, there exists a constant c>0 such that whenever U is a finite subset of the vertices, we have $|\partial U|\ge c|U|$. (Note that finiteness of U is important – we would not expect results like this to hold for very large subsets.)

Kesten proved that it is further equivalent to the statement that simple random walk on $Cay(G,B)$ is amenable in our original sense. This technical and important result links the two definitions.

We finish by declaring the main classical result in BRW, which is a precise condition for transience. As motivated earlier, the rate of branching and the spectral radius have opposing effects on whether the system is recurrent or transient. Note that at some large time, the expected number of particles which have returned to the starting vertex is given by the expected number of particles in the system multiplied by the probability that any one of them is back at its origin, ie $\sim \mu^n\rho^n$. So the probability that there is a particle back at the origin at this time is (crudely transferring from expectation to probability) $1\wedge (\mu \rho)^n$. We can conclude that the chain is recurrent if $\mu > \rho^{-1}$ and transient if $\mu<\rho^{-1}$. This result is due to Benjamini and Peres.

The remaining case, when $\mu=\rho^{-1}$ is called, unsurprisingly, critical BRW. It was proved in ’06 by Gantert and Muller that, in fact, all critical BRWs are transient too. This must exclude the amenable case, as we could think of SRW on $\mathbb{Z}$ as a critical BRW by taking the branching distribution to be identically one, as the spectral radius is also 1.

In the end, the material in this post is rather preliminary to the work presented in EC’s talk, which concerned the trace of BRW, and whether there are infinitely many essentially different paths to infinity taken by the particles of the BRW. They show that this holds in a broad class of graphs with symmetric properties.

# Discontinuous Phase Transitions

Yesterday, Demeter Kiss from Cambridge gave a seminar in Oxford about a model for self-destructive percolation on $\mathbb{Z}^2$ that had implications for the (non-)existence of an infinite-parameter forest fire model on the same lattice. I enjoyed talking about this and his recent work on the related model of frozen percolation on $\mathbb{Z}^2$. Considering these models in the lattice setting present a whole range of interesting geometric challenges that are not present in the mean-field case that has mainly occupied my research direction so far.

The afternoon’s discussion included lots of open problems about percolation. Several of these are based around continuity of the phase transition, so I thought I would write a quite post about some simple examples of this, and one example where it does not hold.

A helpful base example is bond percolation on the lattice $\mathbb{Z}^2$. Here, we specify some probability p in [0,1], and we declare edges of the lattice open with probability p, independently of each other. We then consider the graph induced by the open edges. We say that percolation occurs if the origin is contained in an infinite open component. The terminology arises from the interpretation as fluid being added at the origin and flowing down open edges. We define $\theta(p)$ to be the probability that the origin is in an infinite component when the parameter is p. By translation-invariance, we can get some sort of 0-1 law, to conclude that there is an infinite component somewhere in the system with probability either 0 or 1, depending on whether $\theta(p)$ is positive or zero. Indeed, we can further show that if it is positive, then with probability 1 there is a unique infinite component.

We define the critical probability $p_c:= \inf\{\theta(p)>0\}$. A question worth asking is then, what is $\theta(p_c)$? In some examples, we can find $p_c$, but we cannot prove that $\theta(p)$ is continuous around $p_c$. In the case of $\mathbb{Z}^2$ this is known, and it is known from work of Kesten that $p_c=1/2$. See below for a plot of $\theta(p)$ in this setting (obtained from this blog, though possibly originating elsewhere).

The aim is to find an example where we do not have such a continuous phase transition. The original work on frozen percolation took place on trees, and one of Kiss’s results is confirms that these show qualitatively different phenomena to the same process on the lattice. In some sense, trees lie halfway between a lattice and a mean-field model, since there is often some independence when we look down the tree from a given generation, if it is well-defined to use such language.

Anyway, first we consider percolation on an infinite regular rooted k-ary tree. This means we have a root, which has k children, each of which in turn has k children, and so on. As before we consider bond percolation with parameter p. In this setting, we have a language to describe the resulting open component of the root. The offspring distribution of any vertex in the open component is given by Bin(k,p) independently of everything else, so we can view this component as the realisation of a Galton-Watson tree with this offspring distribution. This distribution has finite mean kp, and so we can state explicitly when the survival probability is positive. This happens when the mean is greater than 1, ie p>1/k.

For our actual example, we will consider the survival probability, but the technicalities are easier to explain if we look at the extinction probability, now using the language of branching processes. Suppose the offspring distribution has pgf given by

$f(x)=p_0+p_1x+p_2x^2+\ldots.$

Then the extinction probability q satisfies f(q)=q. I want to pause to consider what happens if this equation has multiple solutions. Indeed, in most interesting cases it will have multiple solutions, since f(1) will always be 1 if it is a non-defective offspring distribution. It is typically cited that: the extinction probability q is the smallest solution to this equation. I want to discuss why that is the case.

To approach this, we have to consider what extinction means. It is the limit in the event sense of the events {we are extinct after n generations}. Let the probabilities of these events be $q_n$, so $q_0=0$. Then by a straightforward coupling argument, we must have

$0=q_0\le q_1\le q_2 \le\ldots\le q:= \lim q_n \le 1.$

But, by the same generating function argument as before, $q_{n+1}=f(q_n)\ge q_n$. So if we split [0,1] into regions A where $f(x)\ge x$ and B where $f(x), all the $(q_n)$s must occur in the former, and so since it is closed, their limit must be in A also. Note that if f(x) intersects x lots of times, then region A is not necessarily connected. In the diagram below, in moving from $q_n$ to $q_{n+1}$ we might jump across part of B.

This is bad, as we are trying to prove that q is the right boundary of the connected component of A containing 0. But this cannot happen, as f is monotonic. So if one of the roots of f(x)=x in between the hypothesised $q_n is called z, then $f(q_n)< f(z)=z < q_{n+1}$, a contradiction.

Ok, so now we are ready to consider our counterexample to continuity over the percolation threshold. See references for a link to the original source of this example. We have to choose a slightly more complicated event than mere survival or extinction. We consider bond percolation as before on the infinite ternary tree, where every vertex has precisely 3 offspring. Our percolation event is now that the root is the root of an infinite binary tree. That is, the root has at least two children, each of which have at least two children, each of which, and so on.

If we set this probability equal to q, and the probability of an edge being open equal to p, then we have the recurrence:

$q=3p^2(1-p)q^2+p^3[3q^2(1-q)+q^3].$

The first term corresponds to the root having two open edges to offspring, and the second to the root having all three open edges to offspring. After manipulating, we end up with

$q\left[2p^3q^2-3p^2q+1\right]=0.$

We are therefore interested in roots of the quadratic lying between 0 and 1. The discriminant can be evaluated as

$\Delta=p^3(9p-8),$

and so there are no real roots where p<8/9. But when p=8/9, we have a repeated root at q=27/32, which is obviously not zero!

This equation is qualitatively different to the previous one for the extinction probability of a Galton-Watson tree. There, we had a quadratic, with one root at 1. As we varied p, the other root moved continuously from greater than one to less than one, so it passed through 1, giving continuity at the critical probability. Here, we have a cubic, again with one root at 1. But now the other roots are complex for small p, meaning that the local minimum of the cubic lies above the x-axis. As p gets to the critical value, it the local minimum passes below the x-axis, and suddenly we have a repeated root, not at zero.

I would like to have a neat probabilistic heuristic for this result, without having to make reference to generating functions. At the moment, the best I can come up with is to say that the original problem is simple, in the sense that the critical probability is as small as it could be while still making sense in expectation. To be concrete, when the mean of the offspring generation is less than 1, the expected size of the nth generation tends to zero, so there certainly could not be positive probability of having an infinite component.

Whereas in the binary tree example, we only require p=2/3 to have, in expectation, the right number of open edges to theoretically allow an infinite binary tree. If we think of percolation as a dynamic process by coupling in p, essentially as we move from p=2/3 to p=8/9 we need to add enough edges near the origin to be able to take advantage of the high density of edges available far from the origin. The probability of this working given you start from n vertices grows much faster (as n grows) than in the original problem, so you might expect a faster transition.

This is so content-free I’m reluctant even to call it a heuristic. I would be very interested to hear of any more convincing argument for this phenomenon!

REFERENCES

Dekking, Pakes – On family trees and subtrees of simple branching processes (link)

# The Contour Process

As I explained in my previous post, I haven’t been reading around as much as I would generally like to recently. A few days in London staying with my parents and catching up with some friends has therefore been a good chance to get back into the habit of leafing through papers and Pitman’s book among other things.

This morning’s post should be a relatively short one. I’m going to define the contour process, a function of a (random or deterministic) tree, related to the exploration process which I have mentioned a few times previously. I will then use this to prove a simple but cute result equating in distribution the sizes of two different branching processes via a direct bijection.

The Contour Process

To start with, we have to have a root, and from that root we label the tree with a depth-first labelling. An example of this is given below. It is helpful at this stage to conceive this process as an explorer walking on the tree, and turning back on themselves only when there is no option to visit a vertex they haven’t already seen. So in the example tree shown, the depth-first exploration visits vertex V_2 exactly four times. Note that with this description, it is clear that the exploration traverses every edge exactly twice, and so the length of the sequence is 2n-1, where n is the number of vertices in the tree since obviously, we start and end at the root.

Another common interpretation of this depth-first exploration is to take some planar realisation of the tree. (Note trees are always planar – proof via induction after removing a leaf.) Then if you treat the tree as a hedge and starting at the root walk along, following the outer boundary with your right hand, this exactly recreates the process.

The height of a tree at a particular vertex is simply the graph distance between that vertex and the root. So when we move from one vertex to an adjacent vertex, the height must increase or decrease by 1.

The contour process is the sequence of heights seen along the depth-first exploration. It is therefore a sequence:

$0=h_0,h_1,\ldots,h_{2n-1}=0,\quad h_i\geq 0,$

and such that $|h_{i+1}-h_i|=1$.

Note that though the contour process uniquely determines the tree structure, the choice of depth-first labelling is a priori non-canonical. For example, in the display above, V_3 might have been explored before V_2. Normally this is resolved by taking the suitable vertex with the smallest label in the original tree to be next. It makes little difference to any analysis to choose the ordering of descendents of some vertex in a depth-first labelling randomly. Note that this explains why it is rather hard to recover Cayley’s theorem about the number of rooted trees on n vertices from this characterisation. Although the number of suitable contour functions is possible to calculate, we would require a complicated multiplicative correction for labelling if we wanted to recover the number of trees.

The only real observation about the uses of the contour process at this stage is that it is not in general a random walk with IID increments for a Galton-Watson branching process. This equivalence is what made the exploration process so useful. In particular, it made it straightforward, at least heuristically, to see why large trees might have a limit interpretation through Brownian excursions. If for example, the offspring distribution is bounded above, say by M, then the contour process certainly cannot be a random walk, as if we have visited a particular vertex exactly M+1 times, then it cannot have another descendent, and so we must return closer to the root at the next step.

I want to mention that in fact Aldous showed his results on scaling limits towards the Continuum Random Tree through the contour process rather than the exploration process. However, I don’t want to say any more about that right now.

A Neat Equivalence

What I do want to talk about is the following distribution on the positive integers. This comes up in Balazs Rath and Balint Toth’s work on forest-fires on the complete graph that I have been reading about recently. The role of this distribution is a conjectured equilibrium distribution for component size in a version of the Erdos-Renyi process where components are deleted (or ‘struck by lightning’) at a rate tuned so that giant components ‘just’ never emerge.

This distribution has the possibly useful property that it is the distribution of the total population size in a Galton-Watson process with Geom(1/2) offspring distribution. It is also the distribution of the total number of leaves in a critical binary branching process, where every vertex has either two descendents or zero descendents, each with probability 1/2. Note that both of these tree processes are critical, as the expected number of offspring is 1 in each case. This is a good start, as it suggests that the relevant equilibrium distribution should also have the power-law tail that is found in these critical branching processes. This would confirm that the forest-fire model exhibits self-organised criticality.

Anyway, as a sanity check, I tried to find a reason why, ignoring the forest-fires for now, these two distributions should be the same. One can argue using generating functions, but there is also the following nice bijective argument.

We focus first on the critical Geometric branching process. We examine its contour function. As explained above, the contour process is not in general a random walk with IID increments. However, for this particular case, it is. The geometric distribution should be viewed as the family of discrete memoryless distributions.

This is useful for the contour process. Note that if we are at vertex V for the (m+1)th time, that is we have already explored m of the edges out of V, then the probability that there is at least one further edge is 1/2, independently of the history of the exploration, as the offspring distribution is Geometric(1/2), which we can easily think of as adding edges one at a time based on independent fair coin tosses until we see a tail for example. The contour process for this random tree is therefore a simple symmetric random walk on Z. Note that this will hit -1 at some point, and the associated contour process is the RW up to the final time it hits 0 before hitting -1. We can check that this obeys the clear rule that with probability 1/2 the tree is a single vertex.

Now we consider the other model, the Galton-Watson process with critical binary branching mechanism. We should consider the exploration process. Recall that the increments in this process are given by the offspring distribution minus one. So this random sequence also behaves as a simple symmetric random walk on Z, again stopped when we hit -1.

To complete the bijective argument, we have to relate leaves in the binary process to vertices in the geometric one. A vertex is a leaf if it has no offspring, so the number of leaves is the number of times before the hitting time of -1 that the exploration process decreases by 1. (*)

Similarly for the contour process. Note that there is bijection between the set of vertices that aren’t the root and the set of edges. The contour process explores every edge exactly twice, once giving an increase of 1 and once giving a decrease of 1. So there is a bijection between the times that the contour process decreases by 1 and the non-root vertices. But the contour process was defined only up to the time we return to the root. This is fine if we know in advance how large the tree is, but we don’t know which return to the root is the final return to the root. So if we extend the random walk to the first time it hits -1, the portion up until the last increment is the contour process, and the final increment must be a decrease by 1, hence there is a bijection between the number of vertices in the Geom(1/2) G-W tree and the number of times that the contour process decreases by 1 before the hitting time of -1. Comparing with (*) gives the result.

# Uniform Spanning Trees

For applications to random graphs, the local binomial structure and independence means that the Galton-Watson branching process is a useful structure to consider embedding in the graph. In several previous posts, I have shown how we can set up the so-called exploration process which visits the sites in a component as if the component were actually a tree. The typical degree is O(1), and so in particular small components will be trees with high probability in the limit. In the giant component for a supercritical graph, this is not the case, but it doesn’t matter, as we ignore vertices we have already explored in our exploration process. We can consider the excess edges separately by ‘sprinkling’ them back in once we have the tree-like backbone of all the components. Again, independence is crucial here.

I am now thinking about a new model. We take an Erdos-Renyi process as before, with edges arriving at some fixed rate, but whenever a cycle appears, we immediately delete all the edges that make up the cycle. Thus at all times the system consists of a collection (or forest) of trees on the n vertices. So initially this process will look exactly like the normal E-R process, but as soon as the components start getting large, we start getting excess edges which destroy the cycles and make everything small again. The question to ask is: if we run the process for long enough, roughly how large are all the components? It seems unlikely that the splitting mechanism is so weak that we will get true giant components forming, ie O(n) sizes, so we might guess that, in common with some other split-merge models of this type, we end up with components of size $n^{2/3}$, as in the critical window for the E-R process.

In any case, the scaling limit process is likely to have components whose sizes grow with n, so we will have a class of trees larger than those we have considered previously, which have typically been O(1). So it’s worth thinking about some ways to generate random trees on a fixed number of vertices.

Conditioned Galton-Watson

Our favourite method of creating trees is inductive. We take a root and connect the root to a number of offspring given by a fixed distribution, and each of these some offspring given by an independent sample from the same distribution and so on. The natural formulation gives no control over the size of the tree. This is a random variable whose distribution depends on the offspring distribution, and which in some circumstances be computed explicitly, for example when the offspring distribution is geometric. In other cases, it is easier to make recourse to generating functions or to a random walk analogue as described in the exploration process discussion.

Of course, there is nothing to stop us conditioning on the total size of the population. This is equivalent to conditioning on the hitting time of -1 for the corresponding random walk, and Donsker’s theorem gives several consequences of a convergence relation towards a rescaled Brownian excursion. Note that there is no a priori labelling for the resulting tree. This will have to be supplied later, with breadth-first and depth-first the most natural choices, which might cause annoyance if you actually want to use it. In particular, it is not obvious, and probably not true unless you are careful, that the distribution is invariant under permuting the labels (having initially assumed 1 is the root etc) which is not ideal if you are embedding into the complete graph.

However, we would like to have some more direct constructions of random trees on n vertices. We now consider perhaps the two best known such methods. These are of particular interest as they are applicable to finding random spanning trees embedded in any graph, rather than just the complete graph.

Uniform Spanning Tree

Given a connected graph, consider the set of all subgraphs which are trees and span the vertex set of the original graph. An element of this set is called a spanning tree. A uniform spanning tree is chosen uniformly at random from the set of spanning trees on the complex graph on n vertices. A famous result of Arthur Cayley says that the number of such spanning trees is $n^{n-2}$. There are various neat proofs, many of which consider a mild generalisation which gives us a more natural framework for using induction. This might be a suitable subject for a subsequent post.

While there is no objective answer to the question of what is the right model for random trees on n vertices, this is what you get from the Erdos-Renyi process. Formally, conditional on the sizes of the (tree) components, the structures of the tree components are given by UST.

To see why this is the case, observe that when we condition that a component has m vertices and is a tree, we are demanding that it be connected and have m-1 edges. Since the probability of a particular configuration appearing in G(n,p) is a function only of the number of edges in the configuration, it follows that the probability of each spanning tree on the m vertices in question is equal.

Interesting things happen when you do this dynamically. That is, if we have two USTs of sizes m and n at some time t, and condition that the next edge to be added in the process joins them, then the resulting component is not a UST on m+n vertices. To see why, consider the probability of a ‘star’, that is a tree with a single distinguished vertex to which every other vertex is joined. Then the probability that the UST on m vertices is a star is $\frac{m}{m^{m-2}}=m^{-(m-3)}$. By contrast, it is not possible to obtain a star on m+n vertices by joining a tree on m vertices and a tree on n vertices with an additional edge.

However, I think the UST property is preserved by the cycle deletion mechanism mentioned at the very start of this post. My working has been very much of the back of the envelope variety, but I am fairly convinced that once you have taken a UST and conditioned on the sizes of the smaller trees which result from cycle deletion. My argument is that you might as well fix the cycle to be deleted, then condition on how many vertices are in each of the trees coming off this cycle. Now the choice of each of these trees is clearly uniform among spanning trees on the correct number of vertices.

However, it is my current belief that the combination of these two mechanisms does not give UST-like trees even after conditioning on the sizes at fixed time.

# Analytic vs Probabilistic Arguments for a Supercritical BP

This follows on directly from the previous post. I was originally going to talk only about what follows, but I got rather carried away with the branching process account. I was stuck on a particular exercise, and we ended up coming up with two arguments: one analytic and one probabilistic. Since the typical flavour of this blog is to present problems which show the advantage of the probabilistic approach, it seems only fair to remark on this case, where the analytic method was less interesting, but much simpler.

Recall that we have a supercritical random graph $G(n,\frac{\lambda}{n}), \lambda>1$, and we are considering the rescaled exploration process $S_{nt}$, which has asymptotic mean $\mu_t=1-t-e^{-\lambda t}$. We can calculate similarly an expression for the asymptotic variance

$\frac{\text{Var}(S_{nt})}{n}\rightarrow v_t=e^{-\lambda t}(1-e^{-\lambda t}).$

To use this to verify the result about the size of the giant component, we verify that $\mu_{\zeta_\lambda+x/\sqrt{n}}$ is negative, and has small variance, which would confirm that the giant component has size bounded above by $\zeta_\lambda$ almost surely. A similar argument is required for the lower bound. The variance is a separate matter, but it is therefore necessary that $\mu_t$ should be decreasing at $t=\zeta_\lambda$, that is $\mu_t'=\lambda e^{-\lambda \zeta_\lambda}<0$. This is what we try to prove in the remainder of this post. Recall that in the previous post we have checked that it is equal to zero here.

Heuristic Explanation

$\mu_t$ has been rescaled from the original definition of the exploration process in both size and time-scale so some care is needed to see why this should hold in the limit. Remember that all components apart from the giant component are of size O(log n). So immediately after exhausting the giant component, you are likely to be visiting components of size roughly log n. A time interval of dt for $\mu$ corresponds to ndt for S, during which S will visit some components of size log n and some of O(1) and some in between. In particular, some fixed proportion of vertices are isolated, that is, in a component of size 1.

There is then a complicated size-biasing train of thought. A component of size log n is more likely to come up than an isolated vertex, but there are not as many of them. The log n components push the derivative $\mu_t'$ towards zero, because S_t decreases by 1 over a time-interval of length log n, which gives a gradient of zero in the limit. However, the isolated vertices give a gradient of -1, because S_t decreases by 1 over a time interval of 1. Despite the fact that log n intervals are likely to appear earlier, it still remains the case that after exhausting a component (in particular, at time $t=\zeta_\lambda$, after exhausting the giant component), with some bounded below positive probability you will choose an isolated vertex next. The component size only affects that time-scale if it is O(n), which none of the remaining components are, so the derivative $\mu_{\zeta_\lambda}'$ consists of some complicated weighted mean of 0 and -1. In particular, it is negative.

Analytic solution

Obviously, that won’t do in practice. Suppressing lambdas for ease of notation, the key fact is: $e^{-\lambda \zeta}=1-\zeta$. We want to show that $\lambda e^{-\lambda \zeta}<1$. Substituting

$\lambda=-\frac{\log(1-\zeta)}{\zeta},$

means that it is required to show:

$-\frac{1-\zeta}{\zeta}\log(1-\zeta)<1.$

Differentiating the left hand side gives:

$\frac{\log(1-\zeta)+\zeta}{\zeta^2}<0,$

since of course $\log(1-\zeta)=\zeta+\frac{\zeta^2}{2}+\frac{\zeta^3}{3}+\dots$. So it suffice to check the result for small $\zeta$. But, again using a Taylor series:

$-\frac{1-\zeta}{\zeta}\log(1-\zeta)=1-\frac12\zeta+O(\zeta^2)<1,$

for small $\zeta$. This gives the required result.

Probabilistic Interpretation and Solution

First, we observe that $\lambda e^{-\lambda\zeta}=\lambda(1-\zeta)$ is the expected number of vertices in the first generation of a $\text{Po}(\lambda)$ whose progeny become extinct. This motivates considering the canonical decomposition of a supercritical branching process Z into the skeleton process and the dual process. The skeleton $Z^+$ consists of all vertices which have infinitely many successors. It is relatively easy to show that this is a branching process with offspring distribution $\text{Po}(\lambda\zeta)$ conditioned on being positive. The dual process $Z^*$ is a G-W branching process with offspring distribution $\text{Po}(\lambda)$ conditioned on dying. This is the same as a branching process with offspring distribution $\text{Po}(\lambda(1-\zeta)$, by a sprinkling argument, which says that if we begin with a Poisson number of things, then remove each one independently with some fixed probability, the remaining number of things is Poisson also.

We can construct the original branching process by

• With probability $\zeta$, take the skeleton, and affixe independent copies of $Z^*$ at every vertex in the skeleton.
• With probability $1-\zeta$, just take a copy of $Z^*$.

It is immediately clear that $\lambda(1-\zeta)\leq 1$. After all, the dual process is almost surely finite, so the offspring distribution cannot have expectation greater than 1. Checking that this is strong is more fiddly. The best way I have come up with is to examine the tail of the distribution of total population size of the original branching process.

The total population size T of a branching process has an exponential tail if the offspring distribution is subcritical. It isn’t hugely surprising that this behaves like a large deviation for iid RVs, since in the limit such an event requires a lot of the offspring counts to deviate substantially from the mean. The same holds in the supercritical case, with the additional complication that though the finite tail decays exponential, there is positive probability that the total size will be infinite. In the critical case, however, there is a power-law decay. This is not hugely surprising as it marks the threshhold for the appearance of the infinite population, just as in a multiplicative coalescent at time 1, we have a load of very large components just about to form a giant component. The tool for all of these results is Dwass’s Theorem, which says:

$\mathbb{P}(T=n)=\frac{1}{n}\mathbb{P}(X_1+\ldots+X_n=n-1),$

where $X_1$ are iid with the offspring distribution. When $\mathbb{E}X_1\neq 1$, this is a large deviation event, for which Cramer’s theorem applies (assuming, as is the case for the Poisson distribution, that the offspring distribution has finite variance). When, $\mathbb{E}X=1$, the Central Limit Theorem says that with high probability,

$X_1+\ldots+X_n\in [n-n^{3/4},n+n^{3/4}],$

so, skating over the details of whether everything is exactly uniform within this CLT scaling window,

$\mathbb{P}(T=n)\geq \frac{1}{n}\cdot\frac{1}{2n^{3/4}}.$

The true exponent of the power law decay is substantially slower than this, but the above argument works as a back-of-the-envelope bound.

In particular, if the dual process has mean 1, then the population size of the original branching process is given by taking a distribution with exponential tail with some probability and a distribution with power-law tail with some probability. Obviously the power-law will dominate, which contradicts the assumption that the original branching process was supercritical, and so has an exponential tail.

# Exploring the Supercritical Random Graph

I’ve spent a bit of time this week reading and doing all the exercises from some excellent notes by van der Hofstad about random graphs. I think they are absolutely excellent and would not be surprised if they become the standard text for an introduction to probabilistic combinatorics. You can find them hosted on the author’s website. I’ve been reading chapters 4 and 5, which approaches the properties of phase transitions in G(n,p) by formalising the analogy between component sizes and population sizes in a binomial branching process. When I met this sort of material for the first time during Part III, the proofs generally relied on careful first and second moment bounds, which is fine in many ways, but I enjoyed vdH’s (perhaps more modern?) approach, as it seems to give a more accurate picture of what is actually going on. In this post, I am going to talk about using the branching process picture to explain why the giant component emerges when it does, and how to get a grip on how large it is at any time after it has emerged.

Background

A quick tour through the background, and in particular the notation will be required. At some point I will write a post about this topic in a more digestible format, but for now I want to move on as quickly as possible.

We are looking at the sparse random graph $G(n,\frac{\lambda}{n})$, in the super-critical phase $\lambda>1$. With high probability (that is, with probability tending to 1 as n grows), we have a so-called giant component, with O(n) vertices.

Because all the edges in the configuration are independent, we can view the component containing a fixed vertex as a branching process. Given vertex v(1), the number of neighbours is distributed like $\text{Bi}(n-1,\frac{\lambda}{n})$. The number of neighbours of each of these which we haven’t already considered is then $\text{Bi}(n-k,\frac{\lambda}{n})$, conditional on k, the number of vertices we have already discounted. After any finite number of steps, k=o(n), and so it is fairly reasonable to approximate this just by $\text{Bi}(n,\frac{\lambda}{n})$. Furthermore, as n grows, this distribution converges to $\text{Po}(\lambda)$, and so it is natural to expect that the probability that the fixed vertex lies in a giant component is equal to the survival probability $\zeta_\lambda$ (that is, the probability that it is infinite) of a branching process with $\text{Po}(\lambda)$ offspring distribution. Note that given a graph, the probability of a fixed vertex lying in a giant component is equal to the fraction of the vertex in the giant component. At this point it is clear why the emergence of the giant component must happen at $\lambda=1$, because we require $\mathbb{E}\text{Po}(\lambda)>1$ for the survival probability to be non-zero. Obviously, all of this needs to be made precise and rigorous, and this is treated in sections 4.3 and 4.4 of the notes.

Exploration Process

A common functional of a rooted branching process to consider is the following. This is called in various places an exploration process, a depth-first process or a Lukasiewicz path. We take a depth-first labelling of the tree v(0), v(1), v(2),… , and define c(k) to be the number of children of vertex v(k). We then define the exploration process by:

$S(0)=0,\quad S(k+1)=S(k)+c(k)-1.$

By far the best way to think of this is to imagine we are making the depth-first walk on the tree. S(k) records how many vertices we have seen (because they are connected by an edge to a vertex we have visited) but have not yet visited. To clarify understanding of the definition, note that when you arrive at a vertex with no children, this should decrease by one, as you can see no new vertices, but have visited an extra one.

This exploration process is useful to consider for a couple of reasons. Firstly, you can reconstruct the branching process directly from it. Secondly, while other functionals (eg the height, or contour process) look like random walks, the exploration process genuinely is a random walk. The distribution of the number of children of the next vertex we arrive at is independent of everything we have previously seen in the tree, and is the same for every vertex. If we were looking at branching processes in a different context, we might observe that this gives some information in a suitably-rescaled limit, as rescaled random walks converge to Brownian motion if the variance of the (offspring) distribution is finite. (This is Donsker’s result, which I should write something about soon…)

The most important property is that the exploration process returns to 0 precisely when we have exhausted all the vertices in a component. At that point, we have seen exactly the vertices which we have explored. There is no reason not to extend the definition to forests, that is a union of trees. The depth-first exploration is the same – but when we have exhausted one component, we move onto another component, chosen according to some labelling property. Then, running minima of the exploration process (ie times when it is smaller than it has been before) correspond to jumping between components, and thus excursions above the minimum to components themselves. The running minimum will be non-positive, with absolute value equal to the number of components already exhausted.

Although the exploration process was defined with reference to and in the language of trees, the result of a branching process, this is not necessary. With some vertex denoted as the root, we can construct a depth-first labelling of a general graph, and the exploration process follows exactly as before. Note that we end up ignoring all edges except a set that forms a forest. This is what we will apply to G(n,p).

Exploring G(n,p)

When we jump between components in the exploration process on a supercritical (that is $\lambda>1$) random graph, we move to a component chosen randomly with size-biased distribution. If there is a giant component, as we know there is in the supercritical case, then this will dominate the size-biased distribution. Precisely, if the giant component takes up a fraction H of the vertices, then the number of components to be explored before we get to the giant component is geometrically distributed with parameter H. All other components have size O(log n), so the expected number of vertices explored before we get to the giant component is O(log n)/H = o(n), and so in the limit, we explore the giant component immediately.

The exploration process therefore gives good control on the giant component in the limit, as roughly speaking the first time it returns to 0 is the size of the giant component. Fortunately, we can also control the distribution of S_t, the exploration process at time t. We have that:

$S_t+(t-1)\sim \text{Bi}(n-1,1-(1-p)^t).$

This is not too hard to see. $S_t+(t-1)$ is number of vertices we have explored or seen, ie are connected to a vertex we have explored. Suppose the remaining vertices are called unseen, and we began the exploration at vertex 1. Then any vertex with label in {2,…,n} is unseen if it successively avoids being in the neighbourhood of v(1), v(2), … v(t). This happens with probability $(1-p)^t$, and so the probability of being an explored or seen vertex is the complement of this.

In the supercritical case, we are taking $p=\frac{\lambda}{n}$ with $\lambda>1$, and we also want to speed up S, so that all the exploration processes are defined on [0,1], and rescale the sizes by n, so that it records the fraction of the graph rather than the number of vertices. So we set consider the rescaling $\frac{1}{n}S_{nt}$.

It is straightforward to use the distribution of S_t we deduce that the asymptotic mean $\mathbb{E}\frac{1}{n}S_{nt}=\mu_t = 1-t-e^{-\lambda t}$.

Now we are in a position to provide more concrete motivation for the claim that the proportion of vertices in the giant component is $\zeta_\lambda$, the survival probability of a branching process with $\text{Po}(\lambda)$ offspring distribution. It helps to consider instead the extinction probability $1-\zeta_\lambda$. We have:

$1-\zeta_\lambda=\sum_{k\geq 0}\mathbb{P}(\text{Po}(\lambda)=k)(1-\zeta_\lambda)^k=e^{-\lambda\zeta_\lambda},$

where the second equality is a consequence of the simple form for the moment generating function of the Poisson distribution.

As a result, we have that $\mu_{\zeta_\lambda}=0$. In fact we also have a central limit theorem for S_t, which enables us to deduce that $\frac{1}{n}S_{n\zeta_\lambda}=0$ with high probability, as well as in expectation, which is precisely what is required to prove that the giant component of $G(n,\frac{\lambda}{n})$ has size $n(\zeta_\lambda+o(1))$.

# Branching Processes and Dwass’s Theorem

This is something I had to think about when writing my Part III essay, and it turns out to be relevant to some of the literature I’ve been reading this week. The main result is hugely helpful for reducing a potentially complicated combinatorial object to a finite sum of i.i.d. random variables, which in general we do know quite a lot about. I was very pleased with the proof I came up with while writing the essay, even if in the end it turned out to have appeared elsewhere before. (Citation at end)

Galton-Watson processes

A Galton-Watson process is a stochastic process describing a simple model for evolution of a population. At each stage of the evolution, a new generation is created as every member of the current generation produces some number of `offspring’ with identical and independent (both across all generations and within generations) distributions. Such processes were introduced by Galton and Watson to examine the evolution of surnames through history.

More precisely, we specify an offspring distribution, a probability distribution supported on $\mathbb{N}_0$. Then define a sequence of random variables $(Z_n,n\in\mathbb{N})$ by:

$Z_{n+1}=Y_1^n+\ldots+Y_{Z_n}^n,$

where $(Y_k^n,k\geq 1,n\geq 0)$ is a family of i.i.d. random variables with the offspring distribution $Y$. We say $Z_n$ is the size of the $n$th generation. From now on, assume $Z_0=1$ and then we call $(Z_n,n\geq 0)$ a Galton-Watson process. We also define the total population size to be

$X:=Z_0+Z_1+Z_2+\ldots,$

noting that this might be infinite. We refer to the situation where $X<\infty$ finite as extinction, and can show that extinction occurs almost surely when $\mathbb{E}Y\leq 1$, excepting the trivial case $Y=\delta_1$. The strict inequality parts are as you would expect. We say the process is critical if $\mathbb{E}Y=1$, and this is less obvious to visualise, but works equally well in the proof, which is usually driven using generating functions.

Total Population Size and Dwass’s Theorem

Of particular interest is $X$, the total population size, and its distribution. The following result gives us a precise and useful result linking the probability of the population having size $n$ and the distribution of the sum of $n$ RVs with the relevant offspring distribution. Among the consequences are that we can conclude immediately, by CLT and Cramer’s Large Deviations Theorem, that the total population size distribution has power-law decay in the critical case, and exponential decay otherwise.

Theorem (Dwass (1)): For a general branching process with a single time-0 ancestor and offspring distribution $Y$ and total population size $X$:

$\mathbb{P}(X=k)=\frac{1}{k}\mathbb{P}(Y^1+\ldots+ Y^k=k-1),\quad k\geq 1$

where $Y^1,\ldots,Y^k$ are independent copies of $Y$.

We now give a proof via a combinatorial argument. The approach is similar to that given in (2). Much of the literature gives a proof using generating functions.

Proof: For motivation, consider the following. It is natural to consider a branching process as a tree, with the time-0 ancestor as the root. Suppose the event $\{X=k\}$ in holds, which means that the tree has $k$ vertices. Now consider the numbers of offspring of each vertex in the tree. Since every vertex except the root has exactly one parent, and there are no vertices outside the tree, we must have $Y^1+\ldots+Y^k=k-1$ where $Y^1,\ldots,Y^k$ are the offspring numbers in some order. However, observe that this is not sufficient. For example, if $Y^1$ is the number of offspring of the root, and $k\geq 2$, then we must have $Y^1\geq 1$. Continue reading