Random walks conditioned to stay positive

In this post, I’m going to discuss some of the literature concerning the question of conditioning a simple random walk to lie above a line with fixed gradient. A special case of this situation is conditioning to stay non-negative. Some notation first. Let (S_n)_{n\ge 0} be a random walk with IID increments, with distribution X. Take \mu to be the expectation of these increments, and we’ll assume that the variance \sigma^2 is finite, though at times we may need to enforce slightly stronger regularity conditions.

(Although simple symmetric random walk is a good example for asymptotic heuristics, in general we also assume that if the increments are discrete they don’t have parity-based support, or any other arithmetic property that prevents local limit theorems holding.)

We will investigate the probability that S_n\ge 0 for n=0,1,…,N, particularly for large N. For ease of notation we write T=\inf\{n\ge 0\,:\, S_n<0\} for the hitting time of the negative half-plane. Thus we are interested in S_n conditioned on T>N, or T=N, mindful that these might not be the same. We will also discuss briefly to what extent we can condition on T=\infty.

In the first paragraph, I said that this is a special case of conditioning SRW to lie above a line with fixed gradient. Fortunately, all the content of the general case is contained in the special case. We can repose the question of S_n conditioned to stay above n\alpha until step N by the question of S_n-n\alpha (which, naturally, has drift \mu-\alpha) conditioned to stay non-negative until step N, by a direct coupling.

Applications

Simple random walk is a perfectly interesting object to study in its own right, and this is a perfectly natural question to ask about it. But lots of probabilistic models can be studied via naturally embedded SRWs, and it’s worth pointing out a couple of applications to other probabilistic settings (one of which is the reason I was investigating this literature).

In many circumstances, we can desribe random trees and random graphs by an embedded random walk, such as an exploration process, as described in several posts during my PhD, such as here and here. The exploration process of a Galton-Watson branching tree is a particularly good example, since the exploration process really is simple random walk, unlike in, for example, the Erdos-Renyi random graph G(N,p), where the increments are only approximately IID. In this setting, the increments are given by the offspring distribution minus one, and the hitting time of -1 is the total population size of the branching process. So if the expectation of the offspring distribution is at most 1, then the event that the size of the tree is large is an atypical event, corresponding to delayed extinction. Whereas if the expectation is greater than one, then it is an event with limiting positive probability. Indeed, with positive probability the exploration process never hits -1, corresponding to survival of the branching tree. There are plenty of interesting questions about the structure of a branching process tree conditional on having atypically large size, including the spine decomposition of Kesten [KS], but the methods described in this post can be used to quantify the probability, or at least the scale of the probability of this atypical event.

In my current research, I’m studying a random walk embedded in a construction of the infinite-volume DGFF pinned at zero, as introduced by Biskup and Louidor [BL]. The random walk controls the gross behaviour of the field on annuli with dyadically-growing radii. Anyway, in this setting the random walk has Gaussian increments. (In fact, there is a complication because the increments aren’t exactly IID, but that’s definitely not a problem at this level of exposition.) The overall field is decomposed as a sum of the random walk, plus independent DGFFs with Dirichlet boundary conditions on each of the annuli, plus asymptotically negligible corrections from a ‘binding field’. Conditioning that this pinned field be non-negative up to the Kth annulus corresponds to conditioning the random walk to stay above the magnitude of the minimum of each successive annular DGFF. (These minima are random, but tightly concentrated around their expectations.)

Conditioning on \{T > N\}

When we condition on \{T>N\}, obviously the resulting distribution (of the process) is a mixture of the distributions we obtain by conditioning on each of \{T=N+1\}, \{T=N+2\},\ldots. Shortly, we’ll condition on \{T=N\} itself, but first it’s worth establishing how to relate the two options. That is, conditional on \{T>N\}, what is the distribution of T?

Firstly, when \mu>0, this event always has positive probability, since \mathbb{P}(T=\infty)>0. So as N\rightarrow\infty, the distribution of the process conditional on \{T>N\} converges to the distribution of the process conditional on survival. So we’ll ignore this for now.

In the case \mu\le 0, everything is encapsulated in the tail of the probabilities \mathbb{P}(T=N), and these tails are qualitatively different in the cases \mu=0 and \mu<0.

When \mu=0, then \mathbb{P}(T=N) decays polynomially in N. In the special case where S_n is simple symmetric random walk (and N has the correct parity), we can check this just by an application of Stirling’s formula to count paths with this property. By contrast, when \mu<0, even demanding S_N=-1 is a large deviations event in the sense of Cramer’s theorem, and so the probability decays exponentially with N. Mogulskii’s theorem gives a large deviation principle for random walks to lie above a line defined on the scale N. The crucial fact here is that the probabilistic cost of staying positive until N has the same exponent as the probabilistic cost of being positive at N. Heuristically, we think of spreading the non-expected behaviour of the increments uniformly through the process, at only polynomial cost once we’ve specified the multiset of values taken by the increments. So, when \mu<0, we have

\mathbb{P}(T\ge(1+\epsilon)N) \ll \mathbb{P}(T= N).

Therefore, conditioning on \{T\ge N\} in fact concentrates T on N+o(N). Whereas by contrast, when \mu=0, conditioning on \{T\ge N\} gives a nontrivial limit in distribution for T/N, supported on [1,\infty).

A related problem is the value taken by S_N, conditional on {T>N}. It’s a related problem because the event {T>N} depends only on the process up to time N, and so given the value of S_N, even with the conditioning, after time N, the process is just an unconditioned RW. This is a classic application of the Markov property, beloved in several guises by undergraduate probability exam designers.

Anyway, Iglehart [Ig2] shows an invariance principle for S_N | T>N when \mu<0, without scaling. That is S_N=\Theta(1), though the limiting distribution depends on the increment distribution in a sense that is best described through Laplace transforms. If we start a RW with negative drift from height O(1), then it hits zero in time O(1), so in fact this shows that conditonal on \{T\ge N\}, we have T= N +O(1) with high probability. When \mu=0, we have fluctuations on a scale \sqrt{N}, as shown earlier by Iglehart [Ig1]. Again, thinking about the central limit theorem, this fits the asymptotic description of T conditioned on T>N.

Conditioning on T=N

In the case \mu=0, conditioning on T=N gives

\left[\frac{1}{\sqrt{N}}S(\lfloor Nt\rfloor ) ,t\in[0,1] \right] \Rightarrow W^+(t), (*)

where W^+ is a standard Brownian excursion on [0,1]. This is shown roughly simultaneously in [Ka] and [DIM]. This is similar to Donsker’s theorem for the unconditioned random walk, which converges after rescaling to Brownian motion in this sense, or Brownian bridge if you condition on S_N=0. Skorohod’s proof for Brownian bridge [Sk] approximates the event \{S_N=0\} by \{S_N\in[-\epsilon \sqrt{N},+\epsilon \sqrt{N}]\}, since the probability of this event is bounded away from zero. Similarly, but with more technicalities, a proof of convergence conditional on T=N can approximate by \{S_m\ge 0, m\in[\delta N,(1-\delta)N], S_N\in [-\epsilon \sqrt{N},+\epsilon\sqrt{N}]\}. The technicalities here emerge since T, the first return time to zero, is not continuous as a function of continuous functions. (Imagine a sequence of processes f^N for which f^N(x)\ge 0 on [0,1] and f^N(\frac12)=\frac{1}{N}.)

Once you condition on T=N, the mean \mu doesn’t really matter for this scaling limit. That is, so long as variance is finite, for any \mu\in\mathbb{R}, the same result (*) holds, although a different proof is in general necessary. See [BD] and references for details. However, this is particularly clear in the case where the increments are Gaussian. In this setting, we don’t actually need to take a scaling limit. The distribution of Gaussian *random walk bridge* doesn’t depend on the mean of the increments. This is related to the fact that a linear transformation of a Gaussian is Gaussian, and can be seen by examining the joint density function directly.

Conditioning on T=\infty

When \mu>0, the event \{T=\infty\} occurs with positive probability, so it is well-defined to condition on it. When \mu\le 0, this is not the case, and so we have to be more careful.

First, an observation. Just for clarity, let’s take \mu<0, and condition on \{T>N\}, and look at the distribution of S_{\epsilon N}, where \epsilon>0 is small. This is approximately given by

\frac{S_{\epsilon N}}{\sqrt{N}}\stackrel{d}{\approx}W^+(\epsilon).

Now take \epsilon\rightarrow\infty and consider the RHS. If instead of the Brownian excursion W^+, we instead had Brownian motion, we could specify the distribution exactly. But in fact, we can construct Brownian excursion as the solution to an SDE:

\mathrm{d}W^+(t) = \left[\frac{1}{W^+(t)} - \frac{W^+(t)}{1-t}\right] \mathrm{d}t + \mathrm{d}B(t),\quad t\in(0,1) (**)

for B a standard Brownian motion. I might return in the next post to why this is valid. For now, note that the first drift term pushes the excursion away from zero, while the second term brings it back to zero as t\rightarrow 1.

From this, the second drift term is essentially negligible if we care about scaling W^+(\epsilon) as \epsilon\rightarrow 0, and we can say that W^+(\epsilon)=\Theta(\sqrt{\epsilon}).

So, returning to the random walk, we have

\frac{S_{\epsilon N}}{\sqrt{\epsilon N}}\stackrel{d}{\approx} \frac{W^+(\epsilon)}{\sqrt{\epsilon}} = \Theta(1).

At a heuristic level, it’s tempting to try ‘taking N\rightarrow\infty while fixing \epsilon N‘, to conclude that there is a well-defined scaling limit for the RW conditioned to stay positive forever. But we came up with this estimate by taking N\rightarrow\infty and then \epsilon\rightarrow 0 in that order. So while the heuristic might be convincing, this is not the outline of a valid argument in any way. However, the SDE representation of W^+ in the \epsilon\rightarrow 0 regime is useful. If we drop the second drift term in (**), we define the three-dimensional Bessel process, which (again, possibly the subject of a new post) is the correct scaling limit we should be aiming for.

Finally, it’s worth observing that the limit \{T=\infty\}=\lim_{N\rightarrow\infty} \{T>N\} is a monotone limit, and so further tools are available. In particular, if we know that the trajectories of the random walk satisfy the FKG property, then we can define this limit directly. It feels intuitively clear that random walks should satisfy the FKG inequality (in the sense that if a RW is large somewhere, it’s more likely to be large somewhere else). You can do a covariance calculation easily, but a standard way to show the FKG inequality applies is by verifying the FKG lattice condition, and unless I’m missing something, this is clear (though a bit annoying to check) when the increments are Gaussian, but not in general. Even so, defining this monotone limit does not tell you that it is non-degenerate (ie almost-surely finite), for which some separate estimates would be required.

A final remark: in a recent post, I talked about the Skorohod embedding, as a way to construct any centered random walk where the increments have finite variance as a stopped Brownian motion. One approach to conditioning a random walk to lie above some discrete function is to condition the corresponding Brownian motion to lie above some continuous extension of that function. This is a slightly stronger conditioning, and so any approach of this kind must quantify how much stronger. In Section 4 of [BL], the authors do this for the random walk associated with the DGFF conditioned to lie above a polylogarithmic curve.

References

[BD] – Bertoin, Doney – 1994 – On conditioning a random walk to stay nonnegative

[BL] – Biskup, Louidor – 2016 – Full extremal process, cluster law and freezing for two-dimensional discrete Gaussian free field

[DIM] – Durrett, Iglehart, Miller – 1977 – Weak convergence to Brownian meander and Brownian excursion

[Ig1] – Iglehart – 1974 – Functional central limit theorems for random walks conditioned to stay positive

[Ig2] – Iglehart – 1974 – Random walks with negative drift conditioned to stay positive

[Ka] – Kaigh – 1976 – An invariance principle for random walk conditioned by a late return to zero

[KS] – Kesten, Stigum – 1966 – A limit theorem for multidimensional Galton-Watson processes

[Sk] – Skorohod – 1955 – Limit theorems for stochastic processes with independent increments

Advertisements

Poisson Random Measures

[This is a companion to the previous post. They explore different aspects of the same problem which I have been thinking about from a research point of view. So that they can be read independently, there has inevitably been some overlap.]

As I explained in passing previously, Poisson Random Measures have come up in my current research project. Indeed, the context where they have appeared seems like a very good motivation for considering the construction and some properties of PRMs.

We begin not with a Poisson variable, but with a standard Erdos-Renyi random graph G(n,\frac{c}{n}). The local limit of a component in this random graph is given by a Galton-Watson branching process with Poisson(c) offspring distribution. Recall that a local limit is description of what the structure looks like near a given (or random) vertex. Since the vertices in G(n,p) are exchangeable, this rooting matters less. Anyway, the number of neighbours in the graph of our root is given by Bin(n-1,c/n). Suppose that the root v_0, has k neighbours. Then if we are just interested in determining the vertices in the component, we can ignore the possibility of further edges between these neighbours. So if we pick one of the neighbours of the root, say v_1, and count the number of neighbours of this vertex that we haven’t already considered, this is distributed as Bin(n-1-k,c/n), since we discount the root and the k neighbours of the root.

Then, as n grows large, Bin(n-1,c/n) converges in distribution to Po(c). Except on a very unlikely event whose probability we can control if we need, so does Bin(n-1-k,c/n). Indeed if we consider a set of K vertices which are already connected in some way, then the distribution of the number of neighbours of one of them which we haven’t already considered is still Po(c) in the limit.

Now we consider what happens if we declare the graph to be inhomogeneous. The simplest possible way to achieve this is to specify two types of vertices, say type A and type B. Then we specify the proportion of vertices of each type, and the probability that there is an edge between two vertices of given types. This is best given by a symmetric matrix. So for example, if we wanted a random bipartite graph, we could achieve this as described by setting all the diagonal entries of the matrix to be zero.

So does the local limit extend to this setting? Yes, unsurprisingly it does. To be concrete, let’s say that the proportion of types A and B are a and b respectively, and the probabilities of having edges between vertices of various types is given by P=(p_{ij}/n)_{i,j\in\{A,B\}}. So we can proceed exactly as before, only now we have to count how many type A neighbours and how many type B neighbours we see at all stages. We have to specify the type of our starting vertex. Suppose for now that it is type A. Then the number of type A neighbours is distributed as

\text{Bin}(an,p_{AA}/n)\stackrel{d}{\rightarrow}\text{Po}(ap_{AA}),

and similarly the limiting number of type B neighbours is \sim \text{Po}(bp_{AB}). Crucially, this is independent of the number of type A neighbours. The argument extends naturally to later generations, and the result is exactly a multitype Galton-Watson process as defined in the previous post.

My motivating model is the forest fire. Here, components get burned when they are large and reduced to singletons. It is therefore natural to talk about the ‘age’ of a vertex, that is, how long has elapsed since it was last burned. If we are interested in the forest fire process at some fixed time T>1, that is, once burning has started, then we can describe it as an inhomogeneous random graph, given that we know the ages of the vertices.

For, given two vertices with ages s and t, where WLOG s<t, we know that the older vertex could not have been joined to the other vertex between times T-t and T-s. Why? Well, if it had, then it too would have been burned at time T-s when the other vertex was burned. So the only possibility is that they might have been joined by an edge between times T-s and T. Since each edge arrives at rate 1/n, the probability that this happens is 1-e^{-s/n}\approx \frac{s}{n}. Indeed, in general the probability that two vertices of ages s and t are joined at time T is \frac{s\wedge t}{n}.

Again at fixed time T>1, the sequence of ages of the vertices converges weakly to some fixed distribution (which depends on T) as the number of vertices grows to infinity. We can then recover the graph structure by assigning ages according to this distribution, then growing the inhomogeneous random graph with the kernel as described. The question is: when we look for a local limit, how to do we describe the offspring distribution?

Note that in the limit, components will be burned continuously, so the distribution of possible ages is continuous (with an atom at T for those vertices which have never been burned). So if we try to calculate the distribution of the number of neighbours of age s, we are going to be doomed, because with probability 1 then is no vertex of age s anywhere!

The answer is that the offspring distribution is given by a Poisson Random Measure. You can think of this as a Poisson Point Process where the intensity is non-constant. For example, let us consider how many neighbours we expect to have with ages [s,s+ds]. Let us suppose the age of our root is t>s+ds for now. Assuming the distribution of ages, f(\cdot) is positive and continuous, the number of vertices with these ages in the system is roughly nf(s)ds, and so the number of neighbours with this property is roughly \text{Bin}(nf(s)ds,\frac{s}{n}). In particular, this does have a Poisson limit. We need to be careful about whether this Poisson limit is preserved by the approximation. In fact this is fine. Let’s assume WLOG that f is increasing at s. Then the number of age [s,s+ds] neighbours can be stochastically bounded between \text{Bin}(nf(s)ds,\frac{s}{n}) and \text{Bin}(nf(s+ds)ds,\frac{s+ds}{n}. As n grows, these converge in the distribution to two Poisson random variables, and then we can let ds go to zero. Note for full formalism, we may need to account for the large deviations event that the number of age s vertices in the system is noticeably different from its expectation. Whether this is necessary depends on whether the ages are assigning deterministically, or drawn IID-ly from f.

One important result to be drawn from this example is that the number of offspring from disjoint type sets, say [s_1,s_2], [t_1,t_2] are independent, for the same reason as in the two-type setting, namely that the underlying binomial variables are independent. We are, after all, testing different sets of vertices! The other is that the number of neighbours with ages in some range is Poisson. Notice that these two results are consistent. The number of neighbours with ages in the set [s_1,s_2]\cup [t_1,t_2] is given by the sum of two independent Poisson RVs, and hence is Poisson itself. The parameter of the sum RV is given by the sum of the original parameters.

These are essentially all the ingredients required for the definition of a Poisson Random Measure. Note that the set of offspring is a measure of the space of ages, or types. (Obviously, this isn’t a probability measure.) We take a general space E, with sigma algebra \mathcal{E}, and an underlying measure \mu on E. We want a distribution \nu for measures on E, such that for each Borel set A\in\mathcal{E}, \nu(A), which is random because \nu is, is distributed as \text{Po}(\mu(A)), and furthermore, for disjoint A,B\in\mathcal{E}, the random variables \nu(A),\nu(B) are independent.

If M=\mu(E)<\infty, then constructing such a random measure is not too hard using a thinning property. We know that \nu(E)\stackrel{d}{=}\text{Po}(M), and so if we sample a Poisson(M) number of RVs with distribution given by \frac{\mu(\cdot)}{M}, we get precisely the desired PRM. Proving this is the unique distribution with this property is best done using properties of the Laplace transform, which uniquely defines the law of a random measure in the same manner that the moment generating function defines the law of a random variable. Here the argument is a function, rather than a single variable for the MGF, reflecting the fact that the space of measures is a lot ‘bigger’ than the reals, where a random variable is supported. We can extend this construction for sigma-finite spaces, that is some countable union of finite spaces.

One nice result about Poisson random measures concerns the expectation of functions evaluated at such a random measure. Recall that some function f evaluated at the measure \sum \delta_{x_i} is given by \sum f(x_i). Then, subject to mild conditions on f, the expectation

\mathbb{E}\nu (f)=\mu(f).

Note that when f=1_A, this is precisely one of the definitions of the PRM. So by a monotone class result, it is not surprising that this holds more generally. Anyway, I’m currently trying to use results like these to get some control over what the structure of this branching processes look like, even when the type space is continuous as in the random graph with specified ages.

Enhanced by Zemanta

Multitype Branching Processes

One of the fundamental objects in classical probability theory is the Galton-Watson branching process. This is defined to be a model for the growth of a population, where each individual in a generation gives birth to some number (possibly zero) of offspring, who form the next generation. Crucially, the numbers of offspring of the individuals are IID, with the same distribution both within generations and between generations.

There are several ways one might generalise this, such as non-IID offspring distributions, or pairs of individuals producing some number of offspring, but here we consider the situation where each individual has some type, and different types have different offspring distributions. Note that if there are K types, say, then the offspring distributions should now be supported on \mathbb{Z}_{\ge 0}^K. Let’s say the offspring distribution from a parent of type i is \mu^{(i)}.

The first question to address is one of survival. Recall that if we want to know whether a standard Galton-Watson process has positive probability of having infinite size, that is never going extinct, we only need to know the expectation of the offspring distribution. If this is less than 1, then the process is subcritical and is almost surely finite. If it is greater than 1, then it is supercritical and survives with positive probability. If the expectation is exactly 1 (and the variance is finite) then the process is critical and although it is still almost surely finite, the overall population size has a power-law tail, and hence (or otherwise) the expected population size is infinite.

We would like a similar result for the multitype process, saying that we do not need to know everything about the distribution to decide what the survival probability should be.

The first thing to address is why we can’t just reduce the multitype change to the monotype setting. It’s easiest to assume that we know the type of the root in the multitype tree. The case where the type of the root is random can be reconstructed later. Anyway, suppose now that we want to know the offspring distribution of a vertex in the m-th generation. To decide this, we need to know the probability that this vertex has a given type, say type j. To calculate this, we need to work out all the type possibilities for the first m generations, and their probabilities, which may well include lots of complicated size-biasing. Certainly it is not easy, and there’s no reason why these offspring distributions should be IID. The best we can say is that they should probably be exchangeable within each generation.

Obviously if the offspring distribution does not depend on the parent’s type, then we have a standard Galton-Watson tree with types assigned in an IID manner to the realisation. If the types are symmetric (for example if M, to be defined, is invariant under permuting the indices) then life gets much easier. In general, however, it will be more complicated than this.

We can however think about how to decide on survival probability. We consider the expected number of offspring, allowing both the type of the parent and the type of the child to vary. So define m_{ij} to be the expected number of type j children born to a type i parent. Then write these in a matrix M=(m_{ij}).

One generalisation is to consider a Galton-Watson forest started from some positive number of roots of various types. Suppose we have a vector \nu=(\nu_i) listing the number of roots of each type. Then the expected number of descendents of each type at generation n is given by the vector \nu M^n.

Let \lambda be the largest eigenvalue of M. As for the transition matrices of Markov chains, the Perron-Frobenius theorem applies here, which confirms that, because the entries of M are positive, the eigenvalue with largest modulus is simple and real, and the associated eigenvector has entirely positive entries. [In fact we need a couple of extra conditions on M, including that it is possible to get from any type to any other type – we say irreducible – but that isn’t worth going into now.]

So in fact the total number of descendents at generation n grows like \lambda^n in expectation, and so we have the same description of subcriticality and supercriticality. We can also make a sensible comment about the left-\lambda-eigenvector of M. This is the limiting proportion of the different types of vertices.

It’s a result (eg. [3]) that the height profile of a depth-first search on a standard Galton-Watson tree converges to Brownian Motion. Another way to phrase this is that a GW tree conditioned to have some size N has the Brownian Continuum Random Tree as a scaling limit as N grows to infinity. Miermont [4] proves that this result holds for the multitype tree as well. In the remainder of this post I want to discuss one idea along the way to the proof, and one application.

I said initially that there wasn’t a trivial reduction of a multitype process to a monotype process. There is however a non-trivial embedding of a monotype process in a multitype process. Consider all the vertices of type 1, and all the paths between such vertices. Then draw a new tree consisting of just the type 1 vertices. Two of these are joined by an edge if there is no other type 1 vertex on the unique path between them in the original tree. If that definition is confusing, think of the most sensible way to construct a tree on the type 1 vertices from the original, and you’ve probably chosen this definition.

There are two important things about this new tree. 1) It is a Galton-Watson tree, and 2) if the original tree is critical, then this reduced tree is also critical. Proving 1) is heavily dependent on exactly what definitions one takes for both the multitype branching mechanism and the standard G-W mechanism. Essentially, at a type 1 vertex, the number of type 1 descendents is not dependent on anything that happened at previous generations, nor in other branches of the original tree. This gives IID offspring distributions once it is formalised. As for criticality, we note that by the matrix argument given before, under the irreducibility condition discussed, the expectation of the total population size is infinite iff the expected number of type 1 vertices is also infinite. Since the proportion of type 1 vertices is given by the first element of the left eigenvector, which is positive, we can make a further argument that the number of type 1 vertices has a power-law tail iff the total population size also has a power-law tail.

I want to end by explaining why I was thinking about this model at all. In many previous posts I’ve discussed the forest fire model, where occasionally all the edges in some large component are deleted, and the component becomes a set of singletons again. We are interested in the local limit. That is, what do the large components look like from the point of view of a single vertex in the component? If we were able to prove that the large components have BCRT as the scaling limit, this would answer this question.

This holds for the original random graph process. There are two sensible ways to motivate this. Firstly, given that a component is a tree (which it is with high probability if its size is O(1) ), its distribution is that of the uniform tree, and it is known that this has BCRT as a scaling limit [1]. Alternatively, we know that the components have a Poisson Galton-Watson process as a local limit by the same argument used to calculate the increments of the exploration process. So we have an alternative description of the BCRT appearing: the scaling limit of G-W trees conditioned on their size.

Regarding the forest fires, if we stop the process at some time T>1, we know that some vertices have been burned several times and some vertices have never received an edge. What is clear though is that if we specify the age of each vertex, that is, how long has elapsed since it was last burned; conditional on this, we have an inhomogeneous random graph. Note that if we have two vertices of ages s and t, then the probability that there is an edge between them is 1-e^{-\frac{s\wedge t}{n}}, ie approximately \frac{s\wedge t}{n}. The function giving the probabilities of edges between different types of vertices is called the kernel, and here it is sufficiently well-behaved (in particular, it is bounded) that we are able to use the results of Bollobas et al in [2], where they discuss general sparse inhomogeneous random graphs. They show, among many other things, that in this setting as well the local limit is a multitype branching process.

So in conclusion, we have almost all the ingredients towards proving the result we want, that forest fire components have BCRT scaling limit. The only outstanding matter is that the Miermont result deals with a finite number of types, whereas obviously in the setting where we parameterise by age, the set of types is continuous. In other words, I’m working hard!

References

[1] Aldous – The Continuum Random Tree III

[2] Bollobas, Janson, Riordan – The phase transition in inhomogeneous random graphs

[3] Le Gall – Random Trees and Applications

[4] Miermont – Invariance principles for spatial multitype Galton-Watson trees

Enhanced by Zemanta

Discontinuous Phase Transitions

Yesterday, Demeter Kiss from Cambridge gave a seminar in Oxford about a model for self-destructive percolation on \mathbb{Z}^2 that had implications for the (non-)existence of an infinite-parameter forest fire model on the same lattice. I enjoyed talking about this and his recent work on the related model of frozen percolation on \mathbb{Z}^2. Considering these models in the lattice setting present a whole range of interesting geometric challenges that are not present in the mean-field case that has mainly occupied my research direction so far.

The afternoon’s discussion included lots of open problems about percolation. Several of these are based around continuity of the phase transition, so I thought I would write a quite post about some simple examples of this, and one example where it does not hold.

A helpful base example is bond percolation on the lattice \mathbb{Z}^2. Here, we specify some probability p in [0,1], and we declare edges of the lattice open with probability p, independently of each other. We then consider the graph induced by the open edges. We say that percolation occurs if the origin is contained in an infinite open component. The terminology arises from the interpretation as fluid being added at the origin and flowing down open edges. We define \theta(p) to be the probability that the origin is in an infinite component when the parameter is p. By translation-invariance, we can get some sort of 0-1 law, to conclude that there is an infinite component somewhere in the system with probability either 0 or 1, depending on whether \theta(p) is positive or zero. Indeed, we can further show that if it is positive, then with probability 1 there is a unique infinite component.

We define the critical probability p_c:= \inf\{\theta(p)>0\}. A question worth asking is then, what is \theta(p_c)? In some examples, we can find p_c, but we cannot prove that \theta(p) is continuous around p_c. In the case of \mathbb{Z}^2 this is known, and it is known from work of Kesten that p_c=1/2. See below for a plot of \theta(p) in this setting (obtained from this blog, though possibly originating elsewhere).

percolation probabilityThe aim is to find an example where we do not have such a continuous phase transition. The original work on frozen percolation took place on trees, and one of Kiss’s results is confirms that these show qualitatively different phenomena to the same process on the lattice. In some sense, trees lie halfway between a lattice and a mean-field model, since there is often some independence when we look down the tree from a given generation, if it is well-defined to use such language.

Anyway, first we consider percolation on an infinite regular rooted k-ary tree. This means we have a root, which has k children, each of which in turn has k children, and so on. As before we consider bond percolation with parameter p. In this setting, we have a language to describe the resulting open component of the root. The offspring distribution of any vertex in the open component is given by Bin(k,p) independently of everything else, so we can view this component as the realisation of a Galton-Watson tree with this offspring distribution. This distribution has finite mean kp, and so we can state explicitly when the survival probability is positive. This happens when the mean is greater than 1, ie p>1/k.

For our actual example, we will consider the survival probability, but the technicalities are easier to explain if we look at the extinction probability, now using the language of branching processes. Suppose the offspring distribution has pgf given by

f(x)=p_0+p_1x+p_2x^2+\ldots.

Then the extinction probability q satisfies f(q)=q. I want to pause to consider what happens if this equation has multiple solutions. Indeed, in most interesting cases it will have multiple solutions, since f(1) will always be 1 if it is a non-defective offspring distribution. It is typically cited that: the extinction probability q is the smallest solution to this equation. I want to discuss why that is the case.

To approach this, we have to consider what extinction means. It is the limit in the event sense of the events {we are extinct after n generations}. Let the probabilities of these events be q_n, so q_0=0. Then by a straightforward coupling argument, we must have

0=q_0\le q_1\le q_2 \le\ldots\le q:= \lim q_n \le 1.

But, by the same generating function argument as before, q_{n+1}=f(q_n)\ge q_n. So if we split [0,1] into regions A where f(x)\ge x and B where f(x)<x, all the (q_n)s must occur in the former, and so since it is closed, their limit must be in A also. Note that if f(x) intersects x lots of times, then region A is not necessarily connected. In the diagram below, in moving from q_n to q_{n+1} we might jump across part of B.

Iterative percolation graphThis is bad, as we are trying to prove that q is the right boundary of the connected component of A containing 0. But this cannot happen, as f is monotonic. So if one of the roots of f(x)=x in between the hypothesised q_n<q_{n+1} is called z, then f(q_n)< f(z)=z < q_{n+1}, a contradiction.

Ok, so now we are ready to consider our counterexample to continuity over the percolation threshold. See references for a link to the original source of this example. We have to choose a slightly more complicated event than mere survival or extinction. We consider bond percolation as before on the infinite ternary tree, where every vertex has precisely 3 offspring. Our percolation event is now that the root is the root of an infinite binary tree. That is, the root has at least two children, each of which have at least two children, each of which, and so on.

If we set this probability equal to q, and the probability of an edge being open equal to p, then we have the recurrence:

q=3p^2(1-p)q^2+p^3[3q^2(1-q)+q^3].

The first term corresponds to the root having two open edges to offspring, and the second to the root having all three open edges to offspring. After manipulating, we end up with

q\left[2p^3q^2-3p^2q+1\right]=0.

We are therefore interested in roots of the quadratic lying between 0 and 1. The discriminant can be evaluated as

\Delta=p^3(9p-8),

and so there are no real roots where p<8/9. But when p=8/9, we have a repeated root at q=27/32, which is obviously not zero!

This equation is qualitatively different to the previous one for the extinction probability of a Galton-Watson tree. There, we had a quadratic, with one root at 1. As we varied p, the other root moved continuously from greater than one to less than one, so it passed through 1, giving continuity at the critical probability. Here, we have a cubic, again with one root at 1. But now the other roots are complex for small p, meaning that the local minimum of the cubic lies above the x-axis. As p gets to the critical value, it the local minimum passes below the x-axis, and suddenly we have a repeated root, not at zero.

I would like to have a neat probabilistic heuristic for this result, without having to make reference to generating functions. At the moment, the best I can come up with is to say that the original problem is simple, in the sense that the critical probability is as small as it could be while still making sense in expectation. To be concrete, when the mean of the offspring generation is less than 1, the expected size of the nth generation tends to zero, so there certainly could not be positive probability of having an infinite component.

Whereas in the binary tree example, we only require p=2/3 to have, in expectation, the right number of open edges to theoretically allow an infinite binary tree. If we think of percolation as a dynamic process by coupling in p, essentially as we move from p=2/3 to p=8/9 we need to add enough edges near the origin to be able to take advantage of the high density of edges available far from the origin. The probability of this working given you start from n vertices grows much faster (as n grows) than in the original problem, so you might expect a faster transition.

This is so content-free I’m reluctant even to call it a heuristic. I would be very interested to hear of any more convincing argument for this phenomenon!

REFERENCES

Dekking, Pakes – On family trees and subtrees of simple branching processes (link)

Enhanced by Zemanta

The Configuration Model

In the past, I’ve talked about limitations of the Erdos-Renyi model of homogeneous random graphs for applications in real-world networks. In a previous post, I’ve discussed a dynamic model, the Preferential Attachment mechanism, that ‘grows’ a graph dynamically by adding edges from new vertices preferentially to existing vertices with high degree. The purpose of this adjustment is to ensure that the distribution of the degrees is not concentrated around some fixed value (which would be c in G(n,c/n) ) but rather exhibits a power-law tail such as observed in many genuine examples.

In this post, we introduce some aspects of the configuration model, which achieves this property more directly. This idea probably first arose in the guise of regular graphs. Recall a regular graph has all degrees equal. How would we construct a random d-regular graph on a large number of vertices?

What we probably want to do is to choose uniformly at random from the set of such graphs, but it is not clear even how large this set is, let alone how one would order its elements to make it possible to make this uniform choice. Instead, we try the following. Assign to each vertex d so-called stubs, which will end up being ‘half-edges’. We then choose two stubs uniformly at random, and glue them together. More formally, we construct an edge between the host vertices, and then delete the chosen stubs. We then continue.

The construction makes no reference to the distribution of stubs, so we are free to choose this as we please. We could for example specify some sequence of degrees which approximates a power-law, so we could sample a random sequence of degrees in some way. So long as we have a sequence of stub set sizes before we start building the edges of the graph we will be able to use the above algorithm.

So what might go wrong? There seem to me to be three potential problems that might arise with this construction.

Firstly, there might be a stub left over, if the sum of the stub set sizes is odd. Recall that in a graph the sum of the degrees is twice the sum of the number of edges, and so in particular the sum of the degrees should be even. But this is a small problem. When the degree sequence is deterministic we can demand that it have even sum, and if it is random, we will typically be working in a large N regime, and so deleting the solitary stub, if such a thing exists, will not affect the sort of properties of the graph we are likely to be interested in.

The second and third objections are perhaps more serious. If we glue together stubs naively, we might end up with loops, that is, edges that ‘begin’ and ‘end’ at the same vertex. These are not allowed in the standard definition of a graph. Alternatively, we might end up with more than one edge between the same pair of vertices.

Our overall aim is that this mechanism gives a convenient way of simulating the uniform distribution on simple graphs with a given degree sequence. At present we have the uniform distribution on potential multigraphs, with a weighting of 1/k! for every multi-edge with multiplicity k, and a weighting of 1/2 for every loop. The latter can be seen because there is an initial probability proportional to d(v_i)d(v_j) that vertices v_i and v_j will be joined, whereas a probability proportional (with the same constant) to d(v_i)^2 that v_i will receive a loop. The multi-edge weighting justification is similar.

However, conditional on getting a simple graph, the distribution is uniform on the set of simple graphs with that degree sequence. So it remains to investigate the probability that a graph generated in this way is simple. So long as this probability does not tend to 0 as n grows, we will probably be happy.

The strongest results on this topic are due to Janson. First observe that if the sum of the degrees grows faster than the number of vertices n, we fail to get a graph without loops with high probability. Heuristically, note that on the first pass, we are taking two picks from the set of vertices, biased by the number of stubs. By Cauchy-Schwarz, Rearrangement Inequality or just intuition, the probability of getting the same vertex is greater than if we picked uniformly from the set of vertices without biasing. So the probability of getting no loop on the first pass is \le (1-\frac{1}{n}). Take some function a(n) that grows faster than n, but slower than the sum of the degrees. Then after a(n) passes, the degree distribution is still roughly the same. In particular, the sum of the degrees is still an order of magnitude greater than n. So we obtain:

\mathbb{P}(\text{no loops})\leq (1-\frac{1}{n})^{a(n)}\approx e^{-\frac{a(n)}{n}}\rightarrow 0.

So, since isolated vertices have no effect on the simplicity or otherwise, we assume the sum of the degrees is \Theta(n). Then, Janson shows that the further condition

\sum_{i=1}^n d_i^2=O(n),

is essentially necessary and sufficient for simplicity. We can see why this might be true by looking at the probability that the first edge added is a loop, which is roughly

\frac{d_1^2+d_2^2+\ldots+d_n^2}{2(\sum d_i)^2}.

We have to consider O(\sum d_i) edges, so if the above expression is much larger than this, we can perform a similar exponential estimate to show that the probability there are no loops is o(1). The technical part is showing that this probability doesn’t change dramatically as the first few stubs disappear.

Note that in both cases, considering only loops is sufficient for simplicity. Although it looks like loop appearance is weaker than multiplicity of edges, in fact they have the same threshold. It should also be pointed out that, like the uniform random forests, an alternative approach is simply to count the number of simple graphs and multigraphs with a given degree sequence. Good asymptotics can then be found for the probability of simplicity.

In the case of G(n,c/n), we were particularly interested in the emergence of the giant component at time c=1. While first-moment methods can be very effective in demonstrating such results, a branching process local limit representation is probably easiest heuristic for this phase transition.

So long as the degree sequences converge in a natural way, we can apply a similar approach to this configuration model. Concretely, we assume that the proportion of vertices with degree i is \lambda_i in the limit. Although the algebra might push through, we should be aware that this means we are not explicitly specifying how many vertices have degree, eg \Theta(n^{1/2}). For now assume the \lambda_is sum to 1, so specify a probability distribution for degree induced by choosing a vertex uniformly at random.

So we start at a vertex, and look at its neighbours. The expected number of neighbours of this root vertex is \sum i\lambda i. Thereafter, when we consider a child vertex, based on how the stubs are paired up (and in particular the fact that the order of the operations does not matter – the choice of partner of a given stub is chosen uniformly at random), we are really choosing a stub uniformly at random. This corresponds to choosing a vertex at random, biased by the number of stubs available. The quantity of interest is how many additional stubs (other than the one that led to the vertex) are attached to this vertex. We assume we don’t need to worry too much about repeating vertices, in a similar way to G(n,c/n). So the expected number of additional stubs is

\frac{1}{\sum i\lambda_i}\sum i\lambda_i(i-1).

For an infinite component, we required the expectation to be > 1, which is equivalent to

\sum \lambda_i i(i-2)>0.

This was proven by Molloy and Reed (95), then with fewer conditions by Janson (07). The latter also shows how to use this construction to derive the giant component for G(n,c/n) result.

REFERENCES

Janson – A New Approach to the Giant Component Problem

Molloy, Reed – A Critical Point for Random Graphs with a Given Degree Sequence

Janson – The Probability that  Random Multigraph is Simple

The Contour Process

As I explained in my previous post, I haven’t been reading around as much as I would generally like to recently. A few days in London staying with my parents and catching up with some friends has therefore been a good chance to get back into the habit of leafing through papers and Pitman’s book among other things.

This morning’s post should be a relatively short one. I’m going to define the contour process, a function of a (random or deterministic) tree, related to the exploration process which I have mentioned a few times previously. I will then use this to prove a simple but cute result equating in distribution the sizes of two different branching processes via a direct bijection.

The Contour Process

To start with, we have to have a root, and from that root we label the tree with a depth-first labelling. An example of this is given below. It is helpful at this stage to conceive this process as an explorer walking on the tree, and turning back on themselves only when there is no option to visit a vertex they haven’t already seen. So in the example tree shown, the depth-first exploration visits vertex V_2 exactly four times. Note that with this description, it is clear that the exploration traverses every edge exactly twice, and so the length of the sequence is 2n-1, where n is the number of vertices in the tree since obviously, we start and end at the root.

Another common interpretation of this depth-first exploration is to take some planar realisation of the tree. (Note trees are always planar – proof via induction after removing a leaf.) Then if you treat the tree as a hedge and starting at the root walk along, following the outer boundary with your right hand, this exactly recreates the process.

The height of a tree at a particular vertex is simply the graph distance between that vertex and the root. So when we move from one vertex to an adjacent vertex, the height must increase or decrease by 1.

The contour process is the sequence of heights seen along the depth-first exploration. It is therefore a sequence:

0=h_0,h_1,\ldots,h_{2n-1}=0,\quad h_i\geq 0,

and such that |h_{i+1}-h_i|=1.

Note that though the contour process uniquely determines the tree structure, the choice of depth-first labelling is a priori non-canonical. For example, in the display above, V_3 might have been explored before V_2. Normally this is resolved by taking the suitable vertex with the smallest label in the original tree to be next. It makes little difference to any analysis to choose the ordering of descendents of some vertex in a depth-first labelling randomly. Note that this explains why it is rather hard to recover Cayley’s theorem about the number of rooted trees on n vertices from this characterisation. Although the number of suitable contour functions is possible to calculate, we would require a complicated multiplicative correction for labelling if we wanted to recover the number of trees.

The only real observation about the uses of the contour process at this stage is that it is not in general a random walk with IID increments for a Galton-Watson branching process. This equivalence is what made the exploration process so useful. In particular, it made it straightforward, at least heuristically, to see why large trees might have a limit interpretation through Brownian excursions. If for example, the offspring distribution is bounded above, say by M, then the contour process certainly cannot be a random walk, as if we have visited a particular vertex exactly M+1 times, then it cannot have another descendent, and so we must return closer to the root at the next step.

I want to mention that in fact Aldous showed his results on scaling limits towards the Continuum Random Tree through the contour process rather than the exploration process. However, I don’t want to say any more about that right now.

A Neat Equivalence

What I do want to talk about is the following distribution on the positive integers. This comes up in Balazs Rath and Balint Toth’s work on forest-fires on the complete graph that I have been reading about recently. The role of this distribution is a conjectured equilibrium distribution for component size in a version of the Erdos-Renyi process where components are deleted (or ‘struck by lightning’) at a rate tuned so that giant components ‘just’ never emerge.

This distribution has the possibly useful property that it is the distribution of the total population size in a Galton-Watson process with Geom(1/2) offspring distribution. It is also the distribution of the total number of leaves in a critical binary branching process, where every vertex has either two descendents or zero descendents, each with probability 1/2. Note that both of these tree processes are critical, as the expected number of offspring is 1 in each case. This is a good start, as it suggests that the relevant equilibrium distribution should also have the power-law tail that is found in these critical branching processes. This would confirm that the forest-fire model exhibits self-organised criticality.

Anyway, as a sanity check, I tried to find a reason why, ignoring the forest-fires for now, these two distributions should be the same. One can argue using generating functions, but there is also the following nice bijective argument.

We focus first on the critical Geometric branching process. We examine its contour function. As explained above, the contour process is not in general a random walk with IID increments. However, for this particular case, it is. The geometric distribution should be viewed as the family of discrete memoryless distributions.

This is useful for the contour process. Note that if we are at vertex V for the (m+1)th time, that is we have already explored m of the edges out of V, then the probability that there is at least one further edge is 1/2, independently of the history of the exploration, as the offspring distribution is Geometric(1/2), which we can easily think of as adding edges one at a time based on independent fair coin tosses until we see a tail for example. The contour process for this random tree is therefore a simple symmetric random walk on Z. Note that this will hit -1 at some point, and the associated contour process is the RW up to the final time it hits 0 before hitting -1. We can check that this obeys the clear rule that with probability 1/2 the tree is a single vertex.

Now we consider the other model, the Galton-Watson process with critical binary branching mechanism. We should consider the exploration process. Recall that the increments in this process are given by the offspring distribution minus one. So this random sequence also behaves as a simple symmetric random walk on Z, again stopped when we hit -1.

To complete the bijective argument, we have to relate leaves in the binary process to vertices in the geometric one. A vertex is a leaf if it has no offspring, so the number of leaves is the number of times before the hitting time of -1 that the exploration process decreases by 1. (*)

Similarly for the contour process. Note that there is bijection between the set of vertices that aren’t the root and the set of edges. The contour process explores every edge exactly twice, once giving an increase of 1 and once giving a decrease of 1. So there is a bijection between the times that the contour process decreases by 1 and the non-root vertices. But the contour process was defined only up to the time we return to the root. This is fine if we know in advance how large the tree is, but we don’t know which return to the root is the final return to the root. So if we extend the random walk to the first time it hits -1, the portion up until the last increment is the contour process, and the final increment must be a decrease by 1, hence there is a bijection between the number of vertices in the Geom(1/2) G-W tree and the number of times that the contour process decreases by 1 before the hitting time of -1. Comparing with (*) gives the result.

Local Limits

In several previous posts, I have talked about scaling limits of various random graphs. Typically in this situation we are interested in convergence of large-scale properties of the graph as the size grows to some limit. These properties will normally be metric in flavour: diameter, component size and so on. To describe convergence of these properties, we divide by the relevant scale, which will often be some simple function of n. If we are looking to find an actual limit object, this is even more important. This is rather similar to describing properties of centred random walks. There, if we run the walk for time n, we have to rescale by \frac{1}{\sqrt{n}} to see the fluctuations on a finite positive scale.

One of the best examples is Aldous’ Continuum Random Tree which we can view as the limit of a Galton-Watson tree conditioned to have total size n, as n tends to infinity. Because of the exploration process or contour process interpretation, where these functions behave rather like a random walk, the correct scaling in this context is again \frac{1}{\sqrt{n}}. The point about this convergence is that it is realised entirely as a convergence of some function that represents the tree. For each finite n, it is clear that the tree with n vertices is a graph, but this is neither clear nor true for the limit object. Although it does indeed have no cycles, if nothing else, if the CRT were a graph it would have [0,1] as vertex set and then would be highly non-obvious how to define the edges.

Local limits aim to give convergence towards a (discrete) infinite graph. The sort of properties we are looking for are now local properties such as degrees and correlations of degrees. These don’t require knowledge of the whole graph, only of some finite subset. First consider the possibility that the sequence of deterministic graphs has the property:

G_1\leq G_2\leq G_3\leq\ldots

where \leq denotes an induced subgraph. Then it is relatively clear what the limit should be, as it is well-defined to take a union. This won’t work directly for a limit of random graphs, because the above relation in probability doesn’t even really make sense if we have a different probability space for each finite graph. This is a general clue that we should be looking to use convergence in distribution rather than anything stronger.

In the previous example, suppose the first finite graph G_1 consists of a single vertex v. If the limit graph (remember this is just the union, since that is well-defined) has bounded degrees, then there is some N such that G_N contains all the information we might want about the limiting neighbourhood of vertex v. For some larger N, G_N contains all the vertex and edges within distance r from our starting vertex v that appear in the limit graph.

This is all the motivation we require for a genuine definition. We will define our limit in terms of neighbourhoods, so we need some mechanism to choose the central vertex of such a neighbourhood. The answer is to consider rooted graphs, that it a graph with an identified vertex. We can introduce randomness by specifying a random graph, or by giving a distribution for the choice of root. If G is finite, the canonical choice is to choose the root uniformly from the set of vertices. This isn’t an option for an infinite graph, so we define the system as (G, p) where G is a (for now deterministic) graph, and p is a probability measure on V(G).

We say that the limit of finite (G_n) is the random rooted infinite graph (G, p) if the neighbourhoods of G_n around a randomly chosen vertex converge in distribution to the neighbourhoods of G around p. Formally, say (G_n)[U_n]\stackrel{d}{\rightarrow} (G,p) if for all r>0, for any finite rooted graph (H,w), the probability that (H,w) is isomorphic to the ball of radius r in G_n centred at randomly chosen $v_n$ converges to the probability that (H,w) is isomorphic to the ball of radius r around v in (G,v), where v is distributed according to measure p.

Informally, we might say that if we zoom in on an average vertex in G_n for large n, the neighbourhood looks the same as the neighbourhood around the root in (G, p). We now consider three examples.

1) When we talk about approximating the component size in a sparse Erdos-Renyi random graph by a \text{Po}(\lambda) branching process, this is exactly the limit sense we mean. The approximation fails if we fix n and take the neighbourhood size very large (eg radius n), but for finite neighbourhoods, or any radius growing more slowly than n, the approximation is good.

2) To emphasise why rooting the finite graphs makes a difference, consider the full binary tree with n levels (so 2^n-1 vertices). If we fix the root, then the limit is the infinite-level binary tree, though this isn’t especially surprising or interesting.

Things get a bit more complicated if we root randomly. Remember that the motivation for random rooting is that we want to know the local structure around a vertex chosen at random in many applications. If we definitely know what vertex we are going to choose, we know the local structure a priori. Note that in an n-level binary tree, 2^{n-1} vertices are leaves, not counting the base of the tree, and 2^{n-2} are distance 1 from a leaf, and 2^{n-3} are distance 2 from a leaf and so on.

This gives us a precise description of the limiting local neighbourhood structure. The resulting limiting object is called the canopy tree. One picture of this can be found on page 6 of this paper. A verbal description is also possible. Consider the set of non-negative integers, arranged in the usual manner on the real line, with edges between adjacent elements. The distribution of the root will be supported on this set of vertices, corresponding to the distance from the leaves in the pre-limit graph. So we have mass 1/2 at 0, 1/4 at 1, 1/8 at 2 and so on. We then connect each vertex k to a full k-level binary tree. The resulting canopy tree looks like an infinite-level full binary tree, viewed from the leaves, which is of course a reasonable heuristic, since that is there the mass is concentrated if we randomly root.

3) In particular, the limit is not the infinite-level binary tree. The canopy tree and the infinite-level binary tree have qualitatively different properties. Simple random walk on the canopy tree is recurrent for example. In fact, a result of Benjamini and Schramm, as explained in this review by Curien, says that any local limit of uniformly bounded degree, uniformly rooted, planar graphs is recurrent for SRW. The infinite-level binary tree can be expressed as a local limit if we choose the root distribution sensibly, using large random 3-regular graphs. The previous result does not apply because the random 3-regular graphs are not almost surely planar.

REFERENCES:

– Much of this article is a paraphrase of a section of Itai Benjamini’s mini-course at the DSSA in Haifa March 2013.

– As well as the review paper linked above, these notes by David Aldous were very useful.

Uniform Spanning Trees

For applications to random graphs, the local binomial structure and independence means that the Galton-Watson branching process is a useful structure to consider embedding in the graph. In several previous posts, I have shown how we can set up the so-called exploration process which visits the sites in a component as if the component were actually a tree. The typical degree is O(1), and so in particular small components will be trees with high probability in the limit. In the giant component for a supercritical graph, this is not the case, but it doesn’t matter, as we ignore vertices we have already explored in our exploration process. We can consider the excess edges separately by ‘sprinkling’ them back in once we have the tree-like backbone of all the components. Again, independence is crucial here.

I am now thinking about a new model. We take an Erdos-Renyi process as before, with edges arriving at some fixed rate, but whenever a cycle appears, we immediately delete all the edges that make up the cycle. Thus at all times the system consists of a collection (or forest) of trees on the n vertices. So initially this process will look exactly like the normal E-R process, but as soon as the components start getting large, we start getting excess edges which destroy the cycles and make everything small again. The question to ask is: if we run the process for long enough, roughly how large are all the components? It seems unlikely that the splitting mechanism is so weak that we will get true giant components forming, ie O(n) sizes, so we might guess that, in common with some other split-merge models of this type, we end up with components of size n^{2/3}, as in the critical window for the E-R process.

In any case, the scaling limit process is likely to have components whose sizes grow with n, so we will have a class of trees larger than those we have considered previously, which have typically been O(1). So it’s worth thinking about some ways to generate random trees on a fixed number of vertices.

Conditioned Galton-Watson

Our favourite method of creating trees is inductive. We take a root and connect the root to a number of offspring given by a fixed distribution, and each of these some offspring given by an independent sample from the same distribution and so on. The natural formulation gives no control over the size of the tree. This is a random variable whose distribution depends on the offspring distribution, and which in some circumstances be computed explicitly, for example when the offspring distribution is geometric. In other cases, it is easier to make recourse to generating functions or to a random walk analogue as described in the exploration process discussion.

Of course, there is nothing to stop us conditioning on the total size of the population. This is equivalent to conditioning on the hitting time of -1 for the corresponding random walk, and Donsker’s theorem gives several consequences of a convergence relation towards a rescaled Brownian excursion. Note that there is no a priori labelling for the resulting tree. This will have to be supplied later, with breadth-first and depth-first the most natural choices, which might cause annoyance if you actually want to use it. In particular, it is not obvious, and probably not true unless you are careful, that the distribution is invariant under permuting the labels (having initially assumed 1 is the root etc) which is not ideal if you are embedding into the complete graph.

However, we would like to have some more direct constructions of random trees on n vertices. We now consider perhaps the two best known such methods. These are of particular interest as they are applicable to finding random spanning trees embedded in any graph, rather than just the complete graph.

Uniform Spanning Tree

Given a connected graph, consider the set of all subgraphs which are trees and span the vertex set of the original graph. An element of this set is called a spanning tree. A uniform spanning tree is chosen uniformly at random from the set of spanning trees on the complex graph on n vertices. A famous result of Arthur Cayley says that the number of such spanning trees is n^{n-2}. There are various neat proofs, many of which consider a mild generalisation which gives us a more natural framework for using induction. This might be a suitable subject for a subsequent post.

While there is no objective answer to the question of what is the right model for random trees on n vertices, this is what you get from the Erdos-Renyi process. Formally, conditional on the sizes of the (tree) components, the structures of the tree components are given by UST.

To see why this is the case, observe that when we condition that a component has m vertices and is a tree, we are demanding that it be connected and have m-1 edges. Since the probability of a particular configuration appearing in G(n,p) is a function only of the number of edges in the configuration, it follows that the probability of each spanning tree on the m vertices in question is equal.

Interesting things happen when you do this dynamically. That is, if we have two USTs of sizes m and n at some time t, and condition that the next edge to be added in the process joins them, then the resulting component is not a UST on m+n vertices. To see why, consider the probability of a ‘star’, that is a tree with a single distinguished vertex to which every other vertex is joined. Then the probability that the UST on m vertices is a star is \frac{m}{m^{m-2}}=m^{-(m-3)}. By contrast, it is not possible to obtain a star on m+n vertices by joining a tree on m vertices and a tree on n vertices with an additional edge.

However, I think the UST property is preserved by the cycle deletion mechanism mentioned at the very start of this post. My working has been very much of the back of the envelope variety, but I am fairly convinced that once you have taken a UST and conditioned on the sizes of the smaller trees which result from cycle deletion. My argument is that you might as well fix the cycle to be deleted, then condition on how many vertices are in each of the trees coming off this cycle. Now the choice of each of these trees is clearly uniform among spanning trees on the correct number of vertices.

However, it is my current belief that the combination of these two mechanisms does not give UST-like trees even after conditioning on the sizes at fixed time.

Beyond Erdos-Renyi: more realistic models of networks

The claim is often made that the study of random graphs such as the Erdos-Renyi model is worthwhile because it gives us information about complex systems which exist in the real world. The internet or social networks provide the example du jour at the moment, but it’s equally plausible to think about traffic flows, electrical systems or interacting biological processes too.

If this were entirely true, it would be great for two reasons. Firstly, in my opinion at least, it is a beautiful subject in its own right, and to have a concrete applicable reason to continue studying it would make it even better. (Not to mention the dreaded competition for funding…) Secondly, Erdos-Renyi is so simple. After all, it involves little more than adding some simple topology to a collection of IID Bernoulli random variables, and so it would surely be possible to draw some significant conclusions about how complicated real-world objects interact without too much mathematical effort.

Unfortunately, but unsurprising, this simplicity is a drawback as far as applications go. It is fairly clear that most real-world systems cannot offer any property even approaching the niceness of the independent, same probability edges condition. But rather than consign E-R to the ‘pretty but useless’ category of mathematical structures, we should think carefully about exactly why it fails to be a good model for real-world networks, and see whether there are any small adjustments that could be made to improve it.

This is something I’ve been meaning to read up about for ages and ages. What follows is based heavily on the Albert and Barabasi 2002 review paper. I suspect that many of the open problems and intuitive calculations have since been finished and formalised, but for an overview I hope that doesn’t matter hugely. I’ve also leafed through the relevant chapters of Remco Van der Hofstad’s notes, but am setting the details and the exercises aside for the holidays when I have a bit more time!

Problems with Erdos-Renyi

Recall that G(n,p) takes n vertices, and adds edges between any pair of vertices independently with probability p.

One property shared by most real-world networks is the scale-free phenomenon, which says that the degree distribution has a power law tail. The Albert-Barabasi papers gives a comprehensive survey of data verifying this claim. By contrast, G(n,p) has degree distribution which is approximately Poisson as n grows. This is concentrated near the average degree with a thin exponential tail, so does not satisfy this requirement. I was and still am a bit confused by the term ‘scale-free’. The idea is certainly that the local structure is independent of the size of the system, which seems to be true for the degree distributions in sparse ER, that is where p = O(1/n). But I think the correct heuristic is that it doesn’t matter how far zoomed in you are – the macroscopic structure looks similar for n vertices as for n^2 vertices. This certainly fails to be true for ER, where no vertex has O(n) neighbours, whereas with a power law tail, this does hold.

The main consequence of this is that there are a few vertices with very high degree. These are often called ‘hubs’ and parallels are drawn to the internet, where key websites and servers connect lots of traffic and pages from different areas. The idea is that the hubs are almost certainly well-connected to each other, and this offers a step towards a small-world phenomenon, where the shortest path between any two vertices is very small relative to the size of the system. This notion was introduced to mainstream culture by Stanley Milgram’s ‘Six degrees of separation’ experiment in the 60s, where it became clear that subjects were able to deliver a package to a complete stranger on the other side of the USA, using only personal contacts, in about six stages. The graph theoretic notion for this is the diameter, defined as the maximal graph distance between two points. Here, the graph distance means the length of the shortest path between the points. This definition, with the max-min formalism looks rather complicated, but isn’t really. The diameter of an Erdos-Renyi graph for fixed p, increases like log n, which is small relative to n, and so this property holds.

A quick glance at your list of Facebook friends will confirm that the independent edges condition in an Erdos-Renyi random graph is not a plausible model for social networks. How many friends do you have? Let’s say about 1000, more to make the calculation easier than because you’re necessarily very popular. How many does your friend Tom have? Let’s say 1000 again. As was in the news a few months ago, there are now over a billion people on Facebook. Let’s say exactly a billion (that is 10^9 for these purposes). So both you and Tom are friends with 1/10^6 of the total membership of the network. So how large would you expect the overlap of your friendships to be, if they were all chosen independently at random? Well, the probability that you are both friends with Alice is 10^-12, and so the expected number of your mutual friends is 10^-12 x 10^9 = 10^-3 which is substantially less than 1. Yet I imagine if you substituted names suitably, you and Tom might well have over 50 mutual friends if you were, say, in the same year at school or niversity and haven’t yet purged your list.

We want a statistic that records this idea quantitatively. There are various candidates for such a clustering coefficient. The underlying notion that we might expect there to be greater connectivity between neighbours of some fixed point v than in the graph as a whole gives an intuition for a possible definition. Compare the proportion of triangles in the graph to the cube of the proportion of edges. When this ratio is large, then there is a lot of clustering. In the E-R case, we would expect these to be equal, as the probability of forming a triangle is equal to the cube of the probability of the presence of each of the three independent edges that make up the triangle.

So we have three properties of real networks that we would like to incorporate into a model: small diameter, power-law degree distribution, and high clustering. To avoid this turning into a book, I’m going to write a paragraph about each of the possibilities discussed by Albert and Barabasi.

Generalised Random Graph

The degree distribution will typically emerge as a consequence of the construction of a given model. The general idea here is to condition on the degree distribution having the form we want, and see what this does to the structure. Of course, the choice of how to do this conditioning is absolutely key. It certainly isn’t obvious what it means to ‘condition G(n,p) to have power-law distribution’, since the very idea of a power-law vs exponential tail requires the number of vertices to be large.

The first idea for achieving this gives the vertices ‘stubs’, which join up in pairs to form edges. We decide on the distribution of stubs according to this power law, then pair them up uniformly at random. Obviously, there is a possibility of getting some loops, but this is not going to happen so often as to be a genuine problem in the limit. This construction is similarly open to the branching process exploration ideas well covered for the E-R random graph, though we have to be careful to size-bias the degree distributions when necessary. There is still an underlying independence in the location of edges though, so it is reasonably clear that the amount of clustering may be closer to E-R than to the real examples cited.

The other possibility suggested is to retain the independent edge property, but give the vertices weights, and let the probability of an edge between two vertices be some sensible function of the weights. In the end it turns out to make little difference whether the weights are chosen deterministically or randomly, but by taking the weights i.i.d. with infinite mean, we can generate a so-called generalised random graph where the degree distribution has a power law.

Watts-Strogatz

In the WS model, the idea is to interpolate between a graph with maximal clustering and a random graph. A d-regular graph, say on a ring, where every vertex is connected to its d nearest neighbours has high clustering, but large diameter, as for example it takes roughly n/2d steps to get to the other side of the ring. Whereas in the standard E-R model we add edges with some fixed probability p, here we replace edges with some fixed probability p. That is, we take an edge in the regular graph and with some small probability we remove it and instead add an edge between two vertices chosen uniformly at random. The theoretical motivation is that removing a few edges doesn’t destroy the high clustering evident in the regular graph, but even a sparse random graph has small diameter, so adding a few ‘long-range’ edges should be enough to decrease the diameter significantly.

It obviously needs to be checked that a substantial drop in diameter occurs before a substantial decrease in clustering, and there is a calculation and diagram to support this intuitive idea in the paper. The one drawback of this model is that it fails to provide the power-law degree distributions we want. After all, an E-R graph has a concentrated degree distribution, and a d-regular graph has all degrees the same, so we would expect some interpolation between the two to have a concentrated distribution as well. Nonetheless, this model accords well with an idea of how complex networks might form, particularly if there is some underlying geometry. It is reasonable to assume that an initial setup for a network would be that people are connected to those closest to them, and then slowly acquire distant contacts as time progresses.

Preferential Attachment – Barabasi-Albert model

Most of our intuition for networks can be extended to an intuition for the formation of networks. The idea of prescribing a degree distribution is neat, but it doesn’t give any account to the mechanism of formation. Complexity emerges over time, and a good model should be able to describe why this happens. The Barabasi-Albert model takes this as its starting point, with the aim of producing a highly clustered system dynamically. Recall that we can describe G(n,p) as a process by coupling, then increasing p from 0 to 1, and seeing edges emerge. The independence assumption can be lifted through the coupling, and so which edge appears next is independent of the current state of the system.

This is what we need to relax. Recall the motivating idea of ‘hubs’, where a small collection of vertices have very high connectivity across the whole system, as observed in several real situations. A consequence of this is that new edge is more likely to be attached to a hub, than to a pair of poorly connected vertex elsewhere. But it turns out that this idea of preferential attachment isn’t enough by itself. Because as a network forms, it is not just the connectivity that increases, but also the size of the system itself. So in fact it makes sense to add vertices rather than edges, and join the new vertices to existing vertices in proportion to the degrees of the existing vertices. This combination of growth and preferential attachment is key to the scale-free graphs that this Barabasi-Albert model generates. Relaxing either mechanism returns us to the case of exponential tails. However, there are methods in the literature for generating such graphs without the need for a dynamic model, but they are harder to understand and describe. None I have seen so far has a high clustering coefficient.

Hubs are effectively a way to reduce the diameter. Recall the description of Milgram’s experiment where he encouraged randomly chosen people to send a package to Harvard. For the purposes of this model, an undergraduate from Wyoming or a husband from Alabama moving in with his wife in Boston are clear hubs, as for very many people near their previous home, they represent a good connection to Harvard. So it is unsurprising that BA, which reinforces hubs, has a sub-logarithmic small diameter.

Conclusions

I’m not entirely what conclusions I should draw from my reading. Probably the main one is that I should read more as there is plenty of interesting stuff going on in this area. Intuitively, it seems unlikely that there is going to be a single model which unites the descriptions of all relevant real-world networks. As ever, it is pleasant to find structures that are both mathematically interesting in their own right and relevant to applied problems. So it is reassuring to observe how similar many of the models discussed above are to the standard random graph.

From G(n,p) to G(n,m)

In the previous posts about random graphs, I was focusing on the model G(n,p). Here, we have n vertices, and we insert an edge between any pair of vertices independently with probability p. In particular, the number of edge which appear in a realisation of G(n,p) is a random variable, distributed as \text{Bin}(\frac{n(n-1)}{2},p).

The original model examined by Erdos and Renyi, after whom the random graph described above was named, was slightly different. Still with n vertices, they specified how many edges m they wanted in the graph, and chose uniformly at random from the set of graphs with this number of edges. This model is usually denoted G(n,m). Normally we can tell them apart by context. Obviously, p is a probability so lies in [0,1], whereas m is a positive integer, so there isn’t much room for ambiguity.

The key observation is that, if H is some graph with n vertices and m edges, then the probability that this is appears as G(n,p) is

p^{E(H)}(1-p)^{n-E(H)}.

This is constant if we vary H while fixing m. In other words, G(n,p) conditioned to have m edges is G(n,m). So, via some sort of law of total probability, we can construct G(n,p) by taking m to be distributed as \text{Bin}(\frac{n(n-1)}{2},p), and conditional on that, sampling from G(n,m). (*)

We can couple G(n,p) for all p, by assigning iid uniform [0,1] random variables to each pair of vertices, then including the edge in G(n,p) only if the value of the RV is greater than 1-p. Similarly, it is often helpful to think of G(n,m) as m varies as a random process, where edges are added one at a time, and at each stage the next edge is chosen uniformly at random from those not currently present. Perhaps because of this, it is sometimes easier to prove results for G(n,p) than for G(n,m) so we want to develop a framework to move between the two.

The decomposition (*) gives a relatively straightforward way to move from a result in G(n,m) to a result in G(n,p). By the Central Limit Theorem, the number of edges in G(n,p) is \binom{n}{2}p+O(n) in the limit, and so if a result with high probability in G(n,m) for all m in some interval, say [\binom{n}{2}p-Kn,\binom{n}{2}p+Kn] for some large K, then the law of total probability shows that the property holds with high probability in G(n,p).

In general, we get more interesting properties when p is a function of n. As discussed in previous posts, the scaling p=\frac{\lambda}{n} is particularly worth studying. CLT now shows that G(n,\frac{\lambda}{n}) has \frac{\lambda n}{2}+O(\sqrt{n}) edges in the limit. If you are confused why you can’t just substitute this value for p into the previous expression, note that p(1-p) does appear in the general asymptotic variance, but this gets absorbed into the “big O” notation when p is constant.

More importantly, many properties that we might want to consider are not in general affected in the limit by adding or removing O(\sqrt{n}) edges. For example, with high probability, G(n,m) has largest component of size (\zeta_\lambda+o(1)) n whenever \lambda>1 and m\in [\frac{\lambda n}{2}-O(\sqrt{n}),\frac{\lambda n}{2}+O(\sqrt{n})]. Some of this notation would need to be made a bit more precise in a formal argument, but for now, let’s take that as given. This then implies that with high probability, G(n,\frac{\lambda}{n}) has largest component of size (\zeta_\lambda+o(1))n also.

Of course, from the logical structure of this blog, this deduction is a bit bogus, because we have only just introduced G(n,m), and have no idea about the properties of its giant components yet. We seek instead an argument to deduce facts about G(n,m) from facts about G(n,p). Because G(n,m) cannot obviously be written as some conditioned combination of G(n,p)s, this instinctively seems harder. Bollobas gives various general conditions to carry results over between the two regimes in his Part III course notes, but I feel that an examples would be the easiest way to explain the ideas.

The size of the largest component is such an important quantity, we might as well consider that, in the subcritical case. We work with G(n,\frac{\lambda}{n}), for which we have the result:

\mathbb{P}_{\lambda+o(1)}(C_1\geq a\log n)=O(n^{-\delta}),

for some \delta>0, whenever a>I_\lambda^{-1}, the rate function at 1 of the total population size of a Poisson \lambda branching process. For now, that doesn’t matter too much, except that it is continuous as a function of \lambda. We want to show that G(n,\frac{\lambda n}{2}) has the same property.

The trick is to consider G(n,\frac{\lambda+\epsilon}{n}) instead. Let E_a be the event described above. By the law of total probability and the decomposition mentioned above, we have:

\mathbb{P}_{\lambda+\epsilon}(E_a)=\sum_{m=0}^n\mathbb{P}(\text{Bin}(\frac{n(n-1)}{2},\frac{\lambda+\epsilon}{n})=m)\mathbb{P}_{n,m}(E_a).

We are going to split this sum into

\sum_{m=0}^{\frac{(\lambda+\epsilon )n}{2}-n^{3/4}}+\sum_{m=\frac{(\lambda+\epsilon)n}{2}-n^{3/4}}^n.

On the first of these sums, we bound using the fact that probabilities are less than 1, and on the second, we use that \mathbb{P}_{n,m}(E_a) is an increasing function of m. This property is special to the event we are considering – in general one might have to be a bit more clever, perhaps using continuity of P_{n,m}(E), interpreting continuity in the limit with n. Anyway, this enables us to bound:

\mathbb{P}_{\lambda+\epsilon}(E_a)\geq \mathbb{P}(\text{Bin}(\frac{n(n-1)}{2},\frac{\lambda+\epsilon}{n})\leq \frac{(\lambda+\epsilon)n}{2}-n^{3/4})+\mathbb{P}_{\frac{(\lambda+\epsilon)n}{2}-n^{3/4}}(E_a)\mathbb{P}(\text{Bin}(\frac{n(n-1)}{2},\frac{\lambda+\epsilon}{n})\geq\frac{(\lambda+\epsilon)n}{2}-n^{3/4}).

By the Central Limit Theorem, this first probability tends to 0, while the final term tends to 1. We therefore have:

\mathbb{P}_{\lambda+\epsilon}(E_a)\geq \mathbb{P}_{\frac{(\lambda+\epsilon)n}{2}-n^{3/4}}(E_a).

We demanded that a>I_\lambda^{-1}, and mentioned that this function was continuous, so since we have total freedom over \epsilon, in particular, we can choose \epsilon>0 such that a>I_{\lambda+\epsilon}^{-1}. By the work on G(n,p), we have \mathbb{P}_{\lambda+\epsilon}(E_a)=O(n^{-\delta}), and for large enough n, we have \frac{(\lambda+\epsilon)n}{2}-n^{3/4}>>\frac{\lambda n}{2}, and so the result \mathbb{P}_{\frac{\lambda n}{2}}(E_a)=O(n^{-\delta}) follows.