Random transpositions

We study a procedure for generating a random sequence of permutations of [N]. We start with the identity permutation, and then in each step, we choose two elements uniformly at random, and swap them. We obtain a sequence of permutations, where each term is obtained from the previous one by multiplying by a uniformly-chosen transposition.

Some more formality and some technical remarks:

  • This is a Markov chain, and as often with Markov chains, it would be better it was aperiodic. As described, the cycle will alternate between odd and even permutations. So we allow the two elements chosen to be the same. This laziness slows down the chain by a factor N-1/N, but removes periodicity. We will work over timescales where this adjustment makes no practical difference.
  • Let \tau_1,\tau_2,\ldots be the sequence of transpositions. We could define the sequence of permutations by \pi_m= \tau_m\cdot\tau_{m-1}\cdot \ldots\cdot \tau_1. I find it slightly more helpful to think of swapping the elements in places i and j, rather the elements i and j themselves, and so I’ll use this language, for which \pi_m = \tau_1\cdot \tau_2\cdot\ldots \cdot \tau_m is the appropriate description. Of course, transpositions and the identity are self-inverse permutations, so it makes no difference to anything we might discuss.
  • You can view this as lazy random walk on the Cayley graph of S_N generated by the set of transpositions. That is, the vertices of the graph are elements of S_N, and two are connected by an edge if one can be obtained from the other by multiplying by a transposition. Note this relation is symmetric. Hence random transposition random walk.
  • Almost everything under discussion would work in continuous time too.

At a very general level, this sort of model is interesting because sometimes the only practical way to introduce ‘global randomness’ is repeatedly to apply ‘local randomness’. This is not the case for permutations – it is not hard to sample uniformly from S_N. But it is a tractable model in which to study relevant questions about the generating randomness on a complicated set through iterated local operations.

Since it is a Markov chain with a straightforward invariant distribution, we can ask about the mixing time. That is, the correct scaling for the number of moves before the random permutation is close in distribution (say in the sense of total variation distance) to the equilibrium distribution. See this series of posts for an odd collection of background material on the topic. Diaconis and Shahshahani [DS81] give an analytic argument for mixing around \frac{N\log N}{2} transpositions. Indeed they include a constant because it is a sharp cutoff, where the total variation distance drops from approximately 1 to approximately 0 in O(N) steps.

Comparison with Erdos-Renyi random graph process

In the previous result, one might observe that m=\frac{N\log N}{2} is also the threshold number of edges to guarantee connectivity of the Erdos-Renyi random graph G(N,m) with high probability. [ER59] Indeed, there is also a sharp transition around this threshold in this setting too.

We explore this link further. We can construct a sequence of random graphs simultaneously with the random transposition random walk. When we multiply by transposition (i j), we add edge ij in the graph. Laziness of RTRW and the possibility of multiple edges mean this definition isn’t literally the same as the conventional definition of a discrete-time Erdos-Renyi random graph process, but again this is not a problem for any of the effects we seek to study.

The similarity between the constructions is clear. But what about the differences? For the RTRW, we need to track more information than the random graph. That is, we need to know what order the transpositions were added, rather than merely which edges were added. However, the trade-off is that a permutation is a simpler object than a graph in the following sense. A permutation can be a described as a union of disjoint cycles. In an exchangeable setting, all the information about a random permutation is encoded in the lengths of the these cycles. Whereas in a graph, geometry is important. It’s an elegant property of the Erdos-Renyi process that we can forget about the geometry and treat it as a process on component sizes (indeed, a multiplicative coalescent process), but there are other questions we might need to ask for which we do have to study the graph structure itself.

Within this analogy, unfortunately the word cycle means different things in the two different settings. In a permutation, a cycle is a directed orbit, while in a graph it has the usual definition. I’m going to write graph-cycle whenever relevant to avoid confusion.

A first observation is that, under this equivalence, the cycles of the permutation form a finer partition than the components of the graph. This is obvious. If we split the vertices into sets A and B, and there are no edges between them, then nothing in set A will ever get moved out of set A by a transposition. (Note that the slickness of this analogy is the advantage of viewing a transposition as swapping the elements in places i and j.)

However, we might then ask under what circumstances is a cycle of the permutation the same as a component of the graph (rather than a strict subset of it). A first answer is the following:

Lemma: [Den59] The permutation formed by a product of transpositions corresponding in any order to a tree in the graph has a single cycle.

We can treat this as a standalone problem and argue in the following predictable fashion. (Indeed, I was tempted to set this as a problem during selection for the UK team for IMO 2017 – it’s perfectly suitable in this context I think.) The first transposition corresponds to some edge say ab, and removing this edge divides the vertices into components A \ni a, B\ni b. Since no further transposition swaps between places in A and places in B, the final permutation maps a into B and b into A, and otherwise preserves A and B.

This argument extends to later transpositions too. Now, suppose there are multiple cycles. Colour one of them. So during the process, the coloured labels move around. At some point, we must swap a coloured label with an uncoloured label. Consider this edge, between places a and b as before, and indeed the same conclusion holds. WLOG we move the coloured label from a to b. But then at the end of the process (ie in the permutation) there are more coloured labels in B than initially. But the number of coloured labels should be the same, because they just cycle around in the final permutation.

We can learn a bit more by trying thinking about the action on cycles (in the permutation) of adding a transposition. In the following pair of diagrams, the black arrows represent the original permutation (note it’s not helpful to think of the directed edges as having anything to do with transpositions now), the dashed line represents a new transposition, and the new arrows describe the new permutation which results from this product.

It’s clear from this that adding a transposition between places corresponding to different cycles causes the cycles to merge, while adding a transposition between places already in the same cycle causes the cycle to split into two cycles. Furthermore the sizes of the two cycles formed is related to the distance in the cycle between the places defining the transposition.

This allows us to prove the lemma by adding the edges of the tree one-at-a-time and using induction. The inductive claim is that cycles of the permutation exactly correspond to components of the partially-built tree. Assuming this claim guarantees that the next step is definitely a merge, not a split (otherwise the edge corresponding to the next step would have to form a cycle). If all N-1 steps are merges, then the number of cycles is reduced by one on each step, and so the final permutation must be a single cycle.

Uniform split-merge

This gives another framework for thinking about the RTRW itself, entirely in terms of cycle lengths as a partition of [N]. That is, given a partition, we choose a pair of parts in a size-biased way. If they are different, we merge them; and if it is the same part, with size k, we split them into two parts, with sizes chosen uniformly from { (1,k-1), (2,k-2), …  (k-1,1) }.

What’s nice about this is that it’s easy to generalise to real-valued partitions, eg of [0,1]. Given a partition of [0,1], we sample two IID U[0,1] random variables U_1,U_2. If these correspond to different parts, we replace these parts by a single part with size given by the sum. If these correspond to the same part, with size \alpha, we split this part into two parts with sizes |U_1-U_2| and \alpha - |U_1-U_2|. This is equivalent in a distributional sense to sampling another U[0,1] variable U and replacing \alpha with (\alpha U, \alpha(1-U)). We probably want our partition to live in \ell^1_\searrow, so we might have to reorder the parts afterwards too.

These uniform split-merge dynamics have a (unique) stationary distribution, the canonical Poisson-Dirichlet random partition, hereafter PD(0,1). This was first shown in [DMZZ04], and then in a framework more relevant to this post by Schramm [Sch08].

Conveniently, PD(0,1) is also the scaling limit of the cycle lengths in a uniform random permutation (scaled by N). The best way to see this is to start with the observation that the length of the cycle containing 1 in a permutation chosen uniformly from S_N has the uniform distribution on {1,…,N}. This matches up well with the uniform stick-breaking construction of PD(0,1), though other arguments are available too. Excellent background on Poisson-Dirichlet distributions and this construction and equivalence can be found in Chapter 3 of Pitman’s comprehensive St. Flour notes [CSP]. Also see this post, and the links within, with the caveat that my understanding of the topic was somewhat shaky then (as presently, for now).

However, Schramm says slightly more than this. As the Erdos-Renyi graph passes criticality, there is a well-defined (and whp unique) giant component including \Theta(N) vertices. It’s not clear that the corresponding permutation should have giant cycles. Indeed, whp the giant component has \Theta(N) surplus edges, so the process of cycle lengths will have undergone O(N) splits. Schramm shows that most of the labels within the giant component are contained in giant cycles in the permutation. Furthermore, the distribution of cycle lengths within the giant component, rescaled by the size of the giant component, converges in distribution to PD(0,1) at any supercritical time \frac{(1+\epsilon)N}{2}

This is definitely surprising, since we already know that the whole permutation doesn’t look close to uniform until time \frac{N\log N}{2}. Essentially, even though the size of the giant component is non-constant (ie it’s gaining vertices), the uniform split-merge process is happening to the cycles within it at rate N. So heuristically, at the level of the largest cycles, at any supercritical time we have a non-trivial partition, so at any slightly later time (eg \frac{(1+\epsilon/2)N}{2} and \frac{(1+\epsilon)N}{2} ), mixing will have comfortably occurred, and so the distribution is close to PD(0,1).

This is explained very clearly in the introduction of [Ber10], in which the approach is extended to a random walk on S_N driven by a uniform choice from any conjugacy class.

So this really does tell us how the global uniform randomness emerges. As the random graph process passes criticality, we have a positive mass of labels in a collection of giant cycles which are effectively a continuous-space uniform split-merge model near equilibrium (and thus with PD(0,1) marginals). The remaining cycles are small, corresponding to small trees which make up the remaining (subcritical by duality) components of the ER graph. These cycles slowly get absorbed into the giant cycles, but on a sufficiently slow timescale relevant to the split-merge dynamics that we do not need to think of a separate split-merge-with-immigration model. Total variation distance on permutations does feel the final few fixed points (corresponding to isolated vertices in the graph), hence the sharp cutoff corresponding to sharp transition in the number of isolated vertices.

References

[Ber10] – N. Berestycki – Emergence of giant cycles and slowdown transition in random transpositions and k-cycles. [arXiv version]

[CSP] – Pitman – Combinatorial stochastic processes. [pdf available]

[Den59] – Denes – the representation of a permutation as a product of a minimal number of transpositions, and its connection with the theory of graphs

[DS81] – Diaconis, Shahshahani – Generating a random permutation with random transpositions

[DMZZ04] – Diaconis, Mayer-Wolf, Zeitouni, Zerner – The Poisson-Dirichlet distribution is the unique invariant distribution for uniform split-merge transformations [link]

[ER59] – Erdos, Renyi – On random graphs I.

[Sch08] – Schramm – Compositions of random transpositions [book link]

Advertisements

Large Deviations 6 – Random Graphs

As a final instalment in this sequence of posts on Large Deviations, I’m going to try and explain how one might be able to apply some of the theory to a problem about random graphs. I should explain in advance that much of what follows will be a heuristic argument only. In a way, I’m more interested in explaining what the technical challenges are than trying to solve them. Not least because at the moment I don’t know exactly how to solve most of them. At the very end I will present a rate function, and reference properly the authors who have proved this. Their methods are related but not identical to what I will present.

Problem

Recall the two standard definitions of random graphs. As in many previous posts, we are interested in the sparse case where the average degree of a vertex is o(1). Anyway, we start with n vertices, and in one description we add an edge between any pair of vertices independently and with fixed probability \frac{\lambda}{n}. In the second model, we choose uniformly at random from the set of graphs with n vertices and \frac{\lambda n}{2} edges. Note that if we take the first model and condition on the number of edges, we get the second model, since the probability of a given configuration appearing in G(n,p) is a function only of the number of edges present. Furthermore, the number of edges in G(n,p) is binomial with parameters \binom{n}{2} and p. For all purposes here it will make no difference to approximate the former by \frac{n^2}{2}.

Of particular interest in the study of sparse random graphs is the phase transition in the size of the largest component observed as \lambda passes 1. Below 1, the largest component has size on a scale of log n, and with high probability all components are trees. Above 1, there is a unique giant component containing \alpha_\lambda n vertices, and all other components are small. For \lambda\approx 1, where I don’t want to discuss what ‘approximately’ means right now, we have a critical window, for which there are infinitely many components with sizes on a scale of n^{2/3}.

A key observation is that this holds irrespective of which model we are using. In particular, this is consistent. By the central limit theorem, we have that:

|E(G(n,\frac{\lambda}{n}))|\sim \text{Bin}\left(\binom{n}{2},\frac{\lambda}{n}\right)\approx \frac{n\lambda}{2}\pm\alpha,

where \alpha is the error due to CLT-scale fluctuations. In particular, these fluctuations are on a scale smaller than n, so in the limit have no effect on which value of \lambda in the edge-specified model is appropriate.

However, it is still a random model, so we can condition on any event which happens with positive probability, so we might ask: what does a supercritical random graph look like if we condition it to have no giant component? Assume for now that we are considering G(n,\frac{\lambda}{n}),\lambda>1.

This deviation from standard behaviour might be achieved in at least two ways. Firstly, we might just have insufficient edges. If we have a large deviation towards too few edges, then this would correspond to a subcritical G(n,\frac{\mu n}{2}), so would have no giant components. However, it is also possible that the lack of a giant component is due to ‘clustering’. We might in fact have the correct number of edges, but they might have arranged themselves into a configuration that keeps the number of components small. For example, we might have a complete graph on Kn^{1/2} vertices plus a whole load of isolated vertices. This has the correct number of edges, but certainly no giant component (that is an O(n) component).

We might suspect that having too few edges would be the primary cause of having no giant component, but it would be interesting if clustering played a role. In a previous post, I talked about more realistic models of complex networks, for which clustering beyond the levels of Erdos-Renyi is one of the properties we seek. There I described a few models which might produce some of these properties. Obviously another model is to take Erdos-Renyi and condition it to have lots of clustering but that isn’t hugely helpful as it is not obvious what the resulting graphs will in general look like. It would certainly be interesting if conditioning on having no giant component were enough to get lots of clustering.

To do this, we need to find a rate function for the size of the giant component in a supercritical random graph. Then we will assume that evaluating this near 0 gives the LD probability of having ‘no giant component’. We will then compare this to the straightforward rate function for the number of edges; in particular, evaluated at criticality, so the probability that we have a subcritical number of edges in our supercritical random graph. If they are the same, then this says that the surfeit of edges dominates clustering effects. If the former is smaller, then clustering may play a non-trivial role. If the former is larger, then we will probably have made a mistake, as we expect on a LD scale that having too few edges will almost surely lead to a subcritical component.

Methods

The starting point is the exploration process for components of the random graph. Recall we start at some vertex v and explore the component containing v depth-first, tracking the number of vertices which have been seen but not yet explored. We can extend this to all components by defining:

S(0)=0, \quad S(t)=S(t-1)+(X(t)-1),

where X(t) is the number of children of the t’th vertex. For a single component, S(t) is precisely the number of seen but unexplored vertices. It is more complicated in general. Note that when we exhaust the first component S(t)=-1, and then when we exhaust the second component S(t)=-2 and so on. So in fact

S_t-\min_{0\leq s\leq t}S_s

is the number of seen but unexplored vertices, with \min_{0\leq s\leq t}S_s equal to (-1) times the number of components already explored up to time t.

Once we know the structure of the first t vertices, we expect the distribution of X(t) – 1 to be

\text{Bin}\Big(n-t-[S_t-\min_{0\leq s\leq t}S_s],\tfrac{\lambda}{n}\Big)-1.

We aren’t interested in all the edges of the random graph, only in some tree skeleton of each component. So we don’t need to consider the possibility of edges connecting our current location to anywhere we’ve previously visited (as such an edge would have been consider then – it’s a depth-first exploration), hence the -t. But we also don’t want to consider edges connecting our current location to anywhere we’ve seen, since that would be a surplus edge creating a cycle, hence the -S_s. It is binomial because by independence even after all this conditioning, the probability that there’s an edge from my current location to any other vertex apart from those discounted is equal to \frac{\lambda}{n} and independent.

For Mogulskii’s theorem in the previous post, we had an LDP for the rescaled paths of a random walk with independent stationary increments. In this situation we have a random walk where the increments do not have this property. They are not stationary because the pre-limit distribution depends on time. They are also not independent, because the distribution depends on behaviour up to time t, but only through the value of the walk at the present time.

Nonetheless, at least by following through the heuristic of having an instantaneous exponential cost for a LD event, then products of sums becoming integrals within the exponent, we would expect to have a similar result for this case. We can find the rate function \Lambda_\lambda^*(x)of latex \text{Po}(\lambda)-1$ and thus get a rate function for paths of the exploration process

I_\lambda(f)=\int_0^1 \Lambda_{(1-t-\bar{f}(t))\lambda}^*(f')dt,

where \bar{f}(t) is the height of f above its previous minimum.

Technicalities and Challenges

1) First we need to prove that it is actually possible to extend Mogulskii to this more general setting. Even though we are varying the distribution continuously, so we have some sort of ‘local almost convexity’, the proof is going to be fairly fiddly.

2) Having to consider excursions above the local minima is a massive hassle. We would ideally like to replace \bar{f} with f. This doesn’t seem unreasonable. After all, if we pick a giant component within o(n) steps, then everything considered before the giant component won’t show up in the O(n) rescaling, so we will have a series of macroscopic excursions above 0 with widths giving the actual sizes of the giant components. The problem is that even though with high probability we will pick a giant component after O(1) components, then probability that we do not do this decays only exponentially fast, so will show up as a term in the LD analysis. We would hope that this would not be important – after all later we are going to take an infimum, and since the order we choose the vertices to explore is random and in particular independent of the actual structure, it ought not to make a huge difference to any result.

3) A key lemma in the proof of Mogulskii in Dembo and Zeitouni was the result that it doesn’t matter from an LDP point of view whether we consider the linear (continuous) interpolation or the step-wise interpolation to get a process that actually lives in L_\infty([0,1]). In this generalised case, we will also need to check that approximating the Binomial distribution by its Poisson limit is valid on an exponential scale. Note that because errors in the approximation for small values of t affect the parameter of the distribution at larger times, this will be more complicated to check than for the IID case.

4) Once we have a rate function, if we actually want to know about the structure of the ‘typical’ graph displaying some LD property, we will need to find the infimum of the integrated rate function with some constraints. This is likely to be quite nasty unless we can directly use Euler-Lagrange or some other variational tool.

Answer

Papers by O’Connell and Puhalskii have found the rate function. Among other interesting things, we learn that:

I_{(1+\epsilon)}(0)\approx \frac{\epsilon^3}{6},

while the rate function for the number of edges:

-\lim\tfrac{1}{n}\log\mathbb{P}\Big(\text{Bin}(\tfrac{n^2}{2},\tfrac{1+\epsilon}{n})\leq\tfrac{n}{2}\Big)\approx \frac{\epsilon^2}{4}.

So in fact it looks as if there might be a significant contribution from clustering after all.

Uniform Spanning Trees

For applications to random graphs, the local binomial structure and independence means that the Galton-Watson branching process is a useful structure to consider embedding in the graph. In several previous posts, I have shown how we can set up the so-called exploration process which visits the sites in a component as if the component were actually a tree. The typical degree is O(1), and so in particular small components will be trees with high probability in the limit. In the giant component for a supercritical graph, this is not the case, but it doesn’t matter, as we ignore vertices we have already explored in our exploration process. We can consider the excess edges separately by ‘sprinkling’ them back in once we have the tree-like backbone of all the components. Again, independence is crucial here.

I am now thinking about a new model. We take an Erdos-Renyi process as before, with edges arriving at some fixed rate, but whenever a cycle appears, we immediately delete all the edges that make up the cycle. Thus at all times the system consists of a collection (or forest) of trees on the n vertices. So initially this process will look exactly like the normal E-R process, but as soon as the components start getting large, we start getting excess edges which destroy the cycles and make everything small again. The question to ask is: if we run the process for long enough, roughly how large are all the components? It seems unlikely that the splitting mechanism is so weak that we will get true giant components forming, ie O(n) sizes, so we might guess that, in common with some other split-merge models of this type, we end up with components of size n^{2/3}, as in the critical window for the E-R process.

In any case, the scaling limit process is likely to have components whose sizes grow with n, so we will have a class of trees larger than those we have considered previously, which have typically been O(1). So it’s worth thinking about some ways to generate random trees on a fixed number of vertices.

Conditioned Galton-Watson

Our favourite method of creating trees is inductive. We take a root and connect the root to a number of offspring given by a fixed distribution, and each of these some offspring given by an independent sample from the same distribution and so on. The natural formulation gives no control over the size of the tree. This is a random variable whose distribution depends on the offspring distribution, and which in some circumstances be computed explicitly, for example when the offspring distribution is geometric. In other cases, it is easier to make recourse to generating functions or to a random walk analogue as described in the exploration process discussion.

Of course, there is nothing to stop us conditioning on the total size of the population. This is equivalent to conditioning on the hitting time of -1 for the corresponding random walk, and Donsker’s theorem gives several consequences of a convergence relation towards a rescaled Brownian excursion. Note that there is no a priori labelling for the resulting tree. This will have to be supplied later, with breadth-first and depth-first the most natural choices, which might cause annoyance if you actually want to use it. In particular, it is not obvious, and probably not true unless you are careful, that the distribution is invariant under permuting the labels (having initially assumed 1 is the root etc) which is not ideal if you are embedding into the complete graph.

However, we would like to have some more direct constructions of random trees on n vertices. We now consider perhaps the two best known such methods. These are of particular interest as they are applicable to finding random spanning trees embedded in any graph, rather than just the complete graph.

Uniform Spanning Tree

Given a connected graph, consider the set of all subgraphs which are trees and span the vertex set of the original graph. An element of this set is called a spanning tree. A uniform spanning tree is chosen uniformly at random from the set of spanning trees on the complex graph on n vertices. A famous result of Arthur Cayley says that the number of such spanning trees is n^{n-2}. There are various neat proofs, many of which consider a mild generalisation which gives us a more natural framework for using induction. This might be a suitable subject for a subsequent post.

While there is no objective answer to the question of what is the right model for random trees on n vertices, this is what you get from the Erdos-Renyi process. Formally, conditional on the sizes of the (tree) components, the structures of the tree components are given by UST.

To see why this is the case, observe that when we condition that a component has m vertices and is a tree, we are demanding that it be connected and have m-1 edges. Since the probability of a particular configuration appearing in G(n,p) is a function only of the number of edges in the configuration, it follows that the probability of each spanning tree on the m vertices in question is equal.

Interesting things happen when you do this dynamically. That is, if we have two USTs of sizes m and n at some time t, and condition that the next edge to be added in the process joins them, then the resulting component is not a UST on m+n vertices. To see why, consider the probability of a ‘star’, that is a tree with a single distinguished vertex to which every other vertex is joined. Then the probability that the UST on m vertices is a star is \frac{m}{m^{m-2}}=m^{-(m-3)}. By contrast, it is not possible to obtain a star on m+n vertices by joining a tree on m vertices and a tree on n vertices with an additional edge.

However, I think the UST property is preserved by the cycle deletion mechanism mentioned at the very start of this post. My working has been very much of the back of the envelope variety, but I am fairly convinced that once you have taken a UST and conditioned on the sizes of the smaller trees which result from cycle deletion. My argument is that you might as well fix the cycle to be deleted, then condition on how many vertices are in each of the trees coming off this cycle. Now the choice of each of these trees is clearly uniform among spanning trees on the correct number of vertices.

However, it is my current belief that the combination of these two mechanisms does not give UST-like trees even after conditioning on the sizes at fixed time.

Analytic vs Probabilistic Arguments for a Supercritical BP

This follows on directly from the previous post. I was originally going to talk only about what follows, but I got rather carried away with the branching process account. I was stuck on a particular exercise, and we ended up coming up with two arguments: one analytic and one probabilistic. Since the typical flavour of this blog is to present problems which show the advantage of the probabilistic approach, it seems only fair to remark on this case, where the analytic method was less interesting, but much simpler.

Recall that we have a supercritical random graph G(n,\frac{\lambda}{n}), \lambda>1, and we are considering the rescaled exploration process S_{nt}, which has asymptotic mean \mu_t=1-t-e^{-\lambda t}. We can calculate similarly an expression for the asymptotic variance

\frac{\text{Var}(S_{nt})}{n}\rightarrow v_t=e^{-\lambda t}(1-e^{-\lambda t}).

To use this to verify the result about the size of the giant component, we verify that \mu_{\zeta_\lambda+x/\sqrt{n}} is negative, and has small variance, which would confirm that the giant component has size bounded above by \zeta_\lambda almost surely. A similar argument is required for the lower bound. The variance is a separate matter, but it is therefore necessary that \mu_t should be decreasing at t=\zeta_\lambda, that is \mu_t'=\lambda e^{-\lambda \zeta_\lambda}<0. This is what we try to prove in the remainder of this post. Recall that in the previous post we have checked that it is equal to zero here.

Heuristic Explanation

\mu_t has been rescaled from the original definition of the exploration process in both size and time-scale so some care is needed to see why this should hold in the limit. Remember that all components apart from the giant component are of size O(log n). So immediately after exhausting the giant component, you are likely to be visiting components of size roughly log n. A time interval of dt for \mu corresponds to ndt for S, during which S will visit some components of size log n and some of O(1) and some in between. In particular, some fixed proportion of vertices are isolated, that is, in a component of size 1.

There is then a complicated size-biasing train of thought. A component of size log n is more likely to come up than an isolated vertex, but there are not as many of them. The log n components push the derivative \mu_t' towards zero, because S_t decreases by 1 over a time-interval of length log n, which gives a gradient of zero in the limit. However, the isolated vertices give a gradient of -1, because S_t decreases by 1 over a time interval of 1. Despite the fact that log n intervals are likely to appear earlier, it still remains the case that after exhausting a component (in particular, at time t=\zeta_\lambda, after exhausting the giant component), with some bounded below positive probability you will choose an isolated vertex next. The component size only affects that time-scale if it is O(n), which none of the remaining components are, so the derivative \mu_{\zeta_\lambda}' consists of some complicated weighted mean of 0 and -1. In particular, it is negative.

Analytic solution

Obviously, that won’t do in practice. Suppressing lambdas for ease of notation, the key fact is: e^{-\lambda \zeta}=1-\zeta. We want to show that \lambda e^{-\lambda \zeta}<1. Substituting

\lambda=-\frac{\log(1-\zeta)}{\zeta},

means that it is required to show:

-\frac{1-\zeta}{\zeta}\log(1-\zeta)<1.

Differentiating the left hand side gives:

\frac{\log(1-\zeta)+\zeta}{\zeta^2}<0,

since of course \log(1-\zeta)=\zeta+\frac{\zeta^2}{2}+\frac{\zeta^3}{3}+\dots. So it suffice to check the result for small \zeta. But, again using a Taylor series:

-\frac{1-\zeta}{\zeta}\log(1-\zeta)=1-\frac12\zeta+O(\zeta^2)<1,

for small \zeta. This gives the required result.

Probabilistic Interpretation and Solution

First, we observe that \lambda e^{-\lambda\zeta}=\lambda(1-\zeta) is the expected number of vertices in the first generation of a \text{Po}(\lambda) whose progeny become extinct. This motivates considering the canonical decomposition of a supercritical branching process Z into the skeleton process and the dual process. The skeleton Z^+ consists of all vertices which have infinitely many successors. It is relatively easy to show that this is a branching process with offspring distribution \text{Po}(\lambda\zeta) conditioned on being positive. The dual process Z^* is a G-W branching process with offspring distribution \text{Po}(\lambda) conditioned on dying. This is the same as a branching process with offspring distribution \text{Po}(\lambda(1-\zeta), by a sprinkling argument, which says that if we begin with a Poisson number of things, then remove each one independently with some fixed probability, the remaining number of things is Poisson also.

We can construct the original branching process by

  • With probability \zeta, take the skeleton, and affixe independent copies of Z^* at every vertex in the skeleton.
  • With probability 1-\zeta, just take a copy of Z^*.

It is immediately clear that \lambda(1-\zeta)\leq 1. After all, the dual process is almost surely finite, so the offspring distribution cannot have expectation greater than 1. Checking that this is strong is more fiddly. The best way I have come up with is to examine the tail of the distribution of total population size of the original branching process.

The total population size T of a branching process has an exponential tail if the offspring distribution is subcritical. It isn’t hugely surprising that this behaves like a large deviation for iid RVs, since in the limit such an event requires a lot of the offspring counts to deviate substantially from the mean. The same holds in the supercritical case, with the additional complication that though the finite tail decays exponential, there is positive probability that the total size will be infinite. In the critical case, however, there is a power-law decay. This is not hugely surprising as it marks the threshhold for the appearance of the infinite population, just as in a multiplicative coalescent at time 1, we have a load of very large components just about to form a giant component. The tool for all of these results is Dwass’s Theorem, which says:

\mathbb{P}(T=n)=\frac{1}{n}\mathbb{P}(X_1+\ldots+X_n=n-1),

where X_1 are iid with the offspring distribution. When \mathbb{E}X_1\neq 1, this is a large deviation event, for which Cramer’s theorem applies (assuming, as is the case for the Poisson distribution, that the offspring distribution has finite variance). When, \mathbb{E}X=1, the Central Limit Theorem says that with high probability,

X_1+\ldots+X_n\in [n-n^{3/4},n+n^{3/4}],

so, skating over the details of whether everything is exactly uniform within this CLT scaling window,

\mathbb{P}(T=n)\geq \frac{1}{n}\cdot\frac{1}{2n^{3/4}}.

The true exponent of the power law decay is substantially slower than this, but the above argument works as a back-of-the-envelope bound.

In particular, if the dual process has mean 1, then the population size of the original branching process is given by taking a distribution with exponential tail with some probability and a distribution with power-law tail with some probability. Obviously the power-law will dominate, which contradicts the assumption that the original branching process was supercritical, and so has an exponential tail.

Exploring the Supercritical Random Graph

I’ve spent a bit of time this week reading and doing all the exercises from some excellent notes by van der Hofstad about random graphs. I think they are absolutely excellent and would not be surprised if they become the standard text for an introduction to probabilistic combinatorics. You can find them hosted on the author’s website. I’ve been reading chapters 4 and 5, which approaches the properties of phase transitions in G(n,p) by formalising the analogy between component sizes and population sizes in a binomial branching process. When I met this sort of material for the first time during Part III, the proofs generally relied on careful first and second moment bounds, which is fine in many ways, but I enjoyed vdH’s (perhaps more modern?) approach, as it seems to give a more accurate picture of what is actually going on. In this post, I am going to talk about using the branching process picture to explain why the giant component emerges when it does, and how to get a grip on how large it is at any time after it has emerged.

Background

A quick tour through the background, and in particular the notation will be required. At some point I will write a post about this topic in a more digestible format, but for now I want to move on as quickly as possible.

We are looking at the sparse random graph G(n,\frac{\lambda}{n}), in the super-critical phase \lambda>1. With high probability (that is, with probability tending to 1 as n grows), we have a so-called giant component, with O(n) vertices.

Because all the edges in the configuration are independent, we can view the component containing a fixed vertex as a branching process. Given vertex v(1), the number of neighbours is distributed like \text{Bi}(n-1,\frac{\lambda}{n}). The number of neighbours of each of these which we haven’t already considered is then \text{Bi}(n-k,\frac{\lambda}{n}), conditional on k, the number of vertices we have already discounted. After any finite number of steps, k=o(n), and so it is fairly reasonable to approximate this just by \text{Bi}(n,\frac{\lambda}{n}). Furthermore, as n grows, this distribution converges to \text{Po}(\lambda), and so it is natural to expect that the probability that the fixed vertex lies in a giant component is equal to the survival probability \zeta_\lambda (that is, the probability that it is infinite) of a branching process with \text{Po}(\lambda) offspring distribution. Note that given a graph, the probability of a fixed vertex lying in a giant component is equal to the fraction of the vertex in the giant component. At this point it is clear why the emergence of the giant component must happen at \lambda=1, because we require \mathbb{E}\text{Po}(\lambda)>1 for the survival probability to be non-zero. Obviously, all of this needs to be made precise and rigorous, and this is treated in sections 4.3 and 4.4 of the notes.

Exploration Process

A common functional of a rooted branching process to consider is the following. This is called in various places an exploration process, a depth-first process or a Lukasiewicz path. We take a depth-first labelling of the tree v(0), v(1), v(2),… , and define c(k) to be the number of children of vertex v(k). We then define the exploration process by:

S(0)=0,\quad S(k+1)=S(k)+c(k)-1.

By far the best way to think of this is to imagine we are making the depth-first walk on the tree. S(k) records how many vertices we have seen (because they are connected by an edge to a vertex we have visited) but have not yet visited. To clarify understanding of the definition, note that when you arrive at a vertex with no children, this should decrease by one, as you can see no new vertices, but have visited an extra one.

This exploration process is useful to consider for a couple of reasons. Firstly, you can reconstruct the branching process directly from it. Secondly, while other functionals (eg the height, or contour process) look like random walks, the exploration process genuinely is a random walk. The distribution of the number of children of the next vertex we arrive at is independent of everything we have previously seen in the tree, and is the same for every vertex. If we were looking at branching processes in a different context, we might observe that this gives some information in a suitably-rescaled limit, as rescaled random walks converge to Brownian motion if the variance of the (offspring) distribution is finite. (This is Donsker’s result, which I should write something about soon…)

The most important property is that the exploration process returns to 0 precisely when we have exhausted all the vertices in a component. At that point, we have seen exactly the vertices which we have explored. There is no reason not to extend the definition to forests, that is a union of trees. The depth-first exploration is the same – but when we have exhausted one component, we move onto another component, chosen according to some labelling property. Then, running minima of the exploration process (ie times when it is smaller than it has been before) correspond to jumping between components, and thus excursions above the minimum to components themselves. The running minimum will be non-positive, with absolute value equal to the number of components already exhausted.

Although the exploration process was defined with reference to and in the language of trees, the result of a branching process, this is not necessary. With some vertex denoted as the root, we can construct a depth-first labelling of a general graph, and the exploration process follows exactly as before. Note that we end up ignoring all edges except a set that forms a forest. This is what we will apply to G(n,p).

Exploring G(n,p)

When we jump between components in the exploration process on a supercritical (that is \lambda>1) random graph, we move to a component chosen randomly with size-biased distribution. If there is a giant component, as we know there is in the supercritical case, then this will dominate the size-biased distribution. Precisely, if the giant component takes up a fraction H of the vertices, then the number of components to be explored before we get to the giant component is geometrically distributed with parameter H. All other components have size O(log n), so the expected number of vertices explored before we get to the giant component is O(log n)/H = o(n), and so in the limit, we explore the giant component immediately.

The exploration process therefore gives good control on the giant component in the limit, as roughly speaking the first time it returns to 0 is the size of the giant component. Fortunately, we can also control the distribution of S_t, the exploration process at time t. We have that:

S_t+(t-1)\sim \text{Bi}(n-1,1-(1-p)^t).

This is not too hard to see. S_t+(t-1) is number of vertices we have explored or seen, ie are connected to a vertex we have explored. Suppose the remaining vertices are called unseen, and we began the exploration at vertex 1. Then any vertex with label in {2,…,n} is unseen if it successively avoids being in the neighbourhood of v(1), v(2), … v(t). This happens with probability (1-p)^t, and so the probability of being an explored or seen vertex is the complement of this.

In the supercritical case, we are taking p=\frac{\lambda}{n} with \lambda>1, and we also want to speed up S, so that all the exploration processes are defined on [0,1], and rescale the sizes by n, so that it records the fraction of the graph rather than the number of vertices. So we set consider the rescaling \frac{1}{n}S_{nt}.

It is straightforward to use the distribution of S_t we deduce that the asymptotic mean \mathbb{E}\frac{1}{n}S_{nt}=\mu_t = 1-t-e^{-\lambda t}.

Now we are in a position to provide more concrete motivation for the claim that the proportion of vertices in the giant component is \zeta_\lambda, the survival probability of a branching process with \text{Po}(\lambda) offspring distribution. It helps to consider instead the extinction probability 1-\zeta_\lambda. We have:

1-\zeta_\lambda=\sum_{k\geq 0}\mathbb{P}(\text{Po}(\lambda)=k)(1-\zeta_\lambda)^k=e^{-\lambda\zeta_\lambda},

where the second equality is a consequence of the simple form for the moment generating function of the Poisson distribution.

As a result, we have that \mu_{\zeta_\lambda}=0. In fact we also have a central limit theorem for S_t, which enables us to deduce that \frac{1}{n}S_{n\zeta_\lambda}=0 with high probability, as well as in expectation, which is precisely what is required to prove that the giant component of G(n,\frac{\lambda}{n}) has size n(\zeta_\lambda+o(1)).