# Random transpositions

We study a procedure for generating a random sequence of permutations of [N]. We start with the identity permutation, and then in each step, we choose two elements uniformly at random, and swap them. We obtain a sequence of permutations, where each term is obtained from the previous one by multiplying by a uniformly-chosen transposition.

Some more formality and some technical remarks:

• This is a Markov chain, and as often with Markov chains, it would be better it was aperiodic. As described, the cycle will alternate between odd and even permutations. So we allow the two elements chosen to be the same. This laziness slows down the chain by a factor N-1/N, but removes periodicity. We will work over timescales where this adjustment makes no practical difference.
• Let $\tau_1,\tau_2,\ldots$ be the sequence of transpositions. We could define the sequence of permutations by $\pi_m= \tau_m\cdot\tau_{m-1}\cdot \ldots\cdot \tau_1$. I find it slightly more helpful to think of swapping the elements in places i and j, rather the elements i and j themselves, and so I’ll use this language, for which $\pi_m = \tau_1\cdot \tau_2\cdot\ldots \cdot \tau_m$ is the appropriate description. Of course, transpositions and the identity are self-inverse permutations, so it makes no difference to anything we might discuss.
• You can view this as lazy random walk on the Cayley graph of $S_N$ generated by the set of transpositions. That is, the vertices of the graph are elements of $S_N$, and two are connected by an edge if one can be obtained from the other by multiplying by a transposition. Note this relation is symmetric. Hence random transposition random walk.
• Almost everything under discussion would work in continuous time too.

At a very general level, this sort of model is interesting because sometimes the only practical way to introduce ‘global randomness’ is repeatedly to apply ‘local randomness’. This is not the case for permutations – it is not hard to sample uniformly from $S_N$. But it is a tractable model in which to study relevant questions about the generating randomness on a complicated set through iterated local operations.

Since it is a Markov chain with a straightforward invariant distribution, we can ask about the mixing time. That is, the correct scaling for the number of moves before the random permutation is close in distribution (say in the sense of total variation distance) to the equilibrium distribution. See this series of posts for an odd collection of background material on the topic. Diaconis and Shahshahani [DS81] give an analytic argument for mixing around $\frac{N\log N}{2}$ transpositions. Indeed they include a constant because it is a sharp cutoff, where the total variation distance drops from approximately 1 to approximately 0 in O(N) steps.

Comparison with Erdos-Renyi random graph process

In the previous result, one might observe that $m=\frac{N\log N}{2}$ is also the threshold number of edges to guarantee connectivity of the Erdos-Renyi random graph G(N,m) with high probability. [ER59] Indeed, there is also a sharp transition around this threshold in this setting too.

We explore this link further. We can construct a sequence of random graphs simultaneously with the random transposition random walk. When we multiply by transposition (i j), we add edge ij in the graph. Laziness of RTRW and the possibility of multiple edges mean this definition isn’t literally the same as the conventional definition of a discrete-time Erdos-Renyi random graph process, but again this is not a problem for any of the effects we seek to study.

The similarity between the constructions is clear. But what about the differences? For the RTRW, we need to track more information than the random graph. That is, we need to know what order the transpositions were added, rather than merely which edges were added. However, the trade-off is that a permutation is a simpler object than a graph in the following sense. A permutation can be a described as a union of disjoint cycles. In an exchangeable setting, all the information about a random permutation is encoded in the lengths of the these cycles. Whereas in a graph, geometry is important. It’s an elegant property of the Erdos-Renyi process that we can forget about the geometry and treat it as a process on component sizes (indeed, a multiplicative coalescent process), but there are other questions we might need to ask for which we do have to study the graph structure itself.

Within this analogy, unfortunately the word cycle means different things in the two different settings. In a permutation, a cycle is a directed orbit, while in a graph it has the usual definition. I’m going to write graph-cycle whenever relevant to avoid confusion.

A first observation is that, under this equivalence, the cycles of the permutation form a finer partition than the components of the graph. This is obvious. If we split the vertices into sets A and B, and there are no edges between them, then nothing in set A will ever get moved out of set A by a transposition. (Note that the slickness of this analogy is the advantage of viewing a transposition as swapping the elements in places i and j.)

However, we might then ask under what circumstances is a cycle of the permutation the same as a component of the graph (rather than a strict subset of it). A first answer is the following:

Lemma: [Den59] The permutation formed by a product of transpositions corresponding in any order to a tree in the graph has a single cycle.

We can treat this as a standalone problem and argue in the following predictable fashion. (Indeed, I was tempted to set this as a problem during selection for the UK team for IMO 2017 – it’s perfectly suitable in this context I think.) The first transposition corresponds to some edge say ab, and removing this edge divides the vertices into components $A \ni a, B\ni b$. Since no further transposition swaps between places in A and places in B, the final permutation maps a into B and b into A, and otherwise preserves A and B.

This argument extends to later transpositions too. Now, suppose there are multiple cycles. Colour one of them. So during the process, the coloured labels move around. At some point, we must swap a coloured label with an uncoloured label. Consider this edge, between places a and b as before, and indeed the same conclusion holds. WLOG we move the coloured label from a to b. But then at the end of the process (ie in the permutation) there are more coloured labels in B than initially. But the number of coloured labels should be the same, because they just cycle around in the final permutation.

We can learn a bit more by trying thinking about the action on cycles (in the permutation) of adding a transposition. In the following pair of diagrams, the black arrows represent the original permutation (note it’s not helpful to think of the directed edges as having anything to do with transpositions now), the dashed line represents a new transposition, and the new arrows describe the new permutation which results from this product.

It’s clear from this that adding a transposition between places corresponding to different cycles causes the cycles to merge, while adding a transposition between places already in the same cycle causes the cycle to split into two cycles. Furthermore the sizes of the two cycles formed is related to the distance in the cycle between the places defining the transposition.

This allows us to prove the lemma by adding the edges of the tree one-at-a-time and using induction. The inductive claim is that cycles of the permutation exactly correspond to components of the partially-built tree. Assuming this claim guarantees that the next step is definitely a merge, not a split (otherwise the edge corresponding to the next step would have to form a cycle). If all N-1 steps are merges, then the number of cycles is reduced by one on each step, and so the final permutation must be a single cycle.

Uniform split-merge

This gives another framework for thinking about the RTRW itself, entirely in terms of cycle lengths as a partition of [N]. That is, given a partition, we choose a pair of parts in a size-biased way. If they are different, we merge them; and if it is the same part, with size k, we split them into two parts, with sizes chosen uniformly from { (1,k-1), (2,k-2), …  (k-1,1) }.

What’s nice about this is that it’s easy to generalise to real-valued partitions, eg of [0,1]. Given a partition of [0,1], we sample two IID U[0,1] random variables $U_1,U_2$. If these correspond to different parts, we replace these parts by a single part with size given by the sum. If these correspond to the same part, with size $\alpha$, we split this part into two parts with sizes $|U_1-U_2|$ and $\alpha - |U_1-U_2|$. This is equivalent in a distributional sense to sampling another U[0,1] variable U and replacing $\alpha$ with $(\alpha U, \alpha(1-U))$. We probably want our partition to live in $\ell^1_\searrow$, so we might have to reorder the parts afterwards too.

These uniform split-merge dynamics have a (unique) stationary distribution, the canonical Poisson-Dirichlet random partition, hereafter PD(0,1). This was first shown in [DMZZ04], and then in a framework more relevant to this post by Schramm [Sch08].

Conveniently, PD(0,1) is also the scaling limit of the cycle lengths in a uniform random permutation (scaled by N). The best way to see this is to start with the observation that the length of the cycle containing 1 in a permutation chosen uniformly from $S_N$ has the uniform distribution on {1,…,N}. This matches up well with the uniform stick-breaking construction of PD(0,1), though other arguments are available too. Excellent background on Poisson-Dirichlet distributions and this construction and equivalence can be found in Chapter 3 of Pitman’s comprehensive St. Flour notes [CSP]. Also see this post, and the links within, with the caveat that my understanding of the topic was somewhat shaky then (as presently, for now).

However, Schramm says slightly more than this. As the Erdos-Renyi graph passes criticality, there is a well-defined (and whp unique) giant component including $\Theta(N)$ vertices. It’s not clear that the corresponding permutation should have giant cycles. Indeed, whp the giant component has $\Theta(N)$ surplus edges, so the process of cycle lengths will have undergone $O(N)$ splits. Schramm shows that most of the labels within the giant component are contained in giant cycles in the permutation. Furthermore, the distribution of cycle lengths within the giant component, rescaled by the size of the giant component, converges in distribution to PD(0,1) at any supercritical time $\frac{(1+\epsilon)N}{2}$

This is definitely surprising, since we already know that the whole permutation doesn’t look close to uniform until time $\frac{N\log N}{2}$. Essentially, even though the size of the giant component is non-constant (ie it’s gaining vertices), the uniform split-merge process is happening to the cycles within it at rate N. So heuristically, at the level of the largest cycles, at any supercritical time we have a non-trivial partition, so at any slightly later time (eg $\frac{(1+\epsilon/2)N}{2}$ and $\frac{(1+\epsilon)N}{2}$ ), mixing will have comfortably occurred, and so the distribution is close to PD(0,1).

This is explained very clearly in the introduction of [Ber10], in which the approach is extended to a random walk on $S_N$ driven by a uniform choice from any conjugacy class.

So this really does tell us how the global uniform randomness emerges. As the random graph process passes criticality, we have a positive mass of labels in a collection of giant cycles which are effectively a continuous-space uniform split-merge model near equilibrium (and thus with PD(0,1) marginals). The remaining cycles are small, corresponding to small trees which make up the remaining (subcritical by duality) components of the ER graph. These cycles slowly get absorbed into the giant cycles, but on a sufficiently slow timescale relevant to the split-merge dynamics that we do not need to think of a separate split-merge-with-immigration model. Total variation distance on permutations does feel the final few fixed points (corresponding to isolated vertices in the graph), hence the sharp cutoff corresponding to sharp transition in the number of isolated vertices.

References

[Ber10] – N. Berestycki – Emergence of giant cycles and slowdown transition in random transpositions and k-cycles. [arXiv version]

[CSP] – Pitman – Combinatorial stochastic processes. [pdf available]

[Den59] – Denes – the representation of a permutation as a product of a minimal number of transpositions, and its connection with the theory of graphs

[DS81] – Diaconis, Shahshahani – Generating a random permutation with random transpositions

[DMZZ04] – Diaconis, Mayer-Wolf, Zeitouni, Zerner – The Poisson-Dirichlet distribution is the unique invariant distribution for uniform split-merge transformations [link]

[ER59] – Erdos, Renyi – On random graphs I.

[Sch08] – Schramm – Compositions of random transpositions [book link]

# Critical Components in Erdos-Renyi

In various previous posts, I’ve talked about the phase transition in the Erdos-Renyi random graph process. Recall the definition of the process. Here we will use the Gilbert model G(n,p), where we have n vertices, and between any pair of vertices we add an edge, independently of other pairs with probability p. We are interested in the sparse scaling, where the typical vertex has degree O(1) in n, and so p=c/n for constant c>0, and we assume throughout that n is large. We could alternatively have considered the alternative Erdos-Renyi model where we choose uniformly at random from the set of graphs with n vertices and some fixed number of edges. Almost all the results present work equally well in this setting.

As proved by Erdos and Renyi, the typical component structure of such a graph changes noticeably around the threshold c=1. Below this, in the subcritical regime, all the components are small, meaning of size at most order O(log n). Above this, in the supercritical regime, there is a single giant component on some non-zero proportion of the vertices. The rest of the graph looks subcritical. The case c=1 exhibits a phase transition between these qualitatively different behaviours. They proved that here, the largest component is with high probability O(n^2/3). It seems that they thought this result held whenever c=1-o(1), but it turns out that this is not the case. In this post, I will discuss some aspects of behaviour around criticality, and the tools needed to treat them.

The first question to address is this: how many components of size n^{2/3} are there? It might be plausible that there is a single such component, like for the subsequent giant component. It might also be plausible that there are n^1/3 such components, so O(n) vertices are on such critical components. As then it is clear how we transition out of criticality into supercriticality – all the vertices on critical components coalesce to form the new giant component.

In fact neither of these are correct. The answer is that for all integers k>0, with high probability the k-th largest component is on a size scale of n^2/3. This is potentially a confusing statement. It looks like there are infinitely many such components, but of course for any particular value of n, this cannot be the case. We should think of there being w(1) components, but o(n^b) for any b>0.

The easiest way to see this is by a duality argument, as we have discussed previously for the supercritical phase. If we remove a component of size O(n^2/3), then what remains is a random graph with n-O(n^2/3) vertices, and edge probability the same as originally. It might make sense to rewrite this probability 1/n as

$\frac{1}{n-O(n^{2/3})}\cdot \frac{n-O(n^{2/3})}{n}=\frac{1-O(n^{-1/3})}{n-O(n^{2/3})}.$

The approximation in the final numerator is basically the same as

$1-o\left(n-O(n^{2/3})\right).$

Although we have no concrete reasoning, it seems at least plausible that this should look similar in structure to G(n,1/n). In particular, there should be another component of size

$O\left([n-O(n^{2/3})]^{2/3}\right)=O(n^{2/3})$.

In fact, the formal proof of this proceeds by an identical argument, only using the exploration process. Because I’ve described this several times before, I’ll be brief. We track how far we have gone through each component in a depth-first walk. In both the supercritical and subcritical cases, when we scale correctly we get a random path which is basically deterministic in the limit (in n). For exactly the same reasons as visible CLT fluctuations for partial sums of RVs with expectation zero, we start seeing interesting effects at criticality.

The important question is the order of rescaling to choose. At each stage of the exploration process, the number of vertices added to the stack is binomial. We want to distinguish between components of size $O(n^{2/3})$ so we should look at the exploration process at time $sn^{2/3}$. The drift of the exploration process is given by the expectation of a binomial random variable minus one (since we remove the current vertex from the stack as we finish exploring it). This is given by

$\mathbb{E}=\left[n-sn^{2/3}\right]\cdot \frac{1}{n}-1=-sn^{-1/3}.$

Note that this is the drift in one time-step. The drift in $n^{2/3}$ time-steps will accordingly by $sn^{1/3}$. So, if we rescale time by $n^{2/3}$ and space by $n^{1/3}$, we should get a nice stochastic process. Specifically, if Z is the exploration process, then we obtain:

$\frac{1}{n^{1/3}}Z^{(n)}_{sn^{2/3}} \rightarrow_d W_s,$

where W is a Brownian motion with inhomogeneous drift -s at time s. The net effect of such a drift at a fixed positive time is given by integrating up to that time, and hence we might say the process has quadratic drift, or is parabolic.

We should remark that our binomial expectation is not entirely correct. We have discounted those $sn^{2/3}$ vertices that have already been explored, but we have not accounted for the vertices currently in the stack. We should also be avoiding considering these. However, we now have a heuristic for the approximate number of these. The number of vertices in the stack should be $O(n^{1/3})$ at all times, and so in particular will always be an order of magnitude smaller than the number of vertices already considered. Therefore, they won’t affect this drift term, though this must be accounted for in any formal proof of convergence. On the subject of which, the mode of convergence is, unsurprisingly, weak convergence uniformly on compact sets. That is, for any fixed S, the convergence holds weakly on the random functions up to time $sn^{2/3}$.

Note that this process will tend to minus infinity almost surely. Component sizes are given by excursions above the running minimum. The process given by the height of the original process above the running minimum is called reflected. Essentially, we construct the reflected process by having the same generator when the current value is positive, and forcing the process up when it is at zero. There are various ways to construct this more formally, including as the scaling limit of some simple random walks conditioned never to stay non-negative.

The cute part of the result is that it holds equally well in a so-called critical window either side of the critical probability 1/n. When the probability is $\frac{1+tn^{-1/3}}{n}$, for any $t\in \mathbb{R}$, the same argument holds. Now the drift at time s is t-s, though everything else still holds.

This result was established by Aldous in [1], and gives a mechanism for calculating distributions of component sizes and so on through this critical window.

In particular, we are now in a position to answer the original question regarding how many such components there were. The key idea is that because whenever we exhaust a component in the exploration process, we choose a new vertex uniformly at random, we are effectively choosing a component according to the size-biased distribution. Roughly speaking, the largest components will show up near the beginning. Note that a critical $O(n^{2/3})$ component will not necessarily be exactly the first component in the exploration process, but the components that are explored before this will take up sufficiently few vertices that they won’t show up in the scaling of the limit.

In any case, the reflected Brownian motion ‘goes on forever’, and the drift is eventually very negative, so there cannot be infinitely wide excursions, hence there are infinitely many such critical components.

If we care about the number of cycles, we can treat this also via the exploration process. Note that in any depth-first search we are necessarily only interested in a spanning tree of the host graph. Anyway, when we are exploring a vertex, there could be extra edges to other vertices in the stack, but not to vertices we’ve already finished exploring (otherwise the edge would have been exposed then). So the expected number of excess edges into a vertex is proportional to the height of the exploration process at that vertex. So the overall expected number of excess edges, conditional on the exploration process is the area under the curve. This carries over perfectly well into the stochastic process limit. It is then a calculation to verify that the area under the curve is almost surely infinite, and thus that we expect there to be infinitely many cycles in a critical random graph.

REFERENCES

[1] Aldous D. – Brownian excursions, critical random graphs and the multiplicative coalescent

# The Configuration Model

In the past, I’ve talked about limitations of the Erdos-Renyi model of homogeneous random graphs for applications in real-world networks. In a previous post, I’ve discussed a dynamic model, the Preferential Attachment mechanism, that ‘grows’ a graph dynamically by adding edges from new vertices preferentially to existing vertices with high degree. The purpose of this adjustment is to ensure that the distribution of the degrees is not concentrated around some fixed value (which would be c in G(n,c/n) ) but rather exhibits a power-law tail such as observed in many genuine examples.

In this post, we introduce some aspects of the configuration model, which achieves this property more directly. This idea probably first arose in the guise of regular graphs. Recall a regular graph has all degrees equal. How would we construct a random d-regular graph on a large number of vertices?

What we probably want to do is to choose uniformly at random from the set of such graphs, but it is not clear even how large this set is, let alone how one would order its elements to make it possible to make this uniform choice. Instead, we try the following. Assign to each vertex d so-called stubs, which will end up being ‘half-edges’. We then choose two stubs uniformly at random, and glue them together. More formally, we construct an edge between the host vertices, and then delete the chosen stubs. We then continue.

The construction makes no reference to the distribution of stubs, so we are free to choose this as we please. We could for example specify some sequence of degrees which approximates a power-law, so we could sample a random sequence of degrees in some way. So long as we have a sequence of stub set sizes before we start building the edges of the graph we will be able to use the above algorithm.

So what might go wrong? There seem to me to be three potential problems that might arise with this construction.

Firstly, there might be a stub left over, if the sum of the stub set sizes is odd. Recall that in a graph the sum of the degrees is twice the sum of the number of edges, and so in particular the sum of the degrees should be even. But this is a small problem. When the degree sequence is deterministic we can demand that it have even sum, and if it is random, we will typically be working in a large N regime, and so deleting the solitary stub, if such a thing exists, will not affect the sort of properties of the graph we are likely to be interested in.

The second and third objections are perhaps more serious. If we glue together stubs naively, we might end up with loops, that is, edges that ‘begin’ and ‘end’ at the same vertex. These are not allowed in the standard definition of a graph. Alternatively, we might end up with more than one edge between the same pair of vertices.

Our overall aim is that this mechanism gives a convenient way of simulating the uniform distribution on simple graphs with a given degree sequence. At present we have the uniform distribution on potential multigraphs, with a weighting of 1/k! for every multi-edge with multiplicity k, and a weighting of 1/2 for every loop. The latter can be seen because there is an initial probability proportional to $d(v_i)d(v_j)$ that vertices v_i and v_j will be joined, whereas a probability proportional (with the same constant) to $d(v_i)^2$ that v_i will receive a loop. The multi-edge weighting justification is similar.

However, conditional on getting a simple graph, the distribution is uniform on the set of simple graphs with that degree sequence. So it remains to investigate the probability that a graph generated in this way is simple. So long as this probability does not tend to 0 as n grows, we will probably be happy.

The strongest results on this topic are due to Janson. First observe that if the sum of the degrees grows faster than the number of vertices n, we fail to get a graph without loops with high probability. Heuristically, note that on the first pass, we are taking two picks from the set of vertices, biased by the number of stubs. By Cauchy-Schwarz, Rearrangement Inequality or just intuition, the probability of getting the same vertex is greater than if we picked uniformly from the set of vertices without biasing. So the probability of getting no loop on the first pass is $\le (1-\frac{1}{n})$. Take some function a(n) that grows faster than n, but slower than the sum of the degrees. Then after a(n) passes, the degree distribution is still roughly the same. In particular, the sum of the degrees is still an order of magnitude greater than n. So we obtain:

$\mathbb{P}(\text{no loops})\leq (1-\frac{1}{n})^{a(n)}\approx e^{-\frac{a(n)}{n}}\rightarrow 0.$

So, since isolated vertices have no effect on the simplicity or otherwise, we assume the sum of the degrees is $\Theta(n)$. Then, Janson shows that the further condition

$\sum_{i=1}^n d_i^2=O(n),$

is essentially necessary and sufficient for simplicity. We can see why this might be true by looking at the probability that the first edge added is a loop, which is roughly

$\frac{d_1^2+d_2^2+\ldots+d_n^2}{2(\sum d_i)^2}.$

We have to consider $O(\sum d_i)$ edges, so if the above expression is much larger than this, we can perform a similar exponential estimate to show that the probability there are no loops is o(1). The technical part is showing that this probability doesn’t change dramatically as the first few stubs disappear.

Note that in both cases, considering only loops is sufficient for simplicity. Although it looks like loop appearance is weaker than multiplicity of edges, in fact they have the same threshold. It should also be pointed out that, like the uniform random forests, an alternative approach is simply to count the number of simple graphs and multigraphs with a given degree sequence. Good asymptotics can then be found for the probability of simplicity.

In the case of G(n,c/n), we were particularly interested in the emergence of the giant component at time c=1. While first-moment methods can be very effective in demonstrating such results, a branching process local limit representation is probably easiest heuristic for this phase transition.

So long as the degree sequences converge in a natural way, we can apply a similar approach to this configuration model. Concretely, we assume that the proportion of vertices with degree i is $\lambda_i$ in the limit. Although the algebra might push through, we should be aware that this means we are not explicitly specifying how many vertices have degree, eg $\Theta(n^{1/2})$. For now assume the $\lambda_i$s sum to 1, so specify a probability distribution for degree induced by choosing a vertex uniformly at random.

So we start at a vertex, and look at its neighbours. The expected number of neighbours of this root vertex is $\sum i\lambda i$. Thereafter, when we consider a child vertex, based on how the stubs are paired up (and in particular the fact that the order of the operations does not matter – the choice of partner of a given stub is chosen uniformly at random), we are really choosing a stub uniformly at random. This corresponds to choosing a vertex at random, biased by the number of stubs available. The quantity of interest is how many additional stubs (other than the one that led to the vertex) are attached to this vertex. We assume we don’t need to worry too much about repeating vertices, in a similar way to G(n,c/n). So the expected number of additional stubs is

$\frac{1}{\sum i\lambda_i}\sum i\lambda_i(i-1).$

For an infinite component, we required the expectation to be > 1, which is equivalent to

$\sum \lambda_i i(i-2)>0.$

This was proven by Molloy and Reed (95), then with fewer conditions by Janson (07). The latter also shows how to use this construction to derive the giant component for G(n,c/n) result.

REFERENCES

Janson – A New Approach to the Giant Component Problem

Molloy, Reed – A Critical Point for Random Graphs with a Given Degree Sequence

Janson – The Probability that  Random Multigraph is Simple

# Recent Progress and Gromov-Hausdorff Convergence

For the past few weeks I’ve been working on the problem of Cycle-Induced Forest Fires, which I’ve referred to in passing in some recent posts. The aim has been to find a non-contrived process which exhibits self-organised criticality, that is, where the process displays critical characteristics (scaling laws, multiple components at the largest order of magnitude) forever. Note that this is in contrast to the conventional Erdos-Renyi graph process, which is only critical at a single time n/2.

The conjecture is that the largest component in equilibrium typically has size on a scale of n^2/3. An argument based on the equilibrium proportion of isolated vertices gives an upper bound on this exponent. The working argument I have for the lower bound at the moment can comfortably fit on the back of a napkin, with perhaps some context provided verbally. Of course, the current full text is very much larger than that, mainly because the napkin would feature assertions like “event A happens at time $O(n^\beta)$“; whereas the more formal argument has to go like:

“With high probability as $n\rightarrow\infty$, event A happens between times $n^{\beta-\epsilon},n^{\beta+\epsilon},$ for any suitably small $\epsilon>0.$ Furthermore, the probability that A happens after this upper threshold decays exponentially with n for fixed $\epsilon$, and the probability that A happens before the lower threshold is at most $n^{-\epsilon}$. Finally, this is under the implicit assumption that there will be no fragmentations before event A, and this holds with probability $1-o(1)$ etc.”

It’s got to the point where I’ve exhausted the canonical set of symbols for small quantities: $\epsilon,\delta,(\eta ?)$.

This has been a very long way of setting up what was going to be my main point, which is that at many points during undergraduate mathematics, colleagues (and occasionally to be honest, probably myself too) have complained that they “don’t want to have anything to do with analysis. They just want to focus on algebra / number theory / statistics / fluids…” Anyway, the point of this ramble was that I think I’ve realised that it is very hard to think about any sort of open problem without engaging with the sort of ideas that a few years ago I would have thought of (and possibly dismissed) as ‘analysis’.

Much of my working on this problem has been rather from first principles, so haven’t been thinking much about any neat less elementary theory recently.

Ok, so on with the actual post now.

Last month I talked about local limits of graphs, which describe convergence in distribution of (local) neighbourhood structure about a ‘typical’ vertex. This is the correct context in which to make claims like “components of $G(n,\frac{\lambda}{n})$ look like Galton-Watson trees with offspring distribution $\text{Po}(\lambda)$“.

Even from this example, we can see a couple of drawbacks and omissions from this limiting picture. In the sub-critical regime, this G-W tree will be almost surely finite, but the number of vertices in the graph is going to infinity. More concretely, the limit description only tells us about a single component. If we wanted to know about a second component, in this case, it would be roughly independent of the size of the first component, and with the same distribution, but if we wanted to know about all components, it would get much more complicated.

Similarly, this local limit description isn’t particularly satisfactory in the supercritical regime. When the component in question is finite, this description is correct, but with high probability we have a giant component, and so the ‘typical’ vertex is with some positive probability in the giant component. This is reflected by the fact that the G-W tree with supercritical offspring distribution is infinite with some positive probability. However, the giant component does not look like a $\text{Po}(\lambda)$ G-W tree. As we exhaust O(n) vertices, the offspring distribution decreases, in expectation at least. In fact, without the assumption that the giant component is with high probability unique (so $\frac{L_1}{n}=1-\mathbb{P}(|C(v)|<\infty$), we can’t even deduce the expected size of the giant component from the local limit result.

This is all unsurprising. By definition a local limit describes the structure near some vertex. How near? Well, finitely near. It can be arbitrarily large, but still finite, so in particular, the change in the offspring distribution after O(n) vertices as mentioned above will not be covered.

So, if we want to learn more about the global structure of a large discrete object, we need to consider a different type of limit. In particular, the limit will not necessarily be a graph. Rather than try to define a priori a ‘continuum’ version of a graph, it is sensible to generalise from the idea that a graph is a discrete object and instead consider it as a metric space.

In this article, I don’t want to spend much time at all thinking about how to encode a finite graph as a metric space. We have a natural notion of graph distance between vertices, and it is not hard to extend this to points on edges. Alternatively, for sparse graphs, we have an encoding through various functions, which live in some (metric) function space.

However, in general, the graph will be a metric object itself, rather than necessarily a subset of a global metric space. We will be interested in convergence, so we need a suitable style of convergence of different metric spaces.

The natural candidate for this is the Gromov-Hausdorff metric, and the corresponding Gromov-Hausdorff convergence.

The Hausdorff distance between two subsets X, Y of a metric space is defined as follows. Informally, we say that $d_H(X,Y)<\epsilon$ if any point of X is within distance $\epsilon$ from some point of Y, in the sense of the original metric. Formally

$d_H(X,Y):=\max \{\sup_{x\in X}\inf_{y\in Y}d(x,y), \sup_{y\in Y}\inf_{x\in X}d(x,y)\}.$

It is not particularly illuminating to prove that this is in fact a metric. In fact, it isn’t a metric as the definition stands, but rather a pseudo-metric, which is exactly the same, only allowing d(X,Y)=0 when X and Y are not equal. Note that

$d(X^\circ,\bar X)=0,$

for any set X, so this gives an example, provided X is not both open and closed. Furthermore, if the underlying metric space is unbounded, then the Hausdorff distance between two sets might be infinite. For example in $\mathbb{R}$,

$d_H(\mathbb{R}_{<0},\mathbb{R}_{>0})=\infty.$

We can overcome this pair of objections by restricting attention to closed, bounded sets. In practice, many spaces under consideration will be real in flavour, so it makes sense to define this for compact sets when appropriate.

But this still leaves the underlying problem, which is how to define a distance function on metric spaces. If two metric spaces X and Y were both subspaces of some larger metric space then it would be easy, as we now have the Hausdorff distance. So this is in fact how we proceed in general. We don’t need any knowledge of this covering space a priori, we can just choose the one which minimises the resulting Hausdorff distance. That is

$d_{GH}(X,Y)=\inf\{d_H(\phi(X),\psi(Y))\},$

where the infimum is taken over all metric spaces (E,d), and isometric embeddings $\phi: X\rightarrow E, \psi: Y\rightarrow E$.

The first observation is that this will again be a pseudometric unless we demand that X, Y be closed and bounded. The second is that this index set is not a set. Fortunately, this is quickly rectified. Instead consider all metrics on the disjoint union of sets X and Y, which is set, and contains the subset of those metrics which restrict to the correct metric on each of X and Y. It can be checked that this forms a true metric on the set of compact metric spaces up to isometry.

We have an alternative characterisation. Given compact sets X and Y, a correspondence between X and Y is a set of pairs in $X\times Y$, such that both projection maps are surjective. Ie for any x in X, there is some pair (x,y) in the correspondence. Let $\mathcal{C}(X,Y)$ be the set of such correspondences. We then define the distortion of correspondence $\mathcal{R}$ by:

$\text{dis}(\mathcal{R}):=\sup\{|d_X(x_1,x_2)-d_2(y_1,y_2)|: (x_i,y_i)\in\mathcal{R}\}.$

Then

$d_{GH}(X,Y)=\frac{1}{2}\inf_{\mathcal{R}\in\mathcal{C}(X,Y)}\text{dis}(\mathcal{R}).$

In particular, this gives another reason why we don’t have to worry about taking an infimum over a proper class.

Gromov-Hausdorff convergence then has the natural definition. Note that this does not respect topological equivalence, ie homeomorphism. For example,

$\bar{B(0,\frac{1}{n})}\stackrel{GH}{\rightarrow} \{0\},$

where the latter has the trivial metric. In particular, although all the closed balls are homeomorphic, the G-H limit is not.

A final remark is that the trees we might be looking at are not necessarily compact, so it is useful to have a notion of how this might be extended to non-compact spaces. The answer is to borrow the idea from local limits of considering large finite balls around a fixed central point. In the case of trees, this is particularly well-motivated, as it is often quite natural to have a canonical choice for the ‘root’.

Then with identified points $p_n\in X_n$, say $(X_n,p_n)\rightarrow (X,p)$ if for any R>0 the R-ball around p_n in X_n converges to the R-ball around p in X. We adjust the definition of distortion to include the condition that the infimum be over correspondences for which $(p_X,p_Y)$ is an element.

REFERENCES

This article was based on some lecture notes by Jean-Francois Le Gall from the Clay Institute Summer School which can be found on the author’s website here (about halfway down the page). This material is in chapter 3. I also used Nicolas Curien’s tutorials on this chapter to inform some of the examples. The resolution of the proper class problem was mentioned by several sources I examined. These notes by Jan Christina were among the best.

# Large Deviations 6 – Random Graphs

As a final instalment in this sequence of posts on Large Deviations, I’m going to try and explain how one might be able to apply some of the theory to a problem about random graphs. I should explain in advance that much of what follows will be a heuristic argument only. In a way, I’m more interested in explaining what the technical challenges are than trying to solve them. Not least because at the moment I don’t know exactly how to solve most of them. At the very end I will present a rate function, and reference properly the authors who have proved this. Their methods are related but not identical to what I will present.

Problem

Recall the two standard definitions of random graphs. As in many previous posts, we are interested in the sparse case where the average degree of a vertex is o(1). Anyway, we start with n vertices, and in one description we add an edge between any pair of vertices independently and with fixed probability $\frac{\lambda}{n}$. In the second model, we choose uniformly at random from the set of graphs with n vertices and $\frac{\lambda n}{2}$ edges. Note that if we take the first model and condition on the number of edges, we get the second model, since the probability of a given configuration appearing in G(n,p) is a function only of the number of edges present. Furthermore, the number of edges in G(n,p) is binomial with parameters $\binom{n}{2}$ and p. For all purposes here it will make no difference to approximate the former by $\frac{n^2}{2}$.

Of particular interest in the study of sparse random graphs is the phase transition in the size of the largest component observed as $\lambda$ passes 1. Below 1, the largest component has size on a scale of log n, and with high probability all components are trees. Above 1, there is a unique giant component containing $\alpha_\lambda n$ vertices, and all other components are small. For $\lambda\approx 1$, where I don’t want to discuss what ‘approximately’ means right now, we have a critical window, for which there are infinitely many components with sizes on a scale of $n^{2/3}$.

A key observation is that this holds irrespective of which model we are using. In particular, this is consistent. By the central limit theorem, we have that:

$|E(G(n,\frac{\lambda}{n}))|\sim \text{Bin}\left(\binom{n}{2},\frac{\lambda}{n}\right)\approx \frac{n\lambda}{2}\pm\alpha,$

where $\alpha$ is the error due to CLT-scale fluctuations. In particular, these fluctuations are on a scale smaller than n, so in the limit have no effect on which value of $\lambda$ in the edge-specified model is appropriate.

However, it is still a random model, so we can condition on any event which happens with positive probability, so we might ask: what does a supercritical random graph look like if we condition it to have no giant component? Assume for now that we are considering $G(n,\frac{\lambda}{n}),\lambda>1$.

This deviation from standard behaviour might be achieved in at least two ways. Firstly, we might just have insufficient edges. If we have a large deviation towards too few edges, then this would correspond to a subcritical $G(n,\frac{\mu n}{2})$, so would have no giant components. However, it is also possible that the lack of a giant component is due to ‘clustering’. We might in fact have the correct number of edges, but they might have arranged themselves into a configuration that keeps the number of components small. For example, we might have a complete graph on $Kn^{1/2}$ vertices plus a whole load of isolated vertices. This has the correct number of edges, but certainly no giant component (that is an O(n) component).

We might suspect that having too few edges would be the primary cause of having no giant component, but it would be interesting if clustering played a role. In a previous post, I talked about more realistic models of complex networks, for which clustering beyond the levels of Erdos-Renyi is one of the properties we seek. There I described a few models which might produce some of these properties. Obviously another model is to take Erdos-Renyi and condition it to have lots of clustering but that isn’t hugely helpful as it is not obvious what the resulting graphs will in general look like. It would certainly be interesting if conditioning on having no giant component were enough to get lots of clustering.

To do this, we need to find a rate function for the size of the giant component in a supercritical random graph. Then we will assume that evaluating this near 0 gives the LD probability of having ‘no giant component’. We will then compare this to the straightforward rate function for the number of edges; in particular, evaluated at criticality, so the probability that we have a subcritical number of edges in our supercritical random graph. If they are the same, then this says that the surfeit of edges dominates clustering effects. If the former is smaller, then clustering may play a non-trivial role. If the former is larger, then we will probably have made a mistake, as we expect on a LD scale that having too few edges will almost surely lead to a subcritical component.

Methods

The starting point is the exploration process for components of the random graph. Recall we start at some vertex v and explore the component containing v depth-first, tracking the number of vertices which have been seen but not yet explored. We can extend this to all components by defining:

$S(0)=0, \quad S(t)=S(t-1)+(X(t)-1),$

where X(t) is the number of children of the t’th vertex. For a single component, S(t) is precisely the number of seen but unexplored vertices. It is more complicated in general. Note that when we exhaust the first component S(t)=-1, and then when we exhaust the second component S(t)=-2 and so on. So in fact

$S_t-\min_{0\leq s\leq t}S_s$

is the number of seen but unexplored vertices, with $\min_{0\leq s\leq t}S_s$ equal to (-1) times the number of components already explored up to time t.

Once we know the structure of the first t vertices, we expect the distribution of X(t) – 1 to be

$\text{Bin}\Big(n-t-[S_t-\min_{0\leq s\leq t}S_s],\tfrac{\lambda}{n}\Big)-1.$

We aren’t interested in all the edges of the random graph, only in some tree skeleton of each component. So we don’t need to consider the possibility of edges connecting our current location to anywhere we’ve previously visited (as such an edge would have been consider then – it’s a depth-first exploration), hence the -t. But we also don’t want to consider edges connecting our current location to anywhere we’ve seen, since that would be a surplus edge creating a cycle, hence the -S_s. It is binomial because by independence even after all this conditioning, the probability that there’s an edge from my current location to any other vertex apart from those discounted is equal to $\frac{\lambda}{n}$ and independent.

For Mogulskii’s theorem in the previous post, we had an LDP for the rescaled paths of a random walk with independent stationary increments. In this situation we have a random walk where the increments do not have this property. They are not stationary because the pre-limit distribution depends on time. They are also not independent, because the distribution depends on behaviour up to time t, but only through the value of the walk at the present time.

Nonetheless, at least by following through the heuristic of having an instantaneous exponential cost for a LD event, then products of sums becoming integrals within the exponent, we would expect to have a similar result for this case. We can find the rate function $\Lambda_\lambda^*(x)of$latex \text{Po}(\lambda)-1\$ and thus get a rate function for paths of the exploration process

$I_\lambda(f)=\int_0^1 \Lambda_{(1-t-\bar{f}(t))\lambda}^*(f')dt,$

where $\bar{f}(t)$ is the height of f above its previous minimum.

Technicalities and Challenges

1) First we need to prove that it is actually possible to extend Mogulskii to this more general setting. Even though we are varying the distribution continuously, so we have some sort of ‘local almost convexity’, the proof is going to be fairly fiddly.

2) Having to consider excursions above the local minima is a massive hassle. We would ideally like to replace $\bar{f}$ with f. This doesn’t seem unreasonable. After all, if we pick a giant component within o(n) steps, then everything considered before the giant component won’t show up in the O(n) rescaling, so we will have a series of macroscopic excursions above 0 with widths giving the actual sizes of the giant components. The problem is that even though with high probability we will pick a giant component after O(1) components, then probability that we do not do this decays only exponentially fast, so will show up as a term in the LD analysis. We would hope that this would not be important – after all later we are going to take an infimum, and since the order we choose the vertices to explore is random and in particular independent of the actual structure, it ought not to make a huge difference to any result.

3) A key lemma in the proof of Mogulskii in Dembo and Zeitouni was the result that it doesn’t matter from an LDP point of view whether we consider the linear (continuous) interpolation or the step-wise interpolation to get a process that actually lives in $L_\infty([0,1])$. In this generalised case, we will also need to check that approximating the Binomial distribution by its Poisson limit is valid on an exponential scale. Note that because errors in the approximation for small values of t affect the parameter of the distribution at larger times, this will be more complicated to check than for the IID case.

4) Once we have a rate function, if we actually want to know about the structure of the ‘typical’ graph displaying some LD property, we will need to find the infimum of the integrated rate function with some constraints. This is likely to be quite nasty unless we can directly use Euler-Lagrange or some other variational tool.

Papers by O’Connell and Puhalskii have found the rate function. Among other interesting things, we learn that:

$I_{(1+\epsilon)}(0)\approx \frac{\epsilon^3}{6},$

while the rate function for the number of edges:

$-\lim\tfrac{1}{n}\log\mathbb{P}\Big(\text{Bin}(\tfrac{n^2}{2},\tfrac{1+\epsilon}{n})\leq\tfrac{n}{2}\Big)\approx \frac{\epsilon^2}{4}.$

So in fact it looks as if there might be a significant contribution from clustering after all.

# From G(n,p) to G(n,m)

In the previous posts about random graphs, I was focusing on the model G(n,p). Here, we have n vertices, and we insert an edge between any pair of vertices independently with probability p. In particular, the number of edge which appear in a realisation of G(n,p) is a random variable, distributed as $\text{Bin}(\frac{n(n-1)}{2},p)$.

The original model examined by Erdos and Renyi, after whom the random graph described above was named, was slightly different. Still with n vertices, they specified how many edges m they wanted in the graph, and chose uniformly at random from the set of graphs with this number of edges. This model is usually denoted G(n,m). Normally we can tell them apart by context. Obviously, p is a probability so lies in [0,1], whereas m is a positive integer, so there isn’t much room for ambiguity.

The key observation is that, if H is some graph with n vertices and m edges, then the probability that this is appears as G(n,p) is

$p^{E(H)}(1-p)^{n-E(H)}.$

This is constant if we vary H while fixing m. In other words, G(n,p) conditioned to have m edges is G(n,m). So, via some sort of law of total probability, we can construct G(n,p) by taking m to be distributed as $\text{Bin}(\frac{n(n-1)}{2},p)$, and conditional on that, sampling from G(n,m). (*)

We can couple G(n,p) for all p, by assigning iid uniform [0,1] random variables to each pair of vertices, then including the edge in G(n,p) only if the value of the RV is greater than 1-p. Similarly, it is often helpful to think of G(n,m) as m varies as a random process, where edges are added one at a time, and at each stage the next edge is chosen uniformly at random from those not currently present. Perhaps because of this, it is sometimes easier to prove results for G(n,p) than for G(n,m) so we want to develop a framework to move between the two.

The decomposition (*) gives a relatively straightforward way to move from a result in G(n,m) to a result in G(n,p). By the Central Limit Theorem, the number of edges in G(n,p) is $\binom{n}{2}p+O(n)$ in the limit, and so if a result with high probability in G(n,m) for all m in some interval, say $[\binom{n}{2}p-Kn,\binom{n}{2}p+Kn]$ for some large K, then the law of total probability shows that the property holds with high probability in G(n,p).

In general, we get more interesting properties when p is a function of n. As discussed in previous posts, the scaling $p=\frac{\lambda}{n}$ is particularly worth studying. CLT now shows that $G(n,\frac{\lambda}{n})$ has $\frac{\lambda n}{2}+O(\sqrt{n})$ edges in the limit. If you are confused why you can’t just substitute this value for p into the previous expression, note that p(1-p) does appear in the general asymptotic variance, but this gets absorbed into the “big O” notation when p is constant.

More importantly, many properties that we might want to consider are not in general affected in the limit by adding or removing $O(\sqrt{n})$ edges. For example, with high probability, G(n,m) has largest component of size $(\zeta_\lambda+o(1)) n$ whenever $\lambda>1$ and $m\in [\frac{\lambda n}{2}-O(\sqrt{n}),\frac{\lambda n}{2}+O(\sqrt{n})]$. Some of this notation would need to be made a bit more precise in a formal argument, but for now, let’s take that as given. This then implies that with high probability, $G(n,\frac{\lambda}{n})$ has largest component of size $(\zeta_\lambda+o(1))n$ also.

Of course, from the logical structure of this blog, this deduction is a bit bogus, because we have only just introduced G(n,m), and have no idea about the properties of its giant components yet. We seek instead an argument to deduce facts about G(n,m) from facts about G(n,p). Because G(n,m) cannot obviously be written as some conditioned combination of G(n,p)s, this instinctively seems harder. Bollobas gives various general conditions to carry results over between the two regimes in his Part III course notes, but I feel that an examples would be the easiest way to explain the ideas.

The size of the largest component is such an important quantity, we might as well consider that, in the subcritical case. We work with $G(n,\frac{\lambda}{n})$, for which we have the result:

$\mathbb{P}_{\lambda+o(1)}(C_1\geq a\log n)=O(n^{-\delta}),$

for some $\delta>0$, whenever $a>I_\lambda^{-1}$, the rate function at 1 of the total population size of a Poisson $\lambda$ branching process. For now, that doesn’t matter too much, except that it is continuous as a function of $\lambda$. We want to show that $G(n,\frac{\lambda n}{2})$ has the same property.

The trick is to consider $G(n,\frac{\lambda+\epsilon}{n})$ instead. Let $E_a$ be the event described above. By the law of total probability and the decomposition mentioned above, we have:

$\mathbb{P}_{\lambda+\epsilon}(E_a)=\sum_{m=0}^n\mathbb{P}(\text{Bin}(\frac{n(n-1)}{2},\frac{\lambda+\epsilon}{n})=m)\mathbb{P}_{n,m}(E_a).$

We are going to split this sum into

$\sum_{m=0}^{\frac{(\lambda+\epsilon )n}{2}-n^{3/4}}+\sum_{m=\frac{(\lambda+\epsilon)n}{2}-n^{3/4}}^n.$

On the first of these sums, we bound using the fact that probabilities are less than 1, and on the second, we use that $\mathbb{P}_{n,m}(E_a)$ is an increasing function of m. This property is special to the event we are considering – in general one might have to be a bit more clever, perhaps using continuity of $P_{n,m}(E)$, interpreting continuity in the limit with n. Anyway, this enables us to bound:

$\mathbb{P}_{\lambda+\epsilon}(E_a)\geq \mathbb{P}(\text{Bin}(\frac{n(n-1)}{2},\frac{\lambda+\epsilon}{n})\leq \frac{(\lambda+\epsilon)n}{2}-n^{3/4})+\mathbb{P}_{\frac{(\lambda+\epsilon)n}{2}-n^{3/4}}(E_a)\mathbb{P}(\text{Bin}(\frac{n(n-1)}{2},\frac{\lambda+\epsilon}{n})\geq\frac{(\lambda+\epsilon)n}{2}-n^{3/4}).$

By the Central Limit Theorem, this first probability tends to 0, while the final term tends to 1. We therefore have:

$\mathbb{P}_{\lambda+\epsilon}(E_a)\geq \mathbb{P}_{\frac{(\lambda+\epsilon)n}{2}-n^{3/4}}(E_a).$

We demanded that $a>I_\lambda^{-1}$, and mentioned that this function was continuous, so since we have total freedom over $\epsilon$, in particular, we can choose $\epsilon>0$ such that $a>I_{\lambda+\epsilon}^{-1}$. By the work on G(n,p), we have $\mathbb{P}_{\lambda+\epsilon}(E_a)=O(n^{-\delta})$, and for large enough n, we have $\frac{(\lambda+\epsilon)n}{2}-n^{3/4}>>\frac{\lambda n}{2}$, and so the result $\mathbb{P}_{\frac{\lambda n}{2}}(E_a)=O(n^{-\delta})$ follows.

# Analytic vs Probabilistic Arguments for a Supercritical BP

This follows on directly from the previous post. I was originally going to talk only about what follows, but I got rather carried away with the branching process account. I was stuck on a particular exercise, and we ended up coming up with two arguments: one analytic and one probabilistic. Since the typical flavour of this blog is to present problems which show the advantage of the probabilistic approach, it seems only fair to remark on this case, where the analytic method was less interesting, but much simpler.

Recall that we have a supercritical random graph $G(n,\frac{\lambda}{n}), \lambda>1$, and we are considering the rescaled exploration process $S_{nt}$, which has asymptotic mean $\mu_t=1-t-e^{-\lambda t}$. We can calculate similarly an expression for the asymptotic variance

$\frac{\text{Var}(S_{nt})}{n}\rightarrow v_t=e^{-\lambda t}(1-e^{-\lambda t}).$

To use this to verify the result about the size of the giant component, we verify that $\mu_{\zeta_\lambda+x/\sqrt{n}}$ is negative, and has small variance, which would confirm that the giant component has size bounded above by $\zeta_\lambda$ almost surely. A similar argument is required for the lower bound. The variance is a separate matter, but it is therefore necessary that $\mu_t$ should be decreasing at $t=\zeta_\lambda$, that is $\mu_t'=\lambda e^{-\lambda \zeta_\lambda}<0$. This is what we try to prove in the remainder of this post. Recall that in the previous post we have checked that it is equal to zero here.

Heuristic Explanation

$\mu_t$ has been rescaled from the original definition of the exploration process in both size and time-scale so some care is needed to see why this should hold in the limit. Remember that all components apart from the giant component are of size O(log n). So immediately after exhausting the giant component, you are likely to be visiting components of size roughly log n. A time interval of dt for $\mu$ corresponds to ndt for S, during which S will visit some components of size log n and some of O(1) and some in between. In particular, some fixed proportion of vertices are isolated, that is, in a component of size 1.

There is then a complicated size-biasing train of thought. A component of size log n is more likely to come up than an isolated vertex, but there are not as many of them. The log n components push the derivative $\mu_t'$ towards zero, because S_t decreases by 1 over a time-interval of length log n, which gives a gradient of zero in the limit. However, the isolated vertices give a gradient of -1, because S_t decreases by 1 over a time interval of 1. Despite the fact that log n intervals are likely to appear earlier, it still remains the case that after exhausting a component (in particular, at time $t=\zeta_\lambda$, after exhausting the giant component), with some bounded below positive probability you will choose an isolated vertex next. The component size only affects that time-scale if it is O(n), which none of the remaining components are, so the derivative $\mu_{\zeta_\lambda}'$ consists of some complicated weighted mean of 0 and -1. In particular, it is negative.

Analytic solution

Obviously, that won’t do in practice. Suppressing lambdas for ease of notation, the key fact is: $e^{-\lambda \zeta}=1-\zeta$. We want to show that $\lambda e^{-\lambda \zeta}<1$. Substituting

$\lambda=-\frac{\log(1-\zeta)}{\zeta},$

means that it is required to show:

$-\frac{1-\zeta}{\zeta}\log(1-\zeta)<1.$

Differentiating the left hand side gives:

$\frac{\log(1-\zeta)+\zeta}{\zeta^2}<0,$

since of course $\log(1-\zeta)=\zeta+\frac{\zeta^2}{2}+\frac{\zeta^3}{3}+\dots$. So it suffice to check the result for small $\zeta$. But, again using a Taylor series:

$-\frac{1-\zeta}{\zeta}\log(1-\zeta)=1-\frac12\zeta+O(\zeta^2)<1,$

for small $\zeta$. This gives the required result.

Probabilistic Interpretation and Solution

First, we observe that $\lambda e^{-\lambda\zeta}=\lambda(1-\zeta)$ is the expected number of vertices in the first generation of a $\text{Po}(\lambda)$ whose progeny become extinct. This motivates considering the canonical decomposition of a supercritical branching process Z into the skeleton process and the dual process. The skeleton $Z^+$ consists of all vertices which have infinitely many successors. It is relatively easy to show that this is a branching process with offspring distribution $\text{Po}(\lambda\zeta)$ conditioned on being positive. The dual process $Z^*$ is a G-W branching process with offspring distribution $\text{Po}(\lambda)$ conditioned on dying. This is the same as a branching process with offspring distribution $\text{Po}(\lambda(1-\zeta)$, by a sprinkling argument, which says that if we begin with a Poisson number of things, then remove each one independently with some fixed probability, the remaining number of things is Poisson also.

We can construct the original branching process by

• With probability $\zeta$, take the skeleton, and affixe independent copies of $Z^*$ at every vertex in the skeleton.
• With probability $1-\zeta$, just take a copy of $Z^*$.

It is immediately clear that $\lambda(1-\zeta)\leq 1$. After all, the dual process is almost surely finite, so the offspring distribution cannot have expectation greater than 1. Checking that this is strong is more fiddly. The best way I have come up with is to examine the tail of the distribution of total population size of the original branching process.

The total population size T of a branching process has an exponential tail if the offspring distribution is subcritical. It isn’t hugely surprising that this behaves like a large deviation for iid RVs, since in the limit such an event requires a lot of the offspring counts to deviate substantially from the mean. The same holds in the supercritical case, with the additional complication that though the finite tail decays exponential, there is positive probability that the total size will be infinite. In the critical case, however, there is a power-law decay. This is not hugely surprising as it marks the threshhold for the appearance of the infinite population, just as in a multiplicative coalescent at time 1, we have a load of very large components just about to form a giant component. The tool for all of these results is Dwass’s Theorem, which says:

$\mathbb{P}(T=n)=\frac{1}{n}\mathbb{P}(X_1+\ldots+X_n=n-1),$

where $X_1$ are iid with the offspring distribution. When $\mathbb{E}X_1\neq 1$, this is a large deviation event, for which Cramer’s theorem applies (assuming, as is the case for the Poisson distribution, that the offspring distribution has finite variance). When, $\mathbb{E}X=1$, the Central Limit Theorem says that with high probability,

$X_1+\ldots+X_n\in [n-n^{3/4},n+n^{3/4}],$

so, skating over the details of whether everything is exactly uniform within this CLT scaling window,

$\mathbb{P}(T=n)\geq \frac{1}{n}\cdot\frac{1}{2n^{3/4}}.$

The true exponent of the power law decay is substantially slower than this, but the above argument works as a back-of-the-envelope bound.

In particular, if the dual process has mean 1, then the population size of the original branching process is given by taking a distribution with exponential tail with some probability and a distribution with power-law tail with some probability. Obviously the power-law will dominate, which contradicts the assumption that the original branching process was supercritical, and so has an exponential tail.

# Exploring the Supercritical Random Graph

I’ve spent a bit of time this week reading and doing all the exercises from some excellent notes by van der Hofstad about random graphs. I think they are absolutely excellent and would not be surprised if they become the standard text for an introduction to probabilistic combinatorics. You can find them hosted on the author’s website. I’ve been reading chapters 4 and 5, which approaches the properties of phase transitions in G(n,p) by formalising the analogy between component sizes and population sizes in a binomial branching process. When I met this sort of material for the first time during Part III, the proofs generally relied on careful first and second moment bounds, which is fine in many ways, but I enjoyed vdH’s (perhaps more modern?) approach, as it seems to give a more accurate picture of what is actually going on. In this post, I am going to talk about using the branching process picture to explain why the giant component emerges when it does, and how to get a grip on how large it is at any time after it has emerged.

Background

A quick tour through the background, and in particular the notation will be required. At some point I will write a post about this topic in a more digestible format, but for now I want to move on as quickly as possible.

We are looking at the sparse random graph $G(n,\frac{\lambda}{n})$, in the super-critical phase $\lambda>1$. With high probability (that is, with probability tending to 1 as n grows), we have a so-called giant component, with O(n) vertices.

Because all the edges in the configuration are independent, we can view the component containing a fixed vertex as a branching process. Given vertex v(1), the number of neighbours is distributed like $\text{Bi}(n-1,\frac{\lambda}{n})$. The number of neighbours of each of these which we haven’t already considered is then $\text{Bi}(n-k,\frac{\lambda}{n})$, conditional on k, the number of vertices we have already discounted. After any finite number of steps, k=o(n), and so it is fairly reasonable to approximate this just by $\text{Bi}(n,\frac{\lambda}{n})$. Furthermore, as n grows, this distribution converges to $\text{Po}(\lambda)$, and so it is natural to expect that the probability that the fixed vertex lies in a giant component is equal to the survival probability $\zeta_\lambda$ (that is, the probability that it is infinite) of a branching process with $\text{Po}(\lambda)$ offspring distribution. Note that given a graph, the probability of a fixed vertex lying in a giant component is equal to the fraction of the vertex in the giant component. At this point it is clear why the emergence of the giant component must happen at $\lambda=1$, because we require $\mathbb{E}\text{Po}(\lambda)>1$ for the survival probability to be non-zero. Obviously, all of this needs to be made precise and rigorous, and this is treated in sections 4.3 and 4.4 of the notes.

Exploration Process

A common functional of a rooted branching process to consider is the following. This is called in various places an exploration process, a depth-first process or a Lukasiewicz path. We take a depth-first labelling of the tree v(0), v(1), v(2),… , and define c(k) to be the number of children of vertex v(k). We then define the exploration process by:

$S(0)=0,\quad S(k+1)=S(k)+c(k)-1.$

By far the best way to think of this is to imagine we are making the depth-first walk on the tree. S(k) records how many vertices we have seen (because they are connected by an edge to a vertex we have visited) but have not yet visited. To clarify understanding of the definition, note that when you arrive at a vertex with no children, this should decrease by one, as you can see no new vertices, but have visited an extra one.

This exploration process is useful to consider for a couple of reasons. Firstly, you can reconstruct the branching process directly from it. Secondly, while other functionals (eg the height, or contour process) look like random walks, the exploration process genuinely is a random walk. The distribution of the number of children of the next vertex we arrive at is independent of everything we have previously seen in the tree, and is the same for every vertex. If we were looking at branching processes in a different context, we might observe that this gives some information in a suitably-rescaled limit, as rescaled random walks converge to Brownian motion if the variance of the (offspring) distribution is finite. (This is Donsker’s result, which I should write something about soon…)

The most important property is that the exploration process returns to 0 precisely when we have exhausted all the vertices in a component. At that point, we have seen exactly the vertices which we have explored. There is no reason not to extend the definition to forests, that is a union of trees. The depth-first exploration is the same – but when we have exhausted one component, we move onto another component, chosen according to some labelling property. Then, running minima of the exploration process (ie times when it is smaller than it has been before) correspond to jumping between components, and thus excursions above the minimum to components themselves. The running minimum will be non-positive, with absolute value equal to the number of components already exhausted.

Although the exploration process was defined with reference to and in the language of trees, the result of a branching process, this is not necessary. With some vertex denoted as the root, we can construct a depth-first labelling of a general graph, and the exploration process follows exactly as before. Note that we end up ignoring all edges except a set that forms a forest. This is what we will apply to G(n,p).

Exploring G(n,p)

When we jump between components in the exploration process on a supercritical (that is $\lambda>1$) random graph, we move to a component chosen randomly with size-biased distribution. If there is a giant component, as we know there is in the supercritical case, then this will dominate the size-biased distribution. Precisely, if the giant component takes up a fraction H of the vertices, then the number of components to be explored before we get to the giant component is geometrically distributed with parameter H. All other components have size O(log n), so the expected number of vertices explored before we get to the giant component is O(log n)/H = o(n), and so in the limit, we explore the giant component immediately.

The exploration process therefore gives good control on the giant component in the limit, as roughly speaking the first time it returns to 0 is the size of the giant component. Fortunately, we can also control the distribution of S_t, the exploration process at time t. We have that:

$S_t+(t-1)\sim \text{Bi}(n-1,1-(1-p)^t).$

This is not too hard to see. $S_t+(t-1)$ is number of vertices we have explored or seen, ie are connected to a vertex we have explored. Suppose the remaining vertices are called unseen, and we began the exploration at vertex 1. Then any vertex with label in {2,…,n} is unseen if it successively avoids being in the neighbourhood of v(1), v(2), … v(t). This happens with probability $(1-p)^t$, and so the probability of being an explored or seen vertex is the complement of this.

In the supercritical case, we are taking $p=\frac{\lambda}{n}$ with $\lambda>1$, and we also want to speed up S, so that all the exploration processes are defined on [0,1], and rescale the sizes by n, so that it records the fraction of the graph rather than the number of vertices. So we set consider the rescaling $\frac{1}{n}S_{nt}$.

It is straightforward to use the distribution of S_t we deduce that the asymptotic mean $\mathbb{E}\frac{1}{n}S_{nt}=\mu_t = 1-t-e^{-\lambda t}$.

Now we are in a position to provide more concrete motivation for the claim that the proportion of vertices in the giant component is $\zeta_\lambda$, the survival probability of a branching process with $\text{Po}(\lambda)$ offspring distribution. It helps to consider instead the extinction probability $1-\zeta_\lambda$. We have:

$1-\zeta_\lambda=\sum_{k\geq 0}\mathbb{P}(\text{Po}(\lambda)=k)(1-\zeta_\lambda)^k=e^{-\lambda\zeta_\lambda},$

where the second equality is a consequence of the simple form for the moment generating function of the Poisson distribution.

As a result, we have that $\mu_{\zeta_\lambda}=0$. In fact we also have a central limit theorem for S_t, which enables us to deduce that $\frac{1}{n}S_{n\zeta_\lambda}=0$ with high probability, as well as in expectation, which is precisely what is required to prove that the giant component of $G(n,\frac{\lambda}{n})$ has size $n(\zeta_\lambda+o(1))$.