Lecture 7 – The giant component

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

As we edge into the second half of the course, we are now in a position to return to the question of the phase transition between the subcritical regime \lambda<1 and the supercritical regime \lambda>1 concerning the size of the largest component L_1(G(n,\lambda/n)).

In Lecture 3, we used the exploration process to give upper bounds on the size of this largest component in the subcritical regime. In particular, we showed that

\frac{1}{n}\big| L_1(G(n,\lambda/n)) \big| \stackrel{\mathbb{P}}\rightarrow 0.

If we used slightly stronger random walk concentration estimates (Chernoff bounds rather than 2nd-moment bounds from Chebyshev’s inequality), we could in fact have shown that with high probability the size of this largest component was at most some logarithmic function of n.

In this lecture, we turn to the supercritical regime. In the previous lecture, we defined various forms of weak local limit, and asserted (without attempting the notationally-involved combinatorial calculation) that the random graph G(n,\lambda/n) converges locally weakly in probability to the Galton-Watson tree with \text{Poisson}(\lambda) offspring distribution, as we’ve used informally earlier in the course.

Of course, when \lambda>1, this branching process has strictly positive survival probability \zeta_\lambda>0. At a heuristic level, we imagine that all vertices whose local neighbourhood is ‘infinite’ are in fact part of the same giant component, which should occupy (\zeta_\lambda+o_{\mathbb{P}}(1))n vertices. In its most basic form, the result is

\frac{1}{n}\big|L_1(G(n,\lambda/n))\big|\;\stackrel{\mathbb{P}}\longrightarrow\; \zeta_\lambda,\quad \frac{1}{n}\big|L_2(G(n,\lambda/n))\big| \;\stackrel{\mathbb{P}}\longrightarrow\; 0, (*)

where the second part is a uniqueness result for the giant component.

The usual heuristic for proving this result is that all ‘large’ components must in fact be joined. For example, if there are two giant components, with sizes \approx \alpha n,\approx \beta n, then each time we add a new edge (such an argument is often called ‘sprinkling‘), the probability that these two components are joined is \approx 2ab, and so if we add lots of edges (which happens as we move from edge probability \lambda-\epsilon\mapsto \lambda ) then with high probability these two components get joined.

It is hard to make this argument rigorous, and the normal approach is to show that with high probability there are no components with sizes within a certain intermediate range (say between \Theta(\log n) and n^\alpha) and then show that all larger components are the same by a joint exploration process or a technical sprinkling argument. Cf the books of Bollobas and of Janson, Luczak, Rucinski. See also this blog post (and the next page) for a readable online version of this argument.

I can’t find any version of the following argument, which takes the weak local convergence as an assumption, in the literature, but seems appropriate to this course. It is worth noting that, as we shall see, the method is not hugely robust to adjustments in case one is, for example, seeking stronger estimates on the giant component (eg a CLT).

Anyway, we proceed in three steps:

Step 1: First we show, using the local limit, that for any \epsilon>0,

\frac{1}{n}\big|L_1(G(n,\lambda/n))\big| \le \zeta_\lambda+\epsilon, with high probability as n\rightarrow\infty.

Step 2: Using a lower bound on the exploration process, for \epsilon>0 small enough

\frac{1}{n}\big|L_1(G(n,\lambda/n))\big| \ge \epsilon, with high probability.

Step 3: Motivated by duality, we count isolated vertices to show

\mathbb{P}(\epsilon n\le |L_1| \le (\zeta_\lambda-\epsilon)n) \rightarrow 0.

We will return to uniqueness at the end.

Step 1

This step is unsurprising. The local limit gives control on how many vertices are in small components of various sizes, and so gives control on how many vertices are in small components of all finite sizes (taking limits in the right order). This gives a bound on how many vertices can be in the giant component.

(Note: parts of this argument appear in the text and exercises of Section 1.4 in the draft of Volume II of van der Hofstad’s notes, which can be found here.)

We can proceed in greater generality, by considering a sequence of random graphs G_n which converge locally weakly in probability to T, a random tree, with survival probability \zeta=\mathbb{P}(|T|=\infty)>0. We will show that:

Proposition: \mathbb{P}(L_1(G_n)\ge (\zeta+\epsilon)n) \rightarrow 0, for each \epsilon>0.

As a preliminary, note that for every k\in\mathbb{N}, there are finitely many rooted graphs (H,\rho_H) with size k. We can also identify whether a graph has size k by looking at a ball of radius r>k around any vertex. In particular, by summing over all graphs with size k, the weak local limit implies:

\frac{1}{n}\sum_{v\in[n]} \mathbf{1}_{\{|C^{G_n}(\rho_n)|=k\}} = \frac{1}{n} \sum_{|V(H)|=k} \sum_{v\in[n]} \mathbf{1}_{\{B_r^{G_n}(\rho_n)\simeq (H,\rho_H)\}}

\stackrel{\mathbb{P}}\longrightarrow \;\sum_{|V(H)|=k} \mathbb{P}(B_r^T(\rho)\simeq (H,\rho_H)) = \mathbb{P}(|T|=k).

Furthermore, we can then control the tail as

\frac{1}{n}\sum_{v\in[n]} \mathbf{1}_{\{|C^{G_n}(v)|\ge k\}}\;\stackrel{\mathbb{P}}\longrightarrow \mathbb{P}(|T|\ge k).

(Recall that the LHS of this statement is the proportion of vertices in components of size at least k.)

We will make the trivial but useful observation that in any graph the largest component has size at least k precisely if at least k vertices are in components of size at least k (!). Ie

|L_1(G)|\ge k\quad\iff\quad \sum_{v\in[n]} \mathbf{1}{\{C^G(v)|\ge k\}} \ge k.

Returning now to the problem at hand, we have \mathbb{P}(|T|\ge k)\downarrow\zeta as k\rightarrow\infty, so we may pick k such that \mathbb{P}(|T|\ge k)<\zeta+\epsilon.

But then, using our ‘trivial but useful’ observation:

\mathbb{P}(L_1(G_n)\ge (\zeta+\epsilon)n) = \mathbb{P}(\sum_{v\in[n]} \mathbf{1}_{\{|C^{G_n}(v)|\ge (\zeta+\epsilon)n \}} \ge (\zeta+\epsilon)n)

\le \mathbb{P}(\frac{1}{n}\sum_{v\in[n]} \mathbf{1}_{\{|C^{G_n}(v)|\ge k\}} \ge \zeta+\epsilon). (**)

Note that we have replaced (\zeta+\epsilon)n by k in this final step for a bound. However, the random quantity inside the probability is known to converge in probability to \mathbb{P}(|T|\ge k)<\zeta+\epsilon. So in fact this probability (**) vanishes as n\rightarrow\infty.

Step 2

Remember the exploration process, where v=v_1,v_2,\ldots,v_n is a labelling of the vertices of G(n,\lambda/n) in breadth-first order. Defining

X_i:= \#\{w\in[n]\,:\, w\in\Gamma(v_i),\,w\not\in \Gamma(v_j),\,j\in[i-1]\},

the number of children of vertex v_i, we set

S_0:=0,\quad S_i:=S_{i-1}+(X_i-1),\; i\ge 1,

to be (a version of) the exploration process. It will be useful to study

H_0:=0,\quad H_k:=\min\{i\,:\, S_i=-k\},

the hitting times of (-k), as then \{v_{H_{k-1}+1},\ldots,v_{H_k}\} is the kth component to be explored.

Unlike for a tree, we have multiple components, and essentially the process decreases by one each time we start a new component, which means that the current value no longer describes the number of vertices on the stack. In general, this is given by S_i - \min_{0\le j\le i}S_j, and so

X_i\stackrel{d}= \text{Bin}(n-i-(S_{i-1}-\min_{0\le j\le i-1}S_j),\, \frac{\lambda}{n}),

which we may stochastically bound below by

\ge_{st} \text{Bin}(n-2i-S_{i-1},\,\frac{\lambda}{n}),

noting that this is extremely crude.

We want to study whether S_i ever exceeds \epsilon n, for some \epsilon>0 to be determined later.

For reasons that will become clear in the following deduction, it’s convenient to fix \alpha>0 small such that \lambda(1-2\alpha)>1, and then choose \epsilon>0 such that


(which is possible by continuity since the given relation holds when \epsilon=0.)

Now, when i\le \alpha n and S_{i-1}\le \epsilon n, we have

X_i\ge_{st} \text{Bin}(n(1-2\alpha-\epsilon),\frac{\lambda}{n}).

The following argument requires some kind of submartingale approach (involving coupling with a simpler process at the stopping time) to make rigorous, which is beyond the scope of this course’s prerequisites.

However, informally, if we assume that \max_{i\le \alpha n} S_i\le \epsilon n, ‘then’

\frac{1}{n}S_{\alpha n}\ge_{st} \frac{1}{n}\text{Bin}(\alpha n\cdot n(1-2\alpha-\epsilon),\,\frac{\lambda}{n}) - \alpha.

But this distribution is concentrated on a value which is, by our obscure assumption, >\epsilon (!) contradicting the assumption on the maximum. Thus we conclude that

\mathbb{P}(\max_{i\le \alpha n} S_i\le \epsilon n)\rightarrow 0, as n\rightarrow\infty.

We conclude that \max_{i\le \alpha n} S_i\ge\epsilon n holds with high probability. But remember that S_{i+1}\ge S_i -1 so if S_i\ge \epsilon n, then all of S_i,S_{i+1},\ldots,S_{i+\lfloor \epsilon n\rfloor} are non-negative, and so certainly v_i,v_{i+1},\ldots,v_{i+\lfloor \epsilon n\rfloor} are in the same component of the graph, and L_1(G(n,\lambda/n))\ge \epsilon n with high probability.

Step 3

The motivation for this section is duality. Recall (from Lecture 5) that if we condition a supercritical Poisson GW tree on extinction, we obtain the distribution of a dual subcritical Poisson GW tree. This relation moves across to the world of the sparse Erdos-Renyi random graph. If you exclude the giant component, you are left with a subcritical random graph (on a smaller vertex set), and this applies equally well to the local limits. Essentially, if we exclude a component, and take the local limit of what remains, we get the wrong answer unless the component we excluded was a giant component with size \approx \zeta_\lambda n, or was small.

As we shall see, this effect is captured sufficiently by counting isolated vertices.

First, we state a Fact: when 1-\zeta_\lambda <y<1, then ye^{-\lambda y}>e^{-\lambda}. This convexity property is easily checked by comparing derivatives, and will be useful shortly.

Now, we study I_n, the number of isolated vertices in G_n, under conditioning that \{1,2,\ldots,k\} is a component for various values k. Note that unless k=1, we have

\mathbb{E}[I_n\,\big|\, \{1,2,\ldots,k\}\text{ a cpt}] = (n-k)(1-\frac{\lambda}{n})^{n-k-1},

for exactly the same reason as when we did this calculation for the original graph several lectures back. We will consider k in the range \epsilon n \le k\le (\zeta_\lambda - \epsilon) n.

We can take a limit of this expectation in appropriate uniformly using the Fact above, since the function ye^{-\lambda y} is suitably well-behaved, to obtain

\liminf_{n\rightarrow\infty} \frac{1}{n}\min_{\epsilon n\le k\le (\zeta-\epsilon)n} (n-k)(1-\frac{\lambda}{n})^{n-k-1}

\ge \min_{\epsilon \le x \le \zeta-\epsilon} (1-x)e^{-\lambda(1-x)}\ge e^{-\lambda}+\epsilon',

where \epsilon'>0. So

\liminf_{n\rightarrow\infty} \min_{\epsilon n\le k\le (zeta-\epsilon)n} \frac{1}{n}\mathbb{E}\left[ I_n\,\big|\, \{1,\ldots,k\}\text{ a cpt}\right] \ge e^{-\lambda}+\epsilon'.

But \frac{I_n}{n} is bounded above (by 1, of course), and so lower bounds on the expectation give lower bounds on upper tail, leading to

\liminf_{n\rightarrow\infty}\min_{\epsilon n\le k\le (\zeta-\epsilon)n} \mathbb{P}\left( \frac{I_n}{n}\ge e^{-\lambda}+\frac{\epsilon'}{2}\,\big|\, \{1,\ldots,k\}\text{ a cpt} \right) >0.

However, we know \frac{I_n}{n}\stackrel{\mathbb{P}}\rightarrow e^{-\lambda} (for example by local convergence…). Therefore, in order to make the unconditional probability vanish, the probability of the conditioning event in question must also vanish, ie

\mathbb{P}\left(\epsilon n\le |C^{G_n}(1)|\le (\zeta_\lambda-\epsilon)n\right)\rightarrow 0.

Finally, since

\mathbb{P}\left(\epsilon n \le |C^{G_n}(1)|\le (\zeta_\lambda-\epsilon)n\right) \ge \frac{1}{\epsilon}\mathbb{P}(\epsilon n\le |L_1(G)|\le (\zeta_\lambda-\epsilon)n),

the corresponding result holds for the largest component, not just the observed component.

Uniqueness and overall comments

Uniqueness can be obtained by a slight adjustment of Step 1. Morally, Step 1 is saying that a proportion asymptotically at most \zeta_\lambda of the vertices are in large components, so it is possible (and an exercise in the course) to adjust the argument to show

\frac{1}{n}|L_1|+\frac{1}{n}|L_2| \le \zeta_\lambda+\epsilon, with high probability,

from which the uniqueness result follows immediately.

In particular, it’s worth noting that this is an example of a bootstrapping argument, where we show a weak version of our goal result (in Step 2), but then use this to show the full result.

Note also that we can use the duality principle to show logarithmic bounds on the size of the second-largest component in exactly the same way that we showed logarithmic bounds on the size of the largest component in the subcritical regime. The whole point of duality is that these are the same problem!


Lecture 6 – Local limits

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

By this point of the course, we’ve studied several aspects of the Erdos-Renyi random graph, especially in the sparse setting G(n,\frac{\lambda}{n}). We’ve also taken a lengthy detour to revise Galton-Watson trees, with a particular focus on the case of Poisson offspring distribution.

This is deliberate. Note that a given vertex v of G(n,\frac{\lambda}{n}) has some number of neighbours distributed as \mathrm{Bin}(n-1,\frac{\lambda}{n})\stackrel{d}\approx\mathrm{Po}(\lambda), and the same approximation remains valid as we explore the graph (for example in a breadth-first fashion) either until we have seen a large number of vertices, or unless some ultra-pathological event happens, such as a vertex having degree n/3.

In any case, we are motivated by the notion that the local structure of G(n,\frac{\lambda}{n}) is well-approximated by the Galton-Watson tree with \mathrm{Po}(\lambda) offspring, and in this lecture and the next we try to make this notion precise, and discuss some consequences when we can show that this form of convergence occurs.

Deterministic graphs

Throughout, we will be interested in rooted graphs, since by definition we have to choose a root vertex whose local neighbourhood is to be studied. Usually, we will study a sequence of rooted graphs (G_n,\rho_n), where the vertex set of G_n is [n], or certainly increasing in n (as in the first example).

For some rooted graph (G,\rho), we say such a sequence (G_n,\rho_n) converges to (G,\rho) locally if for all radii r\ge 1, we have B_r^{G_n}(\rho_n)\simeq B_r^G(\rho). In words, the neighbourhood around \rho_n in G_n is the same up to radius r as the neighbourhood around \rho in G, so long as n is large enough (for given r).

This is best illustrated by an example, such as T_n, the binary tree to depth n.

If we take \rho_n to be the usual root, then the trees are nested, and converge locally to the infinite binary tree T_\infty. Slightly less obviously, if we take \rho_n to be one of the leaves, then the trees are still nested (up to labelling – ie in the sense of isomorphisms of rooted trees), and converge locally to the canopy tree, defined by a copy of \mathbb{Z}_{\ge 0} with nearest-neighbour edges, and where each vertex n\ge 1 is connected to the root of a disjoint copy of T_{n-1}, as shown below:

Things get more interesting when the root is chosen randomly, for example, uniformly at random, as this encodes more global information about the graphs G_n. In the case where the G_n are vertex-transitive, then if we only care about rooted graphs up to isomorphism, then it doesn’t matter how we choose the root.

Otherwise, we say that G_n converges in the local weak sense to (G,\rho) if, for all $r\ge 1$ and for all rooted graphs (H,\rho_H),

\mathbb{P}\left( B^{G_n}_r(\rho_n)\simeq (H,\rho_H) \right) \longrightarrow \mathbb{P}\left( B_r^G(\rho)\simeq H\right),

as n\rightarrow\infty.

Alternatively, one can phrase this as a result about convergence of rooted-graph-valued distributions.

A simple non-transitive example is G_n\simeq P_n, the path of length n. Then, the r-neighbourhood of a vertex is isomorphic to P_{2r}unless that vertex is within graph-distance (r-1) of one of the leaves of G_n. As n\rightarrow\infty, the proportion of such vertices vanishes, and so, \mathbb{P}\left( B^{P_n}_r(\rho_n)\simeq P_{2r}\right)\rightarrow 1, from which we conclude the unsurprising result that P_{n} converges in the local weak sense to \mathbb{Z}. (Which is vertex-transitive, so it doesn’t matter where we select the root.)

The binary trees offer a slightly richer perspective. Let \mathcal{L}_n be the set of leaves of T_n, and we claim that when \rho_n is chosen uniformly from the vertices of T_n, then d_{T_n}(\rho_n,\mathcal{L}_n) converges in distribution. Indeed, \mathbb{P}\left( d_{T_n}(\rho_n,\mathcal{L}_n)=k\right) = \frac{2^{n-k}}{2^{n+1}-1}, whenever n\ge k, and so the given distance converges in distribution to the Geometric distribution with parameter 1/2 supported on {0,1,2,…}.

This induces a random local weak limit, namely the canopy tree, rooted at one of the vertices we denoted by \mathbb{Z}_{\ge 0}, with the choice of this vertex given by Geometric(1/2).


Such limits, and the random graph versions to follow, are sometimes referred to as Benjamini-Schramm convergence, after the authors of the paper [LINK] in which this was first heavily introduced. Various subsequent authors use local weak convergence and Benjamini-Schramm convergence to denote a general distribution on roots versus a uniform measure on roots (or some similar distinction). For the purpose of this course, I followed the notation in Volume II of van der Hofstad’s notes exactly, and did not use the latter phrase.

A note on finiteness and tightness

Note at this point that we have not said anything about the class of random rooted graphs (G,\rho) which may appear as a limit. However, it can be seen that they must be locally-finite if there’s a chance that they appear as a weak limit of a sequence of finite graphs.

In this course, we won’t think of metrizing the space of rooted graphs, but it’s worth mentioning that a tightness criterion for local weak convergence is that the degree of the root is uniformly integrable, meaning that very large degrees do not contribute significantly to the expected degree in the limit. This induces probabilistic bounds on the largest degree to be seen within the radius r neighbourhood (bearing in mind that size-biasing appears everywhere when studying degrees), and tightness follows.

Random graphs – quenched versus annealed settings

In the previous example, all the randomness came from the choice of root. We want to return to a setting where the graph itself is random as well, though will restrict attention to the case where the set of vertices of G_n is always [n]. In this course, we also restrict attention to the two cases where either the root is chosen uniformly from the vertex set (as relevant to G(n,p)) or when the graph is a branching process tree with the usual (deterministic root).

We say that (G_n,\rho_n) converges in distribution locally weakly to (G,\rho) if, as before,

\mathbb{P}\left( B^{G_n}_r(\rho_n)\simeq (H,\rho_H) \right) \longrightarrow \mathbb{P}\left( B_r^G(\rho)\simeq H\right),

as n\rightarrow\infty. Here, we are taking the probability over both sources of randomness on the LHS. This is more transparent if we rewrite as

\mathbb{P}\left(B^{G_n}_r(\rho_n) \simeq H\right) = \mathbb{E}_{G_n}\left[ \frac{1}{n}\sum_{v\in[n]} \mathbf{1}\left\{B^{G_n}_r(v) \simeq H\right\}\right]\longrightarrow \mathbb{P}\left(B_r^G(\rho)\simeq H\right).

In particular, if the graph G_n is vertex-transitive then \mathbb{P}(B_r^{G_n}(\rho_n)\simeq H) = \mathbb{P}(B_r^{G_n}(1)\simeq H).

We also say that (G_n,\rho_n) converges in probability locally weakly to (G,\rho) if

\mathbb{P}\left( B^{G_n}_r(\rho_n)\simeq H \,\big|\, G_n\right)\stackrel{\mathbb{P}}\longrightarrow \mathbb{P}\left(B_r^G(\rho)\simeq H\right),

as n\rightarrow\infty. Note now that the LHS is a random variable, depending on G_n (but not on \rho_n). Again, the indicator function form may be more transparent:

\frac{1}{n}\sum_{v\in[n]} \mathbf{1}\{B^{G_n}_r(v)\simeq H\} \stackrel{\mathbb{P}}\longrightarrow \mathbb{P}\left( B^G_r(\rho)\simeq H\right).

The choice of names for these convergence settings might be counter-intuitive on a first reading. In both settings, we are studying the number of vertices for which (H,\rho_H) exactly describes its local neighbourhood.

  • In the first setting, we require this quantity to converge in expectation after rescaling.
  • In the second setting, we require this quantity to concentrate in distribution to a constant (which is the same as converging in probability to a constant).
  • It’s probably not helpful actually to think of these as convergence in distribution vs convergence in probability of any quantity.
  • This distinction is referred to in general as quenched (conditional on the graph) versus annealed (averaged over the graph).

It’s clear that this represents a division into first- and second-moment arguments, such as we’ve seen earlier in the course. Unsurprisingly, this induces a result about asymptotic independence, when we have local weak convergence in probability. In the lecture we phrased this as

\mathbb{P}\left(B^{G_n}_r(\rho_n^{(1)})\simeq H^{(1)},\, B^{G_n}_s(\rho_n^{(2)}) \simeq H^{(2)} \right)

\longrightarrow \mathbb{P}\left(B^G_r(\rho)\simeq H^{(1)}\right) \mathbb{P}\left( B^G_s(\rho)\simeq H^{(2)}\right).

We prove this by conditioning on G_n, and then the conditional probability of the event on the LHS becomes

\left(\frac{1}{n} \sum_{v\in[n]} \mathbf{1}\{B_r^{G_n}(v)\simeq H^{(1)}\} \right) \left(\frac{1}{n}\sum_{w\in[n]} \mathbf{1}\{B_s^{G_n}(w)\simeq H^{(2)}\right).

Both of these bracketed terms are random, but they converge in probability to constants by the convergence assumption. And it is true and straightforward to show that if X_n\stackrel{\mathbb{P}}\rightarrow x and Y_n\stackrel{\mathbb{P}}\rightarrow y, then X_nY_n\stackrel{\mathbb{P}}\rightarrow xy, which shows that the given result in fact holds in the quenched setting, so certainly holds in the annealed setting (ie by removing the conditioning on G_n) too. Note throughout, that the fact the quantities of interest are bounded (by 1, since they are probabilities…) makes it easy to move from convergence in distribution to convergence in expectation.


  • In the quenched setting, if we start breaking the setting that the pre-limit graphs are growing, we get some pathologies, such as a random graph not converging in probability locally weakly to itself. This phenomenon is a feature of the quenched setting, rather than the space of rooted graphs.
  • Proving that G(n,\frac{\lambda}{n}) converges locally weakly in probability to the Galton-Watson tree with offspring distribution \mathrm{Po}(\lambda) is not going to be part of the course, since the details are a more technical version of previous second-moment calculations. What is most relevant is that a lot useful results can be read off as a consequence of this property. At a meta level, it is preferable to do one combinatorially technical second-moment argument and derive lots of results from this, than to do slightly technical arguments from scratch every time.
  • In the next lecture, we’ll return to the question of component sizes in G(n,\frac{\lambda}{n}), starting to consider the supercritical case \lambda>1. We will see that the weak local limits provide upper bounds on the size of the giant component (in generality), but will need more involved arguments to provide matching lower bounds.

Lecture 4 – Hitting time theorem

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

This lecture consisted of revision of the most relevant theory of Galton-Watson trees, with a focus on the case where the offspring distribution is Poisson, since, as we have seen in previous lectures, this is a strong candidate to approximate the structure of G(n,\lambda/n). It makes sense to cover the theory of the trees before attempting to make rigorous the sense of approximation.

Given a Galton-Watson tree T, it is natural to label the vertices in a breadth-first order as \varnothing=v_1,v_2,\ldots,v_{|T|}. This is easiest if we have constructed the Galton-Watson tree as a subset of the infinite Ulam-Harris tree, where vertices have labels like (3,5,17,4), whose parent is (3,5,17). If this child vertex is part of the tree, then so are (3,5,17,1), (3,5,17,2), and (3,5,17,3). This means our breadth-first order is canonically well-defined, as we have a natural ordering of the children of each parent vertex.

Note: one advantage of using breadth-first order rather than depth-first order (which corresponds to the usual dictionary, or lexicographic ordering of the labels) is that if the tree is infinite, we don’t explore all of it during a depth-first search. (In the sense that there exist vertices which are never given a finite label.) For breadth-first search, a similar problem arises precisely when some vertex has infinitely many children. For a conventional Galton-Watson tree, the latter situation is much less of a problem than the infinite total population problem, which happens with positive probability whenever \mu=\mathbb{E}[X]>1.

Anyway, given the depth-first order, one can consider an exploration process S_0,S_1,S_2,\ldots,S_{|T|} given by

S_0=1,\quad S_i=S_{i-1}+(X_i-1),

where X_i is the number of children of v_i. In this way, we see that

S_i=\big| \Gamma(v_1)\cup\ldots\cup\Gamma(v_i)\backslash \{v_1,\ldots,v_i\}\big|,\quad i\ge 1,

records the number of vertices in some stack containing those which we have ‘seen but not explored’. Some authors prefer to start from 0, in which case one ends up with a similar but slightly different interpretation of the ‘stack’, but that’s fine since we aren’t going to define formally what ‘seen’ and ‘explored’ means in this post.

Essentially, we exhaust the vertices of the tree whenever S_t=0, and so the condition that |T|=n requires

S_n=0,\quad S_m\ge 1,\; m=0,1,\ldots,n-1.

Conveniently, so long as we have avoiding ordering ambiguity, for example by insisting that trees live within the Ulam-Harris tree, we can reconstruct T uniquely from (S_0,S_1,\ldots,S_{|T|}).

Furthermore, if T is a Galton-Watson process, then the numbers of children X_i are IID, and so in fact this exploration process is a random walk, and the size of the tree can be recovered as the hitting time of zero.

Note: making fully rigorous the argument that children in the GW tree are independent of the breadth-first walk fully rigorous is somewhat technical, and not to be dismissed lightly, though not of principle interest at the level of this topics course. See Proposition 1.5 in Section 1.2 of Le Gall’s notes or Section 1.2.2 of my doctoral thesis for further discussion and argument.

The hitting time theorem allows us to study the distribution of the hitting time of a random walk whose increments are bounded below by -1, in terms of the distribution of the value of the random walk.

Theorem: Let (S_n,\, n\ge 0) be a random walk with S_0=0 and IID increments (X_n,n\ge 1) satisfying \mathbb{P}(X_n\ge -1)=1. Let H_{-k}=\inf \left\{n\,:\, S_n=-k\right\} be the hitting time of -k.

Then \mathbb{P}\big( H_{-k}=n\big) = \frac{k}{n}\mathbb{P}\big(S_n=-k).

Commentary: there are local central limit theorem estimates and large deviation estimates that allow good control of the probability on the RHS for a rich class of contexts. So at a meta-level, the hitting time theorem allows us to reduce a complicated (though still classical) problem, to a real classical problem, which is particularly helpful when the LHS is a device for capturing relevant information about our random tree model.

Proof context/history: The case k=1, as relevant to the trees, is often referred to as Dwass’s theorem, though the combinatorial argument pre-dates this, sometimes in the form of the similar ballot theoremA complete reference list would be challenging, but it certainly appears in Feller, due to Spitzer. In his recent book, van der Hofstad uses an argument by induction on n, noting that the statement is obvious when n=0,1,…,k-1, and obvious for different reasons when n=k. This argument can also be found published here.

I’m going to give a combinatorial argument, similar to one I’ve given in a post from years ago (back then, for the case k=1), because there wasn’t time to include this in the lecture.

Proof: from the point of view of the course, the most important probabilistic take home message from this argument is that we get significant control by introducing the randomness in two stages, and conditioning on the first round of randomness.

Here, what we do is to condition on the increments (X_1,X_2,\ldots,X_n)_{\mathrm{cyc}} in cyclic order, ie without distinguishing between (x_1,x_2,\ldots,x_n) and (x_4,x_5,\ldots,x_n,x_1,x_2,x_3). We’ll declare which of these is the true ordering at the end. Anyway, the punchline is the following lemma.

Lemma: when x_1+\ldots +x_n=-k, and all x_i\ge -1, then exactly k of the n orderings corresponding to the cyclic ordering (x_1,\ldots,x_n)_{\mathrm{cyc}} have the property that x_1+\ldots+x_\ell \ge -k+1.

With this lemma, it’s clear why the hitting time theorem holds. We condition on (x_1,\ldots,x_n)_{\mathrm{cyc}}. If the sum of these elements is not -k, then there’s no chance that the hitting time is n. If the sum is -k, then when we choose a true ordering at random, we get the hitting time as n with conditional probability k/n.

Proof of lemma: At all times, take indices modulo n. We want labels \alpha satisfying:

x_\alpha,\; x_\alpha+x_{\alpha+1},\; \ldots\;, ;\ x_{\alpha}+\ldots+x_{\alpha+n-2}\text{ are all }\ge -(k-1). (*)

Step 1: Suppose there are \alpha_1<\alpha_2<\ldots<\alpha_{k+1} satisfying (*). It’s helpful to think about going round the circle of values. If we go all the way round the circle, we get -k, but if we go from x_{\alpha_{i+1}}\to x_{\alpha_i-1} then we have an initial segment from x_{\alpha_{i+1}}, and so

x_{\alpha_i}+\ldots+x_{\alpha_{i+1}-1} = -k - (x_{\alpha_{i+1}}+\ldots+x_{\alpha_i-1}) \le -k - \left(-(k-1)\right) = -1.

However, if we repeat this for the k+1 such blocks, we end up going all the way round the circle, and get

\left(x_{\alpha_1}+\ldots+x_{\alpha_2-1}\right) + \left(x_{\alpha_2}+\ldots+x_{\alpha_3-1}\right) + \ldots

\ldots+ \left(x_{\alpha_{k+1}} + \ldots+x_{\alpha_1-1}\right) \le -(k+1),

which is a contradiction since the total sum is k. So we have at most k such values of \alpha.

Step 2: Now we show that there is at least one such \alpha. Let

\beta=\mathrm{argmin}_{1\le m\le n} (x_1+\ldots+x_m),

and if there are multiple such m, let \beta be the minimal such. Now set \alpha=\beta+1, and we will show this satisfies the requirements.

  • Then x_\alpha+x_{\alpha+1}+\ldots+x_{\alpha+m}\ge 0, whenever \alpha+m\le n.
  • And when m<\beta, we have

x_{\alpha}+\ldots+x_n+x_1+\ldots+x_m = -k+(x_1+\ldots+x_m)-(x_1+\ldots+x_\beta).

And by assumption, the first bracket is strictly greater than the second bracket, so

x_\alpha + \ldots+x_n+x_1+\ldots+x_m \ge -(k-1).

Step 3: Now we have an \alpha which satisfies, we may assume WLOG that it is \alpha_0=1. Now we construct the hitting times:

\alpha_1=\min\big(m\,:\, x_1+\ldots+x_m=-1\big),

\alpha_{k-1}=\min\big(m\,:\, x_1+\ldots+x_m=-(k-1)\big).

We claim that all these also satisfy the requirements for \alpha. This is fairly clear from a diagram, or from the fact that when \alpha_i\le m<\alpha_{i+1}, then

x_{\alpha_k}+x_{\alpha_k+1}+\ldots+x_m\ge x_{\alpha_k}+\ldots+x_{\alpha_i}= i-k,

(or i-k+n if i<k).

Lecture 3 – Couplings, comparing distributions

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

In this third lecture, we made our first foray into the scaling regime for G(n,p) which will be the main focus of the course, namely the sparse regime when p=\frac{\lambda}{n}. The goal for today was to give a self-contained proof of the result that in the subcritical setting \lambda<1, there is no giant component, that is, a component supported on a positive proportion of the vertices, with high probability as n\rightarrow\infty.

More formally, we showed that the proportion of vertices contained within the largest component of G(n,\frac{\lambda}{n}) vanishes in probability:

\frac{1}{n} \left| L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right| \stackrel{\mathbb{P}}\longrightarrow 0.

The argument for this result involves an exploration process of a component of the graph. This notion will be developed more formally in future lectures, aiming for good approximation rather than bounding arguments.

But for now, the key observation is that when we ‘explore’ the component of a uniformly chosen vertex v\in[n] outwards from v, at all times the number of ‘children’ of v which haven’t already been considered is ‘at most’ \mathrm{Bin}(n-1,\frac{\lambda}{n}). Since, for example, if we already know that eleven vertices, including the current one w are in C(v), then the distribution of the number of new vertices to be added to consideration because they are directly connected to w has conditional distribution \mathrm{Bin}(n-11,\frac{\lambda}{n}).

Firstly, we want to formalise the notion that this is ‘less than’ \mathrm{Bin}(n,\frac{\lambda}{n}), and also that, so long as we don’t replace 11 by a linear function of n, that \mathrm{Bin}(n-11,\frac{\lambda}{n})\stackrel{d}\approx \mathrm{Poisson}(\lambda).

Couplings to compare distributions

coupling of two random variables (or distributions) X and Y is a realisation (\hat X,\hat Y) on the same probability space with correct marginals, that is

\hat X\stackrel{d}=X,\quad \hat Y\stackrel{d}=Y.

We saw earlier in the course that we could couple G(n,p) and G(n,q) by simulating both from the same family of uniform random variables, and it’s helpful to think of this in general: ‘constructing the distributions from the same source of randomness’.

Couplings are a useful notion to digest at this point, as they embody a general trend in discrete probability theory. Wherever possible, we try to do as we can with the random objects, before starting any calculations. Think about the connectivity property of G(n,p) as discussed in the previous lecture. This can be expressed directly as a function of p in terms of a large sum, but showing it is an increasing function of p is essentially impossible by computation, whereas this is very straightforward using the coupling.

We will now review how to use couplings to compare distributions.

For a real-valued random variable X, with distribution function F_X, we always have the option to couple with a uniform U(0,1) random variable. That is, when U\sim U[0,1], we have (F_X^{-1}(U)\stackrel{d}= X, where the inverse of the distribution function is defined (in the non-obvious case of atoms) as

F_X^{-1}(u)=\inf\left\{ x\in\mathbb{R}\,:\, F(x)\ge u\right\}.

Note that when the value taken by U increases, so does the value taken by F_X^{-1}(U). This coupling can be used simultaneously on two random variables X and Y, as (F_X^{-1}(U),F_Y^{-1}(U)), to generate a coupling of X and Y.

The total variation distance between two probability measures is

d_{\mathrm{TV}}(\mu,\nu):= \sup_{A}|\mu(A)-\nu(A)|,

with supremum taken over all events in the joint support S of \mu,\nu. This is particularly clear in the case of discrete measures, as then

d_{\mathrm{TV}}(\mu,\nu)=\frac12 \sum_{x\in S} \left| \mu\left(\{x\}\right) - \nu\left(\{x\}\right) \right|.

(Think of the difference in heights between the bars, when you plot \mu,\nu simultaneously as a bar graph…)

The total variation distances records how well we can couple two distributions, if we want them to be equal as often as possible. It is therefore a bad measure of distributions with different support. For example, the distributions \delta_0 and \delta_{1/n} are distance 1 apart (the maximum) for all values of n. Similarly, the uniform distribution on [0,1] and the uniform distribution on \{0,1/n,2/n,\ldots, n-1/n, 1\} are also distance 1 apart.

When there is more overlap, the following result is useful.

Proposition: Any coupling (\hat X,\hat Y) of X\sim \mu,\,Y\sim \nu satisfies \mathbb{P}(X=Y)\le 1-d_{\mathrm{TV}}(\mu,\nu), and there exists a coupling such that equality is achieved.

Proof: The first part of the resultis shown by

\mathbb{P}\left(\hat X=\hat Y = x\right)\le \mathbb{P}\left(\hat X=x\right) = \mu\left(\{x\}\right),

and so we may also bound by \nu\left(\{x\}\right). Thus

\mathbb{P}\left(\hat X=\hat Y\right) = \sum_x \min\left( \mu\left(\{x\}\right),\, \nu\left(\{x\}\right) \right),

leading to

\mathbb{P}\left(\hat X \ne\hat Y\right) = 1 - \sum_x \min\left(\mu\left(\{x\}\right),\, \nu\left(\{x\}\right) \right)

= \sum_x \left[ \frac12 \mu(\{x\}) + \frac12 \nu(\{x\}) - \min\left( \mu(\{x\}), \,\nu(\{x\})\right) \right] = \frac12 \sum_x \left| \mu(\{x\}) - \nu(\{x\}) \right|.

The second part of the theorem is more of an exercise in notation, and less suitable for this blog article.

Crucially, observe that the canonical coupling discussed above is definitely not necessarily the best coupling for total variation distance. For example, the uniform distribution on \{0,1,2,\ldots, n-1\} and the uniform distribution on \{1,2,\ldots, n-1, n\} are very close in the total variation distance, but under the canonical coupling involving the distribution function, they are never actually equal!

Stochastic domination

Given two random variables X and Y, we say that Y stochastically dominates X if

\mathbb{P}(X\ge x)\le \mathbb{P}(Y\ge x),\quad \forall x\in\mathbb{R},

and write X\le_{st}Y. If we are aiming for results in probability, then it is easy to read results in probability for X off from results in probability for Y, at least in one direction. For this relation, it turns out that the canonical coupling is particularly useful, unsurprisingly so since it is defined in terms of the distribution function, which precisely describes the tail probabilities. In fact the reverse implication is true too, as stated in the following theorem.

Theorem (Strassen): X\le{st} Y if and only if there exists a coupling (\hat X,\hat Y) such that \mathbb{P}(\hat X\le \hat Y)=1.

Proof: The reverse implication is clear since

\mathbb{P}\left(\hat X \ge x\right) = \mathbb{P}\left( \hat Y\ge \hat X\ge x\right) \le \mathbb{P}\left( \hat Y \ge x\right),

and \hat X,\,\hat Y have the correct marginals.

Examples: In many situations, the coupling description makes it much easier to verify the stochastic domination relation.

  • \lambda\le \mu implies \mathrm{Poisson}(\lambda)\le_{st} \mathrm{Poisson}(\mu). Checking the tail of the Poisson mass function would be a bit of a nightmare, but since we have the superposition coupling \mathrm{Po}(\mu)\stackrel{d}= \mathrm{Po}(\lambda) + \mathrm{Po}(\mu-\lambda), for independent RVs on the RHS, we can read it off.
  • Similarly \mathrm{Bin}(n,p)\le \mathrm{Bin}(m,q) when n\le m,\, p\le q. Again, checking the tail would be computationally-annoying.
  • One of the most useful examples is the size-biased distribution, obtained from a distribution X by \mathbb{P}(X_{sb}=x) \propto x\mathbb{P}(X=x), when X is a non-negative RV. As discussed on one of the exercises, |C(v)|, which we study repeatedly, can be expressed as the size-biased version of a uniform choice from amongst the list of component sizes in the graph. We have X\le_{st} X_{sb}, which is intuitively reasonable since the size-biased distribution ‘biases in favour of larger values’.


1) The fact that we study the component C(v) containing a uniformly chosen vertex v\in[n] is another example of ‘revealing’ the total randomness in a helpful order. Note that |C(v)| requires extra randomness on top of the randomness of the graph itself. When analysing C(v) it was convenient to sample the graph section-by-section having settled on v. However, for the final comparison of |C(v)| and |L_1(G)|, the opposite order is helpful.

Since, whenever there is a component of size at least \epsilon n, the probability that we pick v within that component is at least \frac{\epsilon n}{n}. Thus we can conclude that

\mathbb{P}\left(|C(v)|\ge \epsilon n\right) \ge \frac{\epsilon n}{n}\mathbb{P}\left( \left|L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right| \ge \epsilon n\right).

Here, since we are aiming for convergence in probability, \epsilon>0 is fixed, and

\frac{1}{n}\left|L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right|\stackrel{\mathbb{P}}\longrightarrow 0,

follows from

\frac{1}{n}\left| C(v) \right| \stackrel{\mathbb{P}}\longrightarrow 0,

which we have shown using the exploration process.

2) In fact, the final step of the argument can be strengthened to show that

\lim_{K\rightarrow\infty}\limsup_{n\rightarrow\infty}\mathbb{P}\left(\left|L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right)\ge K\log n\right)=0,

though this requires familiarity with Chernoff bounds / Large Deviations estimates, which would have been a distraction to have introduced at exactly this point in the course. Showing that the size of the largest component is concentrated in probability on the scale of \log n is also possible, though for this we obviously need to track exact errors, rather than just use bounding arguments.

3) There are many more ways to compare distributions, with and without couplings, some of which have been of central importance in studying the age distributions corresponding to mean field forest fires, as introduced in a recent paper with Ed Crane and Balazs Rath, which can, as of this morning, be found on Arxiv. In the interests of space, further discussion of more exotic measures on distributions is postponed to a separate post.

Lecture 2 – Connectivity threshold

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

The goal of the second lecture was to establish the sharp phase transition for the connectivity of the random graph G(n,p(n)) around the critical regime p(n)\sim \frac{\log n}{n}. In the end, we showed that when \omega(n) is any diverging sequence, and p(n)=\frac{\log n-\omega(n)}{n}, then we have that G(n,p(n)) is with high probability not connected.

In the next lecture, we will finish the classification by studying p(n)=\frac{\log n+\omega(n)}{n}, and show that for this range of p, the graph G(n,p(n)) is with high probability connected.

The details of the lecture, especially the calculation, are not presented fully here. There, I followed van der Hofstad’s recent book fairly closely, sometimes taking different approximations and routes through the algebra, though all versions remain fairly close to the original enumerations by Renyi.

Immediate remarks

  • One is allowed to be surprised that for almost all scalings of p(n), G(n,p) is either whp connected or whp not connected. The speed of the transition is definitely interesting.
  • As defined in lectures, the property that a graph is connected is an increasing property, meaning that it is preserved when you add additional edges to the graph.
  • Because of the natural coupling between G(n,p) and G(n,q), the fact that connectedness is an increasing property makes life easier. For example, we can insist temporarily that \omega(n)\ll \log n, or whatever scaling turns out to be convenient for the proof, but conclude the result for all diverging \omega(n). This avoids the necessity for an annoying case distinction.

Heuristics – Isolated vertices

It turns out that the ‘easiest’ way for such a graph to be disconnected is for it to have an isolated vertex. In determining that the graph has a cut into classes of sizes a and b with no edges between them, there is a trade-off between the number of ways to choose the partition (which increases with min(a,b) ) and the probabilistic penalty from banning the ab edges between the classes (which decreases with min(a,b) ). It turns out that the latter effect is slightly stronger, and so (1,n-1) dominates.

Method 1: second-moment method

In the case p(n)=\frac{\log n - \omega(n)}{n}, we use a second-moment method argument to establish that G(n,p) contains an isolated vertex with high probability. Note that a given vertex v is isolated precisely if n-1 edges are not present. Furthermore, two given vertices v,w are both isolated, precisely if 2n-3 edges are not present. So in fact, both the first moment and the second moment of the number of isolated vertices are straightforward to evaluate.

It turns out that the number of isolated vertices, Y_n, satisfies

\mathbb{E}[Y_n]= \exp(\omega(n)+o(1))\rightarrow\infty. (*)

As always, we have to eliminate the possibility that this divergent expectation is achieved by the graph typically having no isolated vertices, but occasionally having very many. So we turn to the second moment, and can show

\mathrm{Var}(Y_n)= (1+o(1))\mathbb{E}[Y_n],

and so by Chebyshev’s inequality, we have \mathbb{P}(Y_n=0)\rightarrow 0.

Method 2: first-moment method

Counter-intuitively, although the case p(n)=\frac{\log n + \omega(n)}{n} requires only a first-moment method, it is more technical because it involves the non-clear direction of the informal equivalence:

\text{Connected}\; ``\iff ''\; \text{no isolated vertices}.

At the time we showed (*), we also showed that for this regime of p(n), G(n,p) whp has no isolated vertices. It remains to show that it has no splits into (unions of) connected components of sizes k and n-k.

It turns out that it is annoying to count components of size k because when we take an expectation, we can’t study all connected components on k vertices equally, since the probability of a given component structure depends on the exact number of edges it contains (which must be at least k-1, but will be larger if it includes lots of cycles). It is much easier to study the number of spanning trees of components of size k. So if the component itself is a tree, it has exactly one spanning tree, and if it has more cyclic structure then it will have several spanning trees.

The point is that we expect the expected number of such components to be very very small when k\ge 2, so it doesn’t really matter if we overcount by some factor. Somewhat abusing notation, we define X_k to be the number of such spanning trees in G(n,p), so

\mathbb{E}[X_k]=\binom{n}{k} k^{k-2}p^{k-1}(1-p)^{k(n-k)}.

Here the terms determine, respectively:

  • the number of ways to choose the vertex set of the component;
  • the number of trees on that vertex set (Cayley’s formula);
  • the probability of E(G(n,p)) including the edges of the tree in question;
  • the probability that G(n,p) includes no edges between the vertex of the tree and the rest of the graph (so it is actually a component).

Applying Stirling’s formula, and bounding helpfully, we obtain

\mathbb{E}[X_k]\le n(npe)^k \exp(-k(n-k)p).

We have already seen (modulo the abuse of notation) that \mathbb{E}[X_1]\rightarrow 0. Now we need to show that

\mathbb{E}\left[ \sum_{k=2}^{\lfloor n/2\rfloor} X_k\right]\rightarrow 0.

At this point, we should optimise our approximations based on how easy they are to sum, rather than how strong they are. One weak option is

\mathbb{E}[X_k] \le n(npe^{1-np/2})^k,

but it turns out that this is perfectly strong enough, and gives a geometric series, which is easy to sum. The only issue is that we get

\mathbb{E}\left[\sum_{k=\ell}^{\lfloor n/2\rfloor} X_k\right] < n \left(\frac{\log n}{\sqrt{n}}\right)^\ell,

so the sum is only bounded as we want when \ell\ge 3. Fortunately, almost any argument will work to show \mathbb{E}[X_2]\rightarrow 0 separately.

Once we know that \mathbb{E}[\sum X_k] \rightarrow 0, we can use the fact that for non-negative-valued RVs X,

\mathbb{P}(X\ne 0)=\mathbb{P}(X\ge 1)\le \mathbb{E}[X],

which you can view as an example of Markov’s inequality if you want, but really is just one of the fundamental properties of integer-valued RVs that allows useful interchange of probabilities and expectations.

In any case, combining this shows that with high probability there is no cut into classes of size k and n-k for any k, which is the same as being connected.


  • The similarity to the famous coupon collector problem is unsurprising, modulo the heuristic about isolated vertices. Under the right setup (continuous-time, for example), whether the coupons have been collected is independent between different coupons. That’s not the case here, as can be seen in the second-moment calculation. But under any version of the coupon collector problem, the second-moment calculation setup is essentially the same as here.
  • It becomes clear in the proof that the threshold for the existence of cuts into k and n-k with high probability is p(n)\sim \frac{\log n}{kn} for fixed k. In particular for every k>1, we are working with a range of p much larger than this threshold in the calculation above, so it’s unsurprising that we could make weak assumptions and approximations, and still show that the expected number of such cuts vanishes.
  • Once one has a coupling of G(n,p) for all p, one can ask whether the graph becomes connected at exactly the same time / value of p as the last isolated vertex receives an incident edge. I will not discuss this more here, as it’s one of the extended exercises, but this is just the start of a family of questions concerning whether closely-related properties occur simultaneously or not.

Renyi’s study of G(n,m)

Renyi’s original argument concerned the related model G(n,m), that is, a uniform choice amongst the graphs on vertex set [n] with exactly m edges. There’s no right or wrong answer to the question of which is the better choice as the canonical model for a random graph. However, for this question, I guess it is natural to pose the question for G(n,m) since there’s an obvious extremal answer in terms of edges. That is, we require at least n-1 edges before there’s any chance that the graph is connected (cf the exercise about trees), so it’s natural to ask how many edges are required before there’s a strong chance that the G(n,m) is connected.

Clearly, conditioning on m edges being present in G(n,p) gives G(n,m). We can view G(n,p) as a mixture of distributions G(n,m) across values of m from 0 to \binom{n}{2}. Some graph properties are clearly badly focused under this mixture, for example the property A={the graph has an even number of edges}.

However, for increasing events (which are preserved under the addition of edges), proving that an event holds whp for G(n,p) given it holds whp for G(n,m), and vice versa, is normally straightforward.

For example, given a sequence p(n), take m(n) to be a sequence such that E(G(n,p))\ge m(n) with high probability. This is just about the tail of the binomial distribution, so for example taking m(n)=\binom{n}{2}p - \omega(\sqrt{n^2p}) suffices. Then if increasing property A holds for G(n,m) with high probability, it holds for G(n,p) with high probability too.

One can argue similarly in the other direction.

An interesting adjustment to this argument is required in the final section of my paper with James Martin in ALEA concerning a model of random forests which does not have the natural coupling property enjoyed by G(n,p) and G(n,q). This may be the subject of an upcoming post, time permitting.

Random Graphs – Lecture 1

My plan is to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics about the course can be found here.

In the first lecture, we revised some basic definitions about graphs, focusing on those which are most relevant to a first study of the Erdos-Renyi random graph G(n,p) which will be the focus of the lecture course. We discussed in abstract why the independence of the (potential) edges makes the model easier to analyse, but reduces its suitability as a direct model for lots of networks one might see in the real world, where knowledge that A is directly connected to both B and C affects the probability that B is directly connected to C, in either direction. Thinking about the Facebook friendship graph is one of the best examples, where in this case, we expect this extra information to increase the probability that B and C are connected. Even as the world moves away from heteronormativity, it realistically remains the case that in a graph of the dating history amongst a well-defined community we would likely observe the opposite effect.

All of these more complicated phenomena can be captured by various random graphs, but G(n,p) remains the corner stone, evinced by the >10^5 citations towards one of Erdos and Renyi’s original papers on the topic.

Somewhat paraphrasing, one of their (well, mostly Renyi’s) original questions was: when n is large, what should p be so that there’s a good chance that G(n,p) is connected?

The answer to this question lies in Lecture 2, but to cement understanding of the model, and explore some key methods for proofs in discrete probability (as well as play around with the big-O and little-o notation), we investigated the following two situations, which are very far from interesting as far as connectivity of G(n,p) is concerned.

Dense regime

When p is fixed, there are many interesting questions one could ask about the asymptotic properties of G(n,p), but connectivity is not one of them. In particular, for p\in(0,1) we claim:

Proposition: \mathrm{diam}(G(n,p)) \stackrel{\mathbb{P}}\rightarrow 2 as n\rightarrow\infty.

Note that if \mathrm{diam}(G(n,p))=1, then G(n,p)\simeq K_n, the complete graph on n vertices. In other words, every possible edge is actually present. But the probability of this event is p^{\binom{n}{2}}\rightarrow 0, so long as p<1.

It then suffices to prove that \mathbb{P}(G(n,p)>2) \rightarrow 0. We use a union bound, where we study the probability that the graph distance d_{G(n,p}(v,w)>2 for two fixed vertices v\ne w first, and then sum over all such pairs. Of course, there is a probability p that the two vertices are directly connected by an edge. Then, there are (n-2) other vertices with the potential to be a common neighbour of v and w, which would ensure that the graph distance between them is at most two. So

\mathbb{P}(d_{G(n,p)}(v,w)>2)=(1-p)[1-p^2]^{n-2} .

Note that we are using independence throughout this calculation. Then comes the union bound:

\mathbb{P}(\mathrm{diam}(G(n,p)) >2) \le \sum_{v\ne w \in[n]} \mathbb{P}(d_{G(n,P)}(v,w)>2)

\le \binom{n}{2} (1-p)[1-p^2]^{n-2} \rightarrow 0,

since exponential decay ‘kills’ polynomial growth.

Ultra-sparse regime

In general, we work in the setting where p=p(n) depends on n. If p(n) decays fast enough (see Exercise 2), then with high probability G(n,p) has no edges at all. However, when p(n)=o(n^{-3/2}) we have

Proposition: \mathbb{P}(\text{edges of }G(n,p)\text{ form a matching}) \rightarrow 1 as n\rightarrow\infty.

A matching is a collection of edges with no vertices in common. So if the edge set of the graph is a matching, we have essentially no interesting connectivity structure at all. The longest path has length one, for example.

To prove this, note that the edge set of the graph fails to be a matching precisely if one of the vertices has degree at least two. But since a vertex v is connected to each of the (n-1) other vertices in the graph independently with probability p, we have

\mathrm{deg}_{G(n,p)}(v) \sim \mathrm{Bin}(n-1,p),

and so we can directly make the crude approximation

\mathbb{P}(\mathrm{deg}_{G(n,p)}(v) =k) = \binom{n-1}{k}p^k(1-p)^{n-1-k}\le n^k p^k.

We’ve made this very weak bound to make life easier when we sum:

\mathbb{P}(\mathrm{deg}_{G(n,p)}(v) \ge 2) \le \sum_{k\ge 2}(np)^k = \frac{(np)^2}{1-np}.

Since p=o(n^{-3/2}), we have \frac{1}{1-np}=\frac{1}{1-o(1)}=1+o(1), and overall we obtain

\mathbb{P}(\mathrm{deg}_{G(n,p)}(v) \ge 2) = o(\frac{1}{n}).

Again, we finish with a union bound, considering this event across all vertices v\in[n].

\mathbb{P}(E(G(n,p))\text{ not a matching}) \le \sum_{v\in[n]} \mathbb{P}(\mathrm{deg}_{G(n,p)}(v)\ge 2)

=n\mathbb{P}(\mathrm{deg}_{G(n,p)}(1)\ge 2) = o(1),

as required.

Next time

In the next lecture, we’ll study the regime p(n)\sim \frac{\log n}{n}, where G(n,p) experiences a phase transition from probably not connected to probably connected. Part of this involves making the notion probably connected precise, which will be useful throughout the rest of the course, as well as establishing the language for comparing G(n,p) and G(n,q).

The proof itself requires some more sophisticated versions of calculations from Lecture 1, and more sophisticated probabilistic tools (first- and second-moment methods) to convert them into statements about convergence in probability. This will be an advertisement for the more classical enumerative methods that underpinned much of the early work on random graphs.

The rest of the course will exploit much more some comparisons and embeddings involving branching processes and exploration processes, so don’t worry – it won’t be 26 hours of counting trees!

Kernels of critical graph components

This post is motivated by G(N,p), the classical Erdos-Renyi random graph, specifically its critical window, when p=p(N)=\frac{1}{N}(1+\lambda N^{-1/3}).

We start with the following observation, which makes no restriction on p. Suppose a component of G(N,p) is a tree. Then, the graph geometry of this component is that of a uniform random tree on the appropriate number of vertices. This is deliberately informal. To be formal, we’d have to say “condition on a particular subset of vertices forming a tree-component” and so on. But the formality is broadly irrelevant, because at the level of metric scaling limits, if we want to describe the structure of a tree component, it doesn’t matter whether it has \log N or \frac{1}{7}N vertices, because in both cases the tree structure is uniform. The only thing that changes is the scaling factor.

In general, when V vertices form a connected component of a graph with E edges, we define the excess to be E-V+1. So the excess is non-negative, and is zero precisely when the component is a tree. I’m reluctant to say that the excess counts the number of cycles in the component, but certainly it quantifies the amount of cyclic structure present. We will sometimes, in a mild abuse of notation, talk about excess edges. But note that for a connected component with positive excess, there is a priori no way to select which edges would be the excess edges. In a graph process, or when there is some underlying exploration of the component, there sometimes might be a canonical way to classify the excess edges, though it’s worth remarking that the risk of size-biasing errors is always extremely high in this sort of situation.

Returning to the random graph process, as so often there are big changes around criticality. In the subcritical regime, the components are small, and most of them, even the largest with high probability, are trees. In the supercritical regime, the giant component has excess \Theta(N), which is qualitatively very different.

It feels like every talk I’ve ever given has begun with an exposition of Aldous’s seminal paper [Al97] giving a distributional scaling limit of the sizes of critical components in the critical window, and a relation between the process on this time-scale and the multiplicative coalescent. And it remains relevant here, because the breadth-first exploration process can also be used to track the number of excess edges.

In a breadth-first exploration, we have a stack of vertices we are waiting to explore. We pick one and look its neighbours restricted to the rest of the graph, that is without the vertices we have already fully explored, and also without the other vertices in the stack. That’s the easiest way to handle the total component size. But we can simultaneously track how many times we would have joined to a neighbour within the stack, which leads to an excess edge, and Aldous derives a joint distributional scaling limit for the sizes of the critical components and their excesses. (Note that in this case, there is a canonical notion of excess edge, but it depends not just on the graph structure, but also on the extra randomness of the ordering within the breadth-first search.)

Roughly speaking, we consider the reflected exploration process, and its scaling limit, which is a reflected parabolically-drifting Brownian motion (though the details of this are not important at this level of exposition, except that it’s a well-behaved non-negative process that hits zero often). The component sizes are given by the widths of the excursions above zero, scaled up in a factor N^{1/3}. Then conditional on the shape of the excursion, the excess is Poisson with parameter the area under the excursion, with no rescaling. That is, a critical component has \Theta(1) excess.

So, with Aldous’s result in the background, when we ask about the metric structure of these critical components, we are really asking: “what does a uniformly-chosen connected component with fixed excess look like when the number of vertices grows?”

I’ll try to keep notation light, but let’s say T(n,k) is a uniform choice from connected graphs on n vertices with excess k.

[Note, the separation of N and n is deliberate, because in the critical window, the connected components have size n = \Theta(N^{2/3}), so I want to distinguish the two problems.]

In this post, we will mainly address the question: “what does the cycle structure of T(n,k) look like for large n?” When k=0, we have a uniform tree, and the convergence of this to the Brownian CRT is now well-known [CRT2, LeGall]. We hope for results with a similar flavour for positive excess k.

2-cores and kernels

First, we have to give a precise statement of what it means to study just the cycle structure of a connected component. From now on I will assume we are always working with a connected graph.

There are several equivalent definitions of the 2-core C(G) of a graph G:

  • When the excess is positive, there are some cycles. The 2-core is the union of all edges which form part of some cycle, and any edges which lie on a path between two edges which both form part of some cycle.
  • C(G) is the maximal induced subgraph where all degrees are at least two.
  • If you remove all the leaves from the graph, then all the leaves from the remaining graph, and continue, the 2-core is the state you arrive at where there are no leaves.

It’s very helpful to think of the overall structure of the graph as consisting of the 2-core, with pendant trees ‘hanging off’ the 2-core. That is, we can view every vertex of the 2-core as the root of a (possibly size 1) tree. This is particular clear if we remove all the edges of the 2-core from the graph. What remains is a forest, with one tree for each vertex of the 2-core.

In general, the k-core is the maximal induced subgraph where all degrees are at least k. The core is generally taken to be something rather different. For this post (and any immediate sequels) I will never refer to the k-core for k>2, and certainly not to the traditional core. So I write ‘core’ for ‘2-core’.

As you can see in the diagram, the core consists of lots of paths, and topologically, the lengths of these paths are redundant. So we will often consider instead the kernel, K(G), which is constructed by taking the core and contracting all the paths between vertices of degree greater than 2. The resulting graph has minimal degree at least three. So far we’ve made no comment about the simplicity of the original graphs, but certainly the kernel need not be simple. It will regularly have loops and multiple edges. The kernel of the graph and core in the previous diagram is therefore this:

Kernels of critical components

To recap, we can deconstruct a connected graph as follows. It has a kernel, and each edge of the kernel is a path length of some length in the core. The rest of the graph consists of trees hanging off from the core vertices.

For now, we ask about the distribution of the kernel of a T(n,K). You might notice that the case k=1 is slightly awkward, as when the core consists of a single cycle, it’s somewhat ambiguous how to define the kernel. Everything we do is easily fixable for k=1, but rather than carry separate cases, we handle the case k\ge 2.

We first observe that fixing k doesn’t confirm the number of vertices or edges in the kernel. For example, both of the following pictures could correspond to k=3:

However, with high probability the kernel is 3-regular, which suddenly makes the previous post relevant. As I said earlier, it can introduce size-biasing errors to add the excess edges one-at-a-time, but these should be constant factor errors, not scaling errors. So imagine the core of a large graph with excess k=2. For the sake of argument, assume the kernel has the dumbbell / handcuffs shape. Now add an extra edge somewhere. It’s asymptotically very unlikely that this is incident to one of the two vertices with degree three in the core. Note it would need to be incident to both to generate the right-hand picture above. Instead, the core will gain two new vertices of degree three.

Roughly equivalently, once the size of the core is fixed (and large) we have to make a uniform choice from connected graphs of this size where almost every vertex has degree 2, and \Theta(1) of the rest have degree 3 or higher. But the sum of the degrees is fixed, because the excess is fixed. If there are n vertices in the core, then there are \Theta(n) more graphs where all the vertices have degree 2 or 3, than graphs where a vertex has degree at least 4. Let’s state this formally.

Proposition: The kernel of a uniform graph with n vertices and excess k\ge 2 is, with high probability as n\rightarrow\infty, 3-regular.

This proved rather more formally as part of Theorem 7 of [JKLP], essentially as a corollary after some very comprehensive generating function setup; and in [LPW] with a more direct computation.

In the previous post, we introduced the configuration model as a method for constructing regular graphs (or any graphs with fixed degree sequence). We observe that, conditional on the event that the resulting graph is simple, it is in fact uniformly-distributed among simple graphs. When the graph is allowed to be a multigraph, this is no longer true. However, in many circumstances, as remarked in (1.1) of [JKLP], for most applications the configuration model measure on multigraphs is the most natural.

Given a 3-regular labelled multigraph H with 2(k-1) vertices and 3(k-1) edges, and K a uniform choice from the configuration model with these parameters, we have

\mathbb{P}\left( K \equiv H \right) \propto \left(2^{t(H)} \prod_{e\in E(H)} \mathrm{mult}(e)! \right)^{-1},

where t(H) is the number of loops in H, and mult(e) the multiplicity of an edge e. This might seem initially counter-intuitive, because it looks we are biasing against graphs with multiple edges, when perhaps our intuition is that because there are more ways to form a set of multiple edges we should bias in favour of it.

I think it’s most helpful to look at a diagram of a multigraph as shown, and ask how to assign stubs to edges. At a vertex with degree three, all stub assignments are different, that is 3!=6 possibilities. At the multiple edge, however, we care which stubs match with which stubs, but we don’t care about the order within the multi-edge. Alternatively, there are three choices of how to divide each vertex’s stubs into (2 for the multi-edge, 1 for the rest), and then two choices for how to match up the multi-edge stubs, ie 18 in total = 36/2, and a discount factor of 2.

We mention this because in fact K(T(n,k)) converges in distribution to this uniform configuration model. Once you know that K(T(n,k)) is with high probability 3-regular, then again it’s probably easiest to think about the core, indeed you might as well condition on its total size and number of degree 3 vertices. It’s then not hard to convince yourself that a uniform choice induces a uniform choice of kernel. Again, let’s state that as a proposition.

Proposition: For any H a 3-regular labelled multigraph H with 2(k-1) vertices and 3(k-1) edges as before,

\lim_{n\rightarrow\infty}\mathbb{P}\left( K(T(n,k)) \equiv H \right) \propto \left(2^{t(H)} \prod_{e\in E(H)} \mathrm{mult}(e)! \right)^{-1}.

As we said before, the kernel describes the topology of the core. To reconstruct the graph, we need to know the lengths in the core, and then how to glue pendant trees onto the core. But this final stage depends on k only through the total length of paths in the core. Given that information, it’s a combinatorial problem, and while I’m not claiming it’s easy, it’s essentially the same as for the case with k=1, and is worth treating separately.

It is worth clarifying a couple of things first though. Even the outline of methods above relies on the fact that the size of the core diverges as n grows. Again, the heuristic is that up to size-biasing errors, T(n,k) looks like a uniform tree with some uniformly-chosen extra edges. But distances in T(n,k) scale like n^{1/2} (and thus in critical components of G(N,p) scale like N^{1/3}). And the core will be roughly the set of edges on paths between the uniformly-chosen pairs of vertices, and so will also have length \Theta(n^{1/2}).

Once you have conditioned on the kernel structure, and the (large) number of internal vertices on paths in the core (ie the length of the core), it is natural that the assignment of the degree-2 vertices to core paths / kernel edges is uniform. A consequence of this is that if you record (Y_1,\ldots,Y_m) the lengths of paths in the core, where m=3(k-1), then

\frac{(Y_1,\ldots,Y_m)}{\sum Y_i} \stackrel{d}\rightarrow \mathrm{Dirichlet}(1,1,\ldots,1).

This is stated formally as Corollary 7 b) of [ABG09]. It’s worth noting that this confirms that the lengths of core paths are bounded in probability away from zero after the appropriate rescaling. In seeking a metric scaling limit, this is convenient as it means there’s so danger that two of the degree-3 vertices end up in ‘the same place’ in the scaling limit object.

To recap, the only missing ingredients now to give a complete limiting metric description of T(n,k) are 1) a distributional limit of the total core length; 2) some appropriate description of set of pendant trees conditional on the size of the pendant forest. [ABG09] show the first of these. As remarked before, all the content of the second of these is encoded in the unicyclic k=1 case, which I have written about before, albeit slightly sketchily, here. (Note that in that post we get around size-biasing by counting a slightly different object, namely unicyclic graphs with an identified cyclic edge.)

However, [ABG09] also propose an alternative construction, which you can think of as glueing CRTs directly onto the stubs of the kernel (with the same distribution as before). The proof that this construction works isn’t as painful as one might fear, and allows a lot of the other metric distributional results to be read off as corollaries.


[ABG09] – Addario-Berry, Broutin, Goldschmidt – Critical random graphs: limiting constructions and distributional properties

[CRT2] – Aldous – The continuum random tree: II

[Al97] – Aldous – Brownian excursions, critical random graphs and the multiplicative coalescent

[JKLP] – Janson, Knuth, Luczak, Pittel – The birth of the giant component

[LeGall] – Le Gall – Random trees and applications

[LPW] – Luczak, Pittel, Wierman – The structure of a random graph at the point of the phase transition


Random 3-regular graphs

A graph is d-regular if every vertex has degree d. Probably the easiest examples of d-regular graphs are the complete graph on (d+1) vertices, and the infinite d-ary tree. A less trivial example is the Petersen graph, which is 3-regular. 3-regular graphs will be the main focus for some of this post, but initially we lose nothing by considering general d.

Throughout, a necessary condition for the existence of a d-regular graph with N vertices is that at least one of d and N is even, as the sum of the degrees of a graph must be even. We will always assume that this holds, so that when d=3, we are always taking N to be even.

A natural pair of questions for a probabilist is ‘can we sample a d-regular graph with N vertices uniformly at random?’ and ‘what does a typical large d-regular graph look like?’

In a rather old post, I addressed some aspects of the first question, but revisit it briefly here. A good idea, due to Bollobas [B80] is to assign to all the vertices d stubs (or half-edges), and choose a matching of the Nd stubs uniformly at random. This works as a method to generate a random graph with any fixed degree sequence.

If you want your graphs to be simple, this can go wrong, because there’s a chance you get loops (that is, an edge from a vertex v to itself) and multiple edges between the same pair of vertices. It would be nice the graph formed in this fashion was simple with high probability when N\rightarrow\infty. Unfortunately that’s not the case, however the probability that the graph is simple remains asymptotically bounded away from 0 and 1. Indeed, because the presence of a loop / multiple edge is asymptotically independent of the presence of a loop / multiple edge elsewhere, it’s unsurprising we have a Poisson limit for the number of such occurences. So from a sampling point of view, it’s reasonable to sample a graph in this way until you find a simple one. This takes O(1) steps, and it’s O(N) steps to check whether a given multigraph is simple.

It’s clear that conditional on the graph generated in this fashion being simple, its distribution is uniform on the set of simple graphs with the correct degree distribution. If you are happy for your graphs to have loops, then it’s a little bit more complicated, because if an edge has multiplicity k, these can appear in k! ways in the configuration construction.

Other asymptotic properties

Loops and multiple edges can be thought of as cycles of length 1 and 2 respectively if you want. We might ask about other small cycles. A calculation in expectation is relatively straightforward. Given three vertices, the probability they form a triangle (in at least one way) is \Theta(N^{-3}), and there are \Theta(N^3) ways to choose three vertices. Thus the expected number of triangles is \Theta(1). Finally, the edge structure induced on disjoint triples is asymptotically independent, and hence a Poisson limit. (See [J06] for details, including more detail on the general configuration construction.) The same result holds for the same reasons for cycles of any fixed finite length.

We might also ask about connectivity. At a heuristic level, there are two ways for the graph to be disconnected: it could have some small components; or it could have two components of size \Theta(N). The smallest possible component is K_4, and an argument like for the cycles above shows that the number of copies of K_4 vanishes in expectation. Now, consider having two components of size roughly N/2. There are \binom{N}{N/2} \sim 2^{2N} ways to make this choice. However, given such a choice, we can handle the probability that all the stubs from one class match within that class by going through the class one stub at a time:

\frac{\frac{3N}{2}-1}{3N-1} \times \frac{\frac{3N}{2}-3}{3N-3} \times \cdots \times \frac{1}{\frac{3N}{2}+1}.

We approximate this as

\frac{\sqrt{(3N/2)!}}{\sqrt{ (3N)!}} \sim  e^{3N/2} 2^{-3N/2} \left(3N\right)^{-3N/2},

and this dominates the number of choices powerfully enough that we might believe it remains valid for a broader range of class sizes. In fact we have a much stronger statement, namely that G(N,3) is 3-connected with high probability. This means that the graph cannot be disconnected by removing two vertices, or equivalently that there are three vertex-disjoint paths between any pair of vertices in the graph, essentially one emerging from each stub. See this note by David Ellis for a quick proof. We might return to this later.

You might ask about planarity. It’s clear from degree consideration that there are no induced copies of K_5 in any random 3-regular graph, and since K_{3,3} contains a cycle of length 4, and with high probability G(N,3) doesn’t, that takes care of that possibility too. However, there might be minors of this form. This seemed a good example of the Kuratowski criterion not actually being that useful, since I certainly don’t find the minors of the 3-regular graph an obvious structure to handle.

However, we can use Euler’s formula V – E + F = 2 for planar graphs. Here V = N, E = 3N/2. Faces are described by (a subset of the) cycles, and we there are asymptotically O(1) small cycles, so most faces include a large number of edges. But each edge corresponds to at most two faces. So we have F \ll E, and so with high probability Euler’s formula can’t hold in G(N,3) for large N.

We can also ask about the local limit of G(N,3). Since the vertices are exchangeable, we don’t need to worry about whether we choose the root uniformly at random (often referred to as the Benjamini-Schramm sense) or by some other method.

The root has up to three neighbours, and with high probability it has exactly three neighbours. These neighbours have at most two other neighbours themselves. However, we’ve already seen that there are asymptotically O(1) cycles, and so with high probability there are no small cycles near a fixed root vertex. So the six neighbours-of-neighbours are with high probability different to the root and the root’s neighbours and to each other. We can make this argument at arbitrary finite radius from the root, to conclude that the local limit of G(N,3) is the infinite 3-ary tree.

Spectral expansion

[Caveat – this is something I read about and wanted to mention, but I really don’t know much at all about any of this theory, and it’s definitely not certain that what follows wouldn’t be better replaced by a set of links.]

This straightforward local limit offers good heuristics on some of the more global properties. Almost by definition, the d-ary tree expands as rapidly as is possible away from the root among infinite d-regular graphs. There are a number of ways to measure the expansion of a graph, and some methods transfer better to the infinite setting than others. The adjacency matrix of an infinite graph can be defined similarly to that of a finite graph, and it remains possible to talk about eigenfunctions and spectrum. As for the finite setting, d is an eigenvalue because the tree is d-regular, and -d is an eigenvalue because it is also bipartite.

The next largest eigenvalue \lambda_2 governs the spectral gap d-\lambda_2 which is a measure of the expansion of a graph. A graph is a good (spectral) expander if all the non-trivial eigenvalues are close to zero. A priori, all we know is that |\lambda_2|\le d. For the infinite d-ary tree, we have \lambda_2 = 2\sqrt{d-1}. This blog post by Luca Trevisan gives a very readable proof.

A key result is that finite graphs can have \lambda_2 \le 2\sqrt{d-1}, but not asymptotically. That is, taking N to be the number of vertices:

\lambda_2 \ge 2\sqrt{d-1} - o_N(1).

This is the content of the Alon-Boppana theorem [Al86]. In fact the error can be quantified as O(\frac{1}{\log N}) – the diamater of the graph is relevant here. A finite d-regular graph for which \lambda_2\le 2\sqrt{d-1} is called a Ramanujan graph. The existence of Ramanujan graphs has been much studied, and various constructions often rely on number theoretic properties of N, and lie at the interface of disparate branches of mathematics where my understanding is zero rather than epsilon.

Now return to our view of the d-ary tree as the local limit of a d-regular graph on N vertices for large N. We might expect from everything above that the uniform d-regular graph is a good expander. Bollobas shows that in the sense of edge-expansion, asymptotically almost all d-regular graphs have edge-expansion bounded away from zero. (See Section 2 of [Ell], including history of the d=3 case.) Friedman [Fri08] proves the conjecture of Alon that for every \epsilon>0, a.a.s. \lambda_2 for G(N,d) is at most 2\sqrt{d-1}+\epsilon. In this sense, G(N,d) is asymptotically ‘almost Ramanujan’. (See also [Bor17] for another proof and an introduction including history, context and references.)

Some other links: The Wikipedia page on expanders, which includes a discussion of the different descriptions of expansion, and the Cheeger inequalities and other relations between them; slides for a talk by Spielman on spectra and Ramanujan graphs; a survey by Murty on Ramanujan graphs;.

What next?

This post took a slightly different direction from what I had intended, and rather than make a halting U-turn back to my planned finale, I’ll postpone this. However, a short overture is that I’m interested in the structure of critical components of random graphs during the critical window. This is the window during which the largest components first have cycles with probability \Theta(1). Indeed, the critical components have size \Theta(N^{2/3}) and \Theta(1) surplus edges. Conditional on their size, and number of surplus edges, the choice of the graph structure on the component is uniform among such (connected) graphs.

Addario-Berry, Broutin and Goldschmidt [ABG09] study scaling limits of such components. Central to this analysis is the 2-core of such components, which can be described in terms of 3-regular (multi)graphs. Various processes we are now interested in running on the critical components of critical RGs can then be studied in terms of related processes on random 3-regular graphs.


[ABG09] – Addario-Berry, Broutin, Goldschmidt – Critical random graphs: limiting constructions and distributional properties

[Al86] – Alon – Eigenvalues and expanders

[B80] – Bollobas – A probabilistic proof of an asymptotic formula for the number of labelled regular graphs

[B88] – Bollobas – The isoperimetric number of random regular graphs

[Bor17] – Bordenave – A new proof of Friedman’s second eigenvalue theorem and its extension to random lifts. Arxiv.

[Ell] – Ellis – The expansion of random regular graphs

[Fri08] – Friedman – A proof of Alon’s second eigenvalue conjecture and related problems

[J06] – Janson – The probability that a random multigraph is simple by

Random transpositions

We study a procedure for generating a random sequence of permutations of [N]. We start with the identity permutation, and then in each step, we choose two elements uniformly at random, and swap them. We obtain a sequence of permutations, where each term is obtained from the previous one by multiplying by a uniformly-chosen transposition.

Some more formality and some technical remarks:

  • This is a Markov chain, and as often with Markov chains, it would be better it was aperiodic. As described, the cycle will alternate between odd and even permutations. So we allow the two elements chosen to be the same. This laziness slows down the chain by a factor N-1/N, but removes periodicity. We will work over timescales where this adjustment makes no practical difference.
  • Let \tau_1,\tau_2,\ldots be the sequence of transpositions. We could define the sequence of permutations by \pi_m= \tau_m\cdot\tau_{m-1}\cdot \ldots\cdot \tau_1. I find it slightly more helpful to think of swapping the elements in places i and j, rather the elements i and j themselves, and so I’ll use this language, for which \pi_m = \tau_1\cdot \tau_2\cdot\ldots \cdot \tau_m is the appropriate description. Of course, transpositions and the identity are self-inverse permutations, so it makes no difference to anything we might discuss.
  • You can view this as lazy random walk on the Cayley graph of S_N generated by the set of transpositions. That is, the vertices of the graph are elements of S_N, and two are connected by an edge if one can be obtained from the other by multiplying by a transposition. Note this relation is symmetric. Hence random transposition random walk.
  • Almost everything under discussion would work in continuous time too.

At a very general level, this sort of model is interesting because sometimes the only practical way to introduce ‘global randomness’ is repeatedly to apply ‘local randomness’. This is not the case for permutations – it is not hard to sample uniformly from S_N. But it is a tractable model in which to study relevant questions about the generating randomness on a complicated set through iterated local operations.

Since it is a Markov chain with a straightforward invariant distribution, we can ask about the mixing time. That is, the correct scaling for the number of moves before the random permutation is close in distribution (say in the sense of total variation distance) to the equilibrium distribution. See this series of posts for an odd collection of background material on the topic. Diaconis and Shahshahani [DS81] give an analytic argument for mixing around \frac{N\log N}{2} transpositions. Indeed they include a constant because it is a sharp cutoff, where the total variation distance drops from approximately 1 to approximately 0 in O(N) steps.

Comparison with Erdos-Renyi random graph process

In the previous result, one might observe that m=\frac{N\log N}{2} is also the threshold number of edges to guarantee connectivity of the Erdos-Renyi random graph G(N,m) with high probability. [ER59] Indeed, there is also a sharp transition around this threshold in this setting too.

We explore this link further. We can construct a sequence of random graphs simultaneously with the random transposition random walk. When we multiply by transposition (i j), we add edge ij in the graph. Laziness of RTRW and the possibility of multiple edges mean this definition isn’t literally the same as the conventional definition of a discrete-time Erdos-Renyi random graph process, but again this is not a problem for any of the effects we seek to study.

The similarity between the constructions is clear. But what about the differences? For the RTRW, we need to track more information than the random graph. That is, we need to know what order the transpositions were added, rather than merely which edges were added. However, the trade-off is that a permutation is a simpler object than a graph in the following sense. A permutation can be a described as a union of disjoint cycles. In an exchangeable setting, all the information about a random permutation is encoded in the lengths of the these cycles. Whereas in a graph, geometry is important. It’s an elegant property of the Erdos-Renyi process that we can forget about the geometry and treat it as a process on component sizes (indeed, a multiplicative coalescent process), but there are other questions we might need to ask for which we do have to study the graph structure itself.

Within this analogy, unfortunately the word cycle means different things in the two different settings. In a permutation, a cycle is a directed orbit, while in a graph it has the usual definition. I’m going to write graph-cycle whenever relevant to avoid confusion.

A first observation is that, under this equivalence, the cycles of the permutation form a finer partition than the components of the graph. This is obvious. If we split the vertices into sets A and B, and there are no edges between them, then nothing in set A will ever get moved out of set A by a transposition. (Note that the slickness of this analogy is the advantage of viewing a transposition as swapping the elements in places i and j.)

However, we might then ask under what circumstances is a cycle of the permutation the same as a component of the graph (rather than a strict subset of it). A first answer is the following:

Lemma: [Den59] The permutation formed by a product of transpositions corresponding in any order to a tree in the graph has a single cycle.

We can treat this as a standalone problem and argue in the following predictable fashion. (Indeed, I was tempted to set this as a problem during selection for the UK team for IMO 2017 – it’s perfectly suitable in this context I think.) The first transposition corresponds to some edge say ab, and removing this edge divides the vertices into components A \ni a, B\ni b. Since no further transposition swaps between places in A and places in B, the final permutation maps a into B and b into A, and otherwise preserves A and B.

This argument extends to later transpositions too. Now, suppose there are multiple cycles. Colour one of them. So during the process, the coloured labels move around. At some point, we must swap a coloured label with an uncoloured label. Consider this edge, between places a and b as before, and indeed the same conclusion holds. WLOG we move the coloured label from a to b. But then at the end of the process (ie in the permutation) there are more coloured labels in B than initially. But the number of coloured labels should be the same, because they just cycle around in the final permutation.

We can learn a bit more by trying thinking about the action on cycles (in the permutation) of adding a transposition. In the following pair of diagrams, the black arrows represent the original permutation (note it’s not helpful to think of the directed edges as having anything to do with transpositions now), the dashed line represents a new transposition, and the new arrows describe the new permutation which results from this product.

It’s clear from this that adding a transposition between places corresponding to different cycles causes the cycles to merge, while adding a transposition between places already in the same cycle causes the cycle to split into two cycles. Furthermore the sizes of the two cycles formed is related to the distance in the cycle between the places defining the transposition.

This allows us to prove the lemma by adding the edges of the tree one-at-a-time and using induction. The inductive claim is that cycles of the permutation exactly correspond to components of the partially-built tree. Assuming this claim guarantees that the next step is definitely a merge, not a split (otherwise the edge corresponding to the next step would have to form a cycle). If all N-1 steps are merges, then the number of cycles is reduced by one on each step, and so the final permutation must be a single cycle.

Uniform split-merge

This gives another framework for thinking about the RTRW itself, entirely in terms of cycle lengths as a partition of [N]. That is, given a partition, we choose a pair of parts in a size-biased way. If they are different, we merge them; and if it is the same part, with size k, we split them into two parts, with sizes chosen uniformly from { (1,k-1), (2,k-2), …  (k-1,1) }.

What’s nice about this is that it’s easy to generalise to real-valued partitions, eg of [0,1]. Given a partition of [0,1], we sample two IID U[0,1] random variables U_1,U_2. If these correspond to different parts, we replace these parts by a single part with size given by the sum. If these correspond to the same part, with size \alpha, we split this part into two parts with sizes |U_1-U_2| and \alpha - |U_1-U_2|. This is equivalent in a distributional sense to sampling another U[0,1] variable U and replacing \alpha with (\alpha U, \alpha(1-U)). We probably want our partition to live in \ell^1_\searrow, so we might have to reorder the parts afterwards too.

These uniform split-merge dynamics have a (unique) stationary distribution, the canonical Poisson-Dirichlet random partition, hereafter PD(0,1). This was first shown in [DMZZ04], and then in a framework more relevant to this post by Schramm [Sch08].

Conveniently, PD(0,1) is also the scaling limit of the cycle lengths in a uniform random permutation (scaled by N). The best way to see this is to start with the observation that the length of the cycle containing 1 in a permutation chosen uniformly from S_N has the uniform distribution on {1,…,N}. This matches up well with the uniform stick-breaking construction of PD(0,1), though other arguments are available too. Excellent background on Poisson-Dirichlet distributions and this construction and equivalence can be found in Chapter 3 of Pitman’s comprehensive St. Flour notes [CSP]. Also see this post, and the links within, with the caveat that my understanding of the topic was somewhat shaky then (as presently, for now).

However, Schramm says slightly more than this. As the Erdos-Renyi graph passes criticality, there is a well-defined (and whp unique) giant component including \Theta(N) vertices. It’s not clear that the corresponding permutation should have giant cycles. Indeed, whp the giant component has \Theta(N) surplus edges, so the process of cycle lengths will have undergone O(N) splits. Schramm shows that most of the labels within the giant component are contained in giant cycles in the permutation. Furthermore, the distribution of cycle lengths within the giant component, rescaled by the size of the giant component, converges in distribution to PD(0,1) at any supercritical time \frac{(1+\epsilon)N}{2}

This is definitely surprising, since we already know that the whole permutation doesn’t look close to uniform until time \frac{N\log N}{2}. Essentially, even though the size of the giant component is non-constant (ie it’s gaining vertices), the uniform split-merge process is happening to the cycles within it at rate N. So heuristically, at the level of the largest cycles, at any supercritical time we have a non-trivial partition, so at any slightly later time (eg \frac{(1+\epsilon/2)N}{2} and \frac{(1+\epsilon)N}{2} ), mixing will have comfortably occurred, and so the distribution is close to PD(0,1).

This is explained very clearly in the introduction of [Ber10], in which the approach is extended to a random walk on S_N driven by a uniform choice from any conjugacy class.

So this really does tell us how the global uniform randomness emerges. As the random graph process passes criticality, we have a positive mass of labels in a collection of giant cycles which are effectively a continuous-space uniform split-merge model near equilibrium (and thus with PD(0,1) marginals). The remaining cycles are small, corresponding to small trees which make up the remaining (subcritical by duality) components of the ER graph. These cycles slowly get absorbed into the giant cycles, but on a sufficiently slow timescale relevant to the split-merge dynamics that we do not need to think of a separate split-merge-with-immigration model. Total variation distance on permutations does feel the final few fixed points (corresponding to isolated vertices in the graph), hence the sharp cutoff corresponding to sharp transition in the number of isolated vertices.


[Ber10] – N. Berestycki – Emergence of giant cycles and slowdown transition in random transpositions and k-cycles. [arXiv version]

[CSP] – Pitman – Combinatorial stochastic processes. [pdf available]

[Den59] – Denes – the representation of a permutation as a product of a minimal number of transpositions, and its connection with the theory of graphs

[DS81] – Diaconis, Shahshahani – Generating a random permutation with random transpositions

[DMZZ04] – Diaconis, Mayer-Wolf, Zeitouni, Zerner – The Poisson-Dirichlet distribution is the unique invariant distribution for uniform split-merge transformations [link]

[ER59] – Erdos, Renyi – On random graphs I.

[Sch08] – Schramm – Compositions of random transpositions [book link]

Generating uniform trees

A long time ago, I wrote quite a few a things about uniform trees. That is, a uniform choice from the n^{n-2} unrooted trees with vertex set [n]. This enumeration, normally called Cayley’s formula, has several elegant arguments, including the classical Prufer bijection. But making a uniform choice from a large set is awkward, and so we seek more probabilistic methods to sample such a tree, which might also give insight into the structure of a ‘typical’ uniform tree.

In another historic post, I talked about the Aldous-Broder algorithm. Here’s a quick summary. We run a random walk on the complete graph K_n started from a uniformly-chosen vertex. Every time we arrive at a vertex we haven’t visited before, we record the edge just traversed. Eventually we have visited all n vertices, so have recorded n-1 edges. It’s easy enough to convince yourself that these n-1 edges form a tree (how could there be a cycle?) and a bit more complicated to decide that the distribution of this tree is uniform.

It’s worth noting that this algorithm works to construct a uniform spanning tree on any connected base graph.

This post is about a few alternative constructions and interpretations of the uniform random tree. The first construction uses a Galton-Watson process. We take a Galton-Watson process where the offspring distribution is Poisson(1), and condition that the total population size is n. The resulting random tree has a root but no labels, however if we assign labels in [n] uniformly at random, the resulting rooted tree has the uniform distribution among rooted trees on [n].


This is all about moving from ordered trees to non-ordered trees. That is, when setting up a Galton-Watson tree, we distinguish between the following two trees, drawn extremely roughly in Paint:

That is, it matters which of the first-generation vertices have three children. Anyway, for such a (rooted) ordered tree T with n vertices, the probability that the Galton-Watson process ends up equal to T is

\mathbb{P}(GW = T) = \prod_{v\in T} \frac{e^{-1}}{C(v)!} = e^{-n} \prod_{v\in T}\frac{1}{C(v)!},

where C(v) is the number of children of a vertex v\in T. Then, since \mathbb{P}( |GW|=n ) is a function of n, we find

\mathbb{P}(GW=T \,\big|\, |GW|=n) = f(n)\prod_{v\in T} \frac{1}{C(v)!},

where f(n) is a function of n alone (ie depends on T only through its size n).

But given an unordered rooted tree t, labelled by [n], there are \prod_{v \in t} C(v)! ordered trees associated to t in the natural way. Furthermore, if we take the Poisson Galton-Watson tree conditioned to have total population size n, and label uniformly at random with [n], we obtain any one of these ordered trees with probability \frac{f(n)}{n!} \prod_{v\in t} \frac{1}{C(v)!}. So the probability that we have t after we forget about the ordering is \frac{f(n)}{n!}, which is a function of n alone, and so the distribution is uniform among the set of rooted unordered trees labelled by [n], exactly as required.

Heuristic for Poisson offspring distribution

In this proof, the fact that \mathbb{P}(C(v)=k)\propto \frac{1}{k!} exactly balances the number of orderings of the k children explains why Poisson(1) works out. Indeed, you can see in the proof that Poisson(c) works equally well, though when c\ne 1, the event we are conditioning on (namely that the total population size is n) has probability decaying exponentially in n, whereas for c=1, the branching process is critical, and the probability decays polynomially.

We can provide independent motivation though, from the Aldous-Broder construction. Both the conditioned Galton-Watson construction and the A-B algorithm supply the tree with a root, so we’ll keep that, and look at the distribution of the degree of the root as constructed by A-B. Let \rho=v_1,v_2,v_3,\ldots be the vertices [n], ordered by their discovery during the construction. Then \rho is definitely connected by an edge to v_2, but thereafter it follows by an elementary check that the probability \rho is connected to v_m is \frac{1}{n-1}, independently across all m. In other words, the distribution of the degree of \rho in the tree as constructed by A-B is

1+ \mathrm{Bin}\left(n-2,\frac{1}{n-1}\right) \approx 1+\mathrm{Poisson}(1).

Now, in the Galton-Watson process, conditioning the tree to have fixed, large size changes the offspring distribution of the root. Conveniently though, in a limiting sense it’s the same change as conditioning the tree to have size at least n. Since these events are monotone in n, it’s possible to take a limit of the conditioning events, and interpret the result as the Galton-Watson tree conditioned to survive. It’s a beautiful result that this interpretation can be formalised as a local limit. The limiting spine decomposition consists of an infinite spine, where the offspring distribution is a size-biased version of the original offspring distribution (and so in particular, always has at least one child) and where non-spine vertices have the original distribution.

In particular, the number of the offspring of the root is size-biased, and it is well-known and not hard to check that size-biasing Poisson(c) gives 1+Poisson(c) ! So in fact we have, in an appropriate limiting sense in both objects, a match between the degree distribution of the root in the uniform tree, and in the conditioned Galton-Watson tree.

This isn’t supposed to justify why a conditioned Galton-Watson tree is relevant a priori (especially the unconditional independence of degrees), but it does explain why Poisson offspring distributions are relevant.

Construction via G(N,p) and the random cluster model

The main reason uniform trees were important to my thesis was their appearance in the Erdos-Renyi random graph G(N,p). The probability that vertices {1, …, n} form a tree component in G(N,p) with some particular structure is

p^{n-1} (1-p)^{\binom{n}{2}-(n-1)} \times (1-p)^{n(N-m)}.

Here, the first two terms give the probability that the graph structure on {1, …, n} is correct, and the the final term gives the probability of the (independent) event that these vertices are not connected to anything else in the graph. In particular, this has no dependence on the tree structure chosen on [n] (for example, whether it should be a path or a star – both examples of trees). So the conditional distribution is uniform among all trees.

If we work in some limiting regime, where pn\rightarrow 0 (for example if n is fixed and p=\frac{1}{N}\rightarrow 0), then we can get away asymptotically with less strong conditioning. Suppose we condition instead just that [n] form a component. Now, there are more ways to form a connected graph with one cycle on [n] than there are trees on [n], but the former all require an extra edge, and so the probability that a given one such tree-with-extra-edge appears as the restriction to [n] in G(N,p) is asymptotically negligible compared to the probability that the restriction to [n] of G(N,p) is a tree. Naturally, the local limit of components in G(N,c/N) is a Poisson(c) Galton-Watson branching process, and so this is all consistent with the original construction.

One slightly unsatisfying aspect to this construction is that we have to embed the tree of size [n] within a much larger graph on [N] to see uniform trees. We can’t choose a scaling p=p(n) such that G(n,p) itself concentrates on trees. To guarantee connectivity with high probability, we need to take p> \frac{\log n}{n}, but by this threshold, the graph has (many) cycles with high probability.

At this PIMS summer school in Vancouver, one of the courses is focusing on lattice spin models, including the random cluster model, which we now briefly define. We start with some underlying graph G. From a physical motivation, we might take G to be \mathbb{Z}^d or some finite subset of it, or a d-ary tree, or the complete graph K_N. As in classical bond percolation (note G(N,p) is bond percolation on K_N), a random subset of the edges of G are included, or declared open. The probability of a given configuration w, with e open edges is proportional to

p^e (1-p)^{|E(G)| - e} q^{k(w)}, (*)

where the edge-weight p\in(0,1) as usual, and cluster weight q\in (0,\infty), and k(w) counts the number of connected components in configuration w. When q=1, we recover classical bond percolation (including G(N,p) ), while for q>1, this cluster-reweighting favours having more components, and q<1 favours fewer components. Note that in the case q\ne 1, the normalising constant (or partition function) of (*) is generally intractable to calculate explicitly.

As in the Erdos-Renyi graph, consider fixing the underlying graph G, and taking p\rightarrow 0, but also taking \frac{q}{p}\rightarrow 0. So the resulting graph asymptotically ‘wants to have as few edges as possible, but really wants to have as few components as possible’. In particular, 1) all spanning trees of G are equally likely; 2) any configuration with more than one component has asymptotically negligible probability relative to any tree; 3) any graph with a cycle has #components + #edges greater than that of a tree, and so is asymptotically negligible probability relative to any tree.

In other words, the limit of the distribution is the uniform spanning tree of G, and so this (like Aldous-Broder) is a substantial generalisation, which constructs the uniform random tree in the special case where G=K_n.