# Borel-Cantelli Lemmas

I’m currently lecturing my first course at King’s, which builds measure theory from the ground up to the construction of the Lebesgue integral, along with some more probabilistic topics. In this second week, we have been discussing various matters related to the construction of sigma-algebras. It took me slightly by surprise that this ends up being the natural moment during an undergraduate’s progression through probability to introduce the Borel-Cantelli lemmas. It’s not hard to imagine a pedagogical schedule in which they appear earlier, but equally, one benefits from seeing them with a bit more mathematical experience, even if they are only very loosely related to the construction of the Lebesgue integral!

Anyway, I noticed that I have not written about these results on the blog before. There are a couple of neat applications that lie slightly beyond the remit (and time allowance) of the lecture course, so they are presented at the end.

Context – events holding eventually and infinitely often

Recall that (in the notation of probability spaces) a sigma-algebra $\mathcal F$ provides a set of events $A\subseteq \Omega$ for which it is meaningful and consistent to define a probability measure. This is extremely uncontroversial in the setting where the set of outcomes $\Omega$ is finite, but thornier in the setting where $\Omega$ is uncountable, for example [0,1] or $\mathbb{R}$. The key axiom for sigma-algebras involves countable unions. If $(A_n)_{n\ge 1}$ is a sequence of events in $\mathcal F$, we must also have $\bigcup_{n\ge 1}A_n\in\mathcal F$.

Since the axioms also demand that $\mathcal F$ is closed under taking complements, it only takes a few steps to show that countable intersections are also valid, ie $\bigcap_{n\ge 1}A_n\in\mathcal F$ also. It is uncontroversial to note that

$\bigcap_{n\ge 1}A_n \subseteq \bigcup_{n\ge 1}A_n.$

The main initial motivation for considering the events $\{A_n\text{ infinitely often}\},\{A_n\text{ eventually}\}$ are that they lie between the intersection and the union in terms of containment.

• ‘Infinitely often’ is relatively uncontroversial, just meaning ‘for infinitely many values of n’.
• ‘Eventually’ is less clear on an initial reading. Perhaps ‘eventually always’ would be better? [And closer to the title of this blog!] The meaning is ‘always, except for finitely many n’, which implies that there is some threshold N such that it holds for all n>N.
• In set notation, infinitely often (or i.o.) means that $\bigcup_{k\ge n}A_k$ must hold for every n, and so the event $\{A_n\text{ i.o.}\}=\bigcap_n\bigcup_{k\ge n}A_k$.
• Similarly, eventually means that there must be an n for which $\bigcap_{k\ge n}A_k$ holds, so the event $\{A_n\text{ eventually}\}=\bigcup_n\bigcap_{k\ge n}A_k$.

We end up with the richer containment sequence

$\bigcap_{n\ge 1}A_n\subseteq \{A_n\text{ eventually}\}\subseteq \{A_n\text{ i.o.}\}\subseteq \bigcup_{n\ge 1}A_n.$

These individual containment relations may not be obvious a priori in this form. Probably the most clear is the central containment, that if an $A_n$ holds for all except finitely many n, then it holds for infinitely many n. The right-most says that if it holds infinitely often, then it holds at least once. For the left-most containment, note that $\bigcap_{n\ge 1}A_n$ is literally the same as the n=1 case of $\bigcap_{k\ge n}A_k$, so taking a union over other values of n only increases the set.

It’s also worth thinking about the complements. Note that $\left(\bigcap_{n\ge 1}A_n\right)^c = \bigcup_{n\ge 1} A_n^c$ and $\left(\bigcup_{n\ge 1}A_n\right)^c=\bigcap_{n\ge 1}A_n^c$. Versions of these apply to eventually/i.o. also. In particular

$\{A_n\text{ i.o.}\}^c = \{A_n\text{ for only finitely many }n\} = \{A_n^c\text{ eventually}\}.$

and $\{A_n\text{ eventually}\}^c = \{A_n^c\text{ for infinitely many }n\}$.

So in any calculation or application, there is flexibility to study $(A_n^c)$ is this is more convenient. As we will see, it makes sense to focus on whichever of $(A_n),(A_n^c)$ has vanishing probabilities as $n\to\infty$.

Comment: It is standard to write $\limsup_{n}A_n= \bigcap_{n\ge 1}\bigcup_{k\ge n}A_k$, and $\liminf_n A_n=\bigcup_{n\ge 1}\bigcap_{k\ge n}A_k$. I am happy to admit I get these confused with non-vanishing probability, and prefer to stick to eventually/i.o. or the full union/intersection statements for safety!

Borel-Cantelli Lemmas

The Borel-Cantelli lemmas establish some conditions under which the event $\{A_n\text{ i.o.}\}$ almost surely holds, or almost surely doesn’t hold. We’ll state them now:

BC1: If $\sum_{n\ge 1}\mathbb{P}(A_n)<\infty$, then $\mathbb{P}(A_n\text{ i.o.})=0$.

BC2: If $\sum_{n\ge 1}\mathbb{P}(A_n)=\infty$ and the events $(A_n)$ are independent, then $\mathbb{P}(A_n\text{ i.o.})=1$.

Proof of BC1: Since $\{A_n\text{ i.o.}\}\subseteq \bigcup_{k\ge n}A_k$ for every $n$, we have

$\mathbb{P}(A_n\text{ i.o.}) \le \sum_{k\ge n}\mathbb{P}(A_k)\;\stackrel{n\to\infty}\longrightarrow\; 0,$

since $\sum \mathbb{P}(A_n)<\infty$.

Comment: we could also introduce the random variable $N$ which counts how many events $A_n$ occur, ie $N=\sum_{n\ge 1}\mathbf{1}_{A_n}$. Then $\{A_n\text{ i.o.}\}=\{N=\infty\}$, and the given condition is equivalent to $\mathbb{E}[N]<\infty$, which certainly implies $\mathbb{P}(N=\infty)=0$. Indeed, it’s reasonable to think of BC1 as a first-moment result, with BC2 as (sort of) a second-moment result, and this is reflected in the comparative complexity of the two proofs.

Proof of BC2: As discussed, it is equivalent to show that $\mathbb{P}(A_n^c\text{ eventually})=0$. To do this, we study the intersections of the complements, whose probabilities are tractable using independence:

$\mathbb{P}\left( \bigcup_{k\ge n}A_k^c\right) = \prod_{k\ge n}\mathbb{P}(A_k^c)=\prod_{k\ge n}(1-\mathbb{P}(A_k))$

$\le \prod_{k\ge n}\exp(-\mathbb{P}(A_k))=\exp(-\sum_{k\ge n}\mathbb{P}(A_k)).$

We have assumed that $\sum_{k\ge n}\mathbb{P}(A_k)=\infty$ for each n, and so we obtain

$\mathbb{P}\left(\bigcup_{k\ge n}A_k^c\right)=0,\quad\forall n\ge 1.$

But since $(\bigcap_{k\ge n}A_k^c)_{n\ge 1}$ is an increasing sequence of events, so taking a union over n is particularly natural. Precisely, by continuity of measure, we have

$\mathbb{P}\left(\bigcup_{n\ge 1}\bigcap_{k\ge n}A_k^c\right) = \lim_{n\to\infty}\mathbb{P}\left( \bigcap_{k\ge n}A_k^c\right)=0.$

Comment: the independence condition is certainly necessary, as otherwise one can have situations where $A_1\supseteq A_2\supseteq\cdots$. For example, with $U\sim \mathrm{Unif}[0,1]$, consider $A_n:=\{U\le \frac1n\}$, for which $\mathbb{P}(A_n)=\frac1n$, whose sum diverges, while clearly $\mathbb{P}(A_n\text{ i.o.})=\mathbb{P}(U=0)=0$.

Classic examples: the famous ‘infinite monkey theorem’, relatively well-known outside the realm of theoretical probability, asserts that a monkey typing randomly at a keyboard for long enough will eventually type the complete works of William Shakespeare. Other standard examples involve independent $(A_n)$ with probabilities decaying at various speeds. The latter do have some overlap with the slightly less standard examples presented below.

Example – generating random permutations

Consider a sequence of random permutations $(\sigma_n)_{n\ge 1}$ where $\sigma_n$ is a permutation of $[n]$. We construct these iteratively as follows: $\sigma_1$ is the identity (ie the only permutation of one element!); then for each $n\ge 2$, define $\sigma_n$ from $\sigma_{n-1}$ by inserting element $n$ at a uniformly-chosen position in $\sigma_{n-1}$.

We ask the question whether this gives a well-defined limiting permutation $\sigma_\infty$ on $\mathbb{N}$. A necessary condition for this is that the first element $\sigma_n(1)$ is eventually constant. But in fact the events $\{\sigma_n(1)=n\}$ are independent, and each occurs with probability $1/n$, and so

$\mathbb{P}(\text{first element changes i.o.})\ge \mathbb{P}(\sigma_n(1)=n\text{ for infinitely many }n)=1,$

by BC2.

On reflection, each $\sigma_n$ has the uniform distribution on $\Sigma_n$, permutations of $[n]$. So our result is consistent with the idea that one cannot construct a uniform random permutation on an infinite set.

Now, alternatively, suppose the new element is added at place $\max(n-X_n,1)$ where $(X_2,X_3,\ldots)$ are IID $\mathrm{Geometric}(q)$ random variables. (The truncation is just to prevent pathologies like trying to add the new entry in ‘place -6’.) Then

$\mathbb{P}(\sigma_n(1)=n)=(1-q)^{n-2},$

and since $\sum_n (1-q)^{n-2}<\infty$, BC1 shows that the first element $(\sigma_1(1),\sigma_2(1),\sigma_3(1),\ldots)$ changes only finitely often. A similar argument applies for the second element, and the third element, and so on. Consequently, almost surely we have a well-defined pointwise limit permutation $\sigma_\infty$. This is known as the Mallows Permutation on the integers, and is one of the more popular models for a random permutation on an infinite set, with lots of interesting properties. I supervised an undergraduate research project on this model over Summer 2022, and will write more when time allows.

Example – well-separated random integers

The final example is a toy version of a construction I’ve been working on recently regarding subsets of the vertices of a graph which ‘trap’ random walk on that graph.

Suppose we are trying to colour some positive integers randomly, so that the coloured integers are infinite, but also asymptotically sparse. We also do not want pairs of coloured integers to be close together. Let’s say that two integers $m,n$ are close if $|m-n|\le \sqrt{\max(m,n)}$.

Now, suppose we colour each integer $n$ green independently with probability $\frac1n$. Then, whenever we have two integers $m,n$ which are close, and coloured green, we re-colour them both red.

We would like to prove that infinitely many integers are coloured green. Let $C_n:={n\text{ coloured with some colour}}$. Using BC2, we know that almost surely, infinitely many are coloured with some colour, and it seems implausible that many would be close. However, the events $G_n={n\text{ coloured green}}$ are not independent, so one cannot apply BC2 directly to these. However, we can study instead $R_n={n\text{ coloured red}}$. We can bound these probabilities as

$\mathbb{P}(R_n)\le\frac{1}{n}\frac{2\sqrt{n}}{n-\sqrt{n}} = \frac{2+o(1)}{n^{3/2}},$

and apply BC1 to show $\mathbb{P}(R_n\text{ i.o.})=0$. Then, since $\{G_n\text{ i.o.}\}\supseteq \{C_n\text{ i.o.}\}\setminus{R_n\text{ i.o.}}$, we obtain $\mathbb{P}(G_n\text{ i.o.})=1$, as required.

# Convex ordering on Galton-Watson trees

This blog was recently revived, via a post about convex ordering and its relevance to the problem of sampling with and without replacement that forms part of the potpourri of related results all sometimes referred to as Hoeffding’s inequality.

The previous post had been lying almost-complete but dormant for over two years. I revisited it because of a short but open-ended question about trees posed in our research group meeting by Serte Donderwinkel, one of our high-flying doctoral students.

Simplified Question:

For a Galton-Watson tree, can one obtain upper bounds in probability on the height of the tree, uniformly across all offspring distributions with mean $\mu$?

Note that in this setting, it is helpful to have in mind the classical notation $(Z_0,Z_1,Z_2,\ldots)$ for a Galton-Watson process, where typically $Z_0=1$, and $Z_{n+1}$ is the sum of $Z_n$ IID copies of the offspring distribution. Then we have

$\mathbb{P}(\mathrm{height}(\mathcal{T}) < k) = \mathbb{P}(Z_k=0).$

1) Subcritical case. When $\mu<1$, we certainly have $\mathbb{P}(Z_k>0)\le \mathbb{E}[Z_k]=\mu^k$.

Furthermore, if we’re studying all such offspring distributions, this is the best possible upper bound, by considering the offspring distribution given by $Z_1=1$ with probability $\mu$ and zero otherwise.

2) In the critical or supercritical case, $\mu\ge 1$ it is possible that the height is infinite with probability one.

So neither case is especially interesting for now.

Refined question:

What if instead we aren’t trying to obtain a bound uniformly across all offspring distributions with given mean $\mu$, but instead across a subset $\mathcal{X}$ of these distributions? How do we determine which distribution in $\mathcal{X}$ maximises the probability of reaching height k?

This is the question Serte was asking in our group meeting, in the setting where $\mu=1+o(1)$ and the height k has a particular scaling. Also, as far as I understand, the approach outlined in this post didn’t provide strong enough bounds in this particular context. Happily, Serte has recently tied up all the corners of this project concerning the supercritical Galton-Watson forest, and interested readers can find her preprint here on the Arxiv.

Nonetheless the interpretation via convex ordering feels perfect for a blog post, rather than being lost forever.

Convex ordering for offspring distributions

The main observation is that given two offspring distributions X and Y, such that $X\le_{cx} Y$ (which recall means that the means are the same but X is more concentrated) then a number of distributions associated to the Galton-Watson trees for X and Y also satisfy convex ordering relations.

As a warm-up, and because it was the original genesis, we first study heights. We will use the notation

$(Z_0^X,Z_1^X,Z_2^X,\ldots), (Z_0^Y,Z_1^Y,Z_2^Y,\ldots),$

to denote the two Galton-Watson processes. We shall compare $\mathbb{P}(Z^X_k=0)$ and $\mathbb{P}(Z^Y_k=0)$. If we write $\delta_0(\cdot)$ for the function defined on the non-negative integers such that

$\delta_0(0)=1,\quad \delta_0(n)=0,\,n\ge 1$,

it holds that $\delta_0(\cdot)$ is convex. In particular, if $X\le_{cx}Y$, then $\mathbb{E}[\delta_0(X)]\le \mathbb{E}[\delta_0(Y)]$, which exactly says that

$\mathbb{P}(Z^X_1 = 0)\le \mathbb{P}(Z^Y_1 = 0).$

We can then prove that $\mathbb{P}(Z^X_k=0)\le \mathbb{P}(Z^Y_k=0)$ by induction on $k\ge 1$. Note that $\mathbb{P}(Z^X_k=0)^n$ is a convex function of n, regardless of the value of this probability, and so we have

$\mathbb{P}(Z^X_{k+1}=0) = \mathbb{E}\left[ (\mathbb{P}(Z^X_k=0))^X\right] \le \mathbb{E}\left[(\mathbb{P}(Z^X_k=0))^Y\right].$

By the induction hypothesis, this final quantity is at most

$\mathbb{E}\left[(\mathbb{P}(Z^Y_k=0))^Y\right] = \mathbb{P}(Z^Y_{k+1}=0)$.

In conclusion, we have shown that $\mathbb{P}(Z^X_k=0)\le \mathbb{P}(Z^Y_k=0)$ holds for all k, and thus

$\mathrm{height}(\mathcal{T}^X) \ge_{st} \mathrm{height}(\mathcal{T}^Y).$

To return to the original context, suppose we have a large class of offspring distributions $\mathcal{Y}$ and a subclass $\mathcal{X}\subseteq \mathcal{Y}$ such that for all $Y\in\mathcal{Y}$, there exists $X\in \mathcal{X}$ such that $X\le_{cx} Y$. Then one can obtain uniform bounds on the heights of Galton-Watson trees with offspring distributions drawn from $\mathcal{Y}$ by checking those generated from distributions in $\mathcal{X}$ (which is particularly convenient if, for example, $\mathcal{X}$ is finite).

Convex ordering of generation sizes

The above argument solves the original problem, but brushes over the natural question: is it true that $Z^X_k \le_{cx} Z^Y_k$?

The answer is yes. Here’s a proof.

This follows from the following general statement:

Lemma 1: Suppose $X\le_{cx} Y$ are non-negative valued RVs and the non-negative integer valued RVs $M,N$ also satisfy $M \le_{cx} N$. Then

$X_1+\ldots+X_M \le_{cx} Y_1+\ldots Y_N,$

where $X_1,X_2,\ldots$ are IID copies of X and, independently, $Y_1,Y_2,\ldots$ are IID copies of Y.

Lemma 2: Suppose $W_1\le_{cx}Z_1$ and $W_2\le_{cx} Z_2$, and the four random variables are independent. Then $W_1+W_2\le_{cx}Z_1+Z_2$.

Proof of Lemma 2: First, note that for any random variable X, and convex function f

$\mathbb{E}\left[f(Z+x)\right]$ is a convex function of x.

(Indeed, this holds since “f(z+x) is convex” holds for every z, and any definition of convex will pass to the expectation.)

Now we can attack the lemma directly, we may write

$\mathbb{E}\left[ f(W_1+W_2)\right]=\mathbb{E}\left[\, \mathbb{E}[f(W_1+W_2) \mid W_2 ] \,\right] \le \mathbb{E}\left[\, \mathbb{E}[f(W_1+Z_2)\mid Z_2 ] \, \right].$

But then for any $z_2$, we know $f(\cdot+z_2)$ is convex, so $\mathbb{E}[f(W_1+z_2)]\le \mathbb{E}[f(Z_1+z_2)]$, and it follows that

$\mathbb{E}\left[ f(W_1+W_2)\right]\le \mathbb{E} \left[ f(Z_1+Z_2)\right],$

which proves the lemma.

Corollary 3: When $W_1,\ldots,W_m, Z_1,\ldots,Z_m$ are independent, and satisfy $W_i \le_{cx} Z_i$, then we have $W_1+\ldots+W_m\le_{cx} Z_1+\ldots+Z_m$.

Proof of Lemma 1: Note that

$\mathbb{E}\left[ f(X_1+\ldots+X_M)\mid M=n\right] \le \mathbb{E}\left[ f(Y_1+\ldots+Y_N)\mid N=n\right],$

follows from Corollary 3. So a useful question to consider is whether $\mathbb{E}\left[f(Y_1+\ldots+Y_n)\right]$ (*) is a convex function of n?

Denote this quantity by $F(n)$. To check convexity of a function defined on the integers, it suffices to verify that $F(n+1)-F(n)\ge F(n)-F(n-1)$.

There is a canonical coupling between the RVs used to define all of $F(n-1),F(n),F(n+1)$, but it will be convenient to adjust the coupling, and write:

$F(n+1)-F(n)= \mathbb{E}\left[ f(Y_1+\ldots+Y_n + Y^*) - f(Y_1+\ldots+Y_n)\right],$

$F(n)-F(n-1)=\mathbb{E}\left[f(Y_1+\ldots+Y_{n-1}+Y^*) - f(Y_1+\ldots+Y_{n-1})\right],$

where $Y^*$ is a further independent copy of $Y$. But note that for any choice $C\ge c$ and $y\in \mathbb{R}$,

$f(C+y) - f(C) - f(c+y) + f(c)\ge 0.$ (*)

(Essentially, this says that the ‘chord’ of f on the interval $[c,C+y]$ lies above the chord on interval $[C,c+y]$ or $[c+y,C]$, which some people choose to call Karamata’s inequality, but I think is more helpful to think of as part of the visual definition of convexity.)

In any case, setting $y=Y^*, c=Y_1+\ldots+Y_{n-1}, C=Y_1+\ldots+Y_n$ and taking expectations, we obtain

$\mathbb{E}\left[ f(Y_1+\ldots+Y_n+Y^*) - f(Y_1+\ldots+Y_n)\right.$

$\left.- f(Y_1+\ldots+Y_{n-1}+Y^*) + f(Y_1+\ldots+Y_{n-1})\right]\ge 0,$

as required. So $F(n)$ is convex. We may now finish off as

$\mathbb{E}\left[ X_1+\ldots+X_M\right] = \mathbb{E}\left[ \,\mathbb{E}[X_1+\ldots+X_M\mid M]\,\right] \le \mathbb{E}\left[\, \mathbb{E}[Y_1+\ldots+Y_M\mid M]\,\right] = \mathbb{E}[f(M)]\le \mathbb{E}[f(N)] = \mathbb{E}[Y_1+\ldots+Y_N],$

completing the proof of Lemma 1.

• The analysis in this post is not sufficient to study the total population sizes of two Galton-Watson trees generated by X and Y. Note that in Lemma 2, it is important that the random variables are independent. Otherwise, we could, for example, consider $\mathbb{E}[X]=\mathbb{E}[Y]=0$ with $X\le_{cx}Y$ but clearly it should not hold that $X_1+X_2 \le_{cx} Y + (-Y) = 0$. So for total population size, since $(Z^X_k,\,k\ge 1)$ are not independent, an alternative approach would be required.
• A further characterisation of convex ordering is given by Strassen’s theorem [Str65], which is touched on in the previous post, and to which I may return to in a future post on this topic. This may be a more promising avenue for established a convex ordering result on total population size.
• Lemma 1 requires that X,Y are non-negative. Note that during the argument we set $y=Y^*, c=Y_1+\ldots+Y_{n-1}, C=Y_1+\ldots+Y_n$, and when we relax the non-negative support condition, it is no longer guaranteed that $C\ge c$, which is crucial for the step which follows.
• In a recent article in ECP addressing Lemma 1 by a different method, Berard and Juillet [BJ20] provide a simple example showing that the non-negative assumption is genuinely necessary. Consider the random variable $\tau\in \{0,2\}$ with equal probability so $1\le_{cx} \tau$. But then, taking both X and Y to be simple random walk on $\mathbb{Z}$, we do not have $S_1\le_{cx}S_{\tau}$.

References

[BJ20] – Berard, Juillet – A coupling proof of convex ordering for compound distributions, 2020

[Str65] – Strassen – The existence of probability measures with given marginals, 1965

# Hoeffding’s inequality and convex ordering

A number of previous posts on this blog have discussed Markov’s inequality, and the control one can obtain on a distribution by applying this to functions of the distribution, or functions of random variables with this distribution. In particular, studying the square of the deviation from the mean $\left(X-\mathbb{E}X\right)^2$ leads to Chebyshev’s inequality, while even stronger bounds can sometimes be obtained using $e^{tX}$, for some appropriate value of $t\in\mathbb{R}$. Sometimes this final method, which relies on the finiteness of $\mathbb{E}[e^{tX}]$ is referred to as a Chernoff bound, especially when an attempt is made to choose the optimal value of t for the estimate under consideration.

The case of a random walk, that is the sum of IID increments, is particularly well-studied. In [Ho63], Hoeffding studies a generalisation where the increments are merely independent, and outlines a further generalisation to discrete-time martingales, which is developed by Azuma in [Az67], leading to the famous Azuma-Hoeffding inequality. A key feature of this theorem is that we require bounded increments, which leads to tight choices of $t_i$ in the Chernoff bound for each increment.

In addition, [Ho63] addresses the alternative model where the increments of a random walk are chosen uniformly without replacement from a particular set. The potted summary is that the sum of random increments chosen without replacement has the same mean, but is more concentrated that the corresponding sum of random increments chosen with replacement. This means that any of the concentration results proved in the earlier sections of [Ho63] for the latter situation apply equally to the setting without replacement.

Various versions of this result are referred to as Hoeffding’s inequality by the literature, and the main goal of this short article is to list some results which are definitely true about this setting!

(Note, in some contexts, ‘Hoeffding’s inequality’ will mean the concentration bound for either the with-replacement or the without-replacement setting. In this blog post, we are discussing inequalities comparing these two settings.)

Starting point – second moments in ‘vanilla’ Hoeffding

We won’t use this notation heavily, but for now let’s assume that we are studying a (multi)set of values $\mathcal{X}=\{x_1,\ldots,x_N\}$, and will study a sequence of m elements $(X_1,\ldots,X_m)$ from this set with replacement (ie chosen in an IID fashion), and a sequence of m elements $(Y_1,\ldots,Y_m)$ chosen from this set without replacement.

Some preliminary remarks:

• The $Y_i$s are still chosen uniformly which can be interpreted as
• Any permutation of the multiset $\{x_1,\ldots,x_n\}$ is equally likely to appear as $(Y_1,\ldots,Y_n)$;
• If determining the sequence one element at a time, then in a conditional sense, the choice of the ‘next’ element is uniform among those elements which have not yet been chosen.
• In particular, the sequence $(Y_1,\ldots,Y_m)$ is exchangeable (though not knowing what that means will not be a significant impediment to what follows).
• It is clear that $\mathbb{E}[X_1+\ldots+X_m]=\mathbb{E}[Y_1+\ldots+Y_m]$, so concentration around this common mean is the next question of interest.
• Intuitively, when $m\ll N$, we wouldn’t expect there to be much difference between sampling with or without replacement.
• When sampling without replacement, the cases m and N-m are symmetric with respect to the sum of $\mathcal{X}$ and, in particular, have the same concentration.

As a brief warmup, we establish that

$\mathrm{Var}(Y_1+\ldots+Y_m)\le \mathrm{Var}(X_1+\ldots+X_m).$

Note that the RHS is $m\sigma^2$, where $\sigma^2<\infty$ is the variance of a single uniform sample from $\mathcal{X}$. Since these sums have the same mean, it suffices to show that

$\mathbb{E}\left[ (Y_1+\ldots+Y_m)^2\right] \le \mathbb{E}\left[ (X_1+\ldots+X_m)^2\right].$

But this really has nothing to do with probability now, since these quantities are just symmetric polynomials in $\mathcal{X}$. That said, we gain a little bit if we use the exchangeability property to rewrite the LHS and the RHS as

$n\mathbb{E}[Y_1^2] + n(n-1)\mathbb{E}[Y_1Y_2] \le n\mathbb{E}[X_1^2] + n(n-1)\mathbb{E}[X_1X_2].$

Since $Y_1\stackrel{d}= X_1$, it remains to handle the cross term, which involves showing that

$\frac{1}{N(N-1)}\sum_{i\ne j} x_ix_j \le \frac{1}{N^2} \sum_{i,j}x_ix_j,$

which really is an algebra exercise, and can be demonstrated by the rearrangement inequality, or Chebyshev’s inequality (the other one – ie sequence-valued FKG not the probabilistic second-moment bound one), or by clever pre-multiplying and cancelling to reduce to Cauchy-Schwarz.

In [BPS18], which we will turn to in much greater detail shortly, the authors assert in the first sentence that Hoeffding treats also the setting of weighted sums, that is a comparison of

$\alpha_1X_1+\ldots+\alpha_mX_m,\quad \alpha_1Y_1+\ldots+\alpha_mY_m.$

A casual reader of [Ho63] may struggle initially to identify that this has indeed happened. At the level of second-moment comparison though, the argument given above carries over essentially immediately. We have

$\mathbb{E}\left[ (\alpha_1Y_1+\ldots+\alpha_n Y_n)^2\right] = \left(\alpha_1^2+\ldots+\alpha_n^2\right) \mathbb{E}[Y_1^2] + \left(\sum_{i\ne j} \alpha_i\alpha_j \right)\mathbb{E}\left[Y_1Y_2\right],$

and of course the corresponding decomposition for $(X_i)$, from which the variance comparison inequality follows as before.

Convex ordering

For many applications, variance is sufficient as a measure of concentration, but one can say more. As a prelude, note that one characterisation of stochastic ordering (which I have sometimes referred to as stochastic domination on this blog) is that

$X\le_{st}Y \;\iff\;\mathbb{E}[f(X)] \le \mathbb{E}[f(Y)],$

whenever f is a non-decreasing function. Unsurprisingly, it’s the $\Leftarrow$ direction which is harder to prove, though often the $\Rightarrow$ direction is the most useful in applications.

[To inform what follows, I used a number of references, in particular [BM05].]

A natural extension to handle concentration is (stochastic) convex ordering, which is defined as

$X\le_{cx} Y\;\iff \; \mathbb{E}[f(X)]\le \mathbb{E}[f(Y)],$

whenever f is a convex function. Note immediately that this definition can only ever apply to random variables for which $\mathbb{E}X=\mathbb{E}Y$. Otherwise, consider the functions $f(x)=x, f(x)=-x$, both of which are convex.

This sort of comparison has plenty of application in economics and actuarial fields, where there is a notion of risk-aversion, and convexity is the usual criterion to define risk (in the sense that the danger of a significant loss is viewed as more significant than a corresponding possibility of a significant gain). Also note that in this context, X is preferred to Y by a risk-averse observer.

In fact, it can be shown [Oh69] that a sufficient condition for $X\le_{cx} Y$ is the existence of $\alpha\in\mathbb{R}$ such that

$F_X(x)\le F_Y(x)\;\iff\; x\ge \alpha,$

where $F_X(x)=\mathbb{P}(X\le x)$ is the usual distribution function of X. That is, the distribution functions ‘cross exactly once’. Note that in general $\alpha\ne \mathbb{E}X=\mathbb{E}Y$.

We will return later to the question of adapting this measure to handle distributions with different expectations, but for now, we will return to the setting of sampling with and without replacement and show that

$Y_1+\ldots+Y_m\le_{cx}X_1+\ldots+X_m.$

This is the result shown by Hoeffding, but by a more complicated method. We use the observations of [BPS18] which are stated in a more complicated setting which we may return to later. For notational convenience, let’s assume the elements of $\mathcal{X}$ are distinct so that ‘repetitions of $x_i$‘ has an unambiguous meaning.

We set up a natural coupling of $(X_i),(Y_i)$ as follows. Let $(X_i,i\ge 1)$ be IID uniform samples from $\mathcal{X}$, and generate $(Y_1,\ldots,Y_N)$ by removing any repetitions from $(X_1,X_2,\ldots)$.

Note then that given the sequence $(Y_1,\ldots,Y_m)$, the multiset of values $\{X_1,\ldots,X_m\}$ definitely includes $Y_1$ at least once, but otherwise includes each $Y_i$ either zero, one or more times, in a complicated way. However, this distribution doesn’t depend on the values taken by $(Y_i)$, since it is based on uniform sampling. In particular, if we condition instead on the set $\{Y_1,\ldots,Y_m\}$, we find that

$\mathbb{E}\left[X_1+\ldots+X_m\,\big|\, \{Y_1,\ldots,Y_m\}\right] = Y_1+\ldots+Y_m,$

and so we may coarsen the conditioning to obtain

$\mathbb{E}\left[X_1+\ldots+X_m\,\big|\, Y_1+\ldots+Y_m\right] = Y_1+\ldots+Y_m.$

This so-called martingale coupling is very useful for proving convex ordering. You can think of $\mathbb{E}[X|Y]=Y$ as saying “X is generated by Y plus additional randomness”, which is consistent with the notion that Y is more concentrated than X. In any case, it meshes well with Jensen’s inequality, since if f is convex, we obtain

$\mathbb{E}[f(X)] = \mathbb{E}\left[ \mathbb{E}[f(X)|Y] \right] \ge \mathbb{E}\left[f\left(\mathbb{E}[X|Y] \right)\right] = \mathbb{E}\left[ f(Y)\right].$

So we have shown that sampling without replacement is greater than sampling with replacement in the convex ordering, if we consider sums of the samples.

If we want to handle weighted sums, and tried to reproduce the previous argument directly, it would fail, since

$\mathbb{E}\left[\alpha_1X_1+\ldots+\alpha_nX_n\,\big|\,\{Y_1,\ldots,Y_m\}\right]=(\alpha_1+\ldots+\alpha_m)(Y_1+\ldots+Y_m),$

which is not in general equal to $\alpha_1Y_1+\ldots+\alpha_mY_m$.

Non-uniform weighted sampling

Of course, it’s not always the case that we want to sample according to the uniform distribution. My recent personal motivation for investigating this setup has been the notion of exploring a graph generated by the configuration model, for which vertices appear for the first time in a random order that is size-biased by their degrees.

In general we can think of each element of $\mathcal{X}$ as having some weight w(i), and at each step, we pick $x_i$ from the remaining elements with probability proportion to w(i). This procedure is known in other contexts as successive sampling. Note that the sequence $(Y_1,\ldots,Y_m)$ is no longer exchangeable. This is clear enough when there are two elements in $\mathcal{X}$, and $w(1)\gg w(2)$. So it is much more likely that $(Y_1,Y_2)=(x_1,x_2)$ than $(Y_1,Y_2)=(x_2,x_1)$.

In addition, it will in general not be the case that

$\mathbb{E}[Y_1+\ldots+Y_m]=\mathbb{E}[X_1+\ldots+X_m],$

and this is easily seen again in the above example, where $\mathbb{E}[Y_1+Y_2]=x_1+x_2$ while $\mathbb{E}[X_1+X_2]\approx 2x_1$.

However, one can still compare such distributions, with the notion of increasing convex ordering, which is defined as:

$X\le_{icx} Y\;\iff\; \mathbb{E}[f(X)]\le \mathbb{E}[f(Y)],$

for all convex non-decreasing functions f. From an economic point view, increasing convex ordering becomes a comment on the relative contribution of losses above some threshold between the two distributions (and that X is better or worse than Y for all such thresholds), and is sometimes called stop-loss order. More formally, as in [BM05] we have:

$X\le_{icx}Y\;\iff\; \mathbb{E}[\max(X-t,0)]\le \mathbb{E}[\max(Y-t,0)]\,\forall t\in\mathbb{R}.$

In [BPS18], the authors study the case of successive sampling where the weights are monotone-increasing in the values, that is $w(i)\ge w(j)\;\iff\; x_i\ge x_j$, of which size-biased sampling is a special case. With this setup, the heuristic is that ‘large values are more likely to appear earlier’ in the sample without replacement, and making this notion precise underpins all the work.

The coupling introduced above is particularly useful in this more exotic setting. The authors show that conditional on the event A of the form $\{Y_1,\ldots,Y_m\}=\{x_{i_1}\le x_{i_2}\le\ldots\le x_{i_m}\}$, one has

$\mathbb{P}(X_1=x_{i_j}\,\big|\,\mathbf{A}) \le \mathbb{P}(X_1=x_{i_k}\,\big|\,\mathbf{A}),\text{ precisely when }j\le k.$

This is achieved by comparing the explicit probabilities of pairs of orderings for $(Y_1,\ldots,Y_m)$ with $x_{i_j},x_{i_k}$ exchanged. See [BPS18] for this short calculation – a link to the Arxiv version is in the references. So to calculate $\mathbb{E}[X_1\,\big|\, \mathbf{A}]$, we have the setup for the rearrangement inequality / Chebyshev again, as

$\mathbb{E}[X_1\,\big|\, \mathbf{A}] = \sum_{j=1}^m x_{i_j}\mathbb{P}(X_1=x_{i_j}) \le \frac{1}{n}\left(\sum_{j=1}^m x_{i_j}\right)\left(\sum \mathbb{P}\right) = \frac{\left(Y_1+\ldots+Y_m\,\big|\,\mathbf{A}\right)}{m}.$

As before, letting $X,Y$ be the sums of the samples with and without replacement, respectively, we have $\mathbb{E}[X\,\big|\, Y] \ge Y$, which the authors term a submartingale coupling. In particular, the argument

$\mathbb{E}[f(X)] = \mathbb{E}\left[ \mathbb{E}[f(X)|Y] \right] \ge \mathbb{E}\left[f\left(\mathbb{E}[X|Y] \right)\right] \ge \mathbb{E}\left[ f(Y)\right],$

works exactly as before to establish convex ordering.

References

[Az67] – Azuma – 1967 – Weighted sums of certain dependent random variables

[BM05] – Bauerle, Muller – 2005 – Stochastic orders and risk measures: consistency and bounds. (Online version here)

[BPS18] – Ben Hamou, Peres, Salez – 2018 – Weighted sampling without replacement (ArXiv version here)

[Ho63] – Hoeffding – 1963 – Probability inequalities for sums of bounded random variables

[Oh69] – Ohlin – 1969 – On a class of measures of dispersion with application to optimal reinsurance

# Lecture 8 – Bounds in the critical window

I am aiming to write a short post about each lecture in my ongoing course onÂ Random Graphs. Details and logistics for the course can be found here.

Preliminary – positive correlation, Harris inequality

I wrote about independence, association, and the FKG property a long time ago, while I was still an undergraduate taking a first course on Percolation in Cambridge. That post is here. In the lecture I discussed the special case of the FKG inequality applied in the setting of product measure setting, of which the Erdos-Renyi random graph is an example, and which is sometimes referred to as the Harris inequality.

Given two increasing events A and B, say for graphs on [n], then if $\mathbb{P}$ is product measure on the edge set, we have

$\mathbb{P}(A\cap B)\ge \mathbb{P}(A)\mathbb{P}(B).$

Intuitively, since both A and B are ‘positively-correlated’ with the not-rigorous notion of ‘having more edges’, then are genuinely positively-correlated with each other. We will use this later in the post, in the form $\mathbb{E}[X|A]\ge \mathbb{E}[X]$, whenever X is an increasing RV and A is an increasing event.

The critical window

During the course, we’ve discussed separately the key qualitative features of the random graph $G(n,\frac{\lambda}{n})$ in the

• subcritical regime when $\lambda<1$, for which we showed that all the components are small, in the sense that $\frac{1}{n}|L_1| \stackrel{\mathbb{P}}\rightarrow 0$, although the same argument would also give $|L_1|\le K\log n$ with high probability if we used stronger Chernoff bounds;
• supercritical regime when $\lambda>1$, for which there is a uniqueÂ giant component , ie that $\frac{1}{n}|L_1|\stackrel{\mathbb{P}}\rightarrow \zeta_\lambda>0$, the survival probability of a Galton-Watson branching process with Poisson($\lambda$) offspring distribution. Arguing for example by a duality argument shows that with high probability all other components are small in the same sense as in the subcritical regime.

In between, of course we should study $G(n,\frac{1}{n})$, for which it was known that $L_1\stackrel{d}\sim n^{2/3},\, L_2\stackrel{d}\sim n^{2/3},\ldots$. (*) That is, the largest components are on the scale $n^{2/3}$, and there are lots of such critical components.

In the early work on random graphs, the story ended roughly there. But in the 80s, these questions were revived, and considerable work by Bollobas and Luczak, among many others, started investigating the critical setting in more detail. In particular, between the subcritical and the supercritical regimes, the ratio $\frac{|L_2|}{|L_1|}$ between the sizes of the largest and second-largest components goes from ‘concentrated on 1’ to ‘concentrated on 0’. So it is reasonable to ask what finer scaling of the edge probability $p(n)$ around $\frac{1}{n}$ should be chosen to see this transition happen.

Critical window

In this lecture, we studied theÂ critical window, describing sequences of probabilities of the form

$p(n)=\frac{1+\lambda n^{-1/3}}{n},$

where $\lambda\in(-\infty,+\infty)$. (Obviously, this is a different use of $\lambda$ to previous lectures.)

It turns out that as we move $\lambda$ from $-\infty$ to $+\infty$, this window gives exactly the right scaling to see the transition of $\frac{|L_2|}{|L_1|}$ described above. Work by Bollobas and Luczak and many co-authors and others in the 80s establish a large number of results in this window, but for the purposes of this course, this can be summarised as saying that the critical window has the same scaling behaviour as $p(n)=1/n$, with a large number of components on the scale $\sim n^{2/3}$ (see (*) earlier), but different scaling limits.

Note:Â Earlier in the course, we have discussed local limits, in particular for $G(n,\lambda/n)$, where the local limit is a Galton-Watson branching process tree with offspring distribution $\mathrm{Poisson}(\lambda)$. Such local properties are not sufficient to distinguish between different probabilitiesÂ within the critical window. Although there are lots of critical components, it remains the case that asymptotically almost all vertices are in ‘small components’.

The precise form of the scaling limit for

$\frac{1}{n^{2/3}} \left( |L_1|, |L_2|, |L_3|,\ldots \right)$

as $n\rightarrow\infty$ was shown by Aldous in 1997, by lifting a scaling limit result for the exploration process, which was discussed in this previous lecture and this one too. Since Brownian motion lies outside the assumed background for this course, we can’t discuss that, so this lecture establishesÂ upper bounds on the correct scale of $|L_1|$ in the critical window. Continue reading

# Lecture 10 – the configuration model

I am aiming to write a short post about each lecture in my ongoing course onÂ Random Graphs. Details and logistics for the course can be found here.

As we enter the final stages of the semester, I want to discuss some extensions to the standard Erdos-Renyi random graph which has been the focus of most of the course so far. Although we will not get far into the details during this course, the overall goal is to develop models which are close to Erdos-Renyi in terms of ease of analysis, while also allowing more of the features characteristic of networks observed in the real world.

One of the more obvious deficiencies of the sparse regime of Erdos-Renyi random graphs for modelling ‘real-world phenomena’ concerns the degree sequence. Indeed, the empirical degree distribution of G(n,c/n) converges to Poisson(c). By contrast, in real-world networks, a much wider range of degrees is typically observed, and in many cases it is felt that these should follow aÂ power law, with a small number of a very highly connected agents.

One way around this problem to construct random graphs where we insist that the graph has a given sequence of degrees. The configuration model, which is the subject of this lecture and this post (and about which I’ve written before), offers one way to achieve this.

Definition and notes

Let $n\ge 1$ and let $d=(d_1,d_2,\ldots,d_n)$ be a sequence of non-negative integers such that $\sum_{i=1}^n d_i$ is even. Then the configuration model with degree sequence d is a random multigraph with vertex set [n], constructed as follows:

• To each vertex $i\in[n]$, assign $d_i$Â half-edges;
• Then, take a uniform matching of these half-edges;
• Finally, for each pair of half-edges in the matching, replace the two half-edges with a genuine edge, to obtain the multigraph $CM_n(d)$, in which, by construction, vertex i has degree $d_i$.

One should note immediately that although the matching is uniform, the multigraph is not uniform amongst multigraphs with that degree sequence. Note also that the condition on the sums of the degrees is necessary for any graph, and in this context means that the number of half-edges is even, without which it would not be possible to construct a matching.

This effect is manifest in the simplest possible example, when n=2 and d=(3,3). There are two possible graphs, up to isomorphism, which are shown below:

For obvious reasons, we might refer to these as theÂ handcuffs and theÂ theta , respectively. It’s helpful if we, temporarily, assume the half-edges are distinguishable at the moment we join them up in the configuration model construction. Because then there are 3×3=9 ways to join them up to form the handcuffs (think of which half-edge ends up forming the edge between the two vertices) while there are 3!=6 ways to pair up the half-edges in the theta.

In general, for multigraphs H with the correct degree sequence, we have

$\mathbb{P}( CM_n(d)\simeq H) \propto \left( 2^{\# \text{loops}(H)} \prod_{e\in E(H)} \text{mult}(e)! \right),$

where $\text{mult}(e)$ is the multiplicity with which a given edgeÂ e appears in H.

Note: it might seem counterintuitive that this procedure is biased againstÂ multiple edges and self-loops, but it is really just saying that there are more ways to form two distinct edges than to form two equal edges (ie a multiedge pair) when we view the half-edges as distinguishable. (See this post for further discussion of this aspect in the 3-regular setting.)

However, a consequence of this result is that if we condition on the event that $CM_n(d)$ is simple, then the resulting random graphÂ is uniform on the set of simple graphs satisfying the degree property. Note that the same example as above shows that there’s no guarantee that there exists a simple graph whose degrees are some given sequence.

d-regular configuration model

In general, from a modelling point of view, we are particularly interested in simple, connected graphs, and so it is valuable to study whether the large examples of the configuration model are likely to have these properties. In this lecture, I will mainly focus on the case where the multigraphs areÂ d-regular, meaning that all the vertices have degree equal to d. For the purposes of this lecture, we denote by $G^d(n)$, the d-regular configuration model $CM_n(d,\ldots,d)$.

• d=1: to satisfy the parity condition on the sums of degrees, we must have n even. But then $G^1(n)$ will consist of n/2 disjoint edges.
• d=2: $G^2(n)$ will consist of some number of disjoint cycles, and it is a straightforward calculation to check that when n is large, with high probability the graph will be disconnected.

In particular, I will focus on the case when d=3, which is the first interesting case. Most of the results we prove here can be generalised (under various conditions) to more general examples of the configuration model. The main goal of the lecture is revision of some techniques of the course, plus one new one, in a fresh setting, and the strongest possible versions of many of these results can be found amongst the references listed at the end.

Connectedness

In the lecture, we showed that $G^3(2n)$ is connected with high probability. This is, in fact, a very weak result, since in fact $G^d(n)$ is d-connected with high probability for $d\ge 3$ [Bol81, Wor81]. Here,Â d-connected means that one must remove at least d vertices in order to disconnect the graph, or, equivalently, that there are d disjoint paths between any pair of vertices. Furthermore, Bollobas shows that for $d\ge 3$, $G^d(n)$ is a (random) expander family [Bol88].

Anyway, for the purposes of this course, the main tool is direct enumeration. The matching number $M_{2k}$ satisfies

$M_{2k}=(2k-1)\times (2k-3)\times\ldots\times 3\times 1 = \frac{(2k)!}{2^k \cdot k!},$

and so Stirling’s approximation gives the asymptotics

$M_{2k} = (\sqrt{2}+o(1)) \left(\frac{2}{e}\right)^k k^k,$

although it will be useful to use the true bounds

$c \left(\frac{2}{e}\right)^k k^k \le M_{2k}\le C\left(\frac{2}{e}\right)^k k^k,\quad \forall k,$

instead in some places. Anyway, in $G^3(2n)$, there are 6n half-edges in total, and so the probability that the graph may be split into two parts consisting of $2\ell,2m$ vertices, with $2\ell+2m=2n$, and with no edges between the classes is $\frac{\binom{2n}{2\ell} M_{6\ell}M_{6m}}{M_{6n}}.$ Continue reading

# Lecture 9 – Inhomogeneous random graphs

I am aiming to write a short post about each lecture in my ongoing course onÂ Random Graphs. Details and logistics for the course can be found here.

As we enter the final stages of the semester, I want to discuss some extensions to the standard Erdos-Renyi random graph which has been the focus of most of the course so far. In doing so, we can revisit material that we have already covered, and discover how easily one can extend this directly to more exotic settings.

The focus of this lecture was the model of inhomogeneous random graphs (IRGs) introduced by Soderberg [Sod02] and first studied rigorously by Bollobas, Janson and Riordan [BJR07]. Soderberg and this blog post address the case where vertices have a type drawn from a finite set. BJR address the setting with more general typespaces, in particular a continuum of types. This generalisation is essential if one wants to use IRGs to model effects more sophisticated than those of the classical Erdos-Renyi model G(n,c/n), but most of the methodology is present in the finite-type setting, and avoids the operator theory language which is perhaps intimidating for a first-time reader.

Inhomogeneous random graphs

Throughout, $k\ge 2$ is fixed. A graph with k types is a graph G=(V,E) together with aÂ type function $V\to \{1,\ldots,k\}$. We will refer to a $k\times k$ symmetric matrix with non-negative entries as aÂ kernel.

Given $n\in\mathbb{N}$ and a vector $p=(p_1,\ldots,p_k)\in\mathbb{N}_0^k$ satisfying $\sum p_i=n$, and $\kappa$ a kernel, we define the inhomogeneous random graph $G^n(p,\kappa)$ with k types as:

• the vertex set is [n],
• types are assigned uniformly at random to the vertices such that exactly $p_i$ vertices have type i.
• Conditional on these types, each edge $v\leftrightarrow w$ (for $v\ne w\in [n]$) is present, independently, with probability

$1 - \exp\left(-\frac{\kappa_{\mathrm{type}(v),\mathrm{type}(w)} }{n} \right).$

Notes on the definition:

• Alternatively, we could assign the types so that vertices $\{1,\ldots,p_1\}$ have type 1, $\{p_1+1,\ldots,p_1+p_2\}$ have type 2, etc etc. This makes no difference except in terms of the notation we have to use if we want to use exchangeability arguments later.
• An alternative model considers some distribution $\pi$ on [k], and assigns the types of the vertices of [n] in an IID fashion according to $\pi$. Essentially all the same results hold for these two models. (For example, this model with ‘random types’ can be studied by quenching the number of each type!) Often one works with whichever model seems easier for a given proof.
• Note that the edge probability given is $\approx \frac{\kappa_{\mathrm{type}(v),\mathrm{type}(w)}}{n}$. The exponential form has a more natural interpretation if we ever need to turn the IRGs into a process. Additionally, it avoids the requirement to treat small values of n (for which, a priori, $k/n$ might be greater than 1) separately.

In the above example, one can see that, roughly speaking, red vertices are more likely to be connected to each other than blue vertices. However, for both colours, they are more likely to be connected to a given vertex of the same colour than a vertex of the opposite colour. This might, for example, correspond to the kernel $\begin{pmatrix}3&1\\1&2\end{pmatrix}$.

The definition given above corresponds to aÂ sparse setting, where the typical vertex degrees are $\Theta(1)$. Obviously, one can set up an inhomogeneous random graph in a dense regime by an identical argument.

From an applications point of view, it’s not hard to imagine that an IRG of some flavour might be a good model for many phenomena observed in reality, especially when a mean-field assumption is somewhat appropriate. The friendships of boys and girls in primary school seems a particularly resonant example, though doubtless there are many others.

One particular application is to recover the types of the vertices from the topology of the graph. That is, if you see the above picture without the colours, can you work out which vertices are red, and which are blue? (Assuming you know the kernel.) This is clearly impossible to do with anything like certainty in the sparse setting – how does one decide about isolated vertices, for example? The probabilities that a red vertex is isolated and that a blue vertex is isolated differ by a constant factor in the $n\rightarrow\infty$ limit. But in the dense setting, one can achieve this with high confidence. When studying such statistical questions, these IRGs are often referred to asÂ stochastic block models, and the recent survey of Abbe [Abbe] gives a very rich history of this type of problem in this setting.

Poisson multitype branching processes

As in the case of the classical random graph G(n,c/n), we learn a lot about the IRG by studying its local structure. Let’s assume from now on that we are given a sequence of IRGs $G^n(p^n,\kappa)$ for which $\frac{p^n}{n}\rightarrow \pi$, where $\pi=(\pi_1,\ldots,\pi_k)\in[0,1]^k$ satisfies $||\pi||_1=1$.

Now, let $v^n$ be a uniformly-chosen vertex in [n]. Clearly $\mathrm{type}(v^n)\stackrel{d}\rightarrow \pi$, with the immediate mild notation abuse of viewing $\pi$ as a probability distribution on [k].

Then, conditional on $\mathrm{type}(v^n)=i$:

• when $j\ne i$, the number of type j neighbours of $v^n$ is distributed as $\mathrm{Bin}\left(p_j,1-\exp\left(-\frac{\kappa_{i,j}}{n}\right)\right)$.
• the number of type i neighbours of $v^n$ is distributed as $\mathrm{Bin}\left( p_i-1,1-\exp\left(-\frac{\kappa_{i,i}}{n}\right)\right)$.

Note that $p_j\left[1-\exp\left(-\frac{\kappa_{i,j}}{n}\right)\right]\approx \frac{p_j\cdot \kappa_{i,j}}{n} \approx \kappa_{i,j}\pi_j$, and similarly in the case j=i, so in both cases, the number of neighbours of type j is distributed approximately as $\mathrm{Poisson}(\kappa_{i,j}\pi_j)$.

This motivates the following definition of a branching process tree, whose vertices have k types. Continue reading

# Lecture 7 – The giant component

I am aiming to write a short post about each lecture in my ongoing course onÂ Random Graphs. Details and logistics for the course can be found here.

As we edge into the second half of the course, we are now in a position to return to the question of the phase transition between the subcritical regime $\lambda<1$ and theÂ supercritical regime $\lambda>1$ concerning the size of the largest component $L_1(G(n,\lambda/n))$.

In Lecture 3, we used the exploration process to give upper bounds on the size of this largest component in the subcritical regime. In particular, we showed that

$\frac{1}{n}\big| L_1(G(n,\lambda/n)) \big| \stackrel{\mathbb{P}}\rightarrow 0.$

If we used slightly stronger random walk concentration estimates (Chernoff bounds rather than 2nd-moment bounds from Chebyshev’s inequality), we could in fact have shown that with high probability the size of this largest component was at most some logarithmic function of n.

In this lecture, we turn to the supercritical regime. In the previous lecture, we defined various forms of weak local limit, and asserted (without attempting the notationally-involved combinatorial calculation) that the random graph $G(n,\lambda/n)$ converges locally weakly in probability to the Galton-Watson tree with $\text{Poisson}(\lambda)$ offspring distribution, as we’ve used informally earlier in the course.

Of course, when $\lambda>1$, this branching process has strictly positive survival probability $\zeta_\lambda>0$. At a heuristic level, we imagine that all vertices whose local neighbourhood is ‘infinite’ are in fact part of the sameÂ giant component, which should occupy $(\zeta_\lambda+o_{\mathbb{P}}(1))n$ vertices. In its most basic form, the result is

$\frac{1}{n}\big|L_1(G(n,\lambda/n))\big|\;\stackrel{\mathbb{P}}\longrightarrow\; \zeta_\lambda,\quad \frac{1}{n}\big|L_2(G(n,\lambda/n))\big| \;\stackrel{\mathbb{P}}\longrightarrow\; 0,$ (*)

where the second part is aÂ uniqueness result for the giant component.

The usual heuristic for proving this result is that all ‘large’ components must in fact be joined. For example, if there are two giant components, with sizes $\approx \alpha n,\approx \beta n$, then each time we add a new edge (such an argument is often called ‘sprinkling‘), the probability that these two components are joined is $\approx 2ab$, and so if we add lots of edges (which happens as we move from edge probability $\lambda-\epsilon\mapsto \lambda$ ) then with high probability these two components get joined.

It is hard to make this argument rigorous, and the normal approach is to show that with high probability there are no components with sizes within a certain intermediate range (say between $\Theta(\log n)$ and $n^\alpha$) and then show that all larger components are the same by a joint exploration process or a technical sprinkling argument. Cf the books of Bollobas and of Janson, Luczak, Rucinski. See also this blog post (and the next page) for a readable online version of this argument.

I can’t find any version of the following argument, which takes the weak local convergence as an assumption, in the literature, but seems appropriate to this course. It is worth noting that, as we shall see, the method is not hugely robust to adjustments in case one is, for example, seeking stronger estimates on the giant component (eg a CLT).

Anyway, we proceed in three steps:

Step 1:Â First we show, using the local limit, that for any $\epsilon>0$,

$\frac{1}{n}\big|L_1(G(n,\lambda/n))\big| \le \zeta_\lambda+\epsilon,$ with high probability as $n\rightarrow\infty$.

Step 2:Â Using a lower bound on the exploration process, for $\epsilon>0$ small enough

$\frac{1}{n}\big|L_1(G(n,\lambda/n))\big| \ge \epsilon,$ with high probability.

Step 3:Â Motivated by duality, we count isolated vertices to show

$\mathbb{P}(\epsilon n\le |L_1| \le (\zeta_\lambda-\epsilon)n) \rightarrow 0.$

We will return to uniqueness at the end.

Step 1

This step is unsurprising. The local limit gives control on how many vertices are in small components of various sizes, and so gives control on how many vertices are in small components of all finite sizes (taking limits in the right order). This gives a bound on how many vertices can be in the giant component. Continue reading

# Lecture 6 – Local limits

I am aiming to write a short post about each lecture in my ongoing course onÂ Random Graphs. Details and logistics for the course can be found here.

By this point of the course, we’ve studied several aspects of the Erdos-Renyi random graph, especially in the sparse setting $G(n,\frac{\lambda}{n})$. We’ve also taken a lengthy detour to revise Galton-Watson trees, with a particular focus on the case of Poisson offspring distribution.

This is deliberate. Note that a given vertex v of $G(n,\frac{\lambda}{n})$ has some number of neighbours distributed as $\mathrm{Bin}(n-1,\frac{\lambda}{n})\stackrel{d}\approx\mathrm{Po}(\lambda)$, and the same approximation remains valid as we explore the graph (for example in a breadth-first fashion) either until we have seen a large number of vertices, or unless some ultra-pathological event happens, such as a vertex having degree n/3.

In any case, we are motivated by the notion that the local structure of $G(n,\frac{\lambda}{n})$ is well-approximated by the Galton-Watson tree with $\mathrm{Po}(\lambda)$ offspring, and in this lecture and the next we try to make this notion precise, and discuss some consequences when we can show that this form of convergence occurs.

Deterministic graphs

Throughout, we will be interested inÂ rootedÂ graphs, since by definition we have to choose a root vertex whose local neighbourhood is to be studied. Usually, we will study a sequence of rooted graphs $(G_n,\rho_n)$, where the vertex set of $G_n$ is [n], or certainly increasing in n (as in the first example).

For some rooted graph $(G,\rho)$, we say such a sequence $(G_n,\rho_n)$Â converges to $(G,\rho)$ locally if for all radii $r\ge 1$, we have $B_r^{G_n}(\rho_n)\simeq B_r^G(\rho)$. In words, the neighbourhood around $\rho_n$ in $G_n$ is the same up to radius r as the neighbourhood around $\rho$ in $G$, so long as n is large enough (for given r).

This is best illustrated by an example, such as $T_n$, the binary tree to depth n.

If we take $\rho_n$ to be the usual root, then the trees are nested, and converge locally to the infinite binary tree $T_\infty$. Slightly less obviously, if we take $\rho_n$ to be one of the leaves, then the trees are still nested (up to labelling – ie in the sense of isomorphisms of rooted trees), and converge locally to theÂ canopy tree, defined by a copy of $\mathbb{Z}_{\ge 0}$ with nearest-neighbour edges, and where each vertex $n\ge 1$ is connected to the root of a disjoint copy of $T_{n-1}$, as shown below:

Things get more interesting when the root is chosen randomly, for example, uniformly at random, as this encodes more global information about the graphs $G_n$. In the case where the $G_n$ are vertex-transitive, then if we only care about rooted graphs up to isomorphism, then it doesn’t matter how we choose the root.

Otherwise, we say that $G_n$ converges in theÂ local weak sense to $(G,\rho)$ if, for all $r\ge 1$ and for all rooted graphs $(H,\rho_H)$,

$\mathbb{P}\left( B^{G_n}_r(\rho_n)\simeq (H,\rho_H) \right) \longrightarrow \mathbb{P}\left( B_r^G(\rho)\simeq H\right),$

as $n\rightarrow\infty$.

Alternatively, one can phrase this as a result about convergence of rooted-graph-valued distributions.

A simple non-transitive example is $G_n\simeq P_n$, the path of length n. Then, the r-neighbourhood of a vertex is isomorphic to $P_{2r}$,Â unless that vertex is within graph-distance (r-1) of one of the leaves of $G_n$. As $n\rightarrow\infty$, the proportion of such vertices vanishes, and so, $\mathbb{P}\left( B^{P_n}_r(\rho_n)\simeq P_{2r}\right)\rightarrow 1$, from which we conclude the unsurprising result that $P_{n}$ converges in the local weak sense to $\mathbb{Z}$. (Which is vertex-transitive, so it doesn’t matter where we select the root.)

The binary trees offer a slightly richer perspective. Let $\mathcal{L}_n$ be the set of leaves of $T_n$, and we claim that when $\rho_n$ is chosen uniformly from the vertices of $T_n$, then $d_{T_n}(\rho_n,\mathcal{L}_n)$ converges in distribution. Indeed, $\mathbb{P}\left( d_{T_n}(\rho_n,\mathcal{L}_n)=k\right) = \frac{2^{n-k}}{2^{n+1}-1}$, whenever $n\ge k$, and so the given distance converges in distribution to the Geometric distribution with parameter 1/2 supported on {0,1,2,…}.

This induces a random local weak limit, namely the canopy tree, rooted at one of the vertices we denoted by $\mathbb{Z}_{\ge 0}$, with the choice of this vertex given by Geometric(1/2). Continue reading

# Lecture 4 – Hitting time theorem

I am aiming to write a short post about each lecture in my ongoing course onÂ Random Graphs. Details and logistics for the course can be found here.

This lecture consisted of revision of the most relevant theory of Galton-Watson trees, with a focus on the case where the offspring distribution is Poisson, since, as we have seen in previous lectures, this is a strong candidate to approximate the structure of $G(n,\lambda/n)$. It makes sense to cover the theory of the trees before attempting to make rigorous the sense of approximation.

Given a Galton-Watson tree T, it is natural to label the vertices in a breadth-first order as $\varnothing=v_1,v_2,\ldots,v_{|T|}$. This is easiest if we have constructed the Galton-Watson tree as a subset of the infinite Ulam-Harris tree, where vertices have labels like (3,5,17,4), whose parent is (3,5,17). If this child vertex is part of the tree, then so are (3,5,17,1), (3,5,17,2), and (3,5,17,3). This means our breadth-first order is canonically well-defined, as we have a natural ordering of the children of each parent vertex.

Note:Â one advantage of using breadth-first order rather thanÂ depth-first order (which corresponds to the usual dictionary, orÂ lexicographic ordering of the labels) is that if the tree is infinite, we don’t explore all of it during a depth-first search. (In the sense that there exist vertices which are never given a finite label.) For breadth-first search, a similar problem arises precisely when some vertex has infinitely many children. For a conventional Galton-Watson tree, the latter situation is much less of a problem than the infinite total population problem, which happens with positive probability whenever $\mu=\mathbb{E}[X]>1$.

Anyway, given the depth-first order, one can consider anÂ exploration process $S_0,S_1,S_2,\ldots,S_{|T|}$ given by

$S_0=1,\quad S_i=S_{i-1}+(X_i-1),$

where $X_i$ is the number of children of $v_i$. In this way, we see that

$S_i=\big| \Gamma(v_1)\cup\ldots\cup\Gamma(v_i)\backslash \{v_1,\ldots,v_i\}\big|,\quad i\ge 1,$

records the number of vertices in someÂ stack containing those which we have ‘seen but not explored’. Some authors prefer to start from 0, in which case one ends up with a similar but slightly different interpretation of the ‘stack’, but that’s fine since we aren’t going to define formally what ‘seen’ and ‘explored’ means in this post.

Essentially, we exhaust the vertices of the tree whenever $S_t=0$, and so the condition that $|T|=n$ requires

$S_n=0,\quad S_m\ge 1,\; m=0,1,\ldots,n-1.$

Conveniently, so long as we have avoiding ordering ambiguity, for example by insisting that trees live within the Ulam-Harris tree, we can reconstruct T uniquely from $(S_0,S_1,\ldots,S_{|T|})$.

Furthermore, if T is a Galton-Watson process, then the numbers of children $X_i$ are IID, and so in fact this exploration process is a random walk, and the size of the tree can be recovered as the hitting time of zero.

Note:Â making fully rigorous the argument that children in the GW tree are independent of the breadth-first walk fully rigorous is somewhat technical, and not to be dismissed lightly, though not of principle interest at the level of this topics course. See Proposition 1.5 in Section 1.2 of Le Gall’s notes or Section 1.2.2 of my doctoral thesis for further discussion and argument.

TheÂ hitting time theoremÂ allows us to study the distribution of the hitting time of a random walk whose increments are bounded below by -1, in terms of the distribution of the value of the random walk.

Theorem:Â Let $(S_n,\, n\ge 0)$ be a random walk with $S_0=0$ and IID increments $(X_n,n\ge 1)$ satisfying $\mathbb{P}(X_n\ge -1)=1$. Let $H_{-k}=\inf \left\{n\,:\, S_n=-k\right\}$ be theÂ hitting timeÂ of -k.

Then $\mathbb{P}\big( H_{-k}=n\big) = \frac{k}{n}\mathbb{P}\big(S_n=-k)$.

Commentary:Â there are local central limit theorem estimates and large deviation estimates that allow good control of the probability on the RHS for a rich class of contexts. So at a meta-level, the hitting time theorem allows us to reduce a complicated (though still classical) problem, to a real classical problem, which is particularly helpful when the LHS is a device for capturing relevant information about our random tree model.

# Lecture 3 – Couplings, comparing distributions

I am aiming to write a short post about each lecture in my ongoing course onÂ Random Graphs. Details and logistics for the course can be found here.

In this third lecture, we made our first foray into the scaling regime for G(n,p) which will be the main focus of the course, namely theÂ sparse regime when $p=\frac{\lambda}{n}$. The goal for today was to give a self-contained proof of the result that in theÂ subcritical setting $\lambda<1$, there is noÂ giant component, that is, a component supported on a positive proportion of the vertices, with high probability as $n\rightarrow\infty$.

More formally, we showed that the proportion of vertices contained within the largest component of $G(n,\frac{\lambda}{n})$ vanishes in probability:

$\frac{1}{n} \left| L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right| \stackrel{\mathbb{P}}\longrightarrow 0.$

The argument for this result involves anÂ exploration process of a component of the graph. This notion will be developed more formally in future lectures, aiming for good approximation rather than bounding arguments.

But for now, the key observation is that when we ‘explore’ the component of a uniformly chosen vertex $v\in[n]$ outwards from v, at all times the number of ‘children’ of v which haven’t already been considered is ‘at most’ $\mathrm{Bin}(n-1,\frac{\lambda}{n})$. Since, for example, if we already know that eleven vertices, including the current one w are in C(v), then the distribution of the number of new vertices to be added to consideration because they are directly connected to w has conditional distribution $\mathrm{Bin}(n-11,\frac{\lambda}{n})$.

Firstly, we want to formalise the notion that this is ‘less than’ $\mathrm{Bin}(n,\frac{\lambda}{n})$, and also that, so long as we don’t replace 11 by a linear function of n, that $\mathrm{Bin}(n-11,\frac{\lambda}{n})\stackrel{d}\approx \mathrm{Poisson}(\lambda)$.

Couplings to compare distributions

AÂ coupling of two random variables (or distributions) X and Y is a realisation $(\hat X,\hat Y)$ on the same probability space with correct marginals, that is

$\hat X\stackrel{d}=X,\quad \hat Y\stackrel{d}=Y.$

We saw earlier in the course that we could couple G(n,p) and G(n,q) by simulating both from the same family of uniform random variables, and it’s helpful to think of this in general: ‘constructing the distributions from the same source of randomness’.

Couplings are a useful notion to digest at this point, as they embody a general trend in discrete probability theory. Wherever possible, we try to do as we can with the random objects, before starting any calculations. Think about the connectivity property of G(n,p) as discussed in the previous lecture. This can be expressed directly as a function of p in terms of a large sum, but showing it is an increasing function of p is essentially impossible by computation, whereas this is very straightforward using the coupling.

We will now review how to use couplings to compare distributions.

For a real-valued random variable X, with distribution function $F_X$, we always have the option to couple with a uniform U(0,1) random variable. That is, when $U\sim U[0,1]$, we have $(F_X^{-1}(U)\stackrel{d}= X$, where the inverse of the distribution function is defined (in the non-obvious case of atoms) as

$F_X^{-1}(u)=\inf\left\{ x\in\mathbb{R}\,:\, F(x)\ge u\right\}.$

Note that when the value taken by U increases, so does the value taken by $F_X^{-1}(U)$. This coupling can be usedÂ simultaneously on two random variables X and Y, as $(F_X^{-1}(U),F_Y^{-1}(U))$, to generate a coupling of X and Y.

TheÂ total variation distance between two probability measures is

$d_{\mathrm{TV}}(\mu,\nu):= \sup_{A}|\mu(A)-\nu(A)|$,

with supremum taken over all events in the joint support S of $\mu,\nu$. This is particularly clear in the case of discrete measures, as then

$d_{\mathrm{TV}}(\mu,\nu)=\frac12 \sum_{x\in S} \left| \mu\left(\{x\}\right) - \nu\left(\{x\}\right) \right|.$

(Think of the difference in heights between the bars, when you plot $\mu,\nu$ simultaneously as a bar graph…)

The total variation distances records how well we can couple two distributions, if we want them to be equal as often as possible. It is therefore a bad measure of distributions with different support. For example, the distributions $\delta_0$ and $\delta_{1/n}$ are distance 1 apart (the maximum) for all values of n. Similarly, the uniform distribution on [0,1] and the uniform distribution on $\{0,1/n,2/n,\ldots, n-1/n, 1\}$ are also distance 1 apart.

When there is more overlap, the following result is useful.

Proposition:Â Any coupling $(\hat X,\hat Y)$ of $X\sim \mu,\,Y\sim \nu$ satisfies $\mathbb{P}(X=Y)\le 1-d_{\mathrm{TV}}(\mu,\nu)$, and there exists a coupling such that equality is achieved. Continue reading