# Convex ordering on Galton-Watson trees

This blog was recently revived, via a post about convex ordering and its relevance to the problem of sampling with and without replacement that forms part of the potpourri of related results all sometimes referred to as Hoeffding’s inequality.

The previous post had been lying almost-complete but dormant for over two years. I revisited it because of a short but open-ended question about trees posed in our research group meeting by Serte Donderwinkel, one of our high-flying doctoral students.

Simplified Question:

For a Galton-Watson tree, can one obtain upper bounds in probability on the height of the tree, uniformly across all offspring distributions with mean $\mu$?

Note that in this setting, it is helpful to have in mind the classical notation $(Z_0,Z_1,Z_2,\ldots)$ for a Galton-Watson process, where typically $Z_0=1$, and $Z_{n+1}$ is the sum of $Z_n$ IID copies of the offspring distribution. Then we have

$\mathbb{P}(\mathrm{height}(\mathcal{T}) < k) = \mathbb{P}(Z_k=0).$

1) Subcritical case. When $\mu<1$, we certainly have $\mathbb{P}(Z_k>0)\le \mathbb{E}[Z_k]=\mu^k$.

Furthermore, if we’re studying all such offspring distributions, this is the best possible upper bound, by considering the offspring distribution given by $Z_1=1$ with probability $\mu$ and zero otherwise.

2) In the critical or supercritical case, $\mu\ge 1$ it is possible that the height is infinite with probability one.

So neither case is especially interesting for now.

Refined question:

What if instead we aren’t trying to obtain a bound uniformly across all offspring distributions with given mean $\mu$, but instead across a subset $\mathcal{X}$ of these distributions? How do we determine which distribution in $\mathcal{X}$ maximises the probability of reaching height k?

This is the question Serte was asking in our group meeting, in the setting where $\mu=1+o(1)$ and the height k has a particular scaling. Also, as far as I understand, the approach outlined in this post didn’t provide strong enough bounds in this particular context. Happily, Serte has recently tied up all the corners of this project concerning the supercritical Galton-Watson forest, and interested readers can find her preprint here on the Arxiv.

Nonetheless the interpretation via convex ordering feels perfect for a blog post, rather than being lost forever.

Convex ordering for offspring distributions

The main observation is that given two offspring distributions X and Y, such that $X\le_{cx} Y$ (which recall means that the means are the same but X is more concentrated) then a number of distributions associated to the Galton-Watson trees for X and Y also satisfy convex ordering relations.

As a warm-up, and because it was the original genesis, we first study heights. We will use the notation

$(Z_0^X,Z_1^X,Z_2^X,\ldots), (Z_0^Y,Z_1^Y,Z_2^Y,\ldots),$

to denote the two Galton-Watson processes. We shall compare $\mathbb{P}(Z^X_k=0)$ and $\mathbb{P}(Z^Y_k=0)$. If we write $\delta_0(\cdot)$ for the function defined on the non-negative integers such that

$\delta_0(0)=1,\quad \delta_0(n)=0,\,n\ge 1$,

it holds that $\delta_0(\cdot)$ is convex. In particular, if $X\le_{cx}Y$, then $\mathbb{E}[\delta_0(X)]\le \mathbb{E}[\delta_0(Y)]$, which exactly says that

$\mathbb{P}(Z^X_1 = 0)\le \mathbb{P}(Z^Y_1 = 0).$

We can then prove that $\mathbb{P}(Z^X_k=0)\le \mathbb{P}(Z^Y_k=0)$ by induction on $k\ge 1$. Note that $\mathbb{P}(Z^X_k=0)^n$ is a convex function of n, regardless of the value of this probability, and so we have

$\mathbb{P}(Z^X_{k+1}=0) = \mathbb{E}\left[ (\mathbb{P}(Z^X_k=0))^X\right] \le \mathbb{E}\left[(\mathbb{P}(Z^X_k=0))^Y\right].$

By the induction hypothesis, this final quantity is at most

$\mathbb{E}\left[(\mathbb{P}(Z^Y_k=0))^Y\right] = \mathbb{P}(Z^Y_{k+1}=0)$.

In conclusion, we have shown that $\mathbb{P}(Z^X_k=0)\le \mathbb{P}(Z^Y_k=0)$ holds for all k, and thus

$\mathrm{height}(\mathcal{T}^X) \ge_{st} \mathrm{height}(\mathcal{T}^Y).$

To return to the original context, suppose we have a large class of offspring distributions $\mathcal{Y}$ and a subclass $\mathcal{X}\subseteq \mathcal{Y}$ such that for all $Y\in\mathcal{Y}$, there exists $X\in \mathcal{X}$ such that $X\le_{cx} Y$. Then one can obtain uniform bounds on the heights of Galton-Watson trees with offspring distributions drawn from $\mathcal{Y}$ by checking those generated from distributions in $\mathcal{X}$ (which is particularly convenient if, for example, $\mathcal{X}$ is finite).

Convex ordering of generation sizes

The above argument solves the original problem, but brushes over the natural question: is it true that $Z^X_k \le_{cx} Z^Y_k$?

The answer is yes. Here’s a proof.

This follows from the following general statement:

Lemma 1: Suppose $X\le_{cx} Y$ are non-negative valued RVs and the non-negative integer valued RVs $M,N$ also satisfy $M \le_{cx} N$. Then

$X_1+\ldots+X_M \le_{cx} Y_1+\ldots Y_N,$

where $X_1,X_2,\ldots$ are IID copies of X and, independently, $Y_1,Y_2,\ldots$ are IID copies of Y.

Lemma 2: Suppose $W_1\le_{cx}Z_1$ and $W_2\le_{cx} Z_2$, and the four random variables are independent. Then $W_1+W_2\le_{cx}Z_1+Z_2$.

Proof of Lemma 2: First, note that for any random variable X, and convex function f

$\mathbb{E}\left[f(Z+x)\right]$ is a convex function of x.

(Indeed, this holds since “f(z+x) is convex” holds for every z, and any definition of convex will pass to the expectation.)

Now we can attack the lemma directly, we may write

$\mathbb{E}\left[ f(W_1+W_2)\right]=\mathbb{E}\left[\, \mathbb{E}[f(W_1+W_2) \mid W_2 ] \,\right] \le \mathbb{E}\left[\, \mathbb{E}[f(W_1+Z_2)\mid Z_2 ] \, \right].$

But then for any $z_2$, we know $f(\cdot+z_2)$ is convex, so $\mathbb{E}[f(W_1+z_2)]\le \mathbb{E}[f(Z_1+z_2)]$, and it follows that

$\mathbb{E}\left[ f(W_1+W_2)\right]\le \mathbb{E} \left[ f(Z_1+Z_2)\right],$

which proves the lemma.

Corollary 3: When $W_1,\ldots,W_m, Z_1,\ldots,Z_m$ are independent, and satisfy $W_i \le_{cx} Z_i$, then we have $W_1+\ldots+W_m\le_{cx} Z_1+\ldots+Z_m$.

Proof of Lemma 1: Note that

$\mathbb{E}\left[ f(X_1+\ldots+X_M)\mid M=n\right] \le \mathbb{E}\left[ f(Y_1+\ldots+Y_N)\mid N=n\right],$

follows from Corollary 3. So a useful question to consider is whether $\mathbb{E}\left[f(Y_1+\ldots+Y_n)\right]$ (*) is a convex function of n?

Denote this quantity by $F(n)$. To check convexity of a function defined on the integers, it suffices to verify that $F(n+1)-F(n)\ge F(n)-F(n-1)$.

There is a canonical coupling between the RVs used to define all of $F(n-1),F(n),F(n+1)$, but it will be convenient to adjust the coupling, and write:

$F(n+1)-F(n)= \mathbb{E}\left[ f(Y_1+\ldots+Y_n + Y^*) - f(Y_1+\ldots+Y_n)\right],$

$F(n)-F(n-1)=\mathbb{E}\left[f(Y_1+\ldots+Y_{n-1}+Y^*) - f(Y_1+\ldots+Y_{n-1})\right],$

where $Y^*$ is a further independent copy of $Y$. But note that for any choice $C\ge c$ and $y\in \mathbb{R}$,

$f(C+y) - f(C) - f(c+y) + f(c)\ge 0.$ (*)

(Essentially, this says that the ‘chord’ of f on the interval $[c,C+y]$ lies above the chord on interval $[C,c+y]$ or $[c+y,C]$, which some people choose to call Karamata’s inequality, but I think is more helpful to think of as part of the visual definition of convexity.)

In any case, setting $y=Y^*, c=Y_1+\ldots+Y_{n-1}, C=Y_1+\ldots+Y_n$ and taking expectations, we obtain

$\mathbb{E}\left[ f(Y_1+\ldots+Y_n+Y^*) - f(Y_1+\ldots+Y_n)\right.$

$\left.- f(Y_1+\ldots+Y_{n-1}+Y^*) + f(Y_1+\ldots+Y_{n-1})\right]\ge 0,$

as required. So $F(n)$ is convex. We may now finish off as

$\mathbb{E}\left[ X_1+\ldots+X_M\right] = \mathbb{E}\left[ \,\mathbb{E}[X_1+\ldots+X_M\mid M]\,\right] \le \mathbb{E}\left[\, \mathbb{E}[Y_1+\ldots+Y_M\mid M]\,\right] = \mathbb{E}[f(M)]\le \mathbb{E}[f(N)] = \mathbb{E}[Y_1+\ldots+Y_N],$

completing the proof of Lemma 1.

• The analysis in this post is not sufficient to study the total population sizes of two Galton-Watson trees generated by X and Y. Note that in Lemma 2, it is important that the random variables are independent. Otherwise, we could, for example, consider $\mathbb{E}[X]=\mathbb{E}[Y]=0$ with $X\le_{cx}Y$ but clearly it should not hold that $X_1+X_2 \le_{cx} Y + (-Y) = 0$. So for total population size, since $(Z^X_k,\,k\ge 1)$ are not independent, an alternative approach would be required.
• A further characterisation of convex ordering is given by Strassen’s theorem [Str65], which is touched on in the previous post, and to which I may return to in a future post on this topic. This may be a more promising avenue for established a convex ordering result on total population size.
• Lemma 1 requires that X,Y are non-negative. Note that during the argument we set $y=Y^*, c=Y_1+\ldots+Y_{n-1}, C=Y_1+\ldots+Y_n$, and when we relax the non-negative support condition, it is no longer guaranteed that $C\ge c$, which is crucial for the step which follows.
• In a recent article in ECP addressing Lemma 1 by a different method, Berard and Juillet [BJ20] provide a simple example showing that the non-negative assumption is genuinely necessary. Consider the random variable $\tau\in \{0,2\}$ with equal probability so $1\le_{cx} \tau$. But then, taking both X and Y to be simple random walk on $\mathbb{Z}$, we do not have $S_1\le_{cx}S_{\tau}$.

References

[BJ20] – Berard, Juillet – A coupling proof of convex ordering for compound distributions, 2020

[Str65] – Strassen – The existence of probability measures with given marginals, 1965

# Hoeffding’s inequality and convex ordering

A number of previous posts on this blog have discussed Markov’s inequality, and the control one can obtain on a distribution by applying this to functions of the distribution, or functions of random variables with this distribution. In particular, studying the square of the deviation from the mean $\left(X-\mathbb{E}X\right)^2$ leads to Chebyshev’s inequality, while even stronger bounds can sometimes be obtained using $e^{tX}$, for some appropriate value of $t\in\mathbb{R}$. Sometimes this final method, which relies on the finiteness of $\mathbb{E}[e^{tX}]$ is referred to as a Chernoff bound, especially when an attempt is made to choose the optimal value of t for the estimate under consideration.

The case of a random walk, that is the sum of IID increments, is particularly well-studied. In [Ho63], Hoeffding studies a generalisation where the increments are merely independent, and outlines a further generalisation to discrete-time martingales, which is developed by Azuma in [Az67], leading to the famous Azuma-Hoeffding inequality. A key feature of this theorem is that we require bounded increments, which leads to tight choices of $t_i$ in the Chernoff bound for each increment.

In addition, [Ho63] addresses the alternative model where the increments of a random walk are chosen uniformly without replacement from a particular set. The potted summary is that the sum of random increments chosen without replacement has the same mean, but is more concentrated that the corresponding sum of random increments chosen with replacement. This means that any of the concentration results proved in the earlier sections of [Ho63] for the latter situation apply equally to the setting without replacement.

Various versions of this result are referred to as Hoeffding’s inequality by the literature, and the main goal of this short article is to list some results which are definitely true about this setting!

(Note, in some contexts, ‘Hoeffding’s inequality’ will mean the concentration bound for either the with-replacement or the without-replacement setting. In this blog post, we are discussing inequalities comparing these two settings.)

Starting point – second moments in ‘vanilla’ Hoeffding

We won’t use this notation heavily, but for now let’s assume that we are studying a (multi)set of values $\mathcal{X}=\{x_1,\ldots,x_N\}$, and will study a sequence of m elements $(X_1,\ldots,X_m)$ from this set with replacement (ie chosen in an IID fashion), and a sequence of m elements $(Y_1,\ldots,Y_m)$ chosen from this set without replacement.

Some preliminary remarks:

• The $Y_i$s are still chosen uniformly which can be interpreted as
• Any permutation of the multiset $\{x_1,\ldots,x_n\}$ is equally likely to appear as $(Y_1,\ldots,Y_n)$;
• If determining the sequence one element at a time, then in a conditional sense, the choice of the ‘next’ element is uniform among those elements which have not yet been chosen.
• In particular, the sequence $(Y_1,\ldots,Y_m)$ is exchangeable (though not knowing what that means will not be a significant impediment to what follows).
• It is clear that $\mathbb{E}[X_1+\ldots+X_m]=\mathbb{E}[Y_1+\ldots+Y_m]$, so concentration around this common mean is the next question of interest.
• Intuitively, when $m\ll N$, we wouldn’t expect there to be much difference between sampling with or without replacement.
• When sampling without replacement, the cases m and N-m are symmetric with respect to the sum of $\mathcal{X}$ and, in particular, have the same concentration.

As a brief warmup, we establish that

$\mathrm{Var}(Y_1+\ldots+Y_m)\le \mathrm{Var}(X_1+\ldots+X_m).$

Note that the RHS is $m\sigma^2$, where $\sigma^2<\infty$ is the variance of a single uniform sample from $\mathcal{X}$. Since these sums have the same mean, it suffices to show that

$\mathbb{E}\left[ (Y_1+\ldots+Y_m)^2\right] \le \mathbb{E}\left[ (X_1+\ldots+X_m)^2\right].$

But this really has nothing to do with probability now, since these quantities are just symmetric polynomials in $\mathcal{X}$. That said, we gain a little bit if we use the exchangeability property to rewrite the LHS and the RHS as

$n\mathbb{E}[Y_1^2] + n(n-1)\mathbb{E}[Y_1Y_2] \le n\mathbb{E}[X_1^2] + n(n-1)\mathbb{E}[X_1X_2].$

Since $Y_1\stackrel{d}= X_1$, it remains to handle the cross term, which involves showing that

$\frac{1}{N(N-1)}\sum_{i\ne j} x_ix_j \le \frac{1}{N^2} \sum_{i,j}x_ix_j,$

which really is an algebra exercise, and can be demonstrated by the rearrangement inequality, or Chebyshev’s inequality (the other one – ie sequence-valued FKG not the probabilistic second-moment bound one), or by clever pre-multiplying and cancelling to reduce to Cauchy-Schwarz.

In [BPS18], which we will turn to in much greater detail shortly, the authors assert in the first sentence that Hoeffding treats also the setting of weighted sums, that is a comparison of

$\alpha_1X_1+\ldots+\alpha_mX_m,\quad \alpha_1Y_1+\ldots+\alpha_mY_m.$

A casual reader of [Ho63] may struggle initially to identify that this has indeed happened. At the level of second-moment comparison though, the argument given above carries over essentially immediately. We have

$\mathbb{E}\left[ (\alpha_1Y_1+\ldots+\alpha_n Y_n)^2\right] = \left(\alpha_1^2+\ldots+\alpha_n^2\right) \mathbb{E}[Y_1^2] + \left(\sum_{i\ne j} \alpha_i\alpha_j \right)\mathbb{E}\left[Y_1Y_2\right],$

and of course the corresponding decomposition for $(X_i)$, from which the variance comparison inequality follows as before.

Convex ordering

For many applications, variance is sufficient as a measure of concentration, but one can say more. As a prelude, note that one characterisation of stochastic ordering (which I have sometimes referred to as stochastic domination on this blog) is that

$X\le_{st}Y \;\iff\;\mathbb{E}[f(X)] \le \mathbb{E}[f(Y)],$

whenever f is a non-decreasing function. Unsurprisingly, it’s the $\Leftarrow$ direction which is harder to prove, though often the $\Rightarrow$ direction is the most useful in applications.

[To inform what follows, I used a number of references, in particular [BM05].]

A natural extension to handle concentration is (stochastic) convex ordering, which is defined as

$X\le_{cx} Y\;\iff \; \mathbb{E}[f(X)]\le \mathbb{E}[f(Y)],$

whenever f is a convex function. Note immediately that this definition can only ever apply to random variables for which $\mathbb{E}X=\mathbb{E}Y$. Otherwise, consider the functions $f(x)=x, f(x)=-x$, both of which are convex.

This sort of comparison has plenty of application in economics and actuarial fields, where there is a notion of risk-aversion, and convexity is the usual criterion to define risk (in the sense that the danger of a significant loss is viewed as more significant than a corresponding possibility of a significant gain). Also note that in this context, X is preferred to Y by a risk-averse observer.

In fact, it can be shown [Oh69] that a sufficient condition for $X\le_{cx} Y$ is the existence of $\alpha\in\mathbb{R}$ such that

$F_X(x)\le F_Y(x)\;\iff\; x\ge \alpha,$

where $F_X(x)=\mathbb{P}(X\le x)$ is the usual distribution function of X. That is, the distribution functions ‘cross exactly once’. Note that in general $\alpha\ne \mathbb{E}X=\mathbb{E}Y$.

We will return later to the question of adapting this measure to handle distributions with different expectations, but for now, we will return to the setting of sampling with and without replacement and show that

$Y_1+\ldots+Y_m\le_{cx}X_1+\ldots+X_m.$

This is the result shown by Hoeffding, but by a more complicated method. We use the observations of [BPS18] which are stated in a more complicated setting which we may return to later. For notational convenience, let’s assume the elements of $\mathcal{X}$ are distinct so that ‘repetitions of $x_i$‘ has an unambiguous meaning.

We set up a natural coupling of $(X_i),(Y_i)$ as follows. Let $(X_i,i\ge 1)$ be IID uniform samples from $\mathcal{X}$, and generate $(Y_1,\ldots,Y_N)$ by removing any repetitions from $(X_1,X_2,\ldots)$.

Note then that given the sequence $(Y_1,\ldots,Y_m)$, the multiset of values $\{X_1,\ldots,X_m\}$ definitely includes $Y_1$ at least once, but otherwise includes each $Y_i$ either zero, one or more times, in a complicated way. However, this distribution doesn’t depend on the values taken by $(Y_i)$, since it is based on uniform sampling. In particular, if we condition instead on the set $\{Y_1,\ldots,Y_m\}$, we find that

$\mathbb{E}\left[X_1+\ldots+X_m\,\big|\, \{Y_1,\ldots,Y_m\}\right] = Y_1+\ldots+Y_m,$

and so we may coarsen the conditioning to obtain

$\mathbb{E}\left[X_1+\ldots+X_m\,\big|\, Y_1+\ldots+Y_m\right] = Y_1+\ldots+Y_m.$

This so-called martingale coupling is very useful for proving convex ordering. You can think of $\mathbb{E}[X|Y]=Y$ as saying “X is generated by Y plus additional randomness”, which is consistent with the notion that Y is more concentrated than X. In any case, it meshes well with Jensen’s inequality, since if f is convex, we obtain

$\mathbb{E}[f(X)] = \mathbb{E}\left[ \mathbb{E}[f(X)|Y] \right] \ge \mathbb{E}\left[f\left(\mathbb{E}[X|Y] \right)\right] = \mathbb{E}\left[ f(Y)\right].$

So we have shown that sampling without replacement is greater than sampling with replacement in the convex ordering, if we consider sums of the samples.

If we want to handle weighted sums, and tried to reproduce the previous argument directly, it would fail, since

$\mathbb{E}\left[\alpha_1X_1+\ldots+\alpha_nX_n\,\big|\,\{Y_1,\ldots,Y_m\}\right]=(\alpha_1+\ldots+\alpha_m)(Y_1+\ldots+Y_m),$

which is not in general equal to $\alpha_1Y_1+\ldots+\alpha_mY_m$.

Non-uniform weighted sampling

Of course, it’s not always the case that we want to sample according to the uniform distribution. My recent personal motivation for investigating this setup has been the notion of exploring a graph generated by the configuration model, for which vertices appear for the first time in a random order that is size-biased by their degrees.

In general we can think of each element of $\mathcal{X}$ as having some weight w(i), and at each step, we pick $x_i$ from the remaining elements with probability proportion to w(i). This procedure is known in other contexts as successive sampling. Note that the sequence $(Y_1,\ldots,Y_m)$ is no longer exchangeable. This is clear enough when there are two elements in $\mathcal{X}$, and $w(1)\gg w(2)$. So it is much more likely that $(Y_1,Y_2)=(x_1,x_2)$ than $(Y_1,Y_2)=(x_2,x_1)$.

In addition, it will in general not be the case that

$\mathbb{E}[Y_1+\ldots+Y_m]=\mathbb{E}[X_1+\ldots+X_m],$

and this is easily seen again in the above example, where $\mathbb{E}[Y_1+Y_2]=x_1+x_2$ while $\mathbb{E}[X_1+X_2]\approx 2x_1$.

However, one can still compare such distributions, with the notion of increasing convex ordering, which is defined as:

$X\le_{icx} Y\;\iff\; \mathbb{E}[f(X)]\le \mathbb{E}[f(Y)],$

for all convex non-decreasing functions f. From an economic point view, increasing convex ordering becomes a comment on the relative contribution of losses above some threshold between the two distributions (and that X is better or worse than Y for all such thresholds), and is sometimes called stop-loss order. More formally, as in [BM05] we have:

$X\le_{icx}Y\;\iff\; \mathbb{E}[\max(X-t,0)]\le \mathbb{E}[\max(Y-t,0)]\,\forall t\in\mathbb{R}.$

In [BPS18], the authors study the case of successive sampling where the weights are monotone-increasing in the values, that is $w(i)\ge w(j)\;\iff\; x_i\ge x_j$, of which size-biased sampling is a special case. With this setup, the heuristic is that ‘large values are more likely to appear earlier’ in the sample without replacement, and making this notion precise underpins all the work.

The coupling introduced above is particularly useful in this more exotic setting. The authors show that conditional on the event A of the form $\{Y_1,\ldots,Y_m\}=\{x_{i_1}\le x_{i_2}\le\ldots\le x_{i_m}\}$, one has

$\mathbb{P}(X_1=x_{i_j}\,\big|\,\mathbf{A}) \le \mathbb{P}(X_1=x_{i_k}\,\big|\,\mathbf{A}),\text{ precisely when }j\le k.$

This is achieved by comparing the explicit probabilities of pairs of orderings for $(Y_1,\ldots,Y_m)$ with $x_{i_j},x_{i_k}$ exchanged. See [BPS18] for this short calculation – a link to the Arxiv version is in the references. So to calculate $\mathbb{E}[X_1\,\big|\, \mathbf{A}]$, we have the setup for the rearrangement inequality / Chebyshev again, as

$\mathbb{E}[X_1\,\big|\, \mathbf{A}] = \sum_{j=1}^m x_{i_j}\mathbb{P}(X_1=x_{i_j}) \le \frac{1}{n}\left(\sum_{j=1}^m x_{i_j}\right)\left(\sum \mathbb{P}\right) = \frac{\left(Y_1+\ldots+Y_m\,\big|\,\mathbf{A}\right)}{m}.$

As before, letting $X,Y$ be the sums of the samples with and without replacement, respectively, we have $\mathbb{E}[X\,\big|\, Y] \ge Y$, which the authors term a submartingale coupling. In particular, the argument

$\mathbb{E}[f(X)] = \mathbb{E}\left[ \mathbb{E}[f(X)|Y] \right] \ge \mathbb{E}\left[f\left(\mathbb{E}[X|Y] \right)\right] \ge \mathbb{E}\left[ f(Y)\right],$

works exactly as before to establish convex ordering.

References

[Az67] – Azuma – 1967 – Weighted sums of certain dependent random variables

[BM05] – Bauerle, Muller – 2005 – Stochastic orders and risk measures: consistency and bounds. (Online version here)

[BPS18] – Ben Hamou, Peres, Salez – 2018 – Weighted sampling without replacement (ArXiv version here)

[Ho63] – Hoeffding – 1963 – Probability inequalities for sums of bounded random variables

[Oh69] – Ohlin – 1969 – On a class of measures of dispersion with application to optimal reinsurance

# EGMO 2019

Last week, we held our annual IMO training and selection camp in the lovely surroundings of Trinity College, Cambridge. Four of our students have subsequently spent this week in Kiev, for the ninth edition of the prestigious European Girls’ Mathematical Olympiad.

The UK team, none of whom had attended the competition before, and all of whom remain eligible to return at least once, did extremely well, placing fourth out of the official European countries, and earning three silver medals, and one gold medal, with complete or almost-complete solutions to all six problems between them. More details are available on the competition website.

In this post, I’m going to discuss some of the non-geometry problems. As a special treat, the discussion of Question 6 is led by Joe Benton, which is only fitting, since he wrote it. You can find the first day’s three problems here, and the second day’s problems here. The geometry problems are treated in a separate post.

Problem One

Triple equalities are hard to handle directly, so generally one starts with a single equality, for example the left hand one $a^2b+c = b^2c+a$, after noting that the setup is cyclic (but not symmetric) in the variables, under cycling $a\to b \to c\to a$.

Some quick notes:

• The given equations are inhomogeneous, meaning that not every term has the same degree in terms of {a,b,c}. In complicated equations, normally we want to cancel terms, and we certainly can’t cancel terms with different degrees! So a good first idea is to find a way to make the equations homogeneous, for example using the condition, which is itself inhomogeneous. For example, one could replace $c$ with $c(ab+bc+ca)$, and it’s clear this might lead to some cancellation. In this case, the algebra is more straightforward if one rearranges and factorises first to obtain $a(1-ab)=c(1-bc)$, from which the condition gives $a(bc+ca)=c(ab+ac)$.
• Shortly after this, we find $a^2c=ac^2$, which means $a=c$ unless at least one of these terms is equal to zero.
• This case distinction should not be brushed over lightly as a tiny detail. This is the sort of thing that can lose you significant marks, especially if the omitted case turns out to be more involved than the standard case.
• This is a good example of planning your final write-up. The case distinction $\{a,c\ne 0\}, \{a\text{ or }c=0\}$ is really clumsy. What about b? We have already said that the setup is cyclic in (a,b,c), and it’s annoying to throw this aspect away. It really would be much much to have the overall case distinction at the start of your argument, for example into $\{a,b,c \ne 0\}$ and $\{a\text{ or }b\text{ or }c=0\}$, because then in the latter case you could assume that $a=0$ without loss of generality, then see what that implied for the other variables.
• In this setting, one really should check that the claimed solutions satisfy both the condition and the triple-equality. This is a complicated family of simultaneous equations, and even if you’re lucky enough to have settled on an argument that is reversible, it’s much easier to demonstrate that everything is satisfied as desired than actually to argue that the argument is reversible.

Problem Two

Maybe I speak for a number of people when I say I’m a little tired of domino tiling problems, so will leave this one out.

Problem Five

I liked this problem. A lot. It did feel like there were potential time-wasting bear traps one might slip into, and perhaps these comments are relevant:

• This feels like quite a natural problem to ask. In particular, the task specified by (A)+(B) is much more elegant than the RHS of (C). So unless you have immediate insight for the RHS of (C), it makes sense to start with the task, and aim for any bound similar to (C).
• There are a lot of transformations one could make, for example, repeatedly removing n from one of the $a_n$s, or re-ordering them conveniently somehow, which shouldn’t affect any solution. You can also attack the RHS of (C) in a number of ways (perhaps with a case split for n odd or even?). As with involved but easy angle chases discussed in the companion post, it’s probably better to hold this in mind than instantly to do all of these, since it will just obscure the argument if you end up not using the result of the simplification!
• One should always try small examples. Here, it’s worth trying examples large enough to get a sense of what’s happening, because I think the crucial observation is that (unless you were very very unlucky) there are lots of suitable sequences $B=(b_i)$. (*)
• Certainly the case where the $(a_i)$s themselves form a complete n-residue system is ‘easiest’ and certainly straightforward. One might say that the case where the $(a_i)$ are equal (or at least congruent) is ‘hardest’ (in that $(b_i-a_i)$ might have to take the largest values in some sense), but is also straightforward. There’s certainly the option of trying to rigorise this, but I don’t know whether this can be made to work.

# Lecture 8 – Bounds in the critical window

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

Preliminary – positive correlation, Harris inequality

I wrote about independence, association, and the FKG property a long time ago, while I was still an undergraduate taking a first course on Percolation in Cambridge. That post is here. In the lecture I discussed the special case of the FKG inequality applied in the setting of product measure setting, of which the Erdos-Renyi random graph is an example, and which is sometimes referred to as the Harris inequality.

Given two increasing events A and B, say for graphs on [n], then if $\mathbb{P}$ is product measure on the edge set, we have

$\mathbb{P}(A\cap B)\ge \mathbb{P}(A)\mathbb{P}(B).$

Intuitively, since both A and B are ‘positively-correlated’ with the not-rigorous notion of ‘having more edges’, then are genuinely positively-correlated with each other. We will use this later in the post, in the form $\mathbb{E}[X|A]\ge \mathbb{E}[X]$, whenever X is an increasing RV and A is an increasing event.

The critical window

During the course, we’ve discussed separately the key qualitative features of the random graph $G(n,\frac{\lambda}{n})$ in the

• subcritical regime when $\lambda<1$, for which we showed that all the components are small, in the sense that $\frac{1}{n}|L_1| \stackrel{\mathbb{P}}\rightarrow 0$, although the same argument would also give $|L_1|\le K\log n$ with high probability if we used stronger Chernoff bounds;
• supercritical regime when $\lambda>1$, for which there is a unique giant component , ie that $\frac{1}{n}|L_1|\stackrel{\mathbb{P}}\rightarrow \zeta_\lambda>0$, the survival probability of a Galton-Watson branching process with Poisson($\lambda$) offspring distribution. Arguing for example by a duality argument shows that with high probability all other components are small in the same sense as in the subcritical regime.

In between, of course we should study $G(n,\frac{1}{n})$, for which it was known that $L_1\stackrel{d}\sim n^{2/3},\, L_2\stackrel{d}\sim n^{2/3},\ldots$. (*) That is, the largest components are on the scale $n^{2/3}$, and there are lots of such critical components.

In the early work on random graphs, the story ended roughly there. But in the 80s, these questions were revived, and considerable work by Bollobas and Luczak, among many others, started investigating the critical setting in more detail. In particular, between the subcritical and the supercritical regimes, the ratio $\frac{|L_2|}{|L_1|}$ between the sizes of the largest and second-largest components goes from ‘concentrated on 1’ to ‘concentrated on 0’. So it is reasonable to ask what finer scaling of the edge probability $p(n)$ around $\frac{1}{n}$ should be chosen to see this transition happen.

Critical window

In this lecture, we studied the critical window, describing sequences of probabilities of the form

$p(n)=\frac{1+\lambda n^{-1/3}}{n},$

where $\lambda\in(-\infty,+\infty)$. (Obviously, this is a different use of $\lambda$ to previous lectures.)

It turns out that as we move $\lambda$ from $-\infty$ to $+\infty$, this window gives exactly the right scaling to see the transition of $\frac{|L_2|}{|L_1|}$ described above. Work by Bollobas and Luczak and many co-authors and others in the 80s establish a large number of results in this window, but for the purposes of this course, this can be summarised as saying that the critical window has the same scaling behaviour as $p(n)=1/n$, with a large number of components on the scale $\sim n^{2/3}$ (see (*) earlier), but different scaling limits.

Note: Earlier in the course, we have discussed local limits, in particular for $G(n,\lambda/n)$, where the local limit is a Galton-Watson branching process tree with offspring distribution $\mathrm{Poisson}(\lambda)$. Such local properties are not sufficient to distinguish between different probabilities within the critical window. Although there are lots of critical components, it remains the case that asymptotically almost all vertices are in ‘small components’.

The precise form of the scaling limit for

$\frac{1}{n^{2/3}} \left( |L_1|, |L_2|, |L_3|,\ldots \right)$

as $n\rightarrow\infty$ was shown by Aldous in 1997, by lifting a scaling limit result for the exploration process, which was discussed in this previous lecture and this one too. Since Brownian motion lies outside the assumed background for this course, we can’t discuss that, so this lecture establishes upper bounds on the correct scale of $|L_1|$ in the critical window. Continue reading

# BMO2 2019

The second round of the British Mathematical Olympiad was taken on Thursday by the 100 or so top scoring eligible participants from the first round, as well as some open entries. Qualifying for BMO2 is worth celebrating in its own right. The goal of the setters is to find the sweet spot of difficult but stimulating for the eligible participants, which ultimately means it’s likely to be the most challenging exam many of the candidates sit while in high school, at least in mathematics.

I know that lots of students view BMO2 as something actively worth preparing for. As with everything, this is a good attitude in moderation. Part of the goal in writing about the questions at such length is because I think at this level it’s particularly easy to devote more time than needed to preparation, and use it poorly. This year time is tight at the end of semester, and so what follows is closer to a set of complete solutions than usual, for which apologies, although I hope it is still possible to get a sense of how one might have come across the solutions yourself. Of course, this means that what follows will certainly spoil the problems for anyone who hasn’t tried them by themselves already.

The copyright for the problems is held by BMOS, and reproduced here with permission.

Question 1

As if often the case in geometry questions, what you’ve been asked to prove here isn’t the most natural property of the configuration. A good first step would be to see if there are stronger statements which are true.You are asked to show that triangle BPE is isosceles, but you aren’t told which of the three vertices is the apex. In fact, the task is to show that $BP=EP$ or, alternatively, $\angle BEP=\angle PBE$. It’s not in general true that BE is equal to BP=EP. Unless you’re very unlucky, you can establish this from one diagram.

Now, you don’t immediately know whether it’s going to be easier to show that two lengths are equal, or that two angles are equal. However, you know that P lies on the perpendicular bisector of BC, hence BP=CP, which is a big clue. In particular, this means that P would be the centre of the circle through BCE. This clearly implies the given result, so deciding to prove this instead is a good strategy.

There are now a number of ways to prove this. Note that D lies on the altitude from A, and the feet of the perpendiculars from D to sides AB and AC are both present in the configuration so (just as for the orthocentre diagram) we can calculate most of the angles involving {A,B,C,D,E}.

For example, ABDE is cyclic, so $\angle BED=\angle BAD = 90-\hat{B}$, hence $\angle AEB=\hat{B},\,\angle EBA=\hat{C}$. This shows that AB is tangent to the circumcircle of BCE. But then the line L is a radius of this circle, and so its centre must be P, the unique point on L which is equidistant from B and C.

Alternatively, we could directly calculate $\angle BEC=180-\hat{B}$ and $\angle CBP=90-\hat{B}$. But BPC is isosceles so $\angle BPC=2\hat{B}$. In general, the converse of ‘angle at centre is twice angle at circumference’ does not hold, but when we know P is equidistant from B and C this does hold, and so the angle relations precisely confirm that P is the centre of the circle through BPE.

My intention had been that the triangle would be acute-angled, to reduce the number of diagram options based on the magnitude of $\hat{B}$. If pursuing this second approach, one would need to be careful to account for whether P is on the same side or the opposite side of BC to E. That said, unless you do something very exotic, it should be exactly the same argument or calculation, and such a case distinction probably isn’t very important.

Question 2

First, a short remark. As stated, if n=5, a piece could move 3 squares to the left then 4 squares up, by Pythagoras. Handling all such options is likely to be quite annoying, since some values of n can be written in this Pythagorean form, and others cannot. This brings us to some good general principles for olympiad problems which look like this one:

• A construction, when one exists, will probably be possible using simple versions of the allowed moves / structures.
• An argument why a construction is impossible should probably be based on ideas which treat the simple moves similarly to the more complicated moves.

The setup of the problem encourages you to think about dividing the board into $n^2$ sub-boards, each with dimensions $n\times n$. Continue reading

# Lecture 10 – the configuration model

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

As we enter the final stages of the semester, I want to discuss some extensions to the standard Erdos-Renyi random graph which has been the focus of most of the course so far. Although we will not get far into the details during this course, the overall goal is to develop models which are close to Erdos-Renyi in terms of ease of analysis, while also allowing more of the features characteristic of networks observed in the real world.

One of the more obvious deficiencies of the sparse regime of Erdos-Renyi random graphs for modelling ‘real-world phenomena’ concerns the degree sequence. Indeed, the empirical degree distribution of G(n,c/n) converges to Poisson(c). By contrast, in real-world networks, a much wider range of degrees is typically observed, and in many cases it is felt that these should follow a power law, with a small number of a very highly connected agents.

One way around this problem to construct random graphs where we insist that the graph has a given sequence of degrees. The configuration model, which is the subject of this lecture and this post (and about which I’ve written before), offers one way to achieve this.

Definition and notes

Let $n\ge 1$ and let $d=(d_1,d_2,\ldots,d_n)$ be a sequence of non-negative integers such that $\sum_{i=1}^n d_i$ is even. Then the configuration model with degree sequence d is a random multigraph with vertex set [n], constructed as follows:

• To each vertex $i\in[n]$, assign $d_i$ half-edges;
• Then, take a uniform matching of these half-edges;
• Finally, for each pair of half-edges in the matching, replace the two half-edges with a genuine edge, to obtain the multigraph $CM_n(d)$, in which, by construction, vertex i has degree $d_i$.

One should note immediately that although the matching is uniform, the multigraph is not uniform amongst multigraphs with that degree sequence. Note also that the condition on the sums of the degrees is necessary for any graph, and in this context means that the number of half-edges is even, without which it would not be possible to construct a matching.

This effect is manifest in the simplest possible example, when n=2 and d=(3,3). There are two possible graphs, up to isomorphism, which are shown below:

For obvious reasons, we might refer to these as the handcuffs and the theta , respectively. It’s helpful if we, temporarily, assume the half-edges are distinguishable at the moment we join them up in the configuration model construction. Because then there are 3×3=9 ways to join them up to form the handcuffs (think of which half-edge ends up forming the edge between the two vertices) while there are 3!=6 ways to pair up the half-edges in the theta.

In general, for multigraphs H with the correct degree sequence, we have

$\mathbb{P}( CM_n(d)\simeq H) \propto \left( 2^{\# \text{loops}(H)} \prod_{e\in E(H)} \text{mult}(e)! \right),$

where $\text{mult}(e)$ is the multiplicity with which a given edge e appears in H.

Note: it might seem counterintuitive that this procedure is biased against multiple edges and self-loops, but it is really just saying that there are more ways to form two distinct edges than to form two equal edges (ie a multiedge pair) when we view the half-edges as distinguishable. (See this post for further discussion of this aspect in the 3-regular setting.)

However, a consequence of this result is that if we condition on the event that $CM_n(d)$ is simple, then the resulting random graph is uniform on the set of simple graphs satisfying the degree property. Note that the same example as above shows that there’s no guarantee that there exists a simple graph whose degrees are some given sequence.

d-regular configuration model

In general, from a modelling point of view, we are particularly interested in simple, connected graphs, and so it is valuable to study whether the large examples of the configuration model are likely to have these properties. In this lecture, I will mainly focus on the case where the multigraphs are d-regular, meaning that all the vertices have degree equal to d. For the purposes of this lecture, we denote by $G^d(n)$, the d-regular configuration model $CM_n(d,\ldots,d)$.

• d=1: to satisfy the parity condition on the sums of degrees, we must have n even. But then $G^1(n)$ will consist of n/2 disjoint edges.
• d=2: $G^2(n)$ will consist of some number of disjoint cycles, and it is a straightforward calculation to check that when n is large, with high probability the graph will be disconnected.

In particular, I will focus on the case when d=3, which is the first interesting case. Most of the results we prove here can be generalised (under various conditions) to more general examples of the configuration model. The main goal of the lecture is revision of some techniques of the course, plus one new one, in a fresh setting, and the strongest possible versions of many of these results can be found amongst the references listed at the end.

Connectedness

In the lecture, we showed that $G^3(2n)$ is connected with high probability. This is, in fact, a very weak result, since in fact $G^d(n)$ is d-connected with high probability for $d\ge 3$ [Bol81, Wor81]. Here, d-connected means that one must remove at least d vertices in order to disconnect the graph, or, equivalently, that there are d disjoint paths between any pair of vertices. Furthermore, Bollobas shows that for $d\ge 3$, $G^d(n)$ is a (random) expander family [Bol88].

Anyway, for the purposes of this course, the main tool is direct enumeration. The matching number $M_{2k}$ satisfies

$M_{2k}=(2k-1)\times (2k-3)\times\ldots\times 3\times 1 = \frac{(2k)!}{2^k \cdot k!},$

and so Stirling’s approximation gives the asymptotics

$M_{2k} = (\sqrt{2}+o(1)) \left(\frac{2}{e}\right)^k k^k,$

although it will be useful to use the true bounds

$c \left(\frac{2}{e}\right)^k k^k \le M_{2k}\le C\left(\frac{2}{e}\right)^k k^k,\quad \forall k,$

instead in some places. Anyway, in $G^3(2n)$, there are 6n half-edges in total, and so the probability that the graph may be split into two parts consisting of $2\ell,2m$ vertices, with $2\ell+2m=2n$, and with no edges between the classes is $\frac{\binom{2n}{2\ell} M_{6\ell}M_{6m}}{M_{6n}}.$ Continue reading

# Lecture 9 – Inhomogeneous random graphs

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

As we enter the final stages of the semester, I want to discuss some extensions to the standard Erdos-Renyi random graph which has been the focus of most of the course so far. In doing so, we can revisit material that we have already covered, and discover how easily one can extend this directly to more exotic settings.

The focus of this lecture was the model of inhomogeneous random graphs (IRGs) introduced by Soderberg [Sod02] and first studied rigorously by Bollobas, Janson and Riordan [BJR07]. Soderberg and this blog post address the case where vertices have a type drawn from a finite set. BJR address the setting with more general typespaces, in particular a continuum of types. This generalisation is essential if one wants to use IRGs to model effects more sophisticated than those of the classical Erdos-Renyi model G(n,c/n), but most of the methodology is present in the finite-type setting, and avoids the operator theory language which is perhaps intimidating for a first-time reader.

Inhomogeneous random graphs

Throughout, $k\ge 2$ is fixed. A graph with k types is a graph G=(V,E) together with a type function $V\to \{1,\ldots,k\}$. We will refer to a $k\times k$ symmetric matrix with non-negative entries as a kernel.

Given $n\in\mathbb{N}$ and a vector $p=(p_1,\ldots,p_k)\in\mathbb{N}_0^k$ satisfying $\sum p_i=n$, and $\kappa$ a kernel, we define the inhomogeneous random graph $G^n(p,\kappa)$ with k types as:

• the vertex set is [n],
• types are assigned uniformly at random to the vertices such that exactly $p_i$ vertices have type i.
• Conditional on these types, each edge $v\leftrightarrow w$ (for $v\ne w\in [n]$) is present, independently, with probability

$1 - \exp\left(-\frac{\kappa_{\mathrm{type}(v),\mathrm{type}(w)} }{n} \right).$

Notes on the definition:

• Alternatively, we could assign the types so that vertices $\{1,\ldots,p_1\}$ have type 1, $\{p_1+1,\ldots,p_1+p_2\}$ have type 2, etc etc. This makes no difference except in terms of the notation we have to use if we want to use exchangeability arguments later.
• An alternative model considers some distribution $\pi$ on [k], and assigns the types of the vertices of [n] in an IID fashion according to $\pi$. Essentially all the same results hold for these two models. (For example, this model with ‘random types’ can be studied by quenching the number of each type!) Often one works with whichever model seems easier for a given proof.
• Note that the edge probability given is $\approx \frac{\kappa_{\mathrm{type}(v),\mathrm{type}(w)}}{n}$. The exponential form has a more natural interpretation if we ever need to turn the IRGs into a process. Additionally, it avoids the requirement to treat small values of n (for which, a priori, $k/n$ might be greater than 1) separately.

In the above example, one can see that, roughly speaking, red vertices are more likely to be connected to each other than blue vertices. However, for both colours, they are more likely to be connected to a given vertex of the same colour than a vertex of the opposite colour. This might, for example, correspond to the kernel $\begin{pmatrix}3&1\\1&2\end{pmatrix}$.

The definition given above corresponds to a sparse setting, where the typical vertex degrees are $\Theta(1)$. Obviously, one can set up an inhomogeneous random graph in a dense regime by an identical argument.

From an applications point of view, it’s not hard to imagine that an IRG of some flavour might be a good model for many phenomena observed in reality, especially when a mean-field assumption is somewhat appropriate. The friendships of boys and girls in primary school seems a particularly resonant example, though doubtless there are many others.

One particular application is to recover the types of the vertices from the topology of the graph. That is, if you see the above picture without the colours, can you work out which vertices are red, and which are blue? (Assuming you know the kernel.) This is clearly impossible to do with anything like certainty in the sparse setting – how does one decide about isolated vertices, for example? The probabilities that a red vertex is isolated and that a blue vertex is isolated differ by a constant factor in the $n\rightarrow\infty$ limit. But in the dense setting, one can achieve this with high confidence. When studying such statistical questions, these IRGs are often referred to as stochastic block models, and the recent survey of Abbe [Abbe] gives a very rich history of this type of problem in this setting.

Poisson multitype branching processes

As in the case of the classical random graph G(n,c/n), we learn a lot about the IRG by studying its local structure. Let’s assume from now on that we are given a sequence of IRGs $G^n(p^n,\kappa)$ for which $\frac{p^n}{n}\rightarrow \pi$, where $\pi=(\pi_1,\ldots,\pi_k)\in[0,1]^k$ satisfies $||\pi||_1=1$.

Now, let $v^n$ be a uniformly-chosen vertex in [n]. Clearly $\mathrm{type}(v^n)\stackrel{d}\rightarrow \pi$, with the immediate mild notation abuse of viewing $\pi$ as a probability distribution on [k].

Then, conditional on $\mathrm{type}(v^n)=i$:

• when $j\ne i$, the number of type j neighbours of $v^n$ is distributed as $\mathrm{Bin}\left(p_j,1-\exp\left(-\frac{\kappa_{i,j}}{n}\right)\right)$.
• the number of type i neighbours of $v^n$ is distributed as $\mathrm{Bin}\left( p_i-1,1-\exp\left(-\frac{\kappa_{i,i}}{n}\right)\right)$.

Note that $p_j\left[1-\exp\left(-\frac{\kappa_{i,j}}{n}\right)\right]\approx \frac{p_j\cdot \kappa_{i,j}}{n} \approx \kappa_{i,j}\pi_j$, and similarly in the case j=i, so in both cases, the number of neighbours of type j is distributed approximately as $\mathrm{Poisson}(\kappa_{i,j}\pi_j)$.

This motivates the following definition of a branching process tree, whose vertices have k types. Continue reading

# Lecture 7 – The giant component

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

As we edge into the second half of the course, we are now in a position to return to the question of the phase transition between the subcritical regime $\lambda<1$ and the supercritical regime $\lambda>1$ concerning the size of the largest component $L_1(G(n,\lambda/n))$.

In Lecture 3, we used the exploration process to give upper bounds on the size of this largest component in the subcritical regime. In particular, we showed that

$\frac{1}{n}\big| L_1(G(n,\lambda/n)) \big| \stackrel{\mathbb{P}}\rightarrow 0.$

If we used slightly stronger random walk concentration estimates (Chernoff bounds rather than 2nd-moment bounds from Chebyshev’s inequality), we could in fact have shown that with high probability the size of this largest component was at most some logarithmic function of n.

In this lecture, we turn to the supercritical regime. In the previous lecture, we defined various forms of weak local limit, and asserted (without attempting the notationally-involved combinatorial calculation) that the random graph $G(n,\lambda/n)$ converges locally weakly in probability to the Galton-Watson tree with $\text{Poisson}(\lambda)$ offspring distribution, as we’ve used informally earlier in the course.

Of course, when $\lambda>1$, this branching process has strictly positive survival probability $\zeta_\lambda>0$. At a heuristic level, we imagine that all vertices whose local neighbourhood is ‘infinite’ are in fact part of the same giant component, which should occupy $(\zeta_\lambda+o_{\mathbb{P}}(1))n$ vertices. In its most basic form, the result is

$\frac{1}{n}\big|L_1(G(n,\lambda/n))\big|\;\stackrel{\mathbb{P}}\longrightarrow\; \zeta_\lambda,\quad \frac{1}{n}\big|L_2(G(n,\lambda/n))\big| \;\stackrel{\mathbb{P}}\longrightarrow\; 0,$ (*)

where the second part is a uniqueness result for the giant component.

The usual heuristic for proving this result is that all ‘large’ components must in fact be joined. For example, if there are two giant components, with sizes $\approx \alpha n,\approx \beta n$, then each time we add a new edge (such an argument is often called ‘sprinkling‘), the probability that these two components are joined is $\approx 2ab$, and so if we add lots of edges (which happens as we move from edge probability $\lambda-\epsilon\mapsto \lambda$ ) then with high probability these two components get joined.

It is hard to make this argument rigorous, and the normal approach is to show that with high probability there are no components with sizes within a certain intermediate range (say between $\Theta(\log n)$ and $n^\alpha$) and then show that all larger components are the same by a joint exploration process or a technical sprinkling argument. Cf the books of Bollobas and of Janson, Luczak, Rucinski. See also this blog post (and the next page) for a readable online version of this argument.

I can’t find any version of the following argument, which takes the weak local convergence as an assumption, in the literature, but seems appropriate to this course. It is worth noting that, as we shall see, the method is not hugely robust to adjustments in case one is, for example, seeking stronger estimates on the giant component (eg a CLT).

Anyway, we proceed in three steps:

Step 1: First we show, using the local limit, that for any $\epsilon>0$,

$\frac{1}{n}\big|L_1(G(n,\lambda/n))\big| \le \zeta_\lambda+\epsilon,$ with high probability as $n\rightarrow\infty$.

Step 2: Using a lower bound on the exploration process, for $\epsilon>0$ small enough

$\frac{1}{n}\big|L_1(G(n,\lambda/n))\big| \ge \epsilon,$ with high probability.

Step 3: Motivated by duality, we count isolated vertices to show

$\mathbb{P}(\epsilon n\le |L_1| \le (\zeta_\lambda-\epsilon)n) \rightarrow 0.$

Step 1

This step is unsurprising. The local limit gives control on how many vertices are in small components of various sizes, and so gives control on how many vertices are in small components of all finite sizes (taking limits in the right order). This gives a bound on how many vertices can be in the giant component. Continue reading

# Lecture 6 – Local limits

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

By this point of the course, we’ve studied several aspects of the Erdos-Renyi random graph, especially in the sparse setting $G(n,\frac{\lambda}{n})$. We’ve also taken a lengthy detour to revise Galton-Watson trees, with a particular focus on the case of Poisson offspring distribution.

This is deliberate. Note that a given vertex v of $G(n,\frac{\lambda}{n})$ has some number of neighbours distributed as $\mathrm{Bin}(n-1,\frac{\lambda}{n})\stackrel{d}\approx\mathrm{Po}(\lambda)$, and the same approximation remains valid as we explore the graph (for example in a breadth-first fashion) either until we have seen a large number of vertices, or unless some ultra-pathological event happens, such as a vertex having degree n/3.

In any case, we are motivated by the notion that the local structure of $G(n,\frac{\lambda}{n})$ is well-approximated by the Galton-Watson tree with $\mathrm{Po}(\lambda)$ offspring, and in this lecture and the next we try to make this notion precise, and discuss some consequences when we can show that this form of convergence occurs.

Deterministic graphs

Throughout, we will be interested in rooted graphs, since by definition we have to choose a root vertex whose local neighbourhood is to be studied. Usually, we will study a sequence of rooted graphs $(G_n,\rho_n)$, where the vertex set of $G_n$ is [n], or certainly increasing in n (as in the first example).

For some rooted graph $(G,\rho)$, we say such a sequence $(G_n,\rho_n)$ converges to $(G,\rho)$ locally if for all radii $r\ge 1$, we have $B_r^{G_n}(\rho_n)\simeq B_r^G(\rho)$. In words, the neighbourhood around $\rho_n$ in $G_n$ is the same up to radius r as the neighbourhood around $\rho$ in $G$, so long as n is large enough (for given r).

This is best illustrated by an example, such as $T_n$, the binary tree to depth n.

If we take $\rho_n$ to be the usual root, then the trees are nested, and converge locally to the infinite binary tree $T_\infty$. Slightly less obviously, if we take $\rho_n$ to be one of the leaves, then the trees are still nested (up to labelling – ie in the sense of isomorphisms of rooted trees), and converge locally to the canopy tree, defined by a copy of $\mathbb{Z}_{\ge 0}$ with nearest-neighbour edges, and where each vertex $n\ge 1$ is connected to the root of a disjoint copy of $T_{n-1}$, as shown below:

Things get more interesting when the root is chosen randomly, for example, uniformly at random, as this encodes more global information about the graphs $G_n$. In the case where the $G_n$ are vertex-transitive, then if we only care about rooted graphs up to isomorphism, then it doesn’t matter how we choose the root.

Otherwise, we say that $G_n$ converges in the local weak sense to $(G,\rho)$ if, for all $r\ge 1$ and for all rooted graphs $(H,\rho_H)$,

$\mathbb{P}\left( B^{G_n}_r(\rho_n)\simeq (H,\rho_H) \right) \longrightarrow \mathbb{P}\left( B_r^G(\rho)\simeq H\right),$

as $n\rightarrow\infty$.

Alternatively, one can phrase this as a result about convergence of rooted-graph-valued distributions.

A simple non-transitive example is $G_n\simeq P_n$, the path of length n. Then, the r-neighbourhood of a vertex is isomorphic to $P_{2r}$unless that vertex is within graph-distance (r-1) of one of the leaves of $G_n$. As $n\rightarrow\infty$, the proportion of such vertices vanishes, and so, $\mathbb{P}\left( B^{P_n}_r(\rho_n)\simeq P_{2r}\right)\rightarrow 1$, from which we conclude the unsurprising result that $P_{n}$ converges in the local weak sense to $\mathbb{Z}$. (Which is vertex-transitive, so it doesn’t matter where we select the root.)

The binary trees offer a slightly richer perspective. Let $\mathcal{L}_n$ be the set of leaves of $T_n$, and we claim that when $\rho_n$ is chosen uniformly from the vertices of $T_n$, then $d_{T_n}(\rho_n,\mathcal{L}_n)$ converges in distribution. Indeed, $\mathbb{P}\left( d_{T_n}(\rho_n,\mathcal{L}_n)=k\right) = \frac{2^{n-k}}{2^{n+1}-1}$, whenever $n\ge k$, and so the given distance converges in distribution to the Geometric distribution with parameter 1/2 supported on {0,1,2,…}.

This induces a random local weak limit, namely the canopy tree, rooted at one of the vertices we denoted by $\mathbb{Z}_{\ge 0}$, with the choice of this vertex given by Geometric(1/2). Continue reading

# Lecture 4 – Hitting time theorem

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

This lecture consisted of revision of the most relevant theory of Galton-Watson trees, with a focus on the case where the offspring distribution is Poisson, since, as we have seen in previous lectures, this is a strong candidate to approximate the structure of $G(n,\lambda/n)$. It makes sense to cover the theory of the trees before attempting to make rigorous the sense of approximation.

Given a Galton-Watson tree T, it is natural to label the vertices in a breadth-first order as $\varnothing=v_1,v_2,\ldots,v_{|T|}$. This is easiest if we have constructed the Galton-Watson tree as a subset of the infinite Ulam-Harris tree, where vertices have labels like (3,5,17,4), whose parent is (3,5,17). If this child vertex is part of the tree, then so are (3,5,17,1), (3,5,17,2), and (3,5,17,3). This means our breadth-first order is canonically well-defined, as we have a natural ordering of the children of each parent vertex.

Note: one advantage of using breadth-first order rather than depth-first order (which corresponds to the usual dictionary, or lexicographic ordering of the labels) is that if the tree is infinite, we don’t explore all of it during a depth-first search. (In the sense that there exist vertices which are never given a finite label.) For breadth-first search, a similar problem arises precisely when some vertex has infinitely many children. For a conventional Galton-Watson tree, the latter situation is much less of a problem than the infinite total population problem, which happens with positive probability whenever $\mu=\mathbb{E}[X]>1$.

Anyway, given the depth-first order, one can consider an exploration process $S_0,S_1,S_2,\ldots,S_{|T|}$ given by

$S_0=1,\quad S_i=S_{i-1}+(X_i-1),$

where $X_i$ is the number of children of $v_i$. In this way, we see that

$S_i=\big| \Gamma(v_1)\cup\ldots\cup\Gamma(v_i)\backslash \{v_1,\ldots,v_i\}\big|,\quad i\ge 1,$

records the number of vertices in some stack containing those which we have ‘seen but not explored’. Some authors prefer to start from 0, in which case one ends up with a similar but slightly different interpretation of the ‘stack’, but that’s fine since we aren’t going to define formally what ‘seen’ and ‘explored’ means in this post.

Essentially, we exhaust the vertices of the tree whenever $S_t=0$, and so the condition that $|T|=n$ requires

$S_n=0,\quad S_m\ge 1,\; m=0,1,\ldots,n-1.$

Conveniently, so long as we have avoiding ordering ambiguity, for example by insisting that trees live within the Ulam-Harris tree, we can reconstruct T uniquely from $(S_0,S_1,\ldots,S_{|T|})$.

Furthermore, if T is a Galton-Watson process, then the numbers of children $X_i$ are IID, and so in fact this exploration process is a random walk, and the size of the tree can be recovered as the hitting time of zero.

Note: making fully rigorous the argument that children in the GW tree are independent of the breadth-first walk fully rigorous is somewhat technical, and not to be dismissed lightly, though not of principle interest at the level of this topics course. See Proposition 1.5 in Section 1.2 of Le Gall’s notes or Section 1.2.2 of my doctoral thesis for further discussion and argument.

The hitting time theorem allows us to study the distribution of the hitting time of a random walk whose increments are bounded below by -1, in terms of the distribution of the value of the random walk.

Theorem: Let $(S_n,\, n\ge 0)$ be a random walk with $S_0=0$ and IID increments $(X_n,n\ge 1)$ satisfying $\mathbb{P}(X_n\ge -1)=1$. Let $H_{-k}=\inf \left\{n\,:\, S_n=-k\right\}$ be the hitting time of -k.

Then $\mathbb{P}\big( H_{-k}=n\big) = \frac{k}{n}\mathbb{P}\big(S_n=-k)$.

Commentary: there are local central limit theorem estimates and large deviation estimates that allow good control of the probability on the RHS for a rich class of contexts. So at a meta-level, the hitting time theorem allows us to reduce a complicated (though still classical) problem, to a real classical problem, which is particularly helpful when the LHS is a device for capturing relevant information about our random tree model.