Borel-Cantelli Lemmas

I’m currently lecturing my first course at King’s, which builds measure theory from the ground up to the construction of the Lebesgue integral, along with some more probabilistic topics. In this second week, we have been discussing various matters related to the construction of sigma-algebras. It took me slightly by surprise that this ends up being the natural moment during an undergraduate’s progression through probability to introduce the Borel-Cantelli lemmas. It’s not hard to imagine a pedagogical schedule in which they appear earlier, but equally, one benefits from seeing them with a bit more mathematical experience, even if they are only very loosely related to the construction of the Lebesgue integral!

Anyway, I noticed that I have not written about these results on the blog before. There are a couple of neat applications that lie slightly beyond the remit (and time allowance) of the lecture course, so they are presented at the end.

Context – events holding eventually and infinitely often

Recall that (in the notation of probability spaces) a sigma-algebra \mathcal F provides a set of events A\subseteq \Omega for which it is meaningful and consistent to define a probability measure. This is extremely uncontroversial in the setting where the set of outcomes \Omega is finite, but thornier in the setting where \Omega is uncountable, for example [0,1] or \mathbb{R}. The key axiom for sigma-algebras involves countable unions. If (A_n)_{n\ge 1} is a sequence of events in \mathcal F, we must also have \bigcup_{n\ge 1}A_n\in\mathcal F.

Since the axioms also demand that \mathcal F is closed under taking complements, it only takes a few steps to show that countable intersections are also valid, ie \bigcap_{n\ge 1}A_n\in\mathcal F also. It is uncontroversial to note that

\bigcap_{n\ge 1}A_n \subseteq \bigcup_{n\ge 1}A_n.

The main initial motivation for considering the events \{A_n\text{ infinitely often}\},\{A_n\text{ eventually}\} are that they lie between the intersection and the union in terms of containment.

  • ‘Infinitely often’ is relatively uncontroversial, just meaning ‘for infinitely many values of n’.
  • ‘Eventually’ is less clear on an initial reading. Perhaps ‘eventually always’ would be better? [And closer to the title of this blog!] The meaning is ‘always, except for finitely many n’, which implies that there is some threshold N such that it holds for all n>N.
  • In set notation, infinitely often (or i.o.) means that \bigcup_{k\ge n}A_k must hold for every n, and so the event \{A_n\text{ i.o.}\}=\bigcap_n\bigcup_{k\ge n}A_k.
  • Similarly, eventually means that there must be an n for which \bigcap_{k\ge n}A_k holds, so the event \{A_n\text{ eventually}\}=\bigcup_n\bigcap_{k\ge n}A_k.

We end up with the richer containment sequence

\bigcap_{n\ge 1}A_n\subseteq \{A_n\text{ eventually}\}\subseteq \{A_n\text{ i.o.}\}\subseteq \bigcup_{n\ge 1}A_n.

These individual containment relations may not be obvious a priori in this form. Probably the most clear is the central containment, that if an A_n holds for all except finitely many n, then it holds for infinitely many n. The right-most says that if it holds infinitely often, then it holds at least once. For the left-most containment, note that \bigcap_{n\ge 1}A_n is literally the same as the n=1 case of \bigcap_{k\ge n}A_k, so taking a union over other values of n only increases the set.

It’s also worth thinking about the complements. Note that \left(\bigcap_{n\ge 1}A_n\right)^c = \bigcup_{n\ge 1} A_n^c and \left(\bigcup_{n\ge 1}A_n\right)^c=\bigcap_{n\ge 1}A_n^c. Versions of these apply to eventually/i.o. also. In particular

\{A_n\text{ i.o.}\}^c = \{A_n\text{ for only finitely many }n\} = \{A_n^c\text{ eventually}\}.

and \{A_n\text{ eventually}\}^c = \{A_n^c\text{ for infinitely many }n\}.

So in any calculation or application, there is flexibility to study (A_n^c) is this is more convenient. As we will see, it makes sense to focus on whichever of (A_n),(A_n^c) has vanishing probabilities as n\to\infty.

Comment: It is standard to write \limsup_{n}A_n= \bigcap_{n\ge 1}\bigcup_{k\ge n}A_k, and \liminf_n A_n=\bigcup_{n\ge 1}\bigcap_{k\ge n}A_k. I am happy to admit I get these confused with non-vanishing probability, and prefer to stick to eventually/i.o. or the full union/intersection statements for safety!

Borel-Cantelli Lemmas

The Borel-Cantelli lemmas establish some conditions under which the event \{A_n\text{ i.o.}\} almost surely holds, or almost surely doesn’t hold. We’ll state them now:

BC1: If \sum_{n\ge 1}\mathbb{P}(A_n)<\infty, then \mathbb{P}(A_n\text{ i.o.})=0.

BC2: If \sum_{n\ge 1}\mathbb{P}(A_n)=\infty and the events (A_n) are independent, then \mathbb{P}(A_n\text{ i.o.})=1.

Proof of BC1: Since \{A_n\text{ i.o.}\}\subseteq \bigcup_{k\ge n}A_k for every $n$, we have

\mathbb{P}(A_n\text{ i.o.}) \le \sum_{k\ge n}\mathbb{P}(A_k)\;\stackrel{n\to\infty}\longrightarrow\; 0,

since \sum \mathbb{P}(A_n)<\infty.

Comment: we could also introduce the random variable N which counts how many events A_n occur, ie N=\sum_{n\ge 1}\mathbf{1}_{A_n}. Then \{A_n\text{ i.o.}\}=\{N=\infty\}, and the given condition is equivalent to \mathbb{E}[N]<\infty, which certainly implies \mathbb{P}(N=\infty)=0. Indeed, it’s reasonable to think of BC1 as a first-moment result, with BC2 as (sort of) a second-moment result, and this is reflected in the comparative complexity of the two proofs.

Proof of BC2: As discussed, it is equivalent to show that \mathbb{P}(A_n^c\text{ eventually})=0. To do this, we study the intersections of the complements, whose probabilities are tractable using independence:

\mathbb{P}\left( \bigcup_{k\ge n}A_k^c\right) = \prod_{k\ge n}\mathbb{P}(A_k^c)=\prod_{k\ge n}(1-\mathbb{P}(A_k))

\le \prod_{k\ge n}\exp(-\mathbb{P}(A_k))=\exp(-\sum_{k\ge n}\mathbb{P}(A_k)).

We have assumed that \sum_{k\ge n}\mathbb{P}(A_k)=\infty for each n, and so we obtain

\mathbb{P}\left(\bigcup_{k\ge n}A_k^c\right)=0,\quad\forall n\ge 1.

But since (\bigcap_{k\ge n}A_k^c)_{n\ge 1} is an increasing sequence of events, so taking a union over n is particularly natural. Precisely, by continuity of measure, we have

\mathbb{P}\left(\bigcup_{n\ge 1}\bigcap_{k\ge n}A_k^c\right) = \lim_{n\to\infty}\mathbb{P}\left( \bigcap_{k\ge n}A_k^c\right)=0.

Comment: the independence condition is certainly necessary, as otherwise one can have situations where A_1\supseteq A_2\supseteq\cdots. For example, with U\sim \mathrm{Unif}[0,1], consider A_n:=\{U\le \frac1n\}, for which \mathbb{P}(A_n)=\frac1n, whose sum diverges, while clearly \mathbb{P}(A_n\text{ i.o.})=\mathbb{P}(U=0)=0.

Classic examples: the famous ‘infinite monkey theorem’, relatively well-known outside the realm of theoretical probability, asserts that a monkey typing randomly at a keyboard for long enough will eventually type the complete works of William Shakespeare. Other standard examples involve independent (A_n) with probabilities decaying at various speeds. The latter do have some overlap with the slightly less standard examples presented below.

Example – generating random permutations

Consider a sequence of random permutations (\sigma_n)_{n\ge 1} where \sigma_n is a permutation of [n]. We construct these iteratively as follows: \sigma_1 is the identity (ie the only permutation of one element!); then for each n\ge 2, define \sigma_n from \sigma_{n-1} by inserting element n at a uniformly-chosen position in \sigma_{n-1}.

We ask the question whether this gives a well-defined limiting permutation \sigma_\infty on \mathbb{N}. A necessary condition for this is that the first element \sigma_n(1) is eventually constant. But in fact the events \{\sigma_n(1)=n\} are independent, and each occurs with probability 1/n, and so

\mathbb{P}(\text{first element changes i.o.})\ge \mathbb{P}(\sigma_n(1)=n\text{ for infinitely many }n)=1,

by BC2.

On reflection, each \sigma_n has the uniform distribution on \Sigma_n, permutations of [n]. So our result is consistent with the idea that one cannot construct a uniform random permutation on an infinite set.

Now, alternatively, suppose the new element is added at place \max(n-X_n,1) where (X_2,X_3,\ldots) are IID \mathrm{Geometric}(q) random variables. (The truncation is just to prevent pathologies like trying to add the new entry in ‘place -6’.) Then

\mathbb{P}(\sigma_n(1)=n)=(1-q)^{n-2},

and since \sum_n (1-q)^{n-2}<\infty, BC1 shows that the first element (\sigma_1(1),\sigma_2(1),\sigma_3(1),\ldots) changes only finitely often. A similar argument applies for the second element, and the third element, and so on. Consequently, almost surely we have a well-defined pointwise limit permutation \sigma_\infty. This is known as the Mallows Permutation on the integers, and is one of the more popular models for a random permutation on an infinite set, with lots of interesting properties. I supervised an undergraduate research project on this model over Summer 2022, and will write more when time allows.

Example – well-separated random integers

The final example is a toy version of a construction I’ve been working on recently regarding subsets of the vertices of a graph which ‘trap’ random walk on that graph.

Suppose we are trying to colour some positive integers randomly, so that the coloured integers are infinite, but also asymptotically sparse. We also do not want pairs of coloured integers to be close together. Let’s say that two integers m,n are close if |m-n|\le \sqrt{\max(m,n)}.

Now, suppose we colour each integer n green independently with probability \frac1n. Then, whenever we have two integers m,n which are close, and coloured green, we re-colour them both red.

We would like to prove that infinitely many integers are coloured green. Let C_n:={n\text{ coloured with some colour}}. Using BC2, we know that almost surely, infinitely many are coloured with some colour, and it seems implausible that many would be close. However, the events G_n={n\text{ coloured green}} are not independent, so one cannot apply BC2 directly to these. However, we can study instead R_n={n\text{ coloured red}}. We can bound these probabilities as

\mathbb{P}(R_n)\le\frac{1}{n}\frac{2\sqrt{n}}{n-\sqrt{n}} = \frac{2+o(1)}{n^{3/2}},

and apply BC1 to show \mathbb{P}(R_n\text{ i.o.})=0. Then, since \{G_n\text{ i.o.}\}\supseteq \{C_n\text{ i.o.}\}\setminus{R_n\text{ i.o.}}, we obtain \mathbb{P}(G_n\text{ i.o.})=1, as required.

Advertisement

BMO2 2022

The second and final round of this year’s British Mathematical Olympiad took place on Thursday.

Here are some thoughts on the problems. I wasn’t involved in choosing the problems, although I did write Q4. I’ll say a bit more about this geometry problem than about the other three, so I’ve put this first.

The copyright to the problems is retained by BMO, and I’m reproducing here with permission. The paper can be found in its original format here.

Problem Four

Let me explain how I devised this problem. Let A’ be the point diametrically opposite A on the circumcircle of triangle ABC. (Sometimes called the antipode of A.) Now consider any line L through A’, and let P be the point where it meets the circumcircle again. I was particularly thinking about lines L where P lay on the same side of BC as A’, as depicted.

Because AA’ is a diameter, there is a right angle \angle APA'. So I considering intersecting L with the lines \ell_B,\ell_C given in the question, since this would create two cyclic quadrilaterals, denoted BDPP_B, DCP_CP in the figure.

Let’s just focus on BDPP_B. This cyclic quadrilateral gives us two equal angles in many different ways. But a particularly nice pair is the one shown, because ACPB is also cyclic, which confirms that the angle measure is \angle C. In particular, by extending P_BD to E on AC, we get a third cyclic quadrilateral BECP_B, and hence find out that P_BD,P_BE\perp AC.

So four lines go through D: AP, BC, and the perpendiculars from P_B,P_C to the opposite sides of the triangle.

How to turn this into a question? The steps outlined above are relatively clear because

  • the right angle at P is essentially given;
  • there are four other right angles, two involving \ell_B,\ell_C which are given, then one involving P_BD,P_BE being perpendicular to AC, and the same for P_C and AB, which are to be deduced,

so the cyclic quadrilaterals just pop out directly. This would then be too straightforward for BMO2, and would also waste the fact that four lines meet at D. It would suffice to work with P_B as we did above, and assert that the corresponding result also held for P_C.

To move towards the form of the BMO2 question, we consider taking P=A’, so that line L is actually the tangent to the circumcircle. We still have the quadruple concurrency at D, but now D lies on AA’. It seems almost certain that we wouldn’t have this concurrency if we built the figure using the tangent from any point other than A’. So this will be the configuration for the problem.

As the writer, I know that the four lines meet when T=A’, so I know I can force T=A’ by insisting that three lines meet, and then set the problem asking to prove that the fourth goes through the same point. One could have stated that P_BE,P_CF,AT meet, and ask to prove that the meeting point lies on BC. I preferred the one I chose. Reflecting on the difficulty of the problem:

  • It is hard to draw a diagram which looks right! Essentially you have to draw a few, and if T ends up close to A’ on one, the diagram will look right.
  • Crucially, we now only have four right angles in the figure, so only one pair of (symmetric) cyclic quads can be read off immediately

I think the cleanest way to attack the problem is to extend AQ to meet the circumcircle at T’, with the goal of showing T’=T. Then BECP_B, ACT'B cyclic gives BQT'P_B cyclic (in an exact reversal of the argument in the prelude earlier). What we don’t know is that P_B,T',P_C are collinear, but this follows since now we derive \angle QT'P_B=\angle P_CT'Q=90^\circ from this new cyclic quad. Consequently T’=T.

There are many routes for coming up with olympiad problems in geometry and other topics, and the scenario I’ve described isn’t particularly typical for myself and certainly isn’t necessarily standard for other problem setters. Nonetheless, I hope that’s an interesting insight into the origin story of a problem like this.

Problem One

This question reduces to establishing whether

m(m+k)+k=n^2

has infinitely many solutions in (m,n). The technique for proving via ‘square sandwiching’ that this is normally not possible has been discussed previously on this blog here:

and here:

This is a good opportunity to reflect in more detail. Let P(x) be a polynomial with integer coefficients, and we consider whether the image P(\mathbb{Z}):= \{ P(n)\,:\, n\in \mathbb{Z}\} contains infinitely many squares.

Clearly if P(x)=Q(x)^2 for Q another polynomial with integer coefficients, then this holds. Square sandwiching establishes that the answer is no when P(x) is a monic quadratic (meaning that the coefficient of x^2 is one) unless P(x) is the square of a linear polynomial. The same holds if the leading coefficient is a square. Note that the general case where P(x) is negative except on a finite range (eg if P has even degree and negative leading coefficient) is also immediate.

But there are several cases where the answer is yes:

  • For example, when P(x)=2x^3, we have P(n) a square whenever n=2k^2.
  • Indeed, note that P(x)=x has this property too!
  • Amongst non-monic quadratics, note that P(x)=nx^2+1 works when n is not a square because of the theory of Pell’s equation.

We could have the alternative question: if P(n) is always a square, does this imply P(x)=Q(x)^2. Indeed, one could generalise this and ask: if P(\mathbb{Z})\subset R(\mathbb{Z}) does this imply that P(x)=R(Q(x)) for some choice Q?

Our original setting with R(x)=x^2 is classical, due to Polya, and is a good problem for any students reading this. The case R(x)=x^k is handled by Polya and Szego. The case of general R is more challenging, and though it can be treated without heavier concepts, is probably best left until one knows some undergraduate algebra. Readers looking for a paper may consult Davenport, Lewis and Schinzel Polynomials of certain special types (Acta Arith. 1964).

Problem Two

Rearranging to get

\frac{f(2xy)}{2y} = \frac{f(f(x^2)+x)}{f(x+1)}

is a good first step, as y appears only on the LHS. As y varies, the LHS is therefore constant, and so we deduce that f(2y)=Cy, ie f is linear on the even integers.

There were then a number of ways to proceed, bearing in mind one had to study the odd integers and solve for C. Taking x even in the given equation is a good next step, because (after some careful argument about C and divisibility by powers of 2) f(f(x^2)+x) is already known, leading to an expression for f(x+1), ie some control over the odd integers.

There were a few possibilities for how to piece the odds and evens together, and extract the constant C. It’s worth mentioning that if the problem had studied functions from the non-negative integers to the non-negative integers, then C=0 would have been a possibility. You might like to consider whether this could lead to extra functions satisfying the given equation.

Problem Three

I personally found this quite difficult. We have a large number of 50-dimensional integer vectors with entries bounded by [0,2022] such that their sum is (n,n,…,n) and we are tasked to show that we can split the sum to obtain (m,m,…,m) and (n-m,n-m,…,n-m).

It seemed helpful to replace each vector (x_1,x_2,\ldots,x_{50}) with the 49-dimensional vector (x_2-x_1,x_3-x_1,\ldots,x_{50}-x_1). We now have a large number of 49-dimensional integer vectors with entries bounded by [-2022,2022] with sum equal to the zero vector. Our task is to show that we can partition into two smaller sums with sum zero. (*)

It seems likely that the role of 2022 is not important, and will just determine how large n must be. I was hoping there might be a quick proof along the following lines: Let \mathcal V denote the collection of vectors, and A,B\subset \mathcal V some subsets, inducing sums \Sigma_A,\Sigma_B respectively. Clearly \Sigma_A=-\Sigma_B if B=A^c, and we might expect to be able to find A\ne B such that \Sigma_A=\Sigma_B. I was hoping one could then move some vectors around (eg from B\mapsto A^c) to produce disjoint sets A,B for which \Sigma_A=-\Sigma_B, but I couldn’t see any way to achieve this.

This doesn’t mean it’s impossible though of course! And perhaps a contestant will find a successful argument along these lines during the olympiad.

In fact, it seems that a careful induction (on the number of cards, ie the dimension of the vectors) is the best way to proceed, and I will leave that there.

BMO1 2021

The first round of the British Mathematical Olympiad was sat on Thursday by roughly 2000 pupils in the UK, and a significant number overseas on Friday.

For obvious reasons, much of the past 18 months has been dominated by logistical rather than academic problems, so it’s refreshing to be returning to the traditional format of this exam, even if the marking has been stayed online.

I wasn’t involved in setting in this paper, so here are some thoughts on the problems. These aren’t supposed to be official solutions, and it’s entirely possible that I’m missing the most natural approach or framing. Students reading this in future years are advised, as always, that such commentaries are normally more valuable after attempting and digesting the problem yourself first.

Note that the copyright to the problems is retained by BMO, and I’m reproducing here with permission. The paper can be found in its original format here.

Problem One

If one exhibits eighteen (= 6 x 3) explicit examples of relevant sums of consecutive positive odd numbers, one has finished the problem. An exhaustive search is impractical and uninteresting. However, beyond that, we would also like to find integers N which we are convinced have this property without needing to explicitly decompose them into the desired sum format.

Let’s study such a sum. We need an even number of odd numbers to end up with an even sum. Probably the algebraically neatest way is to set up such a sum symmetrically as:

2n-k, 2n-(k-2),\ldots,2n-1,2n+1,2n+3,\ldots, 2n+k,

for k an odd number. The sum of this sequence of consecutive odd numbers is 2n(k+1). So, we seek integers N which can be written as 2nm, where m is an even number, in as many ways as possible. Note that if we have an integer N which can be factorised as a product of two distinct even numbers we are forced to take the larger as 2n, and the smaller as m, since we require 2n-k to be strictly positive.

It’s equivalent to finding integers M which are less than 100 and have lots of factors (then multiplying by 4 to recover an N), and there are a few ways to do this, including now potentially by exhaustive search. It is useful to remember that is M can be factorised into primes as M=p_1^{a_1}\ldots p_k^{a_k} then its factors can be described as the set:

\{1,p_1,p_1^2,\ldots,p_1^{a_1}\}\times \{1,p_2,p_2^2,\ldots,p_2^{a_2}\}\times\cdots \times \{1,p_k,p_k^2,\ldots,p_k^{a_k}\},

In particular, the number of such factors is (a_1+1)(a_2+1)\ldots(a_k+1), independently of the choice of the primes. This gives a recipe for constructing all integers M<100 with at least twelve factors. We could take (a_i) to be some permutation of (11), or (5,1), or (3,2), or (2,1,1), which give the following valid M:

M=2^5\times 3 = 96, \; M=2^3\times 3^2 = 72, \; M=2^2\times 3\times 5 = 60, M=2^2\times 3 \times 7 = 84,\; M=3^2\times 2 \times 5 = 90,

and the corresponding valid N are 240, 288, 336, 360, and 384.

For what it’s worth, this problem took me a while. I spent a long time saying things like “we need numbers that can be written as 4n and as 8n and as…” and aiming to conclude “all multiples of 24 greater than 24 x ?? will work”, but this didn’t ever generate enough examples of one class.

Problem Two

In order to have won exactly 30% of the games so far, the number of games so far must be a multiple of 10, and the same is true for 70% success rate. The other success rates have similar constraints, but the multiples are smaller. After ten games, of course Arun cannot have simultaneously won both 30% and 70% of the total! So we need at least twenty games.

It feels as if we should be able to include the other success percentages so long as we can make 30% and 70% work. But can we do this in twenty games? This would require Arun to win 3 out of the first 10, and 14 out of the first 20 (or the exact opposite), and after some head-scratching, this is impossible, since it would require winning eleven out of the ten games between these observations. This is a nice observation, and we now know the answer must be at least thirty. Given that we have made some nifty, rigorous deductions already, one might speculate that this is the true answer. To verify this, we just have to give a construction, and there are a number of ways to achieve this, for example such that Arun wins

  • One of the first two
  • Two of the first five
  • Three of the first ten
  • Twelve of the first twenty
  • Twenty-one of the first thirty

To get full credit, almost any justification that a construction exists for 30 games would be fine – but one must be given. Otherwise how do you know that the true bound isn’t 40 or 50 because there’s some other reason why 30 fails?

Problem Three

Facing this problem for the first time, there’s a temptation to try values smaller than 2021, and this may well be useful. If nothing else, such experiments may well lead to a conjecture of the general form of the answer. Personally, I find it relatively easy to make mistakes when dealing with small numbers, and would turn to this as a second strategy after trying more theoretical routes first.

First note that the condition 0\le n\le 11 could be replaced by n\ge 0, since we are forced to include zero pieces of weight 2^{11}=2048, and would similarly be forced to include zero pieces of weight 2^n for any n\ge 11. (%)

When counting a potentially complex set, we should start by seeing whether there is anything which we can control effectively. For example, I don’t think it is clear how to control the number of pieces of weight 64 in a configuration. However, powers of two lend themselves well to inductive arguments (if you take the powers of two and multiply them all by two, you still have powers of two!) and to studying parity (that is, whether a number is odd or even).

In this case, note that you must have either one or three of the pieces with weight one to ensure the total weight is odd. Similarly, if the goal had been 2022, we would need either zero or two of these pieces to ensure the total weight is even.

So that settles the number of pieces of weight one, and it remains to solve the problem for total weight 2018 and 2020 with the extra condition that we are no longer allowed to use any pieces of weight one. A way to capture this is to divide everything in the problem by two. So we are now trying to capture all the ways to generate weights 1009 and 1010.

We can reformulate this more generally, and assert that the number of configurations f(n) to make total weight n satisfies the relations:

f(2n)=f(2n+1)=f(n)+f(n-1). (*)

It would be quite tiresome to solve backwards in a binary search fashion, but perhaps this is where trying some small examples turns useful. Either by observation on the original problem, or by thinking about (*) with some initial condition like f(0)=f(1)=1, we find that f(n)=\lfloor n/2\rfloor +1. (That is, f(n)=\frac{n+2}{2} when n is even, and f(n)=\frac{n+1}{2} when n is odd.)

As a final aside, this question is an excellent exercise in using generating functions in an olympiad context. A couple of points are worth making:

  1. It is important to decide whether you’re including f(0) as an example, much more so than in the original argument. As we’ll see, it works better if you do include it, and will avoid an awkward sequence of algebra steps.
  2. Figuring out the answer, for example from small-ish cases, will make a generating function argument easier unless you are very fluent with manipulating rational functions. In this case, we would be studying \sum_{n=0}^\infty f(n)x^n, and trying to show that it is equal to

\sum (\lfloor n/2\rfloor +1)x^n = (1+x)(1+2x^2+3x^4+\ldots) = \frac{(1+x)}{(1-x^2)^2}. (*)

Many readers who’ve played around a bit with generating functions may find it easier to go from the left to the right than vice versa.

Writing down a direct generating function for f(n) and showing it matches up with (*) remains a good exercise. It is helpful to bear in mind the discussion at (%).

Problem Four

Since I last blogged regularly, I have given a series of masterclass on Euclidean geometry three times, including twice in an online format.

Part of the point of this masterclass is to get younger students to think in more general terms about what they are doing when they are solving geometry problems, even when they are straightforward. In a competition, time is tight, and anything that gets the job done is good, but there’s still value in reflecting on these ideas here. Two ideas in the first session are:

  • Attacking the conclusion coherently
  • Clean angle chasing

and this problem offers a good example of both of these in practice.

We’ll start with attacking the conclusion coherently. We’ve got to prove that a triangle is equilateral. How can we achieve this? There are a number of ways:

  • Prove all three lengths are equal
  • Prove that all three angles are equal
  • Prove that two angles are equal to 60 degrees.
  • Prove that one angle is equal to 60, and (any) two lengths are equal
  • Prove that one angle is equal to 60, and the other two angles are equal

It’s important to keep all of these in mind as one explores the configuration. It’s genuinely unclear at the very start which of these characterisations of equilateral triangles will be most appropriate. Note though, that the configuration is symmetric in the roles of D and E, so that, for example, proving that the angle at D is 60 would be sufficient, since by symmetry the angle at E would also be 60, and we would conclude.

As regards clean angle chasing, the main moral here is to avoid angle arithmetic unless you are convinced it’s necessary. Ideally one keeps track of what could be determined in terms of other quantities, and only attempts the algebra/arithmetic once you are sure that it will work. It certainly isn’t helpful to have diagrams littered with angle measures like 240^\circ - \theta - \frac{\alpha}{2}. Equality is the best thing to aim for. Try and demonstrate and notate equal angles wherever possible.

For example, here, even before introducing C,D,E, there is plenty of structure which might useful. Essentially every length in the configuration involving (O_1,O_2,A,B) is equal, and all the angles are 60 or 120.

So we bear this in mind. I would consider proving that \angle DO_1E=60^\circ with the following figure.

There are number of ways to complete the argument. Here are a couple of more obscure ones that illustrate my approach.

Firstly, let’s ignore the result that \angle DO_1E=60^\circ and proceed directly to \angle O_2DO_1=60^\circ. It suffices to show that the blue angles are equal, which is equivalent to demonstrating that ADO_2O_1 is cyclic.

To achieve this, we would like to show that the angle \angle AO_2D is equal to the red angle we’ve already discussed, and this follows by chasing round, using that O_1 is the centre of triangle ABC, and then the fact that AO_2D is the external angle opposite \angle ABC in cyclic quadrilateral ABCO_2. Indeed, note that this use of an equal external angle rather than ‘opposite angle is 180 – […]’ is the textbook example of clear angle chasing, where focusing on equality rather than arithmetic massively cleans up many diagrams.

As a second, rather overblown way to complete the argument, let’s show that the angles at D and E are equal, without calculating them. Since O_1D,O_1E are angle bisectors, we have two pairs of equal angles as shown.

It would suffice to show that these measures are themselves equal. This is saying that line ECO_2D is the external angle bisector of \angle ACB. And this is true, since O_2 is the arc-midpoint of AB on \Gamma_1!

I hope these observations are interesting. But, to reiterate, this level of complexity is unnecessary, and arguments by congruent triangles or direct angle chasing are perfectly satisfactory too!

Problem Five

The punchline of this problem is that m(N) is an integer precisely if N is a triangle number, from which it is quick to count the number of such N.

But how to arrive that this punchline? It is reasonably clear that the N-set with minimal mean must have the form (1,2,\ldots,k, N). One could proceed by setting up inequalities involving the means of (1,2,\ldots,k-1,N) and (1,2,\ldots,k,k+1,N) to establish a relationship between k and N.

But there is a more direct route. A useful, and very intuitive result about means is the following. If you have a set of numbers with mean x, and you add into the set a new number y, then the mean of the new larger set is between x and y. In particular, if y>x, then the new mean is larger than x.

In particular, consider the mean of (1,2,\ldots,k,N). If it is less than k, then removing k gives a smaller mean. If it is greater than k+1, then adding k+1 gives a smaller mean. So since the mean is an integer, it must be equal to k or k+1. That is, we have

\frac{1+2+\ldots+k+N}{k+1}=k\text{ or }k+1.

These lead to N=\frac{k(k+1)}{2}\text{ or }\frac{(k+1)(k+2)}{2}.

Problem Six

There must be some significance to the choice of 71 terms, and the 999,999 but it’s hopeless to try and see this immediately. In this scenario, it really is helpful to play around with semi-small examples.

Note that the second term is 1 regardless of whether we view this as P for PREVIOUS or S for SUM. However, if we then have an S, our sequence starts 1,1,2,2… or 1,1,2,4,… and these are bad, as once we have two even consecutive even numbers, we are doomed to see even numbers forever.

So once we have a common factor, that common factor is retained. How would we introduce a factor of, say, 11? Well, if you take 1,1,1,1,1,1,1,1,1,1,1 then take an S, you get 11, and afterwards you get either 11 or 22.

This gives a clue of the right way to set up the problem. For definiteness, we consider the second term to be an S. If you then consider a sequence of Ss and Ps, and write down the number of Ps between successive Ss (which may be zero) as a list p_1,p_2,p_3,\ldots, it holds that \sum p_i=70, and that each string of p_i Ps introduces a factor of p_i+1, leading to

\prod (p_i+1)=999,999=3^3\times 7\times 11\times 13\times 37.

Note that 2\times 3 + 6+10+12+36=70, so there exists a choice of the p_is that satisfies these relations simultaneously. (And if we had replaced 71 by a smaller choice, this would have been impossible.)

The actual answer to the problem is obtained by studying how to rearrange the blocks of SPPP…s, but the main challenge lies in reducing to this form.

I thought this was a hard problem for BMO, and was impressed to see some solutions under exam conditions.

Convex ordering on Galton-Watson trees

This blog was recently revived, via a post about convex ordering and its relevance to the problem of sampling with and without replacement that forms part of the potpourri of related results all sometimes referred to as Hoeffding’s inequality.

The previous post had been lying almost-complete but dormant for over two years. I revisited it because of a short but open-ended question about trees posed in our research group meeting by Serte Donderwinkel, one of our high-flying doctoral students.

Simplified Question:

For a Galton-Watson tree, can one obtain upper bounds in probability on the height of the tree, uniformly across all offspring distributions with mean \mu?

Note that in this setting, it is helpful to have in mind the classical notation (Z_0,Z_1,Z_2,\ldots) for a Galton-Watson process, where typically Z_0=1, and Z_{n+1} is the sum of Z_n IID copies of the offspring distribution. Then we have

\mathbb{P}(\mathrm{height}(\mathcal{T}) < k) = \mathbb{P}(Z_k=0).

1) Subcritical case. When \mu<1, we certainly have \mathbb{P}(Z_k>0)\le \mathbb{E}[Z_k]=\mu^k.

Furthermore, if we’re studying all such offspring distributions, this is the best possible upper bound, by considering the offspring distribution given by Z_1=1 with probability \mu and zero otherwise.

2) In the critical or supercritical case, \mu\ge 1 it is possible that the height is infinite with probability one.

So neither case is especially interesting for now.

Refined question:

What if instead we aren’t trying to obtain a bound uniformly across all offspring distributions with given mean \mu, but instead across a subset \mathcal{X} of these distributions? How do we determine which distribution in \mathcal{X} maximises the probability of reaching height k?

This is the question Serte was asking in our group meeting, in the setting where \mu=1+o(1) and the height k has a particular scaling. Also, as far as I understand, the approach outlined in this post didn’t provide strong enough bounds in this particular context. Happily, Serte has recently tied up all the corners of this project concerning the supercritical Galton-Watson forest, and interested readers can find her preprint here on the Arxiv.

Nonetheless the interpretation via convex ordering feels perfect for a blog post, rather than being lost forever.

Convex ordering for offspring distributions

The main observation is that given two offspring distributions X and Y, such that X\le_{cx} Y (which recall means that the means are the same but X is more concentrated) then a number of distributions associated to the Galton-Watson trees for X and Y also satisfy convex ordering relations.

As a warm-up, and because it was the original genesis, we first study heights. We will use the notation

(Z_0^X,Z_1^X,Z_2^X,\ldots), (Z_0^Y,Z_1^Y,Z_2^Y,\ldots),

to denote the two Galton-Watson processes. We shall compare \mathbb{P}(Z^X_k=0) and \mathbb{P}(Z^Y_k=0). If we write \delta_0(\cdot) for the function defined on the non-negative integers such that

\delta_0(0)=1,\quad \delta_0(n)=0,\,n\ge 1,

it holds that \delta_0(\cdot) is convex. In particular, if X\le_{cx}Y, then \mathbb{E}[\delta_0(X)]\le \mathbb{E}[\delta_0(Y)], which exactly says that

\mathbb{P}(Z^X_1 = 0)\le \mathbb{P}(Z^Y_1 = 0).

We can then prove that \mathbb{P}(Z^X_k=0)\le \mathbb{P}(Z^Y_k=0) by induction on k\ge 1. Note that \mathbb{P}(Z^X_k=0)^n is a convex function of n, regardless of the value of this probability, and so we have

\mathbb{P}(Z^X_{k+1}=0) = \mathbb{E}\left[ (\mathbb{P}(Z^X_k=0))^X\right] \le \mathbb{E}\left[(\mathbb{P}(Z^X_k=0))^Y\right].

By the induction hypothesis, this final quantity is at most

\mathbb{E}\left[(\mathbb{P}(Z^Y_k=0))^Y\right] = \mathbb{P}(Z^Y_{k+1}=0).

In conclusion, we have shown that \mathbb{P}(Z^X_k=0)\le \mathbb{P}(Z^Y_k=0) holds for all k, and thus

\mathrm{height}(\mathcal{T}^X) \ge_{st} \mathrm{height}(\mathcal{T}^Y).

To return to the original context, suppose we have a large class of offspring distributions \mathcal{Y} and a subclass \mathcal{X}\subseteq \mathcal{Y} such that for all Y\in\mathcal{Y}, there exists X\in \mathcal{X} such that X\le_{cx} Y. Then one can obtain uniform bounds on the heights of Galton-Watson trees with offspring distributions drawn from \mathcal{Y} by checking those generated from distributions in \mathcal{X} (which is particularly convenient if, for example, \mathcal{X} is finite).

Convex ordering of generation sizes

The above argument solves the original problem, but brushes over the natural question: is it true that Z^X_k \le_{cx} Z^Y_k?

The answer is yes. Here’s a proof.

This follows from the following general statement:

Lemma 1: Suppose X\le_{cx} Y are non-negative valued RVs and the non-negative integer valued RVs M,N also satisfy M \le_{cx} N. Then

X_1+\ldots+X_M \le_{cx} Y_1+\ldots Y_N,

where X_1,X_2,\ldots are IID copies of X and, independently, Y_1,Y_2,\ldots are IID copies of Y.

Lemma 2: Suppose W_1\le_{cx}Z_1 and W_2\le_{cx} Z_2, and the four random variables are independent. Then W_1+W_2\le_{cx}Z_1+Z_2.

Proof of Lemma 2: First, note that for any random variable X, and convex function f

\mathbb{E}\left[f(Z+x)\right] is a convex function of x.

(Indeed, this holds since “f(z+x) is convex” holds for every z, and any definition of convex will pass to the expectation.)

Now we can attack the lemma directly, we may write

\mathbb{E}\left[ f(W_1+W_2)\right]=\mathbb{E}\left[\, \mathbb{E}[f(W_1+W_2) \mid W_2 ] \,\right] \le \mathbb{E}\left[\, \mathbb{E}[f(W_1+Z_2)\mid Z_2 ] \, \right].

But then for any z_2, we know f(\cdot+z_2) is convex, so \mathbb{E}[f(W_1+z_2)]\le \mathbb{E}[f(Z_1+z_2)], and it follows that

\mathbb{E}\left[ f(W_1+W_2)\right]\le \mathbb{E} \left[ f(Z_1+Z_2)\right],

which proves the lemma.

Corollary 3: When W_1,\ldots,W_m, Z_1,\ldots,Z_m are independent, and satisfy W_i \le_{cx} Z_i, then we have W_1+\ldots+W_m\le_{cx} Z_1+\ldots+Z_m.

Proof of Lemma 1: Note that

\mathbb{E}\left[ f(X_1+\ldots+X_M)\mid M=n\right] \le \mathbb{E}\left[ f(Y_1+\ldots+Y_N)\mid N=n\right],

follows from Corollary 3. So a useful question to consider is whether \mathbb{E}\left[f(Y_1+\ldots+Y_n)\right] (*) is a convex function of n?

Denote this quantity by F(n). To check convexity of a function defined on the integers, it suffices to verify that F(n+1)-F(n)\ge F(n)-F(n-1).

There is a canonical coupling between the RVs used to define all of F(n-1),F(n),F(n+1), but it will be convenient to adjust the coupling, and write:

F(n+1)-F(n)= \mathbb{E}\left[ f(Y_1+\ldots+Y_n + Y^*) - f(Y_1+\ldots+Y_n)\right],

F(n)-F(n-1)=\mathbb{E}\left[f(Y_1+\ldots+Y_{n-1}+Y^*) - f(Y_1+\ldots+Y_{n-1})\right],

where Y^* is a further independent copy of Y. But note that for any choice C\ge c and y\in \mathbb{R},

f(C+y) - f(C) - f(c+y) + f(c)\ge 0. (*)

(Essentially, this says that the ‘chord’ of f on the interval [c,C+y] lies above the chord on interval [C,c+y] or [c+y,C], which some people choose to call Karamata’s inequality, but I think is more helpful to think of as part of the visual definition of convexity.)

In any case, setting y=Y^*, c=Y_1+\ldots+Y_{n-1}, C=Y_1+\ldots+Y_n and taking expectations, we obtain

\mathbb{E}\left[ f(Y_1+\ldots+Y_n+Y^*) - f(Y_1+\ldots+Y_n)\right.

\left.- f(Y_1+\ldots+Y_{n-1}+Y^*) + f(Y_1+\ldots+Y_{n-1})\right]\ge 0,

as required. So F(n) is convex. We may now finish off as

\mathbb{E}\left[ X_1+\ldots+X_M\right] = \mathbb{E}\left[ \,\mathbb{E}[X_1+\ldots+X_M\mid M]\,\right] \le \mathbb{E}\left[\, \mathbb{E}[Y_1+\ldots+Y_M\mid M]\,\right] = \mathbb{E}[f(M)]\le \mathbb{E}[f(N)] = \mathbb{E}[Y_1+\ldots+Y_N],

completing the proof of Lemma 1.

Final comments

  • The analysis in this post is not sufficient to study the total population sizes of two Galton-Watson trees generated by X and Y. Note that in Lemma 2, it is important that the random variables are independent. Otherwise, we could, for example, consider \mathbb{E}[X]=\mathbb{E}[Y]=0 with X\le_{cx}Y but clearly it should not hold that X_1+X_2 \le_{cx} Y + (-Y) = 0. So for total population size, since (Z^X_k,\,k\ge 1) are not independent, an alternative approach would be required.
  • A further characterisation of convex ordering is given by Strassen’s theorem [Str65], which is touched on in the previous post, and to which I may return to in a future post on this topic. This may be a more promising avenue for established a convex ordering result on total population size.
  • Lemma 1 requires that X,Y are non-negative. Note that during the argument we set y=Y^*, c=Y_1+\ldots+Y_{n-1}, C=Y_1+\ldots+Y_n, and when we relax the non-negative support condition, it is no longer guaranteed that C\ge c, which is crucial for the step which follows.
  • In a recent article in ECP addressing Lemma 1 by a different method, Berard and Juillet [BJ20] provide a simple example showing that the non-negative assumption is genuinely necessary. Consider the random variable \tau\in \{0,2\} with equal probability so 1\le_{cx} \tau. But then, taking both X and Y to be simple random walk on \mathbb{Z}, we do not have S_1\le_{cx}S_{\tau}.

References

[BJ20] – Berard, Juillet – A coupling proof of convex ordering for compound distributions, 2020

[Str65] – Strassen – The existence of probability measures with given marginals, 1965

Hoeffding’s inequality and convex ordering

A number of previous posts on this blog have discussed Markov’s inequality, and the control one can obtain on a distribution by applying this to functions of the distribution, or functions of random variables with this distribution. In particular, studying the square of the deviation from the mean \left(X-\mathbb{E}X\right)^2 leads to Chebyshev’s inequality, while even stronger bounds can sometimes be obtained using e^{tX}, for some appropriate value of t\in\mathbb{R}. Sometimes this final method, which relies on the finiteness of \mathbb{E}[e^{tX}] is referred to as a Chernoff bound, especially when an attempt is made to choose the optimal value of t for the estimate under consideration.

The case of a random walk, that is the sum of IID increments, is particularly well-studied. In [Ho63], Hoeffding studies a generalisation where the increments are merely independent, and outlines a further generalisation to discrete-time martingales, which is developed by Azuma in [Az67], leading to the famous Azuma-Hoeffding inequality. A key feature of this theorem is that we require bounded increments, which leads to tight choices of t_i in the Chernoff bound for each increment.

In addition, [Ho63] addresses the alternative model where the increments of a random walk are chosen uniformly without replacement from a particular set. The potted summary is that the sum of random increments chosen without replacement has the same mean, but is more concentrated that the corresponding sum of random increments chosen with replacement. This means that any of the concentration results proved in the earlier sections of [Ho63] for the latter situation apply equally to the setting without replacement.

Various versions of this result are referred to as Hoeffding’s inequality by the literature, and the main goal of this short article is to list some results which are definitely true about this setting!

(Note, in some contexts, ‘Hoeffding’s inequality’ will mean the concentration bound for either the with-replacement or the without-replacement setting. In this blog post, we are discussing inequalities comparing these two settings.)

Starting point – second moments in ‘vanilla’ Hoeffding

We won’t use this notation heavily, but for now let’s assume that we are studying a (multi)set of values \mathcal{X}=\{x_1,\ldots,x_N\}, and will study a sequence of m elements (X_1,\ldots,X_m) from this set with replacement (ie chosen in an IID fashion), and a sequence of m elements (Y_1,\ldots,Y_m) chosen from this set without replacement.

Some preliminary remarks:

  • The Y_is are still chosen uniformly which can be interpreted as
    • Any permutation of the multiset \{x_1,\ldots,x_n\} is equally likely to appear as (Y_1,\ldots,Y_n);
    • If determining the sequence one element at a time, then in a conditional sense, the choice of the ‘next’ element is uniform among those elements which have not yet been chosen.
  • In particular, the sequence (Y_1,\ldots,Y_m) is exchangeable (though not knowing what that means will not be a significant impediment to what follows).
  • It is clear that \mathbb{E}[X_1+\ldots+X_m]=\mathbb{E}[Y_1+\ldots+Y_m], so concentration around this common mean is the next question of interest.
  • Intuitively, when m\ll N, we wouldn’t expect there to be much difference between sampling with or without replacement.
  • When sampling without replacement, the cases m and N-m are symmetric with respect to the sum of \mathcal{X} and, in particular, have the same concentration.

As a brief warmup, we establish that

\mathrm{Var}(Y_1+\ldots+Y_m)\le \mathrm{Var}(X_1+\ldots+X_m).

Note that the RHS is m\sigma^2, where \sigma^2<\infty is the variance of a single uniform sample from \mathcal{X}. Since these sums have the same mean, it suffices to show that

\mathbb{E}\left[ (Y_1+\ldots+Y_m)^2\right] \le \mathbb{E}\left[ (X_1+\ldots+X_m)^2\right].

But this really has nothing to do with probability now, since these quantities are just symmetric polynomials in \mathcal{X}. That said, we gain a little bit if we use the exchangeability property to rewrite the LHS and the RHS as

n\mathbb{E}[Y_1^2] + n(n-1)\mathbb{E}[Y_1Y_2] \le n\mathbb{E}[X_1^2] + n(n-1)\mathbb{E}[X_1X_2].

Since Y_1\stackrel{d}= X_1, it remains to handle the cross term, which involves showing that

\frac{1}{N(N-1)}\sum_{i\ne j} x_ix_j \le \frac{1}{N^2} \sum_{i,j}x_ix_j,

which really is an algebra exercise, and can be demonstrated by the rearrangement inequality, or Chebyshev’s inequality (the other one – ie sequence-valued FKG not the probabilistic second-moment bound one), or by clever pre-multiplying and cancelling to reduce to Cauchy-Schwarz.

In [BPS18], which we will turn to in much greater detail shortly, the authors assert in the first sentence that Hoeffding treats also the setting of weighted sums, that is a comparison of

\alpha_1X_1+\ldots+\alpha_mX_m,\quad \alpha_1Y_1+\ldots+\alpha_mY_m.

A casual reader of [Ho63] may struggle initially to identify that this has indeed happened. At the level of second-moment comparison though, the argument given above carries over essentially immediately. We have

\mathbb{E}\left[ (\alpha_1Y_1+\ldots+\alpha_n Y_n)^2\right] = \left(\alpha_1^2+\ldots+\alpha_n^2\right) \mathbb{E}[Y_1^2] + \left(\sum_{i\ne j} \alpha_i\alpha_j \right)\mathbb{E}\left[Y_1Y_2\right],

and of course the corresponding decomposition for (X_i), from which the variance comparison inequality follows as before.

Convex ordering

For many applications, variance is sufficient as a measure of concentration, but one can say more. As a prelude, note that one characterisation of stochastic ordering (which I have sometimes referred to as stochastic domination on this blog) is that

X\le_{st}Y \;\iff\;\mathbb{E}[f(X)] \le \mathbb{E}[f(Y)],

whenever f is a non-decreasing function. Unsurprisingly, it’s the \Leftarrow direction which is harder to prove, though often the \Rightarrow direction is the most useful in applications.

[To inform what follows, I used a number of references, in particular [BM05].]

A natural extension to handle concentration is (stochastic) convex ordering, which is defined as

X\le_{cx} Y\;\iff \; \mathbb{E}[f(X)]\le \mathbb{E}[f(Y)],

whenever f is a convex function. Note immediately that this definition can only ever apply to random variables for which \mathbb{E}X=\mathbb{E}Y. Otherwise, consider the functions f(x)=x, f(x)=-x, both of which are convex.

This sort of comparison has plenty of application in economics and actuarial fields, where there is a notion of risk-aversion, and convexity is the usual criterion to define risk (in the sense that the danger of a significant loss is viewed as more significant than a corresponding possibility of a significant gain). Also note that in this context, X is preferred to Y by a risk-averse observer.

In fact, it can be shown [Oh69] that a sufficient condition for X\le_{cx} Y is the existence of \alpha\in\mathbb{R} such that

F_X(x)\le F_Y(x)\;\iff\; x\ge \alpha,

where F_X(x)=\mathbb{P}(X\le x) is the usual distribution function of X. That is, the distribution functions ‘cross exactly once’. Note that in general \alpha\ne \mathbb{E}X=\mathbb{E}Y.

We will return later to the question of adapting this measure to handle distributions with different expectations, but for now, we will return to the setting of sampling with and without replacement and show that

Y_1+\ldots+Y_m\le_{cx}X_1+\ldots+X_m.

This is the result shown by Hoeffding, but by a more complicated method. We use the observations of [BPS18] which are stated in a more complicated setting which we may return to later. For notational convenience, let’s assume the elements of \mathcal{X} are distinct so that ‘repetitions of x_i‘ has an unambiguous meaning.

We set up a natural coupling of (X_i),(Y_i) as follows. Let (X_i,i\ge 1) be IID uniform samples from \mathcal{X}, and generate (Y_1,\ldots,Y_N) by removing any repetitions from (X_1,X_2,\ldots).

Note then that given the sequence (Y_1,\ldots,Y_m), the multiset of values \{X_1,\ldots,X_m\} definitely includes Y_1 at least once, but otherwise includes each Y_i either zero, one or more times, in a complicated way. However, this distribution doesn’t depend on the values taken by (Y_i), since it is based on uniform sampling. In particular, if we condition instead on the set \{Y_1,\ldots,Y_m\}, we find that

\mathbb{E}\left[X_1+\ldots+X_m\,\big|\, \{Y_1,\ldots,Y_m\}\right] = Y_1+\ldots+Y_m,

and so we may coarsen the conditioning to obtain

\mathbb{E}\left[X_1+\ldots+X_m\,\big|\, Y_1+\ldots+Y_m\right] = Y_1+\ldots+Y_m.

This so-called martingale coupling is very useful for proving convex ordering. You can think of \mathbb{E}[X|Y]=Y as saying “X is generated by Y plus additional randomness”, which is consistent with the notion that Y is more concentrated than X. In any case, it meshes well with Jensen’s inequality, since if f is convex, we obtain

\mathbb{E}[f(X)] = \mathbb{E}\left[ \mathbb{E}[f(X)|Y] \right] \ge \mathbb{E}\left[f\left(\mathbb{E}[X|Y] \right)\right] = \mathbb{E}\left[ f(Y)\right].

So we have shown that sampling without replacement is greater than sampling with replacement in the convex ordering, if we consider sums of the samples.

If we want to handle weighted sums, and tried to reproduce the previous argument directly, it would fail, since

\mathbb{E}\left[\alpha_1X_1+\ldots+\alpha_nX_n\,\big|\,\{Y_1,\ldots,Y_m\}\right]=(\alpha_1+\ldots+\alpha_m)(Y_1+\ldots+Y_m),

which is not in general equal to \alpha_1Y_1+\ldots+\alpha_mY_m.

Non-uniform weighted sampling

Of course, it’s not always the case that we want to sample according to the uniform distribution. My recent personal motivation for investigating this setup has been the notion of exploring a graph generated by the configuration model, for which vertices appear for the first time in a random order that is size-biased by their degrees.

In general we can think of each element of \mathcal{X} as having some weight w(i), and at each step, we pick x_i from the remaining elements with probability proportion to w(i). This procedure is known in other contexts as successive sampling. Note that the sequence (Y_1,\ldots,Y_m) is no longer exchangeable. This is clear enough when there are two elements in \mathcal{X}, and w(1)\gg w(2). So it is much more likely that (Y_1,Y_2)=(x_1,x_2) than (Y_1,Y_2)=(x_2,x_1).

In addition, it will in general not be the case that

\mathbb{E}[Y_1+\ldots+Y_m]=\mathbb{E}[X_1+\ldots+X_m],

and this is easily seen again in the above example, where \mathbb{E}[Y_1+Y_2]=x_1+x_2 while \mathbb{E}[X_1+X_2]\approx 2x_1.

However, one can still compare such distributions, with the notion of increasing convex ordering, which is defined as:

X\le_{icx} Y\;\iff\; \mathbb{E}[f(X)]\le \mathbb{E}[f(Y)],

for all convex non-decreasing functions f. From an economic point view, increasing convex ordering becomes a comment on the relative contribution of losses above some threshold between the two distributions (and that X is better or worse than Y for all such thresholds), and is sometimes called stop-loss order. More formally, as in [BM05] we have:

X\le_{icx}Y\;\iff\; \mathbb{E}[\max(X-t,0)]\le \mathbb{E}[\max(Y-t,0)]\,\forall t\in\mathbb{R}.

In [BPS18], the authors study the case of successive sampling where the weights are monotone-increasing in the values, that is w(i)\ge w(j)\;\iff\; x_i\ge x_j, of which size-biased sampling is a special case. With this setup, the heuristic is that ‘large values are more likely to appear earlier’ in the sample without replacement, and making this notion precise underpins all the work.

The coupling introduced above is particularly useful in this more exotic setting. The authors show that conditional on the event A of the form \{Y_1,\ldots,Y_m\}=\{x_{i_1}\le x_{i_2}\le\ldots\le x_{i_m}\}, one has

\mathbb{P}(X_1=x_{i_j}\,\big|\,\mathbf{A}) \le \mathbb{P}(X_1=x_{i_k}\,\big|\,\mathbf{A}),\text{ precisely when }j\le k.

This is achieved by comparing the explicit probabilities of pairs of orderings for (Y_1,\ldots,Y_m) with x_{i_j},x_{i_k} exchanged. See [BPS18] for this short calculation – a link to the Arxiv version is in the references. So to calculate \mathbb{E}[X_1\,\big|\, \mathbf{A}], we have the setup for the rearrangement inequality / Chebyshev again, as

\mathbb{E}[X_1\,\big|\, \mathbf{A}] = \sum_{j=1}^m x_{i_j}\mathbb{P}(X_1=x_{i_j}) \le \frac{1}{n}\left(\sum_{j=1}^m x_{i_j}\right)\left(\sum \mathbb{P}\right) = \frac{\left(Y_1+\ldots+Y_m\,\big|\,\mathbf{A}\right)}{m}.

As before, letting X,Y be the sums of the samples with and without replacement, respectively, we have \mathbb{E}[X\,\big|\, Y] \ge Y, which the authors term a submartingale coupling. In particular, the argument

\mathbb{E}[f(X)] = \mathbb{E}\left[ \mathbb{E}[f(X)|Y] \right] \ge \mathbb{E}\left[f\left(\mathbb{E}[X|Y] \right)\right] \ge \mathbb{E}\left[ f(Y)\right],

works exactly as before to establish convex ordering.

References

[Az67] – Azuma – 1967 – Weighted sums of certain dependent random variables

[BM05] – Bauerle, Muller – 2005 – Stochastic orders and risk measures: consistency and bounds. (Online version here)

[BPS18] – Ben Hamou, Peres, Salez – 2018 – Weighted sampling without replacement (ArXiv version here)

[Ho63] – Hoeffding – 1963 – Probability inequalities for sums of bounded random variables

[Oh69] – Ohlin – 1969 – On a class of measures of dispersion with application to optimal reinsurance

EGMO 2019

Last week, we held our annual IMO training and selection camp in the lovely surroundings of Trinity College, Cambridge. Four of our students have subsequently spent this week in Kiev, for the ninth edition of the prestigious European Girls’ Mathematical Olympiad.

The UK team, none of whom had attended the competition before, and all of whom remain eligible to return at least once, did extremely well, placing fourth out of the official European countries, and earning three silver medals, and one gold medal, with complete or almost-complete solutions to all six problems between them. More details are available on the competition website.

In this post, I’m going to discuss some of the non-geometry problems. As a special treat, the discussion of Question 6 is led by Joe Benton, which is only fitting, since he wrote it. You can find the first day’s three problems here, and the second day’s problems here. The geometry problems are treated in a separate post.

Problem One

Triple equalities are hard to handle directly, so generally one starts with a single equality, for example the left hand one a^2b+c = b^2c+a, after noting that the setup is cyclic (but not symmetric) in the variables, under cycling a\to b \to c\to a.

Some quick notes:

  • The given equations are inhomogeneous, meaning that not every term has the same degree in terms of {a,b,c}. In complicated equations, normally we want to cancel terms, and we certainly can’t cancel terms with different degrees! So a good first idea is to find a way to make the equations homogeneous, for example using the condition, which is itself inhomogeneous. For example, one could replace c with c(ab+bc+ca), and it’s clear this might lead to some cancellation. In this case, the algebra is more straightforward if one rearranges and factorises first to obtain a(1-ab)=c(1-bc), from which the condition gives a(bc+ca)=c(ab+ac).
  • Shortly after this, we find a^2c=ac^2, which means a=c unless at least one of these terms is equal to zero.
  • This case distinction should not be brushed over lightly as a tiny detail. This is the sort of thing that can lose you significant marks, especially if the omitted case turns out to be more involved than the standard case.
  • This is a good example of planning your final write-up. The case distinction \{a,c\ne 0\}, \{a\text{ or }c=0\} is really clumsy. What about b? We have already said that the setup is cyclic in (a,b,c), and it’s annoying to throw this aspect away. It really would be much much to have the overall case distinction at the start of your argument, for example into \{a,b,c \ne 0\} and \{a\text{ or }b\text{ or }c=0\}, because then in the latter case you could assume that a=0 without loss of generality, then see what that implied for the other variables.
  • In this setting, one really should check that the claimed solutions satisfy both the condition and the triple-equality. This is a complicated family of simultaneous equations, and even if you’re lucky enough to have settled on an argument that is reversible, it’s much easier to demonstrate that everything is satisfied as desired than actually to argue that the argument is reversible.

Problem Two

Maybe I speak for a number of people when I say I’m a little tired of domino tiling problems, so will leave this one out.

Problem Five

I liked this problem. A lot. It did feel like there were potential time-wasting bear traps one might slip into, and perhaps these comments are relevant:

  • This feels like quite a natural problem to ask. In particular, the task specified by (A)+(B) is much more elegant than the RHS of (C). So unless you have immediate insight for the RHS of (C), it makes sense to start with the task, and aim for any bound similar to (C).
  • There are a lot of transformations one could make, for example, repeatedly removing n from one of the a_ns, or re-ordering them conveniently somehow, which shouldn’t affect any solution. You can also attack the RHS of (C) in a number of ways (perhaps with a case split for n odd or even?). As with involved but easy angle chases discussed in the companion post, it’s probably better to hold this in mind than instantly to do all of these, since it will just obscure the argument if you end up not using the result of the simplification!
  • One should always try small examples. Here, it’s worth trying examples large enough to get a sense of what’s happening, because I think the crucial observation is that (unless you were very very unlucky) there are lots of suitable sequences B=(b_i). (*)
  • Certainly the case where the (a_i)s themselves form a complete n-residue system is ‘easiest’ and certainly straightforward. One might say that the case where the (a_i) are equal (or at least congruent) is ‘hardest’ (in that (b_i-a_i) might have to take the largest values in some sense), but is also straightforward. There’s certainly the option of trying to rigorise this, but I don’t know whether this can be made to work.

Continue reading

Lecture 8 – Bounds in the critical window

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

Preliminary – positive correlation, Harris inequality

I wrote about independence, association, and the FKG property a long time ago, while I was still an undergraduate taking a first course on Percolation in Cambridge. That post is here. In the lecture I discussed the special case of the FKG inequality applied in the setting of product measure setting, of which the Erdos-Renyi random graph is an example, and which is sometimes referred to as the Harris inequality.

Given two increasing events A and B, say for graphs on [n], then if \mathbb{P} is product measure on the edge set, we have

\mathbb{P}(A\cap B)\ge \mathbb{P}(A)\mathbb{P}(B).

Intuitively, since both A and B are ‘positively-correlated’ with the not-rigorous notion of ‘having more edges’, then are genuinely positively-correlated with each other. We will use this later in the post, in the form \mathbb{E}[X|A]\ge \mathbb{E}[X], whenever X is an increasing RV and A is an increasing event.

The critical window

During the course, we’ve discussed separately the key qualitative features of the random graph G(n,\frac{\lambda}{n}) in the

  • subcritical regime when \lambda<1, for which we showed that all the components are small, in the sense that \frac{1}{n}|L_1| \stackrel{\mathbb{P}}\rightarrow 0, although the same argument would also give |L_1|\le K\log n with high probability if we used stronger Chernoff bounds;
  • supercritical regime when \lambda>1, for which there is a unique giant component , ie that \frac{1}{n}|L_1|\stackrel{\mathbb{P}}\rightarrow \zeta_\lambda>0, the survival probability of a Galton-Watson branching process with Poisson(\lambda) offspring distribution. Arguing for example by a duality argument shows that with high probability all other components are small in the same sense as in the subcritical regime.

In between, of course we should study G(n,\frac{1}{n}), for which it was known that L_1\stackrel{d}\sim n^{2/3},\, L_2\stackrel{d}\sim n^{2/3},\ldots. (*) That is, the largest components are on the scale n^{2/3}, and there are lots of such critical components.

In the early work on random graphs, the story ended roughly there. But in the 80s, these questions were revived, and considerable work by Bollobas and Luczak, among many others, started investigating the critical setting in more detail. In particular, between the subcritical and the supercritical regimes, the ratio \frac{|L_2|}{|L_1|} between the sizes of the largest and second-largest components goes from ‘concentrated on 1’ to ‘concentrated on 0’. So it is reasonable to ask what finer scaling of the edge probability p(n) around \frac{1}{n} should be chosen to see this transition happen.

Critical window

In this lecture, we studied the critical window, describing sequences of probabilities of the form

p(n)=\frac{1+\lambda n^{-1/3}}{n},

where \lambda\in(-\infty,+\infty). (Obviously, this is a different use of \lambda to previous lectures.)

It turns out that as we move \lambda from -\infty to +\infty, this window gives exactly the right scaling to see the transition of \frac{|L_2|}{|L_1|} described above. Work by Bollobas and Luczak and many co-authors and others in the 80s establish a large number of results in this window, but for the purposes of this course, this can be summarised as saying that the critical window has the same scaling behaviour as p(n)=1/n, with a large number of components on the scale \sim n^{2/3} (see (*) earlier), but different scaling limits.

Note: Earlier in the course, we have discussed local limits, in particular for G(n,\lambda/n), where the local limit is a Galton-Watson branching process tree with offspring distribution \mathrm{Poisson}(\lambda). Such local properties are not sufficient to distinguish between different probabilities within the critical window. Although there are lots of critical components, it remains the case that asymptotically almost all vertices are in ‘small components’.

The precise form of the scaling limit for

\frac{1}{n^{2/3}} \left( |L_1|, |L_2|, |L_3|,\ldots \right)

as n\rightarrow\infty was shown by Aldous in 1997, by lifting a scaling limit result for the exploration process, which was discussed in this previous lecture and this one too. Since Brownian motion lies outside the assumed background for this course, we can’t discuss that, so this lecture establishes upper bounds on the correct scale of |L_1| in the critical window. Continue reading

BMO2 2019

The second round of the British Mathematical Olympiad was taken on Thursday by the 100 or so top scoring eligible participants from the first round, as well as some open entries. Qualifying for BMO2 is worth celebrating in its own right. The goal of the setters is to find the sweet spot of difficult but stimulating for the eligible participants, which ultimately means it’s likely to be the most challenging exam many of the candidates sit while in high school, at least in mathematics.

I know that lots of students view BMO2 as something actively worth preparing for. As with everything, this is a good attitude in moderation. Part of the goal in writing about the questions at such length is because I think at this level it’s particularly easy to devote more time than needed to preparation, and use it poorly. This year time is tight at the end of semester, and so what follows is closer to a set of complete solutions than usual, for which apologies, although I hope it is still possible to get a sense of how one might have come across the solutions yourself. Of course, this means that what follows will certainly spoil the problems for anyone who hasn’t tried them by themselves already.

The copyright for the problems is held by BMOS, and reproduced here with permission.

Question 1

As if often the case in geometry questions, what you’ve been asked to prove here isn’t the most natural property of the configuration. A good first step would be to see if there are stronger statements which are true.You are asked to show that triangle BPE is isosceles, but you aren’t told which of the three vertices is the apex. In fact, the task is to show that BP=EP or, alternatively, \angle BEP=\angle PBE. It’s not in general true that BE is equal to BP=EP. Unless you’re very unlucky, you can establish this from one diagram.

Now, you don’t immediately know whether it’s going to be easier to show that two lengths are equal, or that two angles are equal. However, you know that P lies on the perpendicular bisector of BC, hence BP=CP, which is a big clue. In particular, this means that P would be the centre of the circle through BCE. This clearly implies the given result, so deciding to prove this instead is a good strategy.

There are now a number of ways to prove this. Note that D lies on the altitude from A, and the feet of the perpendiculars from D to sides AB and AC are both present in the configuration so (just as for the orthocentre diagram) we can calculate most of the angles involving {A,B,C,D,E}.

For example, ABDE is cyclic, so \angle BED=\angle BAD = 90-\hat{B}, hence \angle AEB=\hat{B},\,\angle EBA=\hat{C}. This shows that AB is tangent to the circumcircle of BCE. But then the line L is a radius of this circle, and so its centre must be P, the unique point on L which is equidistant from B and C.

Alternatively, we could directly calculate \angle BEC=180-\hat{B} and \angle CBP=90-\hat{B}. But BPC is isosceles so \angle BPC=2\hat{B}. In general, the converse of ‘angle at centre is twice angle at circumference’ does not hold, but when we know P is equidistant from B and C this does hold, and so the angle relations precisely confirm that P is the centre of the circle through BPE.

My intention had been that the triangle would be acute-angled, to reduce the number of diagram options based on the magnitude of \hat{B}. If pursuing this second approach, one would need to be careful to account for whether P is on the same side or the opposite side of BC to E. That said, unless you do something very exotic, it should be exactly the same argument or calculation, and such a case distinction probably isn’t very important.

Question 2

First, a short remark. As stated, if n=5, a piece could move 3 squares to the left then 4 squares up, by Pythagoras. Handling all such options is likely to be quite annoying, since some values of n can be written in this Pythagorean form, and others cannot. This brings us to some good general principles for olympiad problems which look like this one:

  • A construction, when one exists, will probably be possible using simple versions of the allowed moves / structures.
  • An argument why a construction is impossible should probably be based on ideas which treat the simple moves similarly to the more complicated moves.

The setup of the problem encourages you to think about dividing the board into n^2 sub-boards, each with dimensions n\times n. Continue reading

Lecture 10 – the configuration model

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

As we enter the final stages of the semester, I want to discuss some extensions to the standard Erdos-Renyi random graph which has been the focus of most of the course so far. Although we will not get far into the details during this course, the overall goal is to develop models which are close to Erdos-Renyi in terms of ease of analysis, while also allowing more of the features characteristic of networks observed in the real world.

One of the more obvious deficiencies of the sparse regime of Erdos-Renyi random graphs for modelling ‘real-world phenomena’ concerns the degree sequence. Indeed, the empirical degree distribution of G(n,c/n) converges to Poisson(c). By contrast, in real-world networks, a much wider range of degrees is typically observed, and in many cases it is felt that these should follow a power law, with a small number of a very highly connected agents.

One way around this problem to construct random graphs where we insist that the graph has a given sequence of degrees. The configuration model, which is the subject of this lecture and this post (and about which I’ve written before), offers one way to achieve this.

Definition and notes

Let n\ge 1 and let d=(d_1,d_2,\ldots,d_n) be a sequence of non-negative integers such that \sum_{i=1}^n d_i is even. Then the configuration model with degree sequence d is a random multigraph with vertex set [n], constructed as follows:

  • To each vertex i\in[n], assign d_i half-edges;
  • Then, take a uniform matching of these half-edges;
  • Finally, for each pair of half-edges in the matching, replace the two half-edges with a genuine edge, to obtain the multigraph CM_n(d), in which, by construction, vertex i has degree d_i.

One should note immediately that although the matching is uniform, the multigraph is not uniform amongst multigraphs with that degree sequence. Note also that the condition on the sums of the degrees is necessary for any graph, and in this context means that the number of half-edges is even, without which it would not be possible to construct a matching.

This effect is manifest in the simplest possible example, when n=2 and d=(3,3). There are two possible graphs, up to isomorphism, which are shown below:

For obvious reasons, we might refer to these as the handcuffs and the theta , respectively. It’s helpful if we, temporarily, assume the half-edges are distinguishable at the moment we join them up in the configuration model construction. Because then there are 3×3=9 ways to join them up to form the handcuffs (think of which half-edge ends up forming the edge between the two vertices) while there are 3!=6 ways to pair up the half-edges in the theta.

In general, for multigraphs H with the correct degree sequence, we have

\mathbb{P}( CM_n(d)\simeq H) \propto \left( 2^{\# \text{loops}(H)} \prod_{e\in E(H)} \text{mult}(e)! \right),

where \text{mult}(e) is the multiplicity with which a given edge e appears in H.

Note: it might seem counterintuitive that this procedure is biased against multiple edges and self-loops, but it is really just saying that there are more ways to form two distinct edges than to form two equal edges (ie a multiedge pair) when we view the half-edges as distinguishable. (See this post for further discussion of this aspect in the 3-regular setting.)

However, a consequence of this result is that if we condition on the event that CM_n(d) is simple, then the resulting random graph is uniform on the set of simple graphs satisfying the degree property. Note that the same example as above shows that there’s no guarantee that there exists a simple graph whose degrees are some given sequence.

d-regular configuration model

In general, from a modelling point of view, we are particularly interested in simple, connected graphs, and so it is valuable to study whether the large examples of the configuration model are likely to have these properties. In this lecture, I will mainly focus on the case where the multigraphs are d-regular, meaning that all the vertices have degree equal to d. For the purposes of this lecture, we denote by G^d(n), the d-regular configuration model CM_n(d,\ldots,d).

  • d=1: to satisfy the parity condition on the sums of degrees, we must have n even. But then G^1(n) will consist of n/2 disjoint edges.
  • d=2: G^2(n) will consist of some number of disjoint cycles, and it is a straightforward calculation to check that when n is large, with high probability the graph will be disconnected.

In particular, I will focus on the case when d=3, which is the first interesting case. Most of the results we prove here can be generalised (under various conditions) to more general examples of the configuration model. The main goal of the lecture is revision of some techniques of the course, plus one new one, in a fresh setting, and the strongest possible versions of many of these results can be found amongst the references listed at the end.

Connectedness

In the lecture, we showed that G^3(2n) is connected with high probability. This is, in fact, a very weak result, since in fact G^d(n) is d-connected with high probability for d\ge 3 [Bol81, Wor81]. Here, d-connected means that one must remove at least d vertices in order to disconnect the graph, or, equivalently, that there are d disjoint paths between any pair of vertices. Furthermore, Bollobas shows that for d\ge 3, G^d(n) is a (random) expander family [Bol88].

Anyway, for the purposes of this course, the main tool is direct enumeration. The matching number M_{2k} satisfies

M_{2k}=(2k-1)\times (2k-3)\times\ldots\times 3\times 1 = \frac{(2k)!}{2^k \cdot k!},

and so Stirling’s approximation gives the asymptotics

M_{2k} = (\sqrt{2}+o(1)) \left(\frac{2}{e}\right)^k k^k,

although it will be useful to use the true bounds

c \left(\frac{2}{e}\right)^k k^k \le M_{2k}\le C\left(\frac{2}{e}\right)^k k^k,\quad \forall k,

instead in some places. Anyway, in G^3(2n), there are 6n half-edges in total, and so the probability that the graph may be split into two parts consisting of 2\ell,2m vertices, with 2\ell+2m=2n, and with no edges between the classes is \frac{\binom{2n}{2\ell} M_{6\ell}M_{6m}}{M_{6n}}. Continue reading

Lecture 9 – Inhomogeneous random graphs

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

As we enter the final stages of the semester, I want to discuss some extensions to the standard Erdos-Renyi random graph which has been the focus of most of the course so far. In doing so, we can revisit material that we have already covered, and discover how easily one can extend this directly to more exotic settings.

The focus of this lecture was the model of inhomogeneous random graphs (IRGs) introduced by Soderberg [Sod02] and first studied rigorously by Bollobas, Janson and Riordan [BJR07]. Soderberg and this blog post address the case where vertices have a type drawn from a finite set. BJR address the setting with more general typespaces, in particular a continuum of types. This generalisation is essential if one wants to use IRGs to model effects more sophisticated than those of the classical Erdos-Renyi model G(n,c/n), but most of the methodology is present in the finite-type setting, and avoids the operator theory language which is perhaps intimidating for a first-time reader.

Inhomogeneous random graphs

Throughout, k\ge 2 is fixed. A graph with k types is a graph G=(V,E) together with a type function V\to \{1,\ldots,k\}. We will refer to a k\times k symmetric matrix with non-negative entries as a kernel.

Given n\in\mathbb{N} and a vector p=(p_1,\ldots,p_k)\in\mathbb{N}_0^k satisfying \sum p_i=n, and \kappa a kernel, we define the inhomogeneous random graph G^n(p,\kappa) with k types as:

  • the vertex set is [n],
  • types are assigned uniformly at random to the vertices such that exactly p_i vertices have type i.
  • Conditional on these types, each edge v\leftrightarrow w (for v\ne w\in [n]) is present, independently, with probability

1 - \exp\left(-\frac{\kappa_{\mathrm{type}(v),\mathrm{type}(w)} }{n} \right).

Notes on the definition:

  • Alternatively, we could assign the types so that vertices \{1,\ldots,p_1\} have type 1, \{p_1+1,\ldots,p_1+p_2\} have type 2, etc etc. This makes no difference except in terms of the notation we have to use if we want to use exchangeability arguments later.
  • An alternative model considers some distribution \pi on [k], and assigns the types of the vertices of [n] in an IID fashion according to \pi. Essentially all the same results hold for these two models. (For example, this model with ‘random types’ can be studied by quenching the number of each type!) Often one works with whichever model seems easier for a given proof.
  • Note that the edge probability given is \approx \frac{\kappa_{\mathrm{type}(v),\mathrm{type}(w)}}{n}. The exponential form has a more natural interpretation if we ever need to turn the IRGs into a process. Additionally, it avoids the requirement to treat small values of n (for which, a priori, k/n might be greater than 1) separately.

In the above example, one can see that, roughly speaking, red vertices are more likely to be connected to each other than blue vertices. However, for both colours, they are more likely to be connected to a given vertex of the same colour than a vertex of the opposite colour. This might, for example, correspond to the kernel \begin{pmatrix}3&1\\1&2\end{pmatrix}.

The definition given above corresponds to a sparse setting, where the typical vertex degrees are \Theta(1). Obviously, one can set up an inhomogeneous random graph in a dense regime by an identical argument.

From an applications point of view, it’s not hard to imagine that an IRG of some flavour might be a good model for many phenomena observed in reality, especially when a mean-field assumption is somewhat appropriate. The friendships of boys and girls in primary school seems a particularly resonant example, though doubtless there are many others.

One particular application is to recover the types of the vertices from the topology of the graph. That is, if you see the above picture without the colours, can you work out which vertices are red, and which are blue? (Assuming you know the kernel.) This is clearly impossible to do with anything like certainty in the sparse setting – how does one decide about isolated vertices, for example? The probabilities that a red vertex is isolated and that a blue vertex is isolated differ by a constant factor in the n\rightarrow\infty limit. But in the dense setting, one can achieve this with high confidence. When studying such statistical questions, these IRGs are often referred to as stochastic block models, and the recent survey of Abbe [Abbe] gives a very rich history of this type of problem in this setting.

Poisson multitype branching processes

As in the case of the classical random graph G(n,c/n), we learn a lot about the IRG by studying its local structure. Let’s assume from now on that we are given a sequence of IRGs G^n(p^n,\kappa) for which \frac{p^n}{n}\rightarrow \pi, where \pi=(\pi_1,\ldots,\pi_k)\in[0,1]^k satisfies ||\pi||_1=1.

Now, let v^n be a uniformly-chosen vertex in [n]. Clearly \mathrm{type}(v^n)\stackrel{d}\rightarrow \pi, with the immediate mild notation abuse of viewing \pi as a probability distribution on [k].

Then, conditional on \mathrm{type}(v^n)=i:

  • when j\ne i, the number of type j neighbours of v^n is distributed as \mathrm{Bin}\left(p_j,1-\exp\left(-\frac{\kappa_{i,j}}{n}\right)\right).
  • the number of type i neighbours of v^n is distributed as \mathrm{Bin}\left( p_i-1,1-\exp\left(-\frac{\kappa_{i,i}}{n}\right)\right).

Note that p_j\left[1-\exp\left(-\frac{\kappa_{i,j}}{n}\right)\right]\approx \frac{p_j\cdot \kappa_{i,j}}{n} \approx \kappa_{i,j}\pi_j, and similarly in the case j=i, so in both cases, the number of neighbours of type j is distributed approximately as \mathrm{Poisson}(\kappa_{i,j}\pi_j).

This motivates the following definition of a branching process tree, whose vertices have k types. Continue reading