Borel-Cantelli Lemmas

I’m currently lecturing my first course at King’s, which builds measure theory from the ground up to the construction of the Lebesgue integral, along with some more probabilistic topics. In this second week, we have been discussing various matters related to the construction of sigma-algebras. It took me slightly by surprise that this ends up being the natural moment during an undergraduate’s progression through probability to introduce the Borel-Cantelli lemmas. It’s not hard to imagine a pedagogical schedule in which they appear earlier, but equally, one benefits from seeing them with a bit more mathematical experience, even if they are only very loosely related to the construction of the Lebesgue integral!

Anyway, I noticed that I have not written about these results on the blog before. There are a couple of neat applications that lie slightly beyond the remit (and time allowance) of the lecture course, so they are presented at the end.

Context – events holding eventually and infinitely often

Recall that (in the notation of probability spaces) a sigma-algebra \mathcal F provides a set of events A\subseteq \Omega for which it is meaningful and consistent to define a probability measure. This is extremely uncontroversial in the setting where the set of outcomes \Omega is finite, but thornier in the setting where \Omega is uncountable, for example [0,1] or \mathbb{R}. The key axiom for sigma-algebras involves countable unions. If (A_n)_{n\ge 1} is a sequence of events in \mathcal F, we must also have \bigcup_{n\ge 1}A_n\in\mathcal F.

Since the axioms also demand that \mathcal F is closed under taking complements, it only takes a few steps to show that countable intersections are also valid, ie \bigcap_{n\ge 1}A_n\in\mathcal F also. It is uncontroversial to note that

\bigcap_{n\ge 1}A_n \subseteq \bigcup_{n\ge 1}A_n.

The main initial motivation for considering the events \{A_n\text{ infinitely often}\},\{A_n\text{ eventually}\} are that they lie between the intersection and the union in terms of containment.

  • ‘Infinitely often’ is relatively uncontroversial, just meaning ‘for infinitely many values of n’.
  • ‘Eventually’ is less clear on an initial reading. Perhaps ‘eventually always’ would be better? [And closer to the title of this blog!] The meaning is ‘always, except for finitely many n’, which implies that there is some threshold N such that it holds for all n>N.
  • In set notation, infinitely often (or i.o.) means that \bigcup_{k\ge n}A_k must hold for every n, and so the event \{A_n\text{ i.o.}\}=\bigcap_n\bigcup_{k\ge n}A_k.
  • Similarly, eventually means that there must be an n for which \bigcap_{k\ge n}A_k holds, so the event \{A_n\text{ eventually}\}=\bigcup_n\bigcap_{k\ge n}A_k.

We end up with the richer containment sequence

\bigcap_{n\ge 1}A_n\subseteq \{A_n\text{ eventually}\}\subseteq \{A_n\text{ i.o.}\}\subseteq \bigcup_{n\ge 1}A_n.

These individual containment relations may not be obvious a priori in this form. Probably the most clear is the central containment, that if an A_n holds for all except finitely many n, then it holds for infinitely many n. The right-most says that if it holds infinitely often, then it holds at least once. For the left-most containment, note that \bigcap_{n\ge 1}A_n is literally the same as the n=1 case of \bigcap_{k\ge n}A_k, so taking a union over other values of n only increases the set.

It’s also worth thinking about the complements. Note that \left(\bigcap_{n\ge 1}A_n\right)^c = \bigcup_{n\ge 1} A_n^c and \left(\bigcup_{n\ge 1}A_n\right)^c=\bigcap_{n\ge 1}A_n^c. Versions of these apply to eventually/i.o. also. In particular

\{A_n\text{ i.o.}\}^c = \{A_n\text{ for only finitely many }n\} = \{A_n^c\text{ eventually}\}.

and \{A_n\text{ eventually}\}^c = \{A_n^c\text{ for infinitely many }n\}.

So in any calculation or application, there is flexibility to study (A_n^c) is this is more convenient. As we will see, it makes sense to focus on whichever of (A_n),(A_n^c) has vanishing probabilities as n\to\infty.

Comment: It is standard to write \limsup_{n}A_n= \bigcap_{n\ge 1}\bigcup_{k\ge n}A_k, and \liminf_n A_n=\bigcup_{n\ge 1}\bigcap_{k\ge n}A_k. I am happy to admit I get these confused with non-vanishing probability, and prefer to stick to eventually/i.o. or the full union/intersection statements for safety!

Borel-Cantelli Lemmas

The Borel-Cantelli lemmas establish some conditions under which the event \{A_n\text{ i.o.}\} almost surely holds, or almost surely doesn’t hold. We’ll state them now:

BC1: If \sum_{n\ge 1}\mathbb{P}(A_n)<\infty, then \mathbb{P}(A_n\text{ i.o.})=0.

BC2: If \sum_{n\ge 1}\mathbb{P}(A_n)=\infty and the events (A_n) are independent, then \mathbb{P}(A_n\text{ i.o.})=1.

Proof of BC1: Since \{A_n\text{ i.o.}\}\subseteq \bigcup_{k\ge n}A_k for every $n$, we have

\mathbb{P}(A_n\text{ i.o.}) \le \sum_{k\ge n}\mathbb{P}(A_k)\;\stackrel{n\to\infty}\longrightarrow\; 0,

since \sum \mathbb{P}(A_n)<\infty.

Comment: we could also introduce the random variable N which counts how many events A_n occur, ie N=\sum_{n\ge 1}\mathbf{1}_{A_n}. Then \{A_n\text{ i.o.}\}=\{N=\infty\}, and the given condition is equivalent to \mathbb{E}[N]<\infty, which certainly implies \mathbb{P}(N=\infty)=0. Indeed, it’s reasonable to think of BC1 as a first-moment result, with BC2 as (sort of) a second-moment result, and this is reflected in the comparative complexity of the two proofs.

Proof of BC2: As discussed, it is equivalent to show that \mathbb{P}(A_n^c\text{ eventually})=0. To do this, we study the intersections of the complements, whose probabilities are tractable using independence:

\mathbb{P}\left( \bigcup_{k\ge n}A_k^c\right) = \prod_{k\ge n}\mathbb{P}(A_k^c)=\prod_{k\ge n}(1-\mathbb{P}(A_k))

\le \prod_{k\ge n}\exp(-\mathbb{P}(A_k))=\exp(-\sum_{k\ge n}\mathbb{P}(A_k)).

We have assumed that \sum_{k\ge n}\mathbb{P}(A_k)=\infty for each n, and so we obtain

\mathbb{P}\left(\bigcup_{k\ge n}A_k^c\right)=0,\quad\forall n\ge 1.

But since (\bigcap_{k\ge n}A_k^c)_{n\ge 1} is an increasing sequence of events, so taking a union over n is particularly natural. Precisely, by continuity of measure, we have

\mathbb{P}\left(\bigcup_{n\ge 1}\bigcap_{k\ge n}A_k^c\right) = \lim_{n\to\infty}\mathbb{P}\left( \bigcap_{k\ge n}A_k^c\right)=0.

Comment: the independence condition is certainly necessary, as otherwise one can have situations where A_1\supseteq A_2\supseteq\cdots. For example, with U\sim \mathrm{Unif}[0,1], consider A_n:=\{U\le \frac1n\}, for which \mathbb{P}(A_n)=\frac1n, whose sum diverges, while clearly \mathbb{P}(A_n\text{ i.o.})=\mathbb{P}(U=0)=0.

Classic examples: the famous ‘infinite monkey theorem’, relatively well-known outside the realm of theoretical probability, asserts that a monkey typing randomly at a keyboard for long enough will eventually type the complete works of William Shakespeare. Other standard examples involve independent (A_n) with probabilities decaying at various speeds. The latter do have some overlap with the slightly less standard examples presented below.

Example – generating random permutations

Consider a sequence of random permutations (\sigma_n)_{n\ge 1} where \sigma_n is a permutation of [n]. We construct these iteratively as follows: \sigma_1 is the identity (ie the only permutation of one element!); then for each n\ge 2, define \sigma_n from \sigma_{n-1} by inserting element n at a uniformly-chosen position in \sigma_{n-1}.

We ask the question whether this gives a well-defined limiting permutation \sigma_\infty on \mathbb{N}. A necessary condition for this is that the first element \sigma_n(1) is eventually constant. But in fact the events \{\sigma_n(1)=n\} are independent, and each occurs with probability 1/n, and so

\mathbb{P}(\text{first element changes i.o.})\ge \mathbb{P}(\sigma_n(1)=n\text{ for infinitely many }n)=1,

by BC2.

On reflection, each \sigma_n has the uniform distribution on \Sigma_n, permutations of [n]. So our result is consistent with the idea that one cannot construct a uniform random permutation on an infinite set.

Now, alternatively, suppose the new element is added at place \max(n-X_n,1) where (X_2,X_3,\ldots) are IID \mathrm{Geometric}(q) random variables. (The truncation is just to prevent pathologies like trying to add the new entry in ‘place -6’.) Then


and since \sum_n (1-q)^{n-2}<\infty, BC1 shows that the first element (\sigma_1(1),\sigma_2(1),\sigma_3(1),\ldots) changes only finitely often. A similar argument applies for the second element, and the third element, and so on. Consequently, almost surely we have a well-defined pointwise limit permutation \sigma_\infty. This is known as the Mallows Permutation on the integers, and is one of the more popular models for a random permutation on an infinite set, with lots of interesting properties. I supervised an undergraduate research project on this model over Summer 2022, and will write more when time allows.

Example – well-separated random integers

The final example is a toy version of a construction I’ve been working on recently regarding subsets of the vertices of a graph which ‘trap’ random walk on that graph.

Suppose we are trying to colour some positive integers randomly, so that the coloured integers are infinite, but also asymptotically sparse. We also do not want pairs of coloured integers to be close together. Let’s say that two integers m,n are close if |m-n|\le \sqrt{\max(m,n)}.

Now, suppose we colour each integer n green independently with probability \frac1n. Then, whenever we have two integers m,n which are close, and coloured green, we re-colour them both red.

We would like to prove that infinitely many integers are coloured green. Let C_n:={n\text{ coloured with some colour}}. Using BC2, we know that almost surely, infinitely many are coloured with some colour, and it seems implausible that many would be close. However, the events G_n={n\text{ coloured green}} are not independent, so one cannot apply BC2 directly to these. However, we can study instead R_n={n\text{ coloured red}}. We can bound these probabilities as

\mathbb{P}(R_n)\le\frac{1}{n}\frac{2\sqrt{n}}{n-\sqrt{n}} = \frac{2+o(1)}{n^{3/2}},

and apply BC1 to show \mathbb{P}(R_n\text{ i.o.})=0. Then, since \{G_n\text{ i.o.}\}\supseteq \{C_n\text{ i.o.}\}\setminus{R_n\text{ i.o.}}, we obtain \mathbb{P}(G_n\text{ i.o.})=1, as required.


Random transpositions

We study a procedure for generating a random sequence of permutations of [N]. We start with the identity permutation, and then in each step, we choose two elements uniformly at random, and swap them. We obtain a sequence of permutations, where each term is obtained from the previous one by multiplying by a uniformly-chosen transposition.

Some more formality and some technical remarks:

  • This is a Markov chain, and as often with Markov chains, it would be better it was aperiodic. As described, the cycle will alternate between odd and even permutations. So we allow the two elements chosen to be the same. This laziness slows down the chain by a factor N-1/N, but removes periodicity. We will work over timescales where this adjustment makes no practical difference.
  • Let \tau_1,\tau_2,\ldots be the sequence of transpositions. We could define the sequence of permutations by \pi_m= \tau_m\cdot\tau_{m-1}\cdot \ldots\cdot \tau_1. I find it slightly more helpful to think of swapping the elements in places i and j, rather the elements i and j themselves, and so I’ll use this language, for which \pi_m = \tau_1\cdot \tau_2\cdot\ldots \cdot \tau_m is the appropriate description. Of course, transpositions and the identity are self-inverse permutations, so it makes no difference to anything we might discuss.
  • You can view this as lazy random walk on the Cayley graph of S_N generated by the set of transpositions. That is, the vertices of the graph are elements of S_N, and two are connected by an edge if one can be obtained from the other by multiplying by a transposition. Note this relation is symmetric. Hence random transposition random walk.
  • Almost everything under discussion would work in continuous time too.

At a very general level, this sort of model is interesting because sometimes the only practical way to introduce ‘global randomness’ is repeatedly to apply ‘local randomness’. This is not the case for permutations – it is not hard to sample uniformly from S_N. But it is a tractable model in which to study relevant questions about the generating randomness on a complicated set through iterated local operations.

Since it is a Markov chain with a straightforward invariant distribution, we can ask about the mixing time. That is, the correct scaling for the number of moves before the random permutation is close in distribution (say in the sense of total variation distance) to the equilibrium distribution. See this series of posts for an odd collection of background material on the topic. Diaconis and Shahshahani [DS81] give an analytic argument for mixing around \frac{N\log N}{2} transpositions. Indeed they include a constant because it is a sharp cutoff, where the total variation distance drops from approximately 1 to approximately 0 in O(N) steps.

Comparison with Erdos-Renyi random graph process

In the previous result, one might observe that m=\frac{N\log N}{2} is also the threshold number of edges to guarantee connectivity of the Erdos-Renyi random graph G(N,m) with high probability. [ER59] Indeed, there is also a sharp transition around this threshold in this setting too.

We explore this link further. We can construct a sequence of random graphs simultaneously with the random transposition random walk. When we multiply by transposition (i j), we add edge ij in the graph. Laziness of RTRW and the possibility of multiple edges mean this definition isn’t literally the same as the conventional definition of a discrete-time Erdos-Renyi random graph process, but again this is not a problem for any of the effects we seek to study.

The similarity between the constructions is clear. But what about the differences? For the RTRW, we need to track more information than the random graph. That is, we need to know what order the transpositions were added, rather than merely which edges were added. However, the trade-off is that a permutation is a simpler object than a graph in the following sense. A permutation can be a described as a union of disjoint cycles. In an exchangeable setting, all the information about a random permutation is encoded in the lengths of the these cycles. Whereas in a graph, geometry is important. It’s an elegant property of the Erdos-Renyi process that we can forget about the geometry and treat it as a process on component sizes (indeed, a multiplicative coalescent process), but there are other questions we might need to ask for which we do have to study the graph structure itself.

Within this analogy, unfortunately the word cycle means different things in the two different settings. In a permutation, a cycle is a directed orbit, while in a graph it has the usual definition. I’m going to write graph-cycle whenever relevant to avoid confusion.

A first observation is that, under this equivalence, the cycles of the permutation form a finer partition than the components of the graph. This is obvious. If we split the vertices into sets A and B, and there are no edges between them, then nothing in set A will ever get moved out of set A by a transposition. (Note that the slickness of this analogy is the advantage of viewing a transposition as swapping the elements in places i and j.)

However, we might then ask under what circumstances is a cycle of the permutation the same as a component of the graph (rather than a strict subset of it). A first answer is the following:

Lemma: [Den59] The permutation formed by a product of transpositions corresponding in any order to a tree in the graph has a single cycle.

We can treat this as a standalone problem and argue in the following predictable fashion. (Indeed, I was tempted to set this as a problem during selection for the UK team for IMO 2017 – it’s perfectly suitable in this context I think.) The first transposition corresponds to some edge say ab, and removing this edge divides the vertices into components A \ni a, B\ni b. Since no further transposition swaps between places in A and places in B, the final permutation maps a into B and b into A, and otherwise preserves A and B.

This argument extends to later transpositions too. Now, suppose there are multiple cycles. Colour one of them. So during the process, the coloured labels move around. At some point, we must swap a coloured label with an uncoloured label. Consider this edge, between places a and b as before, and indeed the same conclusion holds. WLOG we move the coloured label from a to b. But then at the end of the process (ie in the permutation) there are more coloured labels in B than initially. But the number of coloured labels should be the same, because they just cycle around in the final permutation.

We can learn a bit more by trying thinking about the action on cycles (in the permutation) of adding a transposition. In the following pair of diagrams, the black arrows represent the original permutation (note it’s not helpful to think of the directed edges as having anything to do with transpositions now), the dashed line represents a new transposition, and the new arrows describe the new permutation which results from this product.

It’s clear from this that adding a transposition between places corresponding to different cycles causes the cycles to merge, while adding a transposition between places already in the same cycle causes the cycle to split into two cycles. Furthermore the sizes of the two cycles formed is related to the distance in the cycle between the places defining the transposition.

This allows us to prove the lemma by adding the edges of the tree one-at-a-time and using induction. The inductive claim is that cycles of the permutation exactly correspond to components of the partially-built tree. Assuming this claim guarantees that the next step is definitely a merge, not a split (otherwise the edge corresponding to the next step would have to form a cycle). If all N-1 steps are merges, then the number of cycles is reduced by one on each step, and so the final permutation must be a single cycle.

Uniform split-merge

This gives another framework for thinking about the RTRW itself, entirely in terms of cycle lengths as a partition of [N]. That is, given a partition, we choose a pair of parts in a size-biased way. If they are different, we merge them; and if it is the same part, with size k, we split them into two parts, with sizes chosen uniformly from { (1,k-1), (2,k-2), …  (k-1,1) }.

What’s nice about this is that it’s easy to generalise to real-valued partitions, eg of [0,1]. Given a partition of [0,1], we sample two IID U[0,1] random variables U_1,U_2. If these correspond to different parts, we replace these parts by a single part with size given by the sum. If these correspond to the same part, with size \alpha, we split this part into two parts with sizes |U_1-U_2| and \alpha - |U_1-U_2|. This is equivalent in a distributional sense to sampling another U[0,1] variable U and replacing \alpha with (\alpha U, \alpha(1-U)). We probably want our partition to live in \ell^1_\searrow, so we might have to reorder the parts afterwards too.

These uniform split-merge dynamics have a (unique) stationary distribution, the canonical Poisson-Dirichlet random partition, hereafter PD(0,1). This was first shown in [DMZZ04], and then in a framework more relevant to this post by Schramm [Sch08].

Conveniently, PD(0,1) is also the scaling limit of the cycle lengths in a uniform random permutation (scaled by N). The best way to see this is to start with the observation that the length of the cycle containing 1 in a permutation chosen uniformly from S_N has the uniform distribution on {1,…,N}. This matches up well with the uniform stick-breaking construction of PD(0,1), though other arguments are available too. Excellent background on Poisson-Dirichlet distributions and this construction and equivalence can be found in Chapter 3 of Pitman’s comprehensive St. Flour notes [CSP]. Also see this post, and the links within, with the caveat that my understanding of the topic was somewhat shaky then (as presently, for now).

However, Schramm says slightly more than this. As the Erdos-Renyi graph passes criticality, there is a well-defined (and whp unique) giant component including \Theta(N) vertices. It’s not clear that the corresponding permutation should have giant cycles. Indeed, whp the giant component has \Theta(N) surplus edges, so the process of cycle lengths will have undergone O(N) splits. Schramm shows that most of the labels within the giant component are contained in giant cycles in the permutation. Furthermore, the distribution of cycle lengths within the giant component, rescaled by the size of the giant component, converges in distribution to PD(0,1) at any supercritical time \frac{(1+\epsilon)N}{2}

This is definitely surprising, since we already know that the whole permutation doesn’t look close to uniform until time \frac{N\log N}{2}. Essentially, even though the size of the giant component is non-constant (ie it’s gaining vertices), the uniform split-merge process is happening to the cycles within it at rate N. So heuristically, at the level of the largest cycles, at any supercritical time we have a non-trivial partition, so at any slightly later time (eg \frac{(1+\epsilon/2)N}{2} and \frac{(1+\epsilon)N}{2} ), mixing will have comfortably occurred, and so the distribution is close to PD(0,1).

This is explained very clearly in the introduction of [Ber10], in which the approach is extended to a random walk on S_N driven by a uniform choice from any conjugacy class.

So this really does tell us how the global uniform randomness emerges. As the random graph process passes criticality, we have a positive mass of labels in a collection of giant cycles which are effectively a continuous-space uniform split-merge model near equilibrium (and thus with PD(0,1) marginals). The remaining cycles are small, corresponding to small trees which make up the remaining (subcritical by duality) components of the ER graph. These cycles slowly get absorbed into the giant cycles, but on a sufficiently slow timescale relevant to the split-merge dynamics that we do not need to think of a separate split-merge-with-immigration model. Total variation distance on permutations does feel the final few fixed points (corresponding to isolated vertices in the graph), hence the sharp cutoff corresponding to sharp transition in the number of isolated vertices.


[Ber10] – N. Berestycki – Emergence of giant cycles and slowdown transition in random transpositions and k-cycles. [arXiv version]

[CSP] – Pitman – Combinatorial stochastic processes. [pdf available]

[Den59] – Denes – the representation of a permutation as a product of a minimal number of transpositions, and its connection with the theory of graphs

[DS81] – Diaconis, Shahshahani – Generating a random permutation with random transpositions

[DMZZ04] – Diaconis, Mayer-Wolf, Zeitouni, Zerner – The Poisson-Dirichlet distribution is the unique invariant distribution for uniform split-merge transformations [link]

[ER59] – Erdos, Renyi – On random graphs I.

[Sch08] – Schramm – Compositions of random transpositions [book link]

The Top-to-Random Shuffle

This article is based on a talk I gave to the Maths Society at St Paul’s School on Monday. It may turn into a short series if I have time before I go to ALEA in Luminy near Marseille on Saturday.

My original plan had been to talk about riffle-shuffling, and some of the interesting mixing time themed results one can obtain. As a motivating example, I began by discussing the simpler top-to-random shuffle, and this proved sufficiently interesting to occupy the time I had been allowed (and mea culpa a bit more). It therefore seems worth writing a hopefully moderately accessible blog post on the subject. The aim of this post at least is to discuss the idea that repeatedly shuffling brings a pack of cards close to randomness. We have to settle on a definition of ‘close to randomness’, and find some ways to calculate this.

Suppose we are playing some bizarre card game where it is necessary that three cards labelled, uncontroversially, 1, 2 and 3 need to be placed in a random order. If we are organised, we can write down all the ways to do this in a list:

123, 132, 213, 231, 312, 321.

We want to select each of these with equal probability. We could for example use a dice. Most relevantly, even a computer as ancient as my laptop is very happy simulating a random choice from this set. (Now is not the time to talk about exactly how pseudo-random or otherwise this choice would be.)

Of course, when we play a sensible card game we have not three cards, but fifty-two. So the approach described above still works in theory, but no longer in practice, as the list of possible arrangements now has size 52!. Recall this is defined to be

52!=1\times 2 \times\ldots \times 52.

The reason we get this particular expression is that when we are choosing the first card, we have 52 possible choices. Then, regardless of what this first card actually is, there are precisely 51 cards left from which to choose the second card. So there are 52×51 ways to pick the first two cards in the arrangement, and so on, giving the answer. We can approximate how large 52! is by counting powers of ten rather crudely. It seems reasonable that it should be about 10^{65}. Note that the number of atoms in the universe is *only* about 10^{80}, so if we are going to write down this list, we better have very compact handwriting! But being serious, this number is way too large to realistically compute with, so we have to come up with some cleverer methods.

One way is to spread the cards out on a table then pick them up one at a time, ensuring at all times that the choice of card is uniform among those currently present, and not related to any of the past choices. This is relatively easy for a computer, but hard for a human, and certainly deeply tedious for anyone waiting to receive their hand!

So we seek a different approach, namely an algorithm for shuffling. Our aim is to introduce overall randomness by repeatedly applying some simple but random process. Note we have to be careful about our definition of ‘random’ here. The permutation 123456 is just as ‘random’ as the permutation 361524. That is, if they are fixed, then they are not random at all. Just because it is easier to decribe one of them verbally does not mean it is less random. For example, if I am trying to cheat at poker, then I might be able to if I knew the exact order of the cards in the pack before the dealer dealt. It wouldn’t matter what that order was. I would have to adjust my strategy based on the order, but it wouldn’t affect the fact that I had a massive advantage!

The shuffling algorithm to be discussed here is the top-to-random shuffle. Like all the best things in life, this does exactly what it says on the tin. At a given time, we remove the top card from the deck at present, and insert it at a randomly chosen point in the deck. This could be on the bottom, and it could also be back on the top. It feels like this possibility to remain constant can’t possibly help us, but later we will discuss why we need this.

In any case, it feels natural that if we keep applying this procedure, the arrangement of the deck should start to get more and more random, in the sense that knowing the original arrangement will tell us successively little about the current arrangement as time progresses. But we need to find a way to quantify this if we are to do any mathematics.

When we are talking about real numbers, it is fairly clear what it means if I say that the numbers 2, 1.1, 1.01, 1.001 and so on are getting closer and closer to 1. Indeed we can measure the distance along the number line between each term and 1, using the absolute difference. It is not so clear how to compute the distance between two probability distributions. Bearing in mind the fact that a distribution on the set of permutations of cards is defined to be a set of 52! probabilities that sum to 1, there will be a 52!-1 dimensional space (eg the plane is two-dimensional, the world is three-dimensional, *and so on* – whatever that means) where we have a nice distance formula already.

But this is not what we will choose to use. Rather we return to the cheating-at-poker analogy. Suppose I am playing some sort of game involving the pack of cards with my enemy. He or she thinks the deck is perfectly random, but I know the actual distribution. How big a profit can I make by exploiting this knowledge? This will be our measure of how far a distribution is from uniform. It turns out that this will coincide precisely with the formal definition of total variation distance, but that language belongs to a different level of rigour and is not relevant here.

What is relevant is an explanatory example. Suppose we start with the arrangement 12345678. We are now going to perform one iteration of the top-to-random shuffle. The outcome might, for example, be 23456178, if we insert the 1 between the 6 and the 7. Note there were 8 places for the card to go, so the probability of this particular outcome is 1/8. Now let’s see how I might use my knowledge of the distribution to my advantage. Suppose I suggest the bet that the bottom card is an 8. My enemy thinks the stack is uniformly randomly arranged, so the probability of this is 1/8. On the other hand, I know that the only way the 8 might disappear from the bottom is if I place the 1 under it, which happens with probability 1/8. So in fact, I know the probability of this event is 7/8, which gives me an advantage of 3/4. In fact, I could come up with bets that do even better than this, but they are less simple to describe verbally.

At what point do I lose this advantage? Well, we said that the probability that the 8 leaves the bottom of the stack is 1/8. And it will continue to be 1/8 on every turn where it is at the bottom. Recalling that the outcomes of successive shuffles are independent, note this is reminiscent of rolling a dice until a six comes up. The number of rolls required to get the six is an example of a geometric random variable. I don’t want to spoil S1 (or whichever module) by going into too much detail, but it turns out that if the probability of an event happening on a single go is p, then the average time we have to wait is 1/p. So 1/(1/8)=8 of course, and this is how long we typically have to wait before the bet I placed before becomes much less effective.

Now seems like a good time to stop talking about 8 cards and start talking about n cards. Obviously, in practice, we will want n to be 52. Anyway, by the same argument as before, it takes on average n iterations before the bottom card leaves the bottom. This is important, because after then, my bet that the bottom card is n is no longer so effective. However, I could equally place a bet that one of the bottom *two* cards is n.

So we consider how long it takes before n is no longer one of the bottom two cards. Well certainly we need to wait until it is no long *the* bottom card, which takes time n on average. Then, once it is second bottom, there is now a 2/n chance that we move the previously top card below it, so by the same argument as before, the time for this to happen is n/2 on average. If we want this effect to disappear, we have to wait until the original bottom card is in fact at the top of the pile for the first time, and by extending our previous argument, the average time for this is


Fortunately, we have tools for approximating this sort of sum, in particular integration, which is the practice of finding the area under certain curves. It turns out that the answer is roughly n log n. You can think of as log n a measure of the number of digits required to write out n. (This is not the exact definition but it will do for now. In any case, log n gets larger as n gets larger, but not very fast.) There’s a lot more about this in my previous post on the coupon collector problem, from a more technical point of view.

The next question will be to prove that it is actually quite well shuffled by this time, but that’s for another post. The other question to ask is whether this is satisfactory overall? For n=52, the number of operations we have to perform is about 230, which is fine for a computer, but deeply tedious for anyone sitting at a casino table waiting for the next hand. So next time we’ll talk about the riffle shuffle, which seems to introduce a lot of randomness in each go, but we’ll also see that we have to be careful, because the randomness may not be as great as our intuition suggests.

The Chinese Restaurant Process

A couple of months ago I wrote a post about Polya’s Urn, the simplest example of self-reinforcing process. Recall that we have a bag containing black and white balls, and sequentially we draw a ball then replace it together with an additional ball of the same colour. The process is self-reinforcing in the sense that if there is a surplus of black balls, the dynamics will reinforce this by adding more black balls than white balls. Alternatively, you can think of a natural limit process when the number of balls is large, for which any distribution is an invariant distribution. We have seen models such as the Preferential Attachment dynamics for network creation, where the degrees of vertices clearly have this self-reinforcing property. New vertices are more likely to join to existing vertices with large degrees.

One difference between the Polya Urn and some of the models we might be interested in for applications is that for the urn model, the number of classes (in this context colours of balls) is fixed. In many applications, we will want to allow new classes to appear. In the process which follows, we will allow this, and the new classes will have initial size equal to 1, so will be at a disadvantage for the self-reinforcing dynamics. Nonetheless, some will show up in a meaningful way in the limit. It is worth emphasising that Polya’s Urn gave us the Dirichlet distribution in the limit, and this can be thought of as a partition of [0,1]. These more general processes will give us a more interesting family of partitions, called the Poisson-Dirichlet distributions. These will turn up in a wide variety of contexts, and this is perhaps the friendliest way to introduce them.

The model is this. We start with a single diner who sits at the first table. Then whenever the (n+1)th diner arrives, they join a table with k diners already with probability k/n+1, and they start a new table with probability 1/n+1.

(Aside: I’m not exactly sure how this relates to a Chinese restaurant? It seems more reminiscent of a university dining hall during freshers’ week, but I guess that would be a less catchy name for a model.)

Anyway, the interest in this description lies not in organising seating arrangements. Consider choosing uniformly at random from the set of permutations on [n+1]. Suppose x maps to n+1 and n+1 maps to y. Consider taking the permutation of [n] formed by instead mapping x to y and ignoring n+1. This has the uniform distribution on the set of permutations of [n]. By reversing this procedure, we can construct a uniform permutation of [n+1] from a uniform permutation of [n]. When you do this as a process for n growing, observe that the orbits correspond exactly to tables in the Chinese Restaurant Process. If we wanted the CRP to give all the information about the permutation, we could specify the ordering round each table, by saying that with probability 1/n+1 the new diner sits to the left of any given existing diner.

As a starting point for why this is a useful description of the uniform permutation distribution, observe that the size of the component containing the element 1 evolves as a Polya Urn with initial vector (1,1). The second 1 in the initial vector corresponds to the possibility of starting a new table, which is maintained at every stage. This tells us immediately that as n grows to infinity, the proportion of elements in the same cycle as 1 in the uniform permutation converges in distribution to U[0,1]. The construction also allows for an easy proof that the expected number of cycles is roughly log n for large n, since on each pass of the process, the probability that there is a new cycle formed is 1/k.

In this case, the partition induced on [n] by the process is clearly exchangeable given our permutation description. However, this will turn out to hold in greater generality. Note also,, that conditional on the size of the cycle containing 1, the sizes of the remaining cycles are given by a uniform permutation on a smaller number of elements. So the limiting result holds jointly in the first k cycle sizes for all k. More precisely, if (N_1,N_2,\ldots) are the cycle sizes ordered by least element, then the frequencies converge to:


where the Us are independent U[0,1] RVs. This is known as a stick-breaking procedure, where at each step we break off some proportion of the stick according to a fixed distribution, and assemble the pieces into a partition.

We generalise this process to get a two-parameter version. The standard notation for the parameters is (\alpha,\theta). Then we amend the dynamics. We now have to take into account how many tables are occupied when the (n+1)th diner arrives. If k tables are occupied, and the ith table has n_i diners, then the new one will join this table with probability \frac{n_i-\alpha}{n+\theta}, and will start a new table otherwise, so with probability \frac{\theta+k\alpha}{n+\theta}. The original process therefore corresponds to parameters (0,1).

First we examine which parameters are possible. If \alpha<0, and m|\alpha|<\theta<(m+1)|\alpha|, then with high probability the (m+1)th table will eventually be occupied, whereafter the probability of forming a further table will be negative. So we have to demand instead that \theta is an integer multiple of -\alpha. Then the number of tables is bounded by this multiple, so for large n, the probability of joining one of the k (fixed) tables is roughly \frac{n_i}{n}, so this should behave roughly like the standard Polya Urn. And indeed, the induced frequencies do converge to the Dirichlet distribution with k equal parameters.

Obviously \alpha cannot be greater than 1, otherwise the probability of the second diner joining the first table is negative. If it is equal to 1, then every diner starts a new table, which isn’t very interesting. So we care about \alpha\in[0,1), and for the probability of the second diner starting a new table to be non-negative we require \theta>-\alpha.

It turns out that the partitions induced by this process are exchangeable also. We also have a stick-breaking construction, although now the broken proportions are not IID, but distributed as

U_i\sim \mathrm{Beta}(1-\alpha,\theta+i\alpha),

with the same notation otherwise. It turns out that under mild assumptions, these are all the infinite exchangeable random partitions with this stick-breaking property.

My initial struggle with this process was to understand what roles (\alpha,\theta) played in a more precise way. It turns out this is best explained through the limit of the partition, but Pitman’s Exercise 3.2.2 does at least give an idea of how such a process with parameters (1/2,0) might naturally arise as a version of an urn model.

3.2.2. Let an urn initially contain two balls of different colours. Draw 1 is a simple draw from the urn with replacement. Thereafter, balls are drawn from the urn, with replacement of the ball drawn, and addition of two more balls as follows. If the ball drawn is of a colour never drawn before, it is replaced together with two additional balls of two distinct new colours, different to the colours of balls already in the urn. Whereas if the ball drawn is of a colour that has been drawn before, it is replaced together with two balls of its own colour.

Let n_1 be the number of times a ball of the colour of the first ball drawn (and replaced) is drawn. Let $n_2,n_3,\ldots$ be the number of times balls of each other colour are drawn. Suppose after n draws, we have drawn k colours. (There will be other colours in the bag not yet drawn.) Then, for each drawn colour i, there are 2n_i-1 balls of that colour in the bag, giving 2n-k in total. But there should be 2n balls in total, so there are k other balls. Then the probability that we see a new colour is k/2n, and the probability that we see colour i again is $\latex \frac{2n_i-1}{2n}=\frac{n_i-1/2}{n}$, which exactly corresponds to the dynamics for PD(1/2,0).

The other question I was puzzled by initially is where does the dust come from in the limit? Recall that in an infinite exchangeable partition, the sum of the frequencies does not need to be 1. The difference between this sum and 1 gives the probability that an element is in a block by itself. Obviously, when the number of tables is bounded (as when \alpha<0) this is not an issue, but for positive \alpha, this won’t hold. So we need to account for these singletons. The temptation is to imagine that these correspond to tables which are started but never joined. But this use of ‘never’ is not ideal. For each k, the k-th table will eventually include arbitrarily large numbers of diners. But for any finite n, there will likely be some proportion of people dining alone, some in pairs, and so on. So the sum of all of these proportions in the limit gives this dust.

Generalising Polya’s Urn in another direction, if I have time, I might write something about a model which I recently read about on arXiv where the classes are vertices of a graph, and there is dependence between them based on the presence of edges. This might also be a good moment to explain some other generalisations and stochastic approximation methods used to treat them.


This post is almost entirely a paraphrase of Sections 3.1 and 3.2 from Pitman’s Combinatorial Stochastic Processes, available online here.

Enhanced by Zemanta