Antichains in the grid

In the previous post on this topic, we discussed Dilworth’s theorem on chains and antichains in a general partially ordered set. In particular, whatever the size of the largest antichain in a poset, it is possible to partition the poset into exactly that many chains. So for various specific posets, or the directed acyclic graphs associated to them, we are interested in the size of this largest antichain.

The following example turned out to be more interesting than I’d expected. At a conventional modern maths olympiad, there are typically three questions on each paper, and for reasons lost in the mists of time, each student receives an integer score between 0 and 7 per question. A natural question to ask is “how many students need to sit a paper before it’s guaranteed that one will scores at least as highly as another on every question?” (I’m posing this as a straight combinatorial problem – the correlation between scores on different questions will be non-zero and presumably positive, but that is not relevant here.)

The set of outcomes is clearly \{0,1,\ldots,7\}^3, with the usual weak domination partial order inherited from \mathbb{R}^3. Then an antichain corresponds to a set of triples of scores such that no triple dominates another triple. So the answer to the question posed is: “the size of the largest antichain in this poset, plus one.”

In general, we might ask about \{1,2,\ldots,n\}^d, again with the weak domination ordering. This directed graph, which generalises the hypercube as well as our example, is called the grid.

Heuristics for the largest antichain

Retaining the language of test scores on multiple questions is helpful. In the previous post, we constructed a partition of the poset into antichains, indexed by the elements of some maximal chain, by starting with the sources, then looking at everything descended only from sources, and so on. (Recall that the statement that this is possible was referred to as the dual of Dilworth’s theorem.) In the grid, there’s a lot of symmetry (in particular under the mapping x\mapsto n+1-x in every coordinate), and so you end up with the same family of antichains whether you work upwards from the sources or downwards from the sinks. (Or vice versa depending on how you’ve oriented your diagram…) The layers of antichains also have a natural interpretation – each layer corresponds to a given total score. It’s clear a priori why each of these is an antichain. If A scores the same as B overall, but strictly more on the first question, this must be counterbalanced by a strictly lower score on another question.

So a natural guess for the largest antichain is the largest antichain corresponding to some fixed total score. Which total score should this be? It ought to be the middle layer, that is total score \frac{(n+1)d}{2}, or the two values directly on either side if this isn’t an integer. My intuition was probabilistic. The uniform distribution on the grid is achieved by IID uniform distributions in each coordinate, which you can think of as a random walk, especially if you subtract off the mean first. It feels that any symmetric random walk should have mode zero or next-to-zero. Certainly this works asymptotically in a rescaled sense by CLT, and in a slightly stronger sense by local CLT, but we don’t really want asymptotics here.

When I started writing the previous paragraph, I assumed there would be a simple justification for the claim that the middle layer(s) was largest, whether by straight enumeration, or some combinatorial argument, or even generating functions. Perhaps there is, and I didn’t spot it. Induction on d definitely works though, with a slightly stronger hypothesis that the layer sizes are symmetric around the median, and monotone on either side of the median. The details are simple and not especially interesting, so I won’t go into them.

From now on, the hypothesis is that this middle layer of the grid is the largest antichain. Why shouldn’t it, for example, be some mixture of middle-ish layers? (*) Well, heuristically, any score sequence in one layer removes several possibilities from a directly adjacent layer, and it seems unlikely that this effect is going to cancel out if you take some intermediate number of score sequences in the first layer. Also, the layers get smaller as you go away from the middle, so because of the large amount of symmetry (coordinates are exchangeable etc), it feels reasonable that there should be surjections between layers in the outward direction from the middle. The union of all these surjections gives a decomposition into chains.

This result is in fact true, and its proof by Bollobas and Leader, using shadows and compression can be found in the very readable Sections 0 and 1 of [1].

Most of the key ideas to a compression argument are present in the case n=2, for which some notes by Leader can be found here, starting with Proof 1 of Theorem 3, the approach of which is developed over subsequent sections. We treat the case n=2, but focusing on a particularly slick approach that does not generalise as successfully. We also return to the original case d=3 without using anything especially exotic.

Largest antichain in the hypercube – Sperner’s Theorem

The hypercube \{0,1\}^d is the classical example. There is a natural correspondence between the vertices of the hypercube, and subsets of [d]. The ordering on the hypercube corresponds to the ordering given by containment on \mathcal{P}([d]). Almost by definition, the k-th layer corresponds to subsets of size k, and thus includes \binom{d}{k} subsets. The claim is that the size of the largest antichain is \binom{d}{\lfloor d/2 \rfloor}, corresponding to the middle layer if d is even, and one of the two middle layers if d is odd. This result is true, and is called Sperner’s theorem.

I know a few proofs of this from the Combinatorics course I attended in my final year at Cambridge. As explained, I’m mostly going to ignore the arguments using compression and shadows, even though these generalise better.

As in the previous post, one approach is to exhibit a covering family of exactly this number of disjoint chains. Indeed, this can be done layer by layer, working outwards from the middle layer(s). The tool here is Hall’s Marriage Theorem, and we verify the relevant condition by double-counting. Probably the hardest case is demonstrating the existence of a matching between the middle pair of layers when d is odd.

Take d odd, and let d':= \lfloor d/2\rfloor. Now consider any subset S of the d’-th layer \binom{[d]}{d'}. We now let the upper shadow of S be

\partial^+(S):= \{A\in \binom{[d]}{d'+1}\,:\, \exists B\in S, B\subset A\},

the sets in the (d’+1)-th layer which lie above some set in S. To apply Hall’s Marriage theorem, we have to show that |\partial^+(S)|\ge |S| for all choice of S.

We double-count the number of edges in the hypercube from S to \partial^+(S). Firstly, for every element B\in S, there are exactly d’ relevant edges. Secondly, for every element A\in\partial^+(S), there are exactly d’ edges to some element of \binom{[d]}{d'}, and so in particular there are at most d’ edges to elements of S. Thus

d' |S|=|\text{edges }S\leftrightarrow\partial^+(S)| \le d' |\partial^+(S)|,

which is exactly what we require for Hall’s MT. The argument for the matching between other layers is the same, with a bit more notation, but also more flexibility, since it isn’t a perfect matching.

The second proof looks at maximal chains. Recall, in this context, a maximal chain is a sequence \mathcal{C}=B_0\subset B_1\subset\ldots\subset B_d where each B_k:= \binom{[d]}{k}. We now consider some largest-possible antichain \mathcal{A}, and count how many maximal chains include an element A\in\mathcal{A}. If |A|=k, it’s easy to convince yourself that there are \binom{d}{r} such maximal chains. However, given A\ne A'\in\mathcal{A}, the set of maximal chains containing A and the set of maximal chains containing A’ are disjoint, since \mathcal{A} is an antichain. From this, we obtain

\sum_{A\in\mathcal{A}} \binom{d}{|A|} \le d!. (**)

Normally after a change of notation, so that we are counting the size of the intersection of the antichain with each layer, this is called the LYM inequality after Lubell, Yamamoto and Meshalkin. The heuristic is that the sum of the proportions of layers taken up by the antichain is at most one. This is essentially the same as earlier at (*). This argument can also be phrased probabilistically, by choosing a *random* maximal chain, and considering the probability that it intersects the proposed largest antichain, which is, naturally, at most one. Of course, the content is the same as this deterministic combinatorial argument.

Either way, from (**), the statement of Sperner’s theorem follows rapidly, since we know that \binom{d}{|A|}\le \binom{d}{\lfloor d/2\rfloor} for all A.

Largest antichain in the general grid

Instead of attempting a proof or even a digest of the argument in the general case, I’ll give a brief outline of why the previous arguments don’t transfer immediately. It’s pretty much the same reason for both approaches. In the hypercube, there is a lot of symmetry within each layer. Indeed, almost by definition, any vertex in the k-th layer can be obtained from any other vertex in the k-th layer just by permuting the labels (or permuting the coordinates if thinking as a vector).

The hypercube ‘looks the same’ from every vertex, but that is not true of the grid. Consider for clarity the n=8, d=3 case we discussed right at the beginning, and compare the scores (7,0,0) and (2,2,3). The number of maximal chains through (7,0,0) is \binom{14}{7}, while the number of maximal chains through (2,2,3) is \binom{7}{2, 2,3}\binom{14}{4,5,5}, and the latter is a lot larger, which means any attempt to use the second argument is going to be tricky, or at least require an extra layer of detail. Indeed, exactly the same problem arises when we try and use Hall’s condition to construct the optimal chain covering directly. In the double-counting section, it’s a lot more complicated than just multiplying by d’, as was the case in the middle of the hypercube.

Largest antichain in the d=3 grid

We can, however, do the d=3 case. As we will see, the main reason we can do the d=3 case is that the d=2 case is very tractable, and we have lots of choices for the chain coverings, and can choose one which is well-suited to the move to d=3. Indeed, when I set this problem to some students, an explicit listing of a maximal chain covering was the approach some of them went for, and the construction wasn’t too horrible to state.

[Another factor is that it computationally feasible to calculate the size of the middle layer, which is much more annoying in d>3.]

[I’m redefining the grid here as \{0,1,\ldots,n-1\}^d rather than \{1,2,\ldots,n\}^d.]

The case distinction between n even and n odd is going to make both the calculation and the argument annoying, so I’m only going to treat the even case, since n=8 was the original problem posed. I should be honest and confess that I haven’t checked the n odd case, but I assume it’s similar.

So when n is even, there are two middle layers namely \frac{3n}{2}-2, \frac{3n}{2}-1 (corresponding to total score 10 and total score eleven in the original problem). I calculated the number of element in the \frac{3n}{2}-1 layer by splitting based on the value of the first coordinate. I found it helpful to decompose the resulting sum as

\sum_{k=0}^{n-1} = \sum_{k=0}^{\frac{n}{2}-1} + \sum_{k=\frac{n}{2}}^{n-1},

based on whether there is an upper bound, or a lower bound on the value taken by the second coordinate. This is not very interesting, and I obtained the answer \frac{3n^2}{4}, and of course this is an integer, since n is even.

Now to show that any antichain has size at most \frac{3n^2}{4}. Here we use our good control on the chain coverings in the case d=2. We note that there is a chain covering of the (n,d=2) grid where the chains have 2n-1, 2n-3,…, 3, 1 elements (%). We get this by starting with a maximal chain, then taking a maximal chain on what remains etc. It’s pretty much the first thing you’re likely to try.

Consider an antichain with size A in the (n,d=3) grid, and project into the second and third coordinates. The image sets are distinct, because otherwise a non-trivial pre-image would be a chain. So we have A sets in the (n,d=2) grid. How many can be in each chain in the decomposition (%). Well, if there are more than n in any chain in (%), then two must have been mapped from elements of the (n,d=3) grid with the same first coordinate, and so satisfy a containment relation. So in fact there are at most n image points in any of the chains of (%). So we now have a bound of n^2. But of course, some of the chains in (%) have length less than n, so we are throwing away information. Indeed, the number of images points in a given chain is at most

\max(n,\text{length of chain}),

and so the number of image points in total is bounded by

n+\ldots+n+ (n-1)+(n-3)+\ldots+1,

where there are n/2 copies of n in the first half of the sum. Evaluating this sum gives \frac{3n^2}{4}, exactly as we wanted.


[1] – Bollobas, Leader (1991) – Compressions and Isoperimetric Inequalities. Available open-access here.

Azuma-Hoeffding Inequality

It’s (probably) my last Michaelmas term in Oxford, at least for the time being, and so also the last time giving tutorials on either of the probability courses that students take in their first two years. This time, I’m teaching the second years, and as usual the aim of the majority of the first half of the course is to acquire as sophisticated an understanding as possible of the Central Limit Theorem. I feel a key step is appreciating that CLT tells you about the correct scaling for the deviations from the mean of these partial sums of IID random variables. The fact that these deviations on this correct scaling converge in law to a normal distribution, irrespective (apart from mild conditions) on the underlying distribution, is interesting, but should be viewed as a secondary, bonus, property.

Emphasising the scaling of deviations in CLT motivates the next sections of this (or any) course. We develop tools like Markov’s inequality to control the probability that a random variable is much larger than its expectation, and experiment with applying this to various functions of the random variable to get stronger bounds. When the moment generating function exists, this is an excellent choice for this analysis. We end up with a so-called Chernoff bound. For example, we might consider the probability that when we toss N coins, at least a proportion ¾ are Heads. A Chernoff bound says that this probability decays exponentially in N.

One direction to take is to ask how to control precisely the parameter of this exponential decay, which leads to Cramer’s theorem and the basis of the theory of Large Deviations. An alternative direction is to observe that the signed difference between the partial sums of independent random variables and their means is an example of a martingale, albeit not a very interesting one, since in general the increments of a martingale are not independent. So we might ask: under what circumstances can we show exponential tail bounds on the deviation of a martingale from its mean (that is, its initial value) at a fixed (perhaps large) time?

Azuma-Hoeffding inequality

The following result was derived and used by various authors in the 60s, including Azuma and Hoeffding (separately), but also others.

Let X_0,X_1,X_2,\ldots be a martingale with respect to some filtration, and we assume that the absolute value of each increment |X_i-X_{i-1}| is bounded almost surely by some c_i<\infty. Then, recalling that \mathbb{E}[X_n|\mathcal{F}_0]=X_0, we have

\mathbb{P}(X_n \ge X_0+t) \le \exp\left( -\frac{t^2}{2\sum_{i=1}^n c_i^2}\right).


We apply a Chernoff argument to each increment. First, observe that for Y a distribution supported on [-1,1] with mean zero, by convexity \mathbb{E}[e^{tY}] is maximised by taking Y equal to +1 and -1 each with probability ½. Thus

\mathbb{E}[e^{tY}]\le \frac12 e^t + \frac 12 e^{-t}=\cosh(t) \le e^{-t^2/2},

where the final inequality follows by directly comparing the Taylor series.

We’ll use this shortly. Before that, we start the usual argument for a Chernoff bound on X_n-X_0.

\mathbb{P}(X_n-X_0\ge t) = \mathbb{P}(e^{\theta(X_n-X_0)}\ge e^{\theta t})\le e^{-\theta t} \mathbb{E}[e^{\theta(X_n-X_0)}]

= e^{-\theta t} \mathbb{E}[\mathbb{E}[e^{\theta((X_n-X_{n-1}) +X_{n-1}-X_0)} | \mathcal{F}_{n-1}]]

= e^{-\theta t} \mathbb{E}[e^{\theta(X_{n-1}-X_0)} \mathbb{E}[e^{\theta(X_n-X_{n-1})}|\mathcal{F}_{n-1}] ],

and our preliminary result allows us to control this inner expectation

\le e^{-\theta t} e^{\theta^2c_n^2/2} \mathbb{E}[e^{\theta(X_{n-1}-X_0)}].

So now we can apply this inductively to obtain

\mathbb{P}(X_n-X_0\ge t) \le e^{-\theta t+ \theta^2 \sum_{i=1}^n c_i^2}.

Finally, as usual in such an argument, we need to choose a sensible value of the free parameter \theta, and naturally we want to choose it to make this RHS as small as possible, which is achieved when \theta = \frac{t}{\sum_{i=1}^n c_i^2}, and leads exactly to the statement of the inequality.


Unsurprisingly, we can easily apply this to the process of partial sums of IID random variables with mean zero and bounded support, to recover a Chernoff bound.

A more interesting example involves revealing the state (ie open or closed) of the edges of an Erdos-Renyi graph one at a time. We need to examine some quantitative property of the graph which can’t ever be heavily influenced by the presence or non-presence of a single given edge. The size of the largest clique, or the largest cut, are good examples. Adding or removing an edge can change these quantities by at most one.

So if we order the edges, and let the filtration \mathcal{F}_k be generated by the state of the first k edges in this ordering, then X_k=\mathbb{E}[\text{max cut}| \mathcal{F}_k] is a martingale. (A martingale constructed backwards in this fashion by conditioning a final state on a filtration is sometimes called a Doob martingale.) Using A-H on this shows that the deviations from the mean are of order \sqrt{N}, where N is the size of the graph. In the sparse case, it can be justified fairly easily that the maximum cut has size \Theta(N), since for example there will always be some positive proportion of isolated vertices. However, accurate asymptotics for the mean of this quantity seem (at least after a brief search of the literature – please do correct me if this is wrong!) to be unknown. So this might be an example of the curious situation where we can control the deviations around the mean better than the mean itself!

Beyond bounded increments

One observation we might make about the proof is that it is tight only if all the increments X_i-X_{i-1} are supported on \{-c_i,+c_i\}, which is stronger than demanding that the absolute value is bounded. If in fact we have X_i-X_{i-1}\in[-d_i,c_i] almost surely, then, with a more detailed preliminary lemma, we can have instead a bound of \exp\left( -\frac{2t^2}{\sum_{i=1}^n (c_i+d_i)^2} \right).

While it isn’t a problem in these examples, in many settings the restriction to bounded increments is likely to be the obstacle to applying A-H. Indeed, in the technical corner of my current research problem, this is exactly the challenge I faced. Fortunately, at least in principle, all is not necessarily lost. We might, for example, be able to establish bounds (c_i) as described, such that the probability that any |X_i-X_{i-1}| exceeds its c_i is very small. You could then construct a coupled process (Y_i), that is equal to X_i whenever the increments are within the given range, and something else otherwise. For Y to fit the conditions of A-H, the challenge is to ensure we can do this such that the increments remain bounded (ie the ‘something else’ also has to be within [-c_i,c_i] ) and also that Y remains a martingale. This total probability of a deviation is bounded above by the probability of Y experiencing that deviation, plus the probability of Y and X decoupling. To comment on the latter probability is hard in general without saying a bit more about the dependence structure in X itself.

Birthday Coincidences and Poisson Approximations

This morning, Facebook was extremely keen to remind me via every available medium that four of my friends celebrate their birthday today. My first thought was that I hope they all enjoy their day, and my second thought was to ask what the chance of this was. I have about 200 Facebook friends, and so this struck me as an unlikely occurrence. But this problem has form, and it felt worthwhile to try some calculations to see if my intuition was well-founded.


Siméon Denis Poisson celebrated his 234th birthday on 21st June this year.

The classical birthday problem

The starting point is the question: how many friends do you have to have before you expect to start seeing anyone sharing a birthday? There are a ridiculous number of articles about this on the web already, so I will say little, except that I don’t want to call this the ‘birthday paradox’, because it’s not a paradox at all. At best it might be counter-intuitive, but then the moral should be to change our intuition for this type of problem.

Throughout, let’s discount February 29th, as this doesn’t add much. So then, to guarantee having a shared pair of birthdays, you need to have 366 friends. But if you have a mere 23 friends, then the probability of having some pair that share a birthday is slightly greater than a half. The disparity between these two numbers leads to the counter-intuition. Some people might find it helpful to think that instead of counting friends, we should instead be counting pairs of friends, but I don’t personally find this especially helpful.

For me, thinking about the calculation in very slightly more generality is helpful. Here, and throughout, let’s instead take N to be the number of days in a year, and K the number of friends, or kids in the class if you prefer. Then, as usual, it is easier to calculate the probability that no two share a birthday (that is, that all the birthdays are distinct) than the probability that some two share a birthday. We could think of the number of ways to pick the set of birthdays, or we could look at the kids one-at-a-time, and demand that their birthday is not one of those we’ve already seen. Naturally, we get the same answer, that is

\frac{^N P_K}{N^K} = 1\cdot \frac{N-1}{N}\cdot\ldots \frac{N-K+1}{N}.

We’ve assumed here that all birthdates are equally likely. We’ll come back to this assumption right at the end. For now, let’s assume that both N and K are large, and we’ll try to decide roughly how large K has to be in relation to N for this answer to be away from 0 and 1. If we pair opposite terms up, we might approximate this by

(\frac{N-\frac{K}{2}}{N})^K = (1-\frac{K}{2N})^K\approx e^{-K^2/2N}.

In fact, AM-GM says that this is an overestimate, and a bit more care can be used to show that this is a good-approximation to first order. So we see that if K=\Theta(\sqrt{N}) for large N, we get a non-trivial limit.

Challenges for four-way shared birthdays

So the original problem I posed is harder, because there isn’t (unless I’m missing something) a natural way to choose birthdays one-at-a-time, or describe the set of suitable birthday sets. There are two major obstacles in a calculation such as this. Firstly, the overlap of people, that is we might have five or more birthdays overlapping; secondly, the overlap of days, that is we might have several days with four (or more) birthdays. We’ll end up worrying more about the second situation.

We start by eliminating both problems, by asking for the probability that exactly four friends are born on January 1st. The general form of this probability is \frac{\binom{K}{4} }{N^4} \cdot (\frac{N-1}{N})^{K-4}. Now, if K\ll N, this final term should not be significant. Removing this is not exactly the same as specifying the probability that at least four birthdays are on January 1st. But in fact this removal turns a lower bound (because {exactly four}<{at least four}) into an upper (in fact a union) bound. So if the factor being removed is very close to one, we can use whichever expression is more convenient.

In the real life case of N=365, K=200, this term is not negligible. But accounting for this, we get that the probability of exactly four birthdays on 1st January is ~0.0021. Our upper bound on the probability of at least four is ~0.0036.

But now that we know the probability for a given day, can we calculate (1-0.0021)^{365} to estimate the probability that we never have four-overlap? When we did our previous iterative calculation, we were using independence of the different kids’ birthdays. But the event that we have four-overlap on January 1st is not quite independent of the event that we have four-overlap on January 2nd. Why? Well if we know at least four people were born on January 1st, there are fewer people left (potentially) to be born on January 2nd. But maybe this dependence is mild enough that we can ignore it?

We can, however, use some moment estimates. The expected number of days with four-overlap is 365\cdot 0.0021 \approx 0.77. So the probability that there is at least one day with four-overlap is at most ~0.77.

But we really want a lower bound. So, maybe we can borrow exactly the second-moment argument we tried (there for isolated vertices in the random graph) in the previous post? Here, the probability that both January 1st and January 2nd are four-overlapping is

\frac{\binom{K}{4}\binom{K-4}{4}}{N^8}\cdot (\frac{N-2}{N})^{K-8}\approx 4.3\times 10^{-6}.

From this, we can evaluate the expectation of the square of the number of days with four-overlap, and thus find that the variance is ~0.74. So we use Chebyshev, calling this number of days #D for now:

\mathbb{P}(\# D=0)\le \mathbb{P}(|\#D - \mathbb{E}\# D|^2 \ge (\mathbb{E}\# D)^2 ) \le \frac{\mathrm{Var} \# D}{(\mathbb{E} \#D)^2}.

In our case, this unfortunately gives us an upper bound greater than 1 on this probability, and thus a lower bound of zero on the probability that there is at least one day with four-overlap. Which isn’t especially interesting…

Fairly recently, I spoke about the Lovasz Local Lemma, which can be used to find lower bounds on the probabilities of intersections of events, many of which are independent (in a particular precise sense). Perhaps this might be useful here? The natural choice of ‘bad event’ is that particular 4-sets of people share a birthday. There are \binom{K}{4} such events, and each is independent of the collection of \binom{K-4}{4} disjoint events. Thus we can consider using LLL if e\cdot (\binom{K}{4}-\binom{K-4}{4})\cdot 0.0021 \le 1. Unfortunately, this difference of binomial coefficients is large in our example, and so in fact the LHS has order 10^3.

Random number of friends – coupling to a Poisson Process

All of these methods failed because without independence we had to use estimates which were really not tight at all. But we can re-introduce independence if we remove the constraints on the model. Suppose instead of demanding I have K friends, I instead demand that I have a random number of friends, with distribution Poisson(K). Now it is reasonable to assume that for each day, I have a Poisson(K/365) friends with that birthday, independently for each day.

If we end up having exactly K friends with this random choice, then the distribution of the number of 4-overlap days is exactly the same as in the original setup. However, crucially, if we end up having at most K friends with this random choice, the distribution of the number of 4-overlap days is stochastically dominated by the original distribution. So instead let’s assume we have Poisson(L) friends, where L<K, and see how well we can do. For definiteness, we’ll go back to N=365, K=200 now. Let’s say X is the distribution of birthdays in the original model, and \Xi for the distribution of birthdays in the model with a random number of friends


\mathbb{P}(\exists \ge 4\text{-overlap in }\Xi) = 1- \mathbb{P}(\mathrm{Po}(L/365)\le 3)^365. (*)

Now we can write the crucial domination relation as

\mathbb{P}(\exists \ge 4\text{-overlap in }X)\ge \mathbb{P}( \exists \ge 4\text{-overlap in }\Xi \,|\, |\Xi|\le 200),

and then use an inequality version of the law of total probability to bound further as

\ge \frac{ \mathbb{P}(\exists \ge 4\text{-overlap in }\Xi) - \mathbb{P}(|\Xi|>200)}{\mathbb{P}(|\Xi|\le 200)}.

This is a function of L, and in principle we could find its maximum, perhaps as N\rightarrow\infty. Here, though, let’s just take L=365/2 and see what happens. For (*) we get ~0.472.

To estimate \mathbb{P}(\mathrm{Po}(365/2)>200), observe that this event corresponds to 1.4 standard deviations above the mean, so we can approximate using quantiles of the normal distribution, via the CLT. (Obviously this isn’t completely precise, but it could be made precise if we really wanted.) I looked up a table, and this probability is, conveniently for calculations, roughly 0.1. Thus we obtain a lower bound of \frac{0.472-0.1}{0.9}. Allowing for the fairly weak estimates at various points, we still get a lower bound of around 0.4. Which is good, because it shows that my intuition wasn’t right, but that I was in the right ball-park for it being a ‘middle-probability event’.

Remarks and References

– The reason for doing the upper bound for the probability of exact 4-overlap is that the same argument for at-least-4-overlap would have given an upper bound of 1. However, this Poisson Process coupling is also a much better method for obtaining an upper bound on either event.

– Birthdays are not uniformly distributed through the year. The deviation is strong enough that even from the set of birth frequencies (rather than the sequence of birth frequencies), we can reject a null hypothesis of uniformity. Early September is pretty close to the maximum. Two comments: 1) this is the time of year where small variations in birth date have a big effect on education, especially in primary school; 2) we are 37 weeks into the year…

– It is known that 187 friends is the first time the probability of having at-least-4-overlap is greater than ½. You can find the full sequence on OEIS as A014088. I used to have about 650 Facebook friends, before I decided that I’d prefer instead the pleasant surprise of finding out what old acquaintances were up to when I next spoke to them. In this case, the median of the distribution of the largest number sharing a birthday would be seven.

– Eric Weisstein’s article on Mathworld is, in my opinion, the best resource for a mathematician on the first two pages of Google hits by at least two orders of magnitude. In the notation of this article, we were calculating P_4(n=200,d=365). There are also some good general asymptotics, or at least recipes for asymptotics, in equations (17) and (18).

– The paper Methods for Studying Coincidences by Diaconis and Mosteller is, as one might expect, extremely readable, and summarises many results and applications, including several generalisations.

Generating Functions for Dice

So last week I was writing an article for Betting Expert about laws of large numbers, and I was trying to produce some representations of distributions to illustrate the Weak LLN and the Central Limit Theorem. Because tossing a coin feels too simplistic, and also because the natural state space for this random variable, at least verbally, is not a subset of the reals, I decided to go for dice instead. So it’s clear what the distribution of the outcome of a single dice roll is, and with a bit of thought or a 6×6 grid, we can work out the distribution of the average of two dice rolls. But what about 100 rolls? Obviously, we need large samples to illustrate the laws of large numbers! In this post, we discuss how to calculate the distribution of the sample mean of n dice rolls.

First we observe that the total set of outcomes of n dice rolls is 6^n. The sum of the outcomes must lie between n and 6n inclusive. The distribution of the sum and the distribution of the sample mean are equivalent up to dividing by n. The final observation is that because the total number of outcomes has a nice form, we shouldn’t expect it to make any difference to the method if we calculate the probability of a given sum, or the number of configurations giving rise to that sum.

Indeed, tying in nicely with the first year probability course, we are going to use generating functions, and there is no difference in practice between the probability generating function, and the combinatorial generating function, if the underlying mechanism is a uniform choice. Well, in practice, there is a small difference, namely a factor of 6 here. The motivation for using generating functions is clear: we are considering the distribution of a sum of independent random variables. This is pretty much exactly why we bother to set up the machinery for PGFs.

Anyway, since each of {1,2,…,6} is equally likely, the GF of a single dice roll is

x+x^2+\ldots+x^6=x\cdot \frac{1-x^6}{1-x}.

So, if we want the generating function of the sum of n independent dice rolls, we can obtain this by raising the above function to the power n. We obtain


Note the factor of x^n at the beginning arises because the minimum value of the sum is n. So to work out the number of configurations giving rise to sum k, we need to evaluate the coefficient of x^k. We can deal with (1-x^6)^n fairly straightforwardly, but some thought it required regarding whether it’s possible to do similar job on (1-x)^{-n}.

We have to engage briefly with what is meant by a binomial coefficient. Note that

\binom{x}{k}=\frac{x(x-1)\ldots(x-k+1)}{1\cdot\ldots\cdot k}

is a valid definition even when x is not a positive integer, as it is simply a degree k polynomial in x. This works if x is a general positive real, and indeed if x is a general negative real. At this stage, we do need to keep k a positive integer, but that’s not a problem for our applications.

So we need to engage with how the binomial theorem works for exponents that are not positive integers. The tricky part with the standard expression as

(a+b)^n=\binom{n}{0}a^n+\ldots + \binom{n}{n}b^n,

is that the attraction of this symmetry in a and b prompts us to work in more generality than is entirely necessary to state the result. Note if we instead write


we have unwittingly described this finite sum as an infinite series. It just happens that all the binomial coefficients apart from the first (n+1) are zero. The nice thing about this definition is that it might plausibly generalise to non-integer or negative values of n. And indeed it does. I don’t want to go into the details here, but it’s just a Taylor series really, and the binomial coefficients are set up with factorials in the right places to look like a Taylor series, so it all works out.

It is also worth remarking that it follows straight from the definition of a negative binomial coefficient, that

\binom{-n}{j}=(-1)^j \binom{n+j-1}{j}.

In any case, we can rewrite our expression for the generating function of the IID sum as

x^n\left[\sum_{k=0}^n \binom{n}{k}(-1)^k x^{6k}\right]\left[\sum_{j\ge 0} \binom{-n}{j}(-1)^j x^j\right]

By accounting for where we can gather exponents from each bracket, we can evaluate the coefficient of x^m as


Ie, k in the sum takes values in \{0,1,\ldots, \lfloor \frac{m+n}{6}\rfloor\}. At least in theory, this now gives us an explicit way to calculate the distribution of the average of multiple dice rolls. We have to be wary, however, that many compilers will not be happy dealing with large binomial coefficients, as the large factorials grow extremely rapidly. An approximation using logs is likely to be more tractable for larger settings.

Anyway, I leave you with the fruits of my labours.

dice30Related articles

Enhanced by Zemanta

Large Deviations 6 – Random Graphs

As a final instalment in this sequence of posts on Large Deviations, I’m going to try and explain how one might be able to apply some of the theory to a problem about random graphs. I should explain in advance that much of what follows will be a heuristic argument only. In a way, I’m more interested in explaining what the technical challenges are than trying to solve them. Not least because at the moment I don’t know exactly how to solve most of them. At the very end I will present a rate function, and reference properly the authors who have proved this. Their methods are related but not identical to what I will present.


Recall the two standard definitions of random graphs. As in many previous posts, we are interested in the sparse case where the average degree of a vertex is o(1). Anyway, we start with n vertices, and in one description we add an edge between any pair of vertices independently and with fixed probability \frac{\lambda}{n}. In the second model, we choose uniformly at random from the set of graphs with n vertices and \frac{\lambda n}{2} edges. Note that if we take the first model and condition on the number of edges, we get the second model, since the probability of a given configuration appearing in G(n,p) is a function only of the number of edges present. Furthermore, the number of edges in G(n,p) is binomial with parameters \binom{n}{2} and p. For all purposes here it will make no difference to approximate the former by \frac{n^2}{2}.

Of particular interest in the study of sparse random graphs is the phase transition in the size of the largest component observed as \lambda passes 1. Below 1, the largest component has size on a scale of log n, and with high probability all components are trees. Above 1, there is a unique giant component containing \alpha_\lambda n vertices, and all other components are small. For \lambda\approx 1, where I don’t want to discuss what ‘approximately’ means right now, we have a critical window, for which there are infinitely many components with sizes on a scale of n^{2/3}.

A key observation is that this holds irrespective of which model we are using. In particular, this is consistent. By the central limit theorem, we have that:

|E(G(n,\frac{\lambda}{n}))|\sim \text{Bin}\left(\binom{n}{2},\frac{\lambda}{n}\right)\approx \frac{n\lambda}{2}\pm\alpha,

where \alpha is the error due to CLT-scale fluctuations. In particular, these fluctuations are on a scale smaller than n, so in the limit have no effect on which value of \lambda in the edge-specified model is appropriate.

However, it is still a random model, so we can condition on any event which happens with positive probability, so we might ask: what does a supercritical random graph look like if we condition it to have no giant component? Assume for now that we are considering G(n,\frac{\lambda}{n}),\lambda>1.

This deviation from standard behaviour might be achieved in at least two ways. Firstly, we might just have insufficient edges. If we have a large deviation towards too few edges, then this would correspond to a subcritical G(n,\frac{\mu n}{2}), so would have no giant components. However, it is also possible that the lack of a giant component is due to ‘clustering’. We might in fact have the correct number of edges, but they might have arranged themselves into a configuration that keeps the number of components small. For example, we might have a complete graph on Kn^{1/2} vertices plus a whole load of isolated vertices. This has the correct number of edges, but certainly no giant component (that is an O(n) component).

We might suspect that having too few edges would be the primary cause of having no giant component, but it would be interesting if clustering played a role. In a previous post, I talked about more realistic models of complex networks, for which clustering beyond the levels of Erdos-Renyi is one of the properties we seek. There I described a few models which might produce some of these properties. Obviously another model is to take Erdos-Renyi and condition it to have lots of clustering but that isn’t hugely helpful as it is not obvious what the resulting graphs will in general look like. It would certainly be interesting if conditioning on having no giant component were enough to get lots of clustering.

To do this, we need to find a rate function for the size of the giant component in a supercritical random graph. Then we will assume that evaluating this near 0 gives the LD probability of having ‘no giant component’. We will then compare this to the straightforward rate function for the number of edges; in particular, evaluated at criticality, so the probability that we have a subcritical number of edges in our supercritical random graph. If they are the same, then this says that the surfeit of edges dominates clustering effects. If the former is smaller, then clustering may play a non-trivial role. If the former is larger, then we will probably have made a mistake, as we expect on a LD scale that having too few edges will almost surely lead to a subcritical component.


The starting point is the exploration process for components of the random graph. Recall we start at some vertex v and explore the component containing v depth-first, tracking the number of vertices which have been seen but not yet explored. We can extend this to all components by defining:

S(0)=0, \quad S(t)=S(t-1)+(X(t)-1),

where X(t) is the number of children of the t’th vertex. For a single component, S(t) is precisely the number of seen but unexplored vertices. It is more complicated in general. Note that when we exhaust the first component S(t)=-1, and then when we exhaust the second component S(t)=-2 and so on. So in fact

S_t-\min_{0\leq s\leq t}S_s

is the number of seen but unexplored vertices, with \min_{0\leq s\leq t}S_s equal to (-1) times the number of components already explored up to time t.

Once we know the structure of the first t vertices, we expect the distribution of X(t) – 1 to be

\text{Bin}\Big(n-t-[S_t-\min_{0\leq s\leq t}S_s],\tfrac{\lambda}{n}\Big)-1.

We aren’t interested in all the edges of the random graph, only in some tree skeleton of each component. So we don’t need to consider the possibility of edges connecting our current location to anywhere we’ve previously visited (as such an edge would have been consider then – it’s a depth-first exploration), hence the -t. But we also don’t want to consider edges connecting our current location to anywhere we’ve seen, since that would be a surplus edge creating a cycle, hence the -S_s. It is binomial because by independence even after all this conditioning, the probability that there’s an edge from my current location to any other vertex apart from those discounted is equal to \frac{\lambda}{n} and independent.

For Mogulskii’s theorem in the previous post, we had an LDP for the rescaled paths of a random walk with independent stationary increments. In this situation we have a random walk where the increments do not have this property. They are not stationary because the pre-limit distribution depends on time. They are also not independent, because the distribution depends on behaviour up to time t, but only through the value of the walk at the present time.

Nonetheless, at least by following through the heuristic of having an instantaneous exponential cost for a LD event, then products of sums becoming integrals within the exponent, we would expect to have a similar result for this case. We can find the rate function \Lambda_\lambda^*(x)of latex \text{Po}(\lambda)-1$ and thus get a rate function for paths of the exploration process

I_\lambda(f)=\int_0^1 \Lambda_{(1-t-\bar{f}(t))\lambda}^*(f')dt,

where \bar{f}(t) is the height of f above its previous minimum.

Technicalities and Challenges

1) First we need to prove that it is actually possible to extend Mogulskii to this more general setting. Even though we are varying the distribution continuously, so we have some sort of ‘local almost convexity’, the proof is going to be fairly fiddly.

2) Having to consider excursions above the local minima is a massive hassle. We would ideally like to replace \bar{f} with f. This doesn’t seem unreasonable. After all, if we pick a giant component within o(n) steps, then everything considered before the giant component won’t show up in the O(n) rescaling, so we will have a series of macroscopic excursions above 0 with widths giving the actual sizes of the giant components. The problem is that even though with high probability we will pick a giant component after O(1) components, then probability that we do not do this decays only exponentially fast, so will show up as a term in the LD analysis. We would hope that this would not be important – after all later we are going to take an infimum, and since the order we choose the vertices to explore is random and in particular independent of the actual structure, it ought not to make a huge difference to any result.

3) A key lemma in the proof of Mogulskii in Dembo and Zeitouni was the result that it doesn’t matter from an LDP point of view whether we consider the linear (continuous) interpolation or the step-wise interpolation to get a process that actually lives in L_\infty([0,1]). In this generalised case, we will also need to check that approximating the Binomial distribution by its Poisson limit is valid on an exponential scale. Note that because errors in the approximation for small values of t affect the parameter of the distribution at larger times, this will be more complicated to check than for the IID case.

4) Once we have a rate function, if we actually want to know about the structure of the ‘typical’ graph displaying some LD property, we will need to find the infimum of the integrated rate function with some constraints. This is likely to be quite nasty unless we can directly use Euler-Lagrange or some other variational tool.


Papers by O’Connell and Puhalskii have found the rate function. Among other interesting things, we learn that:

I_{(1+\epsilon)}(0)\approx \frac{\epsilon^3}{6},

while the rate function for the number of edges:

-\lim\tfrac{1}{n}\log\mathbb{P}\Big(\text{Bin}(\tfrac{n^2}{2},\tfrac{1+\epsilon}{n})\leq\tfrac{n}{2}\Big)\approx \frac{\epsilon^2}{4}.

So in fact it looks as if there might be a significant contribution from clustering after all.

Large Deviations 5 – Stochastic Processes and Mogulskii’s Theorem


In the previous posts about Large Deviations, most of the emphasis has been on the theory. To summarise briefly, we have a natural idea that for a family of measures supported on the same metric space, increasingly concentrated as some index grows, we might expect the probability of seeing values in a set not containing the limit in distribution to grow exponentially. The canonical example is the sample mean of a family of IID random variables, as treated by Cramer’s theorem.

It becomes apparent that it will not be enough to specify the exponent for a given large deviation event just by taking the infimum of the rate function, so we have to define an LDP topologically, with different behaviour on open and closed sets. Now we want to find some LDPs for more complicated measures, but which will have genuinely non-trivial applications. The key idea in all of this is that the infimum present in the definition of an LDP doesn’t just specify the rate function, it also might well give us some information about the configurations or events that lead to the LDP.

The slogan for the LDP as in Frank den Hollander’s excellent book is: “A large deviation event will happen in the least unlikely of all the unlikely ways.” This will be useful when our underlying space is a bit more complicated.


As a starting point, consider the set-up for Cramer’s theorem, with IID X_1,\ldots,X_n. But instead of investigating LD behaviour for the sample mean, we investigate LD behaviour for the whole set of RVs. There is a bijection between sequences and the partial sums process, so we investigate the partial sums process, rescaled appropriately. For the moment this is a sequence not a function or path (continuous or otherwise), but in the limit it will be, and furthermore it won’t make too much difference whether we interpolate linearly or step-wise.

Concretely, we consider the rescaled random walk:

Z_n(t):=\tfrac{1}{n}\sum_{i=1}^{[nt]}X_i,\quad t\in[0,1],

with laws \mu_n supported on L_\infty([0,1]). Note that the expected behaviour is a straight line from (0,0) to (1,\mathbb{E}X_1). In fact we can say more than that. By Donsker’s theorem we have a functional version of a central limit theorem, which says that deviations from this expected behaviour are given by suitably scaled Brownian motion:

\sqrt{n}\left(\frac{Z_n(t)-t\mathbb{E}X}{\sqrt{\text{Var}(X_1)}}\right)\quad\stackrel{d}{\rightarrow}\quad B(t),\quad t\in[0,1].

This is what we expect ‘standard’ behaviour to look like:

mog1 - Copy

The deviations from a straight line are on a scale of \sqrt{n}. Here are two examples of potential large deviation behaviour:

mog2 - Copy

Or this:

mog3 - Copy

Note that these are qualitatively different. In the first case, the first half of the random variables are in general much larger than the second half, which appear to have empirical mean roughly 0. In the second case, a large deviation in overall mean is driven by a single very large value. It is obviously of interest to find out what the probabilities of each of these possibilities are.

We can do this via an LDP for (\mu_n). Now it is really useful to be working in a topological context with open and closed sets. It will turn out that the rate function is supported on absolutely continuous functions, whereas obviously for finite n, none of the sample paths are continuous!

We assume that \Lambda(\lambda) is the logarithmic moment generating function of X_1 as before, with \Lambda^*(x) the Fenchel-Legendre transform. Then the key result is:

Theorem (Mogulskii): The measures (\mu_n) satisfy an LDP on L_\infty([0,1]) with good rate function:

I(\phi)=\begin{cases}\int_0^1 \Lambda^*(\phi'(t))dt,&\quad \text{if }\phi\in\mathcal{AC}, \phi(0)=0,\\ \infty&\quad\text{otherwise,}\end{cases}

where AC is the space of absolutely continuous functions on [0,1]. Note that AC is dense in L_\infty([0,1]), so any open set contains a \phi for which I(\phi) is at least in principle finite. (Obviously, if \Lambda^* is not finite everywhere, then extra restrictions of \phi' are required.)

The following picture may be helpful at providing some motivation:


So what is going on is that if we take a path and zoom in on some small interval around a point, note first that behaviour on this interval is independent of behaviour everywhere else. Then the gradient at the point is the local empirical mean of the random variables around this point in time. The probability that this differs from the actual mean is given by Cramer’s rate function applied to the empirical mean, so we obtain the rate function for the whole path by integrating.

More concretely, but still very informally, suppose there is some \phi'(t)\neq \mathbb{E}X, then this says that:

Z_n(t+\delta t)-Z_n(t)=\phi'(t)\delta t+o(\delta t),

\Rightarrow\quad \mu_n\Big(\phi'(t)\delta t+o(\delta t)=\frac{1}{n}\sum_{i=nt+1}^{n(t+\delta t)}X_i\Big),

= \mu_n\Big( \phi'(t)+o(1)=\frac{1}{n\delta t}\sum_{i=1}^{n\delta t}X_i\Big)\sim e^{-n\delta t\Lambda^*(\phi'(t))},

by Cramer. Now we can use independence:

\mu_n(Z_n\approx \phi)=\prod_{\delta t}e^{-n\delta t \Lambda^*(\phi'(t))}=e^{-\sum_{\delta t}n\delta t \Lambda^*(\phi'(t))}\approx e^{-n\int_0^1 \Lambda^*(\phi'(t))dt},

as in fact is given by Mogulskii.


1) The absolutely continuous requirement is useful. We really wouldn’t want to be examining carefully the tail of the underlying distribution to see whether it is possible on an exponential scale that o(n) consecutive RVs would have sum O(n).

2) In general \Lambda^*(x) will be convex, which has applications as well as playing a useful role in the proof. Recalling den Hollander’s mantra, we are interested to see where infima hold for LD sets in the host space. So for the event that the empirical mean is greater than some threshold larger than the expectation, Cramer’s theorem told us that this is exponentially the same as same the empirical mean is roughly equal to the threshold. Now Mogulskii’s theorem says more. By convexity, we know that the integral functional for the rate function is minimised by straight lines. So we learn that the contributions to the large deviation are spread roughly equally through the sample. Note that this is NOT saying that all the random variables will have the same higher than expected value. The LDP takes no account of fluctuations in the path on a scale smaller than n. It does however rule out both of the situations pictured a long way up the page. We should expect to see roughly a straight line, with unexpectedly steep gradient.

3) The proof as given in Dembo and Zeitouni is quite involved. There are a few stages, the first and simplest of which is to show that it doesn’t matter on an exponential scale whether we interpolate linearly or step-wise. Later in the proof we will switch back and forth at will. The next step is to show the LDP for the finite-dimensional problem given by evaluating the path at finitely many points in [0,1]. A careful argument via the Dawson-Gartner theorem allows lifting of the finite-dimensional projections back to the space of general functions with the topology of pointwise convergence. It remains to prove that the rate function is indeed the supremum of the rate functions achieved on projections. Convexity of \Lambda^*(x) is very useful here for the upper bound, and this is where it comes through that the rate function is infinite when the comparison path is not absolutely continuous. To lift to the finer topology of L_\infty([0,1]) requires only a check of exponential tightness in the finer space, which follows from Arzela-Ascoli after some work.

In conclusion, it is fairly tricky to prove even this most straightforward case, so unsurprisingly it is hard to extend to the natural case where the distributions of the underlying RVs (X) change continuously in time, as we will want for the analysis of more combinatorial objects. Next time I will consider why it is hard but potentially interesting to consider with adaptations of these techniques an LDP for the size of the largest component in a sparse random graph near criticality.

Bell Polynomials

Trees with a single cycle

When counting combinatorial objects, it is often the case that we have two types of structure present at different levels. The aim of this post is to introduce the Bell polynomials, which provides the most natural notation for describing this sort of situation, and to mention some of the results that become easier to derive in this framework. This post is based on material and exercises from Chapter 1 of Jim Pitman’s book Combinatorial Stochastic Processes, which is great, and also available online here.

The structures that Bell polynomials enumerate are called composite structures in this account. Rather than give a definition right away, I shall give an example. An object I have been thinking about in the past few weeks are graphs on n vertices containing precisely one cycle. Some of the background for this has been explained in recent posts.

In a recent post on Prufer codes, I gave the classical argument showing that the number of trees on n vertices is n^{n-2}. We might consider a unicyclic graph to be a tree with an extra edge. But if we consider the number of ways to add a further vertex to a tree, we get


Obviously, we have overcounted. If the single cycle in a graph has length k, then the graph has been counted exactly k times in this enumeration. But it is not obvious how many graphs have a single cycle of length k.

Instead, we stop worrying about exactly how many of these there are, as there might not be a simple expression anyway. As soon as we start using them in any actual argument, it will be useful to know various properties about the graphs, but probably not exactly how many there are.

Let’s focus on this single cycle of length k say. If we remove the edges of the cycle, we are left with a collection of trees. Why? Well if there was a cycle in the remaining graph, then the original graph would have had at least two cycles. So we have a collection of trees, unsurprisingly called a forest. Remembering that some of the trees may in fact be a single vertex (on the cycle), it is clear that there is a bijection between these trees and the vertices of the cycle in the obvious way. We can think of the graph as a k-cycle, dressed with trees.

Alternatively, once we have specified its size, we can forget about the k-cycle altogether. The graph is precisely defined by a forest of k trees on n vertices, with a specified root in each tree indicating which vertex lies on the cycle, and a permutation specifying the cyclic ordering of the trees. We can write this as

N_{n,k}=(k-1)!\sum_{(A_1,\ldots,A_k)\in\mathcal{P}^k(n)}a_1^{a_1-1}\cdot\ldots\cdot a_k^{a_k-1},\quad \text{for }a_i=|A_i|,

where \mathcal{P}^k(n) is the number of partitions of [n] with k blocks. Remember that the blocks in a partition are necessarily unordered. This makes sense in this setting as the cyclic permutation chosen from the (k-1)! possibilities specifies the order on the cycle.

Bell Polynomials

The key point about this description is that there are two types of combinatorial structure present. We have the rooted trees, and also a cyclic ordering of the rooted trees. Bell polynomials generalise this idea. It is helpful to be less specific and think of partitions of [n] into blocks. There are w_j arrangements of any block of size j, and there are v_k ways to arrange the blocks, if there are k of them. Note that we assume v_k is independent of the arrangements within the collection of blocks. So in the previous example, w_j=j^{j-2}, and v_k=(k-1)!. Pitman denotes these sequences by v_\bullet,w_\bullet. Then the (n,k)th partial Bell polynomial, B_{n,k}(w_\bullet) gives the number of divisions into k blocks:

B_{n,k}(w_\bullet):=\sum_{(A_1,\ldots,A_k)\in\mathcal{P}^k(n)}\prod_{i=1}^k w_{a_i}.

The total number of arrangements is given by the Bell polynomial

B_n(v_\bullet,w_\bullet):=\sum_{k=1}^n v_k B_{n,k}(w_\bullet).

Here are some other examples of Bell polynomials. The Stirling numbers of the first kind c_{n,k} give the number of permutations of [n] with k cycles. Since we don’t want to impose any combinatorial structure on the set of cycles, we don’t need to consider v_\bullet, and the number of ways to make a j-cycle from a j-block is w_j=(j-1)!, so c_{n,k}:=B_{n,k}((\bullet-1)!). Similarly, the Stirling numbers of the second kind S_{n,k} give the number of permutations of [n] into k blocks. Almost by definition, S_{n,k}:=B_{n,k}(1^\bullet), where $1^\bullet$ is defined to be the sequence containing all 1s.


So far, this is just a definition that gives an abbreviated description for the sizes of several interesting sets of discrete objects. Having clean notation is always important, but there are further advantages of using Bell polynomials. I don’t want to reproduce the entirety of the chapter I’ve read, so my aim for this final section is to give a very vague outline of why this is a useful formulation.

Bell polynomials can be treated rather nicely via generating functions. The key to this is to take a sum not over partitions, but rather over ordered partitions, which are exactly the same, except now we also care about the order of the blocks. This has the advantage that there is a correspondence between ordered partitions with k blocks and compositions with k terms. If the composition is n_1+\ldots+n_k=n, it is clear why there are \binom{n}{n_1,\ldots,n_k} ordered partitions encoding this structure. This multinomial coefficient can be written as a product of factorials of $n_i$s over i, and so we can write:

B_{n,k}(w_\bullet)=\frac{n!}{k!}\sum_{(n_1,\ldots,n_k)}\prod_{i=1}^k \frac{w_{n_i}}{n_i!}.

This motivates considering the exponential generating function given by

w(\xi)=\sum_{j=1}^\infty w_j\frac{\xi_j}{j!},

as this leads to the neat expressions:

B_{n,k}(w_\bullet)=n![\xi^n]\frac{w(\xi)^k}{k!},\quad B_n(v_\bullet,w_\bullet)=n![\xi^n]v(w(\xi)).

The Bell polynomial B_n(v_\bullet,w_\bullet) counts the number of partitions of [n] subject to some extra structure. If we choose uniformly from this set, we get a distribution on this combinatorial object, for which the Bell polynomial provides the normalising constant. If we then ignore the extra structure, the sequences v_\bullet,w_\bullet induce a probability distribution on the set of partitions of n. This distribution is known as a Gibbs partition. It is interesting to consider when and whether it is possible to define a splitting mechanism such that the Gibbs partitions can be coupled to form a fragmentation process. This is the opposite of a coalescence process. Here, we have a sequence of masses, and at each integer time we have rules to determine which mass to pick, and a rule for how to break it into two pieces. It is certainly not the case that for an arbitrary splitting rule and sequences v_\bullet,w_\bullet, the one-step fragmentation of the Gibbs partition on n gives the corresponding Gibbs partition on (n-1).

CLT for random permutations

For the final demonstration of the use of Bell polynomials, I am going to sketch the outline of a solution to exercise 1.5.4. which shows that the number of cycles in a uniformly chosen permutation has a CLT. This is not at all obvious, since the number of permutations of [n] with k cycles is given by B_{n,k}((\bullet-1)!) and there is certainly no simple form for this, so the possibility of doing a technical limiting argument seems slim.

For ease of notation, we copy Pitman and write c_{n,k}:=B_{n,k}((\bullet-1)!) as before. First we show exercise 1.2.3. which asserts that

x(x+1)\ldots(x+(n-1))=\sum_{k=1}^n c_{n,k}x^k.

We argue combinatorially. The RHS is the number of ways to choose \sigma\in S_n and a colouring of [n] with k colours such that the orbits of \sigma are monochromatic. We prove that the LHS also has this property by induction on the number of vertices. We claim there is a 1-to-(x+n) map from configurations on n vertices to configurations on (n+1) vertices. Given \sigma\in S_n and colouring, for any a\in[n], we construct \sigma_a\in S_{n+1} by \sigma_a(a)=n+1, \sigma_a(n)=\sigma(a) and for all other x, \sigma_a(x)=\sigma(x). We give n+1 the same colour as a. This gives us n possibilities. Alternatively, we can map (n+1) to itself and give it any colour we want. This gives us x possibilities. A slightly more careful argument shows that this is indeed a 1-to-(x+n) map, which is exactly what we require.

So the polynomial


has n real zeros, which allows us to write


where the Xs are independent but not identically distributed Bernoulli trials. The number of cycles is then given by this sum, and so becomes a simple matter to verify the CLT by checking a that the variances grows appropriately. As both mean and variance are asymptotically log n, we can conclude that:

\frac{K_n - \log n}{\sqrt{\log n}}\stackrel{d}{\rightarrow} N(0,1).

In a future post, I want to give a quick outline of section 1.3. which details how the Bell polynomials can be surprisingly useful to find the moments of infinitely divisible distributions.

Extreme Value Theory

This is something interesting which came up on the first problem sheet for the Part A Statistics course. The second question introduced the Weibull distribution, defined in terms of parameters \alpha,\lambda>0 through the distribution function:

F(x)=\begin{cases}0 & x<0\\ 1-\exp(-(\frac{x}{\lambda})^\alpha) & x\geq 0.\end{cases}

As mentioned in the statement of the question, this distribution is “typically used in industrial reliability studies in situations where failure of a system comprising many similar components occurs when the weakest component fails”. Why could that be? Expressed more theoretically, the lifetimes of various components might reasonably be assumed to behave like i.i.d. random variables in many contexts. Then the failure time of the system is given by the minimum of the constituent random variables.

So this begs the question: what does the distribution of minimum of a collection of i.i.d. random variables look like? First, we need to think why there should be an answer at all. I mean, it would not be unreasonable to assume that this would depend rather strongly on the underlying distribution. But of course, we might say the same thing about sums of i.i.d. random variables, but there is the Central Limit Theorem. Phrased in a way that is deliberately vague, this says that subject to some fairly mild conditions on the underlying distribution (finite variance in this case), the sum of n i.i.d. RVs look like a normal distribution for large n. Here we know what ‘looks like’ means, since we have a notion of a family of normal distributions. Formally, though, we might say that ‘looks like’ means that the image of the distribution under some linear transformation, where the coefficients are possibly functions of n, converges to the distribution N(0,1) as n grows.

The technical term for this is to say that the underlying RV we are considering, which in this case would be X_1+\ldots +X_n) is in the domain of attraction of N(0,1). Note that other distributions in the family of normals are also in the domain of attraction of N(0,1), and vice versa, so this forms an equivalence relation on the space of distributions, though this is not hugely helpful since most interesting statements involve some sort of limit.

Anyway, with that perspective, it is perhaps more reasonable to imagine that the minimum of a collection of i.i.d. RVs might have some limit distribution. Because we typically feel more comfortable thinking about right-tails rather than left-tails of probability distributions, this problem is more often considered for the maximum of i.i.d. RVs. The Fisher-Tippett-Gnedenko theorem, proved in various forms in the first half of the 20th century, asserts that again under mild regularity assumptions, the maximum of such a collection does lie in the domain of attraction of one of a small set of distributions. The Weibull distribution as defined above is one of these. (Note that if we are considering domains of attraction, then scaling x by a constant is of no consequence, so we can drop the parameterisation by \lambda.)

This is considered the first main theorem of Extreme Value Theory, which addresses precisely this sort of problem. It is not hard to consider why this area is of interest. To decide how much liquidity they require, an insurance company needs to know the likely size of the maximum claim during the policy. Similarly, the designer of a sea-wall doesn’t care about the average wave-height – what matters is how strong the once-in-a-century storm which threatens the town might be. A good answer might also explain how to resolve the apparent contradiction that most human characteristics are distributed roughly normally across the population. Normal distributions are unbounded, yet physiological constraints enable us to state with certainty that there will never be twelve foot tall men (or women). In some sense, EVT is a cousin of Large Deviation theory, the difference being that unlikely events in a large family of i.i.d. RVs are considered on a local scale rather than globally. Note that large deviations for Cramer’s theorem in the case where the underlying distribution has a heavy tail are driven by a single very deviant value, rather than by lots of slightly deviant data, so in this case the theories are comparable, though generally analysed from different perspectives.

In fact, we have to consider the reversed Weibull distribution for a maximum, which is supported on (-\infty,0]. This is one of three possibly distribution families for the limit of a maximum. The other two are the Gumbel distribution


and the Frechet distribution

F(x)=\exp(-x^{-\alpha}),\quad x>0.

Note that \alpha is a positive parameter in both the Frechet and Gumbel distributions. These three distributions can be expressed as a single one parameter family, the Generalised Extreme Value distribution.

The differences between them lie in the tail behaviour. The reversed Weibull distribution has an actual upper bound, the Gumbel an exponential, fast-decaying tail, and the Frechet a polynomial ‘fat’ tail. It is not completely obvious that these properties are inherited from the original distribution. After all, to get from the original distribution to extreme value distribution, we are taking the maximum, then rescaling and translating in a potentially quite complicated way. However, it is perhaps reasonable to see that the property of the underlying distribution having an upper bound is preserved through this process. Obviously, the bound itself is not preserved – after all, we are free to apply arbitrary linear transformations to the distributions!

In any case, it does turn out to be the case that the U[0,1] distribution has maximum converging to a reversed Weibull; the exponential tails of the Exp(1) and N(0,1) distributions lead to a Gumbel limit; and the fat-tailed Pareto distribution gives the Frechet limit. The calculations are reasonably straightforward, especially once the correct rescaling is known. See this article from Duke for an excellent overview and the details for these examples I have just cited. These notes discuss further properties of these limiting distributions, including the unsurprising fact that their form is preserved under taking the maximum of i.i.d. copies. This is analogous to the fact that the family of normal distributions is preserved under taking arbitrary finite sums.

From a statistical point of view, devising a good statistical test for what class of extreme value distribution a particular set of data obeys is of great interest. Why? Well mainly because of the applications, some of which were suggested above. But also because of the general statistical principle that it is unwise to extrapolate beyond the range of the available data. But that is precisely what we need to do if we are considering extreme values. After all, the designer of that sea-wall can’t necessarily rely on the largest storm in the future being roughly the same as the biggest storm in the past. So because the EVT theorem gives a clear description of the distribution, to find the limiting properties, which is where the truly large extremes might occur, it suffices to find a good test for the form of the limit distribution – that is, which of the three possibilities is relevant, and what the parameter should be. This seems to be fairly hard in general. I didn’t understand much of it, but this paper provided an interesting review.

Anyway, that was something interesting I didn’t know about (for the record, I also now know how to construct a sensible Q-Q plot for the Weibull distribution!), though I am assured that EVT was a core element of the mainstream undergraduate mathematics syllabus forty years ago.

Large Deviations 1 – Motivation and Cramer’s Theorem

I’ve been doing a lot of thinking about Large Deviations recently, in particular how to apply the theory to random graphs and related models. I’ve just writing an article about some of the more interesting aspects, so thought it was probably worth turning it into a few posts.


Given X_1,X_2,\ldots i.i.d. real-valued random variables with finite expectation, and S_n:=X_1+\ldots+X_n, the Weak Law of Large Numbers asserts that the empirical mean \frac{S_n}{n} converges in distribution to \mathbb{E}X_1. So \mathbb{P}(S_n\geq n(\mathbb{E}X_1+\epsilon))\rightarrow 0. In fact, if \mathbb{E}X_1^2<\infty, we have the Central Limit Theorem, and a consequence is that \mathbb{P}(S_n\geq n\mathbb{E}X_1+n^\alpha)\rightarrow 0 whenever \alpha>\frac12.

In a concrete example, if we toss a coin some suitably large number of times, the probability that the proportion of heads will be substantially greater or smaller than \frac12 tends to zero. So the probability that at least \frac34 of the results are heads tends to zero. But how fast? Consider first four tosses, then eight. A quick addition of the relevant terms in the binomial distribution gives:

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of four tosses are heads}\right)=\frac{1}{16}+\frac{4}{16}=\frac{5}{16},

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of twelve tosses are heads}\right)=\frac{1}{2^{12}}+\frac{12}{2^{12}}+\frac{66}{2^{12}}+\frac{220}{2^{12}}=\frac{299}{2^{12}}.

There are two observations to be made. The first is that the second is substantially smaller than the first – the decay appears to be relatively fast. The second observation is that \frac{220}{2^{12}} is substantially larger than the rest of the sum. So by far the most likely way for at least \tfrac34 out of twelve tosses to be heads is if exactly \tfrac34 are heads. Cramer’s theorem applies to a general i.i.d. sequence of RVs, provided the tail is not too heavy. It show that the probability of any such large deviation event decays exponentially with n, and identifies the exponent.

Theorem (Cramer): Let (X_i) be i.i.d. real-valued random variables which satisfy \mathbb{E}e^{tX_1}<\infty for every t\in\mathbb{R}. Then for any a>\mathbb{E}X_1,

\lim_{n\rightarrow \infty}\frac{1}{n}\log\mathbb{P}(S_n\geq an)=-I(a),

\text{where}\quad I(z):=\sup_{t\in\mathbb{R}}\left[zt-\log\mathbb{E}e^{tX_1}\right].


  • So, informally, \mathbb{P}(S_n\geq an)\sim e^{-nI(a)}.
  • I(z) is called the Fenchel-Legendre transform (or convex conjugate) of \log\mathbb{E}e^{tX_1}.
  • Considering t=0 confirms that I(z)\in[0,\infty].
  • In their extremely useful book, Dembo and Zeitouni present this theorem in greater generality, allowing X_i to be supported on \mathbb{R}^d, considering a more general set of large deviation events, and relaxing the requirement for finite mean, and thus also the finite moment generating function condition. All of this will still be a special case of the Gartner-Ellis theorem, which will be examined in a subsequent post, so we make do with this form of Cramer’s result for now.

The proof of Cramer’s theorem splits into an upper bound and a lower bound. The former is relatively straightforward, applying Markov’s inequality to e^{tS_n}, then optimising over the choice of t. This idea is referred to by various sources as the exponential Chebyshev inequality or a Chernoff bound. The lower bound is more challenging. We reweight the distribution function F(x) of X_1 by a factor e^{tx}, then choose t so that the large deviation event is in fact now within the treatment of the CLT, from which suitable bounds are obtained.

To avoid overcomplicating this initial presentation, some details have been omitted. It is not clear, for example, whether I(x) should be finite whenever x is in the support of X_1. (It certainly must be infinite outside – consider the probability that 150% or -40% of coin tosses come up heads!) In order to call this a Large Deviation Principle, we also want some extra regularity on I(x), not least to ensure it is unique. This will be discussed in the next posts.

From G(n,p) to G(n,m)

In the previous posts about random graphs, I was focusing on the model G(n,p). Here, we have n vertices, and we insert an edge between any pair of vertices independently with probability p. In particular, the number of edge which appear in a realisation of G(n,p) is a random variable, distributed as \text{Bin}(\frac{n(n-1)}{2},p).

The original model examined by Erdos and Renyi, after whom the random graph described above was named, was slightly different. Still with n vertices, they specified how many edges m they wanted in the graph, and chose uniformly at random from the set of graphs with this number of edges. This model is usually denoted G(n,m). Normally we can tell them apart by context. Obviously, p is a probability so lies in [0,1], whereas m is a positive integer, so there isn’t much room for ambiguity.

The key observation is that, if H is some graph with n vertices and m edges, then the probability that this is appears as G(n,p) is


This is constant if we vary H while fixing m. In other words, G(n,p) conditioned to have m edges is G(n,m). So, via some sort of law of total probability, we can construct G(n,p) by taking m to be distributed as \text{Bin}(\frac{n(n-1)}{2},p), and conditional on that, sampling from G(n,m). (*)

We can couple G(n,p) for all p, by assigning iid uniform [0,1] random variables to each pair of vertices, then including the edge in G(n,p) only if the value of the RV is greater than 1-p. Similarly, it is often helpful to think of G(n,m) as m varies as a random process, where edges are added one at a time, and at each stage the next edge is chosen uniformly at random from those not currently present. Perhaps because of this, it is sometimes easier to prove results for G(n,p) than for G(n,m) so we want to develop a framework to move between the two.

The decomposition (*) gives a relatively straightforward way to move from a result in G(n,m) to a result in G(n,p). By the Central Limit Theorem, the number of edges in G(n,p) is \binom{n}{2}p+O(n) in the limit, and so if a result with high probability in G(n,m) for all m in some interval, say [\binom{n}{2}p-Kn,\binom{n}{2}p+Kn] for some large K, then the law of total probability shows that the property holds with high probability in G(n,p).

In general, we get more interesting properties when p is a function of n. As discussed in previous posts, the scaling p=\frac{\lambda}{n} is particularly worth studying. CLT now shows that G(n,\frac{\lambda}{n}) has \frac{\lambda n}{2}+O(\sqrt{n}) edges in the limit. If you are confused why you can’t just substitute this value for p into the previous expression, note that p(1-p) does appear in the general asymptotic variance, but this gets absorbed into the “big O” notation when p is constant.

More importantly, many properties that we might want to consider are not in general affected in the limit by adding or removing O(\sqrt{n}) edges. For example, with high probability, G(n,m) has largest component of size (\zeta_\lambda+o(1)) n whenever \lambda>1 and m\in [\frac{\lambda n}{2}-O(\sqrt{n}),\frac{\lambda n}{2}+O(\sqrt{n})]. Some of this notation would need to be made a bit more precise in a formal argument, but for now, let’s take that as given. This then implies that with high probability, G(n,\frac{\lambda}{n}) has largest component of size (\zeta_\lambda+o(1))n also.

Of course, from the logical structure of this blog, this deduction is a bit bogus, because we have only just introduced G(n,m), and have no idea about the properties of its giant components yet. We seek instead an argument to deduce facts about G(n,m) from facts about G(n,p). Because G(n,m) cannot obviously be written as some conditioned combination of G(n,p)s, this instinctively seems harder. Bollobas gives various general conditions to carry results over between the two regimes in his Part III course notes, but I feel that an examples would be the easiest way to explain the ideas.

The size of the largest component is such an important quantity, we might as well consider that, in the subcritical case. We work with G(n,\frac{\lambda}{n}), for which we have the result:

\mathbb{P}_{\lambda+o(1)}(C_1\geq a\log n)=O(n^{-\delta}),

for some \delta>0, whenever a>I_\lambda^{-1}, the rate function at 1 of the total population size of a Poisson \lambda branching process. For now, that doesn’t matter too much, except that it is continuous as a function of \lambda. We want to show that G(n,\frac{\lambda n}{2}) has the same property.

The trick is to consider G(n,\frac{\lambda+\epsilon}{n}) instead. Let E_a be the event described above. By the law of total probability and the decomposition mentioned above, we have:


We are going to split this sum into

\sum_{m=0}^{\frac{(\lambda+\epsilon )n}{2}-n^{3/4}}+\sum_{m=\frac{(\lambda+\epsilon)n}{2}-n^{3/4}}^n.

On the first of these sums, we bound using the fact that probabilities are less than 1, and on the second, we use that \mathbb{P}_{n,m}(E_a) is an increasing function of m. This property is special to the event we are considering – in general one might have to be a bit more clever, perhaps using continuity of P_{n,m}(E), interpreting continuity in the limit with n. Anyway, this enables us to bound:

\mathbb{P}_{\lambda+\epsilon}(E_a)\geq \mathbb{P}(\text{Bin}(\frac{n(n-1)}{2},\frac{\lambda+\epsilon}{n})\leq \frac{(\lambda+\epsilon)n}{2}-n^{3/4})+\mathbb{P}_{\frac{(\lambda+\epsilon)n}{2}-n^{3/4}}(E_a)\mathbb{P}(\text{Bin}(\frac{n(n-1)}{2},\frac{\lambda+\epsilon}{n})\geq\frac{(\lambda+\epsilon)n}{2}-n^{3/4}).

By the Central Limit Theorem, this first probability tends to 0, while the final term tends to 1. We therefore have:

\mathbb{P}_{\lambda+\epsilon}(E_a)\geq \mathbb{P}_{\frac{(\lambda+\epsilon)n}{2}-n^{3/4}}(E_a).

We demanded that a>I_\lambda^{-1}, and mentioned that this function was continuous, so since we have total freedom over \epsilon, in particular, we can choose \epsilon>0 such that a>I_{\lambda+\epsilon}^{-1}. By the work on G(n,p), we have \mathbb{P}_{\lambda+\epsilon}(E_a)=O(n^{-\delta}), and for large enough n, we have \frac{(\lambda+\epsilon)n}{2}-n^{3/4}>>\frac{\lambda n}{2}, and so the result \mathbb{P}_{\frac{\lambda n}{2}}(E_a)=O(n^{-\delta}) follows.