BMO1 2021

The first round of the British Mathematical Olympiad was sat on Thursday by roughly 2000 pupils in the UK, and a significant number overseas on Friday.

For obvious reasons, much of the past 18 months has been dominated by logistical rather than academic problems, so it’s refreshing to be returning to the traditional format of this exam, even if the marking has been stayed online.

I wasn’t involved in setting in this paper, so here are some thoughts on the problems. These aren’t supposed to be official solutions, and it’s entirely possible that I’m missing the most natural approach or framing. Students reading this in future years are advised, as always, that such commentaries are normally more valuable after attempting and digesting the problem yourself first.

Note that the copyright to the problems is retained by BMO, and I’m reproducing here with permission. The paper can be found in its original format here.

Problem One

If one exhibits eighteen (= 6 x 3) explicit examples of relevant sums of consecutive positive odd numbers, one has finished the problem. An exhaustive search is impractical and uninteresting. However, beyond that, we would also like to find integers N which we are convinced have this property without needing to explicitly decompose them into the desired sum format.

Let’s study such a sum. We need an even number of odd numbers to end up with an even sum. Probably the algebraically neatest way is to set up such a sum symmetrically as:

2n-k, 2n-(k-2),\ldots,2n-1,2n+1,2n+3,\ldots, 2n+k,

for k an odd number. The sum of this sequence of consecutive odd numbers is 2n(k+1). So, we seek integers N which can be written as 2nm, where m is an even number, in as many ways as possible. Note that if we have an integer N which can be factorised as a product of two distinct even numbers we are forced to take the larger as 2n, and the smaller as m, since we require 2n-k to be strictly positive.

It’s equivalent to finding integers M which are less than 100 and have lots of factors (then multiplying by 4 to recover an N), and there are a few ways to do this, including now potentially by exhaustive search. It is useful to remember that is M can be factorised into primes as M=p_1^{a_1}\ldots p_k^{a_k} then its factors can be described as the set:

\{1,p_1,p_1^2,\ldots,p_1^{a_1}\}\times \{1,p_2,p_2^2,\ldots,p_2^{a_2}\}\times\cdots \times \{1,p_k,p_k^2,\ldots,p_k^{a_k}\},

In particular, the number of such factors is (a_1+1)(a_2+1)\ldots(a_k+1), independently of the choice of the primes. This gives a recipe for constructing all integers M<100 with at least twelve factors. We could take (a_i) to be some permutation of (11), or (5,1), or (3,2), or (2,1,1), which give the following valid M:

M=2^5\times 3 = 96, \; M=2^3\times 3^2 = 72, \; M=2^2\times 3\times 5 = 60, M=2^2\times 3 \times 7 = 84,\; M=3^2\times 2 \times 5 = 90,

and the corresponding valid N are 240, 288, 336, 360, and 384.

For what it’s worth, this problem took me a while. I spent a long time saying things like “we need numbers that can be written as 4n and as 8n and as…” and aiming to conclude “all multiples of 24 greater than 24 x ?? will work”, but this didn’t ever generate enough examples of one class.

Problem Two

In order to have won exactly 30% of the games so far, the number of games so far must be a multiple of 10, and the same is true for 70% success rate. The other success rates have similar constraints, but the multiples are smaller. After ten games, of course Arun cannot have simultaneously won both 30% and 70% of the total! So we need at least twenty games.

It feels as if we should be able to include the other success percentages so long as we can make 30% and 70% work. But can we do this in twenty games? This would require Arun to win 3 out of the first 10, and 14 out of the first 20 (or the exact opposite), and after some head-scratching, this is impossible, since it would require winning eleven out of the ten games between these observations. This is a nice observation, and we now know the answer must be at least thirty. Given that we have made some nifty, rigorous deductions already, one might speculate that this is the true answer. To verify this, we just have to give a construction, and there are a number of ways to achieve this, for example such that Arun wins

  • One of the first two
  • Two of the first five
  • Three of the first ten
  • Twelve of the first twenty
  • Twenty-one of the first thirty

To get full credit, almost any justification that a construction exists for 30 games would be fine – but one must be given. Otherwise how do you know that the true bound isn’t 40 or 50 because there’s some other reason why 30 fails?

Problem Three

Facing this problem for the first time, there’s a temptation to try values smaller than 2021, and this may well be useful. If nothing else, such experiments may well lead to a conjecture of the general form of the answer. Personally, I find it relatively easy to make mistakes when dealing with small numbers, and would turn to this as a second strategy after trying more theoretical routes first.

First note that the condition 0\le n\le 11 could be replaced by n\ge 0, since we are forced to include zero pieces of weight 2^{11}=2048, and would similarly be forced to include zero pieces of weight 2^n for any n\ge 11. (%)

When counting a potentially complex set, we should start by seeing whether there is anything which we can control effectively. For example, I don’t think it is clear how to control the number of pieces of weight 64 in a configuration. However, powers of two lend themselves well to inductive arguments (if you take the powers of two and multiply them all by two, you still have powers of two!) and to studying parity (that is, whether a number is odd or even).

In this case, note that you must have either one or three of the pieces with weight one to ensure the total weight is odd. Similarly, if the goal had been 2022, we would need either zero or two of these pieces to ensure the total weight is even.

So that settles the number of pieces of weight one, and it remains to solve the problem for total weight 2018 and 2020 with the extra condition that we are no longer allowed to use any pieces of weight one. A way to capture this is to divide everything in the problem by two. So we are now trying to capture all the ways to generate weights 1009 and 1010.

We can reformulate this more generally, and assert that the number of configurations f(n) to make total weight n satisfies the relations:

f(2n)=f(2n+1)=f(n)+f(n-1). (*)

It would be quite tiresome to solve backwards in a binary search fashion, but perhaps this is where trying some small examples turns useful. Either by observation on the original problem, or by thinking about (*) with some initial condition like f(0)=f(1)=1, we find that f(n)=\lfloor n/2\rfloor +1. (That is, f(n)=\frac{n+2}{2} when n is even, and f(n)=\frac{n+1}{2} when n is odd.)

As a final aside, this question is an excellent exercise in using generating functions in an olympiad context. A couple of points are worth making:

  1. It is important to decide whether you’re including f(0) as an example, much more so than in the original argument. As we’ll see, it works better if you do include it, and will avoid an awkward sequence of algebra steps.
  2. Figuring out the answer, for example from small-ish cases, will make a generating function argument easier unless you are very fluent with manipulating rational functions. In this case, we would be studying \sum_{n=0}^\infty f(n)x^n, and trying to show that it is equal to

\sum (\lfloor n/2\rfloor +1)x^n = (1+x)(1+2x^2+3x^4+\ldots) = \frac{(1+x)}{(1-x^2)^2}. (*)

Many readers who’ve played around a bit with generating functions may find it easier to go from the left to the right than vice versa.

Writing down a direct generating function for f(n) and showing it matches up with (*) remains a good exercise. It is helpful to bear in mind the discussion at (%).

Problem Four

Since I last blogged regularly, I have given a series of masterclass on Euclidean geometry three times, including twice in an online format.

Part of the point of this masterclass is to get younger students to think in more general terms about what they are doing when they are solving geometry problems, even when they are straightforward. In a competition, time is tight, and anything that gets the job done is good, but there’s still value in reflecting on these ideas here. Two ideas in the first session are:

  • Attacking the conclusion coherently
  • Clean angle chasing

and this problem offers a good example of both of these in practice.

We’ll start with attacking the conclusion coherently. We’ve got to prove that a triangle is equilateral. How can we achieve this? There are a number of ways:

  • Prove all three lengths are equal
  • Prove that all three angles are equal
  • Prove that two angles are equal to 60 degrees.
  • Prove that one angle is equal to 60, and (any) two lengths are equal
  • Prove that one angle is equal to 60, and the other two angles are equal

It’s important to keep all of these in mind as one explores the configuration. It’s genuinely unclear at the very start which of these characterisations of equilateral triangles will be most appropriate. Note though, that the configuration is symmetric in the roles of D and E, so that, for example, proving that the angle at D is 60 would be sufficient, since by symmetry the angle at E would also be 60, and we would conclude.

As regards clean angle chasing, the main moral here is to avoid angle arithmetic unless you are convinced it’s necessary. Ideally one keeps track of what could be determined in terms of other quantities, and only attempts the algebra/arithmetic once you are sure that it will work. It certainly isn’t helpful to have diagrams littered with angle measures like 240^\circ - \theta - \frac{\alpha}{2}. Equality is the best thing to aim for. Try and demonstrate and notate equal angles wherever possible.

For example, here, even before introducing C,D,E, there is plenty of structure which might useful. Essentially every length in the configuration involving (O_1,O_2,A,B) is equal, and all the angles are 60 or 120.

So we bear this in mind. I would consider proving that \angle DO_1E=60^\circ with the following figure.

There are number of ways to complete the argument. Here are a couple of more obscure ones that illustrate my approach.

Firstly, let’s ignore the result that \angle DO_1E=60^\circ and proceed directly to \angle O_2DO_1=60^\circ. It suffices to show that the blue angles are equal, which is equivalent to demonstrating that ADO_2O_1 is cyclic.

To achieve this, we would like to show that the angle \angle AO_2D is equal to the red angle we’ve already discussed, and this follows by chasing round, using that O_1 is the centre of triangle ABC, and then the fact that AO_2D is the external angle opposite \angle ABC in cyclic quadrilateral ABCO_2. Indeed, note that this use of an equal external angle rather than ‘opposite angle is 180 – […]’ is the textbook example of clear angle chasing, where focusing on equality rather than arithmetic massively cleans up many diagrams.

As a second, rather overblown way to complete the argument, let’s show that the angles at D and E are equal, without calculating them. Since O_1D,O_1E are angle bisectors, we have two pairs of equal angles as shown.

It would suffice to show that these measures are themselves equal. This is saying that line ECO_2D is the external angle bisector of \angle ACB. And this is true, since O_2 is the arc-midpoint of AB on \Gamma_1!

I hope these observations are interesting. But, to reiterate, this level of complexity is unnecessary, and arguments by congruent triangles or direct angle chasing are perfectly satisfactory too!

Problem Five

The punchline of this problem is that m(N) is an integer precisely if N is a triangle number, from which it is quick to count the number of such N.

But how to arrive that this punchline? It is reasonably clear that the N-set with minimal mean must have the form (1,2,\ldots,k, N). One could proceed by setting up inequalities involving the means of (1,2,\ldots,k-1,N) and (1,2,\ldots,k,k+1,N) to establish a relationship between k and N.

But there is a more direct route. A useful, and very intuitive result about means is the following. If you have a set of numbers with mean x, and you add into the set a new number y, then the mean of the new larger set is between x and y. In particular, if y>x, then the new mean is larger than x.

In particular, consider the mean of (1,2,\ldots,k,N). If it is less than k, then removing k gives a smaller mean. If it is greater than k+1, then adding k+1 gives a smaller mean. So since the mean is an integer, it must be equal to k or k+1. That is, we have

\frac{1+2+\ldots+k+N}{k+1}=k\text{ or }k+1.

These lead to N=\frac{k(k+1)}{2}\text{ or }\frac{(k+1)(k+2)}{2}.

Problem Six

There must be some significance to the choice of 71 terms, and the 999,999 but it’s hopeless to try and see this immediately. In this scenario, it really is helpful to play around with semi-small examples.

Note that the second term is 1 regardless of whether we view this as P for PREVIOUS or S for SUM. However, if we then have an S, our sequence starts 1,1,2,2… or 1,1,2,4,… and these are bad, as once we have two even consecutive even numbers, we are doomed to see even numbers forever.

So once we have a common factor, that common factor is retained. How would we introduce a factor of, say, 11? Well, if you take 1,1,1,1,1,1,1,1,1,1,1 then take an S, you get 11, and afterwards you get either 11 or 22.

This gives a clue of the right way to set up the problem. For definiteness, we consider the second term to be an S. If you then consider a sequence of Ss and Ps, and write down the number of Ps between successive Ss (which may be zero) as a list p_1,p_2,p_3,\ldots, it holds that \sum p_i=70, and that each string of p_i Ps introduces a factor of p_i+1, leading to

\prod (p_i+1)=999,999=3^3\times 7\times 11\times 13\times 37.

Note that 2\times 3 + 6+10+12+36=70, so there exists a choice of the p_is that satisfies these relations simultaneously. (And if we had replaced 71 by a smaller choice, this would have been impossible.)

The actual answer to the problem is obtained by studying how to rearrange the blocks of SPPP…s, but the main challenge lies in reducing to this form.

I thought this was a hard problem for BMO, and was impressed to see some solutions under exam conditions.

Advertisement

Kernels of critical graph components

This post is motivated by G(N,p), the classical Erdos-Renyi random graph, specifically its critical window, when p=p(N)=\frac{1}{N}(1+\lambda N^{-1/3}).

We start with the following observation, which makes no restriction on p. Suppose a component of G(N,p) is a tree. Then, the graph geometry of this component is that of a uniform random tree on the appropriate number of vertices. This is deliberately informal. To be formal, we’d have to say “condition on a particular subset of vertices forming a tree-component” and so on. But the formality is broadly irrelevant, because at the level of metric scaling limits, if we want to describe the structure of a tree component, it doesn’t matter whether it has \log N or \frac{1}{7}N vertices, because in both cases the tree structure is uniform. The only thing that changes is the scaling factor.

In general, when V vertices form a connected component of a graph with E edges, we define the excess to be E-V+1. So the excess is non-negative, and is zero precisely when the component is a tree. I’m reluctant to say that the excess counts the number of cycles in the component, but certainly it quantifies the amount of cyclic structure present. We will sometimes, in a mild abuse of notation, talk about excess edges. But note that for a connected component with positive excess, there is a priori no way to select which edges would be the excess edges. In a graph process, or when there is some underlying exploration of the component, there sometimes might be a canonical way to classify the excess edges, though it’s worth remarking that the risk of size-biasing errors is always extremely high in this sort of situation.

Returning to the random graph process, as so often there are big changes around criticality. In the subcritical regime, the components are small, and most of them, even the largest with high probability, are trees. In the supercritical regime, the giant component has excess \Theta(N), which is qualitatively very different.

It feels like every talk I’ve ever given has begun with an exposition of Aldous’s seminal paper [Al97] giving a distributional scaling limit of the sizes of critical components in the critical window, and a relation between the process on this time-scale and the multiplicative coalescent. And it remains relevant here, because the breadth-first exploration process can also be used to track the number of excess edges.

In a breadth-first exploration, we have a stack of vertices we are waiting to explore. We pick one and look its neighbours restricted to the rest of the graph, that is without the vertices we have already fully explored, and also without the other vertices in the stack. That’s the easiest way to handle the total component size. But we can simultaneously track how many times we would have joined to a neighbour within the stack, which leads to an excess edge, and Aldous derives a joint distributional scaling limit for the sizes of the critical components and their excesses. (Note that in this case, there is a canonical notion of excess edge, but it depends not just on the graph structure, but also on the extra randomness of the ordering within the breadth-first search.)

Roughly speaking, we consider the reflected exploration process, and its scaling limit, which is a reflected parabolically-drifting Brownian motion (though the details of this are not important at this level of exposition, except that it’s a well-behaved non-negative process that hits zero often). The component sizes are given by the widths of the excursions above zero, scaled up in a factor N^{1/3}. Then conditional on the shape of the excursion, the excess is Poisson with parameter the area under the excursion, with no rescaling. That is, a critical component has \Theta(1) excess.

So, with Aldous’s result in the background, when we ask about the metric structure of these critical components, we are really asking: “what does a uniformly-chosen connected component with fixed excess look like when the number of vertices grows?”

I’ll try to keep notation light, but let’s say T(n,k) is a uniform choice from connected graphs on n vertices with excess k.

[Note, the separation of N and n is deliberate, because in the critical window, the connected components have size n = \Theta(N^{2/3}), so I want to distinguish the two problems.]

In this post, we will mainly address the question: “what does the cycle structure of T(n,k) look like for large n?” When k=0, we have a uniform tree, and the convergence of this to the Brownian CRT is now well-known [CRT2, LeGall]. We hope for results with a similar flavour for positive excess k.

2-cores and kernels

First, we have to give a precise statement of what it means to study just the cycle structure of a connected component. From now on I will assume we are always working with a connected graph.

There are several equivalent definitions of the 2-core C(G) of a graph G:

  • When the excess is positive, there are some cycles. The 2-core is the union of all edges which form part of some cycle, and any edges which lie on a path between two edges which both form part of some cycle.
  • C(G) is the maximal induced subgraph where all degrees are at least two.
  • If you remove all the leaves from the graph, then all the leaves from the remaining graph, and continue, the 2-core is the state you arrive at where there are no leaves.

It’s very helpful to think of the overall structure of the graph as consisting of the 2-core, with pendant trees ‘hanging off’ the 2-core. That is, we can view every vertex of the 2-core as the root of a (possibly size 1) tree. This is particular clear if we remove all the edges of the 2-core from the graph. What remains is a forest, with one tree for each vertex of the 2-core.

In general, the k-core is the maximal induced subgraph where all degrees are at least k. The core is generally taken to be something rather different. For this post (and any immediate sequels) I will never refer to the k-core for k>2, and certainly not to the traditional core. So I write ‘core’ for ‘2-core’.

As you can see in the diagram, the core consists of lots of paths, and topologically, the lengths of these paths are redundant. So we will often consider instead the kernel, K(G), which is constructed by taking the core and contracting all the paths between vertices of degree greater than 2. The resulting graph has minimal degree at least three. So far we’ve made no comment about the simplicity of the original graphs, but certainly the kernel need not be simple. It will regularly have loops and multiple edges. The kernel of the graph and core in the previous diagram is therefore this:

Kernels of critical components

To recap, we can deconstruct a connected graph as follows. It has a kernel, and each edge of the kernel is a path length of some length in the core. The rest of the graph consists of trees hanging off from the core vertices.

For now, we ask about the distribution of the kernel of a T(n,K). You might notice that the case k=1 is slightly awkward, as when the core consists of a single cycle, it’s somewhat ambiguous how to define the kernel. Everything we do is easily fixable for k=1, but rather than carry separate cases, we handle the case k\ge 2.

We first observe that fixing k doesn’t confirm the number of vertices or edges in the kernel. For example, both of the following pictures could correspond to k=3:

However, with high probability the kernel is 3-regular, which suddenly makes the previous post relevant. As I said earlier, it can introduce size-biasing errors to add the excess edges one-at-a-time, but these should be constant factor errors, not scaling errors. So imagine the core of a large graph with excess k=2. For the sake of argument, assume the kernel has the dumbbell / handcuffs shape. Now add an extra edge somewhere. It’s asymptotically very unlikely that this is incident to one of the two vertices with degree three in the core. Note it would need to be incident to both to generate the right-hand picture above. Instead, the core will gain two new vertices of degree three.

Roughly equivalently, once the size of the core is fixed (and large) we have to make a uniform choice from connected graphs of this size where almost every vertex has degree 2, and \Theta(1) of the rest have degree 3 or higher. But the sum of the degrees is fixed, because the excess is fixed. If there are n vertices in the core, then there are \Theta(n) more graphs where all the vertices have degree 2 or 3, than graphs where a vertex has degree at least 4. Let’s state this formally.

Proposition: The kernel of a uniform graph with n vertices and excess k\ge 2 is, with high probability as n\rightarrow\infty, 3-regular.

This proved rather more formally as part of Theorem 7 of [JKLP], essentially as a corollary after some very comprehensive generating function setup; and in [LPW] with a more direct computation.

In the previous post, we introduced the configuration model as a method for constructing regular graphs (or any graphs with fixed degree sequence). We observe that, conditional on the event that the resulting graph is simple, it is in fact uniformly-distributed among simple graphs. When the graph is allowed to be a multigraph, this is no longer true. However, in many circumstances, as remarked in (1.1) of [JKLP], for most applications the configuration model measure on multigraphs is the most natural.

Given a 3-regular labelled multigraph H with 2(k-1) vertices and 3(k-1) edges, and K a uniform choice from the configuration model with these parameters, we have

\mathbb{P}\left( K \equiv H \right) \propto \left(2^{t(H)} \prod_{e\in E(H)} \mathrm{mult}(e)! \right)^{-1},

where t(H) is the number of loops in H, and mult(e) the multiplicity of an edge e. This might seem initially counter-intuitive, because it looks we are biasing against graphs with multiple edges, when perhaps our intuition is that because there are more ways to form a set of multiple edges we should bias in favour of it.

I think it’s most helpful to look at a diagram of a multigraph as shown, and ask how to assign stubs to edges. At a vertex with degree three, all stub assignments are different, that is 3!=6 possibilities. At the multiple edge, however, we care which stubs match with which stubs, but we don’t care about the order within the multi-edge. Alternatively, there are three choices of how to divide each vertex’s stubs into (2 for the multi-edge, 1 for the rest), and then two choices for how to match up the multi-edge stubs, ie 18 in total = 36/2, and a discount factor of 2.

We mention this because in fact K(T(n,k)) converges in distribution to this uniform configuration model. Once you know that K(T(n,k)) is with high probability 3-regular, then again it’s probably easiest to think about the core, indeed you might as well condition on its total size and number of degree 3 vertices. It’s then not hard to convince yourself that a uniform choice induces a uniform choice of kernel. Again, let’s state that as a proposition.

Proposition: For any H a 3-regular labelled multigraph H with 2(k-1) vertices and 3(k-1) edges as before,

\lim_{n\rightarrow\infty}\mathbb{P}\left( K(T(n,k)) \equiv H \right) \propto \left(2^{t(H)} \prod_{e\in E(H)} \mathrm{mult}(e)! \right)^{-1}.

As we said before, the kernel describes the topology of the core. To reconstruct the graph, we need to know the lengths in the core, and then how to glue pendant trees onto the core. But this final stage depends on k only through the total length of paths in the core. Given that information, it’s a combinatorial problem, and while I’m not claiming it’s easy, it’s essentially the same as for the case with k=1, and is worth treating separately.

It is worth clarifying a couple of things first though. Even the outline of methods above relies on the fact that the size of the core diverges as n grows. Again, the heuristic is that up to size-biasing errors, T(n,k) looks like a uniform tree with some uniformly-chosen extra edges. But distances in T(n,k) scale like n^{1/2} (and thus in critical components of G(N,p) scale like N^{1/3}). And the core will be roughly the set of edges on paths between the uniformly-chosen pairs of vertices, and so will also have length \Theta(n^{1/2}).

Once you have conditioned on the kernel structure, and the (large) number of internal vertices on paths in the core (ie the length of the core), it is natural that the assignment of the degree-2 vertices to core paths / kernel edges is uniform. A consequence of this is that if you record (Y_1,\ldots,Y_m) the lengths of paths in the core, where m=3(k-1), then

\frac{(Y_1,\ldots,Y_m)}{\sum Y_i} \stackrel{d}\rightarrow \mathrm{Dirichlet}(1,1,\ldots,1).

This is stated formally as Corollary 7 b) of [ABG09]. It’s worth noting that this confirms that the lengths of core paths are bounded in probability away from zero after the appropriate rescaling. In seeking a metric scaling limit, this is convenient as it means there’s so danger that two of the degree-3 vertices end up in ‘the same place’ in the scaling limit object.

To recap, the only missing ingredients now to give a complete limiting metric description of T(n,k) are 1) a distributional limit of the total core length; 2) some appropriate description of set of pendant trees conditional on the size of the pendant forest. [ABG09] show the first of these. As remarked before, all the content of the second of these is encoded in the unicyclic k=1 case, which I have written about before, albeit slightly sketchily, here. (Note that in that post we get around size-biasing by counting a slightly different object, namely unicyclic graphs with an identified cyclic edge.)

However, [ABG09] also propose an alternative construction, which you can think of as glueing CRTs directly onto the stubs of the kernel (with the same distribution as before). The proof that this construction works isn’t as painful as one might fear, and allows a lot of the other metric distributional results to be read off as corollaries.

References

[ABG09] – Addario-Berry, Broutin, Goldschmidt – Critical random graphs: limiting constructions and distributional properties

[CRT2] – Aldous – The continuum random tree: II

[Al97] – Aldous – Brownian excursions, critical random graphs and the multiplicative coalescent

[JKLP] – Janson, Knuth, Luczak, Pittel – The birth of the giant component

[LeGall] – Le Gall – Random trees and applications

[LPW] – Luczak, Pittel, Wierman – The structure of a random graph at the point of the phase transition

 

Enumerating Forests

I’ve just got back from a visit to Budapest University of Technology, where it was very pleasant to be invited to give a talk, as well as continuing the discussion our research programme with Balazs. My talk concerned a limit for the exploration process of an Erdos-Renyi random graph conditioned to have no cycles. Watch this space (hopefully very soon) for a fully rigorous account of this. In any case, my timings were not as slick as I would like, and I had to miss out a chunk I’d planned to say about a result of Britikov concerning enumerating unrooted forests. It therefore feels like an excellent time to write something again, and explain this paper, which you might be able to find here, if you have appropriate journal rights.

We are interested to calculate a_{n,m} the number of forests with vertex set [n] consisting of m unrooted trees. Recall that if we were interested in rooted trees, we could appeal to Prufer codes to show that there are m n^{n-m-1} such forests, and indeed results of Pitman give a coalescent/fragmentation scheme as m varies between 1 and n-1. It seems that there is no neat combinatorial re-interpretation of the unrooted case though, so Britikov uses an analytic method.

We know that

a_{n,m}= \frac{n!}{m!} \sum_{\substack{k_1+\ldots+k_m=n\\ k_i\ge 1}} \prod_{j=1}^m \frac{k_j^{k_j-2}}{k_j!}.

To see this, observe that the k_js correspond to the sizes of the m trees in the forest; \frac{n!}{\prod k_j!} gives the multinomial number of ways to assign vertices to the trees; given the labels for a tree of size k_j, there are k_j^{k_j-2} ways to make up the tree itself; and \frac{1}{m!} accounts for the fact that the trees have no order.

What we would really like to do is to take the uniform distribution on the set of all labelled trees, then simulate m IID copies of this distribution, and condition the union to contain precisely n vertices. But obviously this is an infinite set, so we cannot choose uniformly from it. Instead, we can tilt so that large trees are unlikely. In particular, for each x we define

\mathbb{P}(\xi=k) \propto \frac{k^{k-2} x^k}{k!},

and define the normalising constant

B(x):= \sum_{k\ge 1} \frac{k^{k-2}x^k}{k!},

whenever it exists. It turns out that x\le e^{-1} is precisely the condition for B(x)<\infty. Note now that if \xi_1,x_2,\ldots are IID copies of \xi, then

\mathbb{P}(\xi_1+\ldots+\xi_m=n) = \frac{x^n}{B(x)^m} \sum_{k_1+\ldots + k_m=n} \prod_{j=1}^m \frac{k_j^{k_j-2}}{k_j!},

and so we obtain

a_{n,m}= \frac{n!}{m!} \frac{B(x)^m}{x^n} \mathbb{P}(\xi_1+\ldots + \xi_m=n).

So asymptotics for a_{n,m} might follows from laws of large numbers of this distribution \xi.

So far, we haven’t said anything about how to choose this value x. But observe that if you want to have lots of trees in the forest, then the individual trees should generally be small, so we take x small to tilt away from a preference for large trees. It turns out that there is a similar interpretation of criticality for forests as for general graphs, and taking x equal to 1/e, its radius of convergence works well for this setting. If you want even fewer trees, there is no option to take x larger than 1/e, but instead one can use large deviations machinery rather than laws of large number asymptotics.

We will be interested in asymptotics of the characteristic function of \xi for x=1/e. In particular \mathbb{E}[e^{it\xi}]=\frac{B(xe^{it})}{B(x)}, and it will be enough to clarify the behaviour of this as t\rightarrow 0. It’s easier to work with a relation analytic function

\theta(x)=\sum_{k\ge 1} \frac{k^{k-1}x^k}{k!},

ie the integral of B. What now feels like a long time ago I wrote a masters’ thesis on the subject of multiplicative coalescence, and this shows up as the generating function of the solutions to Smoluchowski’s equations with monodisperse initial conditions, which are themselves closely related to the Borel distributions. In any case, several of the early papers on this topic made progress by establishing that the radius of convergence is 1/e, and that \theta(x)e^{-\theta(x)}=x everywhere where |x|\le 1/e. We want to consider x=1/e, for which \theta=1.

Note that \mathbb{E}\xi = \frac{\theta(x)}{B(x)}, so we will make progress by relating B(x),\theta(x) in two ways. One way involves playing around with contour integrals in a fashion that is clear in print, but involves quite a lot of notation. The second way is the Renyi relation which asserts that \theta(x)=B(x)+\frac{\theta(x)^2}{2}. We will briefly give a combinatorial proof. Observe that after multiplying through by factorials and interpreting the square of a generating function, this is equivalent to

k^{k-1} = k^{k-2} + \frac12 \sum_{\substack{l+m=k\\l,m\ge 1}} l^{l-1}m^{m-1}\binom{k}{l},

for all k. As we might expect from the appearance of this equality, we can prove it using a bijection on trees. Obviously on the LHS we have the size of the set of rooted trees on [k]. Now consider the set of pairs of disjoint rooted trees with vertex set [k]. This second term on the RHS is clearly the size of this set. Given an element of this set, join up the two roots, and choose whichever root was not initially in the same tree as 1 to be the new root. We claim this gives a bijection between this set, and the set of rooted trees on [k], for which 1 is not the root. Given the latter, the only pair of trees that leads to the right rooted tree on [k] under this mapping is given by cutting off the unique edge incident to the root that separates the root and vertex 1. In particular, since there is a canonical bijection between rooted trees for which 1 is the root, and unrooted trees (!), we can conclude the Renyi relation.

The Renyi relation now gives \mathbb{E}\xi = \frac{\theta(x)}{B(x)}=2 when x=1/e. If we wanted, we could show that the variance is infinite, which is not completely surprising, as the parameter x lies on the radius of convergence of the generating function.

Now, playing around with contour integrals, and being careful about which strands to take leads to the asymptotic as t\rightarrow 0

\mathbb{E}[ e^{it\xi}] = 1+2it + \frac{2}{3}i |2t|^{3/2} (i\mathrm{sign}(t))^{3/2} + o(|t|^{3/2}).

So from this, we can show that the characteristic function of the rescaled centred partial sum \frac{\xi_1+\ldots+\xi_N-2N}{bN^{2/3}} converges to \exp(-|t|^{3/2}\exp(\frac{i\pi}{4}\mathrm{sign} t)), where b= (32/9)^{1/3} is a constant arising out of the previous step.

We recognise this as the characteristic function of the stable distribution with parameters 3/2 and -1. In particular, we know now that \xi is in the domain of attraction for a stable-3/2 distribution. If we wanted a version of the central limit theorem for such partial sums, we could have that, but since we care about the partial sums of the \xi_is taking a specific value, rather than a range of values on the scale of the fluctuations, we actually need a local limit theorem.

To make this clear, let’s return to the simplest example of the CLT, with some random variables with mean \mu and variance \sigma^2<\infty. Then the partial sums satisfy

\mathbb{P}(\mu N + a\sigma\sqrt{N} \le S_N \le \mu_N+b\sigma\sqrt{N}) \rightarrow \int_a^b f_{\mathcal N}(x)dx,

as N\rightarrow\infty. But what about the probability of S_N taking a particular value m that lies between \mu N+a\sigma \sqrt{N} and \mu N + b\sigma \sqrt{N}? If the underlying distribution was continuous, this would be uncontroversial – considering the probability of lying in a range that is smaller than the scale of the CLT can be shown in a similar way to the CLT itself. A local limit theorem asserts that when the underlying distribution is supported on some lattice, mostly naturally the integers, then these probabilities are in the limit roughly the same whenever m is close to \mu N+a\sigma\sqrt{N}.

In this setting, a result of Ibragimov and Linnik that I have struggled to find anywhere in print (especially in English) gives us local limit theory for integer-supported distributions in the domain of attraction of a stable distribution. Taking p( ) to be the density of this distribution, we obtain

bm^{2/3}\mathbb{P}(\xi_1+\ldots+\xi_m=n) - p(\frac{n-2m}{b m^{2/3}}) \rightarrow 0

as n\rightarrow\infty, uniformly on any set of m for which z= \frac{n-2m}{bm^{2/3}} is bounded. Conveniently, the two occurrences of b clear, and Britikov obtains

a_{n,m} = (1+o(1)) \frac{\sqrt{2\pi} n^{n-1/6}}{2^{n-m}(n-m)!} p(\frac{n-2m}{n^{2/3}},

uniformly in the same sense as before.

The Yule Process

The second problem sheet for classes on the Applied Probability course this term features a long question about the Yule process. This is probably the simplest example of a birth process. It’s named for the British statistician George Udny Yule, though some sources prefer to call it the Yule-Furry process for the American physicist Wendell Furry who used it as a model of a radioactive reaction.

The model is straightforward. At any time there is some number of individuals in the population, and each individual gives birth to an offspring at constant rate \lambda, independently from the rest of the population. After a birth has happened, the parent and child evolve independently. In the notation of general birth processes, the birth rate when there are n individuals is \lambda_n=\lambda n.

Note that if we start with two or more individuals, the sizes of the two or more families of descendents evolve as a continuous-time Polya’s urn. The arrivals process speeds up with time, but the jump chain is exactly Polya’s urn. Unsurprisingly, the Yule process can be found embedded in preferential attachment models, and other processes which are based around Polya’s urn with extra information.

This is a discrete, random version of exponential growth. Since the geometric distribution is the discrete analogue of the exponential distribution, we probably shouldn’t be surprised to learn that this is indeed the distribution of the process at some fixed time t, when it is started from a single original ancestor. This is all we care about, since the numbers of descendents from each different original ancestors are independent. In general, the distribution of the population size at some fixed time will be negative binomial, that is, a sum of IID geometric distributions.

The standard method here is to proceed using generating functions. Conditioning on the first splitting time gives two independent copies of the original process over a shorter time-scale. One derives an ODE in time for the generating function evaluated at any particular value z. This can be solved uniquely for each z, and patching together gives the generating function of the distribution at any specific time t, which can be seen to coincide with the corresponding generating function of the geometric distribution with parameter e^{-\lambda t}.

So we were trying to decide whether there might be a more heuristic argument for this geometric distribution. The method we came up with is not immediate, but does justify the geometric distribution in a couple of steps. First, we say that the birth times are T_2,T_3,\ldots, so between times [T_n,T_{n+1}) there are n individuals, with T_1:=0 for concreteness. Then by construction of the birth process, T_{n+1}-T_n\stackrel{d}{=}\mathrm{Exp}(\lambda n).

We now look at these ‘inter-birth times’ backwards, starting from T_{n+1}. Note that \mathrm{Exp}(\lambda n) is the distribution of the time for the first of n IID \mathrm{Exp}(\lambda) clocks to ring. But then, looking backwards, the next inter-birth time is thus the distribution of the time for one of (n-1) IID \mathrm{Exp}(\lambda) clocks to ring. So by memorylessness of the exponential distribution (discussed at great length on the first problem sheet), we can actually take these (n-1) clocks to be exactly those of the original n clocks which did not ring first. Continuing this argument, we can show that the first (in the original time direction) inter-birth time corresponds to the time spent waiting for the final clock to ring. Rewriting this observation formally:

T_{n+1}\stackrel{d}{=}\max\{X_i : X_1,\ldots,X_n\stackrel{\text{iid}}{\sim}\mathrm{Exp}(\lambda)\}. (*)

To return to justifying the geometric form of the distribution, we need to clarify the easiest relationship between the population size at a fixed size and these birth times. As we are aiming for the geometric distribution, the probability of the event \{X_t>n\} will be most useful. Clearly this event is the same as \{T_{n+1}<t\}, and from the description involving maxima of IID exponentials, this is easy to compute as (1-e^{-\lambda t})^n, which is exactly what we want.

There are two interesting couplings hidden in these constructions. On closer inspection they turn out to be essentially the same from two different perspectives.

We have specified the distribution of T_n at (*). Look at this distribution on the right hand side. There is a very natural way to couple these distributions for all n, namely to take some infinite sequence X_1,X_2,\ldots of IID \mathrm{Exp}(\lambda) random variables, then use initial sequences of these to generate each of the T_ns as described in (*).

Does this coupling correspond to the use of these IID RVs in the birth process? Well, in fact it doesn’t. Examining the argument, we can see that X_1 gives a different inter-birth time for each value of t in the correspondence proposed. Even more concretely, in the birth process, almost surely T_{n+1}>T_n for each n. This is not true if we take the canonical coupling of (*). Here, if X_n<\max\{X_1,\ldots,X_{n-1}\}, which happens with high probability for large n, we have T_{n+1}=T_n in the process of running maxima.

Perhaps more interestingly, we might observe that this birth process gives a coupling of the geometric distributions. If we want to recover the standard parameterisation of the geometric distribution, we should reparameterise time. [And thus generate an essentially inevitable temptation to make some joke about now having a Yule Log process.]

Let’s consider what the standard coupling might be. For a binomial random variable, either on [n] or some more exotic set, as in percolation, we can couple across all values of the parameter by constructing a family independent uniform random variables, and returning a 1 if U_i>1-p and so on, where p is the parameter of a specific binomial realisation.

We can do exactly the same here. A geometric distribution can be justified as the first success in a sequence of Bernoulli trials, so again we can replace the relevant Bernoulli distribution with a uniform distribution. Take U_1,U_2,\ldots to be IID U[0,1] random variables. Then, we have:

X_t=\stackrel{d}{=}\bar X_t:= \max\{n: U_1,\ldots,U_{n-1}\ge e^{-\lambda t}\}.

The equality in distribution holds for any particular value of t by constructing. But it certainly doesn’t hold uniformly in t. Note that if we define \bar X_t as a process, then typically the jumps of this process will be greater than 1, which is forbidden in the Yule process.

So, we have seen that this Yule process, even though its distribution at a fixed time has a standard form, provides a coupling of such distributions that is perhaps slightly surprising.

Random Maps 1 – Towards the Schaeffer Bijection

I have spent the past ten days in Saint Flour, an inaccessible but picturesque town in the rolling hills of Cantal, in the middle of France, and venue for perhaps the most notable summer school in probability. My highlight has been the course ‘Aspects of Random Maps’ delivered by Gregory Miermont, and I thought I should write a few posts about points of interest encountered during the lectures and private study.

A map is an embedding of a connected graph onto a surface. We typically do not care about the nature of this embedding up to homeomorphisms of the surface which preserve orientations of the map. One advantage for doing this is that the set of maps now might be countable, and the set of maps with n edges might be finite. This can be proved by considering a map to be obtained by glueing together polygonal faces. Some potential glueings are impossible, and some are equivalent, but for a fixed number of edges, the number of such sets of polygons and a possible glueing is finite. In fact we can be much more precise than this about how to describe precisely the legal glueing through a triple of permutations, but I won’t discuss this here.

I haven’t yet given a complete definition of a map. We want a typical large map, that is a map with a large number of vertices and edges, to be topologically roughly the same as the surface it is embedded into. In particular, the map needs to encode the geometric features of the surface. So a small triangle on the surface of a torus should not be considered a map. To rigorise this, we demand that any face of a map should be a topological disc, in particular, it should be simply connected. Since the torus itself is not simply connected, this excludes our triangle example. Note a single vertex on a torus is also excluded.

Although it goes against the usual order of definitions, it might be helpful to think of a map as an embedding which satisfies Euler’s formula: V – E + F = 2 -2g, where g is the genus of the surface. For a connected planar graph, induction is on the number of edges and vertices is the typical way to prove this result. The inductive step works the same on a more general surface, but it is less clear what the base case should be. Another consequence of the definition is that we should work on the sphere rather than the plane. From now on, this is our surface of interest.

We begin by considering \mathcal{M}_n to be the family of rooted plane maps with n edges. The root is a distinguished oriented edge. Our aim is to count the size of this set.

Before doing this, we digress onto the topic of rooted plane trees. Note that any (rooted) tree in the classical sense is planar, but in a rooted plane tree, we also specify the geometric ordering of the offspring. For example, if the root has two offspring, of which one has precisely one offspring and the other has none, we consider these as two separate cases.

So now, if we denote by a_k the number of plane trees with k edges, we can define a generating function via A(z):=\sum_{k\ge 0} a_k z^k. If the root vertex v has no offspring, this gives one possibility corresponding to k=0. Otherwise, there is a well-defined left-most offspring of the root, called u. Then u and its descendents form a plane tree, and v and its descendents apart from those through u also form a plane tree. So after accounting for the edge between u and v, we obtain

A(z)=1+zA(z)^2,

whenever A(z) is defined. We now can apply whichever is our favourite method of showing that this is the generating function of the Catalan numbers, a_k=\frac{1}{k+1}\binom{2k}{k}.

There is a more complicated version of this generating function argument due to Tutte that allows us to enumerate \mathcal{M}_n. It is convenient to work with a second variable in the generating function that encodes the degree of the root face. The resulting equation of generating functions is less well-known but using the Lagrange inversion formula gives the explicit expression

|\mathcal{M}_n|=\frac{2}{n+2}\cdot \frac{3^n}{n+1}\binom{2n}{n}.

Although there are extra terms, this motivates seeking a bijection between maps, and some version of rooted plane trees, perhaps decorated with some extra information. As in many cases, this will turn out to be possible. The bijection we end up with will not just help us enumerate the maps, but will also allow us to control a lot more information about distances in the map, which will be particularly useful when we try to take limits.

The first observation is that given a map, we can construct a dual map, by placing fresh vertices somewhere in the middle of each face, and joining a pair of these if the corresponding faces in the original graph share an edge.

Alternatively, we can place the same fresh vertices in the middle of each face, then join each new vertex to an original vertex, if that original vertex lies on the face corresponding to the new vertex. If you focus in on an original edge, it is clear that it is now surrounded by a ‘diamond’ (if you’ve drawn the diagram in a natural way) of new edges. Removing the original edges thus leaves us with a quadrangulation. This procedure is called the ‘trivial bijection’ between \mathcal{M}_n and \mathcal{Q}_n, the family of rooted quadrangulations with n faces. Note that the root in such a quadrangulation is an identified directed edge, rather than a vertex. We haven’t yet specified how to describe the root of the resulting quadrangulation. It suffices to take the first new edge which lies clockwise of the root edge in the original graph, seen from the ‘tail’ of the root, which is of course oriented.

In this, and the bijections which follow, the natural questions to ask are: a) is the inverse obvious? and b) what happens to self-loops and isthmuses? Here, the inverse really is obvious. Any quadrangulation is bipartite, hence two-colourable, so we need to fix one colour and join the two vertices of that colour within each face to recover the original graph. The root tells us which colour we need to take. As for the second question, first we should say that an isthmus is an edge which has the same face on both sides. This causes no problems in this particular bijection. For the self-loops, we get a sort of Pacman-like quadrangle, with two ‘outer-edges’ between the same two vertices, and an edge between one of the outer vertices and some internal vertex. This edge contributes twice to the degree of the face.

The upshot of this is that for a simple enumeration, it suffices to prove that |\mathcal{Q}_n|=\frac{2}{n+2}\cdot \frac{3^n}{n+1}\binom{2n}{n}. This may not look like we have achieved much, but we can now apply Euler’s formula to any quadrangulation in this set to deduce that the number of vertices present is n+2. If we consider the set \mathcal{Q}_n^*, where now we identify a particular vertex v_* in the quadrangulation, it suffices to prove that |\mathcal{Q}_n^*|=2.3^n \cdot a_n, where a_n is the nth Catalan number as before. Now we have the most efficient setup to look for a bijection with some type of decorated plane tree as discussed before.

Avoiding Mistakes in Probability Exams

Over the past week, I’ve given several tutorials to second year undergraduates preparing for upcoming papers on probability and statistics. In particular, I’ve now seen a lot of solutions to a lot of past papers and specimen questions, and it’s worthwhile to consider some of the typical mistakes students can make on these questions. Of course, as with any maths exam, there’s always the possibility of a particularly subtle or involved question coming up, but if the following three common areas of difficulty can be avoided, you’re on track for doing well.

Jacobians

In a previous course, a student will learn how to calculate the pdf of a function of a random variable. Here, we move onto the more interesting and useful case of finding the (joint) density of function(s) of two or more random variables. The key thing to remember here is that manipulating pdfs is not a strange arbitrary exercise – it is just integration. It is rarely of interest to consider the value of a pdf at a single point. We can draw meaningful conclusions from a pdf or from comparison of two pdfs by integrating them.

Then the question of substituting for new random variables is precisely integration by substitution, which we are totally happy with in the one-dimensional case, and should be fairly happy with in the two-dimensional case. To get from one joint density to another, we multiply by the absolute value of the Jacobian. To ensure you get it right, it makes sense to write out the informal infinitesimal relation

f_{U,V}(u,v) du dv = f_{X,Y}(x,y)dx dy.

This is certainly relevant if we put integral signs in front of both sides, and explains why you obtain f_{U,V} = \frac{d(x,y)}{d(u,v)} f_{X,Y} rather than the other way round. Note though that if \frac{d(u,v)}{d(x,y)} is easier to calculate for some reason, then you can evaluate this and take the inverse, as your functions will almost certainly be locally bijective almost everywhere.

It is important to take the modulus of the Jacobian, since densities cannot be negative! If this looks like a fudge, then consider the situation in one dimension. If we substitute for x\mapsto f(x)=1-x, then f’ is obviously negative, BUT we also end up reversing the order of the bounds of the integral, eg [1/3, ¾] will become [2/3,1/4]. So we have a negative integrand (after multiplying by f'(x)) but bounds in the ‘wrong’ order. These two factors of -1 will obviously cancel, so it suffices just to multiply by |f'(x)| at that stage. It is harder to express in words, but a similar relation works for the Jacobian substitution.

You also need to check where the new joint density is non-zero. Suppose X, Y are supported on [0,1], then when we write f_{X,Y}(x,y) we should indicate that it is 0 off this region, either by splitting into cases, or adding the indicator function 1_{\{x,y\in[0,1]\}} as a factor. This is even more important after substitutions, as the range of the resulting random variables might be less obvious than the originals. Eg with X,Y as above, and U=X^2, V=X/Y, the resulting pdf will be non-zero only when u\in[0,1], v\ge \sqrt{u}. Failing to account for this will often lead to ludicrous answers. A general rule is that you can always check that any distribution you’ve produced does actually integrate to one.

Convergence using MGFs

There are two main reasons to use MGFs and PGFs. The first is that they behave nicely when applied to (possibly random) sums of independent random variables. The independence property is crucial to allow splitting of the MGF of the sum into the product of MGFs of the summands. Of course, implicit in this argument is that MGFs determine distributions.

A key theorem of the course is that this works even in the limit, so you can use MGFs to show convergence in distribution of a family of distributions. For this, you need to show that the MGFs converge pointwise on some interval [-a,a] around 0. (Note that the moments of the distribution are given by the family of derivatives at 0, as motivation for why this condition might be necessary.) Normally for such questions, you will have been asked to define the MGF earlier in the question, and probably will have found the MGF of a particular distribution or family of distributions, which might well end up appearing as the final answer.

Sometimes such an argument might involve substituting in something unusual, like t/N, rather than t, into a known MGF. Normally a Taylor series can be used to show the final convergence result. If you have a fraction, try to cancel terms so that you only have to evaluate one Taylor series, rather than lots.

Using the Markov Property

The Markov property is initially confusing, but once we become comfortable with the statement, it is increasingly irritating to have to answer the question: “show that this process has the Markov property.” This question is irritating because in most cases we want to answer: “because it obviously does!” Which is compelling, but unlikely to be considered satisfactory in a mathematics exam. Normally we observe that the random dynamics of the next step are a function only of the present location. Looking for the word ‘independent’ in the statement of the process under discussion is a good place to start for any argument along these lines.

The most developed example of a Markov process in this course is the Poisson process. I’ve written far too much about this before, so I won’t do so again, except to say this. When we think of the Poisson process, we generally have two thoughts going through our minds, namely the equivalent definitions of IID exponential inter-arrival times, and stationary, Poisson increments (or the infinitesimal version). If we draw a sketch of a sample trajectory of this process, we can label everything up and it is clear how it all fits together. But if you are asked to give a definition of the Poisson process (N_t), it is inappropriate to talk about inter-arrival times unless you define them in terms of N_t, since that is the process you are actually trying to define! It is fine to write out

T_k:=\min\{t: N_t=k\},\quad N_t=\max\{k: Y_1+Y_2+\ldots+Y_k\le t\}

but the relation between the two characterisations of the process is not obvious. That is why it is a theorem of the course.

We have to be particularly careful of the difference in definition when we are calculating probabilities of various events. A classic example is this. Find the distribution of N_2, conditional on T_3=1. It’s very tempting to come up with some massive waffle to argue that the answer is 3+Po(1). The most streamlined observation is that the problem is easy if we are conditioning instead on N_1=3. We just use the independent Poisson increments definition of (N_t), with no reference to inter-arrival times required. But then the Markov property applied at time 1 says that the distribution of (N_2) depends only on the value of N_1, not on the process on the interval [0,1). In a sense, the condition that T_3=1 is giving us extra information on the behaviour of the process up to time 1, and the Markov property, which we know holds for the Poisson process, asserts precisely that the extra information doesn’t matter.

Generating Functions for Dice

So last week I was writing an article for Betting Expert about laws of large numbers, and I was trying to produce some representations of distributions to illustrate the Weak LLN and the Central Limit Theorem. Because tossing a coin feels too simplistic, and also because the natural state space for this random variable, at least verbally, is not a subset of the reals, I decided to go for dice instead. So it’s clear what the distribution of the outcome of a single dice roll is, and with a bit of thought or a 6×6 grid, we can work out the distribution of the average of two dice rolls. But what about 100 rolls? Obviously, we need large samples to illustrate the laws of large numbers! In this post, we discuss how to calculate the distribution of the sample mean of n dice rolls.

First we observe that the total set of outcomes of n dice rolls is 6^n. The sum of the outcomes must lie between n and 6n inclusive. The distribution of the sum and the distribution of the sample mean are equivalent up to dividing by n. The final observation is that because the total number of outcomes has a nice form, we shouldn’t expect it to make any difference to the method if we calculate the probability of a given sum, or the number of configurations giving rise to that sum.

Indeed, tying in nicely with the first year probability course, we are going to use generating functions, and there is no difference in practice between the probability generating function, and the combinatorial generating function, if the underlying mechanism is a uniform choice. Well, in practice, there is a small difference, namely a factor of 6 here. The motivation for using generating functions is clear: we are considering the distribution of a sum of independent random variables. This is pretty much exactly why we bother to set up the machinery for PGFs.

Anyway, since each of {1,2,…,6} is equally likely, the GF of a single dice roll is

x+x^2+\ldots+x^6=x\cdot \frac{1-x^6}{1-x}.

So, if we want the generating function of the sum of n independent dice rolls, we can obtain this by raising the above function to the power n. We obtain

x^n(1-x^6)^n(1-x)^{-n}.

Note the factor of x^n at the beginning arises because the minimum value of the sum is n. So to work out the number of configurations giving rise to sum k, we need to evaluate the coefficient of x^k. We can deal with (1-x^6)^n fairly straightforwardly, but some thought it required regarding whether it’s possible to do similar job on (1-x)^{-n}.

We have to engage briefly with what is meant by a binomial coefficient. Note that

\binom{x}{k}=\frac{x(x-1)\ldots(x-k+1)}{1\cdot\ldots\cdot k}

is a valid definition even when x is not a positive integer, as it is simply a degree k polynomial in x. This works if x is a general positive real, and indeed if x is a general negative real. At this stage, we do need to keep k a positive integer, but that’s not a problem for our applications.

So we need to engage with how the binomial theorem works for exponents that are not positive integers. The tricky part with the standard expression as

(a+b)^n=\binom{n}{0}a^n+\ldots + \binom{n}{n}b^n,

is that the attraction of this symmetry in a and b prompts us to work in more generality than is entirely necessary to state the result. Note if we instead write

(1+x)^n=1+\binom{n}{1}x+\binom{n}{2}x^2+\ldots,

we have unwittingly described this finite sum as an infinite series. It just happens that all the binomial coefficients apart from the first (n+1) are zero. The nice thing about this definition is that it might plausibly generalise to non-integer or negative values of n. And indeed it does. I don’t want to go into the details here, but it’s just a Taylor series really, and the binomial coefficients are set up with factorials in the right places to look like a Taylor series, so it all works out.

It is also worth remarking that it follows straight from the definition of a negative binomial coefficient, that

\binom{-n}{j}=(-1)^j \binom{n+j-1}{j}.

In any case, we can rewrite our expression for the generating function of the IID sum as

x^n\left[\sum_{k=0}^n \binom{n}{k}(-1)^k x^{6k}\right]\left[\sum_{j\ge 0} \binom{-n}{j}(-1)^j x^j\right]

By accounting for where we can gather exponents from each bracket, we can evaluate the coefficient of x^m as

\sum_{6k+j=m+n}\binom{n}{k}\binom{n+j-1}{j}(-1)^k.

Ie, k in the sum takes values in \{0,1,\ldots, \lfloor \frac{m+n}{6}\rfloor\}. At least in theory, this now gives us an explicit way to calculate the distribution of the average of multiple dice rolls. We have to be wary, however, that many compilers will not be happy dealing with large binomial coefficients, as the large factorials grow extremely rapidly. An approximation using logs is likely to be more tractable for larger settings.

Anyway, I leave you with the fruits of my labours.

dice30Related articles

Enhanced by Zemanta

Random Mappings for Cycle Deletion

In previous posts here and here, I’ve talked about attempts to describe a cycle deleting process. We amend the dynamics of the standard random graph process by demanding that whenever a cycle is formed in the graph we delete all the edges that lie on the cycle. The aim of this is to prevent the system growing giant components, and perhaps give a system that displays the characteristics of self-organised criticality. In the posts linked to, we discuss the difficulties caused by the fact that the tree structure of components in such a process is not necessarily uniform.

Today we look in the opposite direction. It gives a perfectly reasonable model to take a multiplicative coalescent with quadratic fragmentation (this corresponds to cycle deletion, since there are O(n^2) edges which would give a cycle if added to a tree on n vertices) and a fragmentation kernel corresponding to adding an extra edge to a uniform spanning tree on n vertices then deleting the edges of the unique cycle. The focus of the rest of this post, we consider this fragmentation mechanism, in particular thinking about how we would sample from it most practically. Not least, without going through Prufer codes or some other clever machinery, it is not trivial to sample a uniform spanning tree.

First, we count the number of unicyclic graphs on n labelled vertices. If we know that the vertices on the cycle are v_1,\ldots,v_k, then the number of cycles with an identified edge is

u_1=1,\quad u_k=\frac{k!}{2},\, k\ge 2.

If we know that the tree coming off the cycle from vertex v_i has size m, say, then each of the possible rooted labelled trees with size m is equally likely. So taking w_j=j^{j-1}, the number of rooted trees on j labelled vertices, we get B_n(u_\bullet,w_\bullet) for the number of such unicyclic graphs on [n]. Recall B_n is the nth Bell polynomial, which gives the size of a compound combinatorial structure, where we have some structure on blocks and some other structure within blocks. Then the random partition of [n] given by the tree sizes has the distribution \text{Gibbs}_n(u_\bullet,w_\bullet).

Consider now a related object, the so-called random mapping digraph. What follows is taken from Chapter 9 of Combinatorial Stochastic Processes. We can view any mapping M_n:[n]\rightarrow[n] as a digraph where every vertex has out-degree 1. Each such digraph contains a collection of directed cycles, supported on those elements x for which M_n^k(x)=x for some k. Such an element x is called a cyclic point. Each cyclic point can be viewed as the root of a labelled tree.

In an identical manner to the unicyclic graph, the sizes of these directed trees in the digraph decomposition of a uniform random mapping is distributed as \text{Gibbs}_n(\bullet !,w_\bullet). So this is exactly the same as the cycle deletion kernel, apart from in the probability that the partition has precisely one block. In practice, for large n, the probability of this event is very small in both cases. And if we wanted to sample the cycle deletion kernel exactly, we could choose the trivial partition with some probability p, and otherwise sample from the random mapping kernel, where p is chosen such that

p+\frac{1-p}{B_n(\bullet !, w_\bullet)}=\frac{1}{B_n(u_\bullet,w_\bullet)}.

At least we know from the initial definition of a random mapping, that B_n(\bullet !,w_\bullet)=n^n. The number of unicyclic graphs with an identified edge is less clear. It turns out that the partition induced by the random mapping has a nice limit, after rescaling, as the lengths of excursions away from 0 in the standard Brownian bridge on [0,1].

The time for a fuller discussion of this sort of phenomenon is in the context of Poisson-Dirichlet distributions, as the above exchangeable partition turns out to be PD(1/2,1/2). However, for now we remark that the jumps of a subordinator give a partition after rescaling. The case of a stable subordinator is particularly convenient, as calculations are made easier by the Levy-Khintchine formula.

A notable example is the stable-1/2 subordinator, which can be realised as the inverse of the local time process at zero of a Brownian motion. The jumps of this process are then the excursion lengths of the original Brownian motion. A calculation involving the tail of the w_j’s indicates that 1/2 is the correct parameter for a subordinator to describe the random mappings. Note that the number of blocks in the partition corresponds to the local time at zero of the Brownian motion. (This is certainly not obvious, but it should at least be intuitively clear why a larger local time roughly indicates more excursions which indicates more blocks.)

So it turns out, after checking some of the technicalities, that it will suffice to show that the rescaled number of blocks in the random mapping partition \frac{|\Pi_n|}{\sqrt{n}} converges to the Raleigh density, which is a size-biased Normal random variable (hence effectively first conditioned to be positive), and which also is the distribution of the local time of the standard Brownian bridge.

After that very approximate description, we conclude by showing that the distribution of the number of blocks does indeed converge as we require. Recall Cayley’s formula kn^{n-k-1} for the number of labelled forests on [n] with a specified set of k roots. We also need to know how many labelled forests there are with any set of roots. Suppose we introduce an extra vertex, labelled 0, and connect it only to the roots of a rooted labelled forest on [n]. This gives a bijection between unlabelled trees on {0,1,…,n} and labelled forests with a specified set of roots on [n]. So we can use Cayley’s original formula to conclude there are (n+1)^{n-1} such forests. We can do a quick sanity check that these are the same, which is equivalent to showing

\sum_{k=1}^n k n^{-k-1}\binom{n}{k}=\frac{1}{n}(1+\frac{1}{n})^{n-1}.

This odd way of writing it is well-motivated. The form of the LHS is reminiscent of a generating function, and the additional k suggests taking a derivative. Indeed, the LHS is the derivative

\frac{d}{dx}(1+x)^n,

evaluated at \frac{1}{n}. This is clearly the same as the RHS.

That said, having established that the random mapping partition is essentially the same, it is computationally more convenient to consider that instead. By the digraph analogy, we again need to count forests with k roots on n vertices, and multiply by the number of permutations of the roots. This gives:

\mathbb{P}(|\Pi_n|=k)=\frac{kn^{n-k-1}\cdot k! \binom{n}{k}}{n^n}=\frac{k}{n}\prod_{i=1}^{k-1}\left(1-\frac{i}{n}\right).

Now we can consider the limit. Being a bit casual with notation, we get:

\lim \mathbb{P}(\frac{|\Pi_n|}{\sqrt{n}}\in dl)\approx \sqrt{n}dl \mathbb{P}(|\Pi_n|=l\sqrt{n}).

Since the Raleigh distribution has density l\exp(-\frac12 l^2)dl, it suffices for this informal verification to check that

\prod_{i=1}^{l\sqrt{n}}(1-\frac{i}{n})\approx \exp(-\frac12 l^2). (*)

We take logs, so the LHS becomes:

\log(1-\frac{1}{n})+\log(1-\frac{2}{n})+\ldots+\log(1-\frac{l\sqrt{n}}{n}).

If we view this as a function of l and differentiate, we get

d(LHS)=\sqrt{n}dl \log (1-\frac{l}{\sqrt{n}})\approx \sqrt{n}dl \left[-\frac{l}{\sqrt{n}}-\frac{l^2}{2n}\right]\approx -ldl.

When l is zero, the LHS should be zero, so we can obtain the desired result (*) by integrating then taking an exponential.

Generating Functions for the IMO

The background to this post is that these days I find myself using generating functions all the time, especially for describing the stationary states of various coalescence-like processes. I remember meeting them vaguely while preparing for the IMO as a student. However, a full working understanding must have eluded me at the time, as for Q5 on IMO 2008 in Madrid I had written down in big boxes the two statements involving generating functions that immediately implied the answer, but failed to finish it off. The aim of this post is to help this year’s team avoid that particular pitfall.

What are they?

I’m going to define some things in a way which will be most relevant to the type of problems you are meeting now. Start with a sequence (a_0,a_1,a_2,\ldots). Typically these will be the sizes of various combinatorial sets. Eg a_n = number of partitions of [n] with some property. Define the generating function of the sequence to be:

f(x)=\sum_{k\geq 0}a_k x^k=a_0+a_1x+a_2x^2+\ldots.

If the sequence is finite, then this generating function is a polynomial. In general it is a power series. As you may know, some power series can be rather complicated, in terms of where they are defined. Eg

1+x+x^2+x^3+\ldots=\frac{1}{1-x},

only when |x|<1. For other values of x, the LHS diverges. Defining f over C is fine too. This sort of thing is generally NOT important for applications of generating functions to combinatorics. To borrow a phrase from Wilf, a generating function is a convenient `clothesline’ on which to hang a sequence of numbers.

We need a notation to get back from the generating function to the coefficients. Write [x^k]g(x) to denote the coefficient of x^k in the power series g(x). So, if g(x)=3x^3-5x^2+7, then [x^2]g(x)=-5. It hopefully should never be relevant unless you read some other notes on the topic, but the notation [\alpha x^2]g(x):=\frac{[x^2]g(x)}{\alpha}, which does make sense after a while.

How might they be useful?

Example: binomial coefficients a_k=\binom{n}{k} appear, as the name suggests, as coefficients of

f_n(x)=(1+x)^n=\sum_{k=0}^n \binom{n}{k}x^k.

Immediate consequence: it’s trivial to work out \sum_{k=0}^n \binom{n}{k} and \sum_{k=0}^n(-1)^k \binom{n}{k} by substituting x=\pm 1 into f_n.

Less obvious consequence. By considering choosing n from a red balls and b blue balls, one can verify

\binom{a+b}{n}=\sum_{k=0}^n \binom{a}{k}\binom{b}{n-k}.

We can rewrite the RHS as

\sum_{k+l=n}\binom{a}{k}\binom{b}{l}.

Think how we calculate the coefficient of x^n in the product f(x)g(x), and it is now clear that \binom{a+b}{n}=[x^n](1+x)^{a+b}, while

\sum_{k+l=n}\binom{a}{k}\binom{b}{l}=[x^n](1+x)^a(1+x)^b,

so the result again follows. This provides a good slogan for generating functions: they often replicate arguments via bijections, even if you can’t find the bijection.

Useful for? – Multinomial sums

The reason why the previous argument for binomial coefficients worked nicely is because we were interested in the coefficients, but had a neat expression for the generating function as a polynomial. In particular, we had an expression

\sum_{k+l=n}a_k b_l.

This is always a clue that generating functions might be useful. This is sometimes called a convolution.

Exercise: prove that in general, if f(x) is the generating function of (a_k) and g(x) the generating function of (b_l), then f(x)g(x) is the generating function of \sum_{k+l=n}a_kb_l.

Even more usefully, this works in the multinomial case:

\sum_{k_1+\ldots+k_m=n}a^{(1)}_{k_1}\ldots a^{(m)}_{k_m}.

In many applications, these a^{(i)}s will all be the same. We don’t even have to specify how many k_i’s there are to be considered. After all, if we want the sum to be n, then only finitely many can be non-zero. So:

\sum_{m}\sum_{k_1+\ldots+k_m=n}a_{k_1}\ldots a_{k_m}=[x^n]f(x)^n=[x^n]f(x)^\infty,

provided f(0)=1.

Useful when? – You recognise the generating function!

In some cases, you can identify the generating function as a `standard’ function, eg the geometric series. In that case, manipulating the generating functions is likely to be promising. Here is a list of some useful power series you might spot.

1+x+x^2+\ldots=\frac{1}{1-x},\quad |x|<1

1+2x+3x^2+\ldots=\frac{1}{(1-x)^2},\quad |x|<1

e^x=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\ldots

\cos x=1-\frac{x^2}{2!}+\frac{x^4}{4!}\pm\ldots

Exercise: if you know what differentiation means, show that if f(x) is the gen fn of (a_k), then xf'(x) is the gen fn of ka_k.

Technicalities: some of these identities are defined only for certain values of x. This may be a problem if they are defined at, say, only a single point, but in general this shouldn’t be the case. In addition, you don’t need to worry about differentiability. You can definition differentiation of power series by x^n\mapsto nx^{n-1}, and sort out convergence later if necessary.

Useful for? – Recurrent definitions

The Fibonacci numbers are defined by:

F_0=F_1=1,\quad F_{n+1}=F_n+F_{n-1},\quad n\geq 1.

Let F(x) be the generating function of the sequence F_n. So, for n=>1,

[x^n]F(x)=[x^{n-1}]F(x)+[x^{n-2}]F(x)=[x^n](xF(x)+x^2F(x)),

and F(0)=1, so we can conclude that:

F(x)=1+(x+x^2)F(x)\quad\Rightarrow\quad F(x)=\frac{1}{1-x-x^2}.

Exercise: Find a closed form for the generating function of the Catalan numbers, defined recursively by:

C_n=C_0C_{n-1}+C_1C_{n-2}+\ldots+C_{n-1}C_0.

Can you now find the coefficients explicitly for this generating function?

Useful for? – Partitions

Partitions can be an absolute nightmare to work with because of the lack of explicit formulae. Often any attempt at a calculation turns into a massive IEP bash. This prompts a search for bijective or bare-hands arguments, but generating functions can be useful too.

For now (*), let’s assume a partition of [n] means a sequence of positive integers a_1\geq a_2\geq\ldots\geq a_k such that a_1+\ldots+a_k=n. Let p(n) be the number of partitions of [n].

(* there are other definitions, in terms of a partition of the set [n] into k disjoint but unlabelled sets. Be careful about definitions, but the methods often extend to whatever framework is required. *)

Exercise: Show that the generating function of p(n) is:

\left(\frac{1}{1-x}\right)\left(\frac{1}{1-x^2}\right)\left(\frac{1}{1-x^3}\right)\ldots

Note that if we are interested only in partitions of [n], then we don’t need to consider any terms with exponent greater than n, so if we wanted we could take a finite product instead.

Example: the mint group will remember this problem from the first session in Cambridge:

Show that the number of partitions of [n] with distinct parts is equal to the number of partitions of [n] with odd parts.

Rather than the fiddly bijection argument found in the session, we can now treat this as a simple calculation. The generating function for distinct parts is given by:

(1+x)(1+x^2)(1+x^3)\ldots,

while the generating function for odd parts is given by:

\left(\frac{1}{1-x}\right)\left(\frac{1}{1-x^3}\right)\left(\frac{1}{1-x^5}\right)\ldots.

Writing the former as

\left(\frac{1-x^2}{1-x}\right)\left(\frac{1-x^4}{1-x^2}\right)\left(\frac{1-x^6}{1-x^3}\right)\ldots

shows that these are equal and the result follows.

Other things – Multivariate Generating Functions

If you want to track a sequence in two variables, say a_{m,n}, then you can encode this with the bivariate generating function

f(x,y):=\sum_{m,n\geq 0}a_{m,n}x^my^n.

The coefficients are then extracted by [x^ay^b] and so on. There’s some interesting stuff on counting lattice paths with this method.

Sums over arithmetic progressions via roots of unity

Note that we can extract both \sum a_n and \sum (-1)^na_n by judicious choice of x in f(x). By taking half the sum or half the difference, we can obtain

a_0+a_2+a_4+\ldots=\frac12(f(1)+f(-1)),\quad a_1+a_3+a_5+\ldots=\frac12(f(1)-f(-1)).

Can we do this in general? Yes actually. If you want a_0+a_k+a_{2k}+\ldots, this is given by:

a_0+a_k+a_{2k}+\ldots+\frac{1}{k}\left(f(1)+f(w)+\ldots+f(w^{k-1})\right),

where w=e^{2\pi i/k} is a $k$th root of unity. Exercise: Prove this.

For greater clarity, first try the case k=4, and consider the complex part of the power series evaluated at +i and -1.

Bell Polynomials

Trees with a single cycle

When counting combinatorial objects, it is often the case that we have two types of structure present at different levels. The aim of this post is to introduce the Bell polynomials, which provides the most natural notation for describing this sort of situation, and to mention some of the results that become easier to derive in this framework. This post is based on material and exercises from Chapter 1 of Jim Pitman’s book Combinatorial Stochastic Processes, which is great, and also available online here.

The structures that Bell polynomials enumerate are called composite structures in this account. Rather than give a definition right away, I shall give an example. An object I have been thinking about in the past few weeks are graphs on n vertices containing precisely one cycle. Some of the background for this has been explained in recent posts.

In a recent post on Prufer codes, I gave the classical argument showing that the number of trees on n vertices is n^{n-2}. We might consider a unicyclic graph to be a tree with an extra edge. But if we consider the number of ways to add a further vertex to a tree, we get

n^{n-2}\left[\binom{n}{2}-(n-1)\right]=n^{n-2}\binom{n-1}{2}.

Obviously, we have overcounted. If the single cycle in a graph has length k, then the graph has been counted exactly k times in this enumeration. But it is not obvious how many graphs have a single cycle of length k.

Instead, we stop worrying about exactly how many of these there are, as there might not be a simple expression anyway. As soon as we start using them in any actual argument, it will be useful to know various properties about the graphs, but probably not exactly how many there are.

Let’s focus on this single cycle of length k say. If we remove the edges of the cycle, we are left with a collection of trees. Why? Well if there was a cycle in the remaining graph, then the original graph would have had at least two cycles. So we have a collection of trees, unsurprisingly called a forest. Remembering that some of the trees may in fact be a single vertex (on the cycle), it is clear that there is a bijection between these trees and the vertices of the cycle in the obvious way. We can think of the graph as a k-cycle, dressed with trees.

Alternatively, once we have specified its size, we can forget about the k-cycle altogether. The graph is precisely defined by a forest of k trees on n vertices, with a specified root in each tree indicating which vertex lies on the cycle, and a permutation specifying the cyclic ordering of the trees. We can write this as

N_{n,k}=(k-1)!\sum_{(A_1,\ldots,A_k)\in\mathcal{P}^k(n)}a_1^{a_1-1}\cdot\ldots\cdot a_k^{a_k-1},\quad \text{for }a_i=|A_i|,

where \mathcal{P}^k(n) is the number of partitions of [n] with k blocks. Remember that the blocks in a partition are necessarily unordered. This makes sense in this setting as the cyclic permutation chosen from the (k-1)! possibilities specifies the order on the cycle.

Bell Polynomials

The key point about this description is that there are two types of combinatorial structure present. We have the rooted trees, and also a cyclic ordering of the rooted trees. Bell polynomials generalise this idea. It is helpful to be less specific and think of partitions of [n] into blocks. There are w_j arrangements of any block of size j, and there are v_k ways to arrange the blocks, if there are k of them. Note that we assume v_k is independent of the arrangements within the collection of blocks. So in the previous example, w_j=j^{j-2}, and v_k=(k-1)!. Pitman denotes these sequences by v_\bullet,w_\bullet. Then the (n,k)th partial Bell polynomial, B_{n,k}(w_\bullet) gives the number of divisions into k blocks:

B_{n,k}(w_\bullet):=\sum_{(A_1,\ldots,A_k)\in\mathcal{P}^k(n)}\prod_{i=1}^k w_{a_i}.

The total number of arrangements is given by the Bell polynomial

B_n(v_\bullet,w_\bullet):=\sum_{k=1}^n v_k B_{n,k}(w_\bullet).

Here are some other examples of Bell polynomials. The Stirling numbers of the first kind c_{n,k} give the number of permutations of [n] with k cycles. Since we don’t want to impose any combinatorial structure on the set of cycles, we don’t need to consider v_\bullet, and the number of ways to make a j-cycle from a j-block is w_j=(j-1)!, so c_{n,k}:=B_{n,k}((\bullet-1)!). Similarly, the Stirling numbers of the second kind S_{n,k} give the number of permutations of [n] into k blocks. Almost by definition, S_{n,k}:=B_{n,k}(1^\bullet), where $1^\bullet$ is defined to be the sequence containing all 1s.

Applications

So far, this is just a definition that gives an abbreviated description for the sizes of several interesting sets of discrete objects. Having clean notation is always important, but there are further advantages of using Bell polynomials. I don’t want to reproduce the entirety of the chapter I’ve read, so my aim for this final section is to give a very vague outline of why this is a useful formulation.

Bell polynomials can be treated rather nicely via generating functions. The key to this is to take a sum not over partitions, but rather over ordered partitions, which are exactly the same, except now we also care about the order of the blocks. This has the advantage that there is a correspondence between ordered partitions with k blocks and compositions with k terms. If the composition is n_1+\ldots+n_k=n, it is clear why there are \binom{n}{n_1,\ldots,n_k} ordered partitions encoding this structure. This multinomial coefficient can be written as a product of factorials of $n_i$s over i, and so we can write:

B_{n,k}(w_\bullet)=\frac{n!}{k!}\sum_{(n_1,\ldots,n_k)}\prod_{i=1}^k \frac{w_{n_i}}{n_i!}.

This motivates considering the exponential generating function given by

w(\xi)=\sum_{j=1}^\infty w_j\frac{\xi_j}{j!},

as this leads to the neat expressions:

B_{n,k}(w_\bullet)=n![\xi^n]\frac{w(\xi)^k}{k!},\quad B_n(v_\bullet,w_\bullet)=n![\xi^n]v(w(\xi)).

The Bell polynomial B_n(v_\bullet,w_\bullet) counts the number of partitions of [n] subject to some extra structure. If we choose uniformly from this set, we get a distribution on this combinatorial object, for which the Bell polynomial provides the normalising constant. If we then ignore the extra structure, the sequences v_\bullet,w_\bullet induce a probability distribution on the set of partitions of n. This distribution is known as a Gibbs partition. It is interesting to consider when and whether it is possible to define a splitting mechanism such that the Gibbs partitions can be coupled to form a fragmentation process. This is the opposite of a coalescence process. Here, we have a sequence of masses, and at each integer time we have rules to determine which mass to pick, and a rule for how to break it into two pieces. It is certainly not the case that for an arbitrary splitting rule and sequences v_\bullet,w_\bullet, the one-step fragmentation of the Gibbs partition on n gives the corresponding Gibbs partition on (n-1).

CLT for random permutations

For the final demonstration of the use of Bell polynomials, I am going to sketch the outline of a solution to exercise 1.5.4. which shows that the number of cycles in a uniformly chosen permutation has a CLT. This is not at all obvious, since the number of permutations of [n] with k cycles is given by B_{n,k}((\bullet-1)!) and there is certainly no simple form for this, so the possibility of doing a technical limiting argument seems slim.

For ease of notation, we copy Pitman and write c_{n,k}:=B_{n,k}((\bullet-1)!) as before. First we show exercise 1.2.3. which asserts that

x(x+1)\ldots(x+(n-1))=\sum_{k=1}^n c_{n,k}x^k.

We argue combinatorially. The RHS is the number of ways to choose \sigma\in S_n and a colouring of [n] with k colours such that the orbits of \sigma are monochromatic. We prove that the LHS also has this property by induction on the number of vertices. We claim there is a 1-to-(x+n) map from configurations on n vertices to configurations on (n+1) vertices. Given \sigma\in S_n and colouring, for any a\in[n], we construct \sigma_a\in S_{n+1} by \sigma_a(a)=n+1, \sigma_a(n)=\sigma(a) and for all other x, \sigma_a(x)=\sigma(x). We give n+1 the same colour as a. This gives us n possibilities. Alternatively, we can map (n+1) to itself and give it any colour we want. This gives us x possibilities. A slightly more careful argument shows that this is indeed a 1-to-(x+n) map, which is exactly what we require.

So the polynomial

A_n(z)=\sum_{k=0}^nc_{n,k}z^k,

has n real zeros, which allows us to write

\frac{c_{n,k}}{A_n(1)}=\mathbb{P}(X_1+\ldots+X_n=k),

where the Xs are independent but not identically distributed Bernoulli trials. The number of cycles is then given by this sum, and so becomes a simple matter to verify the CLT by checking a that the variances grows appropriately. As both mean and variance are asymptotically log n, we can conclude that:

\frac{K_n - \log n}{\sqrt{\log n}}\stackrel{d}{\rightarrow} N(0,1).

In a future post, I want to give a quick outline of section 1.3. which details how the Bell polynomials can be surprisingly useful to find the moments of infinitely divisible distributions.