Large Deviations 5 – Stochastic Processes and Mogulskii’s Theorem

Motivation

In the previous posts about Large Deviations, most of the emphasis has been on the theory. To summarise briefly, we have a natural idea that for a family of measures supported on the same metric space, increasingly concentrated as some index grows, we might expect the probability of seeing values in a set not containing the limit in distribution to grow exponentially. The canonical example is the sample mean of a family of IID random variables, as treated by Cramer’s theorem.

It becomes apparent that it will not be enough to specify the exponent for a given large deviation event just by taking the infimum of the rate function, so we have to define an LDP topologically, with different behaviour on open and closed sets. Now we want to find some LDPs for more complicated measures, but which will have genuinely non-trivial applications. The key idea in all of this is that the infimum present in the definition of an LDP doesn’t just specify the rate function, it also might well give us some information about the configurations or events that lead to the LDP.

The slogan for the LDP as in Frank den Hollander’s excellent book is: “A large deviation event will happen in the least unlikely of all the unlikely ways.” This will be useful when our underlying space is a bit more complicated.

Setup

As a starting point, consider the set-up for Cramer’s theorem, with IID X_1,\ldots,X_n. But instead of investigating LD behaviour for the sample mean, we investigate LD behaviour for the whole set of RVs. There is a bijection between sequences and the partial sums process, so we investigate the partial sums process, rescaled appropriately. For the moment this is a sequence not a function or path (continuous or otherwise), but in the limit it will be, and furthermore it won’t make too much difference whether we interpolate linearly or step-wise.

Concretely, we consider the rescaled random walk:

Z_n(t):=\tfrac{1}{n}\sum_{i=1}^{[nt]}X_i,\quad t\in[0,1],

with laws \mu_n supported on L_\infty([0,1]). Note that the expected behaviour is a straight line from (0,0) to (1,\mathbb{E}X_1). In fact we can say more than that. By Donsker’s theorem we have a functional version of a central limit theorem, which says that deviations from this expected behaviour are given by suitably scaled Brownian motion:

\sqrt{n}\left(\frac{Z_n(t)-t\mathbb{E}X}{\sqrt{\text{Var}(X_1)}}\right)\quad\stackrel{d}{\rightarrow}\quad B(t),\quad t\in[0,1].

This is what we expect ‘standard’ behaviour to look like:

mog1 - Copy

The deviations from a straight line are on a scale of \sqrt{n}. Here are two examples of potential large deviation behaviour:

mog2 - Copy

Or this:

mog3 - Copy

Note that these are qualitatively different. In the first case, the first half of the random variables are in general much larger than the second half, which appear to have empirical mean roughly 0. In the second case, a large deviation in overall mean is driven by a single very large value. It is obviously of interest to find out what the probabilities of each of these possibilities are.

We can do this via an LDP for (\mu_n). Now it is really useful to be working in a topological context with open and closed sets. It will turn out that the rate function is supported on absolutely continuous functions, whereas obviously for finite n, none of the sample paths are continuous!

We assume that \Lambda(\lambda) is the logarithmic moment generating function of X_1 as before, with \Lambda^*(x) the Fenchel-Legendre transform. Then the key result is:

Theorem (Mogulskii): The measures (\mu_n) satisfy an LDP on L_\infty([0,1]) with good rate function:

I(\phi)=\begin{cases}\int_0^1 \Lambda^*(\phi'(t))dt,&\quad \text{if }\phi\in\mathcal{AC}, \phi(0)=0,\\ \infty&\quad\text{otherwise,}\end{cases}

where AC is the space of absolutely continuous functions on [0,1]. Note that AC is dense in L_\infty([0,1]), so any open set contains a \phi for which I(\phi) is at least in principle finite. (Obviously, if \Lambda^* is not finite everywhere, then extra restrictions of \phi' are required.)

The following picture may be helpful at providing some motivation:

CptPath2

So what is going on is that if we take a path and zoom in on some small interval around a point, note first that behaviour on this interval is independent of behaviour everywhere else. Then the gradient at the point is the local empirical mean of the random variables around this point in time. The probability that this differs from the actual mean is given by Cramer’s rate function applied to the empirical mean, so we obtain the rate function for the whole path by integrating.

More concretely, but still very informally, suppose there is some \phi'(t)\neq \mathbb{E}X, then this says that:

Z_n(t+\delta t)-Z_n(t)=\phi'(t)\delta t+o(\delta t),

\Rightarrow\quad \mu_n\Big(\phi'(t)\delta t+o(\delta t)=\frac{1}{n}\sum_{i=nt+1}^{n(t+\delta t)}X_i\Big),

= \mu_n\Big( \phi'(t)+o(1)=\frac{1}{n\delta t}\sum_{i=1}^{n\delta t}X_i\Big)\sim e^{-n\delta t\Lambda^*(\phi'(t))},

by Cramer. Now we can use independence:

\mu_n(Z_n\approx \phi)=\prod_{\delta t}e^{-n\delta t \Lambda^*(\phi'(t))}=e^{-\sum_{\delta t}n\delta t \Lambda^*(\phi'(t))}\approx e^{-n\int_0^1 \Lambda^*(\phi'(t))dt},

as in fact is given by Mogulskii.

Remarks

1) The absolutely continuous requirement is useful. We really wouldn’t want to be examining carefully the tail of the underlying distribution to see whether it is possible on an exponential scale that o(n) consecutive RVs would have sum O(n).

2) In general \Lambda^*(x) will be convex, which has applications as well as playing a useful role in the proof. Recalling den Hollander’s mantra, we are interested to see where infima hold for LD sets in the host space. So for the event that the empirical mean is greater than some threshold larger than the expectation, Cramer’s theorem told us that this is exponentially the same as same the empirical mean is roughly equal to the threshold. Now Mogulskii’s theorem says more. By convexity, we know that the integral functional for the rate function is minimised by straight lines. So we learn that the contributions to the large deviation are spread roughly equally through the sample. Note that this is NOT saying that all the random variables will have the same higher than expected value. The LDP takes no account of fluctuations in the path on a scale smaller than n. It does however rule out both of the situations pictured a long way up the page. We should expect to see roughly a straight line, with unexpectedly steep gradient.

3) The proof as given in Dembo and Zeitouni is quite involved. There are a few stages, the first and simplest of which is to show that it doesn’t matter on an exponential scale whether we interpolate linearly or step-wise. Later in the proof we will switch back and forth at will. The next step is to show the LDP for the finite-dimensional problem given by evaluating the path at finitely many points in [0,1]. A careful argument via the Dawson-Gartner theorem allows lifting of the finite-dimensional projections back to the space of general functions with the topology of pointwise convergence. It remains to prove that the rate function is indeed the supremum of the rate functions achieved on projections. Convexity of \Lambda^*(x) is very useful here for the upper bound, and this is where it comes through that the rate function is infinite when the comparison path is not absolutely continuous. To lift to the finer topology of L_\infty([0,1]) requires only a check of exponential tightness in the finer space, which follows from Arzela-Ascoli after some work.

In conclusion, it is fairly tricky to prove even this most straightforward case, so unsurprisingly it is hard to extend to the natural case where the distributions of the underlying RVs (X) change continuously in time, as we will want for the analysis of more combinatorial objects. Next time I will consider why it is hard but potentially interesting to consider with adaptations of these techniques an LDP for the size of the largest component in a sparse random graph near criticality.

Advertisement

Increments of Random Partitions

The following is problem 2.1.4. from Combinatorial Stochastic Processes:

Let X_i be the indicator of the event that i the least element of some block of an exchangeable random partition \Pi_n of [n]. Show that the joint law of the (X_i,1\leq i\leq n) determines the law of \Pi_n.

As Pitman says, this is a result by Serban Nacu, the paper for which can be found here. In this post I’m going to explain what an exchangeable random partition is, how to prove the result, and a couple of consequences.

The starting point is the question ‘what is an exchangeable random partition?’ The most confusing aspect is that there are multiple definitions depending on whether the blocks of the partition are sets or just integers corresponding to a size. Eg, {1,2,4} u {3} is a partition of [4], corresponding to the partition 3+1 of 4. Obviously one induces the other, and in an exchangeable setting the laws of one may determine the laws of the other.

In the second case, we assume 3+1 is the same partition as 1+3. If order does matter then we call it a composition instead. This gets a bit annoying for set partitions, as we don’t want these to be ordered either. But if we want actually to talk about the sets in question we have to give them labels, which becomes an ordering, so we need some canonical way to assign these labels. Typically we will say \Pi_n=\{A_1,\ldots,A_k\}, where the curly brackets indicate that we don’t care about order, and we choose the labels by order of appearance, so by increasing order of least elements.

We say that a random partition \Pi_n of [n] is exchangeable if its distribution is invariant the action on partitions induced by the symmetric group. That is, relabelling doesn’t change probabilities. We can express this functionally by saying

\mathbb{P}(\Pi_n=\{A_1,\ldots,A_k\})=p(|A_1|,\ldots,|A_k|),

for p a symmetric function. This function is then called the exchangeable partition probability function (EPPF) by Pitman.

Consider a partition of 4 into sets of sizes 3 and 1. There is a danger that this definition looks like it might be saying that the probability that A_1 is the set of size 3 is the same as the probability that A_1 is the set of size 1. This would be a problem because we expect to see some size-biasing to the labelling. Larger sets are more likely to contain small elements, merely because they contain more elements. Fortunately the definition is not broken after all. The statement above makes no reference to the probabilities of seeing various sizes for A_1 etc. For that, we would have to sum over all partitions with that property. It merely says that the partitions:

\{1,2,3\}\cup\{4\},\quad \{1,2,4\}\cup\{3\},\quad\{1,3,4\}\cup\{2\},\quad \{2,3,4\}\cup\{1\}

have respective probabilities:

p(3,1),\quad p(3,1),\quad p(3,1),\quad p(1,3),

and furthermore these are equal.

Anyway, now let’s turn to the problem. The key idea is that we want to be looking at strings of 0s and 1s that can only arise in one way. For example, the string 10…01 can only arise corresponding to the partitions {1,2,…,n-1} u {n} and {1,2,…,n-2,n} u {n-1}. So now we know p(n-1,1) and so also p(1,n-1). Furthermore, note that 10…0 and 11…1 give the probabilities of 1 block of size n and n blocks of size 1 respectively at once.

So then the string 10…010 can only arise from partitions {1,2,…,n-2,n} u {n-1} or {1,2,…,n-2} u {n-1,n}. We can calculate the probability that it came from the former using the previously found value of p(n-1,1) and a combinatorial weighting, so the remaining probability is given by p(2,n-2). Keep going. It is clear what ‘keep going’ means in the case of p(a,b) but for partitions with more than two blocks it seems a bit more complicated.

Let’s fix k the number of blocks in partitions under consideration, and start talking about compositions, that is a_1+\ldots+a_k=n. The problem we might face in trying to generalise the previous argument is that potentially lots of compositions might generate the same sequence of 0s and 1s, so the ‘first time’ we consider a composition might be the same for more than one composition. Trying it out in the case k=3 makes it clear that this is not going to happen, but we need some partial ordering structure to explain why this is the case.

Recall that a composition with k blocks is a sequence a=(a_1,\ldots,a_k) which sums to n. Let’s say a majorizes b if all its partial sums are at least as large. That is a_1+\ldots+a_l\geq b_1+\ldots+b_l for all 1\leq l \leq k. We say this is strict if at least one of the inequalities is strict. It is not hard to see that if a majorizes b then this is strict unless a = b.

Since we don’t care about ordering, we assume for now that all compositions are arranged in non-increasing order. So we find a partition corresponding to some such composition a_1,\ldots,a_k. The partition is:

\{1,\ldots,a_1\}\cup\{a_1+1,\ldots,a_1+a_2\}\cup\{a_1+a_2+1,\ldots,a_1+a_2+a_3\}\cup\ldots\cup\{n-a_k,\ldots,n\}.

This generates a sequence of 0s and 1s as describe above, with a_i-1 0s between the i’th 1 and the (i+1)th 1. The claim is that given some composition which admits a partition with this same corresponding sequence, that composition must majorize a. Proof by induction on l. So in fact we can prove Nacu’s result inductively down the partial ordering described. We know the probability of the sequence of 0s and 1s corresponding to the partition of [n] described by assumption. We know the probability of any partition corresponding to a composition which majorizes a by induction, and we know how many partitions with this sequence each such composition generates. Combining all of this, we can find the probability corresponding to a.

Actually I’m not going to say much about consequences of this except to paraphrase very briefly what Nacu says in the paper. One of the neat consequences of this result is that it allows us to prove in a fairly straightforward way that the only infinite family of exchangeable random partitions with independent increments is the so-called Chinese Restaurant process.

Instead of attempting to prove this, I will explain what all the bits mean. First, the Chinese Restaurant process is the main topic of the next chapter of the book, so I won’t say any more about it right now, except that its definition is almost exact what is required to make this particular result true.

We can’t extend the definition of exchangeable to infinite partitions immediately, because considering invariance under the symmetric group on the integers is not very nice, in particular because there’s a danger all the probabilities will end up being zero. Instead, we consider restrictions of the partition to [n]\subset\mathbb{N}, and demand that these nest appropriately, and are exchangeable.

Independent increments is a meaningful thing to consider since one way to construct a partition, infinite or otherwise, is to consider elements one at a time in the standard ordering, either adding the new element to an already present block, or starting block. Since 0 or 1 in the increment sequence corresponds precisely to these events, it is meaningful to talk about independent increments.

Generating Functions for the IMO

The background to this post is that these days I find myself using generating functions all the time, especially for describing the stationary states of various coalescence-like processes. I remember meeting them vaguely while preparing for the IMO as a student. However, a full working understanding must have eluded me at the time, as for Q5 on IMO 2008 in Madrid I had written down in big boxes the two statements involving generating functions that immediately implied the answer, but failed to finish it off. The aim of this post is to help this year’s team avoid that particular pitfall.

What are they?

I’m going to define some things in a way which will be most relevant to the type of problems you are meeting now. Start with a sequence (a_0,a_1,a_2,\ldots). Typically these will be the sizes of various combinatorial sets. Eg a_n = number of partitions of [n] with some property. Define the generating function of the sequence to be:

f(x)=\sum_{k\geq 0}a_k x^k=a_0+a_1x+a_2x^2+\ldots.

If the sequence is finite, then this generating function is a polynomial. In general it is a power series. As you may know, some power series can be rather complicated, in terms of where they are defined. Eg

1+x+x^2+x^3+\ldots=\frac{1}{1-x},

only when |x|<1. For other values of x, the LHS diverges. Defining f over C is fine too. This sort of thing is generally NOT important for applications of generating functions to combinatorics. To borrow a phrase from Wilf, a generating function is a convenient `clothesline’ on which to hang a sequence of numbers.

We need a notation to get back from the generating function to the coefficients. Write [x^k]g(x) to denote the coefficient of x^k in the power series g(x). So, if g(x)=3x^3-5x^2+7, then [x^2]g(x)=-5. It hopefully should never be relevant unless you read some other notes on the topic, but the notation [\alpha x^2]g(x):=\frac{[x^2]g(x)}{\alpha}, which does make sense after a while.

How might they be useful?

Example: binomial coefficients a_k=\binom{n}{k} appear, as the name suggests, as coefficients of

f_n(x)=(1+x)^n=\sum_{k=0}^n \binom{n}{k}x^k.

Immediate consequence: it’s trivial to work out \sum_{k=0}^n \binom{n}{k} and \sum_{k=0}^n(-1)^k \binom{n}{k} by substituting x=\pm 1 into f_n.

Less obvious consequence. By considering choosing n from a red balls and b blue balls, one can verify

\binom{a+b}{n}=\sum_{k=0}^n \binom{a}{k}\binom{b}{n-k}.

We can rewrite the RHS as

\sum_{k+l=n}\binom{a}{k}\binom{b}{l}.

Think how we calculate the coefficient of x^n in the product f(x)g(x), and it is now clear that \binom{a+b}{n}=[x^n](1+x)^{a+b}, while

\sum_{k+l=n}\binom{a}{k}\binom{b}{l}=[x^n](1+x)^a(1+x)^b,

so the result again follows. This provides a good slogan for generating functions: they often replicate arguments via bijections, even if you can’t find the bijection.

Useful for? – Multinomial sums

The reason why the previous argument for binomial coefficients worked nicely is because we were interested in the coefficients, but had a neat expression for the generating function as a polynomial. In particular, we had an expression

\sum_{k+l=n}a_k b_l.

This is always a clue that generating functions might be useful. This is sometimes called a convolution.

Exercise: prove that in general, if f(x) is the generating function of (a_k) and g(x) the generating function of (b_l), then f(x)g(x) is the generating function of \sum_{k+l=n}a_kb_l.

Even more usefully, this works in the multinomial case:

\sum_{k_1+\ldots+k_m=n}a^{(1)}_{k_1}\ldots a^{(m)}_{k_m}.

In many applications, these a^{(i)}s will all be the same. We don’t even have to specify how many k_i’s there are to be considered. After all, if we want the sum to be n, then only finitely many can be non-zero. So:

\sum_{m}\sum_{k_1+\ldots+k_m=n}a_{k_1}\ldots a_{k_m}=[x^n]f(x)^n=[x^n]f(x)^\infty,

provided f(0)=1.

Useful when? – You recognise the generating function!

In some cases, you can identify the generating function as a `standard’ function, eg the geometric series. In that case, manipulating the generating functions is likely to be promising. Here is a list of some useful power series you might spot.

1+x+x^2+\ldots=\frac{1}{1-x},\quad |x|<1

1+2x+3x^2+\ldots=\frac{1}{(1-x)^2},\quad |x|<1

e^x=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\ldots

\cos x=1-\frac{x^2}{2!}+\frac{x^4}{4!}\pm\ldots

Exercise: if you know what differentiation means, show that if f(x) is the gen fn of (a_k), then xf'(x) is the gen fn of ka_k.

Technicalities: some of these identities are defined only for certain values of x. This may be a problem if they are defined at, say, only a single point, but in general this shouldn’t be the case. In addition, you don’t need to worry about differentiability. You can definition differentiation of power series by x^n\mapsto nx^{n-1}, and sort out convergence later if necessary.

Useful for? – Recurrent definitions

The Fibonacci numbers are defined by:

F_0=F_1=1,\quad F_{n+1}=F_n+F_{n-1},\quad n\geq 1.

Let F(x) be the generating function of the sequence F_n. So, for n=>1,

[x^n]F(x)=[x^{n-1}]F(x)+[x^{n-2}]F(x)=[x^n](xF(x)+x^2F(x)),

and F(0)=1, so we can conclude that:

F(x)=1+(x+x^2)F(x)\quad\Rightarrow\quad F(x)=\frac{1}{1-x-x^2}.

Exercise: Find a closed form for the generating function of the Catalan numbers, defined recursively by:

C_n=C_0C_{n-1}+C_1C_{n-2}+\ldots+C_{n-1}C_0.

Can you now find the coefficients explicitly for this generating function?

Useful for? – Partitions

Partitions can be an absolute nightmare to work with because of the lack of explicit formulae. Often any attempt at a calculation turns into a massive IEP bash. This prompts a search for bijective or bare-hands arguments, but generating functions can be useful too.

For now (*), let’s assume a partition of [n] means a sequence of positive integers a_1\geq a_2\geq\ldots\geq a_k such that a_1+\ldots+a_k=n. Let p(n) be the number of partitions of [n].

(* there are other definitions, in terms of a partition of the set [n] into k disjoint but unlabelled sets. Be careful about definitions, but the methods often extend to whatever framework is required. *)

Exercise: Show that the generating function of p(n) is:

\left(\frac{1}{1-x}\right)\left(\frac{1}{1-x^2}\right)\left(\frac{1}{1-x^3}\right)\ldots

Note that if we are interested only in partitions of [n], then we don’t need to consider any terms with exponent greater than n, so if we wanted we could take a finite product instead.

Example: the mint group will remember this problem from the first session in Cambridge:

Show that the number of partitions of [n] with distinct parts is equal to the number of partitions of [n] with odd parts.

Rather than the fiddly bijection argument found in the session, we can now treat this as a simple calculation. The generating function for distinct parts is given by:

(1+x)(1+x^2)(1+x^3)\ldots,

while the generating function for odd parts is given by:

\left(\frac{1}{1-x}\right)\left(\frac{1}{1-x^3}\right)\left(\frac{1}{1-x^5}\right)\ldots.

Writing the former as

\left(\frac{1-x^2}{1-x}\right)\left(\frac{1-x^4}{1-x^2}\right)\left(\frac{1-x^6}{1-x^3}\right)\ldots

shows that these are equal and the result follows.

Other things – Multivariate Generating Functions

If you want to track a sequence in two variables, say a_{m,n}, then you can encode this with the bivariate generating function

f(x,y):=\sum_{m,n\geq 0}a_{m,n}x^my^n.

The coefficients are then extracted by [x^ay^b] and so on. There’s some interesting stuff on counting lattice paths with this method.

Sums over arithmetic progressions via roots of unity

Note that we can extract both \sum a_n and \sum (-1)^na_n by judicious choice of x in f(x). By taking half the sum or half the difference, we can obtain

a_0+a_2+a_4+\ldots=\frac12(f(1)+f(-1)),\quad a_1+a_3+a_5+\ldots=\frac12(f(1)-f(-1)).

Can we do this in general? Yes actually. If you want a_0+a_k+a_{2k}+\ldots, this is given by:

a_0+a_k+a_{2k}+\ldots+\frac{1}{k}\left(f(1)+f(w)+\ldots+f(w^{k-1})\right),

where w=e^{2\pi i/k} is a $k$th root of unity. Exercise: Prove this.

For greater clarity, first try the case k=4, and consider the complex part of the power series evaluated at +i and -1.

Bell Polynomials

Trees with a single cycle

When counting combinatorial objects, it is often the case that we have two types of structure present at different levels. The aim of this post is to introduce the Bell polynomials, which provides the most natural notation for describing this sort of situation, and to mention some of the results that become easier to derive in this framework. This post is based on material and exercises from Chapter 1 of Jim Pitman’s book Combinatorial Stochastic Processes, which is great, and also available online here.

The structures that Bell polynomials enumerate are called composite structures in this account. Rather than give a definition right away, I shall give an example. An object I have been thinking about in the past few weeks are graphs on n vertices containing precisely one cycle. Some of the background for this has been explained in recent posts.

In a recent post on Prufer codes, I gave the classical argument showing that the number of trees on n vertices is n^{n-2}. We might consider a unicyclic graph to be a tree with an extra edge. But if we consider the number of ways to add a further vertex to a tree, we get

n^{n-2}\left[\binom{n}{2}-(n-1)\right]=n^{n-2}\binom{n-1}{2}.

Obviously, we have overcounted. If the single cycle in a graph has length k, then the graph has been counted exactly k times in this enumeration. But it is not obvious how many graphs have a single cycle of length k.

Instead, we stop worrying about exactly how many of these there are, as there might not be a simple expression anyway. As soon as we start using them in any actual argument, it will be useful to know various properties about the graphs, but probably not exactly how many there are.

Let’s focus on this single cycle of length k say. If we remove the edges of the cycle, we are left with a collection of trees. Why? Well if there was a cycle in the remaining graph, then the original graph would have had at least two cycles. So we have a collection of trees, unsurprisingly called a forest. Remembering that some of the trees may in fact be a single vertex (on the cycle), it is clear that there is a bijection between these trees and the vertices of the cycle in the obvious way. We can think of the graph as a k-cycle, dressed with trees.

Alternatively, once we have specified its size, we can forget about the k-cycle altogether. The graph is precisely defined by a forest of k trees on n vertices, with a specified root in each tree indicating which vertex lies on the cycle, and a permutation specifying the cyclic ordering of the trees. We can write this as

N_{n,k}=(k-1)!\sum_{(A_1,\ldots,A_k)\in\mathcal{P}^k(n)}a_1^{a_1-1}\cdot\ldots\cdot a_k^{a_k-1},\quad \text{for }a_i=|A_i|,

where \mathcal{P}^k(n) is the number of partitions of [n] with k blocks. Remember that the blocks in a partition are necessarily unordered. This makes sense in this setting as the cyclic permutation chosen from the (k-1)! possibilities specifies the order on the cycle.

Bell Polynomials

The key point about this description is that there are two types of combinatorial structure present. We have the rooted trees, and also a cyclic ordering of the rooted trees. Bell polynomials generalise this idea. It is helpful to be less specific and think of partitions of [n] into blocks. There are w_j arrangements of any block of size j, and there are v_k ways to arrange the blocks, if there are k of them. Note that we assume v_k is independent of the arrangements within the collection of blocks. So in the previous example, w_j=j^{j-2}, and v_k=(k-1)!. Pitman denotes these sequences by v_\bullet,w_\bullet. Then the (n,k)th partial Bell polynomial, B_{n,k}(w_\bullet) gives the number of divisions into k blocks:

B_{n,k}(w_\bullet):=\sum_{(A_1,\ldots,A_k)\in\mathcal{P}^k(n)}\prod_{i=1}^k w_{a_i}.

The total number of arrangements is given by the Bell polynomial

B_n(v_\bullet,w_\bullet):=\sum_{k=1}^n v_k B_{n,k}(w_\bullet).

Here are some other examples of Bell polynomials. The Stirling numbers of the first kind c_{n,k} give the number of permutations of [n] with k cycles. Since we don’t want to impose any combinatorial structure on the set of cycles, we don’t need to consider v_\bullet, and the number of ways to make a j-cycle from a j-block is w_j=(j-1)!, so c_{n,k}:=B_{n,k}((\bullet-1)!). Similarly, the Stirling numbers of the second kind S_{n,k} give the number of permutations of [n] into k blocks. Almost by definition, S_{n,k}:=B_{n,k}(1^\bullet), where $1^\bullet$ is defined to be the sequence containing all 1s.

Applications

So far, this is just a definition that gives an abbreviated description for the sizes of several interesting sets of discrete objects. Having clean notation is always important, but there are further advantages of using Bell polynomials. I don’t want to reproduce the entirety of the chapter I’ve read, so my aim for this final section is to give a very vague outline of why this is a useful formulation.

Bell polynomials can be treated rather nicely via generating functions. The key to this is to take a sum not over partitions, but rather over ordered partitions, which are exactly the same, except now we also care about the order of the blocks. This has the advantage that there is a correspondence between ordered partitions with k blocks and compositions with k terms. If the composition is n_1+\ldots+n_k=n, it is clear why there are \binom{n}{n_1,\ldots,n_k} ordered partitions encoding this structure. This multinomial coefficient can be written as a product of factorials of $n_i$s over i, and so we can write:

B_{n,k}(w_\bullet)=\frac{n!}{k!}\sum_{(n_1,\ldots,n_k)}\prod_{i=1}^k \frac{w_{n_i}}{n_i!}.

This motivates considering the exponential generating function given by

w(\xi)=\sum_{j=1}^\infty w_j\frac{\xi_j}{j!},

as this leads to the neat expressions:

B_{n,k}(w_\bullet)=n![\xi^n]\frac{w(\xi)^k}{k!},\quad B_n(v_\bullet,w_\bullet)=n![\xi^n]v(w(\xi)).

The Bell polynomial B_n(v_\bullet,w_\bullet) counts the number of partitions of [n] subject to some extra structure. If we choose uniformly from this set, we get a distribution on this combinatorial object, for which the Bell polynomial provides the normalising constant. If we then ignore the extra structure, the sequences v_\bullet,w_\bullet induce a probability distribution on the set of partitions of n. This distribution is known as a Gibbs partition. It is interesting to consider when and whether it is possible to define a splitting mechanism such that the Gibbs partitions can be coupled to form a fragmentation process. This is the opposite of a coalescence process. Here, we have a sequence of masses, and at each integer time we have rules to determine which mass to pick, and a rule for how to break it into two pieces. It is certainly not the case that for an arbitrary splitting rule and sequences v_\bullet,w_\bullet, the one-step fragmentation of the Gibbs partition on n gives the corresponding Gibbs partition on (n-1).

CLT for random permutations

For the final demonstration of the use of Bell polynomials, I am going to sketch the outline of a solution to exercise 1.5.4. which shows that the number of cycles in a uniformly chosen permutation has a CLT. This is not at all obvious, since the number of permutations of [n] with k cycles is given by B_{n,k}((\bullet-1)!) and there is certainly no simple form for this, so the possibility of doing a technical limiting argument seems slim.

For ease of notation, we copy Pitman and write c_{n,k}:=B_{n,k}((\bullet-1)!) as before. First we show exercise 1.2.3. which asserts that

x(x+1)\ldots(x+(n-1))=\sum_{k=1}^n c_{n,k}x^k.

We argue combinatorially. The RHS is the number of ways to choose \sigma\in S_n and a colouring of [n] with k colours such that the orbits of \sigma are monochromatic. We prove that the LHS also has this property by induction on the number of vertices. We claim there is a 1-to-(x+n) map from configurations on n vertices to configurations on (n+1) vertices. Given \sigma\in S_n and colouring, for any a\in[n], we construct \sigma_a\in S_{n+1} by \sigma_a(a)=n+1, \sigma_a(n)=\sigma(a) and for all other x, \sigma_a(x)=\sigma(x). We give n+1 the same colour as a. This gives us n possibilities. Alternatively, we can map (n+1) to itself and give it any colour we want. This gives us x possibilities. A slightly more careful argument shows that this is indeed a 1-to-(x+n) map, which is exactly what we require.

So the polynomial

A_n(z)=\sum_{k=0}^nc_{n,k}z^k,

has n real zeros, which allows us to write

\frac{c_{n,k}}{A_n(1)}=\mathbb{P}(X_1+\ldots+X_n=k),

where the Xs are independent but not identically distributed Bernoulli trials. The number of cycles is then given by this sum, and so becomes a simple matter to verify the CLT by checking a that the variances grows appropriately. As both mean and variance are asymptotically log n, we can conclude that:

\frac{K_n - \log n}{\sqrt{\log n}}\stackrel{d}{\rightarrow} N(0,1).

In a future post, I want to give a quick outline of section 1.3. which details how the Bell polynomials can be surprisingly useful to find the moments of infinitely divisible distributions.

Exponentials kill Polynomials

I gave my second session at the UK IMO training and selection camp at Trinity Cambridge earlier today. This one to a group of the more experienced students on the subject of polynomials. This always feels like a tricky topic to present to olympiad students. I always felt that there were lots of useful connections between roots and coefficients, but it was hard to get a handle on exactly what sort of relationship would be useful for each question. Perhaps the main problem is that any of the natural interesting things to talk about lie annoyingly on the fringes of complex analysis or abstract algebra. Or, at any rate, are best explained in that language, which isn’t particularly suitable at this stage when there’s only an hour and a half to play with.

One problem I was particularly keen for the students to attempt was a proof that exponential functions always grow faster than polynomials. I think this is a good problem to think about because it is so useful in all sorts of areas. In probability for example, polynomial decay and exponential decay are the two regimes generally discussed for the tail behaviour of distributions of random variables, and all sorts of things are qualitatively different in the two cases. It is also often a useful step in a proof when we need very crude bounds on function.

Anyway, how to prove it? Well the first stage is to prove that a polynomial of degree n+1 dominates any polynomial of strictly smaller degree. I am writing ‘dominate’ to mean, ‘is eventually larger than’, under the assumption that the leading coefficients are always positive. (As this seems easier than sticking modulus signs everywhere.)

This isn’t too hard. If we take

P(x)=a_nx^n+\ldots+a_1x+a_0,

then for any x>|a_n|+\ldots+|a_0|, we must have x^{n+1}>P(x) eventually.

Now we introduce the exponential function. In most applications, it turns out to be most natural to use e^x, but for students who haven’t even necessarily done AS-levels I wasn’t happy using a concept whose definition might be rather unfamiliar.

If you are happy with the Taylor series definition

e^x:=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\ldots

then the result is not too challenging. Given a polynomial P of degree k with positive leading coefficient, by the previous result we have that eventually

P(x)<1+x+\frac{x^2}{2!}+\ldots+\frac{x^k}{k!}+\frac{x^{k+1}}{(k+1)!}\leq e^x.

Although the students were able to follow this proof, they were happier thinking about P(n)<2^n. Obviously, we could replace x by n log 2 in the previous argument, but I was pleased with the following direct proof. Ironically, this has much more of the flavour of analysis than the above.

First we can show by induction that n<2^{n/2} for n > 4. It makes sense to take the broader induction hypothesis 4<n<2^{n/2}, and then show that in this range “adding 1 gives you a smaller answer than multiplying by the square root of two”.

From the initial result about polynomials dominating smaller degree polynomials, it suffices to prove the result for P(x)=x^k, for some k, rather than arbitrary polynomials. Now we can proof this by induction on k, the degree of P. We can prove the base case k = 1 via the previous paragraph.

If n^k<2^n eventually, then (4n)^k<2^{n+2k}, so by changing variables, n^k<2^{n/4+2k} eventually, which is in turn < 2^{n/2}. So, eventually

n^{k+1}<2^{n/2}\cdot 2^{n/2}=2^{n}.

Bijections, Prufer Codes and Cayley’s Formula

I’m currently at the training camp in Cambridge for this year’s UK IMO squad. This afternoon I gave a talk to some of the less experienced students about combinatorics. My aim was to cover as many useful tricks for calculating the sizes of combinatorial sets as I could in an hour and a half. We started by discussing binomial coefficients, which pleasingly turned out to be revision for the majority. But my next goal was to demonstrate that we are much more interested in the fact that we can calculate these if we want than in the actual expression for their values.

Put another way, my argument was that the interpretation of \binom{n}{m} as the number of ways to choose m objects from a collection of n, or the number of up-and-right paths from (0,0) to (m,n) is more useful than the fact that \binom{n}{m}=\frac{n!}{m!(n-m)!}. The opening gambit was to prove the fundamental result underlying the famous construction of Pascal’s triangle that

\binom{n+1}{m+1}=\binom{n}{m}+\binom{n}{m+1}.

This is not a hard result to prove by manipulating factorials, but it is a very easy result to prove in the path-counting setting, for example.

So it turned out that the goal of my session, as further supported by some unsubtly motivated problems from the collection, was to convince the students to use bijections as much as possible. That is, if you have to count something awkward, show that counting the awkward thing is equivalent to counting something more manageable, then count that instead. For many simpler questions, this equivalence is often drawn implicitly using words (“each of the n objects can be in any subset of the collection of bags so we multiply…” etc), but it is always worth having in mind the formal bijective approach. Apart from anything else, asking the question “is this bijection so obvious I don’t need to prove it” is often a good starting-point for assessing whether the argument is in fact correct!

Anyway, I really wanted to show my favouriite bijection argument, but there wasn’t time, and I didn’t want to spoil other lecturers’ thunder by defining a graph and a tree and so forth. The exploration process encoding of trees is a strong contender, but today I want to define quickly the Prufer coding for trees, and use it to prove a famous result I’ve been using a lot recently, Cayley’s formula for the number of spanning trees on the complete graph with n vertices, n^{n-2}.

We are going to count rooted trees instead. Since we can choose any vertex to be the root, there are n^{n-1} rooted trees on n vertices. The description of the Prufer code is relatively simple. Take a rooted tree with vertices labelled by [n]. A leaf is a vertex with degree 1, other than the root. Find the leaf with the largest label. Write down the label of the single vertex to which this leaf is connected, then delete the leaf. Now repeat the procedure, writing down the label of the vertex connected to the leaf now with the largest label, until there are only two vertices remaining, when you delete the non-root vertex, and write down the label of the root. We get a string of (n-1) labels. We want to show that this mapping is a bijection from the set of rooted trees with vertices labelled by [n] to [n]^{n-1}.

Let’s record informally how we would recover a tree from the Prufer code. First, observe that the label of any vertex which is not a leaf must appear in the code. Why? Well, the root label appears right at the end, if not earlier, and every vertex must be deleted. But a vertex cannot be deleted until it has degree one, so the neighbours further from the root (or ancestors) of the vertex must be removed first, and so by construction the label appears. So know what the root is, and what the leaves are straight away.

In fact we can say slightly more than this. The number of times the root label appears is the degree of the root, while the number of times any other label appears is the degree of the corresponding vertex minus one. Call this sequence the Prufer degrees.

So we construct the tree backwards from the leaves towards the root. We add edges one at a time, with the k-th edge joining the vertex with the k-th label to some other vertex. For k=1, this other vertex is the leaf with maximum label. In general, let G_k be the graph formed after the addition of k-1 edges, so G_1 is empty, and G_n is the full tree. Define T_k to be the set of vertices such that their degree in G_k is exactly one less than their Prufer degree. Note that T_1 is therefore the set of leaves suggested by the Prufer code. So we form G_{k+1} by adding an edge between the vertex with label appearing at position k+1 in the Prufer sequence and the vertex of T_k with maximum label.

Proving that this is indeed the inverse is a bit fiddly, more because of notation than any actual mathematics. You probably want to show injectivity by an extremal argument, taking the closest vertex to the root that is different in two trees with the same Prufer code. I hope it isn’t a complete cop out to swerve around presenting this in full technical detail, as I feel I’ve achieved by main goal of explaining why bijection arguments can reduce a counting problem that was genuinely challenging to an exercise in choosing sensible notation for proving a fairly natural bijection.

Diameters of Trees and Cycle Deletion

In the past two posts, we introduced two models of random trees. The Uniform Spanning Tree chooses uniformly at random from the set of spanning trees for a given underlying graph. The Minimum Spanning Tree assigns IID weights to each edge in the underlying graph, then chooses the spanning tree with minimum induced total weight. We are interested to know whether these are in fact the same distribution, and if they are not, what properties can be used to distinguish them asymptotically.

While investigating my current research problem, I was interested in the diameter of large random trees under various models. Specifically, I am considering what happens if you take a standard Erdos-Renyi process on n vertices, where edges appear at constant rate between pairs of vertices chosen uniformly at random, and add an extra mechanism to prevent the components becoming too large. For this particular model, our mechanism consists of removing any cycles as they are formed. Thus all the components remain trees as time advances, so it is not unreasonable to think that there might be some sort of equilibrium distribution.

Now, by definition, any tree formed by the Erdos-Renyi process is a uniform tree. Why? Well, the probability of a configuration is determined entirely by the number of edges present, so once we condition that a particular set of vertices are the support of a tree, all possible tree structures are equally likely. Note that this relies on sampling at a single fixed time. If we know the full history of the process, then it is no longer uniform. For example, define a k-star to be a tree on k vertices where one ‘centre’ vertex has degree k-1. The probability that a uniform tree on k vertices is a k-star is \frac{k}{k^{k-2}}=k^{-(k-3)}. But a star can only be formed by successively adding single vertices to an existing star. That is, we cannot join a 3-tree and a 4-tree with a edge to get a 7-star. So it is certainly not immediately clear that once we’ve incorporated the cycle deletion mechanism, the resulting trees will be uniform once we condition on their size.

In fact, the process of component sizes is not itself Markovian. For a concrete example, observe first that there is, up to isomorphism, only one tree on any of {0,1,2,3} vertices, so the first possible counterexample will be splitting up a tree on four vertices. Note that cycle deletion always removes at least three edges (ie a triangle), so the two possibilities for breaking a 4-tree are:

(4) -> (2,1,1) and (4) -> (1,1,1,1)

I claim that the probabilities of each of these are different in the two cases: a) (4) is formed from (2,2) and b) (4) is formed from (3,1). This is precisely a counterexample to the Markov property.

In the case where (4) is formed from (2,2), the 4-tree is certainly a path of length 4. Therefore, with probability 1/3, the next edge added creates a 4-cycle, which is deleted to leave components (1,1,1,1). In the case where (4) is formed from (3,1), then with probability 2/3 it is a path of length 4 and with probability 1/3 it is a 4-star (a ‘T’ shape). In this second case, no edge can be added to make a 4-cycle, so after cycle deletion the only possibility is (2,1,1). Thus the probability of getting (1,1,1,1) is 2/9 in this case, confirming that the process is non-Markovian. However, we might remark that we are unlikely to have O(n) vertices involved in fragmentations until at least the formation of the giant component in the underlying E-R process, so it is possible that the cycle deletion process is ‘almost Markov’ for any property we might actually be interested in.

When we delete a cycle, how many vertices do we lose? Well, for a large tree on n vertices, the edge added which creates the cycle is chosen uniformly at random from the pairs of vertices which are not currently joined by an edge. Assuming that n is w(1), that is we are thinking about a limit of fairly large trees, then the number of edges present is much smaller than the number of possible edges. So we might as well assume we are choosing uniformly from the possible edges, rather than just the possible edges which aren’t already present.

If we choose to add an edge between vertices x and y in the tree, then a cycle is formed and immediately deleted. So the number of edges lost is precisely the length of the path between x and y in the original tree. We are interested to know the asymptotics for this length when x and y are chosen at random. The largest path in a graph is called the diameter, and in practice if we are just interested in orders of magnitude, we might as well assume diameter and expected path length are the same.

So we want to know the asymptotic diameter of a UST on n vertices for large n. This is generally taken to be n^{1/2}. Here’s a quick but very informal argument that did genuinely originate on the back of a napkin. I’m using the LERW definition. Let’s start at vertex x and perform LERW, and record how long the resultant path is as time t advances. This is a Markov chain: call the path length at time t X_t.

Then if X_t=k, with probability 1-\frac k n we get X_{t+1}=k+1, and for each j in {0,…,k-1}, with probability 1/n we have X_{t+1}=j, as this corresponds to hitting a vertex we have already visited. So

\mathbb{E}\Big[X_{t+1}|X_t=k\Big]=\frac{nk-k^2/2}{n}.

Note that this drift is positive for k<< \sqrt n and negative for k>>\sqrt n, so we would expect n^{-1/2} to be the correct scaling if we wanted to find an equilibrium distribution. And the expected hitting time of vertex y is n, by a geometric distribution argument, so in fact we would expect this Markov chain to be well into the equilibrium window with the n^{-1/2} scaling by the time this occurs. As a result, we expect the length of the x to y path to have magnitude n^{1/2}, and assume that the diameter is similar.

So this will be helpful for calculations in the cycle deletion model, provided that the trees look like uniform trees. But does that even matter? Do all sensible models of random trees have diameter going like n^{1/2}? Well, a recent paper of Addario-Berry, Broutin and Reed shows that this is not the case for the minimum spanning tree. They demonstrate that the diameter in this case is n^{1/3}. I found this initially surprising, so tried a small example to see if that shed any light on the situation.

The underlying claim is that MSTs are more likely to be ‘star-like’ than USTs, a term I am not going to define. Let’s consider n=4. Specifically, consider the 4-star with centre labelled as 1. There are six possible edges in K_4 and we want to see how many of the 6! weight allocations lead to this star. If the three edges into vertex 1 have weights 1, 2 and 3 then we certainly get the star, but we can also get this star if the edges have weights 1, 2 and 4, and the edge with weight 3 lies between the edges with weights 1 and 2. So the total number of possibilities for this is 3! x 3! + 3! x 2! = 48. Whereas to get a 4-path, you can assign weights 1, 2 and 3 to the edges of the path, or weights 1, 2 and 4 provided the 4 is not in the middle, and then you have the 3 joining up the triangle formed by 1 and 2. So the number of possibilities for this is 3! x 3! + 4 x 2! = 44.

To summarise in a highly informal way, in a star-like tree, you can ‘hide’ some fairly low-scoring weights on edges that aren’t in the tree, so long as they join up very low-scoring edges that are in the tree. Obviously, this is a long way from getting any formal results on asymptotics, but it does at least show that we need to be careful about diameters if we don’t know exactly what mechanism is generating the tree!