# Kernels of critical graph components

This post is motivated by G(N,p), the classical Erdos-Renyi random graph, specifically its critical window, when $p=p(N)=\frac{1}{N}(1+\lambda N^{-1/3})$.

We start with the following observation, which makes no restriction on p. Suppose a component of G(N,p) is a tree. Then, the graph geometry of this component is that of a uniform random tree on the appropriate number of vertices. This is deliberately informal. To be formal, we’d have to say “condition on a particular subset of vertices forming a tree-component” and so on. But the formality is broadly irrelevant, because at the level of metric scaling limits, if we want to describe the structure of a tree component, it doesn’t matter whether it has $\log N$ or $\frac{1}{7}N$ vertices, because in both cases the tree structure is uniform. The only thing that changes is the scaling factor.

In general, when V vertices form a connected component of a graph with E edges, we define the excess to be E-V+1. So the excess is non-negative, and is zero precisely when the component is a tree. I’m reluctant to say that the excess counts the number of cycles in the component, but certainly it quantifies the amount of cyclic structure present. We will sometimes, in a mild abuse of notation, talk about excess edges. But note that for a connected component with positive excess, there is a priori no way to select which edges would be the excess edges. In a graph process, or when there is some underlying exploration of the component, there sometimes might be a canonical way to classify the excess edges, though it’s worth remarking that the risk of size-biasing errors is always extremely high in this sort of situation.

Returning to the random graph process, as so often there are big changes around criticality. In the subcritical regime, the components are small, and most of them, even the largest with high probability, are trees. In the supercritical regime, the giant component has excess $\Theta(N)$, which is qualitatively very different.

It feels like every talk I’ve ever given has begun with an exposition of Aldous’s seminal paper [Al97] giving a distributional scaling limit of the sizes of critical components in the critical window, and a relation between the process on this time-scale and the multiplicative coalescent. And it remains relevant here, because the breadth-first exploration process can also be used to track the number of excess edges.

In a breadth-first exploration, we have a stack of vertices we are waiting to explore. We pick one and look its neighbours restricted to the rest of the graph, that is without the vertices we have already fully explored, and also without the other vertices in the stack. That’s the easiest way to handle the total component size. But we can simultaneously track how many times we would have joined to a neighbour within the stack, which leads to an excess edge, and Aldous derives a joint distributional scaling limit for the sizes of the critical components and their excesses. (Note that in this case, there is a canonical notion of excess edge, but it depends not just on the graph structure, but also on the extra randomness of the ordering within the breadth-first search.)

Roughly speaking, we consider the reflected exploration process, and its scaling limit, which is a reflected parabolically-drifting Brownian motion (though the details of this are not important at this level of exposition, except that it’s a well-behaved non-negative process that hits zero often). The component sizes are given by the widths of the excursions above zero, scaled up in a factor $N^{1/3}$. Then conditional on the shape of the excursion, the excess is Poisson with parameter the area under the excursion, with no rescaling. That is, a critical component has $\Theta(1)$ excess.

So, with Aldous’s result in the background, when we ask about the metric structure of these critical components, we are really asking: “what does a uniformly-chosen connected component with fixed excess look like when the number of vertices grows?”

I’ll try to keep notation light, but let’s say T(n,k) is a uniform choice from connected graphs on n vertices with excess k.

[Note, the separation of N and n is deliberate, because in the critical window, the connected components have size $n = \Theta(N^{2/3})$, so I want to distinguish the two problems.]

In this post, we will mainly address the question: “what does the cycle structure of T(n,k) look like for large n?” When k=0, we have a uniform tree, and the convergence of this to the Brownian CRT is now well-known [CRT2, LeGall]. We hope for results with a similar flavour for positive excess k.

2-cores and kernels

First, we have to give a precise statement of what it means to study just the cycle structure of a connected component. From now on I will assume we are always working with a connected graph.

There are several equivalent definitions of the 2-core C(G) of a graph G:

• When the excess is positive, there are some cycles. The 2-core is the union of all edges which form part of some cycle, and any edges which lie on a path between two edges which both form part of some cycle.
• C(G) is the maximal induced subgraph where all degrees are at least two.
• If you remove all the leaves from the graph, then all the leaves from the remaining graph, and continue, the 2-core is the state you arrive at where there are no leaves.

It’s very helpful to think of the overall structure of the graph as consisting of the 2-core, with pendant trees ‘hanging off’ the 2-core. That is, we can view every vertex of the 2-core as the root of a (possibly size 1) tree. This is particular clear if we remove all the edges of the 2-core from the graph. What remains is a forest, with one tree for each vertex of the 2-core.

In general, the k-core is the maximal induced subgraph where all degrees are at least k. The core is generally taken to be something rather different. For this post (and any immediate sequels) I will never refer to the k-core for k>2, and certainly not to the traditional core. So I write ‘core’ for ‘2-core’.

As you can see in the diagram, the core consists of lots of paths, and topologically, the lengths of these paths are redundant. So we will often consider instead the kernel, K(G), which is constructed by taking the core and contracting all the paths between vertices of degree greater than 2. The resulting graph has minimal degree at least three. So far we’ve made no comment about the simplicity of the original graphs, but certainly the kernel need not be simple. It will regularly have loops and multiple edges. The kernel of the graph and core in the previous diagram is therefore this:

Kernels of critical components

To recap, we can deconstruct a connected graph as follows. It has a kernel, and each edge of the kernel is a path length of some length in the core. The rest of the graph consists of trees hanging off from the core vertices.

For now, we ask about the distribution of the kernel of a T(n,K). You might notice that the case k=1 is slightly awkward, as when the core consists of a single cycle, it’s somewhat ambiguous how to define the kernel. Everything we do is easily fixable for k=1, but rather than carry separate cases, we handle the case $k\ge 2$.

We first observe that fixing k doesn’t confirm the number of vertices or edges in the kernel. For example, both of the following pictures could correspond to k=3:

However, with high probability the kernel is 3-regular, which suddenly makes the previous post relevant. As I said earlier, it can introduce size-biasing errors to add the excess edges one-at-a-time, but these should be constant factor errors, not scaling errors. So imagine the core of a large graph with excess k=2. For the sake of argument, assume the kernel has the dumbbell / handcuffs shape. Now add an extra edge somewhere. It’s asymptotically very unlikely that this is incident to one of the two vertices with degree three in the core. Note it would need to be incident to both to generate the right-hand picture above. Instead, the core will gain two new vertices of degree three.

Roughly equivalently, once the size of the core is fixed (and large) we have to make a uniform choice from connected graphs of this size where almost every vertex has degree 2, and $\Theta(1)$ of the rest have degree 3 or higher. But the sum of the degrees is fixed, because the excess is fixed. If there are n vertices in the core, then there are $\Theta(n)$ more graphs where all the vertices have degree 2 or 3, than graphs where a vertex has degree at least 4. Let’s state this formally.

Proposition: The kernel of a uniform graph with n vertices and excess $k\ge 2$ is, with high probability as $n\rightarrow\infty$, 3-regular.

This proved rather more formally as part of Theorem 7 of [JKLP], essentially as a corollary after some very comprehensive generating function setup; and in [LPW] with a more direct computation.

In the previous post, we introduced the configuration model as a method for constructing regular graphs (or any graphs with fixed degree sequence). We observe that, conditional on the event that the resulting graph is simple, it is in fact uniformly-distributed among simple graphs. When the graph is allowed to be a multigraph, this is no longer true. However, in many circumstances, as remarked in (1.1) of [JKLP], for most applications the configuration model measure on multigraphs is the most natural.

Given a 3-regular labelled multigraph H with 2(k-1) vertices and 3(k-1) edges, and K a uniform choice from the configuration model with these parameters, we have

$\mathbb{P}\left( K \equiv H \right) \propto \left(2^{t(H)} \prod_{e\in E(H)} \mathrm{mult}(e)! \right)^{-1},$

where t(H) is the number of loops in H, and mult(e) the multiplicity of an edge e. This might seem initially counter-intuitive, because it looks we are biasing against graphs with multiple edges, when perhaps our intuition is that because there are more ways to form a set of multiple edges we should bias in favour of it.

I think it’s most helpful to look at a diagram of a multigraph as shown, and ask how to assign stubs to edges. At a vertex with degree three, all stub assignments are different, that is 3!=6 possibilities. At the multiple edge, however, we care which stubs match with which stubs, but we don’t care about the order within the multi-edge. Alternatively, there are three choices of how to divide each vertex’s stubs into (2 for the multi-edge, 1 for the rest), and then two choices for how to match up the multi-edge stubs, ie 18 in total = 36/2, and a discount factor of 2.

We mention this because in fact K(T(n,k)) converges in distribution to this uniform configuration model. Once you know that K(T(n,k)) is with high probability 3-regular, then again it’s probably easiest to think about the core, indeed you might as well condition on its total size and number of degree 3 vertices. It’s then not hard to convince yourself that a uniform choice induces a uniform choice of kernel. Again, let’s state that as a proposition.

Proposition: For any H a 3-regular labelled multigraph H with 2(k-1) vertices and 3(k-1) edges as before,

$\lim_{n\rightarrow\infty}\mathbb{P}\left( K(T(n,k)) \equiv H \right) \propto \left(2^{t(H)} \prod_{e\in E(H)} \mathrm{mult}(e)! \right)^{-1}.$

As we said before, the kernel describes the topology of the core. To reconstruct the graph, we need to know the lengths in the core, and then how to glue pendant trees onto the core. But this final stage depends on k only through the total length of paths in the core. Given that information, it’s a combinatorial problem, and while I’m not claiming it’s easy, it’s essentially the same as for the case with k=1, and is worth treating separately.

It is worth clarifying a couple of things first though. Even the outline of methods above relies on the fact that the size of the core diverges as n grows. Again, the heuristic is that up to size-biasing errors, T(n,k) looks like a uniform tree with some uniformly-chosen extra edges. But distances in T(n,k) scale like $n^{1/2}$ (and thus in critical components of G(N,p) scale like $N^{1/3}$). And the core will be roughly the set of edges on paths between the uniformly-chosen pairs of vertices, and so will also have length $\Theta(n^{1/2})$.

Once you have conditioned on the kernel structure, and the (large) number of internal vertices on paths in the core (ie the length of the core), it is natural that the assignment of the degree-2 vertices to core paths / kernel edges is uniform. A consequence of this is that if you record $(Y_1,\ldots,Y_m)$ the lengths of paths in the core, where m=3(k-1), then

$\frac{(Y_1,\ldots,Y_m)}{\sum Y_i} \stackrel{d}\rightarrow \mathrm{Dirichlet}(1,1,\ldots,1).$

This is stated formally as Corollary 7 b) of [ABG09]. It’s worth noting that this confirms that the lengths of core paths are bounded in probability away from zero after the appropriate rescaling. In seeking a metric scaling limit, this is convenient as it means there’s so danger that two of the degree-3 vertices end up in ‘the same place’ in the scaling limit object.

To recap, the only missing ingredients now to give a complete limiting metric description of T(n,k) are 1) a distributional limit of the total core length; 2) some appropriate description of set of pendant trees conditional on the size of the pendant forest. [ABG09] show the first of these. As remarked before, all the content of the second of these is encoded in the unicyclic k=1 case, which I have written about before, albeit slightly sketchily, here. (Note that in that post we get around size-biasing by counting a slightly different object, namely unicyclic graphs with an identified cyclic edge.)

However, [ABG09] also propose an alternative construction, which you can think of as glueing CRTs directly onto the stubs of the kernel (with the same distribution as before). The proof that this construction works isn’t as painful as one might fear, and allows a lot of the other metric distributional results to be read off as corollaries.

References

[ABG09] – Addario-Berry, Broutin, Goldschmidt – Critical random graphs: limiting constructions and distributional properties

[CRT2] – Aldous – The continuum random tree: II

[Al97] – Aldous – Brownian excursions, critical random graphs and the multiplicative coalescent

[JKLP] – Janson, Knuth, Luczak, Pittel – The birth of the giant component

[LeGall] – Le Gall – Random trees and applications

[LPW] – Luczak, Pittel, Wierman – The structure of a random graph at the point of the phase transition

# Enumerating Forests

I’ve just got back from a visit to Budapest University of Technology, where it was very pleasant to be invited to give a talk, as well as continuing the discussion our research programme with Balazs. My talk concerned a limit for the exploration process of an Erdos-Renyi random graph conditioned to have no cycles. Watch this space (hopefully very soon) for a fully rigorous account of this. In any case, my timings were not as slick as I would like, and I had to miss out a chunk I’d planned to say about a result of Britikov concerning enumerating unrooted forests. It therefore feels like an excellent time to write something again, and explain this paper, which you might be able to find here, if you have appropriate journal rights.

We are interested to calculate $a_{n,m}$ the number of forests with vertex set [n] consisting of m unrooted trees. Recall that if we were interested in rooted trees, we could appeal to Prufer codes to show that there are $m n^{n-m-1}$ such forests, and indeed results of Pitman give a coalescent/fragmentation scheme as m varies between 1 and n-1. It seems that there is no neat combinatorial re-interpretation of the unrooted case though, so Britikov uses an analytic method.

We know that

$a_{n,m}= \frac{n!}{m!} \sum_{\substack{k_1+\ldots+k_m=n\\ k_i\ge 1}} \prod_{j=1}^m \frac{k_j^{k_j-2}}{k_j!}.$

To see this, observe that the $k_j$s correspond to the sizes of the m trees in the forest; $\frac{n!}{\prod k_j!}$ gives the multinomial number of ways to assign vertices to the trees; given the labels for a tree of size $k_j$, there are $k_j^{k_j-2}$ ways to make up the tree itself; and $\frac{1}{m!}$ accounts for the fact that the trees have no order.

What we would really like to do is to take the uniform distribution on the set of all labelled trees, then simulate m IID copies of this distribution, and condition the union to contain precisely n vertices. But obviously this is an infinite set, so we cannot choose uniformly from it. Instead, we can tilt so that large trees are unlikely. In particular, for each x we define

$\mathbb{P}(\xi=k) \propto \frac{k^{k-2} x^k}{k!}$,

and define the normalising constant

$B(x):= \sum_{k\ge 1} \frac{k^{k-2}x^k}{k!},$

whenever it exists. It turns out that $x\le e^{-1}$ is precisely the condition for $B(x)<\infty$. Note now that if $\xi_1,x_2,\ldots$ are IID copies of $\xi$, then

$\mathbb{P}(\xi_1+\ldots+\xi_m=n) = \frac{x^n}{B(x)^m} \sum_{k_1+\ldots + k_m=n} \prod_{j=1}^m \frac{k_j^{k_j-2}}{k_j!},$

and so we obtain

$a_{n,m}= \frac{n!}{m!} \frac{B(x)^m}{x^n} \mathbb{P}(\xi_1+\ldots + \xi_m=n).$

So asymptotics for $a_{n,m}$ might follows from laws of large numbers of this distribution $\xi$.

So far, we haven’t said anything about how to choose this value x. But observe that if you want to have lots of trees in the forest, then the individual trees should generally be small, so we take x small to tilt away from a preference for large trees. It turns out that there is a similar interpretation of criticality for forests as for general graphs, and taking x equal to 1/e, its radius of convergence works well for this setting. If you want even fewer trees, there is no option to take x larger than 1/e, but instead one can use large deviations machinery rather than laws of large number asymptotics.

We will be interested in asymptotics of the characteristic function of $\xi$ for x=1/e. In particular $\mathbb{E}[e^{it\xi}]=\frac{B(xe^{it})}{B(x)}$, and it will be enough to clarify the behaviour of this as $t\rightarrow 0$. It’s easier to work with a relation analytic function

$\theta(x)=\sum_{k\ge 1} \frac{k^{k-1}x^k}{k!},$

ie the integral of B. What now feels like a long time ago I wrote a masters’ thesis on the subject of multiplicative coalescence, and this shows up as the generating function of the solutions to Smoluchowski’s equations with monodisperse initial conditions, which are themselves closely related to the Borel distributions. In any case, several of the early papers on this topic made progress by establishing that the radius of convergence is 1/e, and that $\theta(x)e^{-\theta(x)}=x$ everywhere where $|x|\le 1/e$. We want to consider x=1/e, for which $\theta=1$.

Note that $\mathbb{E}\xi = \frac{\theta(x)}{B(x)}$, so we will make progress by relating $B(x),\theta(x)$ in two ways. One way involves playing around with contour integrals in a fashion that is clear in print, but involves quite a lot of notation. The second way is the Renyi relation which asserts that $\theta(x)=B(x)+\frac{\theta(x)^2}{2}$. We will briefly give a combinatorial proof. Observe that after multiplying through by factorials and interpreting the square of a generating function, this is equivalent to

$k^{k-1} = k^{k-2} + \frac12 \sum_{\substack{l+m=k\\l,m\ge 1}} l^{l-1}m^{m-1}\binom{k}{l},$

for all k. As we might expect from the appearance of this equality, we can prove it using a bijection on trees. Obviously on the LHS we have the size of the set of rooted trees on [k]. Now consider the set of pairs of disjoint rooted trees with vertex set [k]. This second term on the RHS is clearly the size of this set. Given an element of this set, join up the two roots, and choose whichever root was not initially in the same tree as 1 to be the new root. We claim this gives a bijection between this set, and the set of rooted trees on [k], for which 1 is not the root. Given the latter, the only pair of trees that leads to the right rooted tree on [k] under this mapping is given by cutting off the unique edge incident to the root that separates the root and vertex 1. In particular, since there is a canonical bijection between rooted trees for which 1 is the root, and unrooted trees (!), we can conclude the Renyi relation.

The Renyi relation now gives $\mathbb{E}\xi = \frac{\theta(x)}{B(x)}=2$ when x=1/e. If we wanted, we could show that the variance is infinite, which is not completely surprising, as the parameter x lies on the radius of convergence of the generating function.

Now, playing around with contour integrals, and being careful about which strands to take leads to the asymptotic as $t\rightarrow 0$

$\mathbb{E}[ e^{it\xi}] = 1+2it + \frac{2}{3}i |2t|^{3/2} (i\mathrm{sign}(t))^{3/2} + o(|t|^{3/2}).$

So from this, we can show that the characteristic function of the rescaled centred partial sum $\frac{\xi_1+\ldots+\xi_N-2N}{bN^{2/3}}$ converges to $\exp(-|t|^{3/2}\exp(\frac{i\pi}{4}\mathrm{sign} t))$, where $b= (32/9)^{1/3}$ is a constant arising out of the previous step.

We recognise this as the characteristic function of the stable distribution with parameters 3/2 and -1. In particular, we know now that $\xi$ is in the domain of attraction for a stable-3/2 distribution. If we wanted a version of the central limit theorem for such partial sums, we could have that, but since we care about the partial sums of the $\xi_i$s taking a specific value, rather than a range of values on the scale of the fluctuations, we actually need a local limit theorem.

To make this clear, let’s return to the simplest example of the CLT, with some random variables with mean $\mu$ and variance $\sigma^2<\infty$. Then the partial sums satisfy

$\mathbb{P}(\mu N + a\sigma\sqrt{N} \le S_N \le \mu_N+b\sigma\sqrt{N}) \rightarrow \int_a^b f_{\mathcal N}(x)dx,$

as $N\rightarrow\infty$. But what about the probability of $S_N$ taking a particular value m that lies between $\mu N+a\sigma \sqrt{N}$ and $\mu N + b\sigma \sqrt{N}$? If the underlying distribution was continuous, this would be uncontroversial – considering the probability of lying in a range that is smaller than the scale of the CLT can be shown in a similar way to the CLT itself. A local limit theorem asserts that when the underlying distribution is supported on some lattice, mostly naturally the integers, then these probabilities are in the limit roughly the same whenever m is close to $\mu N+a\sigma\sqrt{N}$.

In this setting, a result of Ibragimov and Linnik that I have struggled to find anywhere in print (especially in English) gives us local limit theory for integer-supported distributions in the domain of attraction of a stable distribution. Taking p( ) to be the density of this distribution, we obtain

$bm^{2/3}\mathbb{P}(\xi_1+\ldots+\xi_m=n) - p(\frac{n-2m}{b m^{2/3}}) \rightarrow 0$

as $n\rightarrow\infty$, uniformly on any set of m for which $z= \frac{n-2m}{bm^{2/3}}$ is bounded. Conveniently, the two occurrences of b clear, and Britikov obtains

$a_{n,m} = (1+o(1)) \frac{\sqrt{2\pi} n^{n-1/6}}{2^{n-m}(n-m)!} p(\frac{n-2m}{n^{2/3}},$

uniformly in the same sense as before.

# The Yule Process

The second problem sheet for classes on the Applied Probability course this term features a long question about the Yule process. This is probably the simplest example of a birth process. It’s named for the British statistician George Udny Yule, though some sources prefer to call it the Yule-Furry process for the American physicist Wendell Furry who used it as a model of a radioactive reaction.

The model is straightforward. At any time there is some number of individuals in the population, and each individual gives birth to an offspring at constant rate $\lambda$, independently from the rest of the population. After a birth has happened, the parent and child evolve independently. In the notation of general birth processes, the birth rate when there are n individuals is $\lambda_n=\lambda n$.

Note that if we start with two or more individuals, the sizes of the two or more families of descendents evolve as a continuous-time Polya’s urn. The arrivals process speeds up with time, but the jump chain is exactly Polya’s urn. Unsurprisingly, the Yule process can be found embedded in preferential attachment models, and other processes which are based around Polya’s urn with extra information.

This is a discrete, random version of exponential growth. Since the geometric distribution is the discrete analogue of the exponential distribution, we probably shouldn’t be surprised to learn that this is indeed the distribution of the process at some fixed time t, when it is started from a single original ancestor. This is all we care about, since the numbers of descendents from each different original ancestors are independent. In general, the distribution of the population size at some fixed time will be negative binomial, that is, a sum of IID geometric distributions.

The standard method here is to proceed using generating functions. Conditioning on the first splitting time gives two independent copies of the original process over a shorter time-scale. One derives an ODE in time for the generating function evaluated at any particular value z. This can be solved uniquely for each z, and patching together gives the generating function of the distribution at any specific time t, which can be seen to coincide with the corresponding generating function of the geometric distribution with parameter $e^{-\lambda t}$.

So we were trying to decide whether there might be a more heuristic argument for this geometric distribution. The method we came up with is not immediate, but does justify the geometric distribution in a couple of steps. First, we say that the birth times are $T_2,T_3,\ldots$, so between times $[T_n,T_{n+1})$ there are n individuals, with $T_1:=0$ for concreteness. Then by construction of the birth process, $T_{n+1}-T_n\stackrel{d}{=}\mathrm{Exp}(\lambda n)$.

We now look at these ‘inter-birth times’ backwards, starting from $T_{n+1}$. Note that $\mathrm{Exp}(\lambda n)$ is the distribution of the time for the first of n IID $\mathrm{Exp}(\lambda)$ clocks to ring. But then, looking backwards, the next inter-birth time is thus the distribution of the time for one of (n-1) IID $\mathrm{Exp}(\lambda)$ clocks to ring. So by memorylessness of the exponential distribution (discussed at great length on the first problem sheet), we can actually take these (n-1) clocks to be exactly those of the original n clocks which did not ring first. Continuing this argument, we can show that the first (in the original time direction) inter-birth time corresponds to the time spent waiting for the final clock to ring. Rewriting this observation formally:

$T_{n+1}\stackrel{d}{=}\max\{X_i : X_1,\ldots,X_n\stackrel{\text{iid}}{\sim}\mathrm{Exp}(\lambda)\}.$ (*)

To return to justifying the geometric form of the distribution, we need to clarify the easiest relationship between the population size at a fixed size and these birth times. As we are aiming for the geometric distribution, the probability of the event $\{X_t>n\}$ will be most useful. Clearly this event is the same as $\{T_{n+1}, and from the description involving maxima of IID exponentials, this is easy to compute as $(1-e^{-\lambda t})^n$, which is exactly what we want.

There are two interesting couplings hidden in these constructions. On closer inspection they turn out to be essentially the same from two different perspectives.

We have specified the distribution of $T_n$ at (*). Look at this distribution on the right hand side. There is a very natural way to couple these distributions for all n, namely to take some infinite sequence $X_1,X_2,\ldots$ of IID $\mathrm{Exp}(\lambda)$ random variables, then use initial sequences of these to generate each of the $T_n$s as described in (*).

Does this coupling correspond to the use of these IID RVs in the birth process? Well, in fact it doesn’t. Examining the argument, we can see that $X_1$ gives a different inter-birth time for each value of t in the correspondence proposed. Even more concretely, in the birth process, almost surely $T_{n+1}>T_n$ for each n. This is not true if we take the canonical coupling of (*). Here, if $X_n<\max\{X_1,\ldots,X_{n-1}\}$, which happens with high probability for large n, we have $T_{n+1}=T_n$ in the process of running maxima.

Perhaps more interestingly, we might observe that this birth process gives a coupling of the geometric distributions. If we want to recover the standard parameterisation of the geometric distribution, we should reparameterise time. [And thus generate an essentially inevitable temptation to make some joke about now having a Yule Log process.]

Let’s consider what the standard coupling might be. For a binomial random variable, either on [n] or some more exotic set, as in percolation, we can couple across all values of the parameter by constructing a family independent uniform random variables, and returning a 1 if $U_i>1-p$ and so on, where p is the parameter of a specific binomial realisation.

We can do exactly the same here. A geometric distribution can be justified as the first success in a sequence of Bernoulli trials, so again we can replace the relevant Bernoulli distribution with a uniform distribution. Take $U_1,U_2,\ldots$ to be IID U[0,1] random variables. Then, we have:

$X_t=\stackrel{d}{=}\bar X_t:= \max\{n: U_1,\ldots,U_{n-1}\ge e^{-\lambda t}\}.$

The equality in distribution holds for any particular value of t by constructing. But it certainly doesn’t hold uniformly in t. Note that if we define $\bar X_t$ as a process, then typically the jumps of this process will be greater than 1, which is forbidden in the Yule process.

So, we have seen that this Yule process, even though its distribution at a fixed time has a standard form, provides a coupling of such distributions that is perhaps slightly surprising.

# Random Maps 1 – Towards the Schaeffer Bijection

I have spent the past ten days in Saint Flour, an inaccessible but picturesque town in the rolling hills of Cantal, in the middle of France, and venue for perhaps the most notable summer school in probability. My highlight has been the course ‘Aspects of Random Maps’ delivered by Gregory Miermont, and I thought I should write a few posts about points of interest encountered during the lectures and private study.

A map is an embedding of a connected graph onto a surface. We typically do not care about the nature of this embedding up to homeomorphisms of the surface which preserve orientations of the map. One advantage for doing this is that the set of maps now might be countable, and the set of maps with n edges might be finite. This can be proved by considering a map to be obtained by glueing together polygonal faces. Some potential glueings are impossible, and some are equivalent, but for a fixed number of edges, the number of such sets of polygons and a possible glueing is finite. In fact we can be much more precise than this about how to describe precisely the legal glueing through a triple of permutations, but I won’t discuss this here.

I haven’t yet given a complete definition of a map. We want a typical large map, that is a map with a large number of vertices and edges, to be topologically roughly the same as the surface it is embedded into. In particular, the map needs to encode the geometric features of the surface. So a small triangle on the surface of a torus should not be considered a map. To rigorise this, we demand that any face of a map should be a topological disc, in particular, it should be simply connected. Since the torus itself is not simply connected, this excludes our triangle example. Note a single vertex on a torus is also excluded.

Although it goes against the usual order of definitions, it might be helpful to think of a map as an embedding which satisfies Euler’s formula: V – E + F = 2 -2g, where g is the genus of the surface. For a connected planar graph, induction is on the number of edges and vertices is the typical way to prove this result. The inductive step works the same on a more general surface, but it is less clear what the base case should be. Another consequence of the definition is that we should work on the sphere rather than the plane. From now on, this is our surface of interest.

We begin by considering $\mathcal{M}_n$ to be the family of rooted plane maps with n edges. The root is a distinguished oriented edge. Our aim is to count the size of this set.

Before doing this, we digress onto the topic of rooted plane trees. Note that any (rooted) tree in the classical sense is planar, but in a rooted plane tree, we also specify the geometric ordering of the offspring. For example, if the root has two offspring, of which one has precisely one offspring and the other has none, we consider these as two separate cases.

So now, if we denote by $a_k$ the number of plane trees with k edges, we can define a generating function via $A(z):=\sum_{k\ge 0} a_k z^k$. If the root vertex v has no offspring, this gives one possibility corresponding to k=0. Otherwise, there is a well-defined left-most offspring of the root, called u. Then u and its descendents form a plane tree, and v and its descendents apart from those through u also form a plane tree. So after accounting for the edge between u and v, we obtain

$A(z)=1+zA(z)^2,$

whenever A(z) is defined. We now can apply whichever is our favourite method of showing that this is the generating function of the Catalan numbers, $a_k=\frac{1}{k+1}\binom{2k}{k}$.

There is a more complicated version of this generating function argument due to Tutte that allows us to enumerate $\mathcal{M}_n$. It is convenient to work with a second variable in the generating function that encodes the degree of the root face. The resulting equation of generating functions is less well-known but using the Lagrange inversion formula gives the explicit expression

$|\mathcal{M}_n|=\frac{2}{n+2}\cdot \frac{3^n}{n+1}\binom{2n}{n}.$

Although there are extra terms, this motivates seeking a bijection between maps, and some version of rooted plane trees, perhaps decorated with some extra information. As in many cases, this will turn out to be possible. The bijection we end up with will not just help us enumerate the maps, but will also allow us to control a lot more information about distances in the map, which will be particularly useful when we try to take limits.

The first observation is that given a map, we can construct a dual map, by placing fresh vertices somewhere in the middle of each face, and joining a pair of these if the corresponding faces in the original graph share an edge.

Alternatively, we can place the same fresh vertices in the middle of each face, then join each new vertex to an original vertex, if that original vertex lies on the face corresponding to the new vertex. If you focus in on an original edge, it is clear that it is now surrounded by a ‘diamond’ (if you’ve drawn the diagram in a natural way) of new edges. Removing the original edges thus leaves us with a quadrangulation. This procedure is called the ‘trivial bijection’ between $\mathcal{M}_n$ and $\mathcal{Q}_n$, the family of rooted quadrangulations with n faces. Note that the root in such a quadrangulation is an identified directed edge, rather than a vertex. We haven’t yet specified how to describe the root of the resulting quadrangulation. It suffices to take the first new edge which lies clockwise of the root edge in the original graph, seen from the ‘tail’ of the root, which is of course oriented.

In this, and the bijections which follow, the natural questions to ask are: a) is the inverse obvious? and b) what happens to self-loops and isthmuses? Here, the inverse really is obvious. Any quadrangulation is bipartite, hence two-colourable, so we need to fix one colour and join the two vertices of that colour within each face to recover the original graph. The root tells us which colour we need to take. As for the second question, first we should say that an isthmus is an edge which has the same face on both sides. This causes no problems in this particular bijection. For the self-loops, we get a sort of Pacman-like quadrangle, with two ‘outer-edges’ between the same two vertices, and an edge between one of the outer vertices and some internal vertex. This edge contributes twice to the degree of the face.

The upshot of this is that for a simple enumeration, it suffices to prove that $|\mathcal{Q}_n|=\frac{2}{n+2}\cdot \frac{3^n}{n+1}\binom{2n}{n}$. This may not look like we have achieved much, but we can now apply Euler’s formula to any quadrangulation in this set to deduce that the number of vertices present is n+2. If we consider the set $\mathcal{Q}_n^*$, where now we identify a particular vertex $v_*$ in the quadrangulation, it suffices to prove that $|\mathcal{Q}_n^*|=2.3^n \cdot a_n$, where $a_n$ is the nth Catalan number as before. Now we have the most efficient setup to look for a bijection with some type of decorated plane tree as discussed before.

# Avoiding Mistakes in Probability Exams

Over the past week, I’ve given several tutorials to second year undergraduates preparing for upcoming papers on probability and statistics. In particular, I’ve now seen a lot of solutions to a lot of past papers and specimen questions, and it’s worthwhile to consider some of the typical mistakes students can make on these questions. Of course, as with any maths exam, there’s always the possibility of a particularly subtle or involved question coming up, but if the following three common areas of difficulty can be avoided, you’re on track for doing well.

Jacobians

In a previous course, a student will learn how to calculate the pdf of a function of a random variable. Here, we move onto the more interesting and useful case of finding the (joint) density of function(s) of two or more random variables. The key thing to remember here is that manipulating pdfs is not a strange arbitrary exercise – it is just integration. It is rarely of interest to consider the value of a pdf at a single point. We can draw meaningful conclusions from a pdf or from comparison of two pdfs by integrating them.

Then the question of substituting for new random variables is precisely integration by substitution, which we are totally happy with in the one-dimensional case, and should be fairly happy with in the two-dimensional case. To get from one joint density to another, we multiply by the absolute value of the Jacobian. To ensure you get it right, it makes sense to write out the informal infinitesimal relation

$f_{U,V}(u,v) du dv = f_{X,Y}(x,y)dx dy.$

This is certainly relevant if we put integral signs in front of both sides, and explains why you obtain $f_{U,V} = \frac{d(x,y)}{d(u,v)} f_{X,Y}$ rather than the other way round. Note though that if $\frac{d(u,v)}{d(x,y)}$ is easier to calculate for some reason, then you can evaluate this and take the inverse, as your functions will almost certainly be locally bijective almost everywhere.

It is important to take the modulus of the Jacobian, since densities cannot be negative! If this looks like a fudge, then consider the situation in one dimension. If we substitute for $x\mapsto f(x)=1-x$, then f’ is obviously negative, BUT we also end up reversing the order of the bounds of the integral, eg [1/3, ¾] will become [2/3,1/4]. So we have a negative integrand (after multiplying by f'(x)) but bounds in the ‘wrong’ order. These two factors of -1 will obviously cancel, so it suffices just to multiply by |f'(x)| at that stage. It is harder to express in words, but a similar relation works for the Jacobian substitution.

You also need to check where the new joint density is non-zero. Suppose X, Y are supported on [0,1], then when we write $f_{X,Y}(x,y)$ we should indicate that it is 0 off this region, either by splitting into cases, or adding the indicator function $1_{\{x,y\in[0,1]\}}$ as a factor. This is even more important after substitutions, as the range of the resulting random variables might be less obvious than the originals. Eg with X,Y as above, and $U=X^2, V=X/Y$, the resulting pdf will be non-zero only when $u\in[0,1], v\ge \sqrt{u}$. Failing to account for this will often lead to ludicrous answers. A general rule is that you can always check that any distribution you’ve produced does actually integrate to one.

Convergence using MGFs

There are two main reasons to use MGFs and PGFs. The first is that they behave nicely when applied to (possibly random) sums of independent random variables. The independence property is crucial to allow splitting of the MGF of the sum into the product of MGFs of the summands. Of course, implicit in this argument is that MGFs determine distributions.

A key theorem of the course is that this works even in the limit, so you can use MGFs to show convergence in distribution of a family of distributions. For this, you need to show that the MGFs converge pointwise on some interval [-a,a] around 0. (Note that the moments of the distribution are given by the family of derivatives at 0, as motivation for why this condition might be necessary.) Normally for such questions, you will have been asked to define the MGF earlier in the question, and probably will have found the MGF of a particular distribution or family of distributions, which might well end up appearing as the final answer.

Sometimes such an argument might involve substituting in something unusual, like t/N, rather than t, into a known MGF. Normally a Taylor series can be used to show the final convergence result. If you have a fraction, try to cancel terms so that you only have to evaluate one Taylor series, rather than lots.

Using the Markov Property

The Markov property is initially confusing, but once we become comfortable with the statement, it is increasingly irritating to have to answer the question: “show that this process has the Markov property.” This question is irritating because in most cases we want to answer: “because it obviously does!” Which is compelling, but unlikely to be considered satisfactory in a mathematics exam. Normally we observe that the random dynamics of the next step are a function only of the present location. Looking for the word ‘independent’ in the statement of the process under discussion is a good place to start for any argument along these lines.

The most developed example of a Markov process in this course is the Poisson process. I’ve written far too much about this before, so I won’t do so again, except to say this. When we think of the Poisson process, we generally have two thoughts going through our minds, namely the equivalent definitions of IID exponential inter-arrival times, and stationary, Poisson increments (or the infinitesimal version). If we draw a sketch of a sample trajectory of this process, we can label everything up and it is clear how it all fits together. But if you are asked to give a definition of the Poisson process $(N_t)$, it is inappropriate to talk about inter-arrival times unless you define them in terms of $N_t$, since that is the process you are actually trying to define! It is fine to write out

$T_k:=\min\{t: N_t=k\},\quad N_t=\max\{k: Y_1+Y_2+\ldots+Y_k\le t\}$

but the relation between the two characterisations of the process is not obvious. That is why it is a theorem of the course.

We have to be particularly careful of the difference in definition when we are calculating probabilities of various events. A classic example is this. Find the distribution of $N_2$, conditional on $T_3=1$. It’s very tempting to come up with some massive waffle to argue that the answer is 3+Po(1). The most streamlined observation is that the problem is easy if we are conditioning instead on $N_1=3$. We just use the independent Poisson increments definition of $(N_t)$, with no reference to inter-arrival times required. But then the Markov property applied at time 1 says that the distribution of $(N_2)$ depends only on the value of $N_1$, not on the process on the interval [0,1). In a sense, the condition that $T_3=1$ is giving us extra information on the behaviour of the process up to time 1, and the Markov property, which we know holds for the Poisson process, asserts precisely that the extra information doesn’t matter.

# Generating Functions for Dice

So last week I was writing an article for Betting Expert about laws of large numbers, and I was trying to produce some representations of distributions to illustrate the Weak LLN and the Central Limit Theorem. Because tossing a coin feels too simplistic, and also because the natural state space for this random variable, at least verbally, is not a subset of the reals, I decided to go for dice instead. So it’s clear what the distribution of the outcome of a single dice roll is, and with a bit of thought or a 6×6 grid, we can work out the distribution of the average of two dice rolls. But what about 100 rolls? Obviously, we need large samples to illustrate the laws of large numbers! In this post, we discuss how to calculate the distribution of the sample mean of n dice rolls.

First we observe that the total set of outcomes of n dice rolls is $6^n$. The sum of the outcomes must lie between n and 6n inclusive. The distribution of the sum and the distribution of the sample mean are equivalent up to dividing by n. The final observation is that because the total number of outcomes has a nice form, we shouldn’t expect it to make any difference to the method if we calculate the probability of a given sum, or the number of configurations giving rise to that sum.

Indeed, tying in nicely with the first year probability course, we are going to use generating functions, and there is no difference in practice between the probability generating function, and the combinatorial generating function, if the underlying mechanism is a uniform choice. Well, in practice, there is a small difference, namely a factor of 6 here. The motivation for using generating functions is clear: we are considering the distribution of a sum of independent random variables. This is pretty much exactly why we bother to set up the machinery for PGFs.

Anyway, since each of {1,2,…,6} is equally likely, the GF of a single dice roll is

$x+x^2+\ldots+x^6=x\cdot \frac{1-x^6}{1-x}.$

So, if we want the generating function of the sum of n independent dice rolls, we can obtain this by raising the above function to the power n. We obtain

$x^n(1-x^6)^n(1-x)^{-n}.$

Note the factor of $x^n$ at the beginning arises because the minimum value of the sum is n. So to work out the number of configurations giving rise to sum k, we need to evaluate the coefficient of $x^k$. We can deal with $(1-x^6)^n$ fairly straightforwardly, but some thought it required regarding whether it’s possible to do similar job on $(1-x)^{-n}$.

We have to engage briefly with what is meant by a binomial coefficient. Note that

$\binom{x}{k}=\frac{x(x-1)\ldots(x-k+1)}{1\cdot\ldots\cdot k}$

is a valid definition even when x is not a positive integer, as it is simply a degree k polynomial in x. This works if x is a general positive real, and indeed if x is a general negative real. At this stage, we do need to keep k a positive integer, but that’s not a problem for our applications.

So we need to engage with how the binomial theorem works for exponents that are not positive integers. The tricky part with the standard expression as

$(a+b)^n=\binom{n}{0}a^n+\ldots + \binom{n}{n}b^n,$

is that the attraction of this symmetry in a and b prompts us to work in more generality than is entirely necessary to state the result. Note if we instead write

$(1+x)^n=1+\binom{n}{1}x+\binom{n}{2}x^2+\ldots,$

we have unwittingly described this finite sum as an infinite series. It just happens that all the binomial coefficients apart from the first (n+1) are zero. The nice thing about this definition is that it might plausibly generalise to non-integer or negative values of n. And indeed it does. I don’t want to go into the details here, but it’s just a Taylor series really, and the binomial coefficients are set up with factorials in the right places to look like a Taylor series, so it all works out.

It is also worth remarking that it follows straight from the definition of a negative binomial coefficient, that

$\binom{-n}{j}=(-1)^j \binom{n+j-1}{j}.$

In any case, we can rewrite our expression for the generating function of the IID sum as

$x^n\left[\sum_{k=0}^n \binom{n}{k}(-1)^k x^{6k}\right]\left[\sum_{j\ge 0} \binom{-n}{j}(-1)^j x^j\right]$

By accounting for where we can gather exponents from each bracket, we can evaluate the coefficient of $x^m$ as

$\sum_{6k+j=m+n}\binom{n}{k}\binom{n+j-1}{j}(-1)^k.$

Ie, k in the sum takes values in $\{0,1,\ldots, \lfloor \frac{m+n}{6}\rfloor\}$. At least in theory, this now gives us an explicit way to calculate the distribution of the average of multiple dice rolls. We have to be wary, however, that many compilers will not be happy dealing with large binomial coefficients, as the large factorials grow extremely rapidly. An approximation using logs is likely to be more tractable for larger settings.

Anyway, I leave you with the fruits of my labours.

Related articles

# Random Mappings for Cycle Deletion

In previous posts here and here, I’ve talked about attempts to describe a cycle deleting process. We amend the dynamics of the standard random graph process by demanding that whenever a cycle is formed in the graph we delete all the edges that lie on the cycle. The aim of this is to prevent the system growing giant components, and perhaps give a system that displays the characteristics of self-organised criticality. In the posts linked to, we discuss the difficulties caused by the fact that the tree structure of components in such a process is not necessarily uniform.

Today we look in the opposite direction. It gives a perfectly reasonable model to take a multiplicative coalescent with quadratic fragmentation (this corresponds to cycle deletion, since there are $O(n^2)$ edges which would give a cycle if added to a tree on n vertices) and a fragmentation kernel corresponding to adding an extra edge to a uniform spanning tree on n vertices then deleting the edges of the unique cycle. The focus of the rest of this post, we consider this fragmentation mechanism, in particular thinking about how we would sample from it most practically. Not least, without going through Prufer codes or some other clever machinery, it is not trivial to sample a uniform spanning tree.

First, we count the number of unicyclic graphs on n labelled vertices. If we know that the vertices on the cycle are $v_1,\ldots,v_k$, then the number of cycles with an identified edge is

$u_1=1,\quad u_k=\frac{k!}{2},\, k\ge 2.$

If we know that the tree coming off the cycle from vertex v_i has size m, say, then each of the possible rooted labelled trees with size m is equally likely. So taking $w_j=j^{j-1}$, the number of rooted trees on j labelled vertices, we get $B_n(u_\bullet,w_\bullet)$ for the number of such unicyclic graphs on [n]. Recall $B_n$ is the nth Bell polynomial, which gives the size of a compound combinatorial structure, where we have some structure on blocks and some other structure within blocks. Then the random partition of [n] given by the tree sizes has the distribution $\text{Gibbs}_n(u_\bullet,w_\bullet)$.

Consider now a related object, the so-called random mapping digraph. What follows is taken from Chapter 9 of Combinatorial Stochastic Processes. We can view any mapping $M_n:[n]\rightarrow[n]$ as a digraph where every vertex has out-degree 1. Each such digraph contains a collection of directed cycles, supported on those elements x for which $M_n^k(x)=x$ for some k. Such an element x is called a cyclic point. Each cyclic point can be viewed as the root of a labelled tree.

In an identical manner to the unicyclic graph, the sizes of these directed trees in the digraph decomposition of a uniform random mapping is distributed as $\text{Gibbs}_n(\bullet !,w_\bullet)$. So this is exactly the same as the cycle deletion kernel, apart from in the probability that the partition has precisely one block. In practice, for large n, the probability of this event is very small in both cases. And if we wanted to sample the cycle deletion kernel exactly, we could choose the trivial partition with some probability p, and otherwise sample from the random mapping kernel, where p is chosen such that

$p+\frac{1-p}{B_n(\bullet !, w_\bullet)}=\frac{1}{B_n(u_\bullet,w_\bullet)}.$

At least we know from the initial definition of a random mapping, that $B_n(\bullet !,w_\bullet)=n^n$. The number of unicyclic graphs with an identified edge is less clear. It turns out that the partition induced by the random mapping has a nice limit, after rescaling, as the lengths of excursions away from 0 in the standard Brownian bridge on [0,1].

The time for a fuller discussion of this sort of phenomenon is in the context of Poisson-Dirichlet distributions, as the above exchangeable partition turns out to be PD(1/2,1/2). However, for now we remark that the jumps of a subordinator give a partition after rescaling. The case of a stable subordinator is particularly convenient, as calculations are made easier by the Levy-Khintchine formula.

A notable example is the stable-1/2 subordinator, which can be realised as the inverse of the local time process at zero of a Brownian motion. The jumps of this process are then the excursion lengths of the original Brownian motion. A calculation involving the tail of the w_j’s indicates that 1/2 is the correct parameter for a subordinator to describe the random mappings. Note that the number of blocks in the partition corresponds to the local time at zero of the Brownian motion. (This is certainly not obvious, but it should at least be intuitively clear why a larger local time roughly indicates more excursions which indicates more blocks.)

So it turns out, after checking some of the technicalities, that it will suffice to show that the rescaled number of blocks in the random mapping partition $\frac{|\Pi_n|}{\sqrt{n}}$ converges to the Raleigh density, which is a size-biased Normal random variable (hence effectively first conditioned to be positive), and which also is the distribution of the local time of the standard Brownian bridge.

After that very approximate description, we conclude by showing that the distribution of the number of blocks does indeed converge as we require. Recall Cayley’s formula $kn^{n-k-1}$ for the number of labelled forests on [n] with a specified set of k roots. We also need to know how many labelled forests there are with any set of roots. Suppose we introduce an extra vertex, labelled 0, and connect it only to the roots of a rooted labelled forest on [n]. This gives a bijection between unlabelled trees on {0,1,…,n} and labelled forests with a specified set of roots on [n]. So we can use Cayley’s original formula to conclude there are $(n+1)^{n-1}$ such forests. We can do a quick sanity check that these are the same, which is equivalent to showing

$\sum_{k=1}^n k n^{-k-1}\binom{n}{k}=\frac{1}{n}(1+\frac{1}{n})^{n-1}.$

This odd way of writing it is well-motivated. The form of the LHS is reminiscent of a generating function, and the additional k suggests taking a derivative. Indeed, the LHS is the derivative

$\frac{d}{dx}(1+x)^n,$

evaluated at $\frac{1}{n}$. This is clearly the same as the RHS.

That said, having established that the random mapping partition is essentially the same, it is computationally more convenient to consider that instead. By the digraph analogy, we again need to count forests with k roots on n vertices, and multiply by the number of permutations of the roots. This gives:

$\mathbb{P}(|\Pi_n|=k)=\frac{kn^{n-k-1}\cdot k! \binom{n}{k}}{n^n}=\frac{k}{n}\prod_{i=1}^{k-1}\left(1-\frac{i}{n}\right).$

Now we can consider the limit. Being a bit casual with notation, we get:

$\lim \mathbb{P}(\frac{|\Pi_n|}{\sqrt{n}}\in dl)\approx \sqrt{n}dl \mathbb{P}(|\Pi_n|=l\sqrt{n}).$

Since the Raleigh distribution has density $l\exp(-\frac12 l^2)dl$, it suffices for this informal verification to check that

$\prod_{i=1}^{l\sqrt{n}}(1-\frac{i}{n})\approx \exp(-\frac12 l^2).$ (*)

We take logs, so the LHS becomes:

$\log(1-\frac{1}{n})+\log(1-\frac{2}{n})+\ldots+\log(1-\frac{l\sqrt{n}}{n}).$

If we view this as a function of l and differentiate, we get

$d(LHS)=\sqrt{n}dl \log (1-\frac{l}{\sqrt{n}})\approx \sqrt{n}dl \left[-\frac{l}{\sqrt{n}}-\frac{l^2}{2n}\right]\approx -ldl.$

When l is zero, the LHS should be zero, so we can obtain the desired result (*) by integrating then taking an exponential.

# Generating Functions for the IMO

The background to this post is that these days I find myself using generating functions all the time, especially for describing the stationary states of various coalescence-like processes. I remember meeting them vaguely while preparing for the IMO as a student. However, a full working understanding must have eluded me at the time, as for Q5 on IMO 2008 in Madrid I had written down in big boxes the two statements involving generating functions that immediately implied the answer, but failed to finish it off. The aim of this post is to help this year’s team avoid that particular pitfall.

What are they?

I’m going to define some things in a way which will be most relevant to the type of problems you are meeting now. Start with a sequence $(a_0,a_1,a_2,\ldots)$. Typically these will be the sizes of various combinatorial sets. Eg a_n = number of partitions of [n] with some property. Define the generating function of the sequence to be:

$f(x)=\sum_{k\geq 0}a_k x^k=a_0+a_1x+a_2x^2+\ldots.$

If the sequence is finite, then this generating function is a polynomial. In general it is a power series. As you may know, some power series can be rather complicated, in terms of where they are defined. Eg

$1+x+x^2+x^3+\ldots=\frac{1}{1-x},$

only when |x|<1. For other values of x, the LHS diverges. Defining f over C is fine too. This sort of thing is generally NOT important for applications of generating functions to combinatorics. To borrow a phrase from Wilf, a generating function is a convenient clothesline’ on which to hang a sequence of numbers.

We need a notation to get back from the generating function to the coefficients. Write $[x^k]g(x)$ to denote the coefficient of $x^k$ in the power series g(x). So, if $g(x)=3x^3-5x^2+7$, then $[x^2]g(x)=-5$. It hopefully should never be relevant unless you read some other notes on the topic, but the notation $[\alpha x^2]g(x):=\frac{[x^2]g(x)}{\alpha}$, which does make sense after a while.

How might they be useful?

Example: binomial coefficients $a_k=\binom{n}{k}$ appear, as the name suggests, as coefficients of

$f_n(x)=(1+x)^n=\sum_{k=0}^n \binom{n}{k}x^k.$

Immediate consequence: it’s trivial to work out $\sum_{k=0}^n \binom{n}{k}$ and $\sum_{k=0}^n(-1)^k \binom{n}{k}$ by substituting $x=\pm 1$ into f_n.

Less obvious consequence. By considering choosing n from a red balls and b blue balls, one can verify

$\binom{a+b}{n}=\sum_{k=0}^n \binom{a}{k}\binom{b}{n-k}.$

We can rewrite the RHS as

$\sum_{k+l=n}\binom{a}{k}\binom{b}{l}.$

Think how we calculate the coefficient of $x^n$ in the product $f(x)g(x)$, and it is now clear that $\binom{a+b}{n}=[x^n](1+x)^{a+b}$, while

$\sum_{k+l=n}\binom{a}{k}\binom{b}{l}=[x^n](1+x)^a(1+x)^b,$

so the result again follows. This provides a good slogan for generating functions: they often replicate arguments via bijections, even if you can’t find the bijection.

Useful for? – Multinomial sums

The reason why the previous argument for binomial coefficients worked nicely is because we were interested in the coefficients, but had a neat expression for the generating function as a polynomial. In particular, we had an expression

$\sum_{k+l=n}a_k b_l.$

This is always a clue that generating functions might be useful. This is sometimes called a convolution.

Exercise: prove that in general, if f(x) is the generating function of (a_k) and g(x) the generating function of (b_l), then f(x)g(x) is the generating function of $\sum_{k+l=n}a_kb_l$.

Even more usefully, this works in the multinomial case:

$\sum_{k_1+\ldots+k_m=n}a^{(1)}_{k_1}\ldots a^{(m)}_{k_m}.$

In many applications, these $a^{(i)}$s will all be the same. We don’t even have to specify how many k_i’s there are to be considered. After all, if we want the sum to be n, then only finitely many can be non-zero. So:

$\sum_{m}\sum_{k_1+\ldots+k_m=n}a_{k_1}\ldots a_{k_m}=[x^n]f(x)^n=[x^n]f(x)^\infty,$

provided f(0)=1.

Useful when? – You recognise the generating function!

In some cases, you can identify the generating function as a standard’ function, eg the geometric series. In that case, manipulating the generating functions is likely to be promising. Here is a list of some useful power series you might spot.

$1+x+x^2+\ldots=\frac{1}{1-x},\quad |x|<1$

$1+2x+3x^2+\ldots=\frac{1}{(1-x)^2},\quad |x|<1$

$e^x=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\ldots$

$\cos x=1-\frac{x^2}{2!}+\frac{x^4}{4!}\pm\ldots$

Exercise: if you know what differentiation means, show that if f(x) is the gen fn of (a_k), then xf'(x) is the gen fn of ka_k.

Technicalities: some of these identities are defined only for certain values of x. This may be a problem if they are defined at, say, only a single point, but in general this shouldn’t be the case. In addition, you don’t need to worry about differentiability. You can definition differentiation of power series by $x^n\mapsto nx^{n-1}$, and sort out convergence later if necessary.

Useful for? – Recurrent definitions

The Fibonacci numbers are defined by:

$F_0=F_1=1,\quad F_{n+1}=F_n+F_{n-1},\quad n\geq 1.$

Let F(x) be the generating function of the sequence F_n. So, for n=>1,

$[x^n]F(x)=[x^{n-1}]F(x)+[x^{n-2}]F(x)=[x^n](xF(x)+x^2F(x)),$

and F(0)=1, so we can conclude that:

$F(x)=1+(x+x^2)F(x)\quad\Rightarrow\quad F(x)=\frac{1}{1-x-x^2}.$

Exercise: Find a closed form for the generating function of the Catalan numbers, defined recursively by:

$C_n=C_0C_{n-1}+C_1C_{n-2}+\ldots+C_{n-1}C_0.$

Can you now find the coefficients explicitly for this generating function?

Useful for? – Partitions

Partitions can be an absolute nightmare to work with because of the lack of explicit formulae. Often any attempt at a calculation turns into a massive IEP bash. This prompts a search for bijective or bare-hands arguments, but generating functions can be useful too.

For now (*), let’s assume a partition of [n] means a sequence of positive integers $a_1\geq a_2\geq\ldots\geq a_k$ such that $a_1+\ldots+a_k=n$. Let p(n) be the number of partitions of [n].

(* there are other definitions, in terms of a partition of the set [n] into k disjoint but unlabelled sets. Be careful about definitions, but the methods often extend to whatever framework is required. *)

Exercise: Show that the generating function of p(n) is:

$\left(\frac{1}{1-x}\right)\left(\frac{1}{1-x^2}\right)\left(\frac{1}{1-x^3}\right)\ldots$

Note that if we are interested only in partitions of [n], then we don’t need to consider any terms with exponent greater than n, so if we wanted we could take a finite product instead.

Example: the mint group will remember this problem from the first session in Cambridge:

Show that the number of partitions of [n] with distinct parts is equal to the number of partitions of [n] with odd parts.

Rather than the fiddly bijection argument found in the session, we can now treat this as a simple calculation. The generating function for distinct parts is given by:

$(1+x)(1+x^2)(1+x^3)\ldots,$

while the generating function for odd parts is given by:

$\left(\frac{1}{1-x}\right)\left(\frac{1}{1-x^3}\right)\left(\frac{1}{1-x^5}\right)\ldots.$

Writing the former as

$\left(\frac{1-x^2}{1-x}\right)\left(\frac{1-x^4}{1-x^2}\right)\left(\frac{1-x^6}{1-x^3}\right)\ldots$

shows that these are equal and the result follows.

Other things – Multivariate Generating Functions

If you want to track a sequence in two variables, say $a_{m,n}$, then you can encode this with the bivariate generating function

$f(x,y):=\sum_{m,n\geq 0}a_{m,n}x^my^n.$

The coefficients are then extracted by $[x^ay^b]$ and so on. There’s some interesting stuff on counting lattice paths with this method.

Sums over arithmetic progressions via roots of unity

Note that we can extract both $\sum a_n$ and $\sum (-1)^na_n$ by judicious choice of x in f(x). By taking half the sum or half the difference, we can obtain

$a_0+a_2+a_4+\ldots=\frac12(f(1)+f(-1)),\quad a_1+a_3+a_5+\ldots=\frac12(f(1)-f(-1)).$

Can we do this in general? Yes actually. If you want $a_0+a_k+a_{2k}+\ldots$, this is given by:

$a_0+a_k+a_{2k}+\ldots+\frac{1}{k}\left(f(1)+f(w)+\ldots+f(w^{k-1})\right),$

where $w=e^{2\pi i/k}$ is a $k$th root of unity. Exercise: Prove this.

For greater clarity, first try the case k=4, and consider the complex part of the power series evaluated at +i and -1.

# Bell Polynomials

Trees with a single cycle

When counting combinatorial objects, it is often the case that we have two types of structure present at different levels. The aim of this post is to introduce the Bell polynomials, which provides the most natural notation for describing this sort of situation, and to mention some of the results that become easier to derive in this framework. This post is based on material and exercises from Chapter 1 of Jim Pitman’s book Combinatorial Stochastic Processes, which is great, and also available online here.

The structures that Bell polynomials enumerate are called composite structures in this account. Rather than give a definition right away, I shall give an example. An object I have been thinking about in the past few weeks are graphs on n vertices containing precisely one cycle. Some of the background for this has been explained in recent posts.

In a recent post on Prufer codes, I gave the classical argument showing that the number of trees on n vertices is $n^{n-2}$. We might consider a unicyclic graph to be a tree with an extra edge. But if we consider the number of ways to add a further vertex to a tree, we get

$n^{n-2}\left[\binom{n}{2}-(n-1)\right]=n^{n-2}\binom{n-1}{2}.$

Obviously, we have overcounted. If the single cycle in a graph has length k, then the graph has been counted exactly k times in this enumeration. But it is not obvious how many graphs have a single cycle of length k.

Instead, we stop worrying about exactly how many of these there are, as there might not be a simple expression anyway. As soon as we start using them in any actual argument, it will be useful to know various properties about the graphs, but probably not exactly how many there are.

Let’s focus on this single cycle of length k say. If we remove the edges of the cycle, we are left with a collection of trees. Why? Well if there was a cycle in the remaining graph, then the original graph would have had at least two cycles. So we have a collection of trees, unsurprisingly called a forest. Remembering that some of the trees may in fact be a single vertex (on the cycle), it is clear that there is a bijection between these trees and the vertices of the cycle in the obvious way. We can think of the graph as a k-cycle, dressed with trees.

Alternatively, once we have specified its size, we can forget about the k-cycle altogether. The graph is precisely defined by a forest of k trees on n vertices, with a specified root in each tree indicating which vertex lies on the cycle, and a permutation specifying the cyclic ordering of the trees. We can write this as

$N_{n,k}=(k-1)!\sum_{(A_1,\ldots,A_k)\in\mathcal{P}^k(n)}a_1^{a_1-1}\cdot\ldots\cdot a_k^{a_k-1},\quad \text{for }a_i=|A_i|,$

where $\mathcal{P}^k(n)$ is the number of partitions of [n] with k blocks. Remember that the blocks in a partition are necessarily unordered. This makes sense in this setting as the cyclic permutation chosen from the (k-1)! possibilities specifies the order on the cycle.

Bell Polynomials

The key point about this description is that there are two types of combinatorial structure present. We have the rooted trees, and also a cyclic ordering of the rooted trees. Bell polynomials generalise this idea. It is helpful to be less specific and think of partitions of [n] into blocks. There are $w_j$ arrangements of any block of size j, and there are $v_k$ ways to arrange the blocks, if there are k of them. Note that we assume $v_k$ is independent of the arrangements within the collection of blocks. So in the previous example, $w_j=j^{j-2}$, and $v_k=(k-1)!$. Pitman denotes these sequences by $v_\bullet,w_\bullet$. Then the (n,k)th partial Bell polynomial, $B_{n,k}(w_\bullet)$ gives the number of divisions into k blocks:

$B_{n,k}(w_\bullet):=\sum_{(A_1,\ldots,A_k)\in\mathcal{P}^k(n)}\prod_{i=1}^k w_{a_i}.$

The total number of arrangements is given by the Bell polynomial

$B_n(v_\bullet,w_\bullet):=\sum_{k=1}^n v_k B_{n,k}(w_\bullet).$

Here are some other examples of Bell polynomials. The Stirling numbers of the first kind $c_{n,k}$ give the number of permutations of [n] with k cycles. Since we don’t want to impose any combinatorial structure on the set of cycles, we don’t need to consider $v_\bullet$, and the number of ways to make a j-cycle from a j-block is $w_j=(j-1)!$, so $c_{n,k}:=B_{n,k}((\bullet-1)!)$. Similarly, the Stirling numbers of the second kind $S_{n,k}$ give the number of permutations of [n] into k blocks. Almost by definition, $S_{n,k}:=B_{n,k}(1^\bullet)$, where $1^\bullet$ is defined to be the sequence containing all 1s.

Applications

So far, this is just a definition that gives an abbreviated description for the sizes of several interesting sets of discrete objects. Having clean notation is always important, but there are further advantages of using Bell polynomials. I don’t want to reproduce the entirety of the chapter I’ve read, so my aim for this final section is to give a very vague outline of why this is a useful formulation.

Bell polynomials can be treated rather nicely via generating functions. The key to this is to take a sum not over partitions, but rather over ordered partitions, which are exactly the same, except now we also care about the order of the blocks. This has the advantage that there is a correspondence between ordered partitions with k blocks and compositions with k terms. If the composition is $n_1+\ldots+n_k=n$, it is clear why there are $\binom{n}{n_1,\ldots,n_k}$ ordered partitions encoding this structure. This multinomial coefficient can be written as a product of factorials of $n_i$s over i, and so we can write:

$B_{n,k}(w_\bullet)=\frac{n!}{k!}\sum_{(n_1,\ldots,n_k)}\prod_{i=1}^k \frac{w_{n_i}}{n_i!}.$

This motivates considering the exponential generating function given by

$w(\xi)=\sum_{j=1}^\infty w_j\frac{\xi_j}{j!},$

as this leads to the neat expressions:

$B_{n,k}(w_\bullet)=n![\xi^n]\frac{w(\xi)^k}{k!},\quad B_n(v_\bullet,w_\bullet)=n![\xi^n]v(w(\xi)).$

The Bell polynomial $B_n(v_\bullet,w_\bullet)$ counts the number of partitions of [n] subject to some extra structure. If we choose uniformly from this set, we get a distribution on this combinatorial object, for which the Bell polynomial provides the normalising constant. If we then ignore the extra structure, the sequences $v_\bullet,w_\bullet$ induce a probability distribution on the set of partitions of n. This distribution is known as a Gibbs partition. It is interesting to consider when and whether it is possible to define a splitting mechanism such that the Gibbs partitions can be coupled to form a fragmentation process. This is the opposite of a coalescence process. Here, we have a sequence of masses, and at each integer time we have rules to determine which mass to pick, and a rule for how to break it into two pieces. It is certainly not the case that for an arbitrary splitting rule and sequences $v_\bullet,w_\bullet$, the one-step fragmentation of the Gibbs partition on n gives the corresponding Gibbs partition on (n-1).

CLT for random permutations

For the final demonstration of the use of Bell polynomials, I am going to sketch the outline of a solution to exercise 1.5.4. which shows that the number of cycles in a uniformly chosen permutation has a CLT. This is not at all obvious, since the number of permutations of [n] with k cycles is given by $B_{n,k}((\bullet-1)!)$ and there is certainly no simple form for this, so the possibility of doing a technical limiting argument seems slim.

For ease of notation, we copy Pitman and write $c_{n,k}:=B_{n,k}((\bullet-1)!)$ as before. First we show exercise 1.2.3. which asserts that

$x(x+1)\ldots(x+(n-1))=\sum_{k=1}^n c_{n,k}x^k.$

We argue combinatorially. The RHS is the number of ways to choose $\sigma\in S_n$ and a colouring of [n] with k colours such that the orbits of $\sigma$ are monochromatic. We prove that the LHS also has this property by induction on the number of vertices. We claim there is a 1-to-(x+n) map from configurations on n vertices to configurations on (n+1) vertices. Given $\sigma\in S_n$ and colouring, for any $a\in[n]$, we construct $\sigma_a\in S_{n+1}$ by $\sigma_a(a)=n+1$, $\sigma_a(n)=\sigma(a)$ and for all other x, $\sigma_a(x)=\sigma(x)$. We give n+1 the same colour as a. This gives us n possibilities. Alternatively, we can map (n+1) to itself and give it any colour we want. This gives us x possibilities. A slightly more careful argument shows that this is indeed a 1-to-(x+n) map, which is exactly what we require.

So the polynomial

$A_n(z)=\sum_{k=0}^nc_{n,k}z^k,$

has n real zeros, which allows us to write

$\frac{c_{n,k}}{A_n(1)}=\mathbb{P}(X_1+\ldots+X_n=k),$

where the Xs are independent but not identically distributed Bernoulli trials. The number of cycles is then given by this sum, and so becomes a simple matter to verify the CLT by checking a that the variances grows appropriately. As both mean and variance are asymptotically log n, we can conclude that:

$\frac{K_n - \log n}{\sqrt{\log n}}\stackrel{d}{\rightarrow} N(0,1).$

In a future post, I want to give a quick outline of section 1.3. which details how the Bell polynomials can be surprisingly useful to find the moments of infinitely divisible distributions.