Parking on a ring, linear hashing

I’ve spent most of my doctorate trying to analyse how adding destructive dynamics affects the behaviour of a particular random growth process, the classical random graph. In this post I’m going to talk about another random growth process, which is slightly less natural, but for which one can show some similar qualitative properties.

The model, and the additive coalescent

Consider m places arranged in a circle, and for consistency of analogy we think of these as parking spaces. Some number n of cars will arrive one at a time. Each car will arrive at a space chosen uniformly at random. If it is empty they will park in it, otherwise they will look clockwise until they find an empty space, and park there. For now we are only interested in growth, so we assume cars never leave. We are interested in the sizes of blocks of consecutively parked cars.

The reason to consider this slightly unnatural statement is its equivalence to the problem of hashing with linear probing, apparently a key topic in computer science, which I won’t pretend that I know anything about. In any case, it’s a nice model, and it seems reasonable that it would have a basis in more realistic search algorithms.

So, how does the sequence of sizes of blocks of consecutively parked cars grow? Well, given the sequence of block sizes, it is reasonably easy to convince yourself that the order of the blocks around the circle is uniformly random, and the number of empty spaces between adjacent blocks is also uniformly random.

Assume for now that there are at least three blocks. A block of size x can merge with a block of size y with the arrival of the next car only if the blocks are adjacent, with exactly one empty space between them. The chance of this is uniform among all pairs of blocks. Now suppose this is the case, and that the block of size y lies clockwise from the block of size x. Then they will merge precisely if the next car arrives at any of the x occupied spaces in that block, or at the empty space between the pair of blocks. This has probability \frac{x+1}{m}. There’s also the opposite ordering to consider, where the block of size x lies clockwise from the other. The total probability of this merge \{x,y\}\mapsto \{x+y+1\} is therefore proportional to (x+y+2).

So the process of block sizes looks a bit like the additive coalescent, at least for large blocks. This is in contrast to the random graph process, where the sequence of component sizes behaves exactly like a multiplicative coalescent, where blocks merge at a rate proportional to the product of their sizes.

Asymptotics

As in the random graph process, it’s interesting to ask roughly how large the largest block will be in such a configuration. Pittel [3] considers the case where the number of empty places \ell = m-n \approx \beta m, for some \beta\in (0,1).

A less interesting model would be to choose the positions of the n cars uniformly at random. But then the size of a block is roughly geometric with parameter \beta, and there are \Theta(m) blocks with high probability. Relatively straightforward calculations in extreme value theory suggest that the largest block is likely to have size on the order of \log m in this setting.

Of course, the actual model is slightly more complicated, because the size of a block is self-reinforcing, since larger blocks are more likely to grow than smaller blocks. However, we can still get somewhere with naïve estimates. Let’s label the places clockwise. Then in order for there to be a block starting at 0 and stretching beyond \alpha \log m, a necessary condition is that at least \alpha \log m cars arrive at those places. The number of cars which arrive at those places is binomial, since there are n cars, and each arrives at a place chosen uniformly, and independently of the other cars. So this event corresponds to

\mathrm{Bin}(n,\frac{\alpha \log m}{m}) \ge \alpha \log m.

Then, since n\approx (1-\beta)n, this event corresponds approximately to

\mathrm{Po}((1-\beta)\alpha \log m) \ge \alpha \log m.

The probability that a Poisson RV is at least a constant multiple larger than its mean decays exponentially with the mean, hence in this case the probability is asymptotically some negative power of m, depending on the value of \alpha. But there are O(m) possible places for such a block to start, so whether we can apply a union bound usefully or not depends on whether the power of m is strictly less than -1.

Since all of this depends on \alpha, it is reasonable that everything is fine, and the largest block does have size at least \alpha \log m when \alpha is small, and very unlikely when \alpha is large. This heuristic argument fits with Pittel’s theorem. Indeed, his result shows much stronger concentration: that the fluctuations of the size of the largest block are O(1).

Critical regime and empirical processes

The following is a paraphrase of the introduction and some methods from [2].

Obviously, once m=m cars have arrived, there’s no room for manoeuvre and definitely all the places are taken in one giant block. But it’s not obvious in general what scaling for the number of gaps will give rise to giant blocks of \Theta(m) cars.

As for the random graph, we can find a process similar to the exploration process of a (random) graph which encodes much of the information we care about. Let Y_k be the number of cars which arrive at place k. So the sum of the Y_ks will be n, the total number of cars. Now consider the process

C_0=0, \ldots, C_{k+1}=C_k + Y_{k+1}-1.

A block has the property that the number of arrivals within that set of places is equal to the number of places. So every time this *empirical process* C drops below its previous running minimum, this indicates the end of a block. To make this equivalence precise, we need to be a bit careful about where we start counting. It works exactly if we start at the beginning of a block. If not, it might introduce some unwanted divisions within the first block.

What we have is a process that looks roughly like a random walk that is constrained to pass through the point (m,n-m), which is equal to (m,-l). Even if we aren’t totally precise about how this is like a random walk, we would expect to see Brownian fluctuations after rescaling. Indeed, we might expect to see a Brownian bridge added to a deterministic linear function with negative gradient. But this is only meaningful if the random part is at least as large as the deterministic part, and since the fluctuations have order \sqrt{m}, if l is much larger than this, the rescaled empirical process is essentially deterministic, so we won’t see any macroscopic excursions above the minimum.

If l is substantially smaller than \sqrt{m}, then there is no real difference between (m,-l) and (m,0), and what we see is just a Brownian bridge. At this point, where we choose to start the process is actually important. If we were to start it at the minimum of the Brownian bridge instead, we would have seen a Brownian excursion, which corresponds to one block occupying (almost) all of the places.

Unsurprisingly, the story is completed by considering \ell=\Theta(\sqrt{m}), where the rescaled empirical process looks like a slanted Brownian bridge, that is Brownian motion conditioned to pass through $(1,-\frac{\ell}{\sqrt{m})$. There isn’t an obvious fix to the question of where to start the process, but it turns out that the correct way is now adding a Brownian excursion onto the deterministic linear function with gradient - \frac{\ell}{\sqrt{m}}. It’s now reasonable that the excursions above the minimum should macroscopic.

This scaling limit works dynamically as well, where the same Brownian excursion is used for different gradients of the deterministic line, corresponding to \ell moving through the critical window m-\Theta(\sqrt{m}). Finally, a direction to Bertoin’s recent paper [1] for the model with an additional destructive property. Analogous to the forest fire, blocks of cars are removed at a rate proportional to their size (as a result, naturally, of ‘Molotov cocktails’…). Similar effects of self-organised criticality are seen when the rate of bombs is scaled appropriately.

References

[1] – Bertoin – Burning cars in a parking lot (paper / slides)

[2] – Chassaing + Louchard – Phase transition for parking blocks, Brownian excursion and coalescence (arXiv)

[3] – Pittel – Linear probing: the probable largest search time grows logarithmically with the number of records

Duality for Stochastic Processes

In the past couple of weeks, we’ve launched a new junior probability seminar in Oxford. (If you’re reading this and would like to come, please drop me an email!) The first session featured Daniel Straulino talking about his work on the Spatial Lambda Fleming-Viot process, a model for the evolution of populations allowing for selection and geometry. A lively discussion of duality broke out halfway through, following which it seemed reasonable to devote another session to this topic. Luke Miller presented some classical and less classical examples of the theory this afternoon. I found all of this interesting and relevant, and thought it would be worth writing some things down, and tying it in with one of the courses on this subject that we attended at ALEA in Luminy last October.

The majority of this article is based on Luke’s talk. Errors, omissions and over-simplifications are of course my own.

The setup is that we have two stochastic processes X_t\subset R, Y_t\subset S. For now we make no assertion about whether the two state spaces R and S are the same or related, and we make no comment on the dependence relationship between X and Y. Let P_x,Q_y be the respective probability measures, representing starting from x and y respectively. Then given a bivariate, measurable function h(.,.) on R x S, such that:

E^P_x h(X_t,y)=E^Q_y h(x,Y_t),\quad \forall x,y\quad\forall t,

then we say X and Y are dual with respect to h.

The interpretation should be that X is a process forwards in time, and Y is a process backwards in time. So X_t, Y_0 represent the present, while X_0, Y_t represent the past, which is the initial time for original process X. The fact that the result holds for all times t allows us to carry the equality through a derivative, to obtain an equality of generators:

\mathcal{G}^X h(x,y)=\mathcal{G}^Y h(x,y),\quad \forall x,y.

On the LHS, the generator acts on x, while on the RHS it acts on y. Although it still isn’t obvious (at least to me) when a pair of processes might have this property, especially for an arbitrary function, this seems the more natural definition to think about.

Note that this does indeed require a specific function h. There were murmurings in our meeting about the possibility of a two processes having a strong duality property, where this held for all h in some broad class of test functions. On more reflection, which may nonetheless be completely wrong, this seems unlikely to happen very often, except in some obviously degenerate cases, such as h constant. If this holds, then as the set of expectations of a class of functions of a random variable determines the distribution, we find that the instantaneous behaviour of Y is equal in distribution to the instantaneous behaviour of X when started from fixed (x,y). It seems unlikely that you might get many examples of this that are not deterministic or independent (eg two Brownian motions, or other space-, time-homogeneous Markov process).

Anyway, a canonical example of this is the Wright-Fisher diffusion, which provides a simple model for a population which evolves in discrete-time generations. We assume that there are two types in the population: {A,a} seems to be the standard notation. Children choose their type randomly from the distribution of types in the previous generation. In other words, if there are N individuals at all times, and X_k is the number of type A individuals, then:

X_{k+1} | X_k \stackrel{d}{=} \mathrm{Bin}(N, \frac{X_k}{N}).

It is not hard to see that in a diffusion limit as the number of individuals tends to infinity, the proportion of type A individuals is a martingale, and so the generator for this process will not depend on f’. In fact by checking a Taylor series, we can show that:

\mathcal{G}_{WF}f(x)=\frac{1}{2} x(1-x)f''(x),

for all f subject to mild regularity conditions. In particular, we can show that for f_n(x)=x^n, we have:

\mathcal{G}_{WF} f_n(x)=\binom{n}{2}(f_{n-1}(x)-f_n(x))

after some rearranging. This looks like the generator of a jump process, indeed a jump process where all the increments are -1. This suggests there might be a coalescent as the dual process, and indeed it turns out that Kingman’s coalescent, where any pair of blocks coalesce at uniform rate, is the dual. We have the relation in expectation:

\mathbb{E}_x[X_t^n]= \mathbb{E}_n[x^{N_t}],

where the latter term is the moment generating function of the number of blocks at time t of Kingman’s coalescent started from n blocks.

In particular, we can control the moments of the Wright-Fisher diffusion using the mgf of the Kingman’s coalescent, which might well be easier to work with.

That’s all very elegant, but we were talking about why this might be useful in a broader setting. In the context of this question, there seems to be an obstacle towards applying this idea above more generally. This is an underlying idea in population genetics models that as well as the forward process, there is also a so-called ancestral process looking backwards in time, detailing how time t individuals are related historically. It would be convenient if this process, which we might expect to be some nice coalescent, was the dual of the forward process.

But this seems to be a problem, as duals are a function of the duality function, so do we have uniqueness? It would not be very satisfying if there were several coalescents processes that could all be the dual of the forward process. Though some worked examples suggest this might not happen, because a dual and its duality function has to satisfy too many constaints, there seems no a priori reason why not. It seems that the strength of the results you can derive from the duality relation is only as strong as the duality relation itself. This is not necessarily a problem from the point of view of applications, so long as the duality function is something it might actually be useful to work with.

It’s getting late and this text is getting long, so I shall postpone a discussion of duality for interacting particle systems until tomorrow. In summary, by restricting to a finite state space, we can allow ourselves the option of a more algebraic approach, from which some direct uses of duality can be read off. I will also mention a non-technical but I feel helpful way to view many examples of duality in interacting particle systems as deterministic forward and backwards walks through a random environment, in what might be considering an extreme example of coupling.

Enhanced by Zemanta

Hall’s Marriage Theorem

Hall’s Marriage Theorem gives conditions on when the vertices of a bipartite graph can be split into pairs of vertices corresponding to disjoint edges such that every vertex in the smaller class is accounted for. Such a set of edges is called a matching. If the sizes of the vertex classes are equal, then the matching naturally induces a bijection between the classes, and such a matching is called a perfect matching.

The number of possible perfect matchings of K_{n,n} is n!, which is a lot to check. Since it’s useful to have bijections, it’s useful to have matchings, so we would like a simple way to check whether a bipartite graph has a matching. Hall’s Marriage Theorem gives a way to reduce the number of things to check to 2^n, which is still large. However, much more importantly, the condition for the existence of a matching has a form which is much easier to check in many applications. The statement is as follows:

Given bipartite graph G with vertex classes X and Y, there is a matching of X into Y precisely when for every subset A\subset X, |\Gamma(A)|\ge |A|, where \Gamma(A) is the set of vertices joined to some vertex in A, called the neighbourhood of A.

Taking A=X, it is clear that |Y|\ge |X| is a necessary condition for the result to hold, unsurprisingly. Perhaps the most elementary standard proof proceeds by induction on the size of X, taking the smallest A to give a contradiction, then using the induction hypothesis to lift smaller matchings up to the original graph. This lifting is based on the idea that a subset relation between sets induces a subset relation between their neighbourhoods.

In this post, I want to consider this theorem as a special case of the Max-Flow-Min-Cut Theorem, as this will support useful generalisations much more easily. The latter theorem is a bit complicated notationally to set up, and I don’t want it to turn into the main point of this post, so I will summarise. The Wikipedia article, and lots of sets of lecture notes are excellent sources of more detailed definitions.

The setting is a weakly-connected directed graph, with two identified vertices, the source, with zero indegree, and the sink, with zero outdegree. Every other vertex lies on a path (not necessarily unique) between the source and the sink. Each edge has a positive capacity, which should be thought of as the maximum volume allowed to flow down the edge. A flow is a way of assigning values to each edge so that they do not exceed the capacity, and there is volume conservation at each interior vertex. That is, the flow into the vertex is equal to flow out of the vertex. The value of the flow is the sum of the flows out of the source, which is necessarily equal to the sum of flows into the sink.

A cut is a partition of the vertices into two classes, with the source in one and the sink in the other. The value of a cut is then the sum of the capacities of any edges going from the class containing the source to the class containing the sink. In most examples, the classes will be increasing, in the sense that any path from source to sink changes class exactly once.

The Max-Flow-Min-Cut Theorem asserts that the maximum value of a flow through the system is equal to the minimum value of a cut. The proof is elementary, though it relies on defining a sensible algorithm to construct a minimal cut from a maximal flow that is not going to be interesting to explain without more precise notation available.

First we explain why Hall’s Marriage Theorem is a special case of this result. Suppose we are given the setup of HMT, with edges directed from X to Y with infinite capacity. We add edges of capacity 1 from some new vertex x_0 to each vertex of X, and from each vertex of Y to a new vertex y_0. The aim is to give necessary and sufficient conditions for the existence of a flow of value |X|. Note that one direction of HMT is genuinely trivial: if there is a matching, then the neighbourhood size condition must hold. We focus on the other direction. If the maximum flow is less than |X|, then there should be a cut of this size as well. We can parameterise a cut by the class of vertices containing the source, say S. Let A=SnX. If \Gamma(A)\not\subset S, then there will be an infinite capacity edge in the cut. So if we are looking for a minimal cut, this should not happen, hence \Gamma(A)\subset S if S is minimal. Similarly, there cannot be any edges from X\A to \Gamma(A). The value of the cut can then be given by

|\Gamma(A)|+|X|-|A|, which is at least |X| if the neighbourhood size assumption is given. Note we can use the same method with the original edges having capacity one, but we have to track slightly more quantities.

This topic came up because I’ve been thinking about fragmentation chains over this holiday. I have a specific example concerning forests of unrooted trees in mind, but won’t go into details right now. The idea is that we often have distributions governing random partitions of some kind, let’s say of [n]. Conditioning on having a given number of classes might give a family of distributions P_{n,k} for the partitions of [n] into k parts. We would be interested to know how easy it is to couple these distributions in a nice way. One way would be via a coalescence or fragmentation process. In the latter, we start with [n] itself, then at each step, split one of the parts into two according to some (random, Markovian) rule. We are interested in finding out whether such a fragmentation process exists for a given distribution.

It suffices to split the problem into single steps. Can we get from P_{n,k} to P_{n,k+1}?

The point I want to make is that this is just a version of Hall’s Marriage Theorem again, at least in terms of proof method. We can take X to be the set of partitions of [n] into k parts, and Y the set of partitions into (k+1) parts. Then we add a directed edge with infinite capacity between x\in X and y\in Y if y can be constructed from x by breaking a part into two. Finally, we connect a fresh vertex x_0 to each edge in X, only now we insist that the capacity is equal to P_{n,k}(x), and similarly an edge from y to y_0 with capacity equal to P_{n,k+1}(y). The existence of a fragmentation chain over this step is then equivalent to the existence of a flow of value 1 in the directed graph network.

Although in many cases this remains challenging to work with, which I will explore in a future post perhaps, this is nonetheless a useful idea to have in mind when it comes to deciding on whether such a construction is possible for specific examples.

Critical Components in Erdos-Renyi

In various previous posts, I’ve talked about the phase transition in the Erdos-Renyi random graph process. Recall the definition of the process. Here we will use the Gilbert model G(n,p), where we have n vertices, and between any pair of vertices we add an edge, independently of other pairs with probability p. We are interested in the sparse scaling, where the typical vertex has degree O(1) in n, and so p=c/n for constant c>0, and we assume throughout that n is large. We could alternatively have considered the alternative Erdos-Renyi model where we choose uniformly at random from the set of graphs with n vertices and some fixed number of edges. Almost all the results present work equally well in this setting.

As proved by Erdos and Renyi, the typical component structure of such a graph changes noticeably around the threshold c=1. Below this, in the subcritical regime, all the components are small, meaning of size at most order O(log n). Above this, in the supercritical regime, there is a single giant component on some non-zero proportion of the vertices. The rest of the graph looks subcritical. The case c=1 exhibits a phase transition between these qualitatively different behaviours. They proved that here, the largest component is with high probability O(n^2/3). It seems that they thought this result held whenever c=1-o(1), but it turns out that this is not the case. In this post, I will discuss some aspects of behaviour around criticality, and the tools needed to treat them.

The first question to address is this: how many components of size n^{2/3} are there? It might be plausible that there is a single such component, like for the subsequent giant component. It might also be plausible that there are n^1/3 such components, so O(n) vertices are on such critical components. As then it is clear how we transition out of criticality into supercriticality – all the vertices on critical components coalesce to form the new giant component.

In fact neither of these are correct. The answer is that for all integers k>0, with high probability the k-th largest component is on a size scale of n^2/3. This is potentially a confusing statement. It looks like there are infinitely many such components, but of course for any particular value of n, this cannot be the case. We should think of there being w(1) components, but o(n^b) for any b>0.

The easiest way to see this is by a duality argument, as we have discussed previously for the supercritical phase. If we remove a component of size O(n^2/3), then what remains is a random graph with n-O(n^2/3) vertices, and edge probability the same as originally. It might make sense to rewrite this probability 1/n as

\frac{1}{n-O(n^{2/3})}\cdot \frac{n-O(n^{2/3})}{n}=\frac{1-O(n^{-1/3})}{n-O(n^{2/3})}.

The approximation in the final numerator is basically the same as

1-o\left(n-O(n^{2/3})\right).

Although we have no concrete reasoning, it seems at least plausible that this should look similar in structure to G(n,1/n). In particular, there should be another component of size

O\left([n-O(n^{2/3})]^{2/3}\right)=O(n^{2/3}).

In fact, the formal proof of this proceeds by an identical argument, only using the exploration process. Because I’ve described this several times before, I’ll be brief. We track how far we have gone through each component in a depth-first walk. In both the supercritical and subcritical cases, when we scale correctly we get a random path which is basically deterministic in the limit (in n). For exactly the same reasons as visible CLT fluctuations for partial sums of RVs with expectation zero, we start seeing interesting effects at criticality.

The important question is the order of rescaling to choose. At each stage of the exploration process, the number of vertices added to the stack is binomial. We want to distinguish between components of size O(n^{2/3}) so we should look at the exploration process at time sn^{2/3}. The drift of the exploration process is given by the expectation of a binomial random variable minus one (since we remove the current vertex from the stack as we finish exploring it). This is given by

\mathbb{E}=\left[n-sn^{2/3}\right]\cdot \frac{1}{n}-1=-sn^{-1/3}.

Note that this is the drift in one time-step. The drift in n^{2/3} time-steps will accordingly by sn^{1/3}. So, if we rescale time by n^{2/3} and space by n^{1/3}, we should get a nice stochastic process. Specifically, if Z is the exploration process, then we obtain:

\frac{1}{n^{1/3}}Z^{(n)}_{sn^{2/3}} \rightarrow_d W_s,

where W is a Brownian motion with inhomogeneous drift -s at time s. The net effect of such a drift at a fixed positive time is given by integrating up to that time, and hence we might say the process has quadratic drift, or is parabolic.

We should remark that our binomial expectation is not entirely correct. We have discounted those sn^{2/3} vertices that have already been explored, but we have not accounted for the vertices currently in the stack. We should also be avoiding considering these. However, we now have a heuristic for the approximate number of these. The number of vertices in the stack should be O(n^{1/3}) at all times, and so in particular will always be an order of magnitude smaller than the number of vertices already considered. Therefore, they won’t affect this drift term, though this must be accounted for in any formal proof of convergence. On the subject of which, the mode of convergence is, unsurprisingly, weak convergence uniformly on compact sets. That is, for any fixed S, the convergence holds weakly on the random functions up to time sn^{2/3}.

Note that this process will tend to minus infinity almost surely. Component sizes are given by excursions above the running minimum. The process given by the height of the original process above the running minimum is called reflected. Essentially, we construct the reflected process by having the same generator when the current value is positive, and forcing the process up when it is at zero. There are various ways to construct this more formally, including as the scaling limit of some simple random walks conditioned never to stay non-negative.

The cute part of the result is that it holds equally well in a so-called critical window either side of the critical probability 1/n. When the probability is \frac{1+tn^{-1/3}}{n}, for any t\in \mathbb{R}, the same argument holds. Now the drift at time s is t-s, though everything else still holds.

This result was established by Aldous in [1], and gives a mechanism for calculating distributions of component sizes and so on through this critical window.

In particular, we are now in a position to answer the original question regarding how many such components there were. The key idea is that because whenever we exhaust a component in the exploration process, we choose a new vertex uniformly at random, we are effectively choosing a component according to the size-biased distribution. Roughly speaking, the largest components will show up near the beginning. Note that a critical O(n^{2/3}) component will not necessarily be exactly the first component in the exploration process, but the components that are explored before this will take up sufficiently few vertices that they won’t show up in the scaling of the limit.

In any case, the reflected Brownian motion ‘goes on forever’, and the drift is eventually very negative, so there cannot be infinitely wide excursions, hence there are infinitely many such critical components.

If we care about the number of cycles, we can treat this also via the exploration process. Note that in any depth-first search we are necessarily only interested in a spanning tree of the host graph. Anyway, when we are exploring a vertex, there could be extra edges to other vertices in the stack, but not to vertices we’ve already finished exploring (otherwise the edge would have been exposed then). So the expected number of excess edges into a vertex is proportional to the height of the exploration process at that vertex. So the overall expected number of excess edges, conditional on the exploration process is the area under the curve. This carries over perfectly well into the stochastic process limit. It is then a calculation to verify that the area under the curve is almost surely infinite, and thus that we expect there to be infinitely many cycles in a critical random graph.

REFERENCES

[1] Aldous D. – Brownian excursions, critical random graphs and the multiplicative coalescent

Recent Research Activity

I’ve spent this week in Luminy, near Marseille, attending a summer school run by ALEA, the organisation of French probabilists. We’ve been staying in CIRM, a dedicated maths research conference centre at the edges of the calanques, the area of mountains and jagged coastal inlets between Marseille and Cassis. The walking possibilities have been excellent, as have the courses and lectures, on a range of topics in probability theory.

Anyway, the time here has been an excellent moment to reflect on my research progress, and try to come up with the sort of fresh ideas that are perhaps slightly inhibited by sitting at a desk with an endless supply of paper on which to try calculations. When I get back, I have to submit a first-year report, so at least for a little while I will have to suppress the desire to make further progress and instead diligently assemble the progress I have made.

The Model

I’ve defined some of these processes in past posts, but I see no harm in doing so again. We take the standard Erdos-Renyi random graph process, where edges are added one-at-a-time uniformly at random between n vertices, and amend it by adding a deletion mechanism. The aim is to arrive at a process which looks in equilibrium more like the critical random graph than either the subcritical or supercritical regimes, where the components are very small, and dominated by one giant component respectively. Rath, Toth and others have studied the process where each vertex is hit by lightning at uniform rate. When this happens, we delete all the edges in the component containing that vertex. Naturally, big components will be hit by lightning more often than small components, and so this acts as a mechanism to prevent the formation of giant components, if scaled correctly.

We take a different approach. We observe that criticality in the original random graph process is denoted by the first appearance of a giant component, but also by the first appearance of a) lots of cycles, and b) large cycles. In particular, it is very unlikely that a giant component could form without containing any cycles. We will therefore use the appearance of a cycle to trigger some form of deletion mechanism.

Our final goal is to treat the so-called ‘Cycle Deletion’ model. Here, whenever a cycle appears, we delete all the edges in that cycle immediately. There are several challenges in treating this model, because the rate at which cycles emerge in a tree is a function of the tree structure. The trees in this model will not be Uniform Spanning Trees (though it is very possible that they will be ‘almost USTs’ in some sense – we need to investigate this further) so it will be hard to make nice statements about the rates. For the standard random graph process, if we are only interested in the sizes of the components, we are actually allowed to ignore the graph structure entirely. The component sizes evolve as a discrete, stochastic version of the multiplicative coalescent (sometimes called a Marcus-Lushnikov process). We would like a deletion mechanism that has a nice interpretation as a fragmentation operation in the same sense. The rate at which a component fragments will be quadratic in the size of the component, since there are O(k^2) possible edges between k vertices forming a component, and adding any of precisely these will create a cycle.

I’ve talked previously about how to overcome the problems with the tree structure in Cycle Deletion with the so-called Uniform Cycle Deleting model. In any case, as a starting point we might consider the Cycle-Induced Forest Fire model. Here, whenever a cycle appears, we delete all the edges, including the new one, in the whole component which contains the cycle.

We suspect this model may resemble the critical random graph at all times. The main characteristic of G(n,1/n) is that the largest component is of size O(n^2/3), and indeed there are arbitrarily many components of this size, with high probability in the limit. Since CIFF is recurrent for any fixed n, meaning that it will visit any state infinitely often (rather than tending to infinity or similar), we should ask what the largest component is typically in the equilibrium distribution. Our aim is to prove that it is O(n^2/3). We might suspect that the typical size of the largest component will be greater in the Cycle Deletion model, since each fragmentation event is less severe there, removing fewer edges.

An Upper Bound

The nice thing about Markov chains is that they have an ergodic property, which means that if you run them for long enough, the proportion of time spent in any state is given by the stationary probability of being in that state. It doesn’t matter whether or not you start in equilibrium, since it will converge anyway. Thus it is meaningful to talk about properties like the average number of isolated vertices as a time-average as well as an average with respect to some distribution.

This quantity is the key to an upper bound. We can equally talk about the average change in the number of isolated vertices in a time-step. This will increase when a component fragments, and will decrease when an isolated vertex coalesces with another component. In particular, the largest possible decrease in the number of isolated vertices in a single time-step is 2, corresponding to an edge appearing between two isolated vertices.

Suppose that with probability \Theta(1) there is a component of size n^\alpha for some \alpha>2/3. Then such a component makes a contribution to the expected change in the number of isolated vertices of

\Theta(1) n^\alpha \left(\frac{n^\alpha}{n}\right)^2. (*)

Where does this come from? Well, we are tracking the contributions from the event that the largest component is of this size and that it fragments, giving n^\alpha new isolated vertices. So the \Theta(1) accounts for the probability that there is such a component to begin with. Then, conditional on that, the probability that it gets fragmented in the next time-step is the probability that both ends of the next edge added lie in that component. Since the edge is chosen uniformly at random, the probability of this is n^\alpha/n. Note that this is under a slightly odd definition of an edge, that allows loops. Basically, I don’t want to have lots of correction terms involving \binom{n}{2} floating around. However, it would make no difference to the orders of magnitude if we to do it with these.

So, this is only one contribution to the typical rate of gain of isolated vertices. Now note that if \alpha>2/3, then this expression is >> 1. This is bad since the negative contributions to this expected flux in the number of isolated vertices is O(1). So this suggests that over time, the number of isolated vertices will keep growing. This is obviously ridiculous since a) we are in equilibrium, so the expected flux should be 0 and b) the number of isolated vertices cannot exceed n, for clear reasons.

This gives us an upper bound of n^2/3 as the typical scale of the largest component. We can come up with a similar argument for the cycle deleting model. The most helpful thing to track there is the number of edges in the graph. Note that since the graph is at all times a forest on n vertices, the number of edges is equal to n minus the number of (tree) components. We use the fact that the typical fragmentation of a component of size k creates O(\sqrt{k}) new components. It is possible to argue via isolated vertices here too, but the estimates are harder, or at least less present in the literature.

Lower Bounds?

The problem with lower bounds is that it is entirely possible that the flux in the number of isolated vertices is not driven by typical behaviour. Suppose for example we had a different rule. We begin a random graph process, and the first time we see a cycle in a component with size larger than n^2/3, we delete all the edges in the whole graph. Then we will see a sequence of random graph processes starting with the empty graph and stopped at some point close to criticality (in fact, with high probability in the *critical window*), and these will all be glued together. So then, most of the time the process will look subcritical, but the gains in isolated vertices will occur only during the critical periods, which are only an asymptotically small proportion of the time.

At the moment, my approach to the lower bound is instead to prove that the upper bound is tight. I mean this in the following sense. Suppose we wanted to be sure that (*) was in fact equal to the average rate of gain of isolated vertices. We would have to check the following:

  • That the total contributions from all other components were similar or smaller than from the component(s) of size roughly n^{\alpha}.
  • That there were only a few components of size n^{\alpha}. In particular, the estimate would be wrong if there were n^\epsilon such components for any \epsilon>0.
  • That it cannot be the case that for example, some small proportion of the time there is a component of size roughly n^{\alpha+\epsilon}, and over a large enough time these make a greater contribution to the average gain in isolated vertices.

A nice way to re-interpret this is to consider some special vertex and track the size of its component in time. It will be involved in repeated fragmentations over the course of time, so it is meaningful to talk about the distribution of the size of the component containing the vertex when it is fragmented. Our aim is to show that this distribution is concentrated on the scaling O(n^\alpha).

So this has turned out to be fairly hard. Rather than try to explain some of the ideas I’ve employed in attempting to overcome this, I will finish by giving one reason why it is hard.

We have seen that the component sizes in random graphs evolve as the multiplicative coalescent, but at a fixed moment in time, we can derive good estimates from an analogy with branching processes. We might like to do that here. If we know what the system looks like most of the time, we might try to ‘grow’ a multiplicative coalescent, viewing it like a branching process, with distribution given by the typical distribution. The problem is that when I do this, I find that the expectation of the offspring distribution is \Theta(1). This looks fine, since 1 is the threshold for extinction with probability 1. However, throughout the analysis, I have only been paying attention to the exponent of n in all the time and size estimates. For example, I view n^\alpha and n^\alpha \log n as the same. This is a problem, as when I say the expectation is \Theta(1), I am really saying it is \sim n^0. This means it could be \frac{1}{\log n} or \log n. Of course, there is a massive difference between these, since a branching process grows expectationally!

So, this approach appears doomed in its current form. I have some other ideas, but a bit more background may be required before going into those. I’m going to be rather busy with teaching on my return to the office, so unfortunately it is possible that there may be many posts about second year probability and third year applied probability before anything more about CIFF.

Random Mappings for Cycle Deletion

In previous posts here and here, I’ve talked about attempts to describe a cycle deleting process. We amend the dynamics of the standard random graph process by demanding that whenever a cycle is formed in the graph we delete all the edges that lie on the cycle. The aim of this is to prevent the system growing giant components, and perhaps give a system that displays the characteristics of self-organised criticality. In the posts linked to, we discuss the difficulties caused by the fact that the tree structure of components in such a process is not necessarily uniform.

Today we look in the opposite direction. It gives a perfectly reasonable model to take a multiplicative coalescent with quadratic fragmentation (this corresponds to cycle deletion, since there are O(n^2) edges which would give a cycle if added to a tree on n vertices) and a fragmentation kernel corresponding to adding an extra edge to a uniform spanning tree on n vertices then deleting the edges of the unique cycle. The focus of the rest of this post, we consider this fragmentation mechanism, in particular thinking about how we would sample from it most practically. Not least, without going through Prufer codes or some other clever machinery, it is not trivial to sample a uniform spanning tree.

First, we count the number of unicyclic graphs on n labelled vertices. If we know that the vertices on the cycle are v_1,\ldots,v_k, then the number of cycles with an identified edge is

u_1=1,\quad u_k=\frac{k!}{2},\, k\ge 2.

If we know that the tree coming off the cycle from vertex v_i has size m, say, then each of the possible rooted labelled trees with size m is equally likely. So taking w_j=j^{j-1}, the number of rooted trees on j labelled vertices, we get B_n(u_\bullet,w_\bullet) for the number of such unicyclic graphs on [n]. Recall B_n is the nth Bell polynomial, which gives the size of a compound combinatorial structure, where we have some structure on blocks and some other structure within blocks. Then the random partition of [n] given by the tree sizes has the distribution \text{Gibbs}_n(u_\bullet,w_\bullet).

Consider now a related object, the so-called random mapping digraph. What follows is taken from Chapter 9 of Combinatorial Stochastic Processes. We can view any mapping M_n:[n]\rightarrow[n] as a digraph where every vertex has out-degree 1. Each such digraph contains a collection of directed cycles, supported on those elements x for which M_n^k(x)=x for some k. Such an element x is called a cyclic point. Each cyclic point can be viewed as the root of a labelled tree.

In an identical manner to the unicyclic graph, the sizes of these directed trees in the digraph decomposition of a uniform random mapping is distributed as \text{Gibbs}_n(\bullet !,w_\bullet). So this is exactly the same as the cycle deletion kernel, apart from in the probability that the partition has precisely one block. In practice, for large n, the probability of this event is very small in both cases. And if we wanted to sample the cycle deletion kernel exactly, we could choose the trivial partition with some probability p, and otherwise sample from the random mapping kernel, where p is chosen such that

p+\frac{1-p}{B_n(\bullet !, w_\bullet)}=\frac{1}{B_n(u_\bullet,w_\bullet)}.

At least we know from the initial definition of a random mapping, that B_n(\bullet !,w_\bullet)=n^n. The number of unicyclic graphs with an identified edge is less clear. It turns out that the partition induced by the random mapping has a nice limit, after rescaling, as the lengths of excursions away from 0 in the standard Brownian bridge on [0,1].

The time for a fuller discussion of this sort of phenomenon is in the context of Poisson-Dirichlet distributions, as the above exchangeable partition turns out to be PD(1/2,1/2). However, for now we remark that the jumps of a subordinator give a partition after rescaling. The case of a stable subordinator is particularly convenient, as calculations are made easier by the Levy-Khintchine formula.

A notable example is the stable-1/2 subordinator, which can be realised as the inverse of the local time process at zero of a Brownian motion. The jumps of this process are then the excursion lengths of the original Brownian motion. A calculation involving the tail of the w_j’s indicates that 1/2 is the correct parameter for a subordinator to describe the random mappings. Note that the number of blocks in the partition corresponds to the local time at zero of the Brownian motion. (This is certainly not obvious, but it should at least be intuitively clear why a larger local time roughly indicates more excursions which indicates more blocks.)

So it turns out, after checking some of the technicalities, that it will suffice to show that the rescaled number of blocks in the random mapping partition \frac{|\Pi_n|}{\sqrt{n}} converges to the Raleigh density, which is a size-biased Normal random variable (hence effectively first conditioned to be positive), and which also is the distribution of the local time of the standard Brownian bridge.

After that very approximate description, we conclude by showing that the distribution of the number of blocks does indeed converge as we require. Recall Cayley’s formula kn^{n-k-1} for the number of labelled forests on [n] with a specified set of k roots. We also need to know how many labelled forests there are with any set of roots. Suppose we introduce an extra vertex, labelled 0, and connect it only to the roots of a rooted labelled forest on [n]. This gives a bijection between unlabelled trees on {0,1,…,n} and labelled forests with a specified set of roots on [n]. So we can use Cayley’s original formula to conclude there are (n+1)^{n-1} such forests. We can do a quick sanity check that these are the same, which is equivalent to showing

\sum_{k=1}^n k n^{-k-1}\binom{n}{k}=\frac{1}{n}(1+\frac{1}{n})^{n-1}.

This odd way of writing it is well-motivated. The form of the LHS is reminiscent of a generating function, and the additional k suggests taking a derivative. Indeed, the LHS is the derivative

\frac{d}{dx}(1+x)^n,

evaluated at \frac{1}{n}. This is clearly the same as the RHS.

That said, having established that the random mapping partition is essentially the same, it is computationally more convenient to consider that instead. By the digraph analogy, we again need to count forests with k roots on n vertices, and multiply by the number of permutations of the roots. This gives:

\mathbb{P}(|\Pi_n|=k)=\frac{kn^{n-k-1}\cdot k! \binom{n}{k}}{n^n}=\frac{k}{n}\prod_{i=1}^{k-1}\left(1-\frac{i}{n}\right).

Now we can consider the limit. Being a bit casual with notation, we get:

\lim \mathbb{P}(\frac{|\Pi_n|}{\sqrt{n}}\in dl)\approx \sqrt{n}dl \mathbb{P}(|\Pi_n|=l\sqrt{n}).

Since the Raleigh distribution has density l\exp(-\frac12 l^2)dl, it suffices for this informal verification to check that

\prod_{i=1}^{l\sqrt{n}}(1-\frac{i}{n})\approx \exp(-\frac12 l^2). (*)

We take logs, so the LHS becomes:

\log(1-\frac{1}{n})+\log(1-\frac{2}{n})+\ldots+\log(1-\frac{l\sqrt{n}}{n}).

If we view this as a function of l and differentiate, we get

d(LHS)=\sqrt{n}dl \log (1-\frac{l}{\sqrt{n}})\approx \sqrt{n}dl \left[-\frac{l}{\sqrt{n}}-\frac{l^2}{2n}\right]\approx -ldl.

When l is zero, the LHS should be zero, so we can obtain the desired result (*) by integrating then taking an exponential.

Urns and the Dirichlet Distribution

As I’ve explained in some posts from a while ago, I’ve been thinking about some models related to random graph processes, where we ensure the configuration stays critical by deleting any cycles as they appear. Under various assumptions, this behaves in the limit as the number of vertices grows to infinity, like a coagulation-fragmentation process, with multiplicative coalescence and quadratic fragmentation rate, where the fragmentation kernel is the Poisson-Dirichlet distribution, PD(1/2,1/2). I found it quite hard to find accessible notes on these, partly because the theory is still relatively recently, and also because it seems to be one of those topics where you can’t understand anything properly until you kind of understand everything.

This post was motivated and is based on chapter 3 of Pitman’s Combinatorial Stochastic Processes, and the opening pair of lectures from Pierre Tarres’s TCC course on Self-Interaction and Learning.

It makes sense to begin by discussing the Dirichlet distribution, and there to start with the most simple case, the Beta distribution. As we learned in the Part A Statistics course while trying some canonical examples of posterior distributions, it is convenient to ignore the normalising constants of various distributions until right at the end. This is particularly true of the Beta distribution, which is indeed often used as a prior in such situations. The density function of \text{Beta}(\alpha,\beta) is x^{\alpha-1}(1-x)^{\beta-1}. If these are natural numbers, we have a quick proof by induction using integration by parts, otherwise a slightly longer but still elementary argument gives the normalising constant as

\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}.

(note that the ‘base case’ is the definition of the Gamma function.) For the generalisation we are about to make, it is helpful to think of this Beta density as a distribution not on [0,1], but on partitions of [0,1] into two parts. That is pairs (x,y) such that x+y=1. Why? Because then the density has the form x^{\alpha-1}y^{\beta-1} and is clear how we might generalise this.

Indeed the Dirichlet distribution with parameters (\alpha_1,\ldots,\alpha_m) is a random variable supported on the subset of \mathbb{R}^m with \sum p_k=1 with density \propto \prod p_k^{\alpha_k-1}. For similar reasons, the correct normalising constant in the general case is

\frac{\Gamma(\sum \alpha_k)}{\prod \Gamma(\alpha_k)}.

You can prove this by inducting on the number of variables, using the Beta distribution as a base case.

In many situations, it is useful to be able to express some distribution as a function of IID random variables with a simpler distribution. We can’t quite do that for the Dirichlet distribution, but we can express it very simply as function of independent RVs from the same family. It turns out that the family of Gamma distributions is a wise choice. Recall that the gamma distribution with parameters (\alpha,\beta) has density:

\frac{1}{\beta^\alpha \Gamma(\alpha)}y^{\alpha-1}e^{-y/\beta},\quad y>0.

Anyway, define independent RVs Y_k as gamma distributions with parameters (\alpha_k,1), then we can specify (X_1,\ldots,X_m) as the Dirichlet distribution with these parameters by:

(X_1,\ldots,X_m)\stackrel{d}{=}\frac{(Y_1,\ldots,Y_m)}{\sum Y_k}.

In other words, the Dirichlet distribution gives the ratio between independent gamma RVs. Note the following:

– the sum of the gamma distributions, ie the factor we have to scale by to get back to a ratio, is a gamma distribution itself.

– If we wanted, we could define it in an identical way using Gamma with parameters (\alpha_k,\beta) for some fixed \beta.

– More helpfully, because the gamma distribution is additive in the first argument, we can take a limit to construct a gamma process, where the increments have the form required. This will be a useful interpretation when we take a limit, as largest increments will correspond to largest jumps.

Polya’s Urn

This is one of the best examples of a self-reinforcing process, where an event which has happened in the past is more likely to happen again in the future.

The basic model is as follows. We start with one white ball and one black ball in a bag. We draw a ball from the bag uniformly at random then replace it along with an additional ball of the same colour. Repeat this procedure.

The first step is to look at the distribution at some time n, ie after n balls have been added, so there are n+2 in total. Note that there are exactly n+1 possibilities for the state of the bag at this time. We must have between 1 and n+1 black balls, and indeed all of these are possible. In general, part of the reason why this process is self-reinforcing is that any distribution is in some sense an equilibrium distribution.

What follows is a classic example of a situation which is a notational nightmare in general, but relatively straightforward for a fixed finite example.

Let’s example n=5, and consider the probability that the sequence of balls drawn is BBWBW. This probability is:

\frac12\times \frac 23\times \frac14\times \frac 35\times \frac26.

So far this isn’t especially illuminating, especially if we start trying to cancel these fractions. But note that the denominator of the product will clearly be 6!. What about the numerator? Well, the contribution to the numerator of the product from black balls is 1x2x3=3! while the contribution from white balls is 1×2=2!. In particular, the contribution to the numerator from each colour is independent of the order of whites and blacks. It depends only on the number of whites and blacks. So we can conclude that the probability that we end up a particular ordering of k+1 whites and (n-k)+1 blacks is

\frac{k! (n-k)!}{(n+1)!},

and so the probability that we end up with k+1 whites where we no longer care about ordering is

\binom{n}{k}\frac{k!(n-k)!}{(n+1)!}=\frac{1}{n+1}.

In other words, the distribution of the number of white balls in the bag after n balls have been added is uniform on [1, n+1].

That looks like it might be something of a neat trick, so the natural question to ask is what happens if we adjust the initial conditions. Suppose that instead we start with a_1,\ldots,a_m balls of each of m colours. Obviously, this is going to turn into a proof by suggestive notation. In fact, the model doesn’t really rely on the (a_i) being positive integers. Everything carries through with a_i\in\mathbb{R} if we view the vector as the initial distribution.

As before, the order in which balls of various colours are drawn doesn’t matter hugely. Suppose that the first n balls drawn feature n_i balls of colour i. The probability of this is:

\binom{n}{n_1,\ldots,n_k}\frac{\prod_i \alpha_i(\alpha_i+1)\ldots (\alpha_i+n_i-1)}{\alpha(\alpha+1)\ldots(\alpha+(n-1))}

where \alpha=\sum_i \alpha_i. Then for large n, assuming for now that the \alpha_i\in\mathbb{N} we have

\frac{\alpha(\alpha+1)\ldots(\alpha+n-1)}{n!}=\frac{[\alpha_i+(n_i-1)]!}{n_i! (\alpha_i-1)!}\approx \frac{n_i^{\alpha_i-1}}{(\alpha_i-1)!}.

The denominator will just be a fixed constant, so we get that overall, the probability above is approximately

\frac{\prod_i n_i^{\alpha_i-1}}{n^{\alpha-1}}=\prod (\frac{n_i}{n})^{\alpha_i-1},

which we recall is the pdf of the distribution distribution with parameters (\alpha_i) as telegraphed by our choice of notation. With some suitable martingale machinery, you can also prove that this convergence happens almost surely, for a suitable limit RV defined on the tail sigma algebra.

Next time I’ll introduce a more complicated family of self-reinforcing processes, and discuss some interesting limits of the Dirichlet distribution that relate to such processes.

Enhanced by Zemanta

Generating Functions for the IMO

The background to this post is that these days I find myself using generating functions all the time, especially for describing the stationary states of various coalescence-like processes. I remember meeting them vaguely while preparing for the IMO as a student. However, a full working understanding must have eluded me at the time, as for Q5 on IMO 2008 in Madrid I had written down in big boxes the two statements involving generating functions that immediately implied the answer, but failed to finish it off. The aim of this post is to help this year’s team avoid that particular pitfall.

What are they?

I’m going to define some things in a way which will be most relevant to the type of problems you are meeting now. Start with a sequence (a_0,a_1,a_2,\ldots). Typically these will be the sizes of various combinatorial sets. Eg a_n = number of partitions of [n] with some property. Define the generating function of the sequence to be:

f(x)=\sum_{k\geq 0}a_k x^k=a_0+a_1x+a_2x^2+\ldots.

If the sequence is finite, then this generating function is a polynomial. In general it is a power series. As you may know, some power series can be rather complicated, in terms of where they are defined. Eg

1+x+x^2+x^3+\ldots=\frac{1}{1-x},

only when |x|<1. For other values of x, the LHS diverges. Defining f over C is fine too. This sort of thing is generally NOT important for applications of generating functions to combinatorics. To borrow a phrase from Wilf, a generating function is a convenient `clothesline’ on which to hang a sequence of numbers.

We need a notation to get back from the generating function to the coefficients. Write [x^k]g(x) to denote the coefficient of x^k in the power series g(x). So, if g(x)=3x^3-5x^2+7, then [x^2]g(x)=-5. It hopefully should never be relevant unless you read some other notes on the topic, but the notation [\alpha x^2]g(x):=\frac{[x^2]g(x)}{\alpha}, which does make sense after a while.

How might they be useful?

Example: binomial coefficients a_k=\binom{n}{k} appear, as the name suggests, as coefficients of

f_n(x)=(1+x)^n=\sum_{k=0}^n \binom{n}{k}x^k.

Immediate consequence: it’s trivial to work out \sum_{k=0}^n \binom{n}{k} and \sum_{k=0}^n(-1)^k \binom{n}{k} by substituting x=\pm 1 into f_n.

Less obvious consequence. By considering choosing n from a red balls and b blue balls, one can verify

\binom{a+b}{n}=\sum_{k=0}^n \binom{a}{k}\binom{b}{n-k}.

We can rewrite the RHS as

\sum_{k+l=n}\binom{a}{k}\binom{b}{l}.

Think how we calculate the coefficient of x^n in the product f(x)g(x), and it is now clear that \binom{a+b}{n}=[x^n](1+x)^{a+b}, while

\sum_{k+l=n}\binom{a}{k}\binom{b}{l}=[x^n](1+x)^a(1+x)^b,

so the result again follows. This provides a good slogan for generating functions: they often replicate arguments via bijections, even if you can’t find the bijection.

Useful for? – Multinomial sums

The reason why the previous argument for binomial coefficients worked nicely is because we were interested in the coefficients, but had a neat expression for the generating function as a polynomial. In particular, we had an expression

\sum_{k+l=n}a_k b_l.

This is always a clue that generating functions might be useful. This is sometimes called a convolution.

Exercise: prove that in general, if f(x) is the generating function of (a_k) and g(x) the generating function of (b_l), then f(x)g(x) is the generating function of \sum_{k+l=n}a_kb_l.

Even more usefully, this works in the multinomial case:

\sum_{k_1+\ldots+k_m=n}a^{(1)}_{k_1}\ldots a^{(m)}_{k_m}.

In many applications, these a^{(i)}s will all be the same. We don’t even have to specify how many k_i’s there are to be considered. After all, if we want the sum to be n, then only finitely many can be non-zero. So:

\sum_{m}\sum_{k_1+\ldots+k_m=n}a_{k_1}\ldots a_{k_m}=[x^n]f(x)^n=[x^n]f(x)^\infty,

provided f(0)=1.

Useful when? – You recognise the generating function!

In some cases, you can identify the generating function as a `standard’ function, eg the geometric series. In that case, manipulating the generating functions is likely to be promising. Here is a list of some useful power series you might spot.

1+x+x^2+\ldots=\frac{1}{1-x},\quad |x|<1

1+2x+3x^2+\ldots=\frac{1}{(1-x)^2},\quad |x|<1

e^x=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\ldots

\cos x=1-\frac{x^2}{2!}+\frac{x^4}{4!}\pm\ldots

Exercise: if you know what differentiation means, show that if f(x) is the gen fn of (a_k), then xf'(x) is the gen fn of ka_k.

Technicalities: some of these identities are defined only for certain values of x. This may be a problem if they are defined at, say, only a single point, but in general this shouldn’t be the case. In addition, you don’t need to worry about differentiability. You can definition differentiation of power series by x^n\mapsto nx^{n-1}, and sort out convergence later if necessary.

Useful for? – Recurrent definitions

The Fibonacci numbers are defined by:

F_0=F_1=1,\quad F_{n+1}=F_n+F_{n-1},\quad n\geq 1.

Let F(x) be the generating function of the sequence F_n. So, for n=>1,

[x^n]F(x)=[x^{n-1}]F(x)+[x^{n-2}]F(x)=[x^n](xF(x)+x^2F(x)),

and F(0)=1, so we can conclude that:

F(x)=1+(x+x^2)F(x)\quad\Rightarrow\quad F(x)=\frac{1}{1-x-x^2}.

Exercise: Find a closed form for the generating function of the Catalan numbers, defined recursively by:

C_n=C_0C_{n-1}+C_1C_{n-2}+\ldots+C_{n-1}C_0.

Can you now find the coefficients explicitly for this generating function?

Useful for? – Partitions

Partitions can be an absolute nightmare to work with because of the lack of explicit formulae. Often any attempt at a calculation turns into a massive IEP bash. This prompts a search for bijective or bare-hands arguments, but generating functions can be useful too.

For now (*), let’s assume a partition of [n] means a sequence of positive integers a_1\geq a_2\geq\ldots\geq a_k such that a_1+\ldots+a_k=n. Let p(n) be the number of partitions of [n].

(* there are other definitions, in terms of a partition of the set [n] into k disjoint but unlabelled sets. Be careful about definitions, but the methods often extend to whatever framework is required. *)

Exercise: Show that the generating function of p(n) is:

\left(\frac{1}{1-x}\right)\left(\frac{1}{1-x^2}\right)\left(\frac{1}{1-x^3}\right)\ldots

Note that if we are interested only in partitions of [n], then we don’t need to consider any terms with exponent greater than n, so if we wanted we could take a finite product instead.

Example: the mint group will remember this problem from the first session in Cambridge:

Show that the number of partitions of [n] with distinct parts is equal to the number of partitions of [n] with odd parts.

Rather than the fiddly bijection argument found in the session, we can now treat this as a simple calculation. The generating function for distinct parts is given by:

(1+x)(1+x^2)(1+x^3)\ldots,

while the generating function for odd parts is given by:

\left(\frac{1}{1-x}\right)\left(\frac{1}{1-x^3}\right)\left(\frac{1}{1-x^5}\right)\ldots.

Writing the former as

\left(\frac{1-x^2}{1-x}\right)\left(\frac{1-x^4}{1-x^2}\right)\left(\frac{1-x^6}{1-x^3}\right)\ldots

shows that these are equal and the result follows.

Other things – Multivariate Generating Functions

If you want to track a sequence in two variables, say a_{m,n}, then you can encode this with the bivariate generating function

f(x,y):=\sum_{m,n\geq 0}a_{m,n}x^my^n.

The coefficients are then extracted by [x^ay^b] and so on. There’s some interesting stuff on counting lattice paths with this method.

Sums over arithmetic progressions via roots of unity

Note that we can extract both \sum a_n and \sum (-1)^na_n by judicious choice of x in f(x). By taking half the sum or half the difference, we can obtain

a_0+a_2+a_4+\ldots=\frac12(f(1)+f(-1)),\quad a_1+a_3+a_5+\ldots=\frac12(f(1)-f(-1)).

Can we do this in general? Yes actually. If you want a_0+a_k+a_{2k}+\ldots, this is given by:

a_0+a_k+a_{2k}+\ldots+\frac{1}{k}\left(f(1)+f(w)+\ldots+f(w^{k-1})\right),

where w=e^{2\pi i/k} is a $k$th root of unity. Exercise: Prove this.

For greater clarity, first try the case k=4, and consider the complex part of the power series evaluated at +i and -1.

Bell Polynomials

Trees with a single cycle

When counting combinatorial objects, it is often the case that we have two types of structure present at different levels. The aim of this post is to introduce the Bell polynomials, which provides the most natural notation for describing this sort of situation, and to mention some of the results that become easier to derive in this framework. This post is based on material and exercises from Chapter 1 of Jim Pitman’s book Combinatorial Stochastic Processes, which is great, and also available online here.

The structures that Bell polynomials enumerate are called composite structures in this account. Rather than give a definition right away, I shall give an example. An object I have been thinking about in the past few weeks are graphs on n vertices containing precisely one cycle. Some of the background for this has been explained in recent posts.

In a recent post on Prufer codes, I gave the classical argument showing that the number of trees on n vertices is n^{n-2}. We might consider a unicyclic graph to be a tree with an extra edge. But if we consider the number of ways to add a further vertex to a tree, we get

n^{n-2}\left[\binom{n}{2}-(n-1)\right]=n^{n-2}\binom{n-1}{2}.

Obviously, we have overcounted. If the single cycle in a graph has length k, then the graph has been counted exactly k times in this enumeration. But it is not obvious how many graphs have a single cycle of length k.

Instead, we stop worrying about exactly how many of these there are, as there might not be a simple expression anyway. As soon as we start using them in any actual argument, it will be useful to know various properties about the graphs, but probably not exactly how many there are.

Let’s focus on this single cycle of length k say. If we remove the edges of the cycle, we are left with a collection of trees. Why? Well if there was a cycle in the remaining graph, then the original graph would have had at least two cycles. So we have a collection of trees, unsurprisingly called a forest. Remembering that some of the trees may in fact be a single vertex (on the cycle), it is clear that there is a bijection between these trees and the vertices of the cycle in the obvious way. We can think of the graph as a k-cycle, dressed with trees.

Alternatively, once we have specified its size, we can forget about the k-cycle altogether. The graph is precisely defined by a forest of k trees on n vertices, with a specified root in each tree indicating which vertex lies on the cycle, and a permutation specifying the cyclic ordering of the trees. We can write this as

N_{n,k}=(k-1)!\sum_{(A_1,\ldots,A_k)\in\mathcal{P}^k(n)}a_1^{a_1-1}\cdot\ldots\cdot a_k^{a_k-1},\quad \text{for }a_i=|A_i|,

where \mathcal{P}^k(n) is the number of partitions of [n] with k blocks. Remember that the blocks in a partition are necessarily unordered. This makes sense in this setting as the cyclic permutation chosen from the (k-1)! possibilities specifies the order on the cycle.

Bell Polynomials

The key point about this description is that there are two types of combinatorial structure present. We have the rooted trees, and also a cyclic ordering of the rooted trees. Bell polynomials generalise this idea. It is helpful to be less specific and think of partitions of [n] into blocks. There are w_j arrangements of any block of size j, and there are v_k ways to arrange the blocks, if there are k of them. Note that we assume v_k is independent of the arrangements within the collection of blocks. So in the previous example, w_j=j^{j-2}, and v_k=(k-1)!. Pitman denotes these sequences by v_\bullet,w_\bullet. Then the (n,k)th partial Bell polynomial, B_{n,k}(w_\bullet) gives the number of divisions into k blocks:

B_{n,k}(w_\bullet):=\sum_{(A_1,\ldots,A_k)\in\mathcal{P}^k(n)}\prod_{i=1}^k w_{a_i}.

The total number of arrangements is given by the Bell polynomial

B_n(v_\bullet,w_\bullet):=\sum_{k=1}^n v_k B_{n,k}(w_\bullet).

Here are some other examples of Bell polynomials. The Stirling numbers of the first kind c_{n,k} give the number of permutations of [n] with k cycles. Since we don’t want to impose any combinatorial structure on the set of cycles, we don’t need to consider v_\bullet, and the number of ways to make a j-cycle from a j-block is w_j=(j-1)!, so c_{n,k}:=B_{n,k}((\bullet-1)!). Similarly, the Stirling numbers of the second kind S_{n,k} give the number of permutations of [n] into k blocks. Almost by definition, S_{n,k}:=B_{n,k}(1^\bullet), where $1^\bullet$ is defined to be the sequence containing all 1s.

Applications

So far, this is just a definition that gives an abbreviated description for the sizes of several interesting sets of discrete objects. Having clean notation is always important, but there are further advantages of using Bell polynomials. I don’t want to reproduce the entirety of the chapter I’ve read, so my aim for this final section is to give a very vague outline of why this is a useful formulation.

Bell polynomials can be treated rather nicely via generating functions. The key to this is to take a sum not over partitions, but rather over ordered partitions, which are exactly the same, except now we also care about the order of the blocks. This has the advantage that there is a correspondence between ordered partitions with k blocks and compositions with k terms. If the composition is n_1+\ldots+n_k=n, it is clear why there are \binom{n}{n_1,\ldots,n_k} ordered partitions encoding this structure. This multinomial coefficient can be written as a product of factorials of $n_i$s over i, and so we can write:

B_{n,k}(w_\bullet)=\frac{n!}{k!}\sum_{(n_1,\ldots,n_k)}\prod_{i=1}^k \frac{w_{n_i}}{n_i!}.

This motivates considering the exponential generating function given by

w(\xi)=\sum_{j=1}^\infty w_j\frac{\xi_j}{j!},

as this leads to the neat expressions:

B_{n,k}(w_\bullet)=n![\xi^n]\frac{w(\xi)^k}{k!},\quad B_n(v_\bullet,w_\bullet)=n![\xi^n]v(w(\xi)).

The Bell polynomial B_n(v_\bullet,w_\bullet) counts the number of partitions of [n] subject to some extra structure. If we choose uniformly from this set, we get a distribution on this combinatorial object, for which the Bell polynomial provides the normalising constant. If we then ignore the extra structure, the sequences v_\bullet,w_\bullet induce a probability distribution on the set of partitions of n. This distribution is known as a Gibbs partition. It is interesting to consider when and whether it is possible to define a splitting mechanism such that the Gibbs partitions can be coupled to form a fragmentation process. This is the opposite of a coalescence process. Here, we have a sequence of masses, and at each integer time we have rules to determine which mass to pick, and a rule for how to break it into two pieces. It is certainly not the case that for an arbitrary splitting rule and sequences v_\bullet,w_\bullet, the one-step fragmentation of the Gibbs partition on n gives the corresponding Gibbs partition on (n-1).

CLT for random permutations

For the final demonstration of the use of Bell polynomials, I am going to sketch the outline of a solution to exercise 1.5.4. which shows that the number of cycles in a uniformly chosen permutation has a CLT. This is not at all obvious, since the number of permutations of [n] with k cycles is given by B_{n,k}((\bullet-1)!) and there is certainly no simple form for this, so the possibility of doing a technical limiting argument seems slim.

For ease of notation, we copy Pitman and write c_{n,k}:=B_{n,k}((\bullet-1)!) as before. First we show exercise 1.2.3. which asserts that

x(x+1)\ldots(x+(n-1))=\sum_{k=1}^n c_{n,k}x^k.

We argue combinatorially. The RHS is the number of ways to choose \sigma\in S_n and a colouring of [n] with k colours such that the orbits of \sigma are monochromatic. We prove that the LHS also has this property by induction on the number of vertices. We claim there is a 1-to-(x+n) map from configurations on n vertices to configurations on (n+1) vertices. Given \sigma\in S_n and colouring, for any a\in[n], we construct \sigma_a\in S_{n+1} by \sigma_a(a)=n+1, \sigma_a(n)=\sigma(a) and for all other x, \sigma_a(x)=\sigma(x). We give n+1 the same colour as a. This gives us n possibilities. Alternatively, we can map (n+1) to itself and give it any colour we want. This gives us x possibilities. A slightly more careful argument shows that this is indeed a 1-to-(x+n) map, which is exactly what we require.

So the polynomial

A_n(z)=\sum_{k=0}^nc_{n,k}z^k,

has n real zeros, which allows us to write

\frac{c_{n,k}}{A_n(1)}=\mathbb{P}(X_1+\ldots+X_n=k),

where the Xs are independent but not identically distributed Bernoulli trials. The number of cycles is then given by this sum, and so becomes a simple matter to verify the CLT by checking a that the variances grows appropriately. As both mean and variance are asymptotically log n, we can conclude that:

\frac{K_n - \log n}{\sqrt{\log n}}\stackrel{d}{\rightarrow} N(0,1).

In a future post, I want to give a quick outline of section 1.3. which details how the Bell polynomials can be surprisingly useful to find the moments of infinitely divisible distributions.

Uniform Spanning Trees

For applications to random graphs, the local binomial structure and independence means that the Galton-Watson branching process is a useful structure to consider embedding in the graph. In several previous posts, I have shown how we can set up the so-called exploration process which visits the sites in a component as if the component were actually a tree. The typical degree is O(1), and so in particular small components will be trees with high probability in the limit. In the giant component for a supercritical graph, this is not the case, but it doesn’t matter, as we ignore vertices we have already explored in our exploration process. We can consider the excess edges separately by ‘sprinkling’ them back in once we have the tree-like backbone of all the components. Again, independence is crucial here.

I am now thinking about a new model. We take an Erdos-Renyi process as before, with edges arriving at some fixed rate, but whenever a cycle appears, we immediately delete all the edges that make up the cycle. Thus at all times the system consists of a collection (or forest) of trees on the n vertices. So initially this process will look exactly like the normal E-R process, but as soon as the components start getting large, we start getting excess edges which destroy the cycles and make everything small again. The question to ask is: if we run the process for long enough, roughly how large are all the components? It seems unlikely that the splitting mechanism is so weak that we will get true giant components forming, ie O(n) sizes, so we might guess that, in common with some other split-merge models of this type, we end up with components of size n^{2/3}, as in the critical window for the E-R process.

In any case, the scaling limit process is likely to have components whose sizes grow with n, so we will have a class of trees larger than those we have considered previously, which have typically been O(1). So it’s worth thinking about some ways to generate random trees on a fixed number of vertices.

Conditioned Galton-Watson

Our favourite method of creating trees is inductive. We take a root and connect the root to a number of offspring given by a fixed distribution, and each of these some offspring given by an independent sample from the same distribution and so on. The natural formulation gives no control over the size of the tree. This is a random variable whose distribution depends on the offspring distribution, and which in some circumstances be computed explicitly, for example when the offspring distribution is geometric. In other cases, it is easier to make recourse to generating functions or to a random walk analogue as described in the exploration process discussion.

Of course, there is nothing to stop us conditioning on the total size of the population. This is equivalent to conditioning on the hitting time of -1 for the corresponding random walk, and Donsker’s theorem gives several consequences of a convergence relation towards a rescaled Brownian excursion. Note that there is no a priori labelling for the resulting tree. This will have to be supplied later, with breadth-first and depth-first the most natural choices, which might cause annoyance if you actually want to use it. In particular, it is not obvious, and probably not true unless you are careful, that the distribution is invariant under permuting the labels (having initially assumed 1 is the root etc) which is not ideal if you are embedding into the complete graph.

However, we would like to have some more direct constructions of random trees on n vertices. We now consider perhaps the two best known such methods. These are of particular interest as they are applicable to finding random spanning trees embedded in any graph, rather than just the complete graph.

Uniform Spanning Tree

Given a connected graph, consider the set of all subgraphs which are trees and span the vertex set of the original graph. An element of this set is called a spanning tree. A uniform spanning tree is chosen uniformly at random from the set of spanning trees on the complex graph on n vertices. A famous result of Arthur Cayley says that the number of such spanning trees is n^{n-2}. There are various neat proofs, many of which consider a mild generalisation which gives us a more natural framework for using induction. This might be a suitable subject for a subsequent post.

While there is no objective answer to the question of what is the right model for random trees on n vertices, this is what you get from the Erdos-Renyi process. Formally, conditional on the sizes of the (tree) components, the structures of the tree components are given by UST.

To see why this is the case, observe that when we condition that a component has m vertices and is a tree, we are demanding that it be connected and have m-1 edges. Since the probability of a particular configuration appearing in G(n,p) is a function only of the number of edges in the configuration, it follows that the probability of each spanning tree on the m vertices in question is equal.

Interesting things happen when you do this dynamically. That is, if we have two USTs of sizes m and n at some time t, and condition that the next edge to be added in the process joins them, then the resulting component is not a UST on m+n vertices. To see why, consider the probability of a ‘star’, that is a tree with a single distinguished vertex to which every other vertex is joined. Then the probability that the UST on m vertices is a star is \frac{m}{m^{m-2}}=m^{-(m-3)}. By contrast, it is not possible to obtain a star on m+n vertices by joining a tree on m vertices and a tree on n vertices with an additional edge.

However, I think the UST property is preserved by the cycle deletion mechanism mentioned at the very start of this post. My working has been very much of the back of the envelope variety, but I am fairly convinced that once you have taken a UST and conditioned on the sizes of the smaller trees which result from cycle deletion. My argument is that you might as well fix the cycle to be deleted, then condition on how many vertices are in each of the trees coming off this cycle. Now the choice of each of these trees is clearly uniform among spanning trees on the correct number of vertices.

However, it is my current belief that the combination of these two mechanisms does not give UST-like trees even after conditioning on the sizes at fixed time.