Parking on a ring, linear hashing

I’ve spent most of my doctorate trying to analyse how adding destructive dynamics affects the behaviour of a particular random growth process, the classical random graph. In this post I’m going to talk about another random growth process, which is slightly less natural, but for which one can show some similar qualitative properties.

The model, and the additive coalescent

Consider m places arranged in a circle, and for consistency of analogy we think of these as parking spaces. Some number n of cars will arrive one at a time. Each car will arrive at a space chosen uniformly at random. If it is empty they will park in it, otherwise they will look clockwise until they find an empty space, and park there. For now we are only interested in growth, so we assume cars never leave. We are interested in the sizes of blocks of consecutively parked cars.

The reason to consider this slightly unnatural statement is its equivalence to the problem of hashing with linear probing, apparently a key topic in computer science, which I won’t pretend that I know anything about. In any case, it’s a nice model, and it seems reasonable that it would have a basis in more realistic search algorithms.

So, how does the sequence of sizes of blocks of consecutively parked cars grow? Well, given the sequence of block sizes, it is reasonably easy to convince yourself that the order of the blocks around the circle is uniformly random, and the number of empty spaces between adjacent blocks is also uniformly random.

Assume for now that there are at least three blocks. A block of size x can merge with a block of size y with the arrival of the next car only if the blocks are adjacent, with exactly one empty space between them. The chance of this is uniform among all pairs of blocks. Now suppose this is the case, and that the block of size y lies clockwise from the block of size x. Then they will merge precisely if the next car arrives at any of the x occupied spaces in that block, or at the empty space between the pair of blocks. This has probability \frac{x+1}{m}. There’s also the opposite ordering to consider, where the block of size x lies clockwise from the other. The total probability of this merge \{x,y\}\mapsto \{x+y+1\} is therefore proportional to (x+y+2).

So the process of block sizes looks a bit like the additive coalescent, at least for large blocks. This is in contrast to the random graph process, where the sequence of component sizes behaves exactly like a multiplicative coalescent, where blocks merge at a rate proportional to the product of their sizes.

Asymptotics

As in the random graph process, it’s interesting to ask roughly how large the largest block will be in such a configuration. Pittel [3] considers the case where the number of empty places \ell = m-n \approx \beta m, for some \beta\in (0,1).

A less interesting model would be to choose the positions of the n cars uniformly at random. But then the size of a block is roughly geometric with parameter \beta, and there are \Theta(m) blocks with high probability. Relatively straightforward calculations in extreme value theory suggest that the largest block is likely to have size on the order of \log m in this setting.

Of course, the actual model is slightly more complicated, because the size of a block is self-reinforcing, since larger blocks are more likely to grow than smaller blocks. However, we can still get somewhere with naïve estimates. Let’s label the places clockwise. Then in order for there to be a block starting at 0 and stretching beyond \alpha \log m, a necessary condition is that at least \alpha \log m cars arrive at those places. The number of cars which arrive at those places is binomial, since there are n cars, and each arrives at a place chosen uniformly, and independently of the other cars. So this event corresponds to

\mathrm{Bin}(n,\frac{\alpha \log m}{m}) \ge \alpha \log m.

Then, since n\approx (1-\beta)n, this event corresponds approximately to

\mathrm{Po}((1-\beta)\alpha \log m) \ge \alpha \log m.

The probability that a Poisson RV is at least a constant multiple larger than its mean decays exponentially with the mean, hence in this case the probability is asymptotically some negative power of m, depending on the value of \alpha. But there are O(m) possible places for such a block to start, so whether we can apply a union bound usefully or not depends on whether the power of m is strictly less than -1.

Since all of this depends on \alpha, it is reasonable that everything is fine, and the largest block does have size at least \alpha \log m when \alpha is small, and very unlikely when \alpha is large. This heuristic argument fits with Pittel’s theorem. Indeed, his result shows much stronger concentration: that the fluctuations of the size of the largest block are O(1).

Critical regime and empirical processes

The following is a paraphrase of the introduction and some methods from [2].

Obviously, once m=m cars have arrived, there’s no room for manoeuvre and definitely all the places are taken in one giant block. But it’s not obvious in general what scaling for the number of gaps will give rise to giant blocks of \Theta(m) cars.

As for the random graph, we can find a process similar to the exploration process of a (random) graph which encodes much of the information we care about. Let Y_k be the number of cars which arrive at place k. So the sum of the Y_ks will be n, the total number of cars. Now consider the process

C_0=0, \ldots, C_{k+1}=C_k + Y_{k+1}-1.

A block has the property that the number of arrivals within that set of places is equal to the number of places. So every time this *empirical process* C drops below its previous running minimum, this indicates the end of a block. To make this equivalence precise, we need to be a bit careful about where we start counting. It works exactly if we start at the beginning of a block. If not, it might introduce some unwanted divisions within the first block.

What we have is a process that looks roughly like a random walk that is constrained to pass through the point (m,n-m), which is equal to (m,-l). Even if we aren’t totally precise about how this is like a random walk, we would expect to see Brownian fluctuations after rescaling. Indeed, we might expect to see a Brownian bridge added to a deterministic linear function with negative gradient. But this is only meaningful if the random part is at least as large as the deterministic part, and since the fluctuations have order \sqrt{m}, if l is much larger than this, the rescaled empirical process is essentially deterministic, so we won’t see any macroscopic excursions above the minimum.

If l is substantially smaller than \sqrt{m}, then there is no real difference between (m,-l) and (m,0), and what we see is just a Brownian bridge. At this point, where we choose to start the process is actually important. If we were to start it at the minimum of the Brownian bridge instead, we would have seen a Brownian excursion, which corresponds to one block occupying (almost) all of the places.

Unsurprisingly, the story is completed by considering \ell=\Theta(\sqrt{m}), where the rescaled empirical process looks like a slanted Brownian bridge, that is Brownian motion conditioned to pass through $(1,-\frac{\ell}{\sqrt{m})$. There isn’t an obvious fix to the question of where to start the process, but it turns out that the correct way is now adding a Brownian excursion onto the deterministic linear function with gradient - \frac{\ell}{\sqrt{m}}. It’s now reasonable that the excursions above the minimum should macroscopic.

This scaling limit works dynamically as well, where the same Brownian excursion is used for different gradients of the deterministic line, corresponding to \ell moving through the critical window m-\Theta(\sqrt{m}). Finally, a direction to Bertoin’s recent paper [1] for the model with an additional destructive property. Analogous to the forest fire, blocks of cars are removed at a rate proportional to their size (as a result, naturally, of ‘Molotov cocktails’…). Similar effects of self-organised criticality are seen when the rate of bombs is scaled appropriately.

References

[1] – Bertoin – Burning cars in a parking lot (paper / slides)

[2] – Chassaing + Louchard – Phase transition for parking blocks, Brownian excursion and coalescence (arXiv)

[3] – Pittel – Linear probing: the probable largest search time grows logarithmically with the number of records

Advertisements

Enumerating Forests

I’ve just got back from a visit to Budapest University of Technology, where it was very pleasant to be invited to give a talk, as well as continuing the discussion our research programme with Balazs. My talk concerned a limit for the exploration process of an Erdos-Renyi random graph conditioned to have no cycles. Watch this space (hopefully very soon) for a fully rigorous account of this. In any case, my timings were not as slick as I would like, and I had to miss out a chunk I’d planned to say about a result of Britikov concerning enumerating unrooted forests. It therefore feels like an excellent time to write something again, and explain this paper, which you might be able to find here, if you have appropriate journal rights.

We are interested to calculate a_{n,m} the number of forests with vertex set [n] consisting of m unrooted trees. Recall that if we were interested in rooted trees, we could appeal to Prufer codes to show that there are m n^{n-m-1} such forests, and indeed results of Pitman give a coalescent/fragmentation scheme as m varies between 1 and n-1. It seems that there is no neat combinatorial re-interpretation of the unrooted case though, so Britikov uses an analytic method.

We know that

a_{n,m}= \frac{n!}{m!} \sum_{\substack{k_1+\ldots+k_m=n\\ k_i\ge 1}} \prod_{j=1}^m \frac{k_j^{k_j-2}}{k_j!}.

To see this, observe that the k_js correspond to the sizes of the m trees in the forest; \frac{n!}{\prod k_j!} gives the multinomial number of ways to assign vertices to the trees; given the labels for a tree of size k_j, there are k_j^{k_j-2} ways to make up the tree itself; and \frac{1}{m!} accounts for the fact that the trees have no order.

What we would really like to do is to take the uniform distribution on the set of all labelled trees, then simulate m IID copies of this distribution, and condition the union to contain precisely n vertices. But obviously this is an infinite set, so we cannot choose uniformly from it. Instead, we can tilt so that large trees are unlikely. In particular, for each x we define

\mathbb{P}(\xi=k) \propto \frac{k^{k-2} x^k}{k!},

and define the normalising constant

B(x):= \sum_{k\ge 1} \frac{k^{k-2}x^k}{k!},

whenever it exists. It turns out that x\le e^{-1} is precisely the condition for B(x)<\infty. Note now that if \xi_1,x_2,\ldots are IID copies of \xi, then

\mathbb{P}(\xi_1+\ldots+\xi_m=n) = \frac{x^n}{B(x)^m} \sum_{k_1+\ldots + k_m=n} \prod_{j=1}^m \frac{k_j^{k_j-2}}{k_j!},

and so we obtain

a_{n,m}= \frac{n!}{m!} \frac{B(x)^m}{x^n} \mathbb{P}(\xi_1+\ldots + \xi_m=n).

So asymptotics for a_{n,m} might follows from laws of large numbers of this distribution \xi.

So far, we haven’t said anything about how to choose this value x. But observe that if you want to have lots of trees in the forest, then the individual trees should generally be small, so we take x small to tilt away from a preference for large trees. It turns out that there is a similar interpretation of criticality for forests as for general graphs, and taking x equal to 1/e, its radius of convergence works well for this setting. If you want even fewer trees, there is no option to take x larger than 1/e, but instead one can use large deviations machinery rather than laws of large number asymptotics.

We will be interested in asymptotics of the characteristic function of \xi for x=1/e. In particular \mathbb{E}[e^{it\xi}]=\frac{B(xe^{it})}{B(x)}, and it will be enough to clarify the behaviour of this as t\rightarrow 0. It’s easier to work with a relation analytic function

\theta(x)=\sum_{k\ge 1} \frac{k^{k-1}x^k}{k!},

ie the integral of B. What now feels like a long time ago I wrote a masters’ thesis on the subject of multiplicative coalescence, and this shows up as the generating function of the solutions to Smoluchowski’s equations with monodisperse initial conditions, which are themselves closely related to the Borel distributions. In any case, several of the early papers on this topic made progress by establishing that the radius of convergence is 1/e, and that \theta(x)e^{-\theta(x)}=x everywhere where |x|\le 1/e. We want to consider x=1/e, for which \theta=1.

Note that \mathbb{E}\xi = \frac{\theta(x)}{B(x)}, so we will make progress by relating B(x),\theta(x) in two ways. One way involves playing around with contour integrals in a fashion that is clear in print, but involves quite a lot of notation. The second way is the Renyi relation which asserts that \theta(x)=B(x)+\frac{\theta(x)^2}{2}. We will briefly give a combinatorial proof. Observe that after multiplying through by factorials and interpreting the square of a generating function, this is equivalent to

k^{k-1} = k^{k-2} + \frac12 \sum_{\substack{l+m=k\\l,m\ge 1}} l^{l-1}m^{m-1}\binom{k}{l},

for all k. As we might expect from the appearance of this equality, we can prove it using a bijection on trees. Obviously on the LHS we have the size of the set of rooted trees on [k]. Now consider the set of pairs of disjoint rooted trees with vertex set [k]. This second term on the RHS is clearly the size of this set. Given an element of this set, join up the two roots, and choose whichever root was not initially in the same tree as 1 to be the new root. We claim this gives a bijection between this set, and the set of rooted trees on [k], for which 1 is not the root. Given the latter, the only pair of trees that leads to the right rooted tree on [k] under this mapping is given by cutting off the unique edge incident to the root that separates the root and vertex 1. In particular, since there is a canonical bijection between rooted trees for which 1 is the root, and unrooted trees (!), we can conclude the Renyi relation.

The Renyi relation now gives \mathbb{E}\xi = \frac{\theta(x)}{B(x)}=2 when x=1/e. If we wanted, we could show that the variance is infinite, which is not completely surprising, as the parameter x lies on the radius of convergence of the generating function.

Now, playing around with contour integrals, and being careful about which strands to take leads to the asymptotic as t\rightarrow 0

\mathbb{E}[ e^{it\xi}] = 1+2it + \frac{2}{3}i |2t|^{3/2} (i\mathrm{sign}(t))^{3/2} + o(|t|^{3/2}).

So from this, we can show that the characteristic function of the rescaled centred partial sum \frac{\xi_1+\ldots+\xi_N-2N}{bN^{2/3}} converges to \exp(-|t|^{3/2}\exp(\frac{i\pi}{4}\mathrm{sign} t)), where b= (32/9)^{1/3} is a constant arising out of the previous step.

We recognise this as the characteristic function of the stable distribution with parameters 3/2 and -1. In particular, we know now that \xi is in the domain of attraction for a stable-3/2 distribution. If we wanted a version of the central limit theorem for such partial sums, we could have that, but since we care about the partial sums of the \xi_is taking a specific value, rather than a range of values on the scale of the fluctuations, we actually need a local limit theorem.

To make this clear, let’s return to the simplest example of the CLT, with some random variables with mean \mu and variance \sigma^2<\infty. Then the partial sums satisfy

\mathbb{P}(\mu N + a\sigma\sqrt{N} \le S_N \le \mu_N+b\sigma\sqrt{N}) \rightarrow \int_a^b f_{\mathcal N}(x)dx,

as N\rightarrow\infty. But what about the probability of S_N taking a particular value m that lies between \mu N+a\sigma \sqrt{N} and \mu N + b\sigma \sqrt{N}? If the underlying distribution was continuous, this would be uncontroversial – considering the probability of lying in a range that is smaller than the scale of the CLT can be shown in a similar way to the CLT itself. A local limit theorem asserts that when the underlying distribution is supported on some lattice, mostly naturally the integers, then these probabilities are in the limit roughly the same whenever m is close to \mu N+a\sigma\sqrt{N}.

In this setting, a result of Ibragimov and Linnik that I have struggled to find anywhere in print (especially in English) gives us local limit theory for integer-supported distributions in the domain of attraction of a stable distribution. Taking p( ) to be the density of this distribution, we obtain

bm^{2/3}\mathbb{P}(\xi_1+\ldots+\xi_m=n) - p(\frac{n-2m}{b m^{2/3}}) \rightarrow 0

as n\rightarrow\infty, uniformly on any set of m for which z= \frac{n-2m}{bm^{2/3}} is bounded. Conveniently, the two occurrences of b clear, and Britikov obtains

a_{n,m} = (1+o(1)) \frac{\sqrt{2\pi} n^{n-1/6}}{2^{n-m}(n-m)!} p(\frac{n-2m}{n^{2/3}},

uniformly in the same sense as before.

Duality for Interacting Particle Systems

Yesterday I introduced the notion of duality for two stochastic processes. My two goals for this post are to elaborate on the idea of why duality is useful, which we touched on in passing in the previous part, and to discuss duality of interacting particle systems. In the latter case, there are often nice ways to consider the forward and backward processes together that make the relation somewhat more natural.

The starting point is to assume a finite state space. This will be reasonable when we start to consider interacting particle systems, eg on \{0,1\}^{[n]}. As before, call the spaces R and S, and a duality function H(x,y). Since the state-spaces are finite, it is entirely natural to think of this as a matrix, and hence as an operator. Of course, a function defined on a finite state-space can be thought of as a vector, so it is clear what this operator will actually operate on. (I’ve chosen H rather than h for the duality function so it is more clear that it is acting as an operator here.)

We have some choice about which way round to define it, but for now let’s say that given some function f(.) on S

Hf(x):=\sum_{y\in S} H(x,y)f(y).

Note that this is a) exactly the definition of matrix (left-)multiplication; b) We should think of Hf as a function on R – perhaps (Hf)(x) might be more clear? and c) the operator H acts \mathbb{R}^S\rightarrow \mathbb{R}^R. If we want the corresponding operator \mathbb{R}^R\rightarrow\mathbb{R}^S, we simply multiply by H on the right instead.

But note also that the generator of a finite state-space Markov process is also a matrix, indeed a Q-matrix. So if we take our definition of the duality function as

\mathcal{G}_X h(x,y)=\mathcal{G}_Y h(x,y),

which, importantly, holds for all x,y, we can convert this into an algebraic form as

\mathcal{G}_X H = H \mathcal{G}_Y^\dagger.

In the same way that n-step transition probabilities for a discrete-time Markov chain are given by the product of the one-step transition matrix, general time transition probabilities for a continuous-time Markov chain are given by exponents of the Q-matrix. In particular, if X and Y have transition kernels P and Q respectively, then P_t=e^{tG_X}, and after doing some manipulation, we can show that

P_t H=H Q_t^\dagger,

also. This is really useful as in general we would hope that H might be invertible, from which we derive

P_t=HQ_t^\dagger H^{-1}.

So this is a powerful statement about the relationship between the evolutions of the two processes. In particular, it shows a correspondence (given by H) between left eigenvectors of P, and right eigenvectors of Q, and vice versa naturally.

The reason why this is useful rather than merely cute, is that when we re-interpret everything in terms of the original stochastic processes, we get a map between stationary distributions of X, and harmonic functions of Y. Stationary distributions are often hard to describe in any terms other than the left-1-eigenvector, or through some convergence property that is typically hard to work with. Harmonic functions, on the other hard, can be much more tractable. An example of a harmonic function is the survival probability started from a given state. This is useful for specifying the stationary distribution, but perhaps even more so for describing properties of the set of stationary distributions. In particular, uniqueness and existence are carried across this equivalence. So, for example, if the dual does not survive almost surely, then this says the only stationary measure is zero, and so the process is transient or similar.

Jan Swart’s course in Luminy last October dealt with duality, with a focus mainly on interacting particle systems. There are a couple of themes I want to talk about, without going into too much detail.

A typical interacting particle system will take place on a locally finite graph. At each vertex, there is either a particle, or there isn’t. Particles move between adjacent vertices, and sometimes interact with particles at adjacent vertices. These interactions might involve branching or coalescence. We will discuss shortly the set of possible forms such interaction might take. The state space is \{0,1\}^{V(G)}, with G the underlying graph. Then given a state, there is some set of actions which might happen next, and we consider the possibility that they happen with exponential rates.

At this stage, it seems like the initial configuration is important, as this affects what set of moves can happen immediately, and also thereafter. It is not clear how quickly this dependence fades. One useful idea is not to restrict ourselves to interactions involving the particles currently present in the system, but instead to consider a Poisson process of all possible interactions. Only the moves actually permitted by the current state will happen, but having this extra information allows for coupling between initial configurations.

It’s probably easier to consider a concrete example. The picture below shows the set-up for a branching random walk up an integer lattice. Each particle moves to one of the two state directly above its current state, or it branches and sends particles to both of them.DSC_2589In the diagram, we have glued arrows onto every state at every time, which tells us what to do if there is a particle there at each time. As a coupling, we can now think of the process as a deterministic walk through a random environment. The environment is given by some probability space, which in continuous time might have the appearance of a Poisson process on the set of ‘moves’, and the initial condition of the walk is up to us.

We can generalise this to a broader class of interacting particle systems. If we want all interactions to be between pairs of adjacent states, there are six possible things which could happen:

  • Annihilation: two adjacent particles destroy each other. ( 11 -> 00 )
  • Branching: one particle becomes two particles. ( 01 or 10 -> 11 )
  • Coalescence: two particles merge. ( 11 -> 01 or 10 )
  • Death: A particle is removed. ( 01 or 10 -> 00 )
  • Exclusion: a particle moves. ( 01 -> 10 )
  • Birth: a particle is created. ( 00 -> 01 or 10 )

For now we exclude the possibility of birth. Note that the way we have set this up involving two-site interactions excludes the possibility of a particle trying to move to an already-occupied site.

DSC_2588Let us say that in process X the rates at which each of these events happen are a, b, c, d and e, taking advantage of the helpful choice of naming. There is some flexibility about whether the rates are the same between every pair of vertices of note. For this post we assume that they are. Then it is a result of Lloyd and Sudbury that given some real q\neq 1, the process X’ with corresponding rates given by:

a'=a+2q\gamma, b'=b+\gamma, c'=c-(1+q)\gamma, d'=d+\gamma, e'=e-\gamma,

for \gamma:= \frac{a+c-d+qb}{1-q},

is dual to X, with duality function given by h(Y,Z)=q^{|Y\cap Z|}, for Y and Z possible states.

I want to make two comments:

1) This illustrates one of the differences between the dual and the time-reversal. It is clear that the time-reversal of branching is coalescence and vice versa, and exclusion is invariant under time-reversal. But the time-reversal of death is definitely birth, but there is no birth component in the dual of a process which features death. I don’t have a strong intuition for why this is the case, but see the final paragraph of this post. However, at least it seems plausible that both processes might simultaneously be recurrent, since in the dual, both the branching rate and the death rate have increased by the same amount.

2) This settles one problem of uniqueness of the dual that I mentioned last time, since we can vary q and get a different dual to the same original process. For example, in the voter model, we have b=d=1, and a=c=e=0, as in any update, the opinions of neighbours which were previously different become the same. Anyway, for any q\in[-1,0] there is a choice of dual, where at the extremes q=0 corresponds to coalescing random walk, and q=-1 to annihilating random walk. (Note that the duality function for q=0 is the indicator function that the systems are different.)

DSC_2590

As a final observation without much justification, suppose we add in arrows in the gaps of the branching random walk picture we had earlier, and direct them in the opposite direction. It turns out that this corresponds precisely to the dual of the process. This provides an appealing visual idea of why the dual of branching might be death. It also supports the general idea based on the coupling described earlier that the dual process is in some sense a deterministic walk in the opposite direction through the random environment specified by the original process.

REFERENCES

J.M. Swart – Duality and Intertwining of Markov Chains (mainly using chapters 2.1 and 2.7)

Thanks for Daniel Straulino for direction towards the branching random walk duality example.

Enhanced by Zemanta

Duality for Stochastic Processes

In the past couple of weeks, we’ve launched a new junior probability seminar in Oxford. (If you’re reading this and would like to come, please drop me an email!) The first session featured Daniel Straulino talking about his work on the Spatial Lambda Fleming-Viot process, a model for the evolution of populations allowing for selection and geometry. A lively discussion of duality broke out halfway through, following which it seemed reasonable to devote another session to this topic. Luke Miller presented some classical and less classical examples of the theory this afternoon. I found all of this interesting and relevant, and thought it would be worth writing some things down, and tying it in with one of the courses on this subject that we attended at ALEA in Luminy last October.

The majority of this article is based on Luke’s talk. Errors, omissions and over-simplifications are of course my own.

The setup is that we have two stochastic processes X_t\subset R, Y_t\subset S. For now we make no assertion about whether the two state spaces R and S are the same or related, and we make no comment on the dependence relationship between X and Y. Let P_x,Q_y be the respective probability measures, representing starting from x and y respectively. Then given a bivariate, measurable function h(.,.) on R x S, such that:

E^P_x h(X_t,y)=E^Q_y h(x,Y_t),\quad \forall x,y\quad\forall t,

then we say X and Y are dual with respect to h.

The interpretation should be that X is a process forwards in time, and Y is a process backwards in time. So X_t, Y_0 represent the present, while X_0, Y_t represent the past, which is the initial time for original process X. The fact that the result holds for all times t allows us to carry the equality through a derivative, to obtain an equality of generators:

\mathcal{G}^X h(x,y)=\mathcal{G}^Y h(x,y),\quad \forall x,y.

On the LHS, the generator acts on x, while on the RHS it acts on y. Although it still isn’t obvious (at least to me) when a pair of processes might have this property, especially for an arbitrary function, this seems the more natural definition to think about.

Note that this does indeed require a specific function h. There were murmurings in our meeting about the possibility of a two processes having a strong duality property, where this held for all h in some broad class of test functions. On more reflection, which may nonetheless be completely wrong, this seems unlikely to happen very often, except in some obviously degenerate cases, such as h constant. If this holds, then as the set of expectations of a class of functions of a random variable determines the distribution, we find that the instantaneous behaviour of Y is equal in distribution to the instantaneous behaviour of X when started from fixed (x,y). It seems unlikely that you might get many examples of this that are not deterministic or independent (eg two Brownian motions, or other space-, time-homogeneous Markov process).

Anyway, a canonical example of this is the Wright-Fisher diffusion, which provides a simple model for a population which evolves in discrete-time generations. We assume that there are two types in the population: {A,a} seems to be the standard notation. Children choose their type randomly from the distribution of types in the previous generation. In other words, if there are N individuals at all times, and X_k is the number of type A individuals, then:

X_{k+1} | X_k \stackrel{d}{=} \mathrm{Bin}(N, \frac{X_k}{N}).

It is not hard to see that in a diffusion limit as the number of individuals tends to infinity, the proportion of type A individuals is a martingale, and so the generator for this process will not depend on f’. In fact by checking a Taylor series, we can show that:

\mathcal{G}_{WF}f(x)=\frac{1}{2} x(1-x)f''(x),

for all f subject to mild regularity conditions. In particular, we can show that for f_n(x)=x^n, we have:

\mathcal{G}_{WF} f_n(x)=\binom{n}{2}(f_{n-1}(x)-f_n(x))

after some rearranging. This looks like the generator of a jump process, indeed a jump process where all the increments are -1. This suggests there might be a coalescent as the dual process, and indeed it turns out that Kingman’s coalescent, where any pair of blocks coalesce at uniform rate, is the dual. We have the relation in expectation:

\mathbb{E}_x[X_t^n]= \mathbb{E}_n[x^{N_t}],

where the latter term is the moment generating function of the number of blocks at time t of Kingman’s coalescent started from n blocks.

In particular, we can control the moments of the Wright-Fisher diffusion using the mgf of the Kingman’s coalescent, which might well be easier to work with.

That’s all very elegant, but we were talking about why this might be useful in a broader setting. In the context of this question, there seems to be an obstacle towards applying this idea above more generally. This is an underlying idea in population genetics models that as well as the forward process, there is also a so-called ancestral process looking backwards in time, detailing how time t individuals are related historically. It would be convenient if this process, which we might expect to be some nice coalescent, was the dual of the forward process.

But this seems to be a problem, as duals are a function of the duality function, so do we have uniqueness? It would not be very satisfying if there were several coalescents processes that could all be the dual of the forward process. Though some worked examples suggest this might not happen, because a dual and its duality function has to satisfy too many constaints, there seems no a priori reason why not. It seems that the strength of the results you can derive from the duality relation is only as strong as the duality relation itself. This is not necessarily a problem from the point of view of applications, so long as the duality function is something it might actually be useful to work with.

It’s getting late and this text is getting long, so I shall postpone a discussion of duality for interacting particle systems until tomorrow. In summary, by restricting to a finite state space, we can allow ourselves the option of a more algebraic approach, from which some direct uses of duality can be read off. I will also mention a non-technical but I feel helpful way to view many examples of duality in interacting particle systems as deterministic forward and backwards walks through a random environment, in what might be considering an extreme example of coupling.

Enhanced by Zemanta

Hall’s Marriage Theorem

Hall’s Marriage Theorem gives conditions on when the vertices of a bipartite graph can be split into pairs of vertices corresponding to disjoint edges such that every vertex in the smaller class is accounted for. Such a set of edges is called a matching. If the sizes of the vertex classes are equal, then the matching naturally induces a bijection between the classes, and such a matching is called a perfect matching.

The number of possible perfect matchings of K_{n,n} is n!, which is a lot to check. Since it’s useful to have bijections, it’s useful to have matchings, so we would like a simple way to check whether a bipartite graph has a matching. Hall’s Marriage Theorem gives a way to reduce the number of things to check to 2^n, which is still large. However, much more importantly, the condition for the existence of a matching has a form which is much easier to check in many applications. The statement is as follows:

Given bipartite graph G with vertex classes X and Y, there is a matching of X into Y precisely when for every subset A\subset X, |\Gamma(A)|\ge |A|, where \Gamma(A) is the set of vertices joined to some vertex in A, called the neighbourhood of A.

Taking A=X, it is clear that |Y|\ge |X| is a necessary condition for the result to hold, unsurprisingly. Perhaps the most elementary standard proof proceeds by induction on the size of X, taking the smallest A to give a contradiction, then using the induction hypothesis to lift smaller matchings up to the original graph. This lifting is based on the idea that a subset relation between sets induces a subset relation between their neighbourhoods.

In this post, I want to consider this theorem as a special case of the Max-Flow-Min-Cut Theorem, as this will support useful generalisations much more easily. The latter theorem is a bit complicated notationally to set up, and I don’t want it to turn into the main point of this post, so I will summarise. The Wikipedia article, and lots of sets of lecture notes are excellent sources of more detailed definitions.

The setting is a weakly-connected directed graph, with two identified vertices, the source, with zero indegree, and the sink, with zero outdegree. Every other vertex lies on a path (not necessarily unique) between the source and the sink. Each edge has a positive capacity, which should be thought of as the maximum volume allowed to flow down the edge. A flow is a way of assigning values to each edge so that they do not exceed the capacity, and there is volume conservation at each interior vertex. That is, the flow into the vertex is equal to flow out of the vertex. The value of the flow is the sum of the flows out of the source, which is necessarily equal to the sum of flows into the sink.

A cut is a partition of the vertices into two classes, with the source in one and the sink in the other. The value of a cut is then the sum of the capacities of any edges going from the class containing the source to the class containing the sink. In most examples, the classes will be increasing, in the sense that any path from source to sink changes class exactly once.

The Max-Flow-Min-Cut Theorem asserts that the maximum value of a flow through the system is equal to the minimum value of a cut. The proof is elementary, though it relies on defining a sensible algorithm to construct a minimal cut from a maximal flow that is not going to be interesting to explain without more precise notation available.

First we explain why Hall’s Marriage Theorem is a special case of this result. Suppose we are given the setup of HMT, with edges directed from X to Y with infinite capacity. We add edges of capacity 1 from some new vertex x_0 to each vertex of X, and from each vertex of Y to a new vertex y_0. The aim is to give necessary and sufficient conditions for the existence of a flow of value |X|. Note that one direction of HMT is genuinely trivial: if there is a matching, then the neighbourhood size condition must hold. We focus on the other direction. If the maximum flow is less than |X|, then there should be a cut of this size as well. We can parameterise a cut by the class of vertices containing the source, say S. Let A=SnX. If \Gamma(A)\not\subset S, then there will be an infinite capacity edge in the cut. So if we are looking for a minimal cut, this should not happen, hence \Gamma(A)\subset S if S is minimal. Similarly, there cannot be any edges from X\A to \Gamma(A). The value of the cut can then be given by

|\Gamma(A)|+|X|-|A|, which is at least |X| if the neighbourhood size assumption is given. Note we can use the same method with the original edges having capacity one, but we have to track slightly more quantities.

This topic came up because I’ve been thinking about fragmentation chains over this holiday. I have a specific example concerning forests of unrooted trees in mind, but won’t go into details right now. The idea is that we often have distributions governing random partitions of some kind, let’s say of [n]. Conditioning on having a given number of classes might give a family of distributions P_{n,k} for the partitions of [n] into k parts. We would be interested to know how easy it is to couple these distributions in a nice way. One way would be via a coalescence or fragmentation process. In the latter, we start with [n] itself, then at each step, split one of the parts into two according to some (random, Markovian) rule. We are interested in finding out whether such a fragmentation process exists for a given distribution.

It suffices to split the problem into single steps. Can we get from P_{n,k} to P_{n,k+1}?

The point I want to make is that this is just a version of Hall’s Marriage Theorem again, at least in terms of proof method. We can take X to be the set of partitions of [n] into k parts, and Y the set of partitions into (k+1) parts. Then we add a directed edge with infinite capacity between x\in X and y\in Y if y can be constructed from x by breaking a part into two. Finally, we connect a fresh vertex x_0 to each edge in X, only now we insist that the capacity is equal to P_{n,k}(x), and similarly an edge from y to y_0 with capacity equal to P_{n,k+1}(y). The existence of a fragmentation chain over this step is then equivalent to the existence of a flow of value 1 in the directed graph network.

Although in many cases this remains challenging to work with, which I will explore in a future post perhaps, this is nonetheless a useful idea to have in mind when it comes to deciding on whether such a construction is possible for specific examples.

Critical Components in Erdos-Renyi

In various previous posts, I’ve talked about the phase transition in the Erdos-Renyi random graph process. Recall the definition of the process. Here we will use the Gilbert model G(n,p), where we have n vertices, and between any pair of vertices we add an edge, independently of other pairs with probability p. We are interested in the sparse scaling, where the typical vertex has degree O(1) in n, and so p=c/n for constant c>0, and we assume throughout that n is large. We could alternatively have considered the alternative Erdos-Renyi model where we choose uniformly at random from the set of graphs with n vertices and some fixed number of edges. Almost all the results present work equally well in this setting.

As proved by Erdos and Renyi, the typical component structure of such a graph changes noticeably around the threshold c=1. Below this, in the subcritical regime, all the components are small, meaning of size at most order O(log n). Above this, in the supercritical regime, there is a single giant component on some non-zero proportion of the vertices. The rest of the graph looks subcritical. The case c=1 exhibits a phase transition between these qualitatively different behaviours. They proved that here, the largest component is with high probability O(n^2/3). It seems that they thought this result held whenever c=1-o(1), but it turns out that this is not the case. In this post, I will discuss some aspects of behaviour around criticality, and the tools needed to treat them.

The first question to address is this: how many components of size n^{2/3} are there? It might be plausible that there is a single such component, like for the subsequent giant component. It might also be plausible that there are n^1/3 such components, so O(n) vertices are on such critical components. As then it is clear how we transition out of criticality into supercriticality – all the vertices on critical components coalesce to form the new giant component.

In fact neither of these are correct. The answer is that for all integers k>0, with high probability the k-th largest component is on a size scale of n^2/3. This is potentially a confusing statement. It looks like there are infinitely many such components, but of course for any particular value of n, this cannot be the case. We should think of there being w(1) components, but o(n^b) for any b>0.

The easiest way to see this is by a duality argument, as we have discussed previously for the supercritical phase. If we remove a component of size O(n^2/3), then what remains is a random graph with n-O(n^2/3) vertices, and edge probability the same as originally. It might make sense to rewrite this probability 1/n as

\frac{1}{n-O(n^{2/3})}\cdot \frac{n-O(n^{2/3})}{n}=\frac{1-O(n^{-1/3})}{n-O(n^{2/3})}.

The approximation in the final numerator is basically the same as

1-o\left(n-O(n^{2/3})\right).

Although we have no concrete reasoning, it seems at least plausible that this should look similar in structure to G(n,1/n). In particular, there should be another component of size

O\left([n-O(n^{2/3})]^{2/3}\right)=O(n^{2/3}).

In fact, the formal proof of this proceeds by an identical argument, only using the exploration process. Because I’ve described this several times before, I’ll be brief. We track how far we have gone through each component in a depth-first walk. In both the supercritical and subcritical cases, when we scale correctly we get a random path which is basically deterministic in the limit (in n). For exactly the same reasons as visible CLT fluctuations for partial sums of RVs with expectation zero, we start seeing interesting effects at criticality.

The important question is the order of rescaling to choose. At each stage of the exploration process, the number of vertices added to the stack is binomial. We want to distinguish between components of size O(n^{2/3}) so we should look at the exploration process at time sn^{2/3}. The drift of the exploration process is given by the expectation of a binomial random variable minus one (since we remove the current vertex from the stack as we finish exploring it). This is given by

\mathbb{E}=\left[n-sn^{2/3}\right]\cdot \frac{1}{n}-1=-sn^{-1/3}.

Note that this is the drift in one time-step. The drift in n^{2/3} time-steps will accordingly by sn^{1/3}. So, if we rescale time by n^{2/3} and space by n^{1/3}, we should get a nice stochastic process. Specifically, if Z is the exploration process, then we obtain:

\frac{1}{n^{1/3}}Z^{(n)}_{sn^{2/3}} \rightarrow_d W_s,

where W is a Brownian motion with inhomogeneous drift -s at time s. The net effect of such a drift at a fixed positive time is given by integrating up to that time, and hence we might say the process has quadratic drift, or is parabolic.

We should remark that our binomial expectation is not entirely correct. We have discounted those sn^{2/3} vertices that have already been explored, but we have not accounted for the vertices currently in the stack. We should also be avoiding considering these. However, we now have a heuristic for the approximate number of these. The number of vertices in the stack should be O(n^{1/3}) at all times, and so in particular will always be an order of magnitude smaller than the number of vertices already considered. Therefore, they won’t affect this drift term, though this must be accounted for in any formal proof of convergence. On the subject of which, the mode of convergence is, unsurprisingly, weak convergence uniformly on compact sets. That is, for any fixed S, the convergence holds weakly on the random functions up to time sn^{2/3}.

Note that this process will tend to minus infinity almost surely. Component sizes are given by excursions above the running minimum. The process given by the height of the original process above the running minimum is called reflected. Essentially, we construct the reflected process by having the same generator when the current value is positive, and forcing the process up when it is at zero. There are various ways to construct this more formally, including as the scaling limit of some simple random walks conditioned never to stay non-negative.

The cute part of the result is that it holds equally well in a so-called critical window either side of the critical probability 1/n. When the probability is \frac{1+tn^{-1/3}}{n}, for any t\in \mathbb{R}, the same argument holds. Now the drift at time s is t-s, though everything else still holds.

This result was established by Aldous in [1], and gives a mechanism for calculating distributions of component sizes and so on through this critical window.

In particular, we are now in a position to answer the original question regarding how many such components there were. The key idea is that because whenever we exhaust a component in the exploration process, we choose a new vertex uniformly at random, we are effectively choosing a component according to the size-biased distribution. Roughly speaking, the largest components will show up near the beginning. Note that a critical O(n^{2/3}) component will not necessarily be exactly the first component in the exploration process, but the components that are explored before this will take up sufficiently few vertices that they won’t show up in the scaling of the limit.

In any case, the reflected Brownian motion ‘goes on forever’, and the drift is eventually very negative, so there cannot be infinitely wide excursions, hence there are infinitely many such critical components.

If we care about the number of cycles, we can treat this also via the exploration process. Note that in any depth-first search we are necessarily only interested in a spanning tree of the host graph. Anyway, when we are exploring a vertex, there could be extra edges to other vertices in the stack, but not to vertices we’ve already finished exploring (otherwise the edge would have been exposed then). So the expected number of excess edges into a vertex is proportional to the height of the exploration process at that vertex. So the overall expected number of excess edges, conditional on the exploration process is the area under the curve. This carries over perfectly well into the stochastic process limit. It is then a calculation to verify that the area under the curve is almost surely infinite, and thus that we expect there to be infinitely many cycles in a critical random graph.

REFERENCES

[1] Aldous D. – Brownian excursions, critical random graphs and the multiplicative coalescent

Coalescence 2: Marcus-Lushnikov Processes and a neat equivalence

Last time, discussed the Smoluchowski equations which define an infinite volume mean-field model for coalescence. Now we consider a stochastic coalescent model, the Marcus-Lushnikov process. Here, we start with a finite collection of finite mass particles, with total mass N. Then we define a process by demanding that given particles with masses and coalesce into a single particle with mass x + y at rate K(x,y)/N.

This can be formalised as a continuous-time Markov chain in different ways. The underlying state space consists of the set of unordered multisets of positive integers which sum to N. But rather than considering the configurations of masses themselves, it is sometimes more convenient to take the state space to be:

\{n=(n_1,\ldots,n_N): \sum xn_x=N\}.

Here n_x records the number of particles with mass x, and the constraint controls conservation of mass. Writing out a transition of this Markov chain is notationally annoying, but they have simple transition rates. The rate of transition from (n_1,\ldots,n_N) to the state where masses x and have coalesced is given by

N^{-1}K(x,y)n_xn_y.

Therefore, the Marcus-Lushnikov process ML^{(N)}_t is the process of component sizes in the finite particle setting of coalescence. The existence and mode of convergence of these processes to the deterministic solutions to Smoluchowski’s equation are of particular interest in many models.

As discussed in the previous post, there is an obvious heuristic link between the multiplicative coalescent and random graph processes. An easy but interesting explicit equivalence can be drawn between the Marcus-Lushnikov process with kernel K(x,y) = xy and monodisperse (that is, starting with unit mass particles) initial conditions and a specific random graph structure.

Proposition: The process ML^{(N)}_t with the conditions above is equivalent to the process of component sizes in \mathcal{G}(N,1-e^{-t/N}).

Proof: First, observe that we can couple the latter process in the obvious way by associating a U[0,1] random variable U_e with each of the \binom{N}{2} edges. The edge e is included at time iff U_e\leq 1-e^{-t/N}. In this process, the appearances of different edges are independent and

\mathbb{P}(\text{edge \emph{e} appears after \emph{t}})=e^{-t/N}.

Therefore the waiting times for a given edge to appear are independent \exp(1/N) RVs. In particular, the edge process is memoryless, hence the component size process is Markov. Then it is genuinely obvious that the rate at which an edge joining given distinct components of sizes and appears is N^{-1}xy. So the evolution is exactly the same as the Marcus-Lushnikov process, and the initial configuration is monodisperse.

Coalescence 1: What is it, and why do we care?

As part of Part III, instead of sitting an extra exam paper I am writing an essay. I have chosen the topic of ‘Multiplicative Coalescence’. I want to avoid contravening plagiarism rules, which don’t allow you to quote your own words without a proper citation, which I figure is tricky on a blog, nor open publishing of anything you intend to submit. So just to be absolutely sure, I’m going to suppress this series of posts until after May 4th, when everything has to be handed in.

———–

Informal Description

Coalescence refers to a process in which particles join together over time. An example might be islands of foam on the surface of a cup of coffee. When two clumps meet, they join, and will never split. In this example, a model would need to take into account the shape of all the islands, their positions, their velocities, and boundary properties. To make things tractable, we need to distance ourselves from the idea that particles merge through collisions, which are highly physical and complicated, and instead just consider that they merge.

Description of the Model

When two particles coalesce, it is natural to assume that mass is conserved, as this will be necessary in any physical application. With this in mind, it makes sense to set up the entire model using only the masses of particles. Define the kernel K(x,y) which describes the relative rate or likelihood of the coalescence {x,y} -> x+y. This has a different precise meaning in different contexts. Effectively, we are making a mean-field assumption that all the complications of a physical model as described above can be absorbed into this coalescent kernel, either because the number of particles is large, or because the other effects are small.

When there is, initially, a finite number of particles, the process is stochastic. Coalescence almost surely happen one at a time, and so we can view the process as a continuous time Markov Chain with state space the set of relevant partitions of the total mass present. The transition rate p(A,B) is given by K(x,y) when the coalescence {x,y} -> x+y transforms partition into B, and 0 otherwise. An observation is that the process recording the number of {x,y} -> x+y coalescences is an inhomogeneous Poisson process with local intensity n(x,t)n(y,t)K(x,y) where n(x,t) is the number of particles with mass at time t.

This motivates the move to an infinite-volume setting. Suppose that there are infinitely many particles, so that coalescences are occurring continuously. The rate of {x,y} -> x+y coalescences is still n(x,t)n(y,t)K(x,y) but now n(x,t) specifies the density of particles with mass at time t. Furthermore, because of the continuum framework, this rate is now deterministic rather than stochastic. This is extremely important, as by removing the probability from a probabilistic model, it can be treated as a large but simple ODE.

Two Remarks

1) Once this introduction is finished, we shall be bringing our focus onto multiplicative coalescence, where K(x,y) = xy. In particular, this is a homogeneous function, as are the other canonical kernels. This means that considering K(x,y) = cxy is the same as performing a constant factor time-change when K(x,y) = xy. Similarly, it is not important how the density n(x,t) is scaled as this can also be absorbed with a time-change. In some contexts, it will be natural and useful to demand that the total density be 1, but this will not always be possible. In general it is convenient to absorb as much as possible into the time parameter, particularly initial conditions, as will be discussed.

2) Working with an infinite volume of particles means that mass is no longer constrained to finitely many values. Generally, it is assumed that the masses are discrete, taking values in the positive integers, or continuous, taking values in the positive reals. In this case, the rate of coalescences between particles with masses in (x, x+dx) and (y,y+dy) is n(x,t)n(y,t)K(x,y)dxdy. The main difference between these will arise when we try to view the process as limits of finite processes. Continue reading