# The Yule Process

The second problem sheet for classes on the Applied Probability course this term features a long question about the Yule process. This is probably the simplest example of a birth process. It’s named for the British statistician George Udny Yule, though some sources prefer to call it the Yule-Furry process for the American physicist Wendell Furry who used it as a model of a radioactive reaction.

The model is straightforward. At any time there is some number of individuals in the population, and each individual gives birth to an offspring at constant rate $\lambda$, independently from the rest of the population. After a birth has happened, the parent and child evolve independently. In the notation of general birth processes, the birth rate when there are n individuals is $\lambda_n=\lambda n$.

Note that if we start with two or more individuals, the sizes of the two or more families of descendents evolve as a continuous-time Polya’s urn. The arrivals process speeds up with time, but the jump chain is exactly Polya’s urn. Unsurprisingly, the Yule process can be found embedded in preferential attachment models, and other processes which are based around Polya’s urn with extra information.

This is a discrete, random version of exponential growth. Since the geometric distribution is the discrete analogue of the exponential distribution, we probably shouldn’t be surprised to learn that this is indeed the distribution of the process at some fixed time t, when it is started from a single original ancestor. This is all we care about, since the numbers of descendents from each different original ancestors are independent. In general, the distribution of the population size at some fixed time will be negative binomial, that is, a sum of IID geometric distributions.

The standard method here is to proceed using generating functions. Conditioning on the first splitting time gives two independent copies of the original process over a shorter time-scale. One derives an ODE in time for the generating function evaluated at any particular value z. This can be solved uniquely for each z, and patching together gives the generating function of the distribution at any specific time t, which can be seen to coincide with the corresponding generating function of the geometric distribution with parameter $e^{-\lambda t}$.

So we were trying to decide whether there might be a more heuristic argument for this geometric distribution. The method we came up with is not immediate, but does justify the geometric distribution in a couple of steps. First, we say that the birth times are $T_2,T_3,\ldots$, so between times $[T_n,T_{n+1})$ there are n individuals, with $T_1:=0$ for concreteness. Then by construction of the birth process, $T_{n+1}-T_n\stackrel{d}{=}\mathrm{Exp}(\lambda n)$.

We now look at these ‘inter-birth times’ backwards, starting from $T_{n+1}$. Note that $\mathrm{Exp}(\lambda n)$ is the distribution of the time for the first of n IID $\mathrm{Exp}(\lambda)$ clocks to ring. But then, looking backwards, the next inter-birth time is thus the distribution of the time for one of (n-1) IID $\mathrm{Exp}(\lambda)$ clocks to ring. So by memorylessness of the exponential distribution (discussed at great length on the first problem sheet), we can actually take these (n-1) clocks to be exactly those of the original n clocks which did not ring first. Continuing this argument, we can show that the first (in the original time direction) inter-birth time corresponds to the time spent waiting for the final clock to ring. Rewriting this observation formally:

$T_{n+1}\stackrel{d}{=}\max\{X_i : X_1,\ldots,X_n\stackrel{\text{iid}}{\sim}\mathrm{Exp}(\lambda)\}.$ (*)

To return to justifying the geometric form of the distribution, we need to clarify the easiest relationship between the population size at a fixed size and these birth times. As we are aiming for the geometric distribution, the probability of the event $\{X_t>n\}$ will be most useful. Clearly this event is the same as $\{T_{n+1}, and from the description involving maxima of IID exponentials, this is easy to compute as $(1-e^{-\lambda t})^n$, which is exactly what we want.

There are two interesting couplings hidden in these constructions. On closer inspection they turn out to be essentially the same from two different perspectives.

We have specified the distribution of $T_n$ at (*). Look at this distribution on the right hand side. There is a very natural way to couple these distributions for all n, namely to take some infinite sequence $X_1,X_2,\ldots$ of IID $\mathrm{Exp}(\lambda)$ random variables, then use initial sequences of these to generate each of the $T_n$s as described in (*).

Does this coupling correspond to the use of these IID RVs in the birth process? Well, in fact it doesn’t. Examining the argument, we can see that $X_1$ gives a different inter-birth time for each value of t in the correspondence proposed. Even more concretely, in the birth process, almost surely $T_{n+1}>T_n$ for each n. This is not true if we take the canonical coupling of (*). Here, if $X_n<\max\{X_1,\ldots,X_{n-1}\}$, which happens with high probability for large n, we have $T_{n+1}=T_n$ in the process of running maxima.

Perhaps more interestingly, we might observe that this birth process gives a coupling of the geometric distributions. If we want to recover the standard parameterisation of the geometric distribution, we should reparameterise time. [And thus generate an essentially inevitable temptation to make some joke about now having a Yule Log process.]

Let’s consider what the standard coupling might be. For a binomial random variable, either on [n] or some more exotic set, as in percolation, we can couple across all values of the parameter by constructing a family independent uniform random variables, and returning a 1 if $U_i>1-p$ and so on, where p is the parameter of a specific binomial realisation.

We can do exactly the same here. A geometric distribution can be justified as the first success in a sequence of Bernoulli trials, so again we can replace the relevant Bernoulli distribution with a uniform distribution. Take $U_1,U_2,\ldots$ to be IID U[0,1] random variables. Then, we have:

$X_t=\stackrel{d}{=}\bar X_t:= \max\{n: U_1,\ldots,U_{n-1}\ge e^{-\lambda t}\}.$

The equality in distribution holds for any particular value of t by constructing. But it certainly doesn’t hold uniformly in t. Note that if we define $\bar X_t$ as a process, then typically the jumps of this process will be greater than 1, which is forbidden in the Yule process.

So, we have seen that this Yule process, even though its distribution at a fixed time has a standard form, provides a coupling of such distributions that is perhaps slightly surprising.

# Mixing of the Noisy Voter Model

I’ve been spending a bit of time recently thinking about some interacting particle systems and mixing time problems, so it was interesting to find a short paper due to Harishchandra Ramadas on arXiv a couple of weeks ago combining these two areas. I enjoyed reading it, so I thought I would summarise its arguments here, in preparation for perhaps talking about it at this week’s junior probability seminar.

The voter model is a graph-based interacting particle system. The underlying motivation is to describe the spread of opinions through a population. Each vertex represents an agent who has an opinion which might vary over time. At exponential rates, an agent will replace their opinion with the opinion of one of their neighbours, chosen uniformly at random. Note that, obviously, this might not actually involve a change of opinion.

One problem with this model is that we might be interested in convergence to equilibrium, but this might be hard to specify, as the state where everything is 0 and the state where everything is 1 are absorbing, and similarly any binomial distribution on the set of vertices is invariant. It is also not entirely realistic to assume that future updates to the system are entirely based on the current configuration. In any actual application, we should allow for some, possibly very small, influence from the outside, whether due to random errors in the update rule, or some external factors which haven’t been explicitly accounted for.

So we consider instead the noisy voter model where, in addition to the dynamics described above, each vertex changes state at rate $\delta$, independently of the configuration or the history. [Note that we are now considering the set of opinions {0,1} ]

We are interested in the mixing time of this system. It is not immediately clear that this should have an invariant distribution, but it seems reasonable since the noise, which converts 0 to 1 and 1 to 0 at equal rates if the proportion of 0s and 1s is equal, moves the system towards something like Bin(V(G),1/2), although the correlations need to take account of the local geometry.

Anyway, we assume $\delta>0$ is fixed, and n tends to infinity. Ramadas proves that the dynamics have the property of optimal temporal mixing, which states that given copies of the process X, Y starting from different initial conditions

$||X_t-Y_t||\le C n e^{-ct},$

for suitable choice of constants. Note that it doesn’t matter whether X and Y are coupled or independent. Total variation distance is a function of distribution, not of probability space. I don’t have a great intuition for exactly what this means, except to make the obvious comment that the distance from uniformity grows at most uniformly as the size of the system grows. Note that this immediately implies that the mixing time is O(log n). The rest of the paper proves that the noisy voter model has this property.

The method is to consider the relation:

$||X-Y||\le \mathbb{P}(X\ne Y),$

which says that the total variation distance gives the minimum of the RHS probability over couplings of X and Y. Anyway, as discussed in the previous post on duality in such interacting particle systems, the key step is to consider the process as a deterministic walk through a random environment. Ie, we apply the same moves to configurations started from different initial conditions.

The first step is to break down the noise mechanism to make it easier to couple everything. Instead of saying that a vertex changes state at rate $\delta$, we instead have two noise processes. One sets a vertex’s state to 0 at rate $\delta$, and another sets a vertex’s state to 1 at rate $\delta$ independently. Since at any configuration, for each vertex, precisely one of these processes has no effect.

So for the coupling, we run the same noise processes, and we choose the neighbours from which to update our opinions the same in both processes. The only difference is the initial configuration. With these two processes, X and Y defined on the same probability space, Ramadas defines $Z_t(v)=\mathbf{1}\{X_t(v)\neq Y_t(v)\}$, which counts how many vertices are different between the two processes. Almost surely, eventually all the Zs will be 0, and this state is absorbing because of how we’ve defined this coupling. We are interested to find out how long it takes for this event, which we could think of like a strong stationary time, to happen.

To control this, we track the evolution in the probabilities of $Z_t(v)=0$ jointly. We note that after a noise event at vertex v, we definitely have $Z_t(v)=0$. After an opinion-change event, whether $Z_t(v)$ changes depends on whether $Z_t(u)$ is 1, where u is the uniformly-chosen vertex from which v inherits its opinion in both processes. We obtain the following ODEs:

$\frac{d}{dt}\mathbb{P}(Z_t(v)=1)=\frac{1}{d(v)}\left(\sum_{u:u\sim v}\mathbb{P}(Z_t(v)=0,Z_t(u)=1)\right)$

$-\frac{1}{d(v)}\left(\sum_{u:u\sim v}\mathbb{P}(Z_t(v)=1,Z_t(u)=0)\right)-2\delta \mathbb{P}(Z_t(v)=1).$

We interpret this sum as a negative bit plus two terms which we would hope would cancel each other out. It seems impractical to attempt to formalise this heuristic. However, we can instead choose to consider at time t to consider the ODE for whichever vertex v maximises $Z_t(v)$. It’s then clear that the RHS is less than $-2\delta \mathbb{P}(Z_t(v)=1)$.

Right-differentiabililty of $Z_t(v_{max})$ follows from differentiability of $Z_t(v)$. The number of vertices is finite, and almost surely the vertex attaining this maximum can’t change too often. In particular, from the ODE, we must have $Z_t(v_{max})$ bounded above by $Ce^{-2\delta t}$. From which it follows that

$\mathbb{P}(X\ne Y)\le nCe^{-2\delta t },$

and we’ve shown this optimal temporal mixing property, and hence the mixing time has order log n.

What I’ve discussed so far is taken entirely from Section 3 of Ramadas’s paper. My first thought was whether this bound is actually correct in scale, as we have used only the size of the graph rather than any finer information about its graph structure. The author discusses that it is not possible to use a general result due to Hayes and Sinclair on this topic in this setting, as it is not reversible.

I certainly can’t answer this question, but I will make a comment on the extreme examples. If the graph is the empty graph on n vertices, then the voter model dynamics are null, so we essentially have just a coupon collector problem going on. It doesn’t lose any generality to start from entirely 0s, then at time t, the distribution of the system will be Bin(n,p), with p a function of time, but not n. Now note that

$||\mathrm{Bin}(n,p)-\mathrm{Bin}(n,q)||_{TV}=\Theta(\sqrt{n}|p-q|),$

so in this example, we need to choose a time so that $p=\frac12+\Theta(\sqrt{n})$. This is really a statement about the central limit theorem. It’s clear that Bin(n,1/2) should genuinely be the equilibrium distribution here. The evolution of p(t) is exponential, and hence we get the O(log n) scaling emerging. Note that the presence of $\sqrt{n}$ rather than any other power of n makes no difference, though it would be relevant if we actually wanted to know the constant.

At the other end of the connectivity spectrum, we consider the dynamics on the complete graph. Here, let’s count the number of 1s, and call it W(t) say. Then W goes from k to k+1 at rate $\frac{(n-k)\delta}{2}+\frac{k(n-k)}{n}$, and from k to k-1 at rate $\frac{k\delta}{2}+\frac{k(n-k)}{n}$. In particular, note in expectation, the contributions from the voter model dynamics cancel. This fits with our intuition that it is solely the noise driving the system towards equilibrium.

If we take an ODE approximation, we get:

$\frac{dX}{dt}=\frac{n}{2}-\delta x,$

from which it follows that

$t(x)=\alpha-\frac{1}{\delta}\log(C-\delta x),$

which gives precisely the correct scaling, when we take x=n/2.

I would be very interested to think about or hear about arguments for a general lower bound across all graphs.

REFERENCES

Hayes, Sinclair (2007) – A general lower bound for mixing of single-site dynamics on graphs (arXiv)

Ramadas (2014) – Mixing of the Noisy Voter Model (arXiv link at top of article)

# Duality for Stochastic Processes

In the past couple of weeks, we’ve launched a new junior probability seminar in Oxford. (If you’re reading this and would like to come, please drop me an email!) The first session featured Daniel Straulino talking about his work on the Spatial Lambda Fleming-Viot process, a model for the evolution of populations allowing for selection and geometry. A lively discussion of duality broke out halfway through, following which it seemed reasonable to devote another session to this topic. Luke Miller presented some classical and less classical examples of the theory this afternoon. I found all of this interesting and relevant, and thought it would be worth writing some things down, and tying it in with one of the courses on this subject that we attended at ALEA in Luminy last October.

The setup is that we have two stochastic processes $X_t\subset R, Y_t\subset S$. For now we make no assertion about whether the two state spaces R and S are the same or related, and we make no comment on the dependence relationship between X and Y. Let $P_x,Q_y$ be the respective probability measures, representing starting from x and y respectively. Then given a bivariate, measurable function h(.,.) on R x S, such that:

$E^P_x h(X_t,y)=E^Q_y h(x,Y_t),\quad \forall x,y\quad\forall t,$

then we say X and Y are dual with respect to h.

The interpretation should be that X is a process forwards in time, and Y is a process backwards in time. So $X_t, Y_0$ represent the present, while $X_0, Y_t$ represent the past, which is the initial time for original process X. The fact that the result holds for all times t allows us to carry the equality through a derivative, to obtain an equality of generators:

$\mathcal{G}^X h(x,y)=\mathcal{G}^Y h(x,y),\quad \forall x,y.$

On the LHS, the generator acts on x, while on the RHS it acts on y. Although it still isn’t obvious (at least to me) when a pair of processes might have this property, especially for an arbitrary function, this seems the more natural definition to think about.

Note that this does indeed require a specific function h. There were murmurings in our meeting about the possibility of a two processes having a strong duality property, where this held for all h in some broad class of test functions. On more reflection, which may nonetheless be completely wrong, this seems unlikely to happen very often, except in some obviously degenerate cases, such as h constant. If this holds, then as the set of expectations of a class of functions of a random variable determines the distribution, we find that the instantaneous behaviour of Y is equal in distribution to the instantaneous behaviour of X when started from fixed (x,y). It seems unlikely that you might get many examples of this that are not deterministic or independent (eg two Brownian motions, or other space-, time-homogeneous Markov process).

Anyway, a canonical example of this is the Wright-Fisher diffusion, which provides a simple model for a population which evolves in discrete-time generations. We assume that there are two types in the population: {A,a} seems to be the standard notation. Children choose their type randomly from the distribution of types in the previous generation. In other words, if there are N individuals at all times, and $X_k$ is the number of type A individuals, then:

$X_{k+1} | X_k \stackrel{d}{=} \mathrm{Bin}(N, \frac{X_k}{N}).$

It is not hard to see that in a diffusion limit as the number of individuals tends to infinity, the proportion of type A individuals is a martingale, and so the generator for this process will not depend on f’. In fact by checking a Taylor series, we can show that:

$\mathcal{G}_{WF}f(x)=\frac{1}{2} x(1-x)f''(x),$

for all f subject to mild regularity conditions. In particular, we can show that for $f_n(x)=x^n$, we have:

$\mathcal{G}_{WF} f_n(x)=\binom{n}{2}(f_{n-1}(x)-f_n(x))$

after some rearranging. This looks like the generator of a jump process, indeed a jump process where all the increments are -1. This suggests there might be a coalescent as the dual process, and indeed it turns out that Kingman’s coalescent, where any pair of blocks coalesce at uniform rate, is the dual. We have the relation in expectation:

$\mathbb{E}_x[X_t^n]= \mathbb{E}_n[x^{N_t}],$

where the latter term is the moment generating function of the number of blocks at time t of Kingman’s coalescent started from n blocks.

In particular, we can control the moments of the Wright-Fisher diffusion using the mgf of the Kingman’s coalescent, which might well be easier to work with.

That’s all very elegant, but we were talking about why this might be useful in a broader setting. In the context of this question, there seems to be an obstacle towards applying this idea above more generally. This is an underlying idea in population genetics models that as well as the forward process, there is also a so-called ancestral process looking backwards in time, detailing how time t individuals are related historically. It would be convenient if this process, which we might expect to be some nice coalescent, was the dual of the forward process.

But this seems to be a problem, as duals are a function of the duality function, so do we have uniqueness? It would not be very satisfying if there were several coalescents processes that could all be the dual of the forward process. Though some worked examples suggest this might not happen, because a dual and its duality function has to satisfy too many constaints, there seems no a priori reason why not. It seems that the strength of the results you can derive from the duality relation is only as strong as the duality relation itself. This is not necessarily a problem from the point of view of applications, so long as the duality function is something it might actually be useful to work with.

It’s getting late and this text is getting long, so I shall postpone a discussion of duality for interacting particle systems until tomorrow. In summary, by restricting to a finite state space, we can allow ourselves the option of a more algebraic approach, from which some direct uses of duality can be read off. I will also mention a non-technical but I feel helpful way to view many examples of duality in interacting particle systems as deterministic forward and backwards walks through a random environment, in what might be considering an extreme example of coupling.