# Coupling from the Past

In a long series of previous posts I have talked about mixing times for Markov chains. We consider how long it takes for the distribution of a particular Markov chain to approach equilibrium. We are particularly interested in the asymptotics when some parameter of the model grows, such as the size of the state space, grows to infinity.

But why are we interested in the underlying problem? The idea of Markov Chain Monte Carlo methods is to sample from an intractable distribution by instead sampling from a Markov chain which approximates the distribution well at large times. A distribution might be intractable because it is computationally demanding to work out the normalising constant, or it might be distributed uniformly on a complicated combinatorial set. If, however, the distribution is the equilibrium distribution of some Markov chain, then we know how to at least sample from a distribution which is close to the one we want. But we need to know how long to run the process. We will typically tolerate some small error in approximating the distribution (whether we measure this in terms of total variation distance or some other metric doesn’t really matter at this heuristic level), but we need to know how it scale. If we double the size of the system, do we need to double the number of iterations of the chain, or square it. This is really important if we are going to use this for large real-world models with finite computing power!

Sometimes though, an approximation is not enough. If we want an exact sample from the equilibrium distribution, Markov chains typically will not help us as it is only in very artificial examples that the distribution after some finite time is actually the equilibrium distribution. One thing that we might use is a stationary time, which is a stopping time T, for which $X_T\stackrel{d}{=}\pi$. Note that there is one trivial way to do this. We can sample Y from distribution $\pi$ before starting the process, then stop X at the first time T for which $X_T=Y$. But this is no help really, as we need to have Y in the first place!

So we are really interested in less trivial stationary times. Perhaps the best example is the top-to-random shuffle. Here we are given a pack of labelled cards, WLOG initially in descending order at each step we move the top card in the pile to a randomly-chosen location in the pile (which includes back onto the top). Then it turns out that the first time we move the card originally at the bottom from the top to somewhere is a strong stationary time. This is fairly natural, as by this time, every card has been involved in at least one randomising event.

Anyway, so this gives a somewhat artificial way to sample from the uniform distribution on a pack of cards. This strong stationary time is almost surely finite, with distribution given by the coupon collector problem, for which the expectation grows as $n\log n$, where n is the number of cards.

The problem with this method is that it is not easy in general to come up with a non-contrived stationary time such as this one. The idea of coupling from the past, discussed by some previous authors but introduced in this context by Propp and Wilson in the mid ’90s, is another method to achieve perfect sampling from the equilibrium distribution of a Markov chain. The idea here is to work backwards rather than forwards. The rest of this post, which discusses this idea, is based on the talk given at the Junior Probability Seminar by Irene, and on the chapter in the Levin, Peres, Wilmer book.

The key to the construction is a coupling of the transitions of a Markov chain. In the setting of a simple random walk, we have by construction a coupling of the transitions. It doesn’t matter which state we are at: we toss a coin to decide whether to move up or down, and we can do this without reference to our current position. Levin, Peres and WIlmer call this a random mapping representation in general, and it is yet another concept that is less scary than its definition might suggest.

Given a transition matrix P on state space S, such a representation is a function

$\phi: S\times[0,1]\rightarrow S,\text{ s.t. }\mathbb{P}(\phi(i,U)=j)=p_{ij},$

where U is a U(0,1) random variable independent of choice of i. In particular, once we have the random value of u, we can consider $\phi(i,u)$ as i varies, to obtain a random map $S\rightarrow S$. Crucially, this map is not necessarily a bijection.

Note first that there are many possibilities for constructing the representation $\phi$. For some chains, and some representations, in particular random walks on vertex-transitive graphs (such as SRW – only for now we are restricting attention to finite state spaces) it is possible to choose $\phi$ so that it always gives a bijection, but it is also always possible to choose it so that there is some probability it doesn’t give a bijection.

Let $U_1,U_2,\ldots$ be an IID sequence of U[0,1] random variables, and write $\phi_i$ for the random map induced by $U_i$. Then consider the sequence of iterated maps:

$\phi_1, \phi_1\circ \phi_2, \ldots, \phi_1\circ\ldots\circ\phi_n,$

and let T be the (random) smallest time such that the image of $\phi_1\circ\ldots\circ \phi_T$ is a single state. Ie, as we go backwards in time through the maps $\phi_i$, we are gradually losing various states, corresponding to the maps not being bijections. Since the state space is finite, and the probability of not being a bijection is positive, it can be shown that T is almost surely finite. The claim then is that

$Y=\text{Im}(\phi_1\circ\ldots\circ \phi_T)$

is distributed as the equilibrium distribution of the chain. We finish by proving this.

Proof: Since the algorithm terminates after finite time almost surely, given any $\epsilon>0$, we can choose N such that the probability the algorithm stops in at most N steps is greater than $1-\epsilon$.

Now run the Markov chain from time -N, started in the equilibrium distribution, with the transition from time -t to -(t-1) given by the random mapping driven by $U_t$. Thus at time 0, the distribution of the chain is still the equilibrium distribution. But if we condition on the event that $T\le N$, then $X_0=\phi_1\circ \ldots \circ\phi_n(X_{-N})=Y$ regardless of the initial value. So $\mathbb{P}(X_0\ne Y)<\epsilon$, and hence the result follows, since $\epsilon>0$ was arbitrary.

What makes this easier than strong stationary times is that we don’t have to be clever to come up with the stopping time. It is however still important to know how long on average it takes to run the algorithm. At the end of her talk, Irene showed how to adapt this algorithm to deal with Probabilistic Cellular Automata. Roughly speaking, these are a sequence of infinite strings of 0s and 1s. The value of some element is determined randomly as a function of the values in the row underneath, say the element directly underneath and the two either side. In that setting, if you start with a finite subsequence and couple from the past by looking down to lower rows, each time you drop down a row you consider one further element, so in fact the coupling from the past algorithm has to eliminate possibilities fast enough to make up for this, if we want to terminate almost surely in finite time.

Here’s a link to the paper which discusses this in fuller detail.

# Mixing of the Noisy Voter Model

I’ve been spending a bit of time recently thinking about some interacting particle systems and mixing time problems, so it was interesting to find a short paper due to Harishchandra Ramadas on arXiv a couple of weeks ago combining these two areas. I enjoyed reading it, so I thought I would summarise its arguments here, in preparation for perhaps talking about it at this week’s junior probability seminar.

The voter model is a graph-based interacting particle system. The underlying motivation is to describe the spread of opinions through a population. Each vertex represents an agent who has an opinion which might vary over time. At exponential rates, an agent will replace their opinion with the opinion of one of their neighbours, chosen uniformly at random. Note that, obviously, this might not actually involve a change of opinion.

One problem with this model is that we might be interested in convergence to equilibrium, but this might be hard to specify, as the state where everything is 0 and the state where everything is 1 are absorbing, and similarly any binomial distribution on the set of vertices is invariant. It is also not entirely realistic to assume that future updates to the system are entirely based on the current configuration. In any actual application, we should allow for some, possibly very small, influence from the outside, whether due to random errors in the update rule, or some external factors which haven’t been explicitly accounted for.

So we consider instead the noisy voter model where, in addition to the dynamics described above, each vertex changes state at rate $\delta$, independently of the configuration or the history. [Note that we are now considering the set of opinions {0,1} ]

We are interested in the mixing time of this system. It is not immediately clear that this should have an invariant distribution, but it seems reasonable since the noise, which converts 0 to 1 and 1 to 0 at equal rates if the proportion of 0s and 1s is equal, moves the system towards something like Bin(V(G),1/2), although the correlations need to take account of the local geometry.

Anyway, we assume $\delta>0$ is fixed, and n tends to infinity. Ramadas proves that the dynamics have the property of optimal temporal mixing, which states that given copies of the process X, Y starting from different initial conditions

$||X_t-Y_t||\le C n e^{-ct},$

for suitable choice of constants. Note that it doesn’t matter whether X and Y are coupled or independent. Total variation distance is a function of distribution, not of probability space. I don’t have a great intuition for exactly what this means, except to make the obvious comment that the distance from uniformity grows at most uniformly as the size of the system grows. Note that this immediately implies that the mixing time is O(log n). The rest of the paper proves that the noisy voter model has this property.

The method is to consider the relation:

$||X-Y||\le \mathbb{P}(X\ne Y),$

which says that the total variation distance gives the minimum of the RHS probability over couplings of X and Y. Anyway, as discussed in the previous post on duality in such interacting particle systems, the key step is to consider the process as a deterministic walk through a random environment. Ie, we apply the same moves to configurations started from different initial conditions.

The first step is to break down the noise mechanism to make it easier to couple everything. Instead of saying that a vertex changes state at rate $\delta$, we instead have two noise processes. One sets a vertex’s state to 0 at rate $\delta$, and another sets a vertex’s state to 1 at rate $\delta$ independently. Since at any configuration, for each vertex, precisely one of these processes has no effect.

So for the coupling, we run the same noise processes, and we choose the neighbours from which to update our opinions the same in both processes. The only difference is the initial configuration. With these two processes, X and Y defined on the same probability space, Ramadas defines $Z_t(v)=\mathbf{1}\{X_t(v)\neq Y_t(v)\}$, which counts how many vertices are different between the two processes. Almost surely, eventually all the Zs will be 0, and this state is absorbing because of how we’ve defined this coupling. We are interested to find out how long it takes for this event, which we could think of like a strong stationary time, to happen.

To control this, we track the evolution in the probabilities of $Z_t(v)=0$ jointly. We note that after a noise event at vertex v, we definitely have $Z_t(v)=0$. After an opinion-change event, whether $Z_t(v)$ changes depends on whether $Z_t(u)$ is 1, where u is the uniformly-chosen vertex from which v inherits its opinion in both processes. We obtain the following ODEs:

$\frac{d}{dt}\mathbb{P}(Z_t(v)=1)=\frac{1}{d(v)}\left(\sum_{u:u\sim v}\mathbb{P}(Z_t(v)=0,Z_t(u)=1)\right)$

$-\frac{1}{d(v)}\left(\sum_{u:u\sim v}\mathbb{P}(Z_t(v)=1,Z_t(u)=0)\right)-2\delta \mathbb{P}(Z_t(v)=1).$

We interpret this sum as a negative bit plus two terms which we would hope would cancel each other out. It seems impractical to attempt to formalise this heuristic. However, we can instead choose to consider at time t to consider the ODE for whichever vertex v maximises $Z_t(v)$. It’s then clear that the RHS is less than $-2\delta \mathbb{P}(Z_t(v)=1)$.

Right-differentiabililty of $Z_t(v_{max})$ follows from differentiability of $Z_t(v)$. The number of vertices is finite, and almost surely the vertex attaining this maximum can’t change too often. In particular, from the ODE, we must have $Z_t(v_{max})$ bounded above by $Ce^{-2\delta t}$. From which it follows that

$\mathbb{P}(X\ne Y)\le nCe^{-2\delta t },$

and we’ve shown this optimal temporal mixing property, and hence the mixing time has order log n.

What I’ve discussed so far is taken entirely from Section 3 of Ramadas’s paper. My first thought was whether this bound is actually correct in scale, as we have used only the size of the graph rather than any finer information about its graph structure. The author discusses that it is not possible to use a general result due to Hayes and Sinclair on this topic in this setting, as it is not reversible.

I certainly can’t answer this question, but I will make a comment on the extreme examples. If the graph is the empty graph on n vertices, then the voter model dynamics are null, so we essentially have just a coupon collector problem going on. It doesn’t lose any generality to start from entirely 0s, then at time t, the distribution of the system will be Bin(n,p), with p a function of time, but not n. Now note that

$||\mathrm{Bin}(n,p)-\mathrm{Bin}(n,q)||_{TV}=\Theta(\sqrt{n}|p-q|),$

so in this example, we need to choose a time so that $p=\frac12+\Theta(\sqrt{n})$. This is really a statement about the central limit theorem. It’s clear that Bin(n,1/2) should genuinely be the equilibrium distribution here. The evolution of p(t) is exponential, and hence we get the O(log n) scaling emerging. Note that the presence of $\sqrt{n}$ rather than any other power of n makes no difference, though it would be relevant if we actually wanted to know the constant.

At the other end of the connectivity spectrum, we consider the dynamics on the complete graph. Here, let’s count the number of 1s, and call it W(t) say. Then W goes from k to k+1 at rate $\frac{(n-k)\delta}{2}+\frac{k(n-k)}{n}$, and from k to k-1 at rate $\frac{k\delta}{2}+\frac{k(n-k)}{n}$. In particular, note in expectation, the contributions from the voter model dynamics cancel. This fits with our intuition that it is solely the noise driving the system towards equilibrium.

If we take an ODE approximation, we get:

$\frac{dX}{dt}=\frac{n}{2}-\delta x,$

from which it follows that

$t(x)=\alpha-\frac{1}{\delta}\log(C-\delta x),$

which gives precisely the correct scaling, when we take x=n/2.

I would be very interested to think about or hear about arguments for a general lower bound across all graphs.

REFERENCES

Hayes, Sinclair (2007) – A general lower bound for mixing of single-site dynamics on graphs (arXiv)

Ramadas (2014) – Mixing of the Noisy Voter Model (arXiv link at top of article)

# Duality for Interacting Particle Systems

Yesterday I introduced the notion of duality for two stochastic processes. My two goals for this post are to elaborate on the idea of why duality is useful, which we touched on in passing in the previous part, and to discuss duality of interacting particle systems. In the latter case, there are often nice ways to consider the forward and backward processes together that make the relation somewhat more natural.

The starting point is to assume a finite state space. This will be reasonable when we start to consider interacting particle systems, eg on $\{0,1\}^{[n]}$. As before, call the spaces R and S, and a duality function H(x,y). Since the state-spaces are finite, it is entirely natural to think of this as a matrix, and hence as an operator. Of course, a function defined on a finite state-space can be thought of as a vector, so it is clear what this operator will actually operate on. (I’ve chosen H rather than h for the duality function so it is more clear that it is acting as an operator here.)

We have some choice about which way round to define it, but for now let’s say that given some function f(.) on S

$Hf(x):=\sum_{y\in S} H(x,y)f(y).$

Note that this is a) exactly the definition of matrix (left-)multiplication; b) We should think of Hf as a function on R – perhaps (Hf)(x) might be more clear? and c) the operator H acts $\mathbb{R}^S\rightarrow \mathbb{R}^R$. If we want the corresponding operator $\mathbb{R}^R\rightarrow\mathbb{R}^S$, we simply multiply by H on the right instead.

But note also that the generator of a finite state-space Markov process is also a matrix, indeed a Q-matrix. So if we take our definition of the duality function as

$\mathcal{G}_X h(x,y)=\mathcal{G}_Y h(x,y),$

which, importantly, holds for all x,y, we can convert this into an algebraic form as

$\mathcal{G}_X H = H \mathcal{G}_Y^\dagger.$

In the same way that n-step transition probabilities for a discrete-time Markov chain are given by the product of the one-step transition matrix, general time transition probabilities for a continuous-time Markov chain are given by exponents of the Q-matrix. In particular, if X and Y have transition kernels P and Q respectively, then $P_t=e^{tG_X}$, and after doing some manipulation, we can show that

$P_t H=H Q_t^\dagger,$

also. This is really useful as in general we would hope that H might be invertible, from which we derive

$P_t=HQ_t^\dagger H^{-1}.$

So this is a powerful statement about the relationship between the evolutions of the two processes. In particular, it shows a correspondence (given by H) between left eigenvectors of P, and right eigenvectors of Q, and vice versa naturally.

The reason why this is useful rather than merely cute, is that when we re-interpret everything in terms of the original stochastic processes, we get a map between stationary distributions of X, and harmonic functions of Y. Stationary distributions are often hard to describe in any terms other than the left-1-eigenvector, or through some convergence property that is typically hard to work with. Harmonic functions, on the other hard, can be much more tractable. An example of a harmonic function is the survival probability started from a given state. This is useful for specifying the stationary distribution, but perhaps even more so for describing properties of the set of stationary distributions. In particular, uniqueness and existence are carried across this equivalence. So, for example, if the dual does not survive almost surely, then this says the only stationary measure is zero, and so the process is transient or similar.

Jan Swart’s course in Luminy last October dealt with duality, with a focus mainly on interacting particle systems. There are a couple of themes I want to talk about, without going into too much detail.

A typical interacting particle system will take place on a locally finite graph. At each vertex, there is either a particle, or there isn’t. Particles move between adjacent vertices, and sometimes interact with particles at adjacent vertices. These interactions might involve branching or coalescence. We will discuss shortly the set of possible forms such interaction might take. The state space is $\{0,1\}^{V(G)}$, with G the underlying graph. Then given a state, there is some set of actions which might happen next, and we consider the possibility that they happen with exponential rates.

At this stage, it seems like the initial configuration is important, as this affects what set of moves can happen immediately, and also thereafter. It is not clear how quickly this dependence fades. One useful idea is not to restrict ourselves to interactions involving the particles currently present in the system, but instead to consider a Poisson process of all possible interactions. Only the moves actually permitted by the current state will happen, but having this extra information allows for coupling between initial configurations.

It’s probably easier to consider a concrete example. The picture below shows the set-up for a branching random walk up an integer lattice. Each particle moves to one of the two state directly above its current state, or it branches and sends particles to both of them.In the diagram, we have glued arrows onto every state at every time, which tells us what to do if there is a particle there at each time. As a coupling, we can now think of the process as a deterministic walk through a random environment. The environment is given by some probability space, which in continuous time might have the appearance of a Poisson process on the set of ‘moves’, and the initial condition of the walk is up to us.

We can generalise this to a broader class of interacting particle systems. If we want all interactions to be between pairs of adjacent states, there are six possible things which could happen:

• Annihilation: two adjacent particles destroy each other. ( 11 -> 00 )
• Branching: one particle becomes two particles. ( 01 or 10 -> 11 )
• Coalescence: two particles merge. ( 11 -> 01 or 10 )
• Death: A particle is removed. ( 01 or 10 -> 00 )
• Exclusion: a particle moves. ( 01 -> 10 )
• Birth: a particle is created. ( 00 -> 01 or 10 )

For now we exclude the possibility of birth. Note that the way we have set this up involving two-site interactions excludes the possibility of a particle trying to move to an already-occupied site.

Let us say that in process X the rates at which each of these events happen are a, b, c, d and e, taking advantage of the helpful choice of naming. There is some flexibility about whether the rates are the same between every pair of vertices of note. For this post we assume that they are. Then it is a result of Lloyd and Sudbury that given some real $q\neq 1$, the process X’ with corresponding rates given by:

$a'=a+2q\gamma, b'=b+\gamma, c'=c-(1+q)\gamma, d'=d+\gamma, e'=e-\gamma,$

for $\gamma:= \frac{a+c-d+qb}{1-q},$

is dual to X, with duality function given by $h(Y,Z)=q^{|Y\cap Z|}$, for Y and Z possible states.

I want to make two comments:

1) This illustrates one of the differences between the dual and the time-reversal. It is clear that the time-reversal of branching is coalescence and vice versa, and exclusion is invariant under time-reversal. But the time-reversal of death is definitely birth, but there is no birth component in the dual of a process which features death. I don’t have a strong intuition for why this is the case, but see the final paragraph of this post. However, at least it seems plausible that both processes might simultaneously be recurrent, since in the dual, both the branching rate and the death rate have increased by the same amount.

2) This settles one problem of uniqueness of the dual that I mentioned last time, since we can vary q and get a different dual to the same original process. For example, in the voter model, we have b=d=1, and a=c=e=0, as in any update, the opinions of neighbours which were previously different become the same. Anyway, for any $q\in[-1,0]$ there is a choice of dual, where at the extremes q=0 corresponds to coalescing random walk, and q=-1 to annihilating random walk. (Note that the duality function for q=0 is the indicator function that the systems are different.)

As a final observation without much justification, suppose we add in arrows in the gaps of the branching random walk picture we had earlier, and direct them in the opposite direction. It turns out that this corresponds precisely to the dual of the process. This provides an appealing visual idea of why the dual of branching might be death. It also supports the general idea based on the coupling described earlier that the dual process is in some sense a deterministic walk in the opposite direction through the random environment specified by the original process.

REFERENCES

J.M. Swart – Duality and Intertwining of Markov Chains (mainly using chapters 2.1 and 2.7)

Thanks for Daniel Straulino for direction towards the branching random walk duality example.