Coupling from the Past

In a long series of previous posts I have talked about mixing times for Markov chains. We consider how long it takes for the distribution of a particular Markov chain to approach equilibrium. We are particularly interested in the asymptotics when some parameter of the model grows, such as the size of the state space, grows to infinity.

But why are we interested in the underlying problem? The idea of Markov Chain Monte Carlo methods is to sample from an intractable distribution by instead sampling from a Markov chain which approximates the distribution well at large times. A distribution might be intractable because it is computationally demanding to work out the normalising constant, or it might be distributed uniformly on a complicated combinatorial set. If, however, the distribution is the equilibrium distribution of some Markov chain, then we know how to at least sample from a distribution which is close to the one we want. But we need to know how long to run the process. We will typically tolerate some small error in approximating the distribution (whether we measure this in terms of total variation distance or some other metric doesn’t really matter at this heuristic level), but we need to know how it scale. If we double the size of the system, do we need to double the number of iterations of the chain, or square it. This is really important if we are going to use this for large real-world models with finite computing power!

Sometimes though, an approximation is not enough. If we want an exact sample from the equilibrium distribution, Markov chains typically will not help us as it is only in very artificial examples that the distribution after some finite time is actually the equilibrium distribution. One thing that we might use is a stationary time, which is a stopping time T, for which X_T\stackrel{d}{=}\pi. Note that there is one trivial way to do this. We can sample Y from distribution \pi before starting the process, then stop X at the first time T for which X_T=Y. But this is no help really, as we need to have Y in the first place!

So we are really interested in less trivial stationary times. Perhaps the best example is the top-to-random shuffle. Here we are given a pack of labelled cards, WLOG initially in descending order at each step we move the top card in the pile to a randomly-chosen location in the pile (which includes back onto the top). Then it turns out that the first time we move the card originally at the bottom from the top to somewhere is a strong stationary time. This is fairly natural, as by this time, every card has been involved in at least one randomising event.

Anyway, so this gives a somewhat artificial way to sample from the uniform distribution on a pack of cards. This strong stationary time is almost surely finite, with distribution given by the coupon collector problem, for which the expectation grows as n\log n, where n is the number of cards.

The problem with this method is that it is not easy in general to come up with a non-contrived stationary time such as this one. The idea of coupling from the past, discussed by some previous authors but introduced in this context by Propp and Wilson in the mid ’90s, is another method to achieve perfect sampling from the equilibrium distribution of a Markov chain. The idea here is to work backwards rather than forwards. The rest of this post, which discusses this idea, is based on the talk given at the Junior Probability Seminar by Irene, and on the chapter in the Levin, Peres, Wilmer book.

The key to the construction is a coupling of the transitions of a Markov chain. In the setting of a simple random walk, we have by construction a coupling of the transitions. It doesn’t matter which state we are at: we toss a coin to decide whether to move up or down, and we can do this without reference to our current position. Levin, Peres and WIlmer call this a random mapping representation in general, and it is yet another concept that is less scary than its definition might suggest.

Given a transition matrix P on state space S, such a representation is a function

\phi: S\times[0,1]\rightarrow S,\text{ s.t. }\mathbb{P}(\phi(i,U)=j)=p_{ij},

where U is a U(0,1) random variable independent of choice of i. In particular, once we have the random value of u, we can consider \phi(i,u) as i varies, to obtain a random map S\rightarrow S. Crucially, this map is not necessarily a bijection.

Note first that there are many possibilities for constructing the representation \phi. For some chains, and some representations, in particular random walks on vertex-transitive graphs (such as SRW – only for now we are restricting attention to finite state spaces) it is possible to choose \phi so that it always gives a bijection, but it is also always possible to choose it so that there is some probability it doesn’t give a bijection.

Let U_1,U_2,\ldots be an IID sequence of U[0,1] random variables, and write \phi_i for the random map induced by U_i. Then consider the sequence of iterated maps:

\phi_1, \phi_1\circ \phi_2, \ldots, \phi_1\circ\ldots\circ\phi_n,

and let T be the (random) smallest time such that the image of \phi_1\circ\ldots\circ \phi_T is a single state. Ie, as we go backwards in time through the maps \phi_i, we are gradually losing various states, corresponding to the maps not being bijections. Since the state space is finite, and the probability of not being a bijection is positive, it can be shown that T is almost surely finite. The claim then is that

Y=\text{Im}(\phi_1\circ\ldots\circ \phi_T)

is distributed as the equilibrium distribution of the chain. We finish by proving this.

Proof: Since the algorithm terminates after finite time almost surely, given any \epsilon>0, we can choose N such that the probability the algorithm stops in at most N steps is greater than 1-\epsilon.

Now run the Markov chain from time -N, started in the equilibrium distribution, with the transition from time -t to -(t-1) given by the random mapping driven by U_t. Thus at time 0, the distribution of the chain is still the equilibrium distribution. But if we condition on the event that T\le N, then X_0=\phi_1\circ \ldots \circ\phi_n(X_{-N})=Y regardless of the initial value. So \mathbb{P}(X_0\ne Y)<\epsilon, and hence the result follows, since \epsilon>0 was arbitrary.

What makes this easier than strong stationary times is that we don’t have to be clever to come up with the stopping time. It is however still important to know how long on average it takes to run the algorithm. At the end of her talk, Irene showed how to adapt this algorithm to deal with Probabilistic Cellular Automata. Roughly speaking, these are a sequence of infinite strings of 0s and 1s. The value of some element is determined randomly as a function of the values in the row underneath, say the element directly underneath and the two either side. In that setting, if you start with a finite subsequence and couple from the past by looking down to lower rows, each time you drop down a row you consider one further element, so in fact the coupling from the past algorithm has to eliminate possibilities fast enough to make up for this, if we want to terminate almost surely in finite time.

Here’s a link to the paper which discusses this in fuller detail.

Enhanced by Zemanta
Advertisement