Coupling from the Past

In a long series of previous posts I have talked about mixing times for Markov chains. We consider how long it takes for the distribution of a particular Markov chain to approach equilibrium. We are particularly interested in the asymptotics when some parameter of the model grows, such as the size of the state space, grows to infinity.

But why are we interested in the underlying problem? The idea of Markov Chain Monte Carlo methods is to sample from an intractable distribution by instead sampling from a Markov chain which approximates the distribution well at large times. A distribution might be intractable because it is computationally demanding to work out the normalising constant, or it might be distributed uniformly on a complicated combinatorial set. If, however, the distribution is the equilibrium distribution of some Markov chain, then we know how to at least sample from a distribution which is close to the one we want. But we need to know how long to run the process. We will typically tolerate some small error in approximating the distribution (whether we measure this in terms of total variation distance or some other metric doesn’t really matter at this heuristic level), but we need to know how it scale. If we double the size of the system, do we need to double the number of iterations of the chain, or square it. This is really important if we are going to use this for large real-world models with finite computing power!

Sometimes though, an approximation is not enough. If we want an exact sample from the equilibrium distribution, Markov chains typically will not help us as it is only in very artificial examples that the distribution after some finite time is actually the equilibrium distribution. One thing that we might use is a stationary time, which is a stopping time T, for which X_T\stackrel{d}{=}\pi. Note that there is one trivial way to do this. We can sample Y from distribution \pi before starting the process, then stop X at the first time T for which X_T=Y. But this is no help really, as we need to have Y in the first place!

So we are really interested in less trivial stationary times. Perhaps the best example is the top-to-random shuffle. Here we are given a pack of labelled cards, WLOG initially in descending order at each step we move the top card in the pile to a randomly-chosen location in the pile (which includes back onto the top). Then it turns out that the first time we move the card originally at the bottom from the top to somewhere is a strong stationary time. This is fairly natural, as by this time, every card has been involved in at least one randomising event.

Anyway, so this gives a somewhat artificial way to sample from the uniform distribution on a pack of cards. This strong stationary time is almost surely finite, with distribution given by the coupon collector problem, for which the expectation grows as n\log n, where n is the number of cards.

The problem with this method is that it is not easy in general to come up with a non-contrived stationary time such as this one. The idea of coupling from the past, discussed by some previous authors but introduced in this context by Propp and Wilson in the mid ’90s, is another method to achieve perfect sampling from the equilibrium distribution of a Markov chain. The idea here is to work backwards rather than forwards. The rest of this post, which discusses this idea, is based on the talk given at the Junior Probability Seminar by Irene, and on the chapter in the Levin, Peres, Wilmer book.

The key to the construction is a coupling of the transitions of a Markov chain. In the setting of a simple random walk, we have by construction a coupling of the transitions. It doesn’t matter which state we are at: we toss a coin to decide whether to move up or down, and we can do this without reference to our current position. Levin, Peres and WIlmer call this a random mapping representation in general, and it is yet another concept that is less scary than its definition might suggest.

Given a transition matrix P on state space S, such a representation is a function

\phi: S\times[0,1]\rightarrow S,\text{ s.t. }\mathbb{P}(\phi(i,U)=j)=p_{ij},

where U is a U(0,1) random variable independent of choice of i. In particular, once we have the random value of u, we can consider \phi(i,u) as i varies, to obtain a random map S\rightarrow S. Crucially, this map is not necessarily a bijection.

Note first that there are many possibilities for constructing the representation \phi. For some chains, and some representations, in particular random walks on vertex-transitive graphs (such as SRW – only for now we are restricting attention to finite state spaces) it is possible to choose \phi so that it always gives a bijection, but it is also always possible to choose it so that there is some probability it doesn’t give a bijection.

Let U_1,U_2,\ldots be an IID sequence of U[0,1] random variables, and write \phi_i for the random map induced by U_i. Then consider the sequence of iterated maps:

\phi_1, \phi_1\circ \phi_2, \ldots, \phi_1\circ\ldots\circ\phi_n,

and let T be the (random) smallest time such that the image of \phi_1\circ\ldots\circ \phi_T is a single state. Ie, as we go backwards in time through the maps \phi_i, we are gradually losing various states, corresponding to the maps not being bijections. Since the state space is finite, and the probability of not being a bijection is positive, it can be shown that T is almost surely finite. The claim then is that

Y=\text{Im}(\phi_1\circ\ldots\circ \phi_T)

is distributed as the equilibrium distribution of the chain. We finish by proving this.

Proof: Since the algorithm terminates after finite time almost surely, given any \epsilon>0, we can choose N such that the probability the algorithm stops in at most N steps is greater than 1-\epsilon.

Now run the Markov chain from time -N, started in the equilibrium distribution, with the transition from time -t to -(t-1) given by the random mapping driven by U_t. Thus at time 0, the distribution of the chain is still the equilibrium distribution. But if we condition on the event that T\le N, then X_0=\phi_1\circ \ldots \circ\phi_n(X_{-N})=Y regardless of the initial value. So \mathbb{P}(X_0\ne Y)<\epsilon, and hence the result follows, since \epsilon>0 was arbitrary.

What makes this easier than strong stationary times is that we don’t have to be clever to come up with the stopping time. It is however still important to know how long on average it takes to run the algorithm. At the end of her talk, Irene showed how to adapt this algorithm to deal with Probabilistic Cellular Automata. Roughly speaking, these are a sequence of infinite strings of 0s and 1s. The value of some element is determined randomly as a function of the values in the row underneath, say the element directly underneath and the two either side. In that setting, if you start with a finite subsequence and couple from the past by looking down to lower rows, each time you drop down a row you consider one further element, so in fact the coupling from the past algorithm has to eliminate possibilities fast enough to make up for this, if we want to terminate almost surely in finite time.

Here’s a link to the paper which discusses this in fuller detail.

Enhanced by Zemanta
Advertisement

The Top-to-Random Shuffle III

This post concludes my non-exhaustive list of things I think are interesting about the top-to-random shuffle. In previous posts I have talked about the construction and correct sense of convergence to randomness, and that this algorithm does genuinely achieve uniform randomness at some hitting time which is easy to specify. Promising that posts will be short hasn’t worked in the past so I won’t do that again now, but the idea of this post is brief:

When we specified the dynamics of the top-to-random shuffle, we insisted that the top card card could be placed anywhere in the deck with equal probability including back on top. This appears to be doing nothing except slowing down the shuffling process. Why is this important for convergence to randomness?

Fortunately the answer is short: if we do not let the top card be inserted back onto the top, allowing the configuration to stay the same, then we can divide up the set of orderings into two classes, and the pack will alternate between them.

Why is this a problem? Suppose the classes are called X and Y, and X is the class that contains the original ordering 1,2,…,n. Then after k shuffles, the ordering of the deck will be in X if k is even and in Y if k is odd. Remember our definition of ‘close to randomness’ will be the greatest difference in probability of an event between the actual distribution and the uniform distribution. As before, you can think of this by a betting analogy – what proportion profit can you make again someone who thinks it’s uniform by knowing the true distribution?

Well, it will turn out that the sets X and Y have the same size, so in the uniform distribution, the probability that an ordering is in X is 1/2. Whereas if the pack alternates, then so long as we know how many shuffles have occurred, this probability is either 0 or 1. In particular this is far from 1/2. We should remark that if we introduce the notion of sampling at a random time, or taking an average over all large times in some sense, such problems may disappear, but the result obtained may be less useful. See this post on Cesaro Mixing for details presented in a more rigorous style.

So it remains to see why this is true. First a definition. A transposition is when two elements in a permutation are exchanged. Eg 31452 -> 35412 by transposing 1 and 5. It makes sense intuitively that we can get from any permutation to any other permutation by making successive transpositions. Indeed, this is precisely what is happening in the top-to-random shuffle. To avoid continually having to write it out, we call the original permutation 1,2,…,n the identity permutation.

Then the idea is that X is the set of permutations we can obtain by starting with the identity and applying an even number of transpositions, while Y is the set obtained by applying an odd number of transpositions. For this to work, we will need to show that these sets are disjoint. That is, no permutation can be generated by both an odd number and an even number of transpositions. This is important, as a permutation can certainly be generated from transpositions in multiple ways. For example, if the elements are 1,2,3, we can obtain the permutation 2,1,3 by transposing 1 and 2, obviously. However, we could alternatively start by transposing 2 and 3 to get 1,3,2, then 1 and 3 to get 3,1,2, then 2 and 3 again to get 2,1,3. Note that both of these require an odd number of transpositions.

We will call a permutation even if it is generated by an even number of transpositions, and odd otherwise. We also say that its sign (alternatively signature, parity) is +1 or -1 respectively. To prove this is well-defined, we really want to find a different property that is easier to track.

A useful trick is to count how many pairs of elements are not in the correct order. Let’s do this for our previous example: 31452. There are 5 elements so 5 x 4 / 2 = 10 pairs of elements. We list them:

  • 1 and 2 are in the correct order.
  • 1 and 3 are not, as 3 comes before 1 in this permutation.
  • 1 and 4 are correct.
  • 1 and 5 are correct.
  • 2 and 3 are not.
  • 2 and 4 are not.
  • 2 and 5 are not.
  • 3 and 4 are correct.
  • 3 and 5 are correct.
  • 4 and 5 are not.

So 5 pairs are not in the correct order. Since 5 is odd, the claim is that this means 31452 is an odd permutation. To check this, and to confirm that the sign is well-defined, it suffices to check that the number of so-called inversions, or pairs in the wrong order, changes parity every time we apply a transposition.

This is clearly true if we transpose adjacent elements. Then the orderings of all pairs remain the same, apart from the pair we transposed, which changes. Then, if the elements are not adjacent, instead of transposing them directly, we can perform a succession of transpositions of adjacent elements. The easiest way to describe this is again by example. Suppose we want to transpose 3 and 5 in 31452.

31452 -> 13452 -> 14352 -> 14532 -> 15432 -> 51432.

Note that the middle transposition is actually transposing 3 and 5, and the others are symmetric about this middle operation. In particular, there is an odd number of transpositions in total. So we have proved the result for general transpositions, and thus we now know that the sign of a permutation is well-defined. Note also that there are an equal number of odd and even permutations of every n=>2. For every odd permutation, transposing 1 and 2 gives an even permutation, and vice versa, uniquely, giving a bijection.

What’s really going on is that we are able to multiply permutations, by doing one after the other. Unlike multiplying real numbers, the order in which we do this now matters. In this context, the set of permutations is an example of a general structure called a group. The idea of partitioning a group into subsets which are in some sense symmetric and where some other operation jumps between the subsets is a useful motivation point for a whole avenue of interesting theory. Not to be explored now unfortunately…

The Top-to-Random Shuffle II

In the last post, I introduced the top-to-random shuffle. In particular, we considered why this sort of procedure was important as an alternative to choosing an ordering afresh, and how to we would go about measuring how close we had got to randomness.

In this post, I want to develop the second of these points. The intuition might be that we can get very close to uniform randomness if we repeat the shuffle often enough. Recall this means that even if we choose our bet in a really complicated and careful way, we still couldn’t make much profit by knowing the actual distribution of the ordering. But we might also suspect that the pack will never be exactly random, in the same way that the distribution of the proportion of heads seen on a repeatedly-flipped coin will eventually get very close to 1/2, but will not be exactly 1/2.

This intuition is extremely sensible, and in general is true. It is a nice fact, however, that it fails for the top-to-random shuffle, where we do in fact get to a uniformly random deck. Recall that we approximated how long it would take to get to a state that was roughly random by calculating the time taken for the original bottom card to rise to the top of the deck. This time was:

n+\frac{n}{2}+\frac{n}{3}+\ldots+\frac{n}{n-1},

where n is the total number of cards. A further shuffle is required to send the original card back into the pack somewhere. The claim is that the pack is now uniformly random. Note that if we ever actually use this, we have to be careful because although we can calculate the average time at which this event happens, the time itself is random. Rather than worry about that, let’s see why it is true.

As a motivating example, let’s assume that the pack was originally in the order 1,2,3,…,n, and consider the final relative order of the cards numbered 1 and 2. There is some small probability that on the first go (and then again on the second go possibly also) that the card number 1 stays put. Let’s ignore that possibility and progress to the first time that the 1 has been moved into the interior of the pack, so the 2 is now on top. When we do this, we are choosing a number between 2 and n uniformly at random, and moving the card numbered 1 to this position. Let’s call this number X.

Now, when we try to shuffle the 2, we are choosing a number between 1 and n uniformly at random, and moving the card numbered 2 to this position. Let’s call this number Y. Under exactly what circumstances does 2 end up above 1? The clearest example is if Y=X. Then card 2 has been moved to the position previously occupied by card 1. So card 1 moves up a position (since the card 2 is no longer on the top). The final configuration therefore includes card 1 directly above card 2. So we can say

\mathbb{P}(\text{2 above 1})=\mathbb{P}(Y<X),

where the fact that the inequality in the second probability is strict is important. To calculate this second probability, we want to exploit the symmetry of the situation. The only problems are that the case X=Y is not symmetric as then 1 ends up above 2 as described above, and also that X cannot be 1. So we account for these separately. Note

\mathbb{P}(Y=1)=\frac{1}{n},\quad \mathbb{P}(Y=X)=\frac{1}{n}.

The second result holds overall since it holds whenever we condition on a particular value of X. These events are also disjoint. Then

\mathbb{P}(Y<X)=\mathbb{P}(Y=1)+\mathbb{P}(Y<X | Y>1, X\neq Y)\mathbb{P}(Y>1,X\neq Y)

= \frac{1}{n}+\frac{1}{2}(1-\mathbb{P}(Y>1,Y\neq X))

=\frac{1}{n}+\frac{1}{2}(1-\frac{1}{n}-\frac{1}{n}) = \frac{1}{2}.

In summary, the cards 1 and 2 will be in uniformly random order. We might like to extend this idea, but it will get complicated when we add card 3 to the mix, as it is possible (if unlikely) that 1 and 2 will be further mixed before this. This shouldn’t affect the result much, but it will get complicated to define the notation required to carry this sort of argument all the way up to the nth card.

Even using induction is not going to make life substantially easier. Knowing that once we have inserted cards 1,2,…,k into the pack they are in uniformly random order is not enough to make inference about what happens once we put k+1 into the pack. We have to know something about the current positions of 1,2,…,k. For example, if one of these cards is definitely on the bottom of the pack, then the probability that k+1 ends up last among 1,2,…,k+1 is 1/n rather than 1/k+1 as it should be. So in fact we would have to control an annoying amount of information jointly.

In the argument we attempted above, we were looking at the first times some card k got folded back into the pack. Note that this division of time is different to the one we were using for the coupon collector approach to the mixing time in the previous post. Let’s try to use that instead here.

Now we consider the times at which a card is moved below card n. We deliberately decline to say what these cards are. But rather, we want to prove that, conditional on the cards below n being A_k=\{a_1,\ldots,a_k\}, the ordering of these is uniform on S_{A_k}, that is, every possibility is equally likely. Now this is easy to prove by induction. For, by conditioning on A_k and a_{k+1} being the new card to be moved below n, we are conditioning on the set of cards below n now being A_{k+1}=A_k\cup\{a_{k+1}\}. The position of the new card is uniformly random within this, by construction of the top-to-random shuffle, and so the new arrangement is uniformly random on the (k+1)! possibilities.

To see why we have proved the original result we wanted, note that this argument works at the time when the original bottom card is now at the top. So the remaining cards are uniformly randomly ordered. Inserting card n at random gives an arrangement that is uniformly random overall. So as we suggested before, working out how long it takes to get close to randomness in this case reduces to working out how long it is before the original bottom card hits the top and is re-inserted, as at that point, the pack genuinely is uniformly random.

The Top-to-Random Shuffle

This article is based on a talk I gave to the Maths Society at St Paul’s School on Monday. It may turn into a short series if I have time before I go to ALEA in Luminy near Marseille on Saturday.

My original plan had been to talk about riffle-shuffling, and some of the interesting mixing time themed results one can obtain. As a motivating example, I began by discussing the simpler top-to-random shuffle, and this proved sufficiently interesting to occupy the time I had been allowed (and mea culpa a bit more). It therefore seems worth writing a hopefully moderately accessible blog post on the subject. The aim of this post at least is to discuss the idea that repeatedly shuffling brings a pack of cards close to randomness. We have to settle on a definition of ‘close to randomness’, and find some ways to calculate this.

Suppose we are playing some bizarre card game where it is necessary that three cards labelled, uncontroversially, 1, 2 and 3 need to be placed in a random order. If we are organised, we can write down all the ways to do this in a list:

123, 132, 213, 231, 312, 321.

We want to select each of these with equal probability. We could for example use a dice. Most relevantly, even a computer as ancient as my laptop is very happy simulating a random choice from this set. (Now is not the time to talk about exactly how pseudo-random or otherwise this choice would be.)

Of course, when we play a sensible card game we have not three cards, but fifty-two. So the approach described above still works in theory, but no longer in practice, as the list of possible arrangements now has size 52!. Recall this is defined to be

52!=1\times 2 \times\ldots \times 52.

The reason we get this particular expression is that when we are choosing the first card, we have 52 possible choices. Then, regardless of what this first card actually is, there are precisely 51 cards left from which to choose the second card. So there are 52×51 ways to pick the first two cards in the arrangement, and so on, giving the answer. We can approximate how large 52! is by counting powers of ten rather crudely. It seems reasonable that it should be about 10^{65}. Note that the number of atoms in the universe is *only* about 10^{80}, so if we are going to write down this list, we better have very compact handwriting! But being serious, this number is way too large to realistically compute with, so we have to come up with some cleverer methods.

One way is to spread the cards out on a table then pick them up one at a time, ensuring at all times that the choice of card is uniform among those currently present, and not related to any of the past choices. This is relatively easy for a computer, but hard for a human, and certainly deeply tedious for anyone waiting to receive their hand!

So we seek a different approach, namely an algorithm for shuffling. Our aim is to introduce overall randomness by repeatedly applying some simple but random process. Note we have to be careful about our definition of ‘random’ here. The permutation 123456 is just as ‘random’ as the permutation 361524. That is, if they are fixed, then they are not random at all. Just because it is easier to decribe one of them verbally does not mean it is less random. For example, if I am trying to cheat at poker, then I might be able to if I knew the exact order of the cards in the pack before the dealer dealt. It wouldn’t matter what that order was. I would have to adjust my strategy based on the order, but it wouldn’t affect the fact that I had a massive advantage!

The shuffling algorithm to be discussed here is the top-to-random shuffle. Like all the best things in life, this does exactly what it says on the tin. At a given time, we remove the top card from the deck at present, and insert it at a randomly chosen point in the deck. This could be on the bottom, and it could also be back on the top. It feels like this possibility to remain constant can’t possibly help us, but later we will discuss why we need this.

In any case, it feels natural that if we keep applying this procedure, the arrangement of the deck should start to get more and more random, in the sense that knowing the original arrangement will tell us successively little about the current arrangement as time progresses. But we need to find a way to quantify this if we are to do any mathematics.

When we are talking about real numbers, it is fairly clear what it means if I say that the numbers 2, 1.1, 1.01, 1.001 and so on are getting closer and closer to 1. Indeed we can measure the distance along the number line between each term and 1, using the absolute difference. It is not so clear how to compute the distance between two probability distributions. Bearing in mind the fact that a distribution on the set of permutations of cards is defined to be a set of 52! probabilities that sum to 1, there will be a 52!-1 dimensional space (eg the plane is two-dimensional, the world is three-dimensional, *and so on* – whatever that means) where we have a nice distance formula already.

But this is not what we will choose to use. Rather we return to the cheating-at-poker analogy. Suppose I am playing some sort of game involving the pack of cards with my enemy. He or she thinks the deck is perfectly random, but I know the actual distribution. How big a profit can I make by exploiting this knowledge? This will be our measure of how far a distribution is from uniform. It turns out that this will coincide precisely with the formal definition of total variation distance, but that language belongs to a different level of rigour and is not relevant here.

What is relevant is an explanatory example. Suppose we start with the arrangement 12345678. We are now going to perform one iteration of the top-to-random shuffle. The outcome might, for example, be 23456178, if we insert the 1 between the 6 and the 7. Note there were 8 places for the card to go, so the probability of this particular outcome is 1/8. Now let’s see how I might use my knowledge of the distribution to my advantage. Suppose I suggest the bet that the bottom card is an 8. My enemy thinks the stack is uniformly randomly arranged, so the probability of this is 1/8. On the other hand, I know that the only way the 8 might disappear from the bottom is if I place the 1 under it, which happens with probability 1/8. So in fact, I know the probability of this event is 7/8, which gives me an advantage of 3/4. In fact, I could come up with bets that do even better than this, but they are less simple to describe verbally.

At what point do I lose this advantage? Well, we said that the probability that the 8 leaves the bottom of the stack is 1/8. And it will continue to be 1/8 on every turn where it is at the bottom. Recalling that the outcomes of successive shuffles are independent, note this is reminiscent of rolling a dice until a six comes up. The number of rolls required to get the six is an example of a geometric random variable. I don’t want to spoil S1 (or whichever module) by going into too much detail, but it turns out that if the probability of an event happening on a single go is p, then the average time we have to wait is 1/p. So 1/(1/8)=8 of course, and this is how long we typically have to wait before the bet I placed before becomes much less effective.

Now seems like a good time to stop talking about 8 cards and start talking about n cards. Obviously, in practice, we will want n to be 52. Anyway, by the same argument as before, it takes on average n iterations before the bottom card leaves the bottom. This is important, because after then, my bet that the bottom card is n is no longer so effective. However, I could equally place a bet that one of the bottom *two* cards is n.

So we consider how long it takes before n is no longer one of the bottom two cards. Well certainly we need to wait until it is no long *the* bottom card, which takes time n on average. Then, once it is second bottom, there is now a 2/n chance that we move the previously top card below it, so by the same argument as before, the time for this to happen is n/2 on average. If we want this effect to disappear, we have to wait until the original bottom card is in fact at the top of the pile for the first time, and by extending our previous argument, the average time for this is

n+\frac{n}{2}+\frac{n}{3}+\ldots+\frac{n}{n-1}.

Fortunately, we have tools for approximating this sort of sum, in particular integration, which is the practice of finding the area under certain curves. It turns out that the answer is roughly n log n. You can think of as log n a measure of the number of digits required to write out n. (This is not the exact definition but it will do for now. In any case, log n gets larger as n gets larger, but not very fast.) There’s a lot more about this in my previous post on the coupon collector problem, from a more technical point of view.

The next question will be to prove that it is actually quite well shuffled by this time, but that’s for another post. The other question to ask is whether this is satisfactory overall? For n=52, the number of operations we have to perform is about 230, which is fine for a computer, but deeply tedious for anyone sitting at a casino table waiting for the next hand. So next time we’ll talk about the riffle shuffle, which seems to introduce a lot of randomness in each go, but we’ll also see that we have to be careful, because the randomness may not be as great as our intuition suggests.