The Top-to-Random Shuffle II

In the last post, I introduced the top-to-random shuffle. In particular, we considered why this sort of procedure was important as an alternative to choosing an ordering afresh, and how to we would go about measuring how close we had got to randomness.

In this post, I want to develop the second of these points. The intuition might be that we can get very close to uniform randomness if we repeat the shuffle often enough. Recall this means that even if we choose our bet in a really complicated and careful way, we still couldn’t make much profit by knowing the actual distribution of the ordering. But we might also suspect that the pack will never be exactly random, in the same way that the distribution of the proportion of heads seen on a repeatedly-flipped coin will eventually get very close to 1/2, but will not be exactly 1/2.

This intuition is extremely sensible, and in general is true. It is a nice fact, however, that it fails for the top-to-random shuffle, where we do in fact get to a uniformly random deck. Recall that we approximated how long it would take to get to a state that was roughly random by calculating the time taken for the original bottom card to rise to the top of the deck. This time was:

n+\frac{n}{2}+\frac{n}{3}+\ldots+\frac{n}{n-1},

where n is the total number of cards. A further shuffle is required to send the original card back into the pack somewhere. The claim is that the pack is now uniformly random. Note that if we ever actually use this, we have to be careful because although we can calculate the average time at which this event happens, the time itself is random. Rather than worry about that, let’s see why it is true.

As a motivating example, let’s assume that the pack was originally in the order 1,2,3,…,n, and consider the final relative order of the cards numbered 1 and 2. There is some small probability that on the first go (and then again on the second go possibly also) that the card number 1 stays put. Let’s ignore that possibility and progress to the first time that the 1 has been moved into the interior of the pack, so the 2 is now on top. When we do this, we are choosing a number between 2 and n uniformly at random, and moving the card numbered 1 to this position. Let’s call this number X.

Now, when we try to shuffle the 2, we are choosing a number between 1 and n uniformly at random, and moving the card numbered 2 to this position. Let’s call this number Y. Under exactly what circumstances does 2 end up above 1? The clearest example is if Y=X. Then card 2 has been moved to the position previously occupied by card 1. So card 1 moves up a position (since the card 2 is no longer on the top). The final configuration therefore includes card 1 directly above card 2. So we can say

\mathbb{P}(\text{2 above 1})=\mathbb{P}(Y<X),

where the fact that the inequality in the second probability is strict is important. To calculate this second probability, we want to exploit the symmetry of the situation. The only problems are that the case X=Y is not symmetric as then 1 ends up above 2 as described above, and also that X cannot be 1. So we account for these separately. Note

\mathbb{P}(Y=1)=\frac{1}{n},\quad \mathbb{P}(Y=X)=\frac{1}{n}.

The second result holds overall since it holds whenever we condition on a particular value of X. These events are also disjoint. Then

\mathbb{P}(Y<X)=\mathbb{P}(Y=1)+\mathbb{P}(Y<X | Y>1, X\neq Y)\mathbb{P}(Y>1,X\neq Y)

= \frac{1}{n}+\frac{1}{2}(1-\mathbb{P}(Y>1,Y\neq X))

=\frac{1}{n}+\frac{1}{2}(1-\frac{1}{n}-\frac{1}{n}) = \frac{1}{2}.

In summary, the cards 1 and 2 will be in uniformly random order. We might like to extend this idea, but it will get complicated when we add card 3 to the mix, as it is possible (if unlikely) that 1 and 2 will be further mixed before this. This shouldn’t affect the result much, but it will get complicated to define the notation required to carry this sort of argument all the way up to the nth card.

Even using induction is not going to make life substantially easier. Knowing that once we have inserted cards 1,2,…,k into the pack they are in uniformly random order is not enough to make inference about what happens once we put k+1 into the pack. We have to know something about the current positions of 1,2,…,k. For example, if one of these cards is definitely on the bottom of the pack, then the probability that k+1 ends up last among 1,2,…,k+1 is 1/n rather than 1/k+1 as it should be. So in fact we would have to control an annoying amount of information jointly.

In the argument we attempted above, we were looking at the first times some card k got folded back into the pack. Note that this division of time is different to the one we were using for the coupon collector approach to the mixing time in the previous post. Let’s try to use that instead here.

Now we consider the times at which a card is moved below card n. We deliberately decline to say what these cards are. But rather, we want to prove that, conditional on the cards below n being A_k=\{a_1,\ldots,a_k\}, the ordering of these is uniform on S_{A_k}, that is, every possibility is equally likely. Now this is easy to prove by induction. For, by conditioning on A_k and a_{k+1} being the new card to be moved below n, we are conditioning on the set of cards below n now being A_{k+1}=A_k\cup\{a_{k+1}\}. The position of the new card is uniformly random within this, by construction of the top-to-random shuffle, and so the new arrangement is uniformly random on the (k+1)! possibilities.

To see why we have proved the original result we wanted, note that this argument works at the time when the original bottom card is now at the top. So the remaining cards are uniformly randomly ordered. Inserting card n at random gives an arrangement that is uniformly random overall. So as we suggested before, working out how long it takes to get close to randomness in this case reduces to working out how long it is before the original bottom card hits the top and is re-inserted, as at that point, the pack genuinely is uniformly random.

Advertisement

The Top-to-Random Shuffle

This article is based on a talk I gave to the Maths Society at St Paul’s School on Monday. It may turn into a short series if I have time before I go to ALEA in Luminy near Marseille on Saturday.

My original plan had been to talk about riffle-shuffling, and some of the interesting mixing time themed results one can obtain. As a motivating example, I began by discussing the simpler top-to-random shuffle, and this proved sufficiently interesting to occupy the time I had been allowed (and mea culpa a bit more). It therefore seems worth writing a hopefully moderately accessible blog post on the subject. The aim of this post at least is to discuss the idea that repeatedly shuffling brings a pack of cards close to randomness. We have to settle on a definition of ‘close to randomness’, and find some ways to calculate this.

Suppose we are playing some bizarre card game where it is necessary that three cards labelled, uncontroversially, 1, 2 and 3 need to be placed in a random order. If we are organised, we can write down all the ways to do this in a list:

123, 132, 213, 231, 312, 321.

We want to select each of these with equal probability. We could for example use a dice. Most relevantly, even a computer as ancient as my laptop is very happy simulating a random choice from this set. (Now is not the time to talk about exactly how pseudo-random or otherwise this choice would be.)

Of course, when we play a sensible card game we have not three cards, but fifty-two. So the approach described above still works in theory, but no longer in practice, as the list of possible arrangements now has size 52!. Recall this is defined to be

52!=1\times 2 \times\ldots \times 52.

The reason we get this particular expression is that when we are choosing the first card, we have 52 possible choices. Then, regardless of what this first card actually is, there are precisely 51 cards left from which to choose the second card. So there are 52×51 ways to pick the first two cards in the arrangement, and so on, giving the answer. We can approximate how large 52! is by counting powers of ten rather crudely. It seems reasonable that it should be about 10^{65}. Note that the number of atoms in the universe is *only* about 10^{80}, so if we are going to write down this list, we better have very compact handwriting! But being serious, this number is way too large to realistically compute with, so we have to come up with some cleverer methods.

One way is to spread the cards out on a table then pick them up one at a time, ensuring at all times that the choice of card is uniform among those currently present, and not related to any of the past choices. This is relatively easy for a computer, but hard for a human, and certainly deeply tedious for anyone waiting to receive their hand!

So we seek a different approach, namely an algorithm for shuffling. Our aim is to introduce overall randomness by repeatedly applying some simple but random process. Note we have to be careful about our definition of ‘random’ here. The permutation 123456 is just as ‘random’ as the permutation 361524. That is, if they are fixed, then they are not random at all. Just because it is easier to decribe one of them verbally does not mean it is less random. For example, if I am trying to cheat at poker, then I might be able to if I knew the exact order of the cards in the pack before the dealer dealt. It wouldn’t matter what that order was. I would have to adjust my strategy based on the order, but it wouldn’t affect the fact that I had a massive advantage!

The shuffling algorithm to be discussed here is the top-to-random shuffle. Like all the best things in life, this does exactly what it says on the tin. At a given time, we remove the top card from the deck at present, and insert it at a randomly chosen point in the deck. This could be on the bottom, and it could also be back on the top. It feels like this possibility to remain constant can’t possibly help us, but later we will discuss why we need this.

In any case, it feels natural that if we keep applying this procedure, the arrangement of the deck should start to get more and more random, in the sense that knowing the original arrangement will tell us successively little about the current arrangement as time progresses. But we need to find a way to quantify this if we are to do any mathematics.

When we are talking about real numbers, it is fairly clear what it means if I say that the numbers 2, 1.1, 1.01, 1.001 and so on are getting closer and closer to 1. Indeed we can measure the distance along the number line between each term and 1, using the absolute difference. It is not so clear how to compute the distance between two probability distributions. Bearing in mind the fact that a distribution on the set of permutations of cards is defined to be a set of 52! probabilities that sum to 1, there will be a 52!-1 dimensional space (eg the plane is two-dimensional, the world is three-dimensional, *and so on* – whatever that means) where we have a nice distance formula already.

But this is not what we will choose to use. Rather we return to the cheating-at-poker analogy. Suppose I am playing some sort of game involving the pack of cards with my enemy. He or she thinks the deck is perfectly random, but I know the actual distribution. How big a profit can I make by exploiting this knowledge? This will be our measure of how far a distribution is from uniform. It turns out that this will coincide precisely with the formal definition of total variation distance, but that language belongs to a different level of rigour and is not relevant here.

What is relevant is an explanatory example. Suppose we start with the arrangement 12345678. We are now going to perform one iteration of the top-to-random shuffle. The outcome might, for example, be 23456178, if we insert the 1 between the 6 and the 7. Note there were 8 places for the card to go, so the probability of this particular outcome is 1/8. Now let’s see how I might use my knowledge of the distribution to my advantage. Suppose I suggest the bet that the bottom card is an 8. My enemy thinks the stack is uniformly randomly arranged, so the probability of this is 1/8. On the other hand, I know that the only way the 8 might disappear from the bottom is if I place the 1 under it, which happens with probability 1/8. So in fact, I know the probability of this event is 7/8, which gives me an advantage of 3/4. In fact, I could come up with bets that do even better than this, but they are less simple to describe verbally.

At what point do I lose this advantage? Well, we said that the probability that the 8 leaves the bottom of the stack is 1/8. And it will continue to be 1/8 on every turn where it is at the bottom. Recalling that the outcomes of successive shuffles are independent, note this is reminiscent of rolling a dice until a six comes up. The number of rolls required to get the six is an example of a geometric random variable. I don’t want to spoil S1 (or whichever module) by going into too much detail, but it turns out that if the probability of an event happening on a single go is p, then the average time we have to wait is 1/p. So 1/(1/8)=8 of course, and this is how long we typically have to wait before the bet I placed before becomes much less effective.

Now seems like a good time to stop talking about 8 cards and start talking about n cards. Obviously, in practice, we will want n to be 52. Anyway, by the same argument as before, it takes on average n iterations before the bottom card leaves the bottom. This is important, because after then, my bet that the bottom card is n is no longer so effective. However, I could equally place a bet that one of the bottom *two* cards is n.

So we consider how long it takes before n is no longer one of the bottom two cards. Well certainly we need to wait until it is no long *the* bottom card, which takes time n on average. Then, once it is second bottom, there is now a 2/n chance that we move the previously top card below it, so by the same argument as before, the time for this to happen is n/2 on average. If we want this effect to disappear, we have to wait until the original bottom card is in fact at the top of the pile for the first time, and by extending our previous argument, the average time for this is

n+\frac{n}{2}+\frac{n}{3}+\ldots+\frac{n}{n-1}.

Fortunately, we have tools for approximating this sort of sum, in particular integration, which is the practice of finding the area under certain curves. It turns out that the answer is roughly n log n. You can think of as log n a measure of the number of digits required to write out n. (This is not the exact definition but it will do for now. In any case, log n gets larger as n gets larger, but not very fast.) There’s a lot more about this in my previous post on the coupon collector problem, from a more technical point of view.

The next question will be to prove that it is actually quite well shuffled by this time, but that’s for another post. The other question to ask is whether this is satisfactory overall? For n=52, the number of operations we have to perform is about 230, which is fine for a computer, but deeply tedious for anyone sitting at a casino table waiting for the next hand. So next time we’ll talk about the riffle shuffle, which seems to introduce a lot of randomness in each go, but we’ll also see that we have to be careful, because the randomness may not be as great as our intuition suggests.

Mixing Times 5 – Cesaro Mixing

We have just finished discussing chapters 11 and 12 of Markov Chains and Mixing Times, the end of the ‘core material’. I thought that, rather than addressing some of the more interesting but technical spectral methods that have just arisen, it would be a good subject for a quick post to collate some of the information about Cesaro mixing, which is spread throughout this first section.

Idea

A main result in the introductory theory of Markov chains is that for an irreducible aperiodic chain X, the distribution of X_t\rightarrow \pi, the (unique) equilibrium distribution. The mixing time gives the rate at which this first mode of convergence takes place. We have freedom over the initial state, so we typically consider the ‘worst case scenario’, ie the slowest convergence. The most appropriate metric is given by the total variation distance, which is defined in previous posts. The most important point to note is that the mixing time should be thought of as the correct timescale for convergence, rather than some threshold. In particular, the time at which the chain is within 1/4 of the equilibrium distribution in the TV metric has the same order magnitude (in n, some parameter controlling the number of states) as the time at which it is within 1/20 of the equilibrium distribution.

But this isn’t the only result about convergence in distribution of functionals of Markov chains. Perhaps more intuitive is the ergodic theorem which asserts that the proportion of time spent in a particular state also converges to the equilibrium probability as time advances. We might write:

\frac{1}{t}\sum_t \mathbf{1}(X_t=x)\rightarrow \pi(x),\quad \forall x\in \Omega.

Note that if the state space \Omega is finite, then we can also assume that this occurs uniformly in x. We can also think of the LHS of this convergence as a measure on \Omega varying in time

\frac{1}{t}\sum_t \mathbf{1}(X_t=\cdot),

and the mixing time for this sequence of measures is defined as for the conventional mixing time, and is called the Cesaro mixing time, at least in the Levin / Peres / Wilmer text.

Advantages

There are some obvious advantages to considering Cesaro mixing. Principally, a main drawback of conventional mixing is that we are unable to consider periodic chains. This property was the main content of the previous post, but to summarise, if a chain switches between various classes in a partition of the state space in a deterministic periodic way, then the distribution does not necessarily converge to equilibrium. The previous post discusses several ways of resolving this problem in specific cases. Note that this problem does not affect Cesaro mixing as the ergodic theorem continues to hold in the periodic case. Indeed the form of the distribution (which we might call an occupation measure in some contexts) confirms the intuition about viewing global mixing as a sort of sum over mixing modulo k in time.

Other advantages include the fact that the dependence on the initial state is weaker. For instance, consider the occupation measures for a chain started at x which moves first to y, versus a chain which starts at y then proceeds as the original. It requires very little thought to see that for O(1) values of t, this difference in occupation measures between these chains becomes small.

Another bonus is that we can use so-called stationary times to control Cesaro mixing. A stationary time is a stopping time such that X_\tau\stackrel{d}{=}\pi. It is clear that if we wait until \tau, then run the chain for a further \alpha\tau, the chain will have spent \frac{\alpha}{1+\alpha} of its duration in the equilibrium distribution, and so using Markov’s inequality and bounding the total variation distance between occupation measure and \pi by 1 up until the stationary time, we can get good bounds for the Cesaro mixing time in terms of \mathbb{E}\tau.

Why does this fail to work for normal mixing? The key to the above argument was that by taking an average over time up to some T>>\tau, the dependence on the actual value of X_\tau was suppressed. Consider the deterministic walk on the cycle \mathbb{Z}_n, which advances by 1 modulo n on each go. Now sample independently a random variable Z distributed uniformly on \mathbb{Z}_n. By definition, the random hitting time \tau_Z is a stationary time, but in fact the chain’s distribution does not converge. The condition we actually require for normal mixing is that \tau be a strong stationary time, meaning that X_\tau\stackrel{d}{=}\pi, and the value of X_\tau is independent of \tau. With this definition we can proceed with a similar result for normal mixing. An example of a strong stationary time would be for shuffling a pack of cards by repeatedly inserting the top card into a random place in the rest of the pile. Then consider the moment at which the original bottom card first reaches the top of the pile. It does not take too much to reassure oneself that after the next move, we have a strong stationary time, since every card has been randomised at least once, and the position of the other cards is independent of how long it took the original bottom card to rise to the top.

Disadvantages

So why do we not consider Cesaro mixing rather than the conventional variety? Well, mainly because of how we actually use mixing times. The Metropolis algorithm gives a way to generate chains with a particular equilibrium distribution, including ones for which it is hard to sample directly. Mixing time theory then gives a quantitative answer to the question of how long it is necessary to run such a chain for before it gives a good estimate to the equilibrium distribution. In many cases, such a random walk on a large unknown network, the main aim when applying such Monte Carlo procedures is to minimise the difficulty of calculation. For Cesaro mixing, you have to store all the information about path states, while for conventional mixing you only care about your current location.

The other phenomenon that is lacking in Cesaro mixing is cutoff. This is where the total variation distance

d_n(t)=||P^t(x,\cdot)-\pi||_{TV}

converges to 0 suddenly. More formally, there is some timescale f(n) such that

\lim_{n\rightarrow\infty}d_n(cf(n))=\begin{cases}1& c<1\& c>1,\end{cases}

so in the n-limit, the graph of d looks like a step-function. Several of the shuffling chains exhibit this property, leading to the statements like “7 shuffles are required to mix a standard pack of cards”. Cesaro mixing smooths out this effect on an f(n) timescale.

A Further Example

Perhaps the best example where Cesaro mixing happens faster than normal mixing is in the case of a lazy biased random walk on \mathbb{Z}_n. (Ex. 4.10 in MTMC) Here, we stay put with probability 1/2, otherwise move clockwise with probability p>1/4 and counter-clockwise with probability 1/2 – p < 1/4. This chain is not reversible, as we can determine the direction (or arrow) of time by examining a path. Roughly speaking, the chain will drift clockwise at rate 2p – 1/2 > 0. In particular, at some time Kn, where K is large, we would expect to have completed ~ \frac{K}{2p-\frac12} circuits of the vertices, and so the occupation measure will be close to the uniform equilibrium distribution if we choose K large enough.

On the other hand, the distribution of X at time Kn is still fairly concentrated. If we assume we are instead performing the random walk on \mathbb{Z}, the distribution after Kn i.i.d. increments is

X_t\sim N(Kn(2p-\frac12), Kn\sigma^2).

That is, the standard deviation is O(\sqrt{n})<<n. So, even once we return to considering the random walk on the cyclic group, if we view it as a circle, we expect most of the probability mass to be concentrated near Kn(2p-\frac12) \mod n. By an identical heuristic argument, we see that the mixing time is achieved when the variance has order n, that is when time has order n^2.