# Coupling from the Past

In a long series of previous posts I have talked about mixing times for Markov chains. We consider how long it takes for the distribution of a particular Markov chain to approach equilibrium. We are particularly interested in the asymptotics when some parameter of the model grows, such as the size of the state space, grows to infinity.

But why are we interested in the underlying problem? The idea of Markov Chain Monte Carlo methods is to sample from an intractable distribution by instead sampling from a Markov chain which approximates the distribution well at large times. A distribution might be intractable because it is computationally demanding to work out the normalising constant, or it might be distributed uniformly on a complicated combinatorial set. If, however, the distribution is the equilibrium distribution of some Markov chain, then we know how to at least sample from a distribution which is close to the one we want. But we need to know how long to run the process. We will typically tolerate some small error in approximating the distribution (whether we measure this in terms of total variation distance or some other metric doesn’t really matter at this heuristic level), but we need to know how it scale. If we double the size of the system, do we need to double the number of iterations of the chain, or square it. This is really important if we are going to use this for large real-world models with finite computing power!

Sometimes though, an approximation is not enough. If we want an exact sample from the equilibrium distribution, Markov chains typically will not help us as it is only in very artificial examples that the distribution after some finite time is actually the equilibrium distribution. One thing that we might use is a stationary time, which is a stopping time T, for which $X_T\stackrel{d}{=}\pi$. Note that there is one trivial way to do this. We can sample Y from distribution $\pi$ before starting the process, then stop X at the first time T for which $X_T=Y$. But this is no help really, as we need to have Y in the first place!

So we are really interested in less trivial stationary times. Perhaps the best example is the top-to-random shuffle. Here we are given a pack of labelled cards, WLOG initially in descending order at each step we move the top card in the pile to a randomly-chosen location in the pile (which includes back onto the top). Then it turns out that the first time we move the card originally at the bottom from the top to somewhere is a strong stationary time. This is fairly natural, as by this time, every card has been involved in at least one randomising event.

Anyway, so this gives a somewhat artificial way to sample from the uniform distribution on a pack of cards. This strong stationary time is almost surely finite, with distribution given by the coupon collector problem, for which the expectation grows as $n\log n$, where n is the number of cards.

The problem with this method is that it is not easy in general to come up with a non-contrived stationary time such as this one. The idea of coupling from the past, discussed by some previous authors but introduced in this context by Propp and Wilson in the mid ’90s, is another method to achieve perfect sampling from the equilibrium distribution of a Markov chain. The idea here is to work backwards rather than forwards. The rest of this post, which discusses this idea, is based on the talk given at the Junior Probability Seminar by Irene, and on the chapter in the Levin, Peres, Wilmer book.

The key to the construction is a coupling of the transitions of a Markov chain. In the setting of a simple random walk, we have by construction a coupling of the transitions. It doesn’t matter which state we are at: we toss a coin to decide whether to move up or down, and we can do this without reference to our current position. Levin, Peres and WIlmer call this a random mapping representation in general, and it is yet another concept that is less scary than its definition might suggest.

Given a transition matrix P on state space S, such a representation is a function

$\phi: S\times[0,1]\rightarrow S,\text{ s.t. }\mathbb{P}(\phi(i,U)=j)=p_{ij},$

where U is a U(0,1) random variable independent of choice of i. In particular, once we have the random value of u, we can consider $\phi(i,u)$ as i varies, to obtain a random map $S\rightarrow S$. Crucially, this map is not necessarily a bijection.

Note first that there are many possibilities for constructing the representation $\phi$. For some chains, and some representations, in particular random walks on vertex-transitive graphs (such as SRW – only for now we are restricting attention to finite state spaces) it is possible to choose $\phi$ so that it always gives a bijection, but it is also always possible to choose it so that there is some probability it doesn’t give a bijection.

Let $U_1,U_2,\ldots$ be an IID sequence of U[0,1] random variables, and write $\phi_i$ for the random map induced by $U_i$. Then consider the sequence of iterated maps:

$\phi_1, \phi_1\circ \phi_2, \ldots, \phi_1\circ\ldots\circ\phi_n,$

and let T be the (random) smallest time such that the image of $\phi_1\circ\ldots\circ \phi_T$ is a single state. Ie, as we go backwards in time through the maps $\phi_i$, we are gradually losing various states, corresponding to the maps not being bijections. Since the state space is finite, and the probability of not being a bijection is positive, it can be shown that T is almost surely finite. The claim then is that

$Y=\text{Im}(\phi_1\circ\ldots\circ \phi_T)$

is distributed as the equilibrium distribution of the chain. We finish by proving this.

Proof: Since the algorithm terminates after finite time almost surely, given any $\epsilon>0$, we can choose N such that the probability the algorithm stops in at most N steps is greater than $1-\epsilon$.

Now run the Markov chain from time -N, started in the equilibrium distribution, with the transition from time -t to -(t-1) given by the random mapping driven by $U_t$. Thus at time 0, the distribution of the chain is still the equilibrium distribution. But if we condition on the event that $T\le N$, then $X_0=\phi_1\circ \ldots \circ\phi_n(X_{-N})=Y$ regardless of the initial value. So $\mathbb{P}(X_0\ne Y)<\epsilon$, and hence the result follows, since $\epsilon>0$ was arbitrary.

What makes this easier than strong stationary times is that we don’t have to be clever to come up with the stopping time. It is however still important to know how long on average it takes to run the algorithm. At the end of her talk, Irene showed how to adapt this algorithm to deal with Probabilistic Cellular Automata. Roughly speaking, these are a sequence of infinite strings of 0s and 1s. The value of some element is determined randomly as a function of the values in the row underneath, say the element directly underneath and the two either side. In that setting, if you start with a finite subsequence and couple from the past by looking down to lower rows, each time you drop down a row you consider one further element, so in fact the coupling from the past algorithm has to eliminate possibilities fast enough to make up for this, if we want to terminate almost surely in finite time.

Here’s a link to the paper which discusses this in fuller detail.

# Mixing of the Noisy Voter Model

I’ve been spending a bit of time recently thinking about some interacting particle systems and mixing time problems, so it was interesting to find a short paper due to Harishchandra Ramadas on arXiv a couple of weeks ago combining these two areas. I enjoyed reading it, so I thought I would summarise its arguments here, in preparation for perhaps talking about it at this week’s junior probability seminar.

The voter model is a graph-based interacting particle system. The underlying motivation is to describe the spread of opinions through a population. Each vertex represents an agent who has an opinion which might vary over time. At exponential rates, an agent will replace their opinion with the opinion of one of their neighbours, chosen uniformly at random. Note that, obviously, this might not actually involve a change of opinion.

One problem with this model is that we might be interested in convergence to equilibrium, but this might be hard to specify, as the state where everything is 0 and the state where everything is 1 are absorbing, and similarly any binomial distribution on the set of vertices is invariant. It is also not entirely realistic to assume that future updates to the system are entirely based on the current configuration. In any actual application, we should allow for some, possibly very small, influence from the outside, whether due to random errors in the update rule, or some external factors which haven’t been explicitly accounted for.

So we consider instead the noisy voter model where, in addition to the dynamics described above, each vertex changes state at rate $\delta$, independently of the configuration or the history. [Note that we are now considering the set of opinions {0,1} ]

We are interested in the mixing time of this system. It is not immediately clear that this should have an invariant distribution, but it seems reasonable since the noise, which converts 0 to 1 and 1 to 0 at equal rates if the proportion of 0s and 1s is equal, moves the system towards something like Bin(V(G),1/2), although the correlations need to take account of the local geometry.

Anyway, we assume $\delta>0$ is fixed, and n tends to infinity. Ramadas proves that the dynamics have the property of optimal temporal mixing, which states that given copies of the process X, Y starting from different initial conditions

$||X_t-Y_t||\le C n e^{-ct},$

for suitable choice of constants. Note that it doesn’t matter whether X and Y are coupled or independent. Total variation distance is a function of distribution, not of probability space. I don’t have a great intuition for exactly what this means, except to make the obvious comment that the distance from uniformity grows at most uniformly as the size of the system grows. Note that this immediately implies that the mixing time is O(log n). The rest of the paper proves that the noisy voter model has this property.

The method is to consider the relation:

$||X-Y||\le \mathbb{P}(X\ne Y),$

which says that the total variation distance gives the minimum of the RHS probability over couplings of X and Y. Anyway, as discussed in the previous post on duality in such interacting particle systems, the key step is to consider the process as a deterministic walk through a random environment. Ie, we apply the same moves to configurations started from different initial conditions.

The first step is to break down the noise mechanism to make it easier to couple everything. Instead of saying that a vertex changes state at rate $\delta$, we instead have two noise processes. One sets a vertex’s state to 0 at rate $\delta$, and another sets a vertex’s state to 1 at rate $\delta$ independently. Since at any configuration, for each vertex, precisely one of these processes has no effect.

So for the coupling, we run the same noise processes, and we choose the neighbours from which to update our opinions the same in both processes. The only difference is the initial configuration. With these two processes, X and Y defined on the same probability space, Ramadas defines $Z_t(v)=\mathbf{1}\{X_t(v)\neq Y_t(v)\}$, which counts how many vertices are different between the two processes. Almost surely, eventually all the Zs will be 0, and this state is absorbing because of how we’ve defined this coupling. We are interested to find out how long it takes for this event, which we could think of like a strong stationary time, to happen.

To control this, we track the evolution in the probabilities of $Z_t(v)=0$ jointly. We note that after a noise event at vertex v, we definitely have $Z_t(v)=0$. After an opinion-change event, whether $Z_t(v)$ changes depends on whether $Z_t(u)$ is 1, where u is the uniformly-chosen vertex from which v inherits its opinion in both processes. We obtain the following ODEs:

$\frac{d}{dt}\mathbb{P}(Z_t(v)=1)=\frac{1}{d(v)}\left(\sum_{u:u\sim v}\mathbb{P}(Z_t(v)=0,Z_t(u)=1)\right)$

$-\frac{1}{d(v)}\left(\sum_{u:u\sim v}\mathbb{P}(Z_t(v)=1,Z_t(u)=0)\right)-2\delta \mathbb{P}(Z_t(v)=1).$

We interpret this sum as a negative bit plus two terms which we would hope would cancel each other out. It seems impractical to attempt to formalise this heuristic. However, we can instead choose to consider at time t to consider the ODE for whichever vertex v maximises $Z_t(v)$. It’s then clear that the RHS is less than $-2\delta \mathbb{P}(Z_t(v)=1)$.

Right-differentiabililty of $Z_t(v_{max})$ follows from differentiability of $Z_t(v)$. The number of vertices is finite, and almost surely the vertex attaining this maximum can’t change too often. In particular, from the ODE, we must have $Z_t(v_{max})$ bounded above by $Ce^{-2\delta t}$. From which it follows that

$\mathbb{P}(X\ne Y)\le nCe^{-2\delta t },$

and we’ve shown this optimal temporal mixing property, and hence the mixing time has order log n.

What I’ve discussed so far is taken entirely from Section 3 of Ramadas’s paper. My first thought was whether this bound is actually correct in scale, as we have used only the size of the graph rather than any finer information about its graph structure. The author discusses that it is not possible to use a general result due to Hayes and Sinclair on this topic in this setting, as it is not reversible.

I certainly can’t answer this question, but I will make a comment on the extreme examples. If the graph is the empty graph on n vertices, then the voter model dynamics are null, so we essentially have just a coupon collector problem going on. It doesn’t lose any generality to start from entirely 0s, then at time t, the distribution of the system will be Bin(n,p), with p a function of time, but not n. Now note that

$||\mathrm{Bin}(n,p)-\mathrm{Bin}(n,q)||_{TV}=\Theta(\sqrt{n}|p-q|),$

so in this example, we need to choose a time so that $p=\frac12+\Theta(\sqrt{n})$. This is really a statement about the central limit theorem. It’s clear that Bin(n,1/2) should genuinely be the equilibrium distribution here. The evolution of p(t) is exponential, and hence we get the O(log n) scaling emerging. Note that the presence of $\sqrt{n}$ rather than any other power of n makes no difference, though it would be relevant if we actually wanted to know the constant.

At the other end of the connectivity spectrum, we consider the dynamics on the complete graph. Here, let’s count the number of 1s, and call it W(t) say. Then W goes from k to k+1 at rate $\frac{(n-k)\delta}{2}+\frac{k(n-k)}{n}$, and from k to k-1 at rate $\frac{k\delta}{2}+\frac{k(n-k)}{n}$. In particular, note in expectation, the contributions from the voter model dynamics cancel. This fits with our intuition that it is solely the noise driving the system towards equilibrium.

If we take an ODE approximation, we get:

$\frac{dX}{dt}=\frac{n}{2}-\delta x,$

from which it follows that

$t(x)=\alpha-\frac{1}{\delta}\log(C-\delta x),$

which gives precisely the correct scaling, when we take x=n/2.

I would be very interested to think about or hear about arguments for a general lower bound across all graphs.

REFERENCES

Hayes, Sinclair (2007) – A general lower bound for mixing of single-site dynamics on graphs (arXiv)

Ramadas (2014) – Mixing of the Noisy Voter Model (arXiv link at top of article)

# The Top-to-Random Shuffle II

In the last post, I introduced the top-to-random shuffle. In particular, we considered why this sort of procedure was important as an alternative to choosing an ordering afresh, and how to we would go about measuring how close we had got to randomness.

In this post, I want to develop the second of these points. The intuition might be that we can get very close to uniform randomness if we repeat the shuffle often enough. Recall this means that even if we choose our bet in a really complicated and careful way, we still couldn’t make much profit by knowing the actual distribution of the ordering. But we might also suspect that the pack will never be exactly random, in the same way that the distribution of the proportion of heads seen on a repeatedly-flipped coin will eventually get very close to 1/2, but will not be exactly 1/2.

This intuition is extremely sensible, and in general is true. It is a nice fact, however, that it fails for the top-to-random shuffle, where we do in fact get to a uniformly random deck. Recall that we approximated how long it would take to get to a state that was roughly random by calculating the time taken for the original bottom card to rise to the top of the deck. This time was:

$n+\frac{n}{2}+\frac{n}{3}+\ldots+\frac{n}{n-1},$

where n is the total number of cards. A further shuffle is required to send the original card back into the pack somewhere. The claim is that the pack is now uniformly random. Note that if we ever actually use this, we have to be careful because although we can calculate the average time at which this event happens, the time itself is random. Rather than worry about that, let’s see why it is true.

As a motivating example, let’s assume that the pack was originally in the order 1,2,3,…,n, and consider the final relative order of the cards numbered 1 and 2. There is some small probability that on the first go (and then again on the second go possibly also) that the card number 1 stays put. Let’s ignore that possibility and progress to the first time that the 1 has been moved into the interior of the pack, so the 2 is now on top. When we do this, we are choosing a number between 2 and n uniformly at random, and moving the card numbered 1 to this position. Let’s call this number X.

Now, when we try to shuffle the 2, we are choosing a number between 1 and n uniformly at random, and moving the card numbered 2 to this position. Let’s call this number Y. Under exactly what circumstances does 2 end up above 1? The clearest example is if Y=X. Then card 2 has been moved to the position previously occupied by card 1. So card 1 moves up a position (since the card 2 is no longer on the top). The final configuration therefore includes card 1 directly above card 2. So we can say

$\mathbb{P}(\text{2 above 1})=\mathbb{P}(Y

where the fact that the inequality in the second probability is strict is important. To calculate this second probability, we want to exploit the symmetry of the situation. The only problems are that the case X=Y is not symmetric as then 1 ends up above 2 as described above, and also that X cannot be 1. So we account for these separately. Note

$\mathbb{P}(Y=1)=\frac{1}{n},\quad \mathbb{P}(Y=X)=\frac{1}{n}.$

The second result holds overall since it holds whenever we condition on a particular value of X. These events are also disjoint. Then

$\mathbb{P}(Y1, X\neq Y)\mathbb{P}(Y>1,X\neq Y)$

$= \frac{1}{n}+\frac{1}{2}(1-\mathbb{P}(Y>1,Y\neq X))$

$=\frac{1}{n}+\frac{1}{2}(1-\frac{1}{n}-\frac{1}{n}) = \frac{1}{2}.$

In summary, the cards 1 and 2 will be in uniformly random order. We might like to extend this idea, but it will get complicated when we add card 3 to the mix, as it is possible (if unlikely) that 1 and 2 will be further mixed before this. This shouldn’t affect the result much, but it will get complicated to define the notation required to carry this sort of argument all the way up to the nth card.

Even using induction is not going to make life substantially easier. Knowing that once we have inserted cards 1,2,…,k into the pack they are in uniformly random order is not enough to make inference about what happens once we put k+1 into the pack. We have to know something about the current positions of 1,2,…,k. For example, if one of these cards is definitely on the bottom of the pack, then the probability that k+1 ends up last among 1,2,…,k+1 is 1/n rather than 1/k+1 as it should be. So in fact we would have to control an annoying amount of information jointly.

In the argument we attempted above, we were looking at the first times some card k got folded back into the pack. Note that this division of time is different to the one we were using for the coupon collector approach to the mixing time in the previous post. Let’s try to use that instead here.

Now we consider the times at which a card is moved below card n. We deliberately decline to say what these cards are. But rather, we want to prove that, conditional on the cards below n being $A_k=\{a_1,\ldots,a_k\}$, the ordering of these is uniform on $S_{A_k}$, that is, every possibility is equally likely. Now this is easy to prove by induction. For, by conditioning on $A_k$ and $a_{k+1}$ being the new card to be moved below n, we are conditioning on the set of cards below n now being $A_{k+1}=A_k\cup\{a_{k+1}\}$. The position of the new card is uniformly random within this, by construction of the top-to-random shuffle, and so the new arrangement is uniformly random on the (k+1)! possibilities.

To see why we have proved the original result we wanted, note that this argument works at the time when the original bottom card is now at the top. So the remaining cards are uniformly randomly ordered. Inserting card n at random gives an arrangement that is uniformly random overall. So as we suggested before, working out how long it takes to get close to randomness in this case reduces to working out how long it is before the original bottom card hits the top and is re-inserted, as at that point, the pack genuinely is uniformly random.

# The Top-to-Random Shuffle

This article is based on a talk I gave to the Maths Society at St Paul’s School on Monday. It may turn into a short series if I have time before I go to ALEA in Luminy near Marseille on Saturday.

My original plan had been to talk about riffle-shuffling, and some of the interesting mixing time themed results one can obtain. As a motivating example, I began by discussing the simpler top-to-random shuffle, and this proved sufficiently interesting to occupy the time I had been allowed (and mea culpa a bit more). It therefore seems worth writing a hopefully moderately accessible blog post on the subject. The aim of this post at least is to discuss the idea that repeatedly shuffling brings a pack of cards close to randomness. We have to settle on a definition of ‘close to randomness’, and find some ways to calculate this.

Suppose we are playing some bizarre card game where it is necessary that three cards labelled, uncontroversially, 1, 2 and 3 need to be placed in a random order. If we are organised, we can write down all the ways to do this in a list:

123, 132, 213, 231, 312, 321.

We want to select each of these with equal probability. We could for example use a dice. Most relevantly, even a computer as ancient as my laptop is very happy simulating a random choice from this set. (Now is not the time to talk about exactly how pseudo-random or otherwise this choice would be.)

Of course, when we play a sensible card game we have not three cards, but fifty-two. So the approach described above still works in theory, but no longer in practice, as the list of possible arrangements now has size 52!. Recall this is defined to be

$52!=1\times 2 \times\ldots \times 52.$

The reason we get this particular expression is that when we are choosing the first card, we have 52 possible choices. Then, regardless of what this first card actually is, there are precisely 51 cards left from which to choose the second card. So there are 52×51 ways to pick the first two cards in the arrangement, and so on, giving the answer. We can approximate how large 52! is by counting powers of ten rather crudely. It seems reasonable that it should be about $10^{65}$. Note that the number of atoms in the universe is *only* about $10^{80}$, so if we are going to write down this list, we better have very compact handwriting! But being serious, this number is way too large to realistically compute with, so we have to come up with some cleverer methods.

One way is to spread the cards out on a table then pick them up one at a time, ensuring at all times that the choice of card is uniform among those currently present, and not related to any of the past choices. This is relatively easy for a computer, but hard for a human, and certainly deeply tedious for anyone waiting to receive their hand!

So we seek a different approach, namely an algorithm for shuffling. Our aim is to introduce overall randomness by repeatedly applying some simple but random process. Note we have to be careful about our definition of ‘random’ here. The permutation 123456 is just as ‘random’ as the permutation 361524. That is, if they are fixed, then they are not random at all. Just because it is easier to decribe one of them verbally does not mean it is less random. For example, if I am trying to cheat at poker, then I might be able to if I knew the exact order of the cards in the pack before the dealer dealt. It wouldn’t matter what that order was. I would have to adjust my strategy based on the order, but it wouldn’t affect the fact that I had a massive advantage!

The shuffling algorithm to be discussed here is the top-to-random shuffle. Like all the best things in life, this does exactly what it says on the tin. At a given time, we remove the top card from the deck at present, and insert it at a randomly chosen point in the deck. This could be on the bottom, and it could also be back on the top. It feels like this possibility to remain constant can’t possibly help us, but later we will discuss why we need this.

In any case, it feels natural that if we keep applying this procedure, the arrangement of the deck should start to get more and more random, in the sense that knowing the original arrangement will tell us successively little about the current arrangement as time progresses. But we need to find a way to quantify this if we are to do any mathematics.

When we are talking about real numbers, it is fairly clear what it means if I say that the numbers 2, 1.1, 1.01, 1.001 and so on are getting closer and closer to 1. Indeed we can measure the distance along the number line between each term and 1, using the absolute difference. It is not so clear how to compute the distance between two probability distributions. Bearing in mind the fact that a distribution on the set of permutations of cards is defined to be a set of 52! probabilities that sum to 1, there will be a 52!-1 dimensional space (eg the plane is two-dimensional, the world is three-dimensional, *and so on* – whatever that means) where we have a nice distance formula already.

But this is not what we will choose to use. Rather we return to the cheating-at-poker analogy. Suppose I am playing some sort of game involving the pack of cards with my enemy. He or she thinks the deck is perfectly random, but I know the actual distribution. How big a profit can I make by exploiting this knowledge? This will be our measure of how far a distribution is from uniform. It turns out that this will coincide precisely with the formal definition of total variation distance, but that language belongs to a different level of rigour and is not relevant here.

What is relevant is an explanatory example. Suppose we start with the arrangement 12345678. We are now going to perform one iteration of the top-to-random shuffle. The outcome might, for example, be 23456178, if we insert the 1 between the 6 and the 7. Note there were 8 places for the card to go, so the probability of this particular outcome is 1/8. Now let’s see how I might use my knowledge of the distribution to my advantage. Suppose I suggest the bet that the bottom card is an 8. My enemy thinks the stack is uniformly randomly arranged, so the probability of this is 1/8. On the other hand, I know that the only way the 8 might disappear from the bottom is if I place the 1 under it, which happens with probability 1/8. So in fact, I know the probability of this event is 7/8, which gives me an advantage of 3/4. In fact, I could come up with bets that do even better than this, but they are less simple to describe verbally.

At what point do I lose this advantage? Well, we said that the probability that the 8 leaves the bottom of the stack is 1/8. And it will continue to be 1/8 on every turn where it is at the bottom. Recalling that the outcomes of successive shuffles are independent, note this is reminiscent of rolling a dice until a six comes up. The number of rolls required to get the six is an example of a geometric random variable. I don’t want to spoil S1 (or whichever module) by going into too much detail, but it turns out that if the probability of an event happening on a single go is p, then the average time we have to wait is 1/p. So 1/(1/8)=8 of course, and this is how long we typically have to wait before the bet I placed before becomes much less effective.

Now seems like a good time to stop talking about 8 cards and start talking about n cards. Obviously, in practice, we will want n to be 52. Anyway, by the same argument as before, it takes on average n iterations before the bottom card leaves the bottom. This is important, because after then, my bet that the bottom card is n is no longer so effective. However, I could equally place a bet that one of the bottom *two* cards is n.

So we consider how long it takes before n is no longer one of the bottom two cards. Well certainly we need to wait until it is no long *the* bottom card, which takes time n on average. Then, once it is second bottom, there is now a 2/n chance that we move the previously top card below it, so by the same argument as before, the time for this to happen is n/2 on average. If we want this effect to disappear, we have to wait until the original bottom card is in fact at the top of the pile for the first time, and by extending our previous argument, the average time for this is

$n+\frac{n}{2}+\frac{n}{3}+\ldots+\frac{n}{n-1}.$

Fortunately, we have tools for approximating this sort of sum, in particular integration, which is the practice of finding the area under certain curves. It turns out that the answer is roughly n log n. You can think of as log n a measure of the number of digits required to write out n. (This is not the exact definition but it will do for now. In any case, log n gets larger as n gets larger, but not very fast.) There’s a lot more about this in my previous post on the coupon collector problem, from a more technical point of view.

The next question will be to prove that it is actually quite well shuffled by this time, but that’s for another post. The other question to ask is whether this is satisfactory overall? For n=52, the number of operations we have to perform is about 230, which is fine for a computer, but deeply tedious for anyone sitting at a casino table waiting for the next hand. So next time we’ll talk about the riffle shuffle, which seems to introduce a lot of randomness in each go, but we’ll also see that we have to be careful, because the randomness may not be as great as our intuition suggests.

# Mixing Times 6 – Aldous-Broder Algorithm and Cover Times

In several previous posts, I’ve discussed the Uniform Spanning Tree. The definition is straightforward: we choose uniformly at random from the set of trees which span a fixed underlying graph. But for a dense underlying graph, there are a very large number spanning trees. Cayley’s formula says that the complete graph K_n has $n^{n-2}$ spanning trees, so to select from this list is impractical.

We seek a better algorithm. In a post about a year ago, I presented the result that the path between two fixed points x and y in the UST is distributed as the path generated by Loop-Erased Random Walk, for which we start at x and delete cycles as they appear. An initial problem might be that this only gives us a single path, which might be enough in some contexts, but in general we will want to specify the whole tree. Wilson’s Algorithm is an unsurprising but useful extension to this equivalence which does just that. You start by constructing the LERW between two vertices, then you add the LERW which connects some other vertex to the path you already have. Then you take a further vertex not currently explored and start LERW there, continuing until you hit the tree that you already have. Iterate this process, which must terminate after at most n steps when there are no vertices which to start from. The tree thus obtained is the UST. The tricky part is proving that the method for selecting which unused vertices to start from has no effect on the distribution of paths between two fixed points.

I want to consider a different algorithm, discovered roughly simultaneously by Aldous and Broder. Start a random walk on the underlying graph at some particular vertex. Every time we traverse an edge which takes us to a vertex we haven’t yet explored, add this edge to the tree. For now I don’t want to give a proof that this algorithm works, but rather to talk about how fast it works, because it ties in nicely with something from the Mixing Times book we’ve been reading recently. It is clear that the algorithm terminates at the first time the random walk has visited every vertex. This is a stopping time, called the cover time of the Markov chain. If we are working with an underlying complete, then we notice that this is annoying, because it means that the cover time will increase like n.log n. That is, it will take an increasingly long time to gather the final few vertices into the tree. Perhaps some combination of Aldous-Broder initially then Wilson’s method for the final o(n) vertices might be preferable?

I want to discuss how to treat this cover time. Often we have information about the hitting times of states from other states $\mathbb{E}_x T_y$. A relationship between S, the hitting time, defined to be the maximum of the previous display over x and y, and the expected cover time would be useful, especially for a highly symmetric graph like the complete graph where the expected hitting times are all the same.

Matthews’ Method relates these two for an irreducible finite Markov chain on n states. It says:

$t_{cov}\leq t_{hit}\left(1+\frac12+\ldots+\frac 1 n\right).$

We first remark that this agrees with what we should get for the random walk on the complete graph. There, the hitting time of x from y is a geometric random variable with success probability 1/n, hence expectation is n. The cover time is the standard coupon collector problem, giving expectation n log n, and the sum of reciprocals factor is asymptotically a good approximation.

The intuition is that if we continue until we hit state 1, then reset and continue until we hit state 2, and so on, by the time we hit state n after (n-1) iterations, this is a very poor overestimate of the cover time, because we are actually likely to have hit most states many times. What we want to do really is say that after we’ve hit state 1, we continue until we hit state 2, unless we’ve already done so, in which case we choose a different state to aim for, one which we haven’t already visited. But this becomes complicated because we then need to know the precise conditional probabilities of visiting any site on the way between two other states, which will depend rather strongly on the exact structure of the chain.

Peres et al give a coupling proof in Chapter 11 of their book which I think can be made a bit shorter, at least informally. The key step is that we still consider hitting the sites in order, only now in a random order.

That is, we choose a permutation $\sigma\in S_n$ uniformly at random, and we let $T_k$ be the first time that states $\sigma(1),\ldots,\sigma(k)$ have all been visited. This is a random time that is measurable in the product space, and for each $\sigma$ it is a stopping time.

The key observation is that $\mathbb{P}(T_{k+1}=T_k)=1-\frac{1}{k+1}$. This holds conditional on any path of the Markov chain because the requirement for the event is that $\sigma(k+1)$ is visited after $\{\sigma(1),\ldots,\sigma(k)\}$. The statement therefore holds as stated as well as just pathwise. Then, by the SMP, conditional on $\{T_{k+1}>T_k\}$, we have

$T_{k+1}-T_k \leq_{st} t_{hit}.$

Note that by the definition of $t_{hit}$, this bound on the hitting time $T_{k+1}$ is unaffected by concerns about where the chain actually is at $T_k$ (since it is not necessarily at $\sigma(k)$).

So, removing the conditioning, we have:

$\mathbb{E}\Big[T_{k+1}-T_k\Big]\leq\frac{1}{k+1}t_{hit},$

and so the telescoping sum gives us Matthews’ result.

One example is the cover time of random walk on the n x n torus, which turns out to be

$O(n^2(\log n)^2).$

If anyone remembers that Microsoft screensaver from many years ago which started with a black screen and a snake leaving a trail of white pixels as it negotiated the screen, this will be familiar. The last few black bits take a frustratingly long while to disappear. Obviously that isn’t quite a random walk, but it perhaps diminishes the surprise that it should take this long to find the cover time.

There are a couple of interesting things I wanted to say about electrical networks for Markov chains and analytic methods for mixing times, but the moment may have passed, so this is probably the last post about Mixing Times. Plans are in motion for a similar reading group next term, possible on Random Matrices.

# Mixing Times 5 – Cesaro Mixing

We have just finished discussing chapters 11 and 12 of Markov Chains and Mixing Times, the end of the ‘core material’. I thought that, rather than addressing some of the more interesting but technical spectral methods that have just arisen, it would be a good subject for a quick post to collate some of the information about Cesaro mixing, which is spread throughout this first section.

Idea

A main result in the introductory theory of Markov chains is that for an irreducible aperiodic chain X, the distribution of $X_t\rightarrow \pi$, the (unique) equilibrium distribution. The mixing time gives the rate at which this first mode of convergence takes place. We have freedom over the initial state, so we typically consider the ‘worst case scenario’, ie the slowest convergence. The most appropriate metric is given by the total variation distance, which is defined in previous posts. The most important point to note is that the mixing time should be thought of as the correct timescale for convergence, rather than some threshold. In particular, the time at which the chain is within 1/4 of the equilibrium distribution in the TV metric has the same order magnitude (in n, some parameter controlling the number of states) as the time at which it is within 1/20 of the equilibrium distribution.

But this isn’t the only result about convergence in distribution of functionals of Markov chains. Perhaps more intuitive is the ergodic theorem which asserts that the proportion of time spent in a particular state also converges to the equilibrium probability as time advances. We might write:

$\frac{1}{t}\sum_t \mathbf{1}(X_t=x)\rightarrow \pi(x),\quad \forall x\in \Omega.$

Note that if the state space $\Omega$ is finite, then we can also assume that this occurs uniformly in x. We can also think of the LHS of this convergence as a measure on $\Omega$ varying in time

$\frac{1}{t}\sum_t \mathbf{1}(X_t=\cdot),$

and the mixing time for this sequence of measures is defined as for the conventional mixing time, and is called the Cesaro mixing time, at least in the Levin / Peres / Wilmer text.

There are some obvious advantages to considering Cesaro mixing. Principally, a main drawback of conventional mixing is that we are unable to consider periodic chains. This property was the main content of the previous post, but to summarise, if a chain switches between various classes in a partition of the state space in a deterministic periodic way, then the distribution does not necessarily converge to equilibrium. The previous post discusses several ways of resolving this problem in specific cases. Note that this problem does not affect Cesaro mixing as the ergodic theorem continues to hold in the periodic case. Indeed the form of the distribution (which we might call an occupation measure in some contexts) confirms the intuition about viewing global mixing as a sort of sum over mixing modulo k in time.

Other advantages include the fact that the dependence on the initial state is weaker. For instance, consider the occupation measures for a chain started at x which moves first to y, versus a chain which starts at y then proceeds as the original. It requires very little thought to see that for O(1) values of t, this difference in occupation measures between these chains becomes small.

Another bonus is that we can use so-called stationary times to control Cesaro mixing. A stationary time is a stopping time such that $X_\tau\stackrel{d}{=}\pi$. It is clear that if we wait until $\tau$, then run the chain for a further $\alpha\tau$, the chain will have spent $\frac{\alpha}{1+\alpha}$ of its duration in the equilibrium distribution, and so using Markov’s inequality and bounding the total variation distance between occupation measure and $\pi$ by 1 up until the stationary time, we can get good bounds for the Cesaro mixing time in terms of $\mathbb{E}\tau$.

Why does this fail to work for normal mixing? The key to the above argument was that by taking an average over time up to some $T>>\tau$, the dependence on the actual value of $X_\tau$ was suppressed. Consider the deterministic walk on the cycle $\mathbb{Z}_n$, which advances by 1 modulo n on each go. Now sample independently a random variable Z distributed uniformly on $\mathbb{Z}_n$. By definition, the random hitting time $\tau_Z$ is a stationary time, but in fact the chain’s distribution does not converge. The condition we actually require for normal mixing is that $\tau$ be a strong stationary time, meaning that $X_\tau\stackrel{d}{=}\pi$, and the value of $X_\tau$ is independent of $\tau$. With this definition we can proceed with a similar result for normal mixing. An example of a strong stationary time would be for shuffling a pack of cards by repeatedly inserting the top card into a random place in the rest of the pile. Then consider the moment at which the original bottom card first reaches the top of the pile. It does not take too much to reassure oneself that after the next move, we have a strong stationary time, since every card has been randomised at least once, and the position of the other cards is independent of how long it took the original bottom card to rise to the top.

So why do we not consider Cesaro mixing rather than the conventional variety? Well, mainly because of how we actually use mixing times. The Metropolis algorithm gives a way to generate chains with a particular equilibrium distribution, including ones for which it is hard to sample directly. Mixing time theory then gives a quantitative answer to the question of how long it is necessary to run such a chain for before it gives a good estimate to the equilibrium distribution. In many cases, such a random walk on a large unknown network, the main aim when applying such Monte Carlo procedures is to minimise the difficulty of calculation. For Cesaro mixing, you have to store all the information about path states, while for conventional mixing you only care about your current location.

The other phenomenon that is lacking in Cesaro mixing is cutoff. This is where the total variation distance

$d_n(t)=||P^t(x,\cdot)-\pi||_{TV}$

converges to 0 suddenly. More formally, there is some timescale f(n) such that

$\lim_{n\rightarrow\infty}d_n(cf(n))=\begin{cases}1& c<1\& c>1,\end{cases}$

so in the n-limit, the graph of d looks like a step-function. Several of the shuffling chains exhibit this property, leading to the statements like “7 shuffles are required to mix a standard pack of cards”. Cesaro mixing smooths out this effect on an f(n) timescale.

A Further Example

Perhaps the best example where Cesaro mixing happens faster than normal mixing is in the case of a lazy biased random walk on $\mathbb{Z}_n$. (Ex. 4.10 in MTMC) Here, we stay put with probability 1/2, otherwise move clockwise with probability p>1/4 and counter-clockwise with probability 1/2 – p < 1/4. This chain is not reversible, as we can determine the direction (or arrow) of time by examining a path. Roughly speaking, the chain will drift clockwise at rate 2p – 1/2 > 0. In particular, at some time Kn, where K is large, we would expect to have completed ~ $\frac{K}{2p-\frac12}$ circuits of the vertices, and so the occupation measure will be close to the uniform equilibrium distribution if we choose K large enough.

On the other hand, the distribution of X at time Kn is still fairly concentrated. If we assume we are instead performing the random walk on $\mathbb{Z}$, the distribution after Kn i.i.d. increments is

$X_t\sim N(Kn(2p-\frac12), Kn\sigma^2).$

That is, the standard deviation is $O(\sqrt{n})<. So, even once we return to considering the random walk on the cyclic group, if we view it as a circle, we expect most of the probability mass to be concentrated near $Kn(2p-\frac12) \mod n$. By an identical heuristic argument, we see that the mixing time is achieved when the variance has order n, that is when time has order $n^2$.

# Mixing Times 4 – Avoiding Periodicity

A Markov chain is periodic if you can partition the state space such it is possible to be in a particular class only at certain, periodic times. Concretely, suppose we can find a decomposition into classes $\Omega=V_1\cup\ldots\cup V_k$ such that conditional on $X_t\in V_i$, we have $\mathbb{P}(X_{t+1}\in V_{i+1})=1$, where the indices of the Vs are taken modulo k. Such a chain is called periodic with period k. In most cases, we would want to define the period to be the maximal such k.

Why is periodicity a problem? It prevents convergence to equilibrium. The distribution at time t has some fairly strong dependence on the initial distribution. For example, if the initial distribution is entirely supported on $V_1$ as defined above, then the distribution at time t will be entirely supported on $V_i$, where $i\equiv t \mod k$. In particular, this cannot converge to some equilibrium.

Aperiodicity thus becomes a necessary condition in any theorem on convergence to equilibrium. Note that by construction this is only relevant for chains in discrete time. In an first account of Markov chains, most of the examples will either have a small state space, for which the transition matrix will have to contain lots of zeros before it stands a chance of being periodic, or obviously aperiodic birth-death or queue type processes. But some of the combinatorially motivated chains we consider for interesting mixing properties are more likely to be periodic. In particular, for a random walk on a group say, the generator measure may well be supported only on a small subset of the whole group, which is completely natural (eg transpositions as a subset of the symmetric group). Then it becomes more plausible that periodicity might arise because of some underlying regularity or symmetry in the group structure.

My first claim is that periodicity is not a disaster for convergence properties of Markov chains. Firstly, by the definition above, $P^k(x,y)$ for $x,y\in V_1$ is an irreducible (aperiodic if k is maximal) transition matrix on $V_1$, and so we have convergence to some equilibrium distribution on $V_1$ of $(X_{kt+a})$ or similar. An initial distribution mixed between classes gives a mix of such equilibria. Alternatively, we could think about large-time ergodic properties. By taking an average over all distributions up to some large t, the periodic problems get smoothed out. So, for mixing on a periodic chain, it might be possible to make headway with Cesaro mixing, which looks at the speed of convergence of the ergodic average distribution.

In most cases, though, we prefer to alter the chain directly to remove periodicity, or even any chance of periodicity. The preferred method in many contexts is to replace the transition matrix P with $\frac12 (P+I)$. This says that at every time t, we toss an independent fair coin, and with probability 1/2 make the transition suggested by P, and with probability 1/2 we stay where we are. Note that if a chain is irreducible, and some P(x,x)>0, then it is definitely aperiodic, as x cannot be in more than one class as per the definition of periodicity.

If you want to know about the mixing time of the original chain, note that this so-called lazy chain moves at half the speed of the original, so to get exact asymptotics (eg in the case of cutoff, that is mixing speed faster than the scale of the mixing time) you must multiply by 2. Also note that all of the eigenvalues of $\frac12 (P+I)$ are non-negative, and in fact, the eigenvalues are subject to a linear transform in the construction of the lazy transition matrix $\lambda\mapsto \frac12(1+\lambda)$.

Note that choosing 1/2 as the parameter is unnecessary. Firstly, it would suffice to take some $P(x,x)=\epsilon$ and rescale the rest of the row appropriately. Also, in some cases, a different constant gives a more natural interpretation of the underlying mechanism. For example, one model worth considering is the Random Transposition Random Walk on the symmetric group, where at time t we multiply (ie compose) with a transposition chosen uniformly at random. This model is interesting partly because the orbits of an element resemble, at least initially, the component size process of a Erdos-Renyi random graph, on the grounds that when the number of transpositions is small, they don’t interact too much, so can be viewed as independent edges. Anyway, some form of laziness is needed in RTRW, otherwise the chain will alternative between odd and even permutations. In this case, 1/2 is not the most natural choice. The most sensible way to sample random transpositions is to select the two elements of [n] to be transposed uniformly and independently at random. Thus each transposition is selected with probability $\frac{2}{n^2}$, while the identity, which corresponds to ‘lazily’ staying at the current state in the random walk, is selected with probability 1/n.

The lazy chain is also useful when the original chain has a lot of symmetry involved. In particular, if the original chain involves ‘switching’ say one coordinate. The best example is the random walk on the vertices of the n-hypercube, but there are others. Here, the most helpful way to visualise the configuration is to choose a coordinate uniformly at random and then flip its value (from 0 to 1 or 1 to 0). Now the lazy chain can be viewed similarly, but note that the dependence on the current value of the coordinate is suppressed. That is, having chosen the coordinate to be affected, we set it to be 0 with probability 1/2 and 1 with probability 1/2, irrespective of the prior value at that coordinate. Thus instead of viewing the action on coordinates to be ‘stay or switch’, we can view the action on the randomly chosen coordinate to be ‘randomly resample’, to use statistical terminology. This is ideal for coupling, because from the time coordinate j is first selected, the value at that coordinate is independent of the past, and in particular, the initial value or distribution. So we can couple arbitrary initial configurations or distributions, and we know that as soon as all coordinates have been selected (a time that can be described as a coupon collector problem), the chains are well coupled, that is, the values are the same.

Note that one way we definitely get periodicity is if the increment distribution for random walk on a group is supported entirely in a single coset of a normal subgroup. Why? Well if we take $H\lhd G$ to be the normal subgroup, and gH to be the relevant coset, then $P^t(g',\cdot)$ is supported entirely on $g^tg'H$, so is periodic with period equal to the order of gH in the quotient group G/H. Note that if the coset is the normal subgroup itself, then it might well include support on the identity, which immediately makes the chain aperiodic. However, there will be then be no transitions between cosets, so the chain is not irreducible on G.

The previous paragraph is the content of Remark 8.3 in the book we are reading. My final comment is that normality is precisely what is needed for this to hold. The key idea is that the set of subsets {gH, gHgH, gHgHgH, … } forms a partition of the group. This is certainly true if H is normal and gH generates G/H. If the latter statement is not true, then the set of subsets still forms a partition, but of some subset of G. The random walk is then neither irreducible nor aperiodic on the reduced state space. If H is not normal, then there are no such restrictions. For example, gHgH might be equal to the whole group G. Then the random walk is aperiodic, as this would imply we can move between any pair of states in two steps, and so by extension between any pair of states in three steps. (2,3)=1, hence the chain is aperiodic. As a concrete example, consider

$\tau=\langle (1 2)\rangle \leq S_3,$

the simplest example of a non-normal subgroup. Part of the problem is that cosets are different in the left-case and the right-case. Consider the left coset of $\tau$ given by $\sigma\tau=\{(1 2 3),(2 3)\}$. These elements have order three and two respectively, and so by a similar argument to the general one above, this random walk is aperiodic.

# Mixing Times 3 – Convex Functions on the Space of Measures

The meat of this course covers rate of convergence of the distribution of Markov chains. In particular, we want to be thinking about lots of distributions simultaneously, so we really to be comfortable working with the space of measures on a (for now) finite state space. This is not really too bad actually, since we can embed it in a finite-dimensional real vector space.

$\mathcal{M}_1(E)=\{(x_v:v\in\Omega),x_v\geq 0, \sum x_v=1\}\subset \mathbb{R}^\Omega.$

Since most operations we might want to apply to distributions are linear, it doesn’t make much sense to inherit the usual Euclidean metric. In the end, the metric we use is the same as the $L_1$ metric, but the motivation is worth exploring. Typically, the size of $|\Omega|$ will be function of n, a parameter which will tend to infinity. So we do not want to be too rooted in the actual set $\Omega$ for what will follow.

Perhaps the best justification for total variation distance is from a gambling viewpoint. Suppose your opinion for the distribution of some outcome is $\mu$, and a bookmaker has priced their odds according to their evaluation of the outcome as $\nu$. You want to make the most money, assuming that your opinion of the distribution is correct (which in your opinion, of course it is!). So assuming the bookmaker will accept an arbitrarily complicated (but finite obviously, since there are only $|\Omega|$ possible outcomes) bet, you want to place money on whichever event evinces the greatest disparity between your measure of likeliness and the bookmaker’s. If you can find an event which you think is very likely, and which the bookmaker thinks is unlikely, you are (again, according to your own opinion of the measure) on for a big profit. This difference is the total variation distance $||\mu-\nu||_{TV}$.

Formally, we define:

$||\mu-\nu||_{TV}:=\max_{A\subset\Omega}|\mu(A)-\nu(A)|.$

Note that if this maximum is achieved at A, it is also achieved at $A^c$, and so we might as well go with the original intuition of

$||\mu-\nu||_{TV}=\max_{A\subset\Omega} \left[\mu(A)-\nu(A)\right].$

If we decompose $\mu(A)=\sum_{x\in A}\mu(x)$, and similarly for $A^c$, then add the results, we obtain:

$||\mu-\nu||_{TV}=\frac12\sum_{x\in\Omega}|\mu(A)-\nu(A)|.$

There are plenty of other interesting interpretations of total variation distance, but I don’t want to get bogged down right now. We are interested in the rate of convergence of distributions of Markov chains. Given some initial distribution $\lambda$ of $X_0$, we are interested in $||\lambda P^t-\pi||_{TV}$. The problem is that doing everything in terms of some general $\lambda$ is really annoying, at the very least for notational reasons. So really we want to investigate

$d(t)=\max_{\lambda\in\mathcal{M}_1(E)}||\lambda P^t-\pi||_{TV},$

the worst-case scenario, where we choose the initial distribution that mixes the slowest, at least judging at time t. Now, here’s where the space of measures starts to come in useful. For now, we relax the requirement that measures must be probability distributions. In fact, we allow them to be negative as well. Then $\lambda P^t-\pi$ is some signed measure on $\Omega$ with zero total mass.

But although I haven’t yet been explicit about this, it is easy to see that $||\cdot||_{TV}$ is a norm on this space. In fact, it is (equivalent to – dividing by 1/2 makes no difference!) the product norm of the $L_1$ norm as defined before. Recall the norms are convex functions. This is an immediate consequence of the triangle inequality. The set of suitable distributions $\lambda$ is affine, because an affine combination of probability distributions is another probability distribution.

Then, we know from linear optimisation theory, that convex functions on an affine space achieve their maxima at boundary points. And the boundary points for this definition of $\lambda\in\mathcal{M}_1(E)$, are precisely the delta-measures at some point of the state space $\delta_v$. So in fact, we can replace our definition of d(t) by:

$d(t)=\max_{x\in\Omega}||P^t(x,\cdot)-\pi||_{TV},$

where $P^t(x,\cdot)$ is the same as $(\delta_x P^t)(\cdot)$. Furthermore, we can immediately apply this idea to get a second result for free. In some problems, particularly those with neat couplings across all initial distributions, it is easier to work with a larger class of transition probabilities, rather than the actual equilibrium distribution, so we define:

$\bar{d}(t):=\max_{x,y\in\Omega}||P^t(x,\cdot)-P^t(y,\cdot)||_{TV}.$

The triangle inequality gives $\bar{d}(t)\leq 2d(t)$ immediately. But we want to show $d(t)\leq \bar{d}(t)$, and we can do that as before, by considering

$\max_{\lambda,\mu\in\mathcal{M}_1(E)}||\lambda P^t-\mu P^t||_{TV}.$

The function we are maximising is a convex function on $\mathcal{M}_1(E)^2$, and so it attains its maximum at a boundary point, which must be $\lambda=\delta_x,\mu=\delta_y$. Hence $\bar{d}(t)$ is equal to the displayed expression above, which is certainly greater than or equal to the original formulation of d(t), as this is the maximum of the same expression over a strict subset.

I’m not suggesting this method is qualitatively different to that proposed by the authors of the book. However, I think this is very much the right way to be thinking about these matters of maximising norms over a space of measures. Partly this is good because it gives an easy ‘sanity check’ for any idea. But also because it gives some idea of whether it will or won’t be possible to extend the ideas to the case where the state space is infinite, which will be of interest much later.

# Mixing Times 2 – Metropolis Chains

In our second reading group meeting for Mixing Times of Markov Chains, we reviewed chapters 3 and 4 of the Levin, Peres and Wilmer book. This post and the next contains a couple of brief thoughts about the ideas I found most interesting in each chapter.

Before reading chapter 3, the only thing I really knew about Monte Carlo methods was the slogan. If you want to sample from a probability distribution that you can’t describe explicitly, find a Markov chain which has that distribution as an equilibrium distribution, then run it for long enough starting from wherever you fancy. Then the convergence theorem for finite Markov chains means that the state of the chain after a long time approximates well the distribution you were originally looking for.

On the one previous occasion I had stopped and thought about this, I had two questions which I never really got round to answering. Firstly, what sort of distributions might you not be able to simulate directly? Secondly, and perhaps more fundamentally, how would you go about finding a Markov chain for which a given distribution is in equilibrium?

In the end, the second question is the one answered by this particular chapter. The method is called a Metropolis chain, and the basic idea is that you take ANY Markov chain with appropriate state space, then fiddle with the transition probabilities slightly. The starting chain is called a base chain. It is completely possible to adjust the following algorithm for a general base chain, but for simplicity, let’s assume it is possible to take an irreducible chain for which the transition matrix is symmetric. By thinking about the DBEs, this shows that the uniform distribution is the (unique) equilibrium distribution. Suppose the  transition matrix is given by $\Psi(x,y)$, to copy notation from the book. Then set:

$P(x,y)=\begin{cases}\Psi(x,y)\left[1\wedge \frac{\pi(y)}{\pi(x)}\right]&y\neq x\\ 1-\sum_{z\neq x} \Psi(x,z)\left [1\wedge \frac{\pi(z)}{\pi(x)}\right]& y=x.\end{cases}$

Note that this second case (y=x) is of essentially no importance. It just confirms that the rows of P add to 1. It is easy to check from the DBEs that $\pi$ is the equilibrium distribution of matrix P. One way to think of this algorithm is that we run the normal chain, but occasionally suppress transitions is they involve a move from a state which is likely (under $\pi$), to one which is less likely. This is done in proportion to the ratio, so it is unsurprising perhaps that the limit in distribution is $\pi$.

Conveniently, this algorithm also gives us some ideas for how to answer the first question. Note that at no point do we need to know $\pi(x)$ for some state x. We only need to use $\frac{\pi(x)}{\pi(y)}$ the ratios of probabilities. So this is perfect for distributions where there is a normalising constant which is computationally taxing to evaluate. For example, in the Ising model and similar statistical physics objects, probabilities are viewed more as weightings. There is a normalising constant, often called the partition function Z in this context, lying in the background, but especially the underlying geometry is quite exotic we definitely don’t want to have to worry about actually calculating Z. Thus we have a way to generate samples from such models. The other classic example is a random walk on a large, perhaps unknown graph. Then the equilibrium distribution at a vertex is inversely proportional to the degree of that vertex, but again you might not know about this information over the entire graph. It is reasonable to think of a situation where you might be able to take a random walk on a graph, say the connectivity graph of the internet, without knowing about all the edges at any one time. So, even though you potentially explore everywhere, you only need to know a small amount at any one time.

Of course, the drawback of both of these examples is that a lack of knowledge about the overall system means that it is hard in general to know how many steps the Metropolis chain must run before we can be sure that we are the equilibrium distribution it has been constructed to approach. So, while these chains are an excellent example to have in mind while thinking about mixing times, they are also a good motivation for the subject itself. General rules about speed of convergence to equilibrium are precisely what are required to make such implementation concrete.

# Mixing Times 1 – Reversing Markov Chains

A small group of us have started meeting to discuss Levin, Peres and Wilmer’s book on Markov Chains and Mixing Times. (The plan is to cover a couple of chapters every week, then discuss points of interest and some of the exercises – if anyone is reading this and fancies joining, let me know!) Anyway, this post is motivated by something we discussed in our first session.

Here are two interesting facts about Markov Chains. 1) The Markov property can be defined in terms of products of transition probabilities giving the probability of a particular initial sequence. However, a more elegant and general formulation is to say that, conditional on the present, the past and the future are independent. 2) All transition matrices have at least one equilibrium distribution. In fact, irreducible Markov Chains have precisely one equilibrium distribution. Then, if we start with any distribution, the distribution of the chain at time t converges to the equilibrium distribution.

But hang on. This might be a fairly serious problem. On the one hand we have given a definition of the Markov property that is symmetric in time, in the sense that it remains true whether we are working forwards or backwards. While, on the other hand, the convergence to equilibrium is very much not time-symmetric: we move from disorder to order as time advances. What has gone wrong here?

We examine each of the properties in turn, then consider how to make them fit together in a non-contradictory way.

Markov Property

As many of the students in the Applied Probability course learned the hard way, there are many ways to define the Markov property depending on context, and some are much easier to work with than others. For a Markov chain, you can find a way to say that the transition probability $\mathbb{P}(X_{n+1}=x_{n+1}\,|\,X_n=x_n,\ldots,X_0=x_0)$ is independent of $x_0,\ldots,x_{n-1}$. Alternatively, you can use this to give an inductive specification for the probability of the first n values of X being some sequence.

It requires a moment’s checking to see that the earlier definition of past/future independence is consistent with this. Let’s first check that we haven’t messed up a definition somewhere, and that the time-reversal of a general Markov chain does have the Markov property, even as defined in the context of a Markov chain.

For clarity, consider $X_0,X_1,\ldots, X_N$ a Markov chain on some finite state space, with N some fixed finite end time. We aren’t losing anything by reversing over a finite time interval – after all, we need to know how to do it over a finite time interval before it could possibly make sense to do it over $(-\infty,\infty)$. We examine $(Y_n)_{n=0}^N$ defined by $Y_n:= X_{N-n}$.

$\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1},\ldots,X_N=x_N)=\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1})$

is the statement of the Markov property for $(Y_n)$. We rearrange the left hand side to obtain:

$=\frac{\mathbb{P}(X_n=x_n,X_{n+1}=x_{n+1},\ldots,X_N=x_N)}{\mathbb{P}(X_{n+1}=x_{n+1},\ldots,X_N=x_N)}$

$=\frac{\mathbb{P}(X_N=x_N|X_n=x_n,\ldots,X_{N-1}=x_{N-1})\mathbb{P}(X_n=x_n,\ldots,X_{N-1}=x_{N-1})}{\mathbb{P}(X_N=x_N|X_{n+1}=x_{n+1},\ldots,X_{N-1}=x_{N-1})\mathbb{P}(X_{n+1}=x_{n+1},\ldots,X_{N-1}=x_{N-1})}.$

Now, by the standard Markov property on the original chain $(X_n)$, the first probability in each of the numerator and denominator are equal. This leaves us with exactly the same form of expression as before, but with one fewer term in the probability. So we can iterate until we end up with

$\frac{\mathbb{P}(X_n=x_n,X_{n+1}=x_{n+1})}{\mathbb{P}(X_{n+1}=x_{n+1})}=\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1}),$

as required.

So there’s nothing wrong with the definition. The reversed chain Y genuinely does have this property, regardless of the initial distribution of X.

In particular, if our original Markov chain starts at a particular state with probability 1, and we run it up to time N, then saying that the time-reversal is a Markov chain too is making a claim that we have a non-trivial chain that converges from some general distribution at time 0 to a distribution concentrated at a single point by time N. This seems to contradict everything we know about these chains.

Convergence to Equilibrium – Markov Property vs Markov Chains

It took us a while to come up with a reasonable explanation for this apparent discrepancy. In the end, we come to the conclusion that Markov chains are a strict subset of stochastic processes with the Markov property.

The key thing to notice is that a Markov chain has even more regularity than the definition above implies. The usual description via a transition matrix says that the probability of moving to state y at time t+1 given that you are at state x at time t is some function of x and y. The Markov property says that this probability is independent of the behaviour up until time t. But we also have that the probability is independent of t. The transition matrix P has no dependence on time t – for example in a random walk we do not have to specify the time to know what happens next. This is the property that fails for the non-stationary time-reversal.

In the most extreme example, we say $X_0=x_0$ with probability 1. So in the time reversal, $\mathbb{P}(Y_N=x_0|Y_{N-1}=y_{N-1})=1$ for all $y_{N-1}$. But it will obviously not be the case in general that $\mathbb{P}(Y_n=x_0|Y_{n-1}=y_{n-1})=1$ for all $y_{n-1}$, as this would mean the chain Y would be absorbed after one step at state $x_0$, which is obviously not how the reversal of X should behave.

Perhaps the best way to reconcile this difference is to consider this example where you definitely start from $x_0$. Then, a Markov chain in general can be thought of as a measure on paths, that is $\Omega^N$, with non-trivial but regular correlations between adjacent components. (In the case of stationarity, all the marginals are equal to the stationary distribution – a good example of i.d. but not independent RVs.) This is indexed by the transition matrix and the initial distribution. If the initial distribution is a single point mass, then this can be viewed as a restriction to a smaller set of possible paths, with measures rescaled appropriately.

What have we learned?

Well, mainly to be careful about assuming extra structure with the Markov property. Markov Chains are nice because there is a transition matrix which is constant in time. Other processes, such as Brownian motion are space-homogeneous, where the transitions, or increments in this context, are independent of time and space. However, neither of these properties are true for a general process with the Markov property. Indeed, we have seen in a post from a long time ago that there are Markov processes which do not have the Strong Markov Property, which seems unthinkable if we limit our attention to chain-like processes.

Most importantly, we have clarified the essential point that reversing a Markov Chain only makes sense in equilibrium. It is perfectly possibly to define the reversal of a chain not started at a stationary distribution, but lots of unwelcome information from the forward chain ends up in the reversed chain. In particular, the theory of Markov Chains is not broken, which is good.