Coupling from the Past

In a long series of previous posts I have talked about mixing times for Markov chains. We consider how long it takes for the distribution of a particular Markov chain to approach equilibrium. We are particularly interested in the asymptotics when some parameter of the model grows, such as the size of the state space, grows to infinity.

But why are we interested in the underlying problem? The idea of Markov Chain Monte Carlo methods is to sample from an intractable distribution by instead sampling from a Markov chain which approximates the distribution well at large times. A distribution might be intractable because it is computationally demanding to work out the normalising constant, or it might be distributed uniformly on a complicated combinatorial set. If, however, the distribution is the equilibrium distribution of some Markov chain, then we know how to at least sample from a distribution which is close to the one we want. But we need to know how long to run the process. We will typically tolerate some small error in approximating the distribution (whether we measure this in terms of total variation distance or some other metric doesn’t really matter at this heuristic level), but we need to know how it scale. If we double the size of the system, do we need to double the number of iterations of the chain, or square it. This is really important if we are going to use this for large real-world models with finite computing power!

Sometimes though, an approximation is not enough. If we want an exact sample from the equilibrium distribution, Markov chains typically will not help us as it is only in very artificial examples that the distribution after some finite time is actually the equilibrium distribution. One thing that we might use is a stationary time, which is a stopping time T, for which X_T\stackrel{d}{=}\pi. Note that there is one trivial way to do this. We can sample Y from distribution \pi before starting the process, then stop X at the first time T for which X_T=Y. But this is no help really, as we need to have Y in the first place!

So we are really interested in less trivial stationary times. Perhaps the best example is the top-to-random shuffle. Here we are given a pack of labelled cards, WLOG initially in descending order at each step we move the top card in the pile to a randomly-chosen location in the pile (which includes back onto the top). Then it turns out that the first time we move the card originally at the bottom from the top to somewhere is a strong stationary time. This is fairly natural, as by this time, every card has been involved in at least one randomising event.

Anyway, so this gives a somewhat artificial way to sample from the uniform distribution on a pack of cards. This strong stationary time is almost surely finite, with distribution given by the coupon collector problem, for which the expectation grows as n\log n, where n is the number of cards.

The problem with this method is that it is not easy in general to come up with a non-contrived stationary time such as this one. The idea of coupling from the past, discussed by some previous authors but introduced in this context by Propp and Wilson in the mid ’90s, is another method to achieve perfect sampling from the equilibrium distribution of a Markov chain. The idea here is to work backwards rather than forwards. The rest of this post, which discusses this idea, is based on the talk given at the Junior Probability Seminar by Irene, and on the chapter in the Levin, Peres, Wilmer book.

The key to the construction is a coupling of the transitions of a Markov chain. In the setting of a simple random walk, we have by construction a coupling of the transitions. It doesn’t matter which state we are at: we toss a coin to decide whether to move up or down, and we can do this without reference to our current position. Levin, Peres and WIlmer call this a random mapping representation in general, and it is yet another concept that is less scary than its definition might suggest.

Given a transition matrix P on state space S, such a representation is a function

\phi: S\times[0,1]\rightarrow S,\text{ s.t. }\mathbb{P}(\phi(i,U)=j)=p_{ij},

where U is a U(0,1) random variable independent of choice of i. In particular, once we have the random value of u, we can consider \phi(i,u) as i varies, to obtain a random map S\rightarrow S. Crucially, this map is not necessarily a bijection.

Note first that there are many possibilities for constructing the representation \phi. For some chains, and some representations, in particular random walks on vertex-transitive graphs (such as SRW – only for now we are restricting attention to finite state spaces) it is possible to choose \phi so that it always gives a bijection, but it is also always possible to choose it so that there is some probability it doesn’t give a bijection.

Let U_1,U_2,\ldots be an IID sequence of U[0,1] random variables, and write \phi_i for the random map induced by U_i. Then consider the sequence of iterated maps:

\phi_1, \phi_1\circ \phi_2, \ldots, \phi_1\circ\ldots\circ\phi_n,

and let T be the (random) smallest time such that the image of \phi_1\circ\ldots\circ \phi_T is a single state. Ie, as we go backwards in time through the maps \phi_i, we are gradually losing various states, corresponding to the maps not being bijections. Since the state space is finite, and the probability of not being a bijection is positive, it can be shown that T is almost surely finite. The claim then is that

Y=\text{Im}(\phi_1\circ\ldots\circ \phi_T)

is distributed as the equilibrium distribution of the chain. We finish by proving this.

Proof: Since the algorithm terminates after finite time almost surely, given any \epsilon>0, we can choose N such that the probability the algorithm stops in at most N steps is greater than 1-\epsilon.

Now run the Markov chain from time -N, started in the equilibrium distribution, with the transition from time -t to -(t-1) given by the random mapping driven by U_t. Thus at time 0, the distribution of the chain is still the equilibrium distribution. But if we condition on the event that T\le N, then X_0=\phi_1\circ \ldots \circ\phi_n(X_{-N})=Y regardless of the initial value. So \mathbb{P}(X_0\ne Y)<\epsilon, and hence the result follows, since \epsilon>0 was arbitrary.

What makes this easier than strong stationary times is that we don’t have to be clever to come up with the stopping time. It is however still important to know how long on average it takes to run the algorithm. At the end of her talk, Irene showed how to adapt this algorithm to deal with Probabilistic Cellular Automata. Roughly speaking, these are a sequence of infinite strings of 0s and 1s. The value of some element is determined randomly as a function of the values in the row underneath, say the element directly underneath and the two either side. In that setting, if you start with a finite subsequence and couple from the past by looking down to lower rows, each time you drop down a row you consider one further element, so in fact the coupling from the past algorithm has to eliminate possibilities fast enough to make up for this, if we want to terminate almost surely in finite time.

Here’s a link to the paper which discusses this in fuller detail.

Enhanced by Zemanta
Advertisement

Convergence of Transition Probabilities

As you can see, I haven’t got round to writing a post for a while. Some of my reasons for this have been good, and some have not. One reason has been that I’ve had to give a large number of tutorials for the fourth quarter of the second year probability course here in Oxford. The second half of this course concerns discrete-time Markov chains, and the fourth problem sheet discusses various modes of convergence for such models, as well as a brief tangent onto Poisson Processes. I’ve written more about Poisson Processes than perhaps was justifiable in the past, so I thought I’d say some words about convergence of transition probabilities in discrete-time Markov chains.

Just to be concrete, let’s assume the state space K is finite, and labelled {1,2,…,k}, so that it becomes meaningful to discuss

p_{12}^{(n)}:=\mathbb{P}(X_n=2|X_0=1).

That is, the probability that if we start at state 1, then after n ‘moves’ we are at state 2. We are interested in the circumstances under which this converges to the stationary distribution. The heuristic is that we can view a time-step of a Markov chain as an operation on the space of distributions on K. Note that this operation is deterministic. If this sounds complicated, what we mean is that we specify an initial distribution, that is the distribution of X_0. If we consider the distribution of X_1, this is given by \lambda P, where \lambda is the initial distribution, and P the transition matrix.

Anyway, the heuristic is that the stationary distribution is the unique fixed point of this operation on the space of distributions. It is therefore not unreasonable to assume that unless there are some periodic effects, we expect repeated use of this operation to move us closer to this fixed point.

We can further clarify this by considering the matrix form. Note that a transition P always has an eigenvalue equal to 1. This is equivalent to say that there is a solution to \pi P=\pi. Note it is not immediately equivalent to saying that P has a stationary distribution, as the latter must be non-negative and have elements summing to one. Only the first property is difficult, and relies on some theory or cleverness to prove. It can also be shown that all eigenvalues satisfy |\lambda|\le 1, and in general, there will be a single eigenvalue (ie dimension 1 eigenspace) with |\lambda|=1, and the rest satisfies |\lambda|<1. Then, if we diagonalise P, it is clear why \pi P^n converges entry-wise, as \pi UP^n U^{-1} converges. In the latter, only the entries in the row corresponding to \lambda=1 converge to something non-zero.

In summary, there is a strong heuristic for why in general, the transition probabilities should converge, and if they converge, that they should converge to the stationary distribution. In fact, we can prove that for any finite Markov chain, p_{ij}^{(n)}\rightarrow \pi_j, provided we two conditions hold. The conditions are that the chain is irreducible and aperiodic.

In the rest of this post, I want to discuss what might go wrong when these conditions are not satisfied. We begin with irreducibility. A chain is irreducible if it has precisely one communicating class. That means that we can get from any state to any other state, not necessarily in one step, with positive probability. One obvious reason why the statement of the theorem cannot hold in this setting is that \pi is not uniquely defined when the chain is not irreducible. Suppose, for example, that we have two closed communicating classes A and B. Then, supported on each of them is an invariant distribution \pi^A and \pi^B, so any affine combination of the two \lambda \pi^A+(1-\lambda) \pi^B will give a stationary distribution for the whole chain.

In fact, the solution to this problem is not too demanding. If we are considering p_{ij}^{(n)} for i\in A a closed communicating class, then we know that p_{ij}^{(n)}=0 whenever j\not\in A. For the remaining j, we can use the theorem in its original form on the Markov chain, with state space reduced to A. Here, it is now irreducible.

The only case left to address is if i is in an open communicating class. In that case, it suffices to work out the hitting probabilities starting from i of each of the closed communicating classes. Provided these classes themselves satisfy the requirements of the theorem, we can write

p_{ij}^{(n)}\rightarrow h_i^A \pi^A_j,\quad i\not\in A, j\in A.

To prove this, we need to show that as the number of steps grows to infinity, the probability that we are in closed class A converges to h_i^A. Then, we decompose this large number of steps so to say that not only have we entered A with roughly the given probability, but in fact with roughly the given probability we entered A a long time in the past, and so there has been enough time for the original convergence result to hold in A.

Now we turn to periodicity. If a chain has period k, this says that we can split the state space into k classes A_1,\ldots,A_k, such that p_{ij}^{(n)}=0 whenever n\not\equiv j-i \mod k. Equivalently, the directed graph describing the possible transitions of the chain is k-partite. This definition makes it immediately clear that p_{ij}^{(n)} cannot converge in this case. However, it is possible that p_{ij}^{(kn)} will converge. Indeed, to verify this, we would need to consider the Markov chain with transition matrix P^k. Note that this is no longer irreducible, as it there are no transitions allowed between classes A_1,\ldots,A_k. Indeed, a more formal definition of the period, in terms of the lcd of possible return times allows us to conclude that there is no finer reducibility structure. That is, A_1,\ldots,A_k genuinely are the closed classes when we consider the chain with matrix P^k. And so the Markov chain with transition matrix P^k restricted to any of the A_is satisfies the conditions of the theorem.

There remains one case which I’ve casually brushed over. When we were discussing the irreducible case, I said that if we had at least one communicating classes, then we could work out the limiting transition probabilities from a state in an open class to a state in a closed class by calculating the hitting probability of that closed class, then applying the standard version of the theorem to that closed class. This relies on the closed class being aperiodic.

Suppose otherwise that the destination closed class A has period k as before. If it were to be the case that the number of steps required to arrive at A had some fixed value mod k, or modulo a non-trivial divisor of k, then we certainly wouldn’t have convergence, for the same reasons as in the globally periodic case. However, we should ask whether we can ever have convergence?

In fact, the answer is yes. For concreteness, and because it’s easier to write ‘odd’ and ‘even’ than m \mod k, let’s assume A has size 2 and period 2. That is, once we arrive in A, thereafter we alternate deterministically between the two states. Anyway, for some large time n, we can write p_{ca}^{(n)} for a\in A, c\not\in A as:

p_{ca}^{(n)}=h_i^A(n),

where the latter term is the probability that we arrive in A at a time-step which has the same parity as n. It’s not terribly hard to come up with an example where this holds, and this idea holds in greater generality, where A has period k (and not necessarily just k states), we have to demand that the probability of arriving at a time which is a mod k is equal for all a in [0,k-1].

Of course, for applications, we don’t normally care much about irreducible chains, and we can easily remove periodicity by introducing so-called laziness, whereby on each time-step we flip a coin (biased if necessary) and stay put if it comes up heads, and apply the transition matrix if it comes up tails. Then it’s possible to get from any state to itself in one step, and so we are by construction aperiodic.

Enhanced by Zemanta

Diameters of Trees and Cycle Deletion

In the past two posts, we introduced two models of random trees. The Uniform Spanning Tree chooses uniformly at random from the set of spanning trees for a given underlying graph. The Minimum Spanning Tree assigns IID weights to each edge in the underlying graph, then chooses the spanning tree with minimum induced total weight. We are interested to know whether these are in fact the same distribution, and if they are not, what properties can be used to distinguish them asymptotically.

While investigating my current research problem, I was interested in the diameter of large random trees under various models. Specifically, I am considering what happens if you take a standard Erdos-Renyi process on n vertices, where edges appear at constant rate between pairs of vertices chosen uniformly at random, and add an extra mechanism to prevent the components becoming too large. For this particular model, our mechanism consists of removing any cycles as they are formed. Thus all the components remain trees as time advances, so it is not unreasonable to think that there might be some sort of equilibrium distribution.

Now, by definition, any tree formed by the Erdos-Renyi process is a uniform tree. Why? Well, the probability of a configuration is determined entirely by the number of edges present, so once we condition that a particular set of vertices are the support of a tree, all possible tree structures are equally likely. Note that this relies on sampling at a single fixed time. If we know the full history of the process, then it is no longer uniform. For example, define a k-star to be a tree on k vertices where one ‘centre’ vertex has degree k-1. The probability that a uniform tree on k vertices is a k-star is \frac{k}{k^{k-2}}=k^{-(k-3)}. But a star can only be formed by successively adding single vertices to an existing star. That is, we cannot join a 3-tree and a 4-tree with a edge to get a 7-star. So it is certainly not immediately clear that once we’ve incorporated the cycle deletion mechanism, the resulting trees will be uniform once we condition on their size.

In fact, the process of component sizes is not itself Markovian. For a concrete example, observe first that there is, up to isomorphism, only one tree on any of {0,1,2,3} vertices, so the first possible counterexample will be splitting up a tree on four vertices. Note that cycle deletion always removes at least three edges (ie a triangle), so the two possibilities for breaking a 4-tree are:

(4) -> (2,1,1) and (4) -> (1,1,1,1)

I claim that the probabilities of each of these are different in the two cases: a) (4) is formed from (2,2) and b) (4) is formed from (3,1). This is precisely a counterexample to the Markov property.

In the case where (4) is formed from (2,2), the 4-tree is certainly a path of length 4. Therefore, with probability 1/3, the next edge added creates a 4-cycle, which is deleted to leave components (1,1,1,1). In the case where (4) is formed from (3,1), then with probability 2/3 it is a path of length 4 and with probability 1/3 it is a 4-star (a ‘T’ shape). In this second case, no edge can be added to make a 4-cycle, so after cycle deletion the only possibility is (2,1,1). Thus the probability of getting (1,1,1,1) is 2/9 in this case, confirming that the process is non-Markovian. However, we might remark that we are unlikely to have O(n) vertices involved in fragmentations until at least the formation of the giant component in the underlying E-R process, so it is possible that the cycle deletion process is ‘almost Markov’ for any property we might actually be interested in.

When we delete a cycle, how many vertices do we lose? Well, for a large tree on n vertices, the edge added which creates the cycle is chosen uniformly at random from the pairs of vertices which are not currently joined by an edge. Assuming that n is w(1), that is we are thinking about a limit of fairly large trees, then the number of edges present is much smaller than the number of possible edges. So we might as well assume we are choosing uniformly from the possible edges, rather than just the possible edges which aren’t already present.

If we choose to add an edge between vertices x and y in the tree, then a cycle is formed and immediately deleted. So the number of edges lost is precisely the length of the path between x and y in the original tree. We are interested to know the asymptotics for this length when x and y are chosen at random. The largest path in a graph is called the diameter, and in practice if we are just interested in orders of magnitude, we might as well assume diameter and expected path length are the same.

So we want to know the asymptotic diameter of a UST on n vertices for large n. This is generally taken to be n^{1/2}. Here’s a quick but very informal argument that did genuinely originate on the back of a napkin. I’m using the LERW definition. Let’s start at vertex x and perform LERW, and record how long the resultant path is as time t advances. This is a Markov chain: call the path length at time t X_t.

Then if X_t=k, with probability 1-\frac k n we get X_{t+1}=k+1, and for each j in {0,…,k-1}, with probability 1/n we have X_{t+1}=j, as this corresponds to hitting a vertex we have already visited. So

\mathbb{E}\Big[X_{t+1}|X_t=k\Big]=\frac{nk-k^2/2}{n}.

Note that this drift is positive for k<< \sqrt n and negative for k>>\sqrt n, so we would expect n^{-1/2} to be the correct scaling if we wanted to find an equilibrium distribution. And the expected hitting time of vertex y is n, by a geometric distribution argument, so in fact we would expect this Markov chain to be well into the equilibrium window with the n^{-1/2} scaling by the time this occurs. As a result, we expect the length of the x to y path to have magnitude n^{1/2}, and assume that the diameter is similar.

So this will be helpful for calculations in the cycle deletion model, provided that the trees look like uniform trees. But does that even matter? Do all sensible models of random trees have diameter going like n^{1/2}? Well, a recent paper of Addario-Berry, Broutin and Reed shows that this is not the case for the minimum spanning tree. They demonstrate that the diameter in this case is n^{1/3}. I found this initially surprising, so tried a small example to see if that shed any light on the situation.

The underlying claim is that MSTs are more likely to be ‘star-like’ than USTs, a term I am not going to define. Let’s consider n=4. Specifically, consider the 4-star with centre labelled as 1. There are six possible edges in K_4 and we want to see how many of the 6! weight allocations lead to this star. If the three edges into vertex 1 have weights 1, 2 and 3 then we certainly get the star, but we can also get this star if the edges have weights 1, 2 and 4, and the edge with weight 3 lies between the edges with weights 1 and 2. So the total number of possibilities for this is 3! x 3! + 3! x 2! = 48. Whereas to get a 4-path, you can assign weights 1, 2 and 3 to the edges of the path, or weights 1, 2 and 4 provided the 4 is not in the middle, and then you have the 3 joining up the triangle formed by 1 and 2. So the number of possibilities for this is 3! x 3! + 4 x 2! = 44.

To summarise in a highly informal way, in a star-like tree, you can ‘hide’ some fairly low-scoring weights on edges that aren’t in the tree, so long as they join up very low-scoring edges that are in the tree. Obviously, this is a long way from getting any formal results on asymptotics, but it does at least show that we need to be careful about diameters if we don’t know exactly what mechanism is generating the tree!

Mixing Times 5 – Cesaro Mixing

We have just finished discussing chapters 11 and 12 of Markov Chains and Mixing Times, the end of the ‘core material’. I thought that, rather than addressing some of the more interesting but technical spectral methods that have just arisen, it would be a good subject for a quick post to collate some of the information about Cesaro mixing, which is spread throughout this first section.

Idea

A main result in the introductory theory of Markov chains is that for an irreducible aperiodic chain X, the distribution of X_t\rightarrow \pi, the (unique) equilibrium distribution. The mixing time gives the rate at which this first mode of convergence takes place. We have freedom over the initial state, so we typically consider the ‘worst case scenario’, ie the slowest convergence. The most appropriate metric is given by the total variation distance, which is defined in previous posts. The most important point to note is that the mixing time should be thought of as the correct timescale for convergence, rather than some threshold. In particular, the time at which the chain is within 1/4 of the equilibrium distribution in the TV metric has the same order magnitude (in n, some parameter controlling the number of states) as the time at which it is within 1/20 of the equilibrium distribution.

But this isn’t the only result about convergence in distribution of functionals of Markov chains. Perhaps more intuitive is the ergodic theorem which asserts that the proportion of time spent in a particular state also converges to the equilibrium probability as time advances. We might write:

\frac{1}{t}\sum_t \mathbf{1}(X_t=x)\rightarrow \pi(x),\quad \forall x\in \Omega.

Note that if the state space \Omega is finite, then we can also assume that this occurs uniformly in x. We can also think of the LHS of this convergence as a measure on \Omega varying in time

\frac{1}{t}\sum_t \mathbf{1}(X_t=\cdot),

and the mixing time for this sequence of measures is defined as for the conventional mixing time, and is called the Cesaro mixing time, at least in the Levin / Peres / Wilmer text.

Advantages

There are some obvious advantages to considering Cesaro mixing. Principally, a main drawback of conventional mixing is that we are unable to consider periodic chains. This property was the main content of the previous post, but to summarise, if a chain switches between various classes in a partition of the state space in a deterministic periodic way, then the distribution does not necessarily converge to equilibrium. The previous post discusses several ways of resolving this problem in specific cases. Note that this problem does not affect Cesaro mixing as the ergodic theorem continues to hold in the periodic case. Indeed the form of the distribution (which we might call an occupation measure in some contexts) confirms the intuition about viewing global mixing as a sort of sum over mixing modulo k in time.

Other advantages include the fact that the dependence on the initial state is weaker. For instance, consider the occupation measures for a chain started at x which moves first to y, versus a chain which starts at y then proceeds as the original. It requires very little thought to see that for O(1) values of t, this difference in occupation measures between these chains becomes small.

Another bonus is that we can use so-called stationary times to control Cesaro mixing. A stationary time is a stopping time such that X_\tau\stackrel{d}{=}\pi. It is clear that if we wait until \tau, then run the chain for a further \alpha\tau, the chain will have spent \frac{\alpha}{1+\alpha} of its duration in the equilibrium distribution, and so using Markov’s inequality and bounding the total variation distance between occupation measure and \pi by 1 up until the stationary time, we can get good bounds for the Cesaro mixing time in terms of \mathbb{E}\tau.

Why does this fail to work for normal mixing? The key to the above argument was that by taking an average over time up to some T>>\tau, the dependence on the actual value of X_\tau was suppressed. Consider the deterministic walk on the cycle \mathbb{Z}_n, which advances by 1 modulo n on each go. Now sample independently a random variable Z distributed uniformly on \mathbb{Z}_n. By definition, the random hitting time \tau_Z is a stationary time, but in fact the chain’s distribution does not converge. The condition we actually require for normal mixing is that \tau be a strong stationary time, meaning that X_\tau\stackrel{d}{=}\pi, and the value of X_\tau is independent of \tau. With this definition we can proceed with a similar result for normal mixing. An example of a strong stationary time would be for shuffling a pack of cards by repeatedly inserting the top card into a random place in the rest of the pile. Then consider the moment at which the original bottom card first reaches the top of the pile. It does not take too much to reassure oneself that after the next move, we have a strong stationary time, since every card has been randomised at least once, and the position of the other cards is independent of how long it took the original bottom card to rise to the top.

Disadvantages

So why do we not consider Cesaro mixing rather than the conventional variety? Well, mainly because of how we actually use mixing times. The Metropolis algorithm gives a way to generate chains with a particular equilibrium distribution, including ones for which it is hard to sample directly. Mixing time theory then gives a quantitative answer to the question of how long it is necessary to run such a chain for before it gives a good estimate to the equilibrium distribution. In many cases, such a random walk on a large unknown network, the main aim when applying such Monte Carlo procedures is to minimise the difficulty of calculation. For Cesaro mixing, you have to store all the information about path states, while for conventional mixing you only care about your current location.

The other phenomenon that is lacking in Cesaro mixing is cutoff. This is where the total variation distance

d_n(t)=||P^t(x,\cdot)-\pi||_{TV}

converges to 0 suddenly. More formally, there is some timescale f(n) such that

\lim_{n\rightarrow\infty}d_n(cf(n))=\begin{cases}1& c<1\& c>1,\end{cases}

so in the n-limit, the graph of d looks like a step-function. Several of the shuffling chains exhibit this property, leading to the statements like “7 shuffles are required to mix a standard pack of cards”. Cesaro mixing smooths out this effect on an f(n) timescale.

A Further Example

Perhaps the best example where Cesaro mixing happens faster than normal mixing is in the case of a lazy biased random walk on \mathbb{Z}_n. (Ex. 4.10 in MTMC) Here, we stay put with probability 1/2, otherwise move clockwise with probability p>1/4 and counter-clockwise with probability 1/2 – p < 1/4. This chain is not reversible, as we can determine the direction (or arrow) of time by examining a path. Roughly speaking, the chain will drift clockwise at rate 2p – 1/2 > 0. In particular, at some time Kn, where K is large, we would expect to have completed ~ \frac{K}{2p-\frac12} circuits of the vertices, and so the occupation measure will be close to the uniform equilibrium distribution if we choose K large enough.

On the other hand, the distribution of X at time Kn is still fairly concentrated. If we assume we are instead performing the random walk on \mathbb{Z}, the distribution after Kn i.i.d. increments is

X_t\sim N(Kn(2p-\frac12), Kn\sigma^2).

That is, the standard deviation is O(\sqrt{n})<<n. So, even once we return to considering the random walk on the cyclic group, if we view it as a circle, we expect most of the probability mass to be concentrated near Kn(2p-\frac12) \mod n. By an identical heuristic argument, we see that the mixing time is achieved when the variance has order n, that is when time has order n^2.

Mixing Times 2 – Metropolis Chains

In our second reading group meeting for Mixing Times of Markov Chains, we reviewed chapters 3 and 4 of the Levin, Peres and Wilmer book. This post and the next contains a couple of brief thoughts about the ideas I found most interesting in each chapter.

Before reading chapter 3, the only thing I really knew about Monte Carlo methods was the slogan. If you want to sample from a probability distribution that you can’t describe explicitly, find a Markov chain which has that distribution as an equilibrium distribution, then run it for long enough starting from wherever you fancy. Then the convergence theorem for finite Markov chains means that the state of the chain after a long time approximates well the distribution you were originally looking for.

On the one previous occasion I had stopped and thought about this, I had two questions which I never really got round to answering. Firstly, what sort of distributions might you not be able to simulate directly? Secondly, and perhaps more fundamentally, how would you go about finding a Markov chain for which a given distribution is in equilibrium?

In the end, the second question is the one answered by this particular chapter. The method is called a Metropolis chain, and the basic idea is that you take ANY Markov chain with appropriate state space, then fiddle with the transition probabilities slightly. The starting chain is called a base chain. It is completely possible to adjust the following algorithm for a general base chain, but for simplicity, let’s assume it is possible to take an irreducible chain for which the transition matrix is symmetric. By thinking about the DBEs, this shows that the uniform distribution is the (unique) equilibrium distribution. Suppose the  transition matrix is given by \Psi(x,y), to copy notation from the book. Then set:

P(x,y)=\begin{cases}\Psi(x,y)\left[1\wedge \frac{\pi(y)}{\pi(x)}\right]&y\neq x\\ 1-\sum_{z\neq x} \Psi(x,z)\left [1\wedge \frac{\pi(z)}{\pi(x)}\right]& y=x.\end{cases}

Note that this second case (y=x) is of essentially no importance. It just confirms that the rows of P add to 1. It is easy to check from the DBEs that \pi is the equilibrium distribution of matrix P. One way to think of this algorithm is that we run the normal chain, but occasionally suppress transitions is they involve a move from a state which is likely (under \pi), to one which is less likely. This is done in proportion to the ratio, so it is unsurprising perhaps that the limit in distribution is \pi.

Conveniently, this algorithm also gives us some ideas for how to answer the first question. Note that at no point do we need to know \pi(x) for some state x. We only need to use \frac{\pi(x)}{\pi(y)} the ratios of probabilities. So this is perfect for distributions where there is a normalising constant which is computationally taxing to evaluate. For example, in the Ising model and similar statistical physics objects, probabilities are viewed more as weightings. There is a normalising constant, often called the partition function Z in this context, lying in the background, but especially the underlying geometry is quite exotic we definitely don’t want to have to worry about actually calculating Z. Thus we have a way to generate samples from such models. The other classic example is a random walk on a large, perhaps unknown graph. Then the equilibrium distribution at a vertex is inversely proportional to the degree of that vertex, but again you might not know about this information over the entire graph. It is reasonable to think of a situation where you might be able to take a random walk on a graph, say the connectivity graph of the internet, without knowing about all the edges at any one time. So, even though you potentially explore everywhere, you only need to know a small amount at any one time.

Of course, the drawback of both of these examples is that a lack of knowledge about the overall system means that it is hard in general to know how many steps the Metropolis chain must run before we can be sure that we are the equilibrium distribution it has been constructed to approach. So, while these chains are an excellent example to have in mind while thinking about mixing times, they are also a good motivation for the subject itself. General rules about speed of convergence to equilibrium are precisely what are required to make such implementation concrete.

Mixing Times 1 – Reversing Markov Chains

A small group of us have started meeting to discuss Levin, Peres and Wilmer’s book on Markov Chains and Mixing Times. (The plan is to cover a couple of chapters every week, then discuss points of interest and some of the exercises – if anyone is reading this and fancies joining, let me know!) Anyway, this post is motivated by something we discussed in our first session.

Here are two interesting facts about Markov Chains. 1) The Markov property can be defined in terms of products of transition probabilities giving the probability of a particular initial sequence. However, a more elegant and general formulation is to say that, conditional on the present, the past and the future are independent. 2) All transition matrices have at least one equilibrium distribution. In fact, irreducible Markov Chains have precisely one equilibrium distribution. Then, if we start with any distribution, the distribution of the chain at time t converges to the equilibrium distribution.

But hang on. This might be a fairly serious problem. On the one hand we have given a definition of the Markov property that is symmetric in time, in the sense that it remains true whether we are working forwards or backwards. While, on the other hand, the convergence to equilibrium is very much not time-symmetric: we move from disorder to order as time advances. What has gone wrong here?

We examine each of the properties in turn, then consider how to make them fit together in a non-contradictory way.

Markov Property

As many of the students in the Applied Probability course learned the hard way, there are many ways to define the Markov property depending on context, and some are much easier to work with than others. For a Markov chain, you can find a way to say that the transition probability \mathbb{P}(X_{n+1}=x_{n+1}\,|\,X_n=x_n,\ldots,X_0=x_0) is independent of x_0,\ldots,x_{n-1}. Alternatively, you can use this to give an inductive specification for the probability of the first n values of X being some sequence.

It requires a moment’s checking to see that the earlier definition of past/future independence is consistent with this. Let’s first check that we haven’t messed up a definition somewhere, and that the time-reversal of a general Markov chain does have the Markov property, even as defined in the context of a Markov chain.

For clarity, consider X_0,X_1,\ldots, X_N a Markov chain on some finite state space, with N some fixed finite end time. We aren’t losing anything by reversing over a finite time interval – after all, we need to know how to do it over a finite time interval before it could possibly make sense to do it over (-\infty,\infty). We examine (Y_n)_{n=0}^N defined by Y_n:= X_{N-n}.

\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1},\ldots,X_N=x_N)=\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1})

is the statement of the Markov property for (Y_n). We rearrange the left hand side to obtain:

=\frac{\mathbb{P}(X_n=x_n,X_{n+1}=x_{n+1},\ldots,X_N=x_N)}{\mathbb{P}(X_{n+1}=x_{n+1},\ldots,X_N=x_N)}

=\frac{\mathbb{P}(X_N=x_N|X_n=x_n,\ldots,X_{N-1}=x_{N-1})\mathbb{P}(X_n=x_n,\ldots,X_{N-1}=x_{N-1})}{\mathbb{P}(X_N=x_N|X_{n+1}=x_{n+1},\ldots,X_{N-1}=x_{N-1})\mathbb{P}(X_{n+1}=x_{n+1},\ldots,X_{N-1}=x_{N-1})}.

Now, by the standard Markov property on the original chain (X_n), the first probability in each of the numerator and denominator are equal. This leaves us with exactly the same form of expression as before, but with one fewer term in the probability. So we can iterate until we end up with

\frac{\mathbb{P}(X_n=x_n,X_{n+1}=x_{n+1})}{\mathbb{P}(X_{n+1}=x_{n+1})}=\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1}),

as required.

So there’s nothing wrong with the definition. The reversed chain Y genuinely does have this property, regardless of the initial distribution of X.

In particular, if our original Markov chain starts at a particular state with probability 1, and we run it up to time N, then saying that the time-reversal is a Markov chain too is making a claim that we have a non-trivial chain that converges from some general distribution at time 0 to a distribution concentrated at a single point by time N. This seems to contradict everything we know about these chains.

Convergence to Equilibrium – Markov Property vs Markov Chains

It took us a while to come up with a reasonable explanation for this apparent discrepancy. In the end, we come to the conclusion that Markov chains are a strict subset of stochastic processes with the Markov property.

The key thing to notice is that a Markov chain has even more regularity than the definition above implies. The usual description via a transition matrix says that the probability of moving to state y at time t+1 given that you are at state x at time t is some function of x and y. The Markov property says that this probability is independent of the behaviour up until time t. But we also have that the probability is independent of t. The transition matrix P has no dependence on time t – for example in a random walk we do not have to specify the time to know what happens next. This is the property that fails for the non-stationary time-reversal.

In the most extreme example, we say X_0=x_0 with probability 1. So in the time reversal, \mathbb{P}(Y_N=x_0|Y_{N-1}=y_{N-1})=1 for all y_{N-1}. But it will obviously not be the case in general that \mathbb{P}(Y_n=x_0|Y_{n-1}=y_{n-1})=1 for all y_{n-1}, as this would mean the chain Y would be absorbed after one step at state x_0, which is obviously not how the reversal of X should behave.

Perhaps the best way to reconcile this difference is to consider this example where you definitely start from x_0. Then, a Markov chain in general can be thought of as a measure on paths, that is \Omega^N, with non-trivial but regular correlations between adjacent components. (In the case of stationarity, all the marginals are equal to the stationary distribution – a good example of i.d. but not independent RVs.) This is indexed by the transition matrix and the initial distribution. If the initial distribution is a single point mass, then this can be viewed as a restriction to a smaller set of possible paths, with measures rescaled appropriately.

What have we learned?

Well, mainly to be careful about assuming extra structure with the Markov property. Markov Chains are nice because there is a transition matrix which is constant in time. Other processes, such as Brownian motion are space-homogeneous, where the transitions, or increments in this context, are independent of time and space. However, neither of these properties are true for a general process with the Markov property. Indeed, we have seen in a post from a long time ago that there are Markov processes which do not have the Strong Markov Property, which seems unthinkable if we limit our attention to chain-like processes.

Most importantly, we have clarified the essential point that reversing a Markov Chain only makes sense in equilibrium. It is perfectly possibly to define the reversal of a chain not started at a stationary distribution, but lots of unwelcome information from the forward chain ends up in the reversed chain. In particular, the theory of Markov Chains is not broken, which is good.

DBEs and stationary distributions

Aside

The most recent Applied Probability assignment sheet featured various aspects of Detailed Balance Equations for continuous-time Markov chains. We discussed the advantages and disadvantages of using DBEs rather than solving for an equilibrium distribution directly. The equations used in this second case are often called Full Balance Equations.

Briefly, the advantages of DBEs are that they are easy to solve. After all, each one contains only two components of the equilibrium distribution, so generally you can solve one-at-a-time. The disadvantage is that an equilibrium distribution might not satisfy the DBEs. The deductive structure is:

\text{Solves DBEs}\quad \stackrel{\Rightarrow}{\not\Leftarrow}\quad\text{Equilibrium distribution}

Usually, the chain will be irreducible, so the equilibrium distribution is unique. This means that if we can solve the DBEs, the result is the unique equilibrium distribution.

The DBEs are soluble only if the situation is reversible. This is probably the best definition to use in practice, but informally we can say that this means that the behaviour looks qualitatively the same if we reverse time. For example, as in Q1:

Q=\begin{pmatrix}-1 &1&0\\ 0& -1&1\\1&0&-1\end{pmatrix},

gives the Q-matrix which equilibrium distribution (\frac13,\frac13,\frac13), which does not satisfy DBEs. The chain is not reversible because sample paths always go clockwise, so if we reversed time they would go anti-clockwise (or vice-versa depending on how you’ve drawn the diagram).

What I wanted to say in the class, and made a mess of explaining was this, about why it was inappropriate to use DBEs to find stationary distributions in Q3d):

Reversibility is not just a function of the chain. It is a function of the chain AND the initial distribution. This is only in practice a concern when the chain is reducible, but in this case it really can lead you astray. Let’s consider an example, like

Q=\begin{pmatrix}-3&2&0&0&1&0\\ 0&-4&3&1&0&0\\ 0&1&-4&3&0&0\\ 0&3&1&-4&0&0\\ 0&0&0&0&-5&5\\ 0&0&0&0&5&-5\end{pmatrix}.

Then by solving as in the problem sheet, the invariant distributions are given by:

\lambda(0,\frac13,\frac13,\frac13,0,0)+\mu(0,0,0,0,\frac12\frac12),\quad \lambda+\mu=1.

If you attempted to solve the DBEs, you would succeed, but the only solution would be

(0,0,0,0,\frac12,\frac12).

The explanation is fairly simple in the end. Reversibility is a class property, and only one of the communicating classes, \{5,6\} in this example admits a reversible initial distribution, so to solve the DBEs we must assign zero mass on the other class.

Anyway, I hope that clears up any residual confusion from the class.

Invariant Distributions of Markov Chains

My lecture course in Linyi was all about Markov chains, and we spent much of the final two sessions discussing the properties of invariant distributions. I was not surprised, however, that none of the class chose this topic as the subject for a presentation to give after the end of the teaching week. One of the main problems is that so many rather similar properties are introduced roughly simultaneously. As we did in the class, I thought it was worth making some sort of executive summary, as a mixture of revision and interest.

Definition: \pi is an invariant measure if \pi P=\pi. If in addition \sum_{i\in I}\pi_i=1, then we say it is an invariant distribution. Of course, if I is finite, then any invariant measure can be normalised to give an invariant distribution.

The key initial questions are about existence and uniqueness. First, if there are multiple communicating classes, then an invariant measure (resp. distribution) is a linear (resp. affine) combination of the invariant measures / distributions on each (closed) class. So we restrict attention to irreducible Markov chains.

In the finite case, P is a stochastic matrix so has a column eigenvector with eigenvalue 1, namely the vector with all entries equal to 1. Thus, by reference to general theory in linear algebra, P has a row eigenvector \pi with eigenvalue 1. To paraphrase a remark made by one of my students, what is not clear is that this should be a measure. Demonstrating that this is true is rather non-trivial I think, normally done by reference to the rather more general Perron-Frobenius theorem, though on the flight home I came up with a short argument using Lagrangian duality. For now, we accept existence in the finite case, and note that we typically show existence by showing that the vector of expected time spent in each state between successive visits to a fixed reference state satisfies the properties of an invariant measure.

This is a good moment to note that recurrence is not a necessary condition for the existence of an invariant measure. For example, the random walk on \mathbb{Z}^3 is transient, but the uniform measure is invariant. However, it is not a sufficient condition for the existence of an invariant distribution either. (Of course, an irreducible finite chain is always recurrent, and always has an invariant distribution, so now we are considering only the infinite state space case.) The random walk on \mathbb{Z}^2 is recurrent, but the invariant measure is not normalisable.

The property we in fact need is positive recurrence. This says that the expected return time to each point is finite. Again, this is a class property. This is a common requirement in probabilistic arguments: almost surely finite is often not strong enough to show results if the expectation is infinite (see for example the various requirements for the optional stopping theorem). If this holds, then \pi_i=\frac{1}{\mathbb{E}T_i}, where T_i is the the return time starting from some i\in I.

The final question is ‘Why are we interested?’ One of the best answers is to look at convergence properties. A simple suggestion is this: if we start in equilibrium, then X_0,X_1,X_2,\ldots are all equal in distribution. Note that the dependence structure remains complicated, and much much more interesting than the individual distributions. Next, we observe that a calculation of n-step transition probabilities for a finite chain will typically involve a linear combination of nth powers of eigenvalues. One of the eigenvalues is 1, and the others lie strictly between -1 and 1. We observe in examples that the constant coefficient in p_{ij}^{(n)} is generally a function of j alone, and so p_{ij}^{(n)}\rightarrow\lambda_j, some distribution on I. By considering P^{n+1}=P\cdot P^n, it is easy to see that if this converges, (\lambda_j)_{j\in I} is an invariant distribution. The classic examples which do not work are

P=\begin{pmatrix}0&1\\1&0\end{pmatrix} and P=\begin{pmatrix}0&1&0\\ 0&0&1\\1&0&0\end{pmatrix},

as then the distribution of X_n is a function of the remainder of n modulo 3 alone. With a little thought, we can give a precise classification of such chains which force you to be in particular proper subsets of the state space at regular times n. Chains without this property are called aperiodic, and we can show that distributions for such chains converge to the equilibrium distribution as n\rightarrow\infty.