# Mixing Times 1 – Reversing Markov Chains

A small group of us have started meeting to discuss Levin, Peres and Wilmer’s book on Markov Chains and Mixing Times. (The plan is to cover a couple of chapters every week, then discuss points of interest and some of the exercises – if anyone is reading this and fancies joining, let me know!) Anyway, this post is motivated by something we discussed in our first session.

Here are two interesting facts about Markov Chains. 1) The Markov property can be defined in terms of products of transition probabilities giving the probability of a particular initial sequence. However, a more elegant and general formulation is to say that, conditional on the present, the past and the future are independent. 2) All transition matrices have at least one equilibrium distribution. In fact, irreducible Markov Chains have precisely one equilibrium distribution. Then, if we start with any distribution, the distribution of the chain at time t converges to the equilibrium distribution.

But hang on. This might be a fairly serious problem. On the one hand we have given a definition of the Markov property that is symmetric in time, in the sense that it remains true whether we are working forwards or backwards. While, on the other hand, the convergence to equilibrium is very much not time-symmetric: we move from disorder to order as time advances. What has gone wrong here?

We examine each of the properties in turn, then consider how to make them fit together in a non-contradictory way.

Markov Property

As many of the students in the Applied Probability course learned the hard way, there are many ways to define the Markov property depending on context, and some are much easier to work with than others. For a Markov chain, you can find a way to say that the transition probability $\mathbb{P}(X_{n+1}=x_{n+1}\,|\,X_n=x_n,\ldots,X_0=x_0)$ is independent of $x_0,\ldots,x_{n-1}$. Alternatively, you can use this to give an inductive specification for the probability of the first n values of X being some sequence.

It requires a moment’s checking to see that the earlier definition of past/future independence is consistent with this. Let’s first check that we haven’t messed up a definition somewhere, and that the time-reversal of a general Markov chain does have the Markov property, even as defined in the context of a Markov chain.

For clarity, consider $X_0,X_1,\ldots, X_N$ a Markov chain on some finite state space, with N some fixed finite end time. We aren’t losing anything by reversing over a finite time interval – after all, we need to know how to do it over a finite time interval before it could possibly make sense to do it over $(-\infty,\infty)$. We examine $(Y_n)_{n=0}^N$ defined by $Y_n:= X_{N-n}$.

$\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1},\ldots,X_N=x_N)=\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1})$

is the statement of the Markov property for $(Y_n)$. We rearrange the left hand side to obtain:

$=\frac{\mathbb{P}(X_n=x_n,X_{n+1}=x_{n+1},\ldots,X_N=x_N)}{\mathbb{P}(X_{n+1}=x_{n+1},\ldots,X_N=x_N)}$

$=\frac{\mathbb{P}(X_N=x_N|X_n=x_n,\ldots,X_{N-1}=x_{N-1})\mathbb{P}(X_n=x_n,\ldots,X_{N-1}=x_{N-1})}{\mathbb{P}(X_N=x_N|X_{n+1}=x_{n+1},\ldots,X_{N-1}=x_{N-1})\mathbb{P}(X_{n+1}=x_{n+1},\ldots,X_{N-1}=x_{N-1})}.$

Now, by the standard Markov property on the original chain $(X_n)$, the first probability in each of the numerator and denominator are equal. This leaves us with exactly the same form of expression as before, but with one fewer term in the probability. So we can iterate until we end up with

$\frac{\mathbb{P}(X_n=x_n,X_{n+1}=x_{n+1})}{\mathbb{P}(X_{n+1}=x_{n+1})}=\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1}),$

as required.

So there’s nothing wrong with the definition. The reversed chain Y genuinely does have this property, regardless of the initial distribution of X.

In particular, if our original Markov chain starts at a particular state with probability 1, and we run it up to time N, then saying that the time-reversal is a Markov chain too is making a claim that we have a non-trivial chain that converges from some general distribution at time 0 to a distribution concentrated at a single point by time N. This seems to contradict everything we know about these chains.

Convergence to Equilibrium – Markov Property vs Markov Chains

It took us a while to come up with a reasonable explanation for this apparent discrepancy. In the end, we come to the conclusion that Markov chains are a strict subset of stochastic processes with the Markov property.

The key thing to notice is that a Markov chain has even more regularity than the definition above implies. The usual description via a transition matrix says that the probability of moving to state y at time t+1 given that you are at state x at time t is some function of x and y. The Markov property says that this probability is independent of the behaviour up until time t. But we also have that the probability is independent of t. The transition matrix P has no dependence on time t – for example in a random walk we do not have to specify the time to know what happens next. This is the property that fails for the non-stationary time-reversal.

In the most extreme example, we say $X_0=x_0$ with probability 1. So in the time reversal, $\mathbb{P}(Y_N=x_0|Y_{N-1}=y_{N-1})=1$ for all $y_{N-1}$. But it will obviously not be the case in general that $\mathbb{P}(Y_n=x_0|Y_{n-1}=y_{n-1})=1$ for all $y_{n-1}$, as this would mean the chain Y would be absorbed after one step at state $x_0$, which is obviously not how the reversal of X should behave.

Perhaps the best way to reconcile this difference is to consider this example where you definitely start from $x_0$. Then, a Markov chain in general can be thought of as a measure on paths, that is $\Omega^N$, with non-trivial but regular correlations between adjacent components. (In the case of stationarity, all the marginals are equal to the stationary distribution – a good example of i.d. but not independent RVs.) This is indexed by the transition matrix and the initial distribution. If the initial distribution is a single point mass, then this can be viewed as a restriction to a smaller set of possible paths, with measures rescaled appropriately.

What have we learned?

Well, mainly to be careful about assuming extra structure with the Markov property. Markov Chains are nice because there is a transition matrix which is constant in time. Other processes, such as Brownian motion are space-homogeneous, where the transitions, or increments in this context, are independent of time and space. However, neither of these properties are true for a general process with the Markov property. Indeed, we have seen in a post from a long time ago that there are Markov processes which do not have the Strong Markov Property, which seems unthinkable if we limit our attention to chain-like processes.

Most importantly, we have clarified the essential point that reversing a Markov Chain only makes sense in equilibrium. It is perfectly possibly to define the reversal of a chain not started at a stationary distribution, but lots of unwelcome information from the forward chain ends up in the reversed chain. In particular, the theory of Markov Chains is not broken, which is good.

## DBEs and stationary distributions

### Aside

The most recent Applied Probability assignment sheet featured various aspects of Detailed Balance Equations for continuous-time Markov chains. We discussed the advantages and disadvantages of using DBEs rather than solving for an equilibrium distribution directly. The equations used in this second case are often called Full Balance Equations.

Briefly, the advantages of DBEs are that they are easy to solve. After all, each one contains only two components of the equilibrium distribution, so generally you can solve one-at-a-time. The disadvantage is that an equilibrium distribution might not satisfy the DBEs. The deductive structure is:

$\text{Solves DBEs}\quad \stackrel{\Rightarrow}{\not\Leftarrow}\quad\text{Equilibrium distribution}$

Usually, the chain will be irreducible, so the equilibrium distribution is unique. This means that if we can solve the DBEs, the result is the unique equilibrium distribution.

The DBEs are soluble only if the situation is reversible. This is probably the best definition to use in practice, but informally we can say that this means that the behaviour looks qualitatively the same if we reverse time. For example, as in Q1:

$Q=\begin{pmatrix}-1 &1&0\\ 0& -1&1\\1&0&-1\end{pmatrix},$

gives the Q-matrix which equilibrium distribution $(\frac13,\frac13,\frac13)$, which does not satisfy DBEs. The chain is not reversible because sample paths always go clockwise, so if we reversed time they would go anti-clockwise (or vice-versa depending on how you’ve drawn the diagram).

What I wanted to say in the class, and made a mess of explaining was this, about why it was inappropriate to use DBEs to find stationary distributions in Q3d):

Reversibility is not just a function of the chain. It is a function of the chain AND the initial distribution. This is only in practice a concern when the chain is reducible, but in this case it really can lead you astray. Let’s consider an example, like

$Q=\begin{pmatrix}-3&2&0&0&1&0\\ 0&-4&3&1&0&0\\ 0&1&-4&3&0&0\\ 0&3&1&-4&0&0\\ 0&0&0&0&-5&5\\ 0&0&0&0&5&-5\end{pmatrix}.$

Then by solving as in the problem sheet, the invariant distributions are given by:

$\lambda(0,\frac13,\frac13,\frac13,0,0)+\mu(0,0,0,0,\frac12\frac12),\quad \lambda+\mu=1.$

If you attempted to solve the DBEs, you would succeed, but the only solution would be

$(0,0,0,0,\frac12,\frac12).$

The explanation is fairly simple in the end. Reversibility is a class property, and only one of the communicating classes, $\{5,6\}$ in this example admits a reversible initial distribution, so to solve the DBEs we must assign zero mass on the other class.

Anyway, I hope that clears up any residual confusion from the class.