A small group of us have started meeting to discuss Levin, Peres and Wilmer’s book on Markov Chains and Mixing Times. (The plan is to cover a couple of chapters every week, then discuss points of interest and some of the exercises – if anyone is reading this and fancies joining, let me know!) Anyway, this post is motivated by something we discussed in our first session.
Here are two interesting facts about Markov Chains. 1) The Markov property can be defined in terms of products of transition probabilities giving the probability of a particular initial sequence. However, a more elegant and general formulation is to say that, conditional on the present, the past and the future are independent. 2) All transition matrices have at least one equilibrium distribution. In fact, irreducible Markov Chains have precisely one equilibrium distribution. Then, if we start with any distribution, the distribution of the chain at time t converges to the equilibrium distribution.
But hang on. This might be a fairly serious problem. On the one hand we have given a definition of the Markov property that is symmetric in time, in the sense that it remains true whether we are working forwards or backwards. While, on the other hand, the convergence to equilibrium is very much not time-symmetric: we move from disorder to order as time advances. What has gone wrong here?
We examine each of the properties in turn, then consider how to make them fit together in a non-contradictory way.
Markov Property
As many of the students in the Applied Probability course learned the hard way, there are many ways to define the Markov property depending on context, and some are much easier to work with than others. For a Markov chain, you can find a way to say that the transition probability is independent of
. Alternatively, you can use this to give an inductive specification for the probability of the first n values of X being some sequence.
It requires a moment’s checking to see that the earlier definition of past/future independence is consistent with this. Let’s first check that we haven’t messed up a definition somewhere, and that the time-reversal of a general Markov chain does have the Markov property, even as defined in the context of a Markov chain.
For clarity, consider a Markov chain on some finite state space, with N some fixed finite end time. We aren’t losing anything by reversing over a finite time interval – after all, we need to know how to do it over a finite time interval before it could possibly make sense to do it over
. We examine
defined by
.
is the statement of the Markov property for . We rearrange the left hand side to obtain:
Now, by the standard Markov property on the original chain , the first probability in each of the numerator and denominator are equal. This leaves us with exactly the same form of expression as before, but with one fewer term in the probability. So we can iterate until we end up with
as required.
So there’s nothing wrong with the definition. The reversed chain Y genuinely does have this property, regardless of the initial distribution of X.
In particular, if our original Markov chain starts at a particular state with probability 1, and we run it up to time N, then saying that the time-reversal is a Markov chain too is making a claim that we have a non-trivial chain that converges from some general distribution at time 0 to a distribution concentrated at a single point by time N. This seems to contradict everything we know about these chains.
Convergence to Equilibrium – Markov Property vs Markov Chains
It took us a while to come up with a reasonable explanation for this apparent discrepancy. In the end, we come to the conclusion that Markov chains are a strict subset of stochastic processes with the Markov property.
The key thing to notice is that a Markov chain has even more regularity than the definition above implies. The usual description via a transition matrix says that the probability of moving to state y at time t+1 given that you are at state x at time t is some function of x and y. The Markov property says that this probability is independent of the behaviour up until time t. But we also have that the probability is independent of t. The transition matrix P has no dependence on time t – for example in a random walk we do not have to specify the time to know what happens next. This is the property that fails for the non-stationary time-reversal.
In the most extreme example, we say with probability 1. So in the time reversal,
for all
. But it will obviously not be the case in general that
for all
, as this would mean the chain Y would be absorbed after one step at state
, which is obviously not how the reversal of X should behave.
Perhaps the best way to reconcile this difference is to consider this example where you definitely start from . Then, a Markov chain in general can be thought of as a measure on paths, that is
, with non-trivial but regular correlations between adjacent components. (In the case of stationarity, all the marginals are equal to the stationary distribution – a good example of i.d. but not independent RVs.) This is indexed by the transition matrix and the initial distribution. If the initial distribution is a single point mass, then this can be viewed as a restriction to a smaller set of possible paths, with measures rescaled appropriately.
What have we learned?
Well, mainly to be careful about assuming extra structure with the Markov property. Markov Chains are nice because there is a transition matrix which is constant in time. Other processes, such as Brownian motion are space-homogeneous, where the transitions, or increments in this context, are independent of time and space. However, neither of these properties are true for a general process with the Markov property. Indeed, we have seen in a post from a long time ago that there are Markov processes which do not have the Strong Markov Property, which seems unthinkable if we limit our attention to chain-like processes.
Most importantly, we have clarified the essential point that reversing a Markov Chain only makes sense in equilibrium. It is perfectly possibly to define the reversal of a chain not started at a stationary distribution, but lots of unwelcome information from the forward chain ends up in the reversed chain. In particular, the theory of Markov Chains is not broken, which is good.
Related articles
- Algorithm to find the “percolation” threshold in a weighed network (stackoverflow.com)
- An idea that changed the world (news.harvard.edu)
- 100 years in chains (servletsuite.blogspot.com)
- Markov chains and skill and luck (gottwurfelt.wordpress.com)
- Fun with Couchbase and Markov Chains (architects.dzone.com)
- Why a Robot should study Probability ? (coswarm.wordpress.com)
Pingback: Mixing Times 2 – Metropolis Chains | Eventually Almost Everywhere
Pingback: Mixing Times 3 – Convex Functions on the Space of Measures | Eventually Almost Everywhere
Pingback: Mixing Times 4 – Avoiding Periodicity | Eventually Almost Everywhere
Pingback: Mixing of the Noisy Voter Model | Eventually Almost Everywhere
Pingback: Coupling from the Past | Eventually Almost Everywhere
Pingback: Hitting Probabilities for Markov Chains | Eventually Almost Everywhere
Pingback: Random transpositions | Eventually Almost Everywhere