Mixing Times 4 – Avoiding Periodicity

A Markov chain is periodic if you can partition the state space such it is possible to be in a particular class only at certain, periodic times. Concretely, suppose we can find a decomposition into classes \Omega=V_1\cup\ldots\cup V_k such that conditional on X_t\in V_i, we have \mathbb{P}(X_{t+1}\in V_{i+1})=1, where the indices of the Vs are taken modulo k. Such a chain is called periodic with period k. In most cases, we would want to define the period to be the maximal such k.

Why is periodicity a problem? It prevents convergence to equilibrium. The distribution at time t has some fairly strong dependence on the initial distribution. For example, if the initial distribution is entirely supported on V_1 as defined above, then the distribution at time t will be entirely supported on V_i, where i\equiv t \mod k. In particular, this cannot converge to some equilibrium.

Aperiodicity thus becomes a necessary condition in any theorem on convergence to equilibrium. Note that by construction this is only relevant for chains in discrete time. In an first account of Markov chains, most of the examples will either have a small state space, for which the transition matrix will have to contain lots of zeros before it stands a chance of being periodic, or obviously aperiodic birth-death or queue type processes. But some of the combinatorially motivated chains we consider for interesting mixing properties are more likely to be periodic. In particular, for a random walk on a group say, the generator measure may well be supported only on a small subset of the whole group, which is completely natural (eg transpositions as a subset of the symmetric group). Then it becomes more plausible that periodicity might arise because of some underlying regularity or symmetry in the group structure.

My first claim is that periodicity is not a disaster for convergence properties of Markov chains. Firstly, by the definition above, P^k(x,y) for x,y\in V_1 is an irreducible (aperiodic if k is maximal) transition matrix on V_1, and so we have convergence to some equilibrium distribution on V_1 of (X_{kt+a}) or similar. An initial distribution mixed between classes gives a mix of such equilibria. Alternatively, we could think about large-time ergodic properties. By taking an average over all distributions up to some large t, the periodic problems get smoothed out. So, for mixing on a periodic chain, it might be possible to make headway with Cesaro mixing, which looks at the speed of convergence of the ergodic average distribution.

In most cases, though, we prefer to alter the chain directly to remove periodicity, or even any chance of periodicity. The preferred method in many contexts is to replace the transition matrix P with \frac12 (P+I). This says that at every time t, we toss an independent fair coin, and with probability 1/2 make the transition suggested by P, and with probability 1/2 we stay where we are. Note that if a chain is irreducible, and some P(x,x)>0, then it is definitely aperiodic, as x cannot be in more than one class as per the definition of periodicity.

If you want to know about the mixing time of the original chain, note that this so-called lazy chain moves at half the speed of the original, so to get exact asymptotics (eg in the case of cutoff, that is mixing speed faster than the scale of the mixing time) you must multiply by 2. Also note that all of the eigenvalues of \frac12 (P+I) are non-negative, and in fact, the eigenvalues are subject to a linear transform in the construction of the lazy transition matrix \lambda\mapsto \frac12(1+\lambda).

Note that choosing 1/2 as the parameter is unnecessary. Firstly, it would suffice to take some P(x,x)=\epsilon and rescale the rest of the row appropriately. Also, in some cases, a different constant gives a more natural interpretation of the underlying mechanism. For example, one model worth considering is the Random Transposition Random Walk on the symmetric group, where at time t we multiply (ie compose) with a transposition chosen uniformly at random. This model is interesting partly because the orbits of an element resemble, at least initially, the component size process of a Erdos-Renyi random graph, on the grounds that when the number of transpositions is small, they don’t interact too much, so can be viewed as independent edges. Anyway, some form of laziness is needed in RTRW, otherwise the chain will alternative between odd and even permutations. In this case, 1/2 is not the most natural choice. The most sensible way to sample random transpositions is to select the two elements of [n] to be transposed uniformly and independently at random. Thus each transposition is selected with probability \frac{2}{n^2}, while the identity, which corresponds to ‘lazily’ staying at the current state in the random walk, is selected with probability 1/n.

The lazy chain is also useful when the original chain has a lot of symmetry involved. In particular, if the original chain involves ‘switching’ say one coordinate. The best example is the random walk on the vertices of the n-hypercube, but there are others. Here, the most helpful way to visualise the configuration is to choose a coordinate uniformly at random and then flip its value (from 0 to 1 or 1 to 0). Now the lazy chain can be viewed similarly, but note that the dependence on the current value of the coordinate is suppressed. That is, having chosen the coordinate to be affected, we set it to be 0 with probability 1/2 and 1 with probability 1/2, irrespective of the prior value at that coordinate. Thus instead of viewing the action on coordinates to be ‘stay or switch’, we can view the action on the randomly chosen coordinate to be ‘randomly resample’, to use statistical terminology. This is ideal for coupling, because from the time coordinate j is first selected, the value at that coordinate is independent of the past, and in particular, the initial value or distribution. So we can couple arbitrary initial configurations or distributions, and we know that as soon as all coordinates have been selected (a time that can be described as a coupon collector problem), the chains are well coupled, that is, the values are the same.

Note that one way we definitely get periodicity is if the increment distribution for random walk on a group is supported entirely in a single coset of a normal subgroup. Why? Well if we take H\lhd G to be the normal subgroup, and gH to be the relevant coset, then P^t(g',\cdot) is supported entirely on g^tg'H, so is periodic with period equal to the order of gH in the quotient group G/H. Note that if the coset is the normal subgroup itself, then it might well include support on the identity, which immediately makes the chain aperiodic. However, there will be then be no transitions between cosets, so the chain is not irreducible on G.

The previous paragraph is the content of Remark 8.3 in the book we are reading. My final comment is that normality is precisely what is needed for this to hold. The key idea is that the set of subsets {gH, gHgH, gHgHgH, … } forms a partition of the group. This is certainly true if H is normal and gH generates G/H. If the latter statement is not true, then the set of subsets still forms a partition, but of some subset of G. The random walk is then neither irreducible nor aperiodic on the reduced state space. If H is not normal, then there are no such restrictions. For example, gHgH might be equal to the whole group G. Then the random walk is aperiodic, as this would imply we can move between any pair of states in two steps, and so by extension between any pair of states in three steps. (2,3)=1, hence the chain is aperiodic. As a concrete example, consider

\tau=\langle (1 2)\rangle \leq S_3,

the simplest example of a non-normal subgroup. Part of the problem is that cosets are different in the left-case and the right-case. Consider the left coset of \tau given by \sigma\tau=\{(1 2 3),(2 3)\}. These elements have order three and two respectively, and so by a similar argument to the general one above, this random walk is aperiodic.

Advertisements

4 thoughts on “Mixing Times 4 – Avoiding Periodicity

  1. Pingback: Mixing Times 5 – Cesaro Mixing | Eventually Almost Everywhere

  2. You say “If you want to know about the mixing time of the original chain, note that this so-called lazy chain moves at half the speed of the original”. I need exactly this for a proof in a research paper. Can you tell me a source that I can cite for this? Or can you prove it?

    • Well I was using the term ‘speed’ informally. You could construct the lazy chain by taking the original chain and a random walk with Bern(1/2) increments, then sampling the original chain at the times given by partial sums of the random walk. CLT controls the size of the error.

  3. Pingback: Random transpositions | Eventually Almost Everywhere

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s