Mixing Times 5 – Cesaro Mixing

We have just finished discussing chapters 11 and 12 of Markov Chains and Mixing Times, the end of the ‘core material’. I thought that, rather than addressing some of the more interesting but technical spectral methods that have just arisen, it would be a good subject for a quick post to collate some of the information about Cesaro mixing, which is spread throughout this first section.

Idea

A main result in the introductory theory of Markov chains is that for an irreducible aperiodic chain X, the distribution of X_t\rightarrow \pi, the (unique) equilibrium distribution. The mixing time gives the rate at which this first mode of convergence takes place. We have freedom over the initial state, so we typically consider the ‘worst case scenario’, ie the slowest convergence. The most appropriate metric is given by the total variation distance, which is defined in previous posts. The most important point to note is that the mixing time should be thought of as the correct timescale for convergence, rather than some threshold. In particular, the time at which the chain is within 1/4 of the equilibrium distribution in the TV metric has the same order magnitude (in n, some parameter controlling the number of states) as the time at which it is within 1/20 of the equilibrium distribution.

But this isn’t the only result about convergence in distribution of functionals of Markov chains. Perhaps more intuitive is the ergodic theorem which asserts that the proportion of time spent in a particular state also converges to the equilibrium probability as time advances. We might write:

\frac{1}{t}\sum_t \mathbf{1}(X_t=x)\rightarrow \pi(x),\quad \forall x\in \Omega.

Note that if the state space \Omega is finite, then we can also assume that this occurs uniformly in x. We can also think of the LHS of this convergence as a measure on \Omega varying in time

\frac{1}{t}\sum_t \mathbf{1}(X_t=\cdot),

and the mixing time for this sequence of measures is defined as for the conventional mixing time, and is called the Cesaro mixing time, at least in the Levin / Peres / Wilmer text.

Advantages

There are some obvious advantages to considering Cesaro mixing. Principally, a main drawback of conventional mixing is that we are unable to consider periodic chains. This property was the main content of the previous post, but to summarise, if a chain switches between various classes in a partition of the state space in a deterministic periodic way, then the distribution does not necessarily converge to equilibrium. The previous post discusses several ways of resolving this problem in specific cases. Note that this problem does not affect Cesaro mixing as the ergodic theorem continues to hold in the periodic case. Indeed the form of the distribution (which we might call an occupation measure in some contexts) confirms the intuition about viewing global mixing as a sort of sum over mixing modulo k in time.

Other advantages include the fact that the dependence on the initial state is weaker. For instance, consider the occupation measures for a chain started at x which moves first to y, versus a chain which starts at y then proceeds as the original. It requires very little thought to see that for O(1) values of t, this difference in occupation measures between these chains becomes small.

Another bonus is that we can use so-called stationary times to control Cesaro mixing. A stationary time is a stopping time such that X_\tau\stackrel{d}{=}\pi. It is clear that if we wait until \tau, then run the chain for a further \alpha\tau, the chain will have spent \frac{\alpha}{1+\alpha} of its duration in the equilibrium distribution, and so using Markov’s inequality and bounding the total variation distance between occupation measure and \pi by 1 up until the stationary time, we can get good bounds for the Cesaro mixing time in terms of \mathbb{E}\tau.

Why does this fail to work for normal mixing? The key to the above argument was that by taking an average over time up to some T>>\tau, the dependence on the actual value of X_\tau was suppressed. Consider the deterministic walk on the cycle \mathbb{Z}_n, which advances by 1 modulo n on each go. Now sample independently a random variable Z distributed uniformly on \mathbb{Z}_n. By definition, the random hitting time \tau_Z is a stationary time, but in fact the chain’s distribution does not converge. The condition we actually require for normal mixing is that \tau be a strong stationary time, meaning that X_\tau\stackrel{d}{=}\pi, and the value of X_\tau is independent of \tau. With this definition we can proceed with a similar result for normal mixing. An example of a strong stationary time would be for shuffling a pack of cards by repeatedly inserting the top card into a random place in the rest of the pile. Then consider the moment at which the original bottom card first reaches the top of the pile. It does not take too much to reassure oneself that after the next move, we have a strong stationary time, since every card has been randomised at least once, and the position of the other cards is independent of how long it took the original bottom card to rise to the top.

Disadvantages

So why do we not consider Cesaro mixing rather than the conventional variety? Well, mainly because of how we actually use mixing times. The Metropolis algorithm gives a way to generate chains with a particular equilibrium distribution, including ones for which it is hard to sample directly. Mixing time theory then gives a quantitative answer to the question of how long it is necessary to run such a chain for before it gives a good estimate to the equilibrium distribution. In many cases, such a random walk on a large unknown network, the main aim when applying such Monte Carlo procedures is to minimise the difficulty of calculation. For Cesaro mixing, you have to store all the information about path states, while for conventional mixing you only care about your current location.

The other phenomenon that is lacking in Cesaro mixing is cutoff. This is where the total variation distance

d_n(t)=||P^t(x,\cdot)-\pi||_{TV}

converges to 0 suddenly. More formally, there is some timescale f(n) such that

\lim_{n\rightarrow\infty}d_n(cf(n))=\begin{cases}1& c<1\& c>1,\end{cases}

so in the n-limit, the graph of d looks like a step-function. Several of the shuffling chains exhibit this property, leading to the statements like “7 shuffles are required to mix a standard pack of cards”. Cesaro mixing smooths out this effect on an f(n) timescale.

A Further Example

Perhaps the best example where Cesaro mixing happens faster than normal mixing is in the case of a lazy biased random walk on \mathbb{Z}_n. (Ex. 4.10 in MTMC) Here, we stay put with probability 1/2, otherwise move clockwise with probability p>1/4 and counter-clockwise with probability 1/2 – p < 1/4. This chain is not reversible, as we can determine the direction (or arrow) of time by examining a path. Roughly speaking, the chain will drift clockwise at rate 2p – 1/2 > 0. In particular, at some time Kn, where K is large, we would expect to have completed ~ \frac{K}{2p-\frac12} circuits of the vertices, and so the occupation measure will be close to the uniform equilibrium distribution if we choose K large enough.

On the other hand, the distribution of X at time Kn is still fairly concentrated. If we assume we are instead performing the random walk on \mathbb{Z}, the distribution after Kn i.i.d. increments is

X_t\sim N(Kn(2p-\frac12), Kn\sigma^2).

That is, the standard deviation is O(\sqrt{n})<<n. So, even once we return to considering the random walk on the cyclic group, if we view it as a circle, we expect most of the probability mass to be concentrated near Kn(2p-\frac12) \mod n. By an identical heuristic argument, we see that the mixing time is achieved when the variance has order n, that is when time has order n^2.

Advertisements

Mixing Times 4 – Avoiding Periodicity

A Markov chain is periodic if you can partition the state space such it is possible to be in a particular class only at certain, periodic times. Concretely, suppose we can find a decomposition into classes \Omega=V_1\cup\ldots\cup V_k such that conditional on X_t\in V_i, we have \mathbb{P}(X_{t+1}\in V_{i+1})=1, where the indices of the Vs are taken modulo k. Such a chain is called periodic with period k. In most cases, we would want to define the period to be the maximal such k.

Why is periodicity a problem? It prevents convergence to equilibrium. The distribution at time t has some fairly strong dependence on the initial distribution. For example, if the initial distribution is entirely supported on V_1 as defined above, then the distribution at time t will be entirely supported on V_i, where i\equiv t \mod k. In particular, this cannot converge to some equilibrium.

Aperiodicity thus becomes a necessary condition in any theorem on convergence to equilibrium. Note that by construction this is only relevant for chains in discrete time. In an first account of Markov chains, most of the examples will either have a small state space, for which the transition matrix will have to contain lots of zeros before it stands a chance of being periodic, or obviously aperiodic birth-death or queue type processes. But some of the combinatorially motivated chains we consider for interesting mixing properties are more likely to be periodic. In particular, for a random walk on a group say, the generator measure may well be supported only on a small subset of the whole group, which is completely natural (eg transpositions as a subset of the symmetric group). Then it becomes more plausible that periodicity might arise because of some underlying regularity or symmetry in the group structure.

My first claim is that periodicity is not a disaster for convergence properties of Markov chains. Firstly, by the definition above, P^k(x,y) for x,y\in V_1 is an irreducible (aperiodic if k is maximal) transition matrix on V_1, and so we have convergence to some equilibrium distribution on V_1 of (X_{kt+a}) or similar. An initial distribution mixed between classes gives a mix of such equilibria. Alternatively, we could think about large-time ergodic properties. By taking an average over all distributions up to some large t, the periodic problems get smoothed out. So, for mixing on a periodic chain, it might be possible to make headway with Cesaro mixing, which looks at the speed of convergence of the ergodic average distribution.

In most cases, though, we prefer to alter the chain directly to remove periodicity, or even any chance of periodicity. The preferred method in many contexts is to replace the transition matrix P with \frac12 (P+I). This says that at every time t, we toss an independent fair coin, and with probability 1/2 make the transition suggested by P, and with probability 1/2 we stay where we are. Note that if a chain is irreducible, and some P(x,x)>0, then it is definitely aperiodic, as x cannot be in more than one class as per the definition of periodicity.

If you want to know about the mixing time of the original chain, note that this so-called lazy chain moves at half the speed of the original, so to get exact asymptotics (eg in the case of cutoff, that is mixing speed faster than the scale of the mixing time) you must multiply by 2. Also note that all of the eigenvalues of \frac12 (P+I) are non-negative, and in fact, the eigenvalues are subject to a linear transform in the construction of the lazy transition matrix \lambda\mapsto \frac12(1+\lambda).

Note that choosing 1/2 as the parameter is unnecessary. Firstly, it would suffice to take some P(x,x)=\epsilon and rescale the rest of the row appropriately. Also, in some cases, a different constant gives a more natural interpretation of the underlying mechanism. For example, one model worth considering is the Random Transposition Random Walk on the symmetric group, where at time t we multiply (ie compose) with a transposition chosen uniformly at random. This model is interesting partly because the orbits of an element resemble, at least initially, the component size process of a Erdos-Renyi random graph, on the grounds that when the number of transpositions is small, they don’t interact too much, so can be viewed as independent edges. Anyway, some form of laziness is needed in RTRW, otherwise the chain will alternative between odd and even permutations. In this case, 1/2 is not the most natural choice. The most sensible way to sample random transpositions is to select the two elements of [n] to be transposed uniformly and independently at random. Thus each transposition is selected with probability \frac{2}{n^2}, while the identity, which corresponds to ‘lazily’ staying at the current state in the random walk, is selected with probability 1/n.

The lazy chain is also useful when the original chain has a lot of symmetry involved. In particular, if the original chain involves ‘switching’ say one coordinate. The best example is the random walk on the vertices of the n-hypercube, but there are others. Here, the most helpful way to visualise the configuration is to choose a coordinate uniformly at random and then flip its value (from 0 to 1 or 1 to 0). Now the lazy chain can be viewed similarly, but note that the dependence on the current value of the coordinate is suppressed. That is, having chosen the coordinate to be affected, we set it to be 0 with probability 1/2 and 1 with probability 1/2, irrespective of the prior value at that coordinate. Thus instead of viewing the action on coordinates to be ‘stay or switch’, we can view the action on the randomly chosen coordinate to be ‘randomly resample’, to use statistical terminology. This is ideal for coupling, because from the time coordinate j is first selected, the value at that coordinate is independent of the past, and in particular, the initial value or distribution. So we can couple arbitrary initial configurations or distributions, and we know that as soon as all coordinates have been selected (a time that can be described as a coupon collector problem), the chains are well coupled, that is, the values are the same.

Note that one way we definitely get periodicity is if the increment distribution for random walk on a group is supported entirely in a single coset of a normal subgroup. Why? Well if we take H\lhd G to be the normal subgroup, and gH to be the relevant coset, then P^t(g',\cdot) is supported entirely on g^tg'H, so is periodic with period equal to the order of gH in the quotient group G/H. Note that if the coset is the normal subgroup itself, then it might well include support on the identity, which immediately makes the chain aperiodic. However, there will be then be no transitions between cosets, so the chain is not irreducible on G.

The previous paragraph is the content of Remark 8.3 in the book we are reading. My final comment is that normality is precisely what is needed for this to hold. The key idea is that the set of subsets {gH, gHgH, gHgHgH, … } forms a partition of the group. This is certainly true if H is normal and gH generates G/H. If the latter statement is not true, then the set of subsets still forms a partition, but of some subset of G. The random walk is then neither irreducible nor aperiodic on the reduced state space. If H is not normal, then there are no such restrictions. For example, gHgH might be equal to the whole group G. Then the random walk is aperiodic, as this would imply we can move between any pair of states in two steps, and so by extension between any pair of states in three steps. (2,3)=1, hence the chain is aperiodic. As a concrete example, consider

\tau=\langle (1 2)\rangle \leq S_3,

the simplest example of a non-normal subgroup. Part of the problem is that cosets are different in the left-case and the right-case. Consider the left coset of \tau given by \sigma\tau=\{(1 2 3),(2 3)\}. These elements have order three and two respectively, and so by a similar argument to the general one above, this random walk is aperiodic.

Mixing Times 3 – Convex Functions on the Space of Measures

The meat of this course covers rate of convergence of the distribution of Markov chains. In particular, we want to be thinking about lots of distributions simultaneously, so we really to be comfortable working with the space of measures on a (for now) finite state space. This is not really too bad actually, since we can embed it in a finite-dimensional real vector space.

\mathcal{M}_1(E)=\{(x_v:v\in\Omega),x_v\geq 0, \sum x_v=1\}\subset \mathbb{R}^\Omega.

Since most operations we might want to apply to distributions are linear, it doesn’t make much sense to inherit the usual Euclidean metric. In the end, the metric we use is the same as the L_1 metric, but the motivation is worth exploring. Typically, the size of |\Omega| will be function of n, a parameter which will tend to infinity. So we do not want to be too rooted in the actual set \Omega for what will follow.

Perhaps the best justification for total variation distance is from a gambling viewpoint. Suppose your opinion for the distribution of some outcome is \mu, and a bookmaker has priced their odds according to their evaluation of the outcome as \nu. You want to make the most money, assuming that your opinion of the distribution is correct (which in your opinion, of course it is!). So assuming the bookmaker will accept an arbitrarily complicated (but finite obviously, since there are only |\Omega| possible outcomes) bet, you want to place money on whichever event evinces the greatest disparity between your measure of likeliness and the bookmaker’s. If you can find an event which you think is very likely, and which the bookmaker thinks is unlikely, you are (again, according to your own opinion of the measure) on for a big profit. This difference is the total variation distance ||\mu-\nu||_{TV}.

Formally, we define:

||\mu-\nu||_{TV}:=\max_{A\subset\Omega}|\mu(A)-\nu(A)|.

Note that if this maximum is achieved at A, it is also achieved at A^c, and so we might as well go with the original intuition of

||\mu-\nu||_{TV}=\max_{A\subset\Omega} \left[\mu(A)-\nu(A)\right].

If we decompose \mu(A)=\sum_{x\in A}\mu(x), and similarly for A^c, then add the results, we obtain:

||\mu-\nu||_{TV}=\frac12\sum_{x\in\Omega}|\mu(A)-\nu(A)|.

There are plenty of other interesting interpretations of total variation distance, but I don’t want to get bogged down right now. We are interested in the rate of convergence of distributions of Markov chains. Given some initial distribution \lambda of X_0, we are interested in ||\lambda P^t-\pi||_{TV}. The problem is that doing everything in terms of some general \lambda is really annoying, at the very least for notational reasons. So really we want to investigate

d(t)=\max_{\lambda\in\mathcal{M}_1(E)}||\lambda P^t-\pi||_{TV},

the worst-case scenario, where we choose the initial distribution that mixes the slowest, at least judging at time t. Now, here’s where the space of measures starts to come in useful. For now, we relax the requirement that measures must be probability distributions. In fact, we allow them to be negative as well. Then \lambda P^t-\pi is some signed measure on \Omega with zero total mass.

But although I haven’t yet been explicit about this, it is easy to see that ||\cdot||_{TV} is a norm on this space. In fact, it is (equivalent to – dividing by 1/2 makes no difference!) the product norm of the L_1 norm as defined before. Recall the norms are convex functions. This is an immediate consequence of the triangle inequality. The set of suitable distributions \lambda is affine, because an affine combination of probability distributions is another probability distribution.

Then, we know from linear optimisation theory, that convex functions on an affine space achieve their maxima at boundary points. And the boundary points for this definition of \lambda\in\mathcal{M}_1(E), are precisely the delta-measures at some point of the state space \delta_v. So in fact, we can replace our definition of d(t) by:

d(t)=\max_{x\in\Omega}||P^t(x,\cdot)-\pi||_{TV},

where P^t(x,\cdot) is the same as (\delta_x P^t)(\cdot). Furthermore, we can immediately apply this idea to get a second result for free. In some problems, particularly those with neat couplings across all initial distributions, it is easier to work with a larger class of transition probabilities, rather than the actual equilibrium distribution, so we define:

\bar{d}(t):=\max_{x,y\in\Omega}||P^t(x,\cdot)-P^t(y,\cdot)||_{TV}.

The triangle inequality gives \bar{d}(t)\leq 2d(t) immediately. But we want to show d(t)\leq \bar{d}(t), and we can do that as before, by considering

\max_{\lambda,\mu\in\mathcal{M}_1(E)}||\lambda P^t-\mu P^t||_{TV}.

The function we are maximising is a convex function on \mathcal{M}_1(E)^2, and so it attains its maximum at a boundary point, which must be \lambda=\delta_x,\mu=\delta_y. Hence \bar{d}(t) is equal to the displayed expression above, which is certainly greater than or equal to the original formulation of d(t), as this is the maximum of the same expression over a strict subset.

I’m not suggesting this method is qualitatively different to that proposed by the authors of the book. However, I think this is very much the right way to be thinking about these matters of maximising norms over a space of measures. Partly this is good because it gives an easy ‘sanity check’ for any idea. But also because it gives some idea of whether it will or won’t be possible to extend the ideas to the case where the state space is infinite, which will be of interest much later.

Mixing Times 2 – Metropolis Chains

In our second reading group meeting for Mixing Times of Markov Chains, we reviewed chapters 3 and 4 of the Levin, Peres and Wilmer book. This post and the next contains a couple of brief thoughts about the ideas I found most interesting in each chapter.

Before reading chapter 3, the only thing I really knew about Monte Carlo methods was the slogan. If you want to sample from a probability distribution that you can’t describe explicitly, find a Markov chain which has that distribution as an equilibrium distribution, then run it for long enough starting from wherever you fancy. Then the convergence theorem for finite Markov chains means that the state of the chain after a long time approximates well the distribution you were originally looking for.

On the one previous occasion I had stopped and thought about this, I had two questions which I never really got round to answering. Firstly, what sort of distributions might you not be able to simulate directly? Secondly, and perhaps more fundamentally, how would you go about finding a Markov chain for which a given distribution is in equilibrium?

In the end, the second question is the one answered by this particular chapter. The method is called a Metropolis chain, and the basic idea is that you take ANY Markov chain with appropriate state space, then fiddle with the transition probabilities slightly. The starting chain is called a base chain. It is completely possible to adjust the following algorithm for a general base chain, but for simplicity, let’s assume it is possible to take an irreducible chain for which the transition matrix is symmetric. By thinking about the DBEs, this shows that the uniform distribution is the (unique) equilibrium distribution. Suppose the  transition matrix is given by \Psi(x,y), to copy notation from the book. Then set:

P(x,y)=\begin{cases}\Psi(x,y)\left[1\wedge \frac{\pi(y)}{\pi(x)}\right]&y\neq x\\ 1-\sum_{z\neq x} \Psi(x,z)\left [1\wedge \frac{\pi(z)}{\pi(x)}\right]& y=x.\end{cases}

Note that this second case (y=x) is of essentially no importance. It just confirms that the rows of P add to 1. It is easy to check from the DBEs that \pi is the equilibrium distribution of matrix P. One way to think of this algorithm is that we run the normal chain, but occasionally suppress transitions is they involve a move from a state which is likely (under \pi), to one which is less likely. This is done in proportion to the ratio, so it is unsurprising perhaps that the limit in distribution is \pi.

Conveniently, this algorithm also gives us some ideas for how to answer the first question. Note that at no point do we need to know \pi(x) for some state x. We only need to use \frac{\pi(x)}{\pi(y)} the ratios of probabilities. So this is perfect for distributions where there is a normalising constant which is computationally taxing to evaluate. For example, in the Ising model and similar statistical physics objects, probabilities are viewed more as weightings. There is a normalising constant, often called the partition function Z in this context, lying in the background, but especially the underlying geometry is quite exotic we definitely don’t want to have to worry about actually calculating Z. Thus we have a way to generate samples from such models. The other classic example is a random walk on a large, perhaps unknown graph. Then the equilibrium distribution at a vertex is inversely proportional to the degree of that vertex, but again you might not know about this information over the entire graph. It is reasonable to think of a situation where you might be able to take a random walk on a graph, say the connectivity graph of the internet, without knowing about all the edges at any one time. So, even though you potentially explore everywhere, you only need to know a small amount at any one time.

Of course, the drawback of both of these examples is that a lack of knowledge about the overall system means that it is hard in general to know how many steps the Metropolis chain must run before we can be sure that we are the equilibrium distribution it has been constructed to approach. So, while these chains are an excellent example to have in mind while thinking about mixing times, they are also a good motivation for the subject itself. General rules about speed of convergence to equilibrium are precisely what are required to make such implementation concrete.

Mixing Times 1 – Reversing Markov Chains

A small group of us have started meeting to discuss Levin, Peres and Wilmer’s book on Markov Chains and Mixing Times. (The plan is to cover a couple of chapters every week, then discuss points of interest and some of the exercises – if anyone is reading this and fancies joining, let me know!) Anyway, this post is motivated by something we discussed in our first session.

Here are two interesting facts about Markov Chains. 1) The Markov property can be defined in terms of products of transition probabilities giving the probability of a particular initial sequence. However, a more elegant and general formulation is to say that, conditional on the present, the past and the future are independent. 2) All transition matrices have at least one equilibrium distribution. In fact, irreducible Markov Chains have precisely one equilibrium distribution. Then, if we start with any distribution, the distribution of the chain at time t converges to the equilibrium distribution.

But hang on. This might be a fairly serious problem. On the one hand we have given a definition of the Markov property that is symmetric in time, in the sense that it remains true whether we are working forwards or backwards. While, on the other hand, the convergence to equilibrium is very much not time-symmetric: we move from disorder to order as time advances. What has gone wrong here?

We examine each of the properties in turn, then consider how to make them fit together in a non-contradictory way.

Markov Property

As many of the students in the Applied Probability course learned the hard way, there are many ways to define the Markov property depending on context, and some are much easier to work with than others. For a Markov chain, you can find a way to say that the transition probability \mathbb{P}(X_{n+1}=x_{n+1}\,|\,X_n=x_n,\ldots,X_0=x_0) is independent of x_0,\ldots,x_{n-1}. Alternatively, you can use this to give an inductive specification for the probability of the first n values of X being some sequence.

It requires a moment’s checking to see that the earlier definition of past/future independence is consistent with this. Let’s first check that we haven’t messed up a definition somewhere, and that the time-reversal of a general Markov chain does have the Markov property, even as defined in the context of a Markov chain.

For clarity, consider X_0,X_1,\ldots, X_N a Markov chain on some finite state space, with N some fixed finite end time. We aren’t losing anything by reversing over a finite time interval – after all, we need to know how to do it over a finite time interval before it could possibly make sense to do it over (-\infty,\infty). We examine (Y_n)_{n=0}^N defined by Y_n:= X_{N-n}.

\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1},\ldots,X_N=x_N)=\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1})

is the statement of the Markov property for (Y_n). We rearrange the left hand side to obtain:

=\frac{\mathbb{P}(X_n=x_n,X_{n+1}=x_{n+1},\ldots,X_N=x_N)}{\mathbb{P}(X_{n+1}=x_{n+1},\ldots,X_N=x_N)}

=\frac{\mathbb{P}(X_N=x_N|X_n=x_n,\ldots,X_{N-1}=x_{N-1})\mathbb{P}(X_n=x_n,\ldots,X_{N-1}=x_{N-1})}{\mathbb{P}(X_N=x_N|X_{n+1}=x_{n+1},\ldots,X_{N-1}=x_{N-1})\mathbb{P}(X_{n+1}=x_{n+1},\ldots,X_{N-1}=x_{N-1})}.

Now, by the standard Markov property on the original chain (X_n), the first probability in each of the numerator and denominator are equal. This leaves us with exactly the same form of expression as before, but with one fewer term in the probability. So we can iterate until we end up with

\frac{\mathbb{P}(X_n=x_n,X_{n+1}=x_{n+1})}{\mathbb{P}(X_{n+1}=x_{n+1})}=\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1}),

as required.

So there’s nothing wrong with the definition. The reversed chain Y genuinely does have this property, regardless of the initial distribution of X.

In particular, if our original Markov chain starts at a particular state with probability 1, and we run it up to time N, then saying that the time-reversal is a Markov chain too is making a claim that we have a non-trivial chain that converges from some general distribution at time 0 to a distribution concentrated at a single point by time N. This seems to contradict everything we know about these chains.

Convergence to Equilibrium – Markov Property vs Markov Chains

It took us a while to come up with a reasonable explanation for this apparent discrepancy. In the end, we come to the conclusion that Markov chains are a strict subset of stochastic processes with the Markov property.

The key thing to notice is that a Markov chain has even more regularity than the definition above implies. The usual description via a transition matrix says that the probability of moving to state y at time t+1 given that you are at state x at time t is some function of x and y. The Markov property says that this probability is independent of the behaviour up until time t. But we also have that the probability is independent of t. The transition matrix P has no dependence on time t – for example in a random walk we do not have to specify the time to know what happens next. This is the property that fails for the non-stationary time-reversal.

In the most extreme example, we say X_0=x_0 with probability 1. So in the time reversal, \mathbb{P}(Y_N=x_0|Y_{N-1}=y_{N-1})=1 for all y_{N-1}. But it will obviously not be the case in general that \mathbb{P}(Y_n=x_0|Y_{n-1}=y_{n-1})=1 for all y_{n-1}, as this would mean the chain Y would be absorbed after one step at state x_0, which is obviously not how the reversal of X should behave.

Perhaps the best way to reconcile this difference is to consider this example where you definitely start from x_0. Then, a Markov chain in general can be thought of as a measure on paths, that is \Omega^N, with non-trivial but regular correlations between adjacent components. (In the case of stationarity, all the marginals are equal to the stationary distribution – a good example of i.d. but not independent RVs.) This is indexed by the transition matrix and the initial distribution. If the initial distribution is a single point mass, then this can be viewed as a restriction to a smaller set of possible paths, with measures rescaled appropriately.

What have we learned?

Well, mainly to be careful about assuming extra structure with the Markov property. Markov Chains are nice because there is a transition matrix which is constant in time. Other processes, such as Brownian motion are space-homogeneous, where the transitions, or increments in this context, are independent of time and space. However, neither of these properties are true for a general process with the Markov property. Indeed, we have seen in a post from a long time ago that there are Markov processes which do not have the Strong Markov Property, which seems unthinkable if we limit our attention to chain-like processes.

Most importantly, we have clarified the essential point that reversing a Markov Chain only makes sense in equilibrium. It is perfectly possibly to define the reversal of a chain not started at a stationary distribution, but lots of unwelcome information from the forward chain ends up in the reversed chain. In particular, the theory of Markov Chains is not broken, which is good.

The Inspection Paradox and related topics

In the final class for Applied Probability, we discussed the so-called Inspection Paradox for an arrivals process. We assume that buses, sat, arrive as a Poisson process with rate 1, and consider the size of the interval (between buses) containing some fixed time T. The ‘paradox’ is that the size of this interval is larger in expectation than the average time between buses, which of course is given by an exponential random variable.

As with many paradoxes, it isn’t really that surprising after all. Perhaps what is more surprising is that the difference between the expected observed interval and the expected actual interval time is quite small here. There are several points of interest:

1) The Markov property of the Poisson process is key. In particular, this says that the expectation (and indeed the distribution) of the waiting time for a given customer arriving at T is not dependent on T, even if T is a random variable (or rather, a class of random variables, the stopping times). So certainly the inspection paradox property will hold whenever the process has the Markov property, because the inspected interval contains the inspected waiting time, which is equal in distribution to any fixed interval.

2) Everything gets slightly complicated by the fact that the Poisson process is defined to begin at 0. In particular, it is not reversible. Under the infinitesimal (or even the independent Poisson increments) definition, we can view the Poisson process not as a random non-decreasing function of time, but rather as a random collection of points on the positive reals. With this setup, it is clearly no problem to define instead a random collection of points on all the reals. [If we consider this instead as a random collection of point masses, then this is one of the simplest examples of a random measure, but that’s not hugely relevant here.]

We don’t need to worry too much about what value the Poisson process takes at any given time if we are only looking at increments, but if it makes you more comfortable, you can still safely assume that it is 0 at time 0. Crucially, the construction IS now reversible. The number of points in the interval [s,t] has distribution parameterised by t-s, so we it doesn’t matter which direction we are moving in down the real line. In this case, A_t, the time since the previous arrival, and E_t, the waiting time until the next arrival, are both Exp(1) RVs, as the memorylessness property applies in each time direction.

For the original Poisson process, we actually have A_t stochastically dominated by an Exp(1) distribution, because it is conditioned to be less than or equal to t. So in this case, the expected interval time is some complicated function of t, lying strictly between 1 and 2. In our process extended to the whole real line, the expected interval time is exactly 2.

This idea of extending onto the whole real line explains why we mainly consider delayed renewal processes rather than just normal renewal processes. The condition that we start a holding time at 0 is often not general enough, particularly when the holding times are not exponential and so the process is not memoryless.

3) There is a general size-biasing principle in action here. Roughly speaking, we are more likely to arrive in a large interval than in a small interval. The scaling required is proportional to the length of the interval. Given a density function f(x) of X, we define the size-biased density function to be xf(x). We need to normalise to give a probability distribution, and dividing by the expectation EX is precisely what is needed. Failure to account for when an observation should have the underlying distribution or the size-biased distribution is a common cause of supposed paradoxes. A common example is ‘on average my friends have more friends than I do’. Obviously, various assumption on me and my friends, and how we are drawn from the set of people (and the distribution of number of friends) is required that might not necessarily be entirely accurate in all situations.

In the Poisson process example above, the holding times have density function e^{-x}, so the size-biased density function if xe^{-x}. This corresponds to a \Gamma(2,1) distribution, which may be written as the sum of two independent Exp(1) RVs as suggested above.

4) A further paradox mentioned on the sheet is the waiting time paradox. This says that the expected waiting time is longer if you arrive at a random time than if you just miss a bus. This is not too surprising: consider at least the stereotypical complaint about buses in this country arriving in threes, at least roughly. Then if you just miss a bus, there is a 2/3 chance that another will be turning up essentially immediately. On the sheet, we showed that the \Gamma(\alpha,1) distribution has this property also, provided \alpha<1.

We can probably do slightly better than this. The memoryless property of the exponential distribution says that:

\mathbb{P}(Z>t+s|Z>t)=\mathbb{P}(Z>s).

In general, for the sort of holding times we might see at a bus stop, we might expect it to be the case that if we have waited a long time already, then we are less likely relatively to have to wait a long time more, that is:

\mathbb{P}(Z>t+s|Z>t)\leq\mathbb{P}(Z>s),

and furthermore this will be strict if neither s nor t is 0. I see no reason not to make up a definition, and call this property supermemorylessness. However, for the subclass of Gamma distributions described above, we have the opposite property:

\mathbb{P}(Z>t+s|Z>t)\geq\mathbb{P}(Z>s).

Accordingly, let’s call this submemorylessness. If this is strict, then it says that we are more likely to have to wait a long time if we have already been waiting a long time. This seems contrary to most real-life distributions, but it certainly explains the paradox. If we arrive at a random time, then the appropriate holding time has been ‘waiting’ for a while, so is more likely to require a longer observed wait than if I had arrived as the previous bus departed.

In conclusion, before you think of something as a paradox, think about whether the random variables being compared are being generated in the same way, in particular whether one is size-biased, and whether there are effects due to non-memorylessness.

DBEs and stationary distributions

Aside

The most recent Applied Probability assignment sheet featured various aspects of Detailed Balance Equations for continuous-time Markov chains. We discussed the advantages and disadvantages of using DBEs rather than solving for an equilibrium distribution directly. The equations used in this second case are often called Full Balance Equations.

Briefly, the advantages of DBEs are that they are easy to solve. After all, each one contains only two components of the equilibrium distribution, so generally you can solve one-at-a-time. The disadvantage is that an equilibrium distribution might not satisfy the DBEs. The deductive structure is:

\text{Solves DBEs}\quad \stackrel{\Rightarrow}{\not\Leftarrow}\quad\text{Equilibrium distribution}

Usually, the chain will be irreducible, so the equilibrium distribution is unique. This means that if we can solve the DBEs, the result is the unique equilibrium distribution.

The DBEs are soluble only if the situation is reversible. This is probably the best definition to use in practice, but informally we can say that this means that the behaviour looks qualitatively the same if we reverse time. For example, as in Q1:

Q=\begin{pmatrix}-1 &1&0\\ 0& -1&1\\1&0&-1\end{pmatrix},

gives the Q-matrix which equilibrium distribution (\frac13,\frac13,\frac13), which does not satisfy DBEs. The chain is not reversible because sample paths always go clockwise, so if we reversed time they would go anti-clockwise (or vice-versa depending on how you’ve drawn the diagram).

What I wanted to say in the class, and made a mess of explaining was this, about why it was inappropriate to use DBEs to find stationary distributions in Q3d):

Reversibility is not just a function of the chain. It is a function of the chain AND the initial distribution. This is only in practice a concern when the chain is reducible, but in this case it really can lead you astray. Let’s consider an example, like

Q=\begin{pmatrix}-3&2&0&0&1&0\\ 0&-4&3&1&0&0\\ 0&1&-4&3&0&0\\ 0&3&1&-4&0&0\\ 0&0&0&0&-5&5\\ 0&0&0&0&5&-5\end{pmatrix}.

Then by solving as in the problem sheet, the invariant distributions are given by:

\lambda(0,\frac13,\frac13,\frac13,0,0)+\mu(0,0,0,0,\frac12\frac12),\quad \lambda+\mu=1.

If you attempted to solve the DBEs, you would succeed, but the only solution would be

(0,0,0,0,\frac12,\frac12).

The explanation is fairly simple in the end. Reversibility is a class property, and only one of the communicating classes, \{5,6\} in this example admits a reversible initial distribution, so to solve the DBEs we must assign zero mass on the other class.

Anyway, I hope that clears up any residual confusion from the class.

The Perron-Frobenius Theorem for Stochastic Matrices

This article was prompted by a question asked by one of my students about 5 minutes before the end of the final lecture of my Markov Chains course in Linyi. Although I didn’t have time to think about it right then, the flight home offered plenty of opportunity for having a play around. I am well aware that the proof given below is not the best one, but the ideas of minimising quadrants of a matrix seemed fairly natural. Anyway, it’s been sitting on my desktop for over two months now, so I decided I should put it up.

———

Recall that the transition probabilities of a finite Markov chain are determined by a stochastic matrix P. That is, each row of P is a probability distribution. In both theory and applications, the existence and properties of an invariant distribution is of interest. This is a probability distribution \pi satisfying the relation:

\pi P=\pi. (1)

It is clear that \bigg(\begin{smallmatrix}1\\ \vdots\\1\end{smallmatrix}\bigg) is a right- or column eigenvector of P, with eigenvalue 1. Since the spectrum of P^T is the same as that of P, we conclude that 1 is a left-eigenvalue of P also. So we can be assured of the existence of a vector \pi satisfying (1). What is unclear is that this eigenvector \pi should be a probability distribution. Since at least one entry must be non-zero, it will suffice to show that every entry of \pi is non-negative.

A necessary condition for the uniqueness of an invariant distribution is that the Markov chain be irreducible. This is best defined using the terminology of random walks: the chain is irreducible if for every pair of states i,j\in I, it is possible to move from i to j and back again. In terms of the transition matrix, P is irreducible if it is not block upper-triangular, up to reordering rows and columns.

We want to show that when P is irreducible, the (unique) 1-eigenvector is a probability distribution. The standard method proposed in this context is to exhibit the invariant distribution directly. For example, Norris’s Markov Chains defines

\gamma_i^k=\text{ expected time spent in i between visits to k }=\mathbb{E}_k\sum_{n=0}^{T_k}1_{\{X_n=i\}},

and shows that (\gamma_i^k)_{i\in I} satisfies (1).

Nonetheless, the result required is clearly at least one step removed from the probabilistic interpretation, so it would be satisfying to find a direct proof of existence. Typically, one quotes the substantially more general theorem of Perron and Frobenius, the most relevant form of which is:

Theorem (Perron-Frobenius): Given A a non-negative and irreducible square matrix. Then there is a positive real eigenvalue \lambda with multiplicity 1 and such that all other eigenvalues have absolute value less than or equal to \lambda. Then the (unique up to scaling) left- and right-eigenvectors corresponding to \lambda are positive.

Here we present a short proof of this result in the case where A is the (stochastic) transition matrix of a Markov chain.

Proposition: An irreducible stochastic matrix has a 1-eigenvector with all entries non-negative.

Proof: We show instead the contrapositive: that if a stochastic matrix has a 1-eigenvector with both negative and non-negative components, then it is reducible. The motivation is this third eigenvector given in example (2). Observe that the communicating classes are \{1,2\} and \{3\}, and it is not hard to see that for any eigenvector with negative and non-negative components, the sign of a component is a class property.

Informally, given an n\times n stochastic matrix P, and a 1-eigenvector \pi with this property, we relabel the states so that the non-negative components, which we call A\subset I are first. That is, in a moderate abuse of notation:

\pi=(\underbrace{\pi_A}_{\geq 0}\quad\underbrace{\pi_B}_{<0}).\quad\text{ If we write P as }\begin{pmatrix}P_{AA}&P_{AB}\\P_{BA}&P_{BB}\end{pmatrix},

our aim is to show that the sub-matrices P_{AB} and P_{BA} are both zero. This implies that states in A and states in B do not communicate, showing that P is reducible. We can formulate this as a linear programming problem:

\text{Maximise }\sum_{\substack{x\in A,y\in B\\x\in B, y\in A}}p_{xy}\text{ s.t. }\begin{cases}p_{xy}\geq 0&\forall x,y\in I\\p_{x1}+\ldots+p_{xn}=1&\forall x\in I\\\pi_1p_{1y}+\ldots+\pi_np_{ny}=\pi_y&\forall y\in I\end{cases}

It suffices to show that this maximum is 0. Now, we take |A|=i, and assume that 1\leq i\leq n-1, that is, there are a positive number of negative and non-negative components. Noting that the sum of the rows in a stochastic matrix is 1, we may consider instead:

\text{Minimise }\sum_{\substack{x,y\in A\\x,y\in B}}p_{xy}\text{ s.t. }\begin{cases}p_{xy}\geq 0&\forall x,y\in I\\p_{x1}+\ldots+p_{xn}=1&\forall x\in I\\\pi_1p_{1y}+\ldots+\pi_np_{ny}=\pi_y&\forall y\in I\end{cases}

and it will suffice to show that this minimum is n. To do this, we consider instead the dual:

\text{Maximise }\lambda_1+\ldots+\lambda_n+\pi_1\mu_1+\ldots+\pi_n\mu_n,

\text{ s.t. }\lambda_x+\pi_y\mu_x\leq\begin{cases}1&\text{if }x,y\leq i\text{ or }x,y\geq i+1\& \text{otherwise}\end{cases}

The objective is certainly bounded by n. And in fact this is attainable, for example by taking:

\lambda_1=\ldots=\lambda_i=1,\quad \lambda_{i+1}=\ldots=\lambda_n=0

\mu_1=\ldots=\mu_i=0,\quad \mu_{i+1}=-\frac{1}{\pi_{i+1}}, \ldots,\mu_n=-\frac{1}{\pi_n}.

Applying strong duality for linear programming problems completes the proof.

How to take a Passport Photograph – Part 1

As seems to be case whenever you start somewhere new, I’ve needed an almost infinite supply of passport-sized photographs recently. The university, my college, my department and of course the Chinese immigration authorities all wanted a record of my beautiful features. Anyway, as a result of all of this interest, I was in the do-it-yourself photo booth WHSmiths in Wandsworth getting some more the other day. The first attempt looked fine, but the machine offered me the possibility of trying again, up to twice if I wanted. This seemed like a win-win situation, so I said yes, not realising that the one I already had would not be kept ‘in the bag’. The second attempt looked somewhat startled, a pose that runs in my family, but not wanting to risk the possibility of a disastrous third attempt (and the financial penalty of having to do the whole operation again) I confirmed that I was happy and made do with the result. Naturally, the question that struck me: what is the optimal strategy for such a situation? (Assuming that, unlike me, you knew the rules from the beginning)

Mathematical model and choices

Let’s formulate this mathematically. Suppose there are n possible trials, corresponding to iid random variables X_1,\ldots,X_n. (Note that this assumes that your ‘performance’ does not improve or otherwise change during the process. Perhaps not a reasonable assumption in some contexts?) After trials X_1,\ldots,X_k have been observed, you have to choose whether to accept the value X_k as your ‘final answer’, or whether to continue.

The first key decision is: what distribution should the X_is have? Since in the original problem there isn’t a natural metric for quality, let’s assume that the X_is represent some well-defined quantitative utility, distributed as a uniform [0,1] random variable. Perhaps a normal random variable might be a more realistic model, but I can solve it in this case, so let’s stick to this for now. In addition, for the sake of making the eventual answer more simple, let’s say that 0 is the best quality and 1 is the worst. That is, we are looking for a strategy that stops the process at a time T so as to minimise \mathbb{E}X_T.

Finding an optimal strategy

The key observation is the following. In words, if we reject X_1, we can forget about its value as that is independent of \{X_2,\ldots,X_n\} which is now all that remains to base future judgments on. We return to the original problem with one fewer trial. In more mathematical notation, conditional on \{T\neq 1\}, T is independent of X_1.

The following argument assumes that an optimal strategy exists, which is not ideal, but can easily be justified. For now though, we proceed relatively informally by induction on n.

Let S be the stopping time for the optimal strategy on X_2,\ldots,X_n which we assume exists by induction. It is ‘obvious’ that the optimal strategy T for X_1,\ldots,X_n should be the following:

  • T=1 iff X_1<a(n), where this is a deterministic quality with dependence only on n.
  • Conditional on T\neq 1, take T=S.

From this alone, we can calculate \mathbb{E}X_T.

\mathbb{E}X_T=\mathbb{E}X_S1_{T=S}+\mathbb{E}X_T1_{T\neq S}

= (1-\mathbb{P}(T=1))\mathbb{E}X_S+\mathbb{E}X_1 1_{X_1<a(n)}

= (1-a(n))\mathbb{E}X_S +\frac12 a(n)^2.

This is minimised precisely when a(n)=\mathbb{E}X_s. We conclude that the optimal strategy, as of course we might well expect, is to take T=1 precisely if X_1 is less than the expected result of applying the optimal strategy to the remaining random variables.

By extension, we have a(n+1)=\mathbb{E}X_T, and so

a(n+1)=a(n)\left[1-\frac{a(n)}{2}\right]. (*)

The first few values are:

a(1)=1,\quad a(2)=\frac12,\quad a(3)=\frac38,\quad a(4)=\frac{39}{128},\quad\ldots

Behaviour of a(n)

The first question is: as n grows large, does a(n)\rightarrow 0? Well this isn’t too hard: the recursive definition (*) confirms that the sequence $a(1),a(2),\ldots$ is (strictly) decreasing, and so has a limit, which must be a fixed point of the equation (*). The only such fixed point is 0.

The second question is: what is the asymptotic behaviour of a(n) for large n? A quick run on MATLAB, or examination of the equation (*) suggests that

a(n)\approx \frac{2}{n}

should describe the behaviour well for large n. My basic attempts to verify this were initially unsuccessful, but I felt fairly sure that this should be true in some metric sense because of the following highly non-rigorous but nonetheless convincing idea.

Claim: a_n=\frac{2}{n}+O(n^{-3}) satisfies (*).

Why? Well, then:

\frac{2}{n+1}-\left[(\frac{2}{n}+O(n^{-3}))-\frac12(\frac{2}{n}+O(n^{-3}))^2\right]

=\frac{2}{n+1}-\frac{2}{n}+\frac{2}{n^2}+O(n^{-3})

=\frac{2}{n^2(n+1)}+O(n^{-3})=O(n^{-3}).

This proves the claim, but none of the = signs are really especially meaningful here. Perhaps there is a really slick way to tie this up that I’ve missed? In any case, I will save my own slightly involved method for a new post.

Markovian Excursions

In the previous post, I talked about the excursions of a Brownian motion. Today I’m thinking about how to extend these ideas to more general Markov chains. First we want to rule out some situations. In particular, we aren’t hugely interested in discrete time Markov chains. The machinery is fairly well established for excursions, whether or not the chain is transient. Furthermore, if the state space is discrete, as for a Poisson process for example, the discussion is not hugely interesting either. Remember that the technical challenges in the constructions of local time arise because of the Blumenthal 0-1 law property that Brownian motion visits 0 infinitely often in the small window after the start time. We therefore want the process to be regular at the point of the state space under discussion. This is precisely the condition described above for BM about 0.

Why is it harder in general?

The informal notion of a local time should transfer to a more general Markov chain, but there are some problems. Firstly, to define something in terms of an integral is not general enough. The state space E needs some topological structure, but any meaningful definition must be in terms of functions from E into the reals. There were also all sorts of special properties of Brownian motion, including the canonical time-space rescaling that came in handy in that particular case. It turns out to be easiest to consider the excursion measure on a general Markov chain through its Laplace transform.

Definition and Probabilistic interpretations

The resolvent is the Laplace transform of the transition probability P_t(x,A), viewed as an operator on functions f:E\rightarrow \mathbb{R}.

R_\lambda f(x):=\mathbb{E}_x\left[\int_0^\infty e^{-\lambda t}f(X_t)dt\right]=\int_0^\infty e^{-\lambda t}P_tf(x)dt.

We can interpret this in terms of the original process in a couple of ways which may be useful later. The motivation is that we would like to specify a Poisson process of excursions, for which we need to know the rate. We hope that the rate will in fact be constant, so it will in fact to suffice to work out the properties of the expected number of excursions (or whatever) up to various random times, in particular those given by exponential RVs.

So, we take \zeta\sim\text{Exp}(\lambda) independent of everything else, and assume that we ‘kill the chain’ at time \zeta. Then, by shuffling expectations in and out of the integral and separating independent bits, we get:

R_\lambda f(x)=\mathbb{E}_x\int_0^\zeta f(X_s)ds = \frac{1}{\lambda}\mathbb{E}_xf(X_\zeta).

As in the Brownian local time description

R_\lambda 1_A(x)=\mathbb{E}(\text{time spent in }A\text{ before death at time }\zeta_\lambda).

Markovian property

We want to show that excursions are Markov, once we’ve sorted out what an ‘excursion’ actually means in this context. We do know how to deal with the Markovian property once we are already on an excursion. It is relatively straightforward to define an extension of the standard transition probability operator, to include a condition that the chain should not hit point a during the transition. That is

_aP_t(x,A):= \mathbb{P}_x(X_t\in A\cap H_a>t).

This will suffice to define the behaviour once an excursion has started. The more complicated bit will be the entrance law n_t(A), being the probability of arriving at A after time t of an excursion. To summarise, as with BM, all the technical difficulties with an excursion happen at the beginning, ie bouncing around the start point, but once it is ‘up-and-running’, its path is Markovian and controlled by _aP_t.

Marking

The link between the resolvent and the excursions, is provided as in the Brownian case, by supplying a PPP of marks at uniform rate \lambda to real time. This induces a mark process on excursions, weighted by an (exponential) function of excursion length. We make no distinction between an excursion including one mark or many marks. Then the measure on marked excursion is, in a mild abuse of notation:

n_\lambda=(1-e^{-\lambda\zeta(f)})\cdot n.

We compare with the Laplace transform n_\lambda(dx)=\int_0^\infty e^{-\lambda t}dtn_t(dx) using a probabilistic argument.

We can apply the measure to a function in the usual way: \lambda n_\lambda(1_A) is the measure of those excursions for which the first mark occurs in A. So by taking A=E, we get

\lambda n_\lambda(1)=\text{ Excursion measure }=\int_U n(df)(1-e^{-\lambda\zeta(f)}).

We have therefore linked the exponential mark process on excursion measure with the Laplace transform of the entrance law. So in particular:

\frac{\lambda n_\lambda(A)}{\lambda n_\lambda(1)}=\mathbb{P}(\text{first mark when in }A)=\int_0^\infty \lambda e^{-\lambda t}P_t(0,A)dt=\lambda R_\lambda 1_A(0).

The resolvent is relatively easy to calculate explicitly, and so we can find the Laplace transform n_\lambda(A). From this it is generally possible to invert the transform to find the entrance law n_t.

References

A Guided Tour Through Excursions – L. C. G. Rogers.

This pair of posts is very much a paraphrase of chapters 3 and 4 of this excellent text. The original can be found here (possibly not open access?)