Mixing Times 3 – Convex Functions on the Space of Measures

The meat of this course covers rate of convergence of the distribution of Markov chains. In particular, we want to be thinking about lots of distributions simultaneously, so we really to be comfortable working with the space of measures on a (for now) finite state space. This is not really too bad actually, since we can embed it in a finite-dimensional real vector space.

\mathcal{M}_1(E)=\{(x_v:v\in\Omega),x_v\geq 0, \sum x_v=1\}\subset \mathbb{R}^\Omega.

Since most operations we might want to apply to distributions are linear, it doesn’t make much sense to inherit the usual Euclidean metric. In the end, the metric we use is the same as the L_1 metric, but the motivation is worth exploring. Typically, the size of |\Omega| will be function of n, a parameter which will tend to infinity. So we do not want to be too rooted in the actual set \Omega for what will follow.

Perhaps the best justification for total variation distance is from a gambling viewpoint. Suppose your opinion for the distribution of some outcome is \mu, and a bookmaker has priced their odds according to their evaluation of the outcome as \nu. You want to make the most money, assuming that your opinion of the distribution is correct (which in your opinion, of course it is!). So assuming the bookmaker will accept an arbitrarily complicated (but finite obviously, since there are only |\Omega| possible outcomes) bet, you want to place money on whichever event evinces the greatest disparity between your measure of likeliness and the bookmaker’s. If you can find an event which you think is very likely, and which the bookmaker thinks is unlikely, you are (again, according to your own opinion of the measure) on for a big profit. This difference is the total variation distance ||\mu-\nu||_{TV}.

Formally, we define:

||\mu-\nu||_{TV}:=\max_{A\subset\Omega}|\mu(A)-\nu(A)|.

Note that if this maximum is achieved at A, it is also achieved at A^c, and so we might as well go with the original intuition of

||\mu-\nu||_{TV}=\max_{A\subset\Omega} \left[\mu(A)-\nu(A)\right].

If we decompose \mu(A)=\sum_{x\in A}\mu(x), and similarly for A^c, then add the results, we obtain:

||\mu-\nu||_{TV}=\frac12\sum_{x\in\Omega}|\mu(A)-\nu(A)|.

There are plenty of other interesting interpretations of total variation distance, but I don’t want to get bogged down right now. We are interested in the rate of convergence of distributions of Markov chains. Given some initial distribution \lambda of X_0, we are interested in ||\lambda P^t-\pi||_{TV}. The problem is that doing everything in terms of some general \lambda is really annoying, at the very least for notational reasons. So really we want to investigate

d(t)=\max_{\lambda\in\mathcal{M}_1(E)}||\lambda P^t-\pi||_{TV},

the worst-case scenario, where we choose the initial distribution that mixes the slowest, at least judging at time t. Now, here’s where the space of measures starts to come in useful. For now, we relax the requirement that measures must be probability distributions. In fact, we allow them to be negative as well. Then \lambda P^t-\pi is some signed measure on \Omega with zero total mass.

But although I haven’t yet been explicit about this, it is easy to see that ||\cdot||_{TV} is a norm on this space. In fact, it is (equivalent to – dividing by 1/2 makes no difference!) the product norm of the L_1 norm as defined before. Recall the norms are convex functions. This is an immediate consequence of the triangle inequality. The set of suitable distributions \lambda is affine, because an affine combination of probability distributions is another probability distribution.

Then, we know from linear optimisation theory, that convex functions on an affine space achieve their maxima at boundary points. And the boundary points for this definition of \lambda\in\mathcal{M}_1(E), are precisely the delta-measures at some point of the state space \delta_v. So in fact, we can replace our definition of d(t) by:

d(t)=\max_{x\in\Omega}||P^t(x,\cdot)-\pi||_{TV},

where P^t(x,\cdot) is the same as (\delta_x P^t)(\cdot). Furthermore, we can immediately apply this idea to get a second result for free. In some problems, particularly those with neat couplings across all initial distributions, it is easier to work with a larger class of transition probabilities, rather than the actual equilibrium distribution, so we define:

\bar{d}(t):=\max_{x,y\in\Omega}||P^t(x,\cdot)-P^t(y,\cdot)||_{TV}.

The triangle inequality gives \bar{d}(t)\leq 2d(t) immediately. But we want to show d(t)\leq \bar{d}(t), and we can do that as before, by considering

\max_{\lambda,\mu\in\mathcal{M}_1(E)}||\lambda P^t-\mu P^t||_{TV}.

The function we are maximising is a convex function on \mathcal{M}_1(E)^2, and so it attains its maximum at a boundary point, which must be \lambda=\delta_x,\mu=\delta_y. Hence \bar{d}(t) is equal to the displayed expression above, which is certainly greater than or equal to the original formulation of d(t), as this is the maximum of the same expression over a strict subset.

I’m not suggesting this method is qualitatively different to that proposed by the authors of the book. However, I think this is very much the right way to be thinking about these matters of maximising norms over a space of measures. Partly this is good because it gives an easy ‘sanity check’ for any idea. But also because it gives some idea of whether it will or won’t be possible to extend the ideas to the case where the state space is infinite, which will be of interest much later.

Advertisement

Mixing Times 2 – Metropolis Chains

In our second reading group meeting for Mixing Times of Markov Chains, we reviewed chapters 3 and 4 of the Levin, Peres and Wilmer book. This post and the next contains a couple of brief thoughts about the ideas I found most interesting in each chapter.

Before reading chapter 3, the only thing I really knew about Monte Carlo methods was the slogan. If you want to sample from a probability distribution that you can’t describe explicitly, find a Markov chain which has that distribution as an equilibrium distribution, then run it for long enough starting from wherever you fancy. Then the convergence theorem for finite Markov chains means that the state of the chain after a long time approximates well the distribution you were originally looking for.

On the one previous occasion I had stopped and thought about this, I had two questions which I never really got round to answering. Firstly, what sort of distributions might you not be able to simulate directly? Secondly, and perhaps more fundamentally, how would you go about finding a Markov chain for which a given distribution is in equilibrium?

In the end, the second question is the one answered by this particular chapter. The method is called a Metropolis chain, and the basic idea is that you take ANY Markov chain with appropriate state space, then fiddle with the transition probabilities slightly. The starting chain is called a base chain. It is completely possible to adjust the following algorithm for a general base chain, but for simplicity, let’s assume it is possible to take an irreducible chain for which the transition matrix is symmetric. By thinking about the DBEs, this shows that the uniform distribution is the (unique) equilibrium distribution. Suppose the  transition matrix is given by \Psi(x,y), to copy notation from the book. Then set:

P(x,y)=\begin{cases}\Psi(x,y)\left[1\wedge \frac{\pi(y)}{\pi(x)}\right]&y\neq x\\ 1-\sum_{z\neq x} \Psi(x,z)\left [1\wedge \frac{\pi(z)}{\pi(x)}\right]& y=x.\end{cases}

Note that this second case (y=x) is of essentially no importance. It just confirms that the rows of P add to 1. It is easy to check from the DBEs that \pi is the equilibrium distribution of matrix P. One way to think of this algorithm is that we run the normal chain, but occasionally suppress transitions is they involve a move from a state which is likely (under \pi), to one which is less likely. This is done in proportion to the ratio, so it is unsurprising perhaps that the limit in distribution is \pi.

Conveniently, this algorithm also gives us some ideas for how to answer the first question. Note that at no point do we need to know \pi(x) for some state x. We only need to use \frac{\pi(x)}{\pi(y)} the ratios of probabilities. So this is perfect for distributions where there is a normalising constant which is computationally taxing to evaluate. For example, in the Ising model and similar statistical physics objects, probabilities are viewed more as weightings. There is a normalising constant, often called the partition function Z in this context, lying in the background, but especially the underlying geometry is quite exotic we definitely don’t want to have to worry about actually calculating Z. Thus we have a way to generate samples from such models. The other classic example is a random walk on a large, perhaps unknown graph. Then the equilibrium distribution at a vertex is inversely proportional to the degree of that vertex, but again you might not know about this information over the entire graph. It is reasonable to think of a situation where you might be able to take a random walk on a graph, say the connectivity graph of the internet, without knowing about all the edges at any one time. So, even though you potentially explore everywhere, you only need to know a small amount at any one time.

Of course, the drawback of both of these examples is that a lack of knowledge about the overall system means that it is hard in general to know how many steps the Metropolis chain must run before we can be sure that we are the equilibrium distribution it has been constructed to approach. So, while these chains are an excellent example to have in mind while thinking about mixing times, they are also a good motivation for the subject itself. General rules about speed of convergence to equilibrium are precisely what are required to make such implementation concrete.