Mixing Times 2 – Metropolis Chains

In our second reading group meeting for Mixing Times of Markov Chains, we reviewed chapters 3 and 4 of the Levin, Peres and Wilmer book. This post and the next contains a couple of brief thoughts about the ideas I found most interesting in each chapter.

Before reading chapter 3, the only thing I really knew about Monte Carlo methods was the slogan. If you want to sample from a probability distribution that you can’t describe explicitly, find a Markov chain which has that distribution as an equilibrium distribution, then run it for long enough starting from wherever you fancy. Then the convergence theorem for finite Markov chains means that the state of the chain after a long time approximates well the distribution you were originally looking for.

On the one previous occasion I had stopped and thought about this, I had two questions which I never really got round to answering. Firstly, what sort of distributions might you not be able to simulate directly? Secondly, and perhaps more fundamentally, how would you go about finding a Markov chain for which a given distribution is in equilibrium?

In the end, the second question is the one answered by this particular chapter. The method is called a Metropolis chain, and the basic idea is that you take ANY Markov chain with appropriate state space, then fiddle with the transition probabilities slightly. The starting chain is called a base chain. It is completely possible to adjust the following algorithm for a general base chain, but for simplicity, let’s assume it is possible to take an irreducible chain for which the transition matrix is symmetric. By thinking about the DBEs, this shows that the uniform distribution is the (unique) equilibrium distribution. Suppose the  transition matrix is given by \Psi(x,y), to copy notation from the book. Then set:

P(x,y)=\begin{cases}\Psi(x,y)\left[1\wedge \frac{\pi(y)}{\pi(x)}\right]&y\neq x\\ 1-\sum_{z\neq x} \Psi(x,z)\left [1\wedge \frac{\pi(z)}{\pi(x)}\right]& y=x.\end{cases}

Note that this second case (y=x) is of essentially no importance. It just confirms that the rows of P add to 1. It is easy to check from the DBEs that \pi is the equilibrium distribution of matrix P. One way to think of this algorithm is that we run the normal chain, but occasionally suppress transitions is they involve a move from a state which is likely (under \pi), to one which is less likely. This is done in proportion to the ratio, so it is unsurprising perhaps that the limit in distribution is \pi.

Conveniently, this algorithm also gives us some ideas for how to answer the first question. Note that at no point do we need to know \pi(x) for some state x. We only need to use \frac{\pi(x)}{\pi(y)} the ratios of probabilities. So this is perfect for distributions where there is a normalising constant which is computationally taxing to evaluate. For example, in the Ising model and similar statistical physics objects, probabilities are viewed more as weightings. There is a normalising constant, often called the partition function Z in this context, lying in the background, but especially the underlying geometry is quite exotic we definitely don’t want to have to worry about actually calculating Z. Thus we have a way to generate samples from such models. The other classic example is a random walk on a large, perhaps unknown graph. Then the equilibrium distribution at a vertex is inversely proportional to the degree of that vertex, but again you might not know about this information over the entire graph. It is reasonable to think of a situation where you might be able to take a random walk on a graph, say the connectivity graph of the internet, without knowing about all the edges at any one time. So, even though you potentially explore everywhere, you only need to know a small amount at any one time.

Of course, the drawback of both of these examples is that a lack of knowledge about the overall system means that it is hard in general to know how many steps the Metropolis chain must run before we can be sure that we are the equilibrium distribution it has been constructed to approach. So, while these chains are an excellent example to have in mind while thinking about mixing times, they are also a good motivation for the subject itself. General rules about speed of convergence to equilibrium are precisely what are required to make such implementation concrete.

Advertisement

DBEs and stationary distributions

Aside

The most recent Applied Probability assignment sheet featured various aspects of Detailed Balance Equations for continuous-time Markov chains. We discussed the advantages and disadvantages of using DBEs rather than solving for an equilibrium distribution directly. The equations used in this second case are often called Full Balance Equations.

Briefly, the advantages of DBEs are that they are easy to solve. After all, each one contains only two components of the equilibrium distribution, so generally you can solve one-at-a-time. The disadvantage is that an equilibrium distribution might not satisfy the DBEs. The deductive structure is:

\text{Solves DBEs}\quad \stackrel{\Rightarrow}{\not\Leftarrow}\quad\text{Equilibrium distribution}

Usually, the chain will be irreducible, so the equilibrium distribution is unique. This means that if we can solve the DBEs, the result is the unique equilibrium distribution.

The DBEs are soluble only if the situation is reversible. This is probably the best definition to use in practice, but informally we can say that this means that the behaviour looks qualitatively the same if we reverse time. For example, as in Q1:

Q=\begin{pmatrix}-1 &1&0\\ 0& -1&1\\1&0&-1\end{pmatrix},

gives the Q-matrix which equilibrium distribution (\frac13,\frac13,\frac13), which does not satisfy DBEs. The chain is not reversible because sample paths always go clockwise, so if we reversed time they would go anti-clockwise (or vice-versa depending on how you’ve drawn the diagram).

What I wanted to say in the class, and made a mess of explaining was this, about why it was inappropriate to use DBEs to find stationary distributions in Q3d):

Reversibility is not just a function of the chain. It is a function of the chain AND the initial distribution. This is only in practice a concern when the chain is reducible, but in this case it really can lead you astray. Let’s consider an example, like

Q=\begin{pmatrix}-3&2&0&0&1&0\\ 0&-4&3&1&0&0\\ 0&1&-4&3&0&0\\ 0&3&1&-4&0&0\\ 0&0&0&0&-5&5\\ 0&0&0&0&5&-5\end{pmatrix}.

Then by solving as in the problem sheet, the invariant distributions are given by:

\lambda(0,\frac13,\frac13,\frac13,0,0)+\mu(0,0,0,0,\frac12\frac12),\quad \lambda+\mu=1.

If you attempted to solve the DBEs, you would succeed, but the only solution would be

(0,0,0,0,\frac12,\frac12).

The explanation is fairly simple in the end. Reversibility is a class property, and only one of the communicating classes, \{5,6\} in this example admits a reversible initial distribution, so to solve the DBEs we must assign zero mass on the other class.

Anyway, I hope that clears up any residual confusion from the class.

Loss Networks and Erlang’s Fixed Point

Loss Networks

In Erlang’s telephone line model discussed in the previous post, we considered users competing for a finite set of resources. When insufficient resources are available, the call is lost. A loss network generalises this situation to more complicated resource configurations. We think of links 1, …, J, each with some integer capacity c_j. Each incoming call requires 1 units of each link in some subset of the set of links, and lasts for a time distributed as an exponential random variable with parameter 1, independent of everything else in the model. We call this subset the route, and denote by A_{jr} the incidence matrix of links on routes. Calls arrive as PP(\nu_r)$ independently for each route: no queueing occurs – a call is lost if some link required is operating at full capacity. We call the probability of this event L_r, the loss probability. Observe that (n_r), the number of calls on each route r, is a Markov chain on the truncated space \{An\leq c\}.

By checking the DBEs, it is clear that an ED for this Markov chain is proportional to the ED for the MC without the capacity constraint, with state-space restricted to the truncated space. But without capacity constraints, the system is a linear migration process, for which we discovered the form of the ED in the previous section. If we write H(c)=\mathbb{P}(An\leq c) in the linear migration process, we can compute the acceptance probability for the finite capacity system as:

1-L_r=\frac{H(C-Ae_r)}{H(C)}

Approximating Blocking Probabilities

We want to calculate B_j, the equilibrium blocking probability, that a given link j is full. We have two methods: firstly, to find the distribution for (n_r) with maximum probability, for which the blocking probabilities appear as shadow prices. And secondly, to make a reasonable approximation about blocking independence, and solve explicitly. We want to show that these methods give the same answers.

To maximise the probability \pi(n)\propto \prod_r \frac{\nu_r^{n_r}}{n_r!} on \{An\leq c\}, we take logs and maximise using Stirling’s approximation, which is reasonable as we are implicitly working under a regime where the throughput tends to infinity while preserving ratios.

The primal problem is

\max\quad \sum_r(x_r\log \nu_r-x_r\log x_r+x_r),\quad\text{s.t. }Ax\leq c

which has Lagrangian

L(x,y,z)=\sum_r x_r+\sum_r x_r(\log \nu_r-\log x_r-\sum_j y_jA_{jr})+\sum_j y_jc_j-\sum_j y_jc_j

We observe that complementary slackness here has the form y.z=0, and remember that by Strong Duality, which applies here because everything relevant is convex, this equality holds at the primal optimum. Differentiating the Lagrangian at the optimum allows us to specify the optimal x in terms of y:

\bar{x}_r=\nu_r e^{-\sum y_jA_{jr}}

The dual problem is then to minimise

\min\quad \sum_r \nu_re^{-\sum_jy_jA_{jr}}+\sum_j y_jc_j

At this point, we make the suggestive substitution e^{-y_j}=1-B_j, observing that this gives B non-negative by default since y is non-negative. After further work, we will deduce that these B do indeed have a sensible interpretation as blocking probabilities, but it should be stressed that this is in no way obvious yet. Now complementary slackness asserts:

\sum_rA_{jr}\nu_r\prod_i(1-B_i)^{A_{ir}}\left\{\begin{array}{l l}=c_j& \quad B_j>0\\ \leq c_j & \quad B_j=0\\ \end{array} \right.

Note that the primal objective function is strictly convex so \bar{x} as discussed is the unique optimum. The dual objective is strictly convex in yA, so if A has full rank J, this induces a unique optimum in terms of y. We assume A is full rank (since for example we can perturb slightly) and that there is no degeneracy in the blocking.

Now we consider a sequence of networks with proportionally increasing arrival rates and capacities Continue reading

Queues and Migration Processes

Simple Queues

A queue generally has the form of a countable state-space Markov Chain. Here we assume that customers are served in the order they arrive (often referred to as: FIFO – First In First Out). The standard Kendall notation is M/M/C/K. Here the Ms stand for Markov (or memoryless) arrivals, and service times respectively, possibly replaced by G if the process admits more general distributions. C is the number of servers, and K the capacity of the system.

The first example is a M/M/C/C queue, motivated by a telephone network. Here, there are c lines, and an arriving call is lost if all lines are busy at the arrival time. We assume arrivals are a PP(\lambda) process, and the call times are iid Exp(\mu) RVs. We record the number of busy lines, which is a continuous-time Markov chain on state-space [0,c]. As usual, we assume the system operates in equilibrium, and so in particular we must have \lambda<\mu c. It is easy to find the equilibrium distribution. The only non-zero transition probabilities are

q(i-1,i)=\lambda, \quad q(i,i-1)=\mu i\quad i=1,\ldots,c

and so can that that \pi_i=\frac{\nu_i}{i!}\pi_0 satisfies the DBEs where \nu:=\frac{\lambda}{\mu}, sometimes called the traffic intensity. This gives

\pi_c=\mathbb{P}(c\text{ lines busy})=\frac{\nu^c}{c!}\left(\sum_{k=0}^c \frac{\nu^k}{k!}\right)^{-1}

and we define this to be E(\nu,c), Erlang’s formula.

Note that if (X(t),t\in\mathbb{R}) is a MC in equilibrium, (X(-t),t\in\mathbb{R}) is a MC with the same ED, modulo style of discontinuities (ie whether transitions are left-continuous or right-continuous). Therefore, in any queue where all customers get served, eg an M/M/1 queue, for which \lambda<\mu (otherwise the MC is transient, so no ED!), the departure process is the same (in distribution) as the arrivals process.

We can check that this holds for a series of M/M/1 queues, and that in equilibrium, the sizes of the queues are independent. This is merely an extension of the observation that future arrivals for a given queue are independent of the present, and likewise past departures are independent of the future, but the argument is immediately obvious.

Migration Processes

We consider a closed migration process on J colonies, with populations described by a vector n=(n_1,\ldots,n_J). We say that the instantaneous rate of movement of an individual from colony j to colony k is q(n,T_{jk}n)=\lambda_{jk}\phi_j(n_j), for some functions \phi_j(0)=0. We can describe the ED of this system in terms of the distribution obtained when a single individual performs a random walk through the state space, with \phi_j(1) taken to be 1 for each j. We call the DBEs for this case the traffic equations, with solutions (\alpha_j,j\in J). Then, by checking the PBEs, it is clear that the equilibrium distribution of the original migration process satisfies

\pi(n)\propto \prod_{j=1}^J \frac{\alpha_j^{n_j}}{\prod_{r=1}^{n_j}\phi_j(r)}

This is important, as it would have been computationally difficult to solve the original equations for an ED.

The same result holds for an open migration process, where individuals can enter and leave the system, arriving at colony j at rate \nu_j, and leaving at rate \mu_k\phi_k(n_k). Note that this has the same form as if each colony was served by a PP(\alpha_j\lambda_j), with departures at rate \lambda_j\phi_j(n_j):=(\mu_j+\sum_k\lambda_{jk})\phi_j(n_j). But (obviously) this interpretation is not equivalent to the model

The time reversal is also an OMP. One has to check that the transition rates have the correct form, and so the exit process from each colony (in equilibrium, naturally), is PP(\alpha_j\mu_j)$. Most importantly, given a communicating class, we can think of the restriction to this class as an OMP, so in particular, rates of transition between classes are Poisson.

Little’s Law

Informally, the mean number of customers should be equal to the arrival rate multiplied by the mean sojourn time in the system of a customer. This is easiest to formalise by taking an expectation up to a regeneration time. This is T, the first time the system returns to its original state (assumed to be 0 customers), an a.s. finite stopping time.

Set L:=\mathbb{E}\frac{1}{T}\int_0^Tn(s)ds, the average number of customers, and W:=\frac{\mathbb{E}\sum_1^N W_n}{\mathbb{E}N} where N is the number of customers arriving in [0,T], and W_n is the waiting time of the n-th customer.

Little’s Law asserts that L=\lambda W. Note that for a Markov Chain in equilibrium, can define L more simply as \mathbb{E}n and similarly for W.

It is easiest proved by considering the area between the arrivals process and the departure process in two ways: integrating over height and width. Note that working up to a regeneration time is convenient because at that time the processes are equal.

The migration processes above are said to be linear if \phi_j(n_j)=n_j. This process has the form of a Markov chain, and so even in a more general state space, the distribution of points in disjoint subsets of the state-space are independent Poisson processes.

Often though, we start with no individuals in the system, but still the distribution is given by a time-inhomogenous Poisson random measure. The mean is specified by

M(t,E)=\nu\int_0^t P(u,E)du

where \nu is the net arrival rate, and P(u,E) is the probability that an individual is in E, a time interval of u after arriving.

As one would suspect, this is easiest to check through generating functions, since independence has a straightforward generating function analogue, and the expression for a Poisson RV is manageable.