Invariant Distributions of Markov Chains

My lecture course in Linyi was all about Markov chains, and we spent much of the final two sessions discussing the properties of invariant distributions. I was not surprised, however, that none of the class chose this topic as the subject for a presentation to give after the end of the teaching week. One of the main problems is that so many rather similar properties are introduced roughly simultaneously. As we did in the class, I thought it was worth making some sort of executive summary, as a mixture of revision and interest.

Definition: \pi is an invariant measure if \pi P=\pi. If in addition \sum_{i\in I}\pi_i=1, then we say it is an invariant distribution. Of course, if I is finite, then any invariant measure can be normalised to give an invariant distribution.

The key initial questions are about existence and uniqueness. First, if there are multiple communicating classes, then an invariant measure (resp. distribution) is a linear (resp. affine) combination of the invariant measures / distributions on each (closed) class. So we restrict attention to irreducible Markov chains.

In the finite case, P is a stochastic matrix so has a column eigenvector with eigenvalue 1, namely the vector with all entries equal to 1. Thus, by reference to general theory in linear algebra, P has a row eigenvector \pi with eigenvalue 1. To paraphrase a remark made by one of my students, what is not clear is that this should be a measure. Demonstrating that this is true is rather non-trivial I think, normally done by reference to the rather more general Perron-Frobenius theorem, though on the flight home I came up with a short argument using Lagrangian duality. For now, we accept existence in the finite case, and note that we typically show existence by showing that the vector of expected time spent in each state between successive visits to a fixed reference state satisfies the properties of an invariant measure.

This is a good moment to note that recurrence is not a necessary condition for the existence of an invariant measure. For example, the random walk on \mathbb{Z}^3 is transient, but the uniform measure is invariant. However, it is not a sufficient condition for the existence of an invariant distribution either. (Of course, an irreducible finite chain is always recurrent, and always has an invariant distribution, so now we are considering only the infinite state space case.) The random walk on \mathbb{Z}^2 is recurrent, but the invariant measure is not normalisable.

The property we in fact need is positive recurrence. This says that the expected return time to each point is finite. Again, this is a class property. This is a common requirement in probabilistic arguments: almost surely finite is often not strong enough to show results if the expectation is infinite (see for example the various requirements for the optional stopping theorem). If this holds, then \pi_i=\frac{1}{\mathbb{E}T_i}, where T_i is the the return time starting from some i\in I.

The final question is ‘Why are we interested?’ One of the best answers is to look at convergence properties. A simple suggestion is this: if we start in equilibrium, then X_0,X_1,X_2,\ldots are all equal in distribution. Note that the dependence structure remains complicated, and much much more interesting than the individual distributions. Next, we observe that a calculation of n-step transition probabilities for a finite chain will typically involve a linear combination of nth powers of eigenvalues. One of the eigenvalues is 1, and the others lie strictly between -1 and 1. We observe in examples that the constant coefficient in p_{ij}^{(n)} is generally a function of j alone, and so p_{ij}^{(n)}\rightarrow\lambda_j, some distribution on I. By considering P^{n+1}=P\cdot P^n, it is easy to see that if this converges, (\lambda_j)_{j\in I} is an invariant distribution. The classic examples which do not work are

P=\begin{pmatrix}0&1\\1&0\end{pmatrix} and P=\begin{pmatrix}0&1&0\\ 0&0&1\\1&0&0\end{pmatrix},

as then the distribution of X_n is a function of the remainder of n modulo 3 alone. With a little thought, we can give a precise classification of such chains which force you to be in particular proper subsets of the state space at regular times n. Chains without this property are called aperiodic, and we can show that distributions for such chains converge to the equilibrium distribution as n\rightarrow\infty.

Advertisements

The Poisson Process – A Third Characteristion

There remains the matter of the distribution of the number of people to arrive in a fixed non-infinitissimal time interval. Consider the time interval [0,1], which we divide into n smaller intervals of equal width. As n grows large enough that we know the probability that two arrivals occur in the same interval tends to zero (as this is \leq no(\frac{1}{n})), we can consider this as a sequence of iid Bernoulli random variables as before. So

\mathbb{P}(N_1=k)=\binom{n}{k}(\frac{\lambda}{n})^k(1-\frac{\lambda}{n})^{n-k}
\approx \frac{n^k}{k!} \frac{\lambda^k}{n^k}(1-\frac{\lambda}{n})^n\approx \frac{\lambda^k}{k!}e^{-\lambda}.

We recognise this as belonging to a Poisson (hence the name of the process!) random variable. We can repeat this for a general time interval and obtain N_t\sim \text{Po}(\lambda t).

Note that we implicitly assumed that, in the infinitissimal case at least, behaviour in disjoint intervals was independent. We would hope that this would lift immediately to the large intervals, but it is not immediately obvious how to make this work. This property of independent increments is one of the key definitions of a Levy Process, of which the Poisson process is one of the two canonical examples (the other is Brownian Motion).

As before, if we can show that the implication goes both ways (and for this case it is not hard – letting t\rightarrow 0 clearly gives the infinitissimal construction), we can prove results about Poisson random variables with ease, for example \text{Po}(\lambda)+\text{Po}(\mu)=\text{Po}(\lambda+\mu).

This pretty much concludes the construction of the Poisson process. We have three characterisations:
1) X_n\sim\text{Exp}(\lambda) all iid.
2) The infinitissimal construction as before, with independence.
3) The number of arrivals in a time interval of width t \sim \text{Po}(\lambda t). (This is sometimes called a stationary increments property.) Furthermore, we have independent increments.

A formal derivation of the equivalence of these forms is important but technical, and so not really worth attempting here. See James Norris’s book for example for a fuller exposition.

The final remark is that the Poisson Process has the Markov property. Recall that this says that conditional on the present, future behaviour is independent of the past. Without getting into too much detail, we might like to prove this by using the independent increments property. But remember that for a continuous process, it is too much information to keep track of all the distributions at once. It is sufficient to track only the finite marginal distributions, provided the process is cadlag, which the Poisson process is, assuming we deal with the discontinuities in the right way. Alternatively, the exponential random variable is memoryless, a property that can be lifted, albeit with some technical difficulties, to show the Markov property.

The Poisson Process – Distributions of Holding Times

So, I was planning to conclude my lecture course on Markov Chains at the Cambridge-Linyi summer school with an hour devoted to the Poisson Process. Unfortunately, working through the discrete time material and examples took longer than I’d expected, so we never got round to it. As I’d put a fair bit of work into the preparation, I figured I might as well write it up here instead.

We need a plausible mathematical model for a situation where people or events arrive randomly in continuous time, in a manner that is as close as possible to the notion of iid random variables. In discrete time, a natural candidate would be a set of iid Bernoulli random variables. For example, with probability p a single bus will arrive in an time interval of a minute. With probability 1-p, no bus will arrive. We might have some logistical motivation for why it is not possible that two or more arrive in a given interval, or we could instead choose a more complicated distribution.

One way to proceed would be to specify the distribution of the times between arrivals. These should be independent and identically distributed, at least intuitively. However, although we might be able to give a sensible guess right now, it is not immediately clear what this distribution should be. For now, we merely remark that the arrival times are called X_1,X_2,\ldots, and the holding times between arrivals are defined by
S_1=X_1, S_n=X_n-X_{n-1},n\geq 2.

In fact the motivating discrete example gives us much of the machinery we will actually need. Recall that when we define probability distributions for continuously-valued random variables we need a different plan of attack than for discrete RVs. Whereas for the discrete case, it is enough to specify the probability of each outcome, for a continuous random variable, we have to specify the probabilities of intervals, and take care that they have the obvious additive and nesting properties that we want. Taking the integral (whether Riemannian or Lebesgue) of a so-called density function is a natural way to do this.

Similarly here, we build up from small time intervals. The first remark is this: it is natural that the expected number of arrivals in the first minute is equal to the expected number of arrivals in the second minute. After all, we are considering the most general process possible. If there are an infinite number of potential arriving agents, then behaviour in the first minute should be independent and equal (in distribution) to behaviour in the second minute. We can naturally extend this idea to a linear relation. If N_s is the number of arrivals in the time [0,s], then we should have \mathbb{E}N_s=\lambda s, where \lambda is some constant of proportionality, equal to \mathbb{E}N_1.

The key remark is that as s\rightarrow 0, \mathbb{P}(N_s=1) becomes small, and \mathbb{P}(N_s\geq 2) becomes very small. In fact it suffices that \mathbb{P}(N_s \geq 2)=o(s), as this implies:

\mathbb{P}(N_s=0)=1-\lambda s+o(s),\quad \mathbb{P}(N_s=1)=\lambda s+o(s).

Note that we are not currently attempting a formal construction of the process. As always, finding a probability space with enough freedom to equip a continuous process is a fiddly task. We are for now just trying to work out what the distribution of the holding times between arrivals should be. There are obvious advantages to defining the process as a collection of iid random variables, for example that we can construct it on a product space.

To do this, we split the time interval [0,1] into blocks

[0,1]=[0,\frac{1}{n}]\cup[\frac{1}{n},\frac{2}{n}]\cup\ldots\cup[\frac{n-1}{n},1]

So the probability that someone arrives in the time [\frac{k}{n},\frac{k+1}{n}] is \frac{\lambda}{n}. So

\mathbb{P}(\text{no-one arrives in time }[0,1])=(1-\frac{\lambda}{n})^n\approx e^{-\lambda}.

As we are working in an n\rightarrow\infty regime, we can replace 1 by general time t, to obtain:

\mathbb{P}(\text{no-one arrives in time }[0,t])=(1-\frac{\lambda t}{n})^n\approx e^{-\lambda t}.

So the distribution function of the first arrival time is F(t)=1-e^{-\lambda t} in the conventional notation.
Thus X_1\sim \text{Exp}(\lambda).

However, to emphasis how useful the infinitissimal definition is for actual problems, consider these examples.

1) If we have two arrivals processes at the same object, for example arriving from the left and the right at the same shop, say with rates \lambda,\mu, then we want to show that the first arrival time is still exponential. Because of the linearity of expectation property, it is clearly from the definition that the total arrivals process is Poisson with rate \lambda+\mu, and so the result follows. Showing this by examining the joint distribution of two exponential random variables is also possible, but much less elegant.

2) Similarly, if we have two shops, A and B, and each arriving person chooses one at random, then the first arrival time at A is: with probability 1/2 distributed as Exp(\lambda), with probability 1/4 as \Gamma(2,\lambda), and so on. A fairly non-trivial calculation is required to show that this is the same as Exp(\frac{1}{2}\lambda), whereas this follows almost instantly using the the infinitissimal definition.

Moral: with the infinitissimal case, the difficulties are all in the probability space. However, once we have settled those problems, everything else is nice as the key property is linear. Whereas for a construction by iid jump times, the existence and well-definedness is clear, but even for a distribution as tractable as the exponential random variable, manipulation can be tricky.

Queues and Migration Processes

Simple Queues

A queue generally has the form of a countable state-space Markov Chain. Here we assume that customers are served in the order they arrive (often referred to as: FIFO – First In First Out). The standard Kendall notation is M/M/C/K. Here the Ms stand for Markov (or memoryless) arrivals, and service times respectively, possibly replaced by G if the process admits more general distributions. C is the number of servers, and K the capacity of the system.

The first example is a M/M/C/C queue, motivated by a telephone network. Here, there are c lines, and an arriving call is lost if all lines are busy at the arrival time. We assume arrivals are a PP(\lambda) process, and the call times are iid Exp(\mu) RVs. We record the number of busy lines, which is a continuous-time Markov chain on state-space [0,c]. As usual, we assume the system operates in equilibrium, and so in particular we must have \lambda<\mu c. It is easy to find the equilibrium distribution. The only non-zero transition probabilities are

q(i-1,i)=\lambda, \quad q(i,i-1)=\mu i\quad i=1,\ldots,c

and so can that that \pi_i=\frac{\nu_i}{i!}\pi_0 satisfies the DBEs where \nu:=\frac{\lambda}{\mu}, sometimes called the traffic intensity. This gives

\pi_c=\mathbb{P}(c\text{ lines busy})=\frac{\nu^c}{c!}\left(\sum_{k=0}^c \frac{\nu^k}{k!}\right)^{-1}

and we define this to be E(\nu,c), Erlang’s formula.

Note that if (X(t),t\in\mathbb{R}) is a MC in equilibrium, (X(-t),t\in\mathbb{R}) is a MC with the same ED, modulo style of discontinuities (ie whether transitions are left-continuous or right-continuous). Therefore, in any queue where all customers get served, eg an M/M/1 queue, for which \lambda<\mu (otherwise the MC is transient, so no ED!), the departure process is the same (in distribution) as the arrivals process.

We can check that this holds for a series of M/M/1 queues, and that in equilibrium, the sizes of the queues are independent. This is merely an extension of the observation that future arrivals for a given queue are independent of the present, and likewise past departures are independent of the future, but the argument is immediately obvious.

Migration Processes

We consider a closed migration process on J colonies, with populations described by a vector n=(n_1,\ldots,n_J). We say that the instantaneous rate of movement of an individual from colony j to colony k is q(n,T_{jk}n)=\lambda_{jk}\phi_j(n_j), for some functions \phi_j(0)=0. We can describe the ED of this system in terms of the distribution obtained when a single individual performs a random walk through the state space, with \phi_j(1) taken to be 1 for each j. We call the DBEs for this case the traffic equations, with solutions (\alpha_j,j\in J). Then, by checking the PBEs, it is clear that the equilibrium distribution of the original migration process satisfies

\pi(n)\propto \prod_{j=1}^J \frac{\alpha_j^{n_j}}{\prod_{r=1}^{n_j}\phi_j(r)}

This is important, as it would have been computationally difficult to solve the original equations for an ED.

The same result holds for an open migration process, where individuals can enter and leave the system, arriving at colony j at rate \nu_j, and leaving at rate \mu_k\phi_k(n_k). Note that this has the same form as if each colony was served by a PP(\alpha_j\lambda_j), with departures at rate \lambda_j\phi_j(n_j):=(\mu_j+\sum_k\lambda_{jk})\phi_j(n_j). But (obviously) this interpretation is not equivalent to the model

The time reversal is also an OMP. One has to check that the transition rates have the correct form, and so the exit process from each colony (in equilibrium, naturally), is PP(\alpha_j\mu_j)$. Most importantly, given a communicating class, we can think of the restriction to this class as an OMP, so in particular, rates of transition between classes are Poisson.

Little’s Law

Informally, the mean number of customers should be equal to the arrival rate multiplied by the mean sojourn time in the system of a customer. This is easiest to formalise by taking an expectation up to a regeneration time. This is T, the first time the system returns to its original state (assumed to be 0 customers), an a.s. finite stopping time.

Set L:=\mathbb{E}\frac{1}{T}\int_0^Tn(s)ds, the average number of customers, and W:=\frac{\mathbb{E}\sum_1^N W_n}{\mathbb{E}N} where N is the number of customers arriving in [0,T], and W_n is the waiting time of the n-th customer.

Little’s Law asserts that L=\lambda W. Note that for a Markov Chain in equilibrium, can define L more simply as \mathbb{E}n and similarly for W.

It is easiest proved by considering the area between the arrivals process and the departure process in two ways: integrating over height and width. Note that working up to a regeneration time is convenient because at that time the processes are equal.

The migration processes above are said to be linear if \phi_j(n_j)=n_j. This process has the form of a Markov chain, and so even in a more general state space, the distribution of points in disjoint subsets of the state-space are independent Poisson processes.

Often though, we start with no individuals in the system, but still the distribution is given by a time-inhomogenous Poisson random measure. The mean is specified by

M(t,E)=\nu\int_0^t P(u,E)du

where \nu is the net arrival rate, and P(u,E) is the probability that an individual is in E, a time interval of u after arriving.

As one would suspect, this is easiest to check through generating functions, since independence has a straightforward generating function analogue, and the expression for a Poisson RV is manageable.

Feller Processes and the Strong Markov Property

Markov Property

We go way back to the Part 1B short course Markov Chains. In the first lecture of this course, we met discrete time Markov Chains. A definition was given, in terms of conditional single period transition probabilities, and it was immediately proved that general transition probabilities are specified by sums of products of entries in a so-called transition matrix. This proves very useful for performing calculations. But the question will inevitably be asked: “Prove that this is a Markov process.” And the answer “because it obviously is” isn’t good enough.

The point is that all of the above is relevant to the setting, but the ideal definition of a Markov process is something like the very general statement:

Conditional on the present, the past and the future are independent.

This opens up the possibility of a large class of processes being Markov processes. A technical definition would be that for s<t and a measurable subset of the state space.

\mathbb{P}(X_t\in A|\mathcal{F}_s)=\mathbb{P}(X_t\in A|\sigma(X_s)).

It is easy to check that this is equivalent to the original definition in the that context.

Strong Markov Property

SMP states that given a stopping time T, conditional on the event \{T<\infty\}:

(X_{T+t}-X_T, t\geq 0)\stackrel{d}{=}(X_t^0,t\geq 0),

that is, the process started at time T has the same distribution as the original process started from 0 (in space as well as time). Furthermore, it is independent of \mathcal{F}_T, which requires technical definition, but which informally is the sigma field of events defined by what happens up to time T.

For a discrete time Markov chain, prove SMP by pre-multiplying by the indicator function 1(T=n), which reduces SMP to the normal Markov property. Then take the countable sum over (which is permissible) to get SMP. For Brownian Motion in one dimension, make dyadic approximations to the stopping time from above. SMP applies to these approximations, and measure theoretic machinery and the (almost sure) continuity of paths allows the equivalence of distributions to hold in the limit. Independence follows by expressing \mathcal{F}_T=\cap \mathcal{F}_{T_n} as the intersection of sigma fields corresponding to the approximations.

In both of these cases, an implicit countable substructure (discrete time and continuity respectively) have been required to deduce SMP. This suggests that there are Markov processes which do not have SMP.

Motivating Counterexample

Take B to be a Brownian Motion in one dimension, with B_0 a RV which contains 0 in its support. Now define the the process:

X_t=B_t1_{B_0\neq 0}.

Then X is certainly not Strong Markov, by considering the hitting time of 0. Then the process started there is either identically 0, or a standard BM, but which is determined by time 0 properties rather than time T properties.

But is Markov. Take s<t and A Borel. Then:

\mathbb{P}(X_t\in A|\mathcal{F}_s)=\mathbb{E}[1(X_t\in A\cap X_0\neq 0)+1(X_t\in A\cap X_0=0)|\mathcal{F}_s]

=1(X_0\neq 0)\int_A \frac{1}{\sqrt{2\pi(t-s)}}\exp(-\frac{(X_s-y)^2}{2(t-s)})dy+1(X_0=0)1(0\in A)

1(X_s\neq 0)\int_A(\ldots)dy + 1(X_s=0,X_0\neq 0)\int_A(\ldots)dy + 1(0\in A)[1(X_s=0)-1(X_s=0,X_0\neq 0)]

Now 1(X_s=0, X_0\neq 0)=0 a.s. so

= \mathbb{E}[1(X_t\in A)|X_s], which is precisely the Markov property.

Feller Property

In general, it hard to verify the Strong Markov Property for a given stochastic process. Instead, we consider a property which is stronger, but easier to check. Continue reading