Perturbation of Eigenvectors

In the previous post, I talked about eigenvalues, and some alternative characterisations which could be useful in some circumstances. Recently, I’ve been interested in controlling how eigenvalues and eigenvectors change as the matrix is varied. My particular example concerns positive matrices, which have a well-defined largest eigenvalue (or Perron root), and a unique (up to normalising in some way) principal eigenvector.

We might expect that perturbing a matrix slightly does not change the eigenvectors very much, since any original eigenvector is still almost an eigenvector, in the sense that its image under the action of the perturbed matrix is almost equal to a multiple of itself. But how to make this precise? And when does it go wrong?

Eigenvalues – The non-multiple case

Throughout, we assume we have a k x k matrix. We might want to allow the entries to be complex, but for now, real entries are perfectly interesting enough.

It makes sense to start with eigenvalues, since it’s easy to define these through the characteristic equation of the matrix. The coefficients of this polynomial are well-behaved (indeed polynomial) functions of the entries of the matrix. So we are really asking how the set of roots of a finite polynomial evolves as the (k+1) coefficients of the polynomial evolve. It is fairly clear that, under any sensible choice of topology on the space of k-(multi)-subsets of \mathbb{C}, the multiset of roots is continuous in the coefficients of the polynomial.

To say anything more precise, we have to introduce some notation.

Let \chi_{A}(z)=z^k+\gamma_{k-1}(A)z^{k-1}+\ldots+\gamma_1(A)z+\gamma_0(A) be the characteristic polynomial of A. Each \gamma_i is a polynomial of degree k-i in the entries of A. Let’s consider now a matrix-valued function A(t), and we assume that the entries of A(t) are all differentiable with respect to t. So each \gamma_i(A(t)) is also differentiable with respect to t.

At this point, let’s make the assumption that t lies in some interval [r,s] for which the eigenvalues of A(t) are distinct. Let \lambda(t) be some eigenvalue of A(t), chosen such that \lambda is a continuous function of t. For example, we might take \lambda(t)=\Lambda_1(t), the eigenvalue with largest absolute value (with some canonical tie-breaking mechanism). Then \chi_{A(t)}(\lambda(t))=0, and so differentiating with respect to \gamma_i:

0=\chi'_{A(t)}(\lambda(t)) \frac{\partial \lambda}{\partial \gamma_i} \Big|_{A(t)} + \lambda(t)^i.

Because we deliberately demanded that the eigenvalues were disjoint, we have \chi'_{A(t)}(\lambda(t))\ne 0, and so \frac{\partial \lambda(t)}{\partial \gamma_i}=-\lambda(t)^i / \chi'_{A(t)}(\lambda(t)). In particular, \lambda(t) is differentiable with respect to the coefficients of the characteristic polynomial, and thus with respect to t also.

Multiple Eigenvalues

It gets more complicated when the characteristic equation has multiple roots. Typically we will be interested in the evolution of the eigenvalue with some extremal property, probably the largest one. Let’s restrict to the real, symmetric case, where the set of eigenvalues is complete and real. Suppose we have t_0 such that A(t_0) has a repeated eigenvalue. Then, in a small enough region of t_0, we can define eigenvalues \lambda(t),\mu(t) continuously such that \lambda(t_0)=\mu(t_0) while \lambda(t)\ne \mu(t) for t=\ne t_0. Then, if the entries of A(t) are analytic functions of t, then so are \lambda(t),\mu(t).

But then \max(\lambda(t),\mu(t)) will in general not be analytic, as the maximum of two smooth functions is in general Lipschitz.

This effect is most obvious in the case of a diagonal matrix A(t)=\begin{pmatrix}t&0\\0&-t\end{pmatrix}, for which the largest eigenvalue is |t|.

Eigenvectors

When the matrix A is real and symmetric, we know it has real eigenvalues, and an orthogonal basis of eigenvectors. Then the Rayleigh quotient characterises the eigenvector as well as the eigenvalue. Recall that for any x\in\mathbb{R}^k with ||x||_2=1, we have

\lambda_1\ge x^T A x \ge \lambda_k,

with equality precisely at the respective eigenvectors. So if we perturb A slightly, keeping it real and symmetric, we can control the principal eigenvector quite well by this method.

If A is not diagonalisable, we can still say something about this principal eigenvector, via large powers of A, sometimes called the Van Mises iteration. This says that for large N, A^N v should have direction close to that of the eigenvector, for any test vector v. The rate of convergence depends on the ratio of the largest eigenvalue to the second largest eigenvalue, though if the matrix is not diagonalisable, it is not completely trivial to quantify this convergence. We have to be careful though, since A maps the subspace orthogonal to the eigenvector to itself, so the magnitude of the projection of v onto the eigenvector determines the speed of convergence. Indeed, if v is orthogonal to the eigenvector, it won’t converge towards the principal eigenvector at all. (But if there is a well-defined ‘second eigenvector’ then it will converge towards that.)

Continuity of Eigenvectors

The reason why I ended up reading about some of these topics was that I wanted to show that the Perron eigenvector of a positive matrix (that is, the unique eigenvector corresponding to the Perron root) was Lipschitz continuous as a function of the entries of the positive matrix. Since for such a matrix, the largest eigenvalue is simple, we are able to make some progress.

In general, the condition that v is an eigenvector of matrix A with eigenvalue \lambda is described by the relation:

(A-\lambda I)v=0,\quad ||u||_1=1, (*)

or whatever the most appropriate normalising condition appears. This describes an implicit relation between A and the eigenvalue-eigenvector pair (\lambda,v). So given a matrix A_0 with eigenvalue \lambda_0 corresponding to eigenvector v_0, in a neighbourhood of A_0 in \mathbb{R}^{k\times k} we can use the implicit function theorem to comment on the differentiability of (\lambda,v) with respect to A in this neighbourhood.

Precisely, we require the matrix of partial derivatives from (*)

\begin{pmatrix} A_0-\lambda_0 I&v_0 \\ \mathbf{1}&0\end{pmatrix}

to have non-zero determinant. But if \lambda_0 is not simple, then if we apply this matrix from the left to one of the other eigenvectors (with a zero appended) we can see that it has non-trivial kernel. With a bit more work, we can show the converse too, and conclude that (\lambda,v) are smooth with respect to A in some neighbourhood of A_0.

Finally, we observe that when the eigenvalues are not simple, we can’t even guarantee continuity of the eigenvectors. This is unsurprising really, since for a multiple eigenvalue, a) we might not know how many LI eigenvectors exists; and b) we might have complete freedom over the choice of eigenvectors. Think about the identity matrix! Indeed the eigenvectors of \begin{pmatrix}1+\epsilon &0\\ 0&1+\epsilon\end{pmatrix} are (1,0) and (0,1), while the eigenvectors of \begin{pmatrix}1&\epsilon\\ \epsilon&1\end{pmatrix} are (1,1), (1,-1). So no continuous choice of eigenvectors is possible here.

Multitype Branching Processes

One of the fundamental objects in classical probability theory is the Galton-Watson branching process. This is defined to be a model for the growth of a population, where each individual in a generation gives birth to some number (possibly zero) of offspring, who form the next generation. Crucially, the numbers of offspring of the individuals are IID, with the same distribution both within generations and between generations.

There are several ways one might generalise this, such as non-IID offspring distributions, or pairs of individuals producing some number of offspring, but here we consider the situation where each individual has some type, and different types have different offspring distributions. Note that if there are K types, say, then the offspring distributions should now be supported on \mathbb{Z}_{\ge 0}^K. Let’s say the offspring distribution from a parent of type i is \mu^{(i)}.

The first question to address is one of survival. Recall that if we want to know whether a standard Galton-Watson process has positive probability of having infinite size, that is never going extinct, we only need to know the expectation of the offspring distribution. If this is less than 1, then the process is subcritical and is almost surely finite. If it is greater than 1, then it is supercritical and survives with positive probability. If the expectation is exactly 1 (and the variance is finite) then the process is critical and although it is still almost surely finite, the overall population size has a power-law tail, and hence (or otherwise) the expected population size is infinite.

We would like a similar result for the multitype process, saying that we do not need to know everything about the distribution to decide what the survival probability should be.

The first thing to address is why we can’t just reduce the multitype change to the monotype setting. It’s easiest to assume that we know the type of the root in the multitype tree. The case where the type of the root is random can be reconstructed later. Anyway, suppose now that we want to know the offspring distribution of a vertex in the m-th generation. To decide this, we need to know the probability that this vertex has a given type, say type j. To calculate this, we need to work out all the type possibilities for the first m generations, and their probabilities, which may well include lots of complicated size-biasing. Certainly it is not easy, and there’s no reason why these offspring distributions should be IID. The best we can say is that they should probably be exchangeable within each generation.

Obviously if the offspring distribution does not depend on the parent’s type, then we have a standard Galton-Watson tree with types assigned in an IID manner to the realisation. If the types are symmetric (for example if M, to be defined, is invariant under permuting the indices) then life gets much easier. In general, however, it will be more complicated than this.

We can however think about how to decide on survival probability. We consider the expected number of offspring, allowing both the type of the parent and the type of the child to vary. So define m_{ij} to be the expected number of type j children born to a type i parent. Then write these in a matrix M=(m_{ij}).

One generalisation is to consider a Galton-Watson forest started from some positive number of roots of various types. Suppose we have a vector \nu=(\nu_i) listing the number of roots of each type. Then the expected number of descendents of each type at generation n is given by the vector \nu M^n.

Let \lambda be the largest eigenvalue of M. As for the transition matrices of Markov chains, the Perron-Frobenius theorem applies here, which confirms that, because the entries of M are positive, the eigenvalue with largest modulus is simple and real, and the associated eigenvector has entirely positive entries. [In fact we need a couple of extra conditions on M, including that it is possible to get from any type to any other type – we say irreducible – but that isn’t worth going into now.]

So in fact the total number of descendents at generation n grows like \lambda^n in expectation, and so we have the same description of subcriticality and supercriticality. We can also make a sensible comment about the left-\lambda-eigenvector of M. This is the limiting proportion of the different types of vertices.

It’s a result (eg. [3]) that the height profile of a depth-first search on a standard Galton-Watson tree converges to Brownian Motion. Another way to phrase this is that a GW tree conditioned to have some size N has the Brownian Continuum Random Tree as a scaling limit as N grows to infinity. Miermont [4] proves that this result holds for the multitype tree as well. In the remainder of this post I want to discuss one idea along the way to the proof, and one application.

I said initially that there wasn’t a trivial reduction of a multitype process to a monotype process. There is however a non-trivial embedding of a monotype process in a multitype process. Consider all the vertices of type 1, and all the paths between such vertices. Then draw a new tree consisting of just the type 1 vertices. Two of these are joined by an edge if there is no other type 1 vertex on the unique path between them in the original tree. If that definition is confusing, think of the most sensible way to construct a tree on the type 1 vertices from the original, and you’ve probably chosen this definition.

There are two important things about this new tree. 1) It is a Galton-Watson tree, and 2) if the original tree is critical, then this reduced tree is also critical. Proving 1) is heavily dependent on exactly what definitions one takes for both the multitype branching mechanism and the standard G-W mechanism. Essentially, at a type 1 vertex, the number of type 1 descendents is not dependent on anything that happened at previous generations, nor in other branches of the original tree. This gives IID offspring distributions once it is formalised. As for criticality, we note that by the matrix argument given before, under the irreducibility condition discussed, the expectation of the total population size is infinite iff the expected number of type 1 vertices is also infinite. Since the proportion of type 1 vertices is given by the first element of the left eigenvector, which is positive, we can make a further argument that the number of type 1 vertices has a power-law tail iff the total population size also has a power-law tail.

I want to end by explaining why I was thinking about this model at all. In many previous posts I’ve discussed the forest fire model, where occasionally all the edges in some large component are deleted, and the component becomes a set of singletons again. We are interested in the local limit. That is, what do the large components look like from the point of view of a single vertex in the component? If we were able to prove that the large components have BCRT as the scaling limit, this would answer this question.

This holds for the original random graph process. There are two sensible ways to motivate this. Firstly, given that a component is a tree (which it is with high probability if its size is O(1) ), its distribution is that of the uniform tree, and it is known that this has BCRT as a scaling limit [1]. Alternatively, we know that the components have a Poisson Galton-Watson process as a local limit by the same argument used to calculate the increments of the exploration process. So we have an alternative description of the BCRT appearing: the scaling limit of G-W trees conditioned on their size.

Regarding the forest fires, if we stop the process at some time T>1, we know that some vertices have been burned several times and some vertices have never received an edge. What is clear though is that if we specify the age of each vertex, that is, how long has elapsed since it was last burned; conditional on this, we have an inhomogeneous random graph. Note that if we have two vertices of ages s and t, then the probability that there is an edge between them is 1-e^{-\frac{s\wedge t}{n}}, ie approximately \frac{s\wedge t}{n}. The function giving the probabilities of edges between different types of vertices is called the kernel, and here it is sufficiently well-behaved (in particular, it is bounded) that we are able to use the results of Bollobas et al in [2], where they discuss general sparse inhomogeneous random graphs. They show, among many other things, that in this setting as well the local limit is a multitype branching process.

So in conclusion, we have almost all the ingredients towards proving the result we want, that forest fire components have BCRT scaling limit. The only outstanding matter is that the Miermont result deals with a finite number of types, whereas obviously in the setting where we parameterise by age, the set of types is continuous. In other words, I’m working hard!

References

[1] Aldous – The Continuum Random Tree III

[2] Bollobas, Janson, Riordan – The phase transition in inhomogeneous random graphs

[3] Le Gall – Random Trees and Applications

[4] Miermont – Invariance principles for spatial multitype Galton-Watson trees

Enhanced by Zemanta

Branching Random Walk and Amenability

This post is about some of the things I learned in an interesting given by Elisabetta Candellero in Oxford last week, based on joint work with Matt Roberts. The paper on which this is based can be found here. The main thing I want to talk about are some properties of graphs which were mentioned near the beginning which I hadn’t heard about before.

Branching Random Walk (hereafter BRW) is a model to which much attention has been paid, because of its natural applications in a range of physical and genetic settings. As with many of the best models, the definition is pretty much in the title. We take the ingredients for a random walk on a graph, which is a graph, and a transition matrix P on that graph. For most of the time we will consider simple random walk, so the graph G exactly specifies P. This requires the additional condition that the graph G is locally finite. We will introduce a branching mechanism, so at discrete times {0,1,2,…} we will track both the number of particles, and their current locations. We start at time 0 with a single particle at some vertex. Then at each time-step, all the vertices present die, and each gives birth independently to some number of offspring according to a fixed probability distribution \mu. These offspring then perform one move according to transition matrix P. Note that if you want the system to carry the appearance of having no death, then taking the support of the offspring distribution to be {1,2,3,…} achieves precisely this. The properties we consider will not be very interesting unless G is infinite, so assume that from now on.

There are almost limitless ways we could think of to generalise these dynamics. The offspring distribution could be allowed to depend on the vertex the particle is occupying. The joint transition probabilities of the offspring at a vertex could be biased in favour or against the offspring moving to the same site next. The environment could be chosen in advance before the process starts, but random.

The classical question about BRW is that of recurrence and transience. The definition extends naturally from that of a Markov chain (which any non-branching random walk on a graph is). As in that setting, we say a BRW is recurrent if every vertex is almost surely visited infinitely often by particles of the graph.

Heuristically, we should observe that in some sense, it is quite difficult for simple random walk on an infinite graph to be recurrent. We have examples in \mathbb{Z},\mathbb{Z}^2, but these are about as ‘small’ as an infinite graph can be. An idea might be that if the number of sites some distance away from where we start grows rapidly as the distance grows, then there isn’t enough ‘pull’ back to visit the sites near where we start infinitely often. Extending this argument, it is easier for a BRW to be recurrent, as we have the option to make the branching rate large, which means that there are lots of particles at large times, hence more possibility for visiting everywhere. Note that if the offspring distribution is subcritical, we don’t stand a chance of having interesting properties. If we ignore the random walk part, we just have a subcritical Galton-Watson process, which dies out almost surely.

We need a measure of the concept discussed in the heuristic for how fast the number of vertices in the graph grows as we consider bands of vertices further and further away from the starting vertex. The standard measure for this is the spectral radius, which is defined not in terms of number of vertices, but through the limiting probability of returning to a fixed vertex at large time n. Precisely

\rho:= \limsup \mathbb{P}_i(X_n=i)^{1/n},

so in some approximation sense

\mathbb{P}_i(X_n=i)\sim \rho^{n},

which explains why \rho\le 1. Note that by considering the sum of such terms, if simple random walk on G is recurrent, then \rho=1, but the converse does not hold. (Consider SRW on \mathbb{Z}^3 for example.)

It’s also worth remarking that \rho is a class property. In particular, for a connected graph, the value of \rho is independent of i. This is not surprising, as if d is the graph distance between vertices i and j, then

p_{ii}^{(n)}\ge p_{ij}^{(d)}p_{jj}^{(n-2d)}p_{ji}^{(d)},

and vice versa, which enables us to sandwich usefully for the limits.

Really, \rho is a function of the transition matrix P. In fact, we can be more specific, by considering diagonalising P. The only case we care about is when P is infinite, so this is not especially nice, but it makes it clear why p_{ii}^{(n)} decays like |\rho|^n where \rho is the largest eigenvalue of P. Indeed this is an alternative definition of the spectral radius. Note that Perron-Frobenius theory (which seems to keep coming up on the blog this week…) says that since |\rho|\le 1, then if |\rho|=1, we must have \rho=1. So the spectral radius being 1 is precisely equivalent to having an invariant measure. We don’t know whether we can normalise it, but P-F guarantees the relevant left-eigenvector is non-negative, and hence a measure.

Next we give this situation a name. Say that a random walk is amenable if \rho(P)=1. We can extend this property to say that a graph is amenable if SRW on it is amenable.

This is not the standard definition of amenability. This property is originally defined (by von Neumann) in the context of groups. A group G is said to be amenable if there exists a left-invariant probability measure on G, ie \mu such that

\forall A\subset G, \forall g\in G, \mu(gA)=A.

The uniform distribution shows that any finite group is amenable.

It turns out that in general there are several conditions for a group which are equivalent to amenability. One is that, given G finitely generated by B, the Cayley graph for G with edges given by elements of B does not satisfy a strong isoperimetric inequality. Such an inequality is an alternative way of saying that the graph grows rapidly. It says that the size of the boundary of a subset of the vertices is uniformly large relative to the size of the set. Precisely, there exists a constant c>0 such that whenever U is a finite subset of the vertices, we have |\partial U|\ge c|U|. (Note that finiteness of U is important – we would not expect results like this to hold for very large subsets.)

Kesten proved that it is further equivalent to the statement that simple random walk on Cay(G,B) is amenable in our original sense. This technical and important result links the two definitions.

We finish by declaring the main classical result in BRW, which is a precise condition for transience. As motivated earlier, the rate of branching and the spectral radius have opposing effects on whether the system is recurrent or transient. Note that at some large time, the expected number of particles which have returned to the starting vertex is given by the expected number of particles in the system multiplied by the probability that any one of them is back at its origin, ie \sim \mu^n\rho^n. So the probability that there is a particle back at the origin at this time is (crudely transferring from expectation to probability) 1\wedge (\mu \rho)^n. We can conclude that the chain is recurrent if \mu > \rho^{-1} and transient if \mu<\rho^{-1}. This result is due to Benjamini and Peres.

The remaining case, when \mu=\rho^{-1} is called, unsurprisingly, critical BRW. It was proved in ’06 by Gantert and Muller that, in fact, all critical BRWs are transient too. This must exclude the amenable case, as we could think of SRW on \mathbb{Z} as a critical BRW by taking the branching distribution to be identically one, as the spectral radius is also 1.

In the end, the material in this post is rather preliminary to the work presented in EC’s talk, which concerned the trace of BRW, and whether there are infinitely many essentially different paths to infinity taken by the particles of the BRW. They show that this holds in a broad class of graphs with symmetric properties.

Enhanced by Zemanta

Convergence of Transition Probabilities

As you can see, I haven’t got round to writing a post for a while. Some of my reasons for this have been good, and some have not. One reason has been that I’ve had to give a large number of tutorials for the fourth quarter of the second year probability course here in Oxford. The second half of this course concerns discrete-time Markov chains, and the fourth problem sheet discusses various modes of convergence for such models, as well as a brief tangent onto Poisson Processes. I’ve written more about Poisson Processes than perhaps was justifiable in the past, so I thought I’d say some words about convergence of transition probabilities in discrete-time Markov chains.

Just to be concrete, let’s assume the state space K is finite, and labelled {1,2,…,k}, so that it becomes meaningful to discuss

p_{12}^{(n)}:=\mathbb{P}(X_n=2|X_0=1).

That is, the probability that if we start at state 1, then after n ‘moves’ we are at state 2. We are interested in the circumstances under which this converges to the stationary distribution. The heuristic is that we can view a time-step of a Markov chain as an operation on the space of distributions on K. Note that this operation is deterministic. If this sounds complicated, what we mean is that we specify an initial distribution, that is the distribution of X_0. If we consider the distribution of X_1, this is given by \lambda P, where \lambda is the initial distribution, and P the transition matrix.

Anyway, the heuristic is that the stationary distribution is the unique fixed point of this operation on the space of distributions. It is therefore not unreasonable to assume that unless there are some periodic effects, we expect repeated use of this operation to move us closer to this fixed point.

We can further clarify this by considering the matrix form. Note that a transition P always has an eigenvalue equal to 1. This is equivalent to say that there is a solution to \pi P=\pi. Note it is not immediately equivalent to saying that P has a stationary distribution, as the latter must be non-negative and have elements summing to one. Only the first property is difficult, and relies on some theory or cleverness to prove. It can also be shown that all eigenvalues satisfy |\lambda|\le 1, and in general, there will be a single eigenvalue (ie dimension 1 eigenspace) with |\lambda|=1, and the rest satisfies |\lambda|<1. Then, if we diagonalise P, it is clear why \pi P^n converges entry-wise, as \pi UP^n U^{-1} converges. In the latter, only the entries in the row corresponding to \lambda=1 converge to something non-zero.

In summary, there is a strong heuristic for why in general, the transition probabilities should converge, and if they converge, that they should converge to the stationary distribution. In fact, we can prove that for any finite Markov chain, p_{ij}^{(n)}\rightarrow \pi_j, provided we two conditions hold. The conditions are that the chain is irreducible and aperiodic.

In the rest of this post, I want to discuss what might go wrong when these conditions are not satisfied. We begin with irreducibility. A chain is irreducible if it has precisely one communicating class. That means that we can get from any state to any other state, not necessarily in one step, with positive probability. One obvious reason why the statement of the theorem cannot hold in this setting is that \pi is not uniquely defined when the chain is not irreducible. Suppose, for example, that we have two closed communicating classes A and B. Then, supported on each of them is an invariant distribution \pi^A and \pi^B, so any affine combination of the two \lambda \pi^A+(1-\lambda) \pi^B will give a stationary distribution for the whole chain.

In fact, the solution to this problem is not too demanding. If we are considering p_{ij}^{(n)} for i\in A a closed communicating class, then we know that p_{ij}^{(n)}=0 whenever j\not\in A. For the remaining j, we can use the theorem in its original form on the Markov chain, with state space reduced to A. Here, it is now irreducible.

The only case left to address is if i is in an open communicating class. In that case, it suffices to work out the hitting probabilities starting from i of each of the closed communicating classes. Provided these classes themselves satisfy the requirements of the theorem, we can write

p_{ij}^{(n)}\rightarrow h_i^A \pi^A_j,\quad i\not\in A, j\in A.

To prove this, we need to show that as the number of steps grows to infinity, the probability that we are in closed class A converges to h_i^A. Then, we decompose this large number of steps so to say that not only have we entered A with roughly the given probability, but in fact with roughly the given probability we entered A a long time in the past, and so there has been enough time for the original convergence result to hold in A.

Now we turn to periodicity. If a chain has period k, this says that we can split the state space into k classes A_1,\ldots,A_k, such that p_{ij}^{(n)}=0 whenever n\not\equiv j-i \mod k. Equivalently, the directed graph describing the possible transitions of the chain is k-partite. This definition makes it immediately clear that p_{ij}^{(n)} cannot converge in this case. However, it is possible that p_{ij}^{(kn)} will converge. Indeed, to verify this, we would need to consider the Markov chain with transition matrix P^k. Note that this is no longer irreducible, as it there are no transitions allowed between classes A_1,\ldots,A_k. Indeed, a more formal definition of the period, in terms of the lcd of possible return times allows us to conclude that there is no finer reducibility structure. That is, A_1,\ldots,A_k genuinely are the closed classes when we consider the chain with matrix P^k. And so the Markov chain with transition matrix P^k restricted to any of the A_is satisfies the conditions of the theorem.

There remains one case which I’ve casually brushed over. When we were discussing the irreducible case, I said that if we had at least one communicating classes, then we could work out the limiting transition probabilities from a state in an open class to a state in a closed class by calculating the hitting probability of that closed class, then applying the standard version of the theorem to that closed class. This relies on the closed class being aperiodic.

Suppose otherwise that the destination closed class A has period k as before. If it were to be the case that the number of steps required to arrive at A had some fixed value mod k, or modulo a non-trivial divisor of k, then we certainly wouldn’t have convergence, for the same reasons as in the globally periodic case. However, we should ask whether we can ever have convergence?

In fact, the answer is yes. For concreteness, and because it’s easier to write ‘odd’ and ‘even’ than m \mod k, let’s assume A has size 2 and period 2. That is, once we arrive in A, thereafter we alternate deterministically between the two states. Anyway, for some large time n, we can write p_{ca}^{(n)} for a\in A, c\not\in A as:

p_{ca}^{(n)}=h_i^A(n),

where the latter term is the probability that we arrive in A at a time-step which has the same parity as n. It’s not terribly hard to come up with an example where this holds, and this idea holds in greater generality, where A has period k (and not necessarily just k states), we have to demand that the probability of arriving at a time which is a mod k is equal for all a in [0,k-1].

Of course, for applications, we don’t normally care much about irreducible chains, and we can easily remove periodicity by introducing so-called laziness, whereby on each time-step we flip a coin (biased if necessary) and stay put if it comes up heads, and apply the transition matrix if it comes up tails. Then it’s possible to get from any state to itself in one step, and so we are by construction aperiodic.

Enhanced by Zemanta

The Perron-Frobenius Theorem for Stochastic Matrices

This article was prompted by a question asked by one of my students about 5 minutes before the end of the final lecture of my Markov Chains course in Linyi. Although I didn’t have time to think about it right then, the flight home offered plenty of opportunity for having a play around. I am well aware that the proof given below is not the best one, but the ideas of minimising quadrants of a matrix seemed fairly natural. Anyway, it’s been sitting on my desktop for over two months now, so I decided I should put it up.

———

Recall that the transition probabilities of a finite Markov chain are determined by a stochastic matrix P. That is, each row of P is a probability distribution. In both theory and applications, the existence and properties of an invariant distribution is of interest. This is a probability distribution \pi satisfying the relation:

\pi P=\pi. (1)

It is clear that \bigg(\begin{smallmatrix}1\\ \vdots\\1\end{smallmatrix}\bigg) is a right- or column eigenvector of P, with eigenvalue 1. Since the spectrum of P^T is the same as that of P, we conclude that 1 is a left-eigenvalue of P also. So we can be assured of the existence of a vector \pi satisfying (1). What is unclear is that this eigenvector \pi should be a probability distribution. Since at least one entry must be non-zero, it will suffice to show that every entry of \pi is non-negative.

A necessary condition for the uniqueness of an invariant distribution is that the Markov chain be irreducible. This is best defined using the terminology of random walks: the chain is irreducible if for every pair of states i,j\in I, it is possible to move from i to j and back again. In terms of the transition matrix, P is irreducible if it is not block upper-triangular, up to reordering rows and columns.

We want to show that when P is irreducible, the (unique) 1-eigenvector is a probability distribution. The standard method proposed in this context is to exhibit the invariant distribution directly. For example, Norris’s Markov Chains defines

\gamma_i^k=\text{ expected time spent in i between visits to k }=\mathbb{E}_k\sum_{n=0}^{T_k}1_{\{X_n=i\}},

and shows that (\gamma_i^k)_{i\in I} satisfies (1).

Nonetheless, the result required is clearly at least one step removed from the probabilistic interpretation, so it would be satisfying to find a direct proof of existence. Typically, one quotes the substantially more general theorem of Perron and Frobenius, the most relevant form of which is:

Theorem (Perron-Frobenius): Given A a non-negative and irreducible square matrix. Then there is a positive real eigenvalue \lambda with multiplicity 1 and such that all other eigenvalues have absolute value less than or equal to \lambda. Then the (unique up to scaling) left- and right-eigenvectors corresponding to \lambda are positive.

Here we present a short proof of this result in the case where A is the (stochastic) transition matrix of a Markov chain.

Proposition: An irreducible stochastic matrix has a 1-eigenvector with all entries non-negative.

Proof: We show instead the contrapositive: that if a stochastic matrix has a 1-eigenvector with both negative and non-negative components, then it is reducible. The motivation is this third eigenvector given in example (2). Observe that the communicating classes are \{1,2\} and \{3\}, and it is not hard to see that for any eigenvector with negative and non-negative components, the sign of a component is a class property.

Informally, given an n\times n stochastic matrix P, and a 1-eigenvector \pi with this property, we relabel the states so that the non-negative components, which we call A\subset I are first. That is, in a moderate abuse of notation:

\pi=(\underbrace{\pi_A}_{\geq 0}\quad\underbrace{\pi_B}_{<0}).\quad\text{ If we write P as }\begin{pmatrix}P_{AA}&P_{AB}\\P_{BA}&P_{BB}\end{pmatrix},

our aim is to show that the sub-matrices P_{AB} and P_{BA} are both zero. This implies that states in A and states in B do not communicate, showing that P is reducible. We can formulate this as a linear programming problem:

\text{Maximise }\sum_{\substack{x\in A,y\in B\\x\in B, y\in A}}p_{xy}\text{ s.t. }\begin{cases}p_{xy}\geq 0&\forall x,y\in I\\p_{x1}+\ldots+p_{xn}=1&\forall x\in I\\\pi_1p_{1y}+\ldots+\pi_np_{ny}=\pi_y&\forall y\in I\end{cases}

It suffices to show that this maximum is 0. Now, we take |A|=i, and assume that 1\leq i\leq n-1, that is, there are a positive number of negative and non-negative components. Noting that the sum of the rows in a stochastic matrix is 1, we may consider instead:

\text{Minimise }\sum_{\substack{x,y\in A\\x,y\in B}}p_{xy}\text{ s.t. }\begin{cases}p_{xy}\geq 0&\forall x,y\in I\\p_{x1}+\ldots+p_{xn}=1&\forall x\in I\\\pi_1p_{1y}+\ldots+\pi_np_{ny}=\pi_y&\forall y\in I\end{cases}

and it will suffice to show that this minimum is n. To do this, we consider instead the dual:

\text{Maximise }\lambda_1+\ldots+\lambda_n+\pi_1\mu_1+\ldots+\pi_n\mu_n,

\text{ s.t. }\lambda_x+\pi_y\mu_x\leq\begin{cases}1&\text{if }x,y\leq i\text{ or }x,y\geq i+1\& \text{otherwise}\end{cases}

The objective is certainly bounded by n. And in fact this is attainable, for example by taking:

\lambda_1=\ldots=\lambda_i=1,\quad \lambda_{i+1}=\ldots=\lambda_n=0

\mu_1=\ldots=\mu_i=0,\quad \mu_{i+1}=-\frac{1}{\pi_{i+1}}, \ldots,\mu_n=-\frac{1}{\pi_n}.

Applying strong duality for linear programming problems completes the proof.

Invariant Distributions of Markov Chains

My lecture course in Linyi was all about Markov chains, and we spent much of the final two sessions discussing the properties of invariant distributions. I was not surprised, however, that none of the class chose this topic as the subject for a presentation to give after the end of the teaching week. One of the main problems is that so many rather similar properties are introduced roughly simultaneously. As we did in the class, I thought it was worth making some sort of executive summary, as a mixture of revision and interest.

Definition: \pi is an invariant measure if \pi P=\pi. If in addition \sum_{i\in I}\pi_i=1, then we say it is an invariant distribution. Of course, if I is finite, then any invariant measure can be normalised to give an invariant distribution.

The key initial questions are about existence and uniqueness. First, if there are multiple communicating classes, then an invariant measure (resp. distribution) is a linear (resp. affine) combination of the invariant measures / distributions on each (closed) class. So we restrict attention to irreducible Markov chains.

In the finite case, P is a stochastic matrix so has a column eigenvector with eigenvalue 1, namely the vector with all entries equal to 1. Thus, by reference to general theory in linear algebra, P has a row eigenvector \pi with eigenvalue 1. To paraphrase a remark made by one of my students, what is not clear is that this should be a measure. Demonstrating that this is true is rather non-trivial I think, normally done by reference to the rather more general Perron-Frobenius theorem, though on the flight home I came up with a short argument using Lagrangian duality. For now, we accept existence in the finite case, and note that we typically show existence by showing that the vector of expected time spent in each state between successive visits to a fixed reference state satisfies the properties of an invariant measure.

This is a good moment to note that recurrence is not a necessary condition for the existence of an invariant measure. For example, the random walk on \mathbb{Z}^3 is transient, but the uniform measure is invariant. However, it is not a sufficient condition for the existence of an invariant distribution either. (Of course, an irreducible finite chain is always recurrent, and always has an invariant distribution, so now we are considering only the infinite state space case.) The random walk on \mathbb{Z}^2 is recurrent, but the invariant measure is not normalisable.

The property we in fact need is positive recurrence. This says that the expected return time to each point is finite. Again, this is a class property. This is a common requirement in probabilistic arguments: almost surely finite is often not strong enough to show results if the expectation is infinite (see for example the various requirements for the optional stopping theorem). If this holds, then \pi_i=\frac{1}{\mathbb{E}T_i}, where T_i is the the return time starting from some i\in I.

The final question is ‘Why are we interested?’ One of the best answers is to look at convergence properties. A simple suggestion is this: if we start in equilibrium, then X_0,X_1,X_2,\ldots are all equal in distribution. Note that the dependence structure remains complicated, and much much more interesting than the individual distributions. Next, we observe that a calculation of n-step transition probabilities for a finite chain will typically involve a linear combination of nth powers of eigenvalues. One of the eigenvalues is 1, and the others lie strictly between -1 and 1. We observe in examples that the constant coefficient in p_{ij}^{(n)} is generally a function of j alone, and so p_{ij}^{(n)}\rightarrow\lambda_j, some distribution on I. By considering P^{n+1}=P\cdot P^n, it is easy to see that if this converges, (\lambda_j)_{j\in I} is an invariant distribution. The classic examples which do not work are

P=\begin{pmatrix}0&1\\1&0\end{pmatrix} and P=\begin{pmatrix}0&1&0\\ 0&0&1\\1&0&0\end{pmatrix},

as then the distribution of X_n is a function of the remainder of n modulo 3 alone. With a little thought, we can give a precise classification of such chains which force you to be in particular proper subsets of the state space at regular times n. Chains without this property are called aperiodic, and we can show that distributions for such chains converge to the equilibrium distribution as n\rightarrow\infty.