Linear Algebra II: Eigenvectors and Diagonalisability

This post continues the discussion of the Oxford first-year course Linear Algebra II. We’ve moved on from determinants, and are now considering eigenvalues and eigenvectors of matrices and linear maps.

A good question to ask is: what’s the point of knowing about eigenvectors? I can think of a quick answer and a longer answer. The quick answer is that whenever we have a mapping of any kind, it is natural to ask about its fixed points. And since we are thinking about vector spaces and linear maps, if we can’t find any fixed points, we might nonetheless be able to find the best thing, some vectors whose direction is fixed by the map. In general, knowing about fixed points of a mapping might tell us other more qualitative properties, including the behaviour seen when you apply the map iteratively a large number of times. (Indeed a recent post discusses this exact problem for positive matrices in a context relevant to a chapter of my thesis…)

A more specific answer concerns bases. Recall that a linear map is defined independently of any basis: it’s just a map from the vector space to itself. We can express the linear map via a matrix with respect to some basis, but how to choose the basis? We could always choose the canonical basis in \mathbb{R}^n, since it’s easy to do vector and matrix calculations when most of the entries of all the vectors are zero. We also have a good visual idea (at least in up to three dimensions) of what a matrix might mean with respect to that basis. If we needed to divide the three-dimensional world around us into small volumes, we’d tend to describe it with small cubes rather than small arbitrary parallelopipeds.

But once we know something about the linear map, we might want to choose a basis of vectors on which the behaviour of the map is particularly easy to describe. And eigenvectors fulfil precisely this role. If we are able to choose a basis of eigenvectors, describing the map’s action, either abstractly, or via a (diagonal) matrix, is very straightforward. If we are given a matrix to begin with, we know how to do a change of basis, and changing to the basis of eigenvectors is precisely what’s going when we write A=P^{-1}DP, where D is a diagonal matrix. We construct P by taking its columns to be these eigenvectors. In particular, for a given vector x, y=Px is the vector giving the coefficients of x in the basis of eigenvectors.

So the case where we have a basis of eigenvectors is particularly useful, and in this case, we say the matrix or the map is diagonalisable. Remember how we find eigenvalues. If there exists a non-zero vector x satisfying Ax=\lambda x, then x is in the kernel of A-\lambda I. As we discussed last time, introducing the determinant gives a much more manageable way to verify which values of \lambda result in A-\lambda I having a non-trivial kernel. In particular, if non-zero x is in the kernel, we have \mathrm{det}(A-\lambda I)=0, and this leads to a polynomial of degree n (the dimensional of the vector space / size of the matrix) for \lambda, called the characteristic polynomial \chi_A(z), which has the eigenvalues as its roots.

If we agree to work over the complex field, then this is good, because it means we always have eigenvalues, and so it becomes sensible to talk about exactly how many eigenvalues and eigenvectors we have. Observe that if we restrict to real vector spaces, this might not be the case. In the plane, the rotation by \pi/2 for example has no fixed vectors.

Multiplicities of eigenvalues

We call the algebraic multiplicity \alpha(\lambda) of an eigenvalue \lambda to be the exponent of the factor (z-\lambda) in the factorisation of the characteristic polynomial. To define the geometric multiplicity, observe that all the eigenvectors with eigenvalue \lambda form a subspace, and so it is meaningful to talk about the dimension of this subspace (‘eigenspace’), which is the geometric multiplicity \gamma(\lambda). There are two facts that one needs to remember. The slightly less obvious one is that \gamma(\lambda)\le \alpha(\lambda) for all \lambda. One can see this by, for example, working in a basis that extends a basis of the \lambda-eigenspace. Observe at this stage that the sum of the algebraic multiplicities has to be n by definition, while the sum of geometric multiplicities is at most n. And this makes sense, because the space spanned by all the eigenvectors is a subspace, and so has dimension at most n.

The more obvious, but more frequently forgotten result is that

\alpha(\lambda)\ge 1 \quad \iff \quad \gamma(\lambda)\ge 1,

which is simply a consequence of the property discussed a few paragraphs previously concerning the kernel of A-\lambda I.

In particular, we might make the heuristic observation that ‘most’ polynomials of degree n have n distinct roots. This is certainly true for quadratics: there is only one value that the discriminant can take such that we see a repeated root. Alternatively, imagine shifting the quadratic up and down (in a complex way if necessary); again there is only one moment at which it might have a repeated root. This observation can be generalised easily to higher degree polynomials in a number of ways.

So if we lift this observation across to matrices, we see that most matrices have n distinct eigenvalues, and thus have n linearly independent eigenvectors which form a basis, hence the matrix is diagonalisable. I think it’s really worth reflecting on this, since much of a first exploration into linear algebra ends up treating exactly the case where the matrix is not diagonalisable.

The principal example of a non-diagonalisable matrix is \begin{pmatrix}2&1\\0&2\end{pmatrix}, where the 2s can be replaced by any value, and the 1 can be replaced by an non-zero value. There’s plenty to learn about to what extent versions of this matrix of higher size represent all non-diagonalisable matrices, but such an exposition of Jordan normal form comes next year for the students taking this course.

It probably is worth saying now though, that this example gives a good sanity check for whether a method is actually using diagonalisability correctly. For example, it is easily seen that elementary row operations to not preserve diagonalisability by starting from \begin{pmatrix}2&0\\0&2\end{pmatrix} and ending up at our counter-example. One could also argue from this that the set of non-diagonalisable matrices are dense within the set of matrices with a repeated eigenvalue. That is, having a repeated eigenvalue but full eigenspace is doubly-infinitely-unlikely.

Cayley-Hamilton theorem

Anyway, among other results, we also saw the Cayley-Hamilton theorem, which states that a matrix A satisfies its own characteristic equation. That is \chi_A(A)=0, where the zero on the right-hand side is the zero matrix. It’s tempting to substitute A into the expression \mathrm{det}(A-\lambda I), but of course this is not valid. Indeed imagine a typical eigenvalue determinant matrix with terms like (7-\lambda) on the diagonal; it doesn’t make sense to substitute a matrix for \lambda as one of the entries of the overall matrix!

Fortunately, we can argue convincingly in the case where A is a diagonalisable matrix. Remember that \chi_A(A) is a matrix. Now looki at the action of \chi_A(A) on any eigenvector v, corresponding to eigenvalue \lambda. Applying some power of A to v gives v multiplied by the same power of \lambda, and so we end up with

\chi_A(A)v = \chi_A(\lambda)v = 0.

This only worked when v was an eigenvector, but fortunately there is a basis of eigenvectors if A is diagonalisable, and so \chi_A(A)v=0 for all v, hence \chi_A(A)=0.

But \chi_A(A) is just a matrix-valued function of A. If you think about it, \chi_A is a monic polynomial, all of whose non-leading coefficients are multinomials of degree at most n-1 in the entries of A. Furthermore, these multinomials have (non-negative) integer coefficients. Therefore the entries of \chi_A(A) are multinomials of degree at most 2n-1 in the entries of A, and again have (non-negative) integer coefficients.

Even without the integrality of the coefficients, this says that, under any reasonable definition of continuity of matrices (which could be induced from any topology on \mathbb{R}^{n\times n}) the function \chi_A(A) should be continuous as a function of A. But we’ve shown \chi_A(A)=0 for all diagonalisable A, and also argued that most complex-valued matrices are diagonalisable. Turning this into a formal statement about denseness means that we’ve shown the Cayley-Hamilton theorem for non-diagonalisable matrices also. It feels that because the coefficients are non-negative integers, we might also have shown the result for other fields too, but I have minimal knowledge or recollection at the moment of the things one has to check for this sort of result.

It’s worth ending with the brief comment that Cayley-Hamilton is useful, among other reasons because it enables us to write the inverse of A as a polynomial of degree at most n-1 in terms of A. In many settings this is a lot easier to work with in terms of calculations than an argument with minors.

Perturbation of Eigenvectors

In the previous post, I talked about eigenvalues, and some alternative characterisations which could be useful in some circumstances. Recently, I’ve been interested in controlling how eigenvalues and eigenvectors change as the matrix is varied. My particular example concerns positive matrices, which have a well-defined largest eigenvalue (or Perron root), and a unique (up to normalising in some way) principal eigenvector.

We might expect that perturbing a matrix slightly does not change the eigenvectors very much, since any original eigenvector is still almost an eigenvector, in the sense that its image under the action of the perturbed matrix is almost equal to a multiple of itself. But how to make this precise? And when does it go wrong?

Eigenvalues – The non-multiple case

Throughout, we assume we have a k x k matrix. We might want to allow the entries to be complex, but for now, real entries are perfectly interesting enough.

It makes sense to start with eigenvalues, since it’s easy to define these through the characteristic equation of the matrix. The coefficients of this polynomial are well-behaved (indeed polynomial) functions of the entries of the matrix. So we are really asking how the set of roots of a finite polynomial evolves as the (k+1) coefficients of the polynomial evolve. It is fairly clear that, under any sensible choice of topology on the space of k-(multi)-subsets of \mathbb{C}, the multiset of roots is continuous in the coefficients of the polynomial.

To say anything more precise, we have to introduce some notation.

Let \chi_{A}(z)=z^k+\gamma_{k-1}(A)z^{k-1}+\ldots+\gamma_1(A)z+\gamma_0(A) be the characteristic polynomial of A. Each \gamma_i is a polynomial of degree k-i in the entries of A. Let’s consider now a matrix-valued function A(t), and we assume that the entries of A(t) are all differentiable with respect to t. So each \gamma_i(A(t)) is also differentiable with respect to t.

At this point, let’s make the assumption that t lies in some interval [r,s] for which the eigenvalues of A(t) are distinct. Let \lambda(t) be some eigenvalue of A(t), chosen such that \lambda is a continuous function of t. For example, we might take \lambda(t)=\Lambda_1(t), the eigenvalue with largest absolute value (with some canonical tie-breaking mechanism). Then \chi_{A(t)}(\lambda(t))=0, and so differentiating with respect to \gamma_i:

0=\chi'_{A(t)}(\lambda(t)) \frac{\partial \lambda}{\partial \gamma_i} \Big|_{A(t)} + \lambda(t)^i.

Because we deliberately demanded that the eigenvalues were disjoint, we have \chi'_{A(t)}(\lambda(t))\ne 0, and so \frac{\partial \lambda(t)}{\partial \gamma_i}=-\lambda(t)^i / \chi'_{A(t)}(\lambda(t)). In particular, \lambda(t) is differentiable with respect to the coefficients of the characteristic polynomial, and thus with respect to t also.

Multiple Eigenvalues

It gets more complicated when the characteristic equation has multiple roots. Typically we will be interested in the evolution of the eigenvalue with some extremal property, probably the largest one. Let’s restrict to the real, symmetric case, where the set of eigenvalues is complete and real. Suppose we have t_0 such that A(t_0) has a repeated eigenvalue. Then, in a small enough region of t_0, we can define eigenvalues \lambda(t),\mu(t) continuously such that \lambda(t_0)=\mu(t_0) while \lambda(t)\ne \mu(t) for t=\ne t_0. Then, if the entries of A(t) are analytic functions of t, then so are \lambda(t),\mu(t).

But then \max(\lambda(t),\mu(t)) will in general not be analytic, as the maximum of two smooth functions is in general Lipschitz.

This effect is most obvious in the case of a diagonal matrix A(t)=\begin{pmatrix}t&0\\0&-t\end{pmatrix}, for which the largest eigenvalue is |t|.

Eigenvectors

When the matrix A is real and symmetric, we know it has real eigenvalues, and an orthogonal basis of eigenvectors. Then the Rayleigh quotient characterises the eigenvector as well as the eigenvalue. Recall that for any x\in\mathbb{R}^k with ||x||_2=1, we have

\lambda_1\ge x^T A x \ge \lambda_k,

with equality precisely at the respective eigenvectors. So if we perturb A slightly, keeping it real and symmetric, we can control the principal eigenvector quite well by this method.

If A is not diagonalisable, we can still say something about this principal eigenvector, via large powers of A, sometimes called the Van Mises iteration. This says that for large N, A^N v should have direction close to that of the eigenvector, for any test vector v. The rate of convergence depends on the ratio of the largest eigenvalue to the second largest eigenvalue, though if the matrix is not diagonalisable, it is not completely trivial to quantify this convergence. We have to be careful though, since A maps the subspace orthogonal to the eigenvector to itself, so the magnitude of the projection of v onto the eigenvector determines the speed of convergence. Indeed, if v is orthogonal to the eigenvector, it won’t converge towards the principal eigenvector at all. (But if there is a well-defined ‘second eigenvector’ then it will converge towards that.)

Continuity of Eigenvectors

The reason why I ended up reading about some of these topics was that I wanted to show that the Perron eigenvector of a positive matrix (that is, the unique eigenvector corresponding to the Perron root) was Lipschitz continuous as a function of the entries of the positive matrix. Since for such a matrix, the largest eigenvalue is simple, we are able to make some progress.

In general, the condition that v is an eigenvector of matrix A with eigenvalue \lambda is described by the relation:

(A-\lambda I)v=0,\quad ||u||_1=1, (*)

or whatever the most appropriate normalising condition appears. This describes an implicit relation between A and the eigenvalue-eigenvector pair (\lambda,v). So given a matrix A_0 with eigenvalue \lambda_0 corresponding to eigenvector v_0, in a neighbourhood of A_0 in \mathbb{R}^{k\times k} we can use the implicit function theorem to comment on the differentiability of (\lambda,v) with respect to A in this neighbourhood.

Precisely, we require the matrix of partial derivatives from (*)

\begin{pmatrix} A_0-\lambda_0 I&v_0 \\ \mathbf{1}&0\end{pmatrix}

to have non-zero determinant. But if \lambda_0 is not simple, then if we apply this matrix from the left to one of the other eigenvectors (with a zero appended) we can see that it has non-trivial kernel. With a bit more work, we can show the converse too, and conclude that (\lambda,v) are smooth with respect to A in some neighbourhood of A_0.

Finally, we observe that when the eigenvalues are not simple, we can’t even guarantee continuity of the eigenvectors. This is unsurprising really, since for a multiple eigenvalue, a) we might not know how many LI eigenvectors exists; and b) we might have complete freedom over the choice of eigenvectors. Think about the identity matrix! Indeed the eigenvectors of \begin{pmatrix}1+\epsilon &0\\ 0&1+\epsilon\end{pmatrix} are (1,0) and (0,1), while the eigenvectors of \begin{pmatrix}1&\epsilon\\ \epsilon&1\end{pmatrix} are (1,1), (1,-1). So no continuous choice of eigenvectors is possible here.

Characterisations of Eigenvalues

I’ve been working for much of the past few months on a version of the frozen percolation random graph process with types. The connectivity between types is controlled by a (finite) non-negative square matrix, and so I’ve been engaging with linear algebra theory to an extent I haven’t really experienced since the second or third year of undergraduate maths.

We are interested in whether the graphs in question are subcritical, critical or supercritical. As in the case of multitype branching processes, this is controlled by the principal eigenvalue of a related non-negative matrix. So I’ve been looking up lots of methods for controlling eigenvalues, and some have proved useful, and some have not, but I thought it would be worthwhile to present some of them here.

Bounds and characterisations of spectral radius

Throughout, I will be talking about finite, square matricies. Eigenvalues may be defined as roots of the characteristic polynomial, and so by the fundamental theorem of algebra, there is always at least one complex eigenvalue. There is always at least one eigenvector associated to any eigenvalue. However, the dimension of the eigenspace is not always the same as the multiplicity of the eigenvalue as a root of the characteristic polynomial. The latter is called algebraic multiplicity, while the former is geometric multiplicity.

For now though, this distinction will be unimportant. The spectral radius of a matrix A is defined as

\rho(A)=\max \{|\lambda|\, : \, \lambda \text{ and eigenvalue of }A\}.

We can bound the spectral radius in terms of the norm of the matrix. Remember that a matrix norm has to satisfy all the usual properties of a norm, as well as a submultiplicative property |||AB|||\le |||A|||\cdot |||B|||. This is good, as otherwise we would be free to replace any norm by an arbitrary multiple of itself, and so no useful bounds could ever emerge. Note that the submultiplicativity implies that |||I_n||\ge 1.

Now, let \lambda,x be some eigenvalue and associated (right-)eigenvector respectively of matrix A. Let X be the square matrix given by taking all the columns to be x. Now Ax=\lambda x implies AX=\lambda X, and so

|\lambda| \cdot|||X||| = |||\lambda X||| = |||AX||| \le |||A|||\cdot |||X|||,

and thus we conclude our most basic bound \lambda \le |||A|||.

When A is diagonalisable, life is particularly easy, but in general we can write A as a conjugate of its Jordan normal form. Then, by looking at each diagonal block of the Jordan normal form separately, we can show that

\lim_{k\rightarrow 0}A^k = 0\quad \iff \quad \rho(A)<1.

Then, applying this, with additional care, to the matrices A / (\rho(A)\pm \epsilon), we derive Gelfand’s Formula, that \rho(A) = \lim_{k\rightarrow \infty} ||A^k||^{1/k}. Again, this applies for any matrix norm.

Real symmetric matrices

When the matrix is real and symmetric, it is not too hard to show that all the eigenvalues are real, and furthermore that all the geometric multiplicities are equal to the algebraic multiplicities. That is, the matrix is diagonalisable, and there is an (orthogonal) basis of eigenvectors. Once we assume we are working with respect to this eigenbasis, it is easy to see how the Rayleigh quotient characterisation of the largest (and smallest) eigenvalue works. Let’s say the eigenvalues are \lambda_1\ge \lambda_2\ge\ldots \ge \lambda_n, then for any ||x||_2=1, we have \lambda_1\ge x^T A x\ge \lambda_n, and equality is attained when x is the respective eigenvector, normalised appropriately.

This is an especially useful characterisation of the largest eigenvalue, as for example we can see fairly easily that this means \lambda_1 is a convex function of the (real, symmetric) matrix.

We can generalise this Rayleigh quotient idea if we take k orthonormal vectors in R^k, arrange them in an nxk matrix P, so that P^T P = I_k. Now we consider the matrix P^TAP. [Note that if k=1, we are exactly considering x^TA x as before.] Then Poincare’s Separation Theorem say that the eigenvalues \mu_1\ge \mu_2\ge\ldots \mu \mu_k of P^TAP (which is also real, symmetric) are bounded by the original eigenvalues:

\lambda_{n-k+i} \ge \mu_i\ge \lambda_i.

Since the trace is preserved under conjugation, and the trace is the sum of eigenvalues, we can apply this result with P’s columns taken to be the any k canonical basis vectors of \mathbb{R}^k. Without loss of generality, we may assume the basis has been chosen so that the diagonal elements of A satisfy a_{11}\ge a_{22}\ge\ldots\ge a_{nn}, and so now we have that the sequence (a_{11},a_{22},\ldots,a_{nn}) is majorised by (\lambda_1,\lambda_2,\ldots,\lambda_n) and majorises (\lambda_n,\lambda_{n-1},\ldots,\lambda_1). The first of these relations can be used via the setup of Karamata’s inequality to conclude that for any convex function f, we have

\sum_{i=1}^n f(\lambda_i)\ge \sum_{i=1}^n f(a_{ii}).

Gershgorin Circles

In fact, we can relate the eigenvalues to the diagonal entries of the matrix in a more general setting. We are motivated by the thought that if the off-diagonal entries are all very small, then the set of eigenvalues should be approximately given by the set of diagonal entries.

For a square complex matrix, let \lambda,x be an eigenvalue, eigenvector pair. For any index i, we have

\lambda - a_{ii}= \frac{\sum_j a_{ij}x_j}{x_i} - a_{ii} = \frac{\sum_{j\ne i}a_{ij}x_j}{x_i}.

Now consider the i such that x_i=\max |x_j|, and take absolute values and apply the triangle inequality,

|\lambda - a_{ii}| \le \sum_{j\ne i} \left| \frac{a_{ij}x_j}{x_i} \right| \le \sum_{j\ne i}|a_{ij}|.

Let’s define R_i=\sum_{j\ne i}|a_{ij}| to be the sum of the non-diagonal entries of the ith row. Then the Gershgorin circle theorem says that every eigenvalue lies within at least one of the discs B(a_{ii},R_i), in the complex plane. So our motivation still makes sense. If the off-diagonal entries are small, this is a strong restriction, and if they are not typically smaller than the diagonal entries, then we perhaps do not learn very much. Obviously, we could apply the same argument to the columns too.

When the diagonal entries are distinct, and the off-diagonal entries are small, the Gershgorin discs are distinct, and we would expect each to contain exactly one eigenvalue, corresponding to the appropriate diagonal entry. In fact, we can say something stronger. In general, the union of the discs is a subset of the complex plane with some connected components. Then, if a component is the union of exactly r discs, then it contains exactly r of the eigenvalues.

To see this, consider multiplying all the off-diagonal entries by z\in[0,1] and observe what happens as z varies from 0 to 1. When z=0, the matrix is diagonal, and each eigenvalue is in the Gershgorin disc (which is a single complex number). As z varies continuously, the characteristic polynomial varies continuously, and also its roots, that is the set of eigenvalues. So since each of the r eigenvalues are initially within the union of the r original, large Gershgorin discs, they must remain within this union as z varies, since they cannot ‘jump’ to another component.

It’s hard to know how time will allow, but provisionally in the next post I will talk about how to control the evolution of eigenvectors as a function of the matrix, and in particular what can go wrong.

REFERENCES

For the middle section, I used the progression from Chapter 4 of Matrix Differential Calculus with Applications in Statistics and Econometrics (Magnus and Neudecker).

Branching Random Walk and Amenability

This post is about some of the things I learned in an interesting given by Elisabetta Candellero in Oxford last week, based on joint work with Matt Roberts. The paper on which this is based can be found here. The main thing I want to talk about are some properties of graphs which were mentioned near the beginning which I hadn’t heard about before.

Branching Random Walk (hereafter BRW) is a model to which much attention has been paid, because of its natural applications in a range of physical and genetic settings. As with many of the best models, the definition is pretty much in the title. We take the ingredients for a random walk on a graph, which is a graph, and a transition matrix P on that graph. For most of the time we will consider simple random walk, so the graph G exactly specifies P. This requires the additional condition that the graph G is locally finite. We will introduce a branching mechanism, so at discrete times {0,1,2,…} we will track both the number of particles, and their current locations. We start at time 0 with a single particle at some vertex. Then at each time-step, all the vertices present die, and each gives birth independently to some number of offspring according to a fixed probability distribution \mu. These offspring then perform one move according to transition matrix P. Note that if you want the system to carry the appearance of having no death, then taking the support of the offspring distribution to be {1,2,3,…} achieves precisely this. The properties we consider will not be very interesting unless G is infinite, so assume that from now on.

There are almost limitless ways we could think of to generalise these dynamics. The offspring distribution could be allowed to depend on the vertex the particle is occupying. The joint transition probabilities of the offspring at a vertex could be biased in favour or against the offspring moving to the same site next. The environment could be chosen in advance before the process starts, but random.

The classical question about BRW is that of recurrence and transience. The definition extends naturally from that of a Markov chain (which any non-branching random walk on a graph is). As in that setting, we say a BRW is recurrent if every vertex is almost surely visited infinitely often by particles of the graph.

Heuristically, we should observe that in some sense, it is quite difficult for simple random walk on an infinite graph to be recurrent. We have examples in \mathbb{Z},\mathbb{Z}^2, but these are about as ‘small’ as an infinite graph can be. An idea might be that if the number of sites some distance away from where we start grows rapidly as the distance grows, then there isn’t enough ‘pull’ back to visit the sites near where we start infinitely often. Extending this argument, it is easier for a BRW to be recurrent, as we have the option to make the branching rate large, which means that there are lots of particles at large times, hence more possibility for visiting everywhere. Note that if the offspring distribution is subcritical, we don’t stand a chance of having interesting properties. If we ignore the random walk part, we just have a subcritical Galton-Watson process, which dies out almost surely.

We need a measure of the concept discussed in the heuristic for how fast the number of vertices in the graph grows as we consider bands of vertices further and further away from the starting vertex. The standard measure for this is the spectral radius, which is defined not in terms of number of vertices, but through the limiting probability of returning to a fixed vertex at large time n. Precisely

\rho:= \limsup \mathbb{P}_i(X_n=i)^{1/n},

so in some approximation sense

\mathbb{P}_i(X_n=i)\sim \rho^{n},

which explains why \rho\le 1. Note that by considering the sum of such terms, if simple random walk on G is recurrent, then \rho=1, but the converse does not hold. (Consider SRW on \mathbb{Z}^3 for example.)

It’s also worth remarking that \rho is a class property. In particular, for a connected graph, the value of \rho is independent of i. This is not surprising, as if d is the graph distance between vertices i and j, then

p_{ii}^{(n)}\ge p_{ij}^{(d)}p_{jj}^{(n-2d)}p_{ji}^{(d)},

and vice versa, which enables us to sandwich usefully for the limits.

Really, \rho is a function of the transition matrix P. In fact, we can be more specific, by considering diagonalising P. The only case we care about is when P is infinite, so this is not especially nice, but it makes it clear why p_{ii}^{(n)} decays like |\rho|^n where \rho is the largest eigenvalue of P. Indeed this is an alternative definition of the spectral radius. Note that Perron-Frobenius theory (which seems to keep coming up on the blog this week…) says that since |\rho|\le 1, then if |\rho|=1, we must have \rho=1. So the spectral radius being 1 is precisely equivalent to having an invariant measure. We don’t know whether we can normalise it, but P-F guarantees the relevant left-eigenvector is non-negative, and hence a measure.

Next we give this situation a name. Say that a random walk is amenable if \rho(P)=1. We can extend this property to say that a graph is amenable if SRW on it is amenable.

This is not the standard definition of amenability. This property is originally defined (by von Neumann) in the context of groups. A group G is said to be amenable if there exists a left-invariant probability measure on G, ie \mu such that

\forall A\subset G, \forall g\in G, \mu(gA)=A.

The uniform distribution shows that any finite group is amenable.

It turns out that in general there are several conditions for a group which are equivalent to amenability. One is that, given G finitely generated by B, the Cayley graph for G with edges given by elements of B does not satisfy a strong isoperimetric inequality. Such an inequality is an alternative way of saying that the graph grows rapidly. It says that the size of the boundary of a subset of the vertices is uniformly large relative to the size of the set. Precisely, there exists a constant c>0 such that whenever U is a finite subset of the vertices, we have |\partial U|\ge c|U|. (Note that finiteness of U is important – we would not expect results like this to hold for very large subsets.)

Kesten proved that it is further equivalent to the statement that simple random walk on Cay(G,B) is amenable in our original sense. This technical and important result links the two definitions.

We finish by declaring the main classical result in BRW, which is a precise condition for transience. As motivated earlier, the rate of branching and the spectral radius have opposing effects on whether the system is recurrent or transient. Note that at some large time, the expected number of particles which have returned to the starting vertex is given by the expected number of particles in the system multiplied by the probability that any one of them is back at its origin, ie \sim \mu^n\rho^n. So the probability that there is a particle back at the origin at this time is (crudely transferring from expectation to probability) 1\wedge (\mu \rho)^n. We can conclude that the chain is recurrent if \mu > \rho^{-1} and transient if \mu<\rho^{-1}. This result is due to Benjamini and Peres.

The remaining case, when \mu=\rho^{-1} is called, unsurprisingly, critical BRW. It was proved in ’06 by Gantert and Muller that, in fact, all critical BRWs are transient too. This must exclude the amenable case, as we could think of SRW on \mathbb{Z} as a critical BRW by taking the branching distribution to be identically one, as the spectral radius is also 1.

In the end, the material in this post is rather preliminary to the work presented in EC’s talk, which concerned the trace of BRW, and whether there are infinitely many essentially different paths to infinity taken by the particles of the BRW. They show that this holds in a broad class of graphs with symmetric properties.

Enhanced by Zemanta

Mixing Times 4 – Avoiding Periodicity

A Markov chain is periodic if you can partition the state space such it is possible to be in a particular class only at certain, periodic times. Concretely, suppose we can find a decomposition into classes \Omega=V_1\cup\ldots\cup V_k such that conditional on X_t\in V_i, we have \mathbb{P}(X_{t+1}\in V_{i+1})=1, where the indices of the Vs are taken modulo k. Such a chain is called periodic with period k. In most cases, we would want to define the period to be the maximal such k.

Why is periodicity a problem? It prevents convergence to equilibrium. The distribution at time t has some fairly strong dependence on the initial distribution. For example, if the initial distribution is entirely supported on V_1 as defined above, then the distribution at time t will be entirely supported on V_i, where i\equiv t \mod k. In particular, this cannot converge to some equilibrium.

Aperiodicity thus becomes a necessary condition in any theorem on convergence to equilibrium. Note that by construction this is only relevant for chains in discrete time. In an first account of Markov chains, most of the examples will either have a small state space, for which the transition matrix will have to contain lots of zeros before it stands a chance of being periodic, or obviously aperiodic birth-death or queue type processes. But some of the combinatorially motivated chains we consider for interesting mixing properties are more likely to be periodic. In particular, for a random walk on a group say, the generator measure may well be supported only on a small subset of the whole group, which is completely natural (eg transpositions as a subset of the symmetric group). Then it becomes more plausible that periodicity might arise because of some underlying regularity or symmetry in the group structure.

My first claim is that periodicity is not a disaster for convergence properties of Markov chains. Firstly, by the definition above, P^k(x,y) for x,y\in V_1 is an irreducible (aperiodic if k is maximal) transition matrix on V_1, and so we have convergence to some equilibrium distribution on V_1 of (X_{kt+a}) or similar. An initial distribution mixed between classes gives a mix of such equilibria. Alternatively, we could think about large-time ergodic properties. By taking an average over all distributions up to some large t, the periodic problems get smoothed out. So, for mixing on a periodic chain, it might be possible to make headway with Cesaro mixing, which looks at the speed of convergence of the ergodic average distribution.

In most cases, though, we prefer to alter the chain directly to remove periodicity, or even any chance of periodicity. The preferred method in many contexts is to replace the transition matrix P with \frac12 (P+I). This says that at every time t, we toss an independent fair coin, and with probability 1/2 make the transition suggested by P, and with probability 1/2 we stay where we are. Note that if a chain is irreducible, and some P(x,x)>0, then it is definitely aperiodic, as x cannot be in more than one class as per the definition of periodicity.

If you want to know about the mixing time of the original chain, note that this so-called lazy chain moves at half the speed of the original, so to get exact asymptotics (eg in the case of cutoff, that is mixing speed faster than the scale of the mixing time) you must multiply by 2. Also note that all of the eigenvalues of \frac12 (P+I) are non-negative, and in fact, the eigenvalues are subject to a linear transform in the construction of the lazy transition matrix \lambda\mapsto \frac12(1+\lambda).

Note that choosing 1/2 as the parameter is unnecessary. Firstly, it would suffice to take some P(x,x)=\epsilon and rescale the rest of the row appropriately. Also, in some cases, a different constant gives a more natural interpretation of the underlying mechanism. For example, one model worth considering is the Random Transposition Random Walk on the symmetric group, where at time t we multiply (ie compose) with a transposition chosen uniformly at random. This model is interesting partly because the orbits of an element resemble, at least initially, the component size process of a Erdos-Renyi random graph, on the grounds that when the number of transpositions is small, they don’t interact too much, so can be viewed as independent edges. Anyway, some form of laziness is needed in RTRW, otherwise the chain will alternative between odd and even permutations. In this case, 1/2 is not the most natural choice. The most sensible way to sample random transpositions is to select the two elements of [n] to be transposed uniformly and independently at random. Thus each transposition is selected with probability \frac{2}{n^2}, while the identity, which corresponds to ‘lazily’ staying at the current state in the random walk, is selected with probability 1/n.

The lazy chain is also useful when the original chain has a lot of symmetry involved. In particular, if the original chain involves ‘switching’ say one coordinate. The best example is the random walk on the vertices of the n-hypercube, but there are others. Here, the most helpful way to visualise the configuration is to choose a coordinate uniformly at random and then flip its value (from 0 to 1 or 1 to 0). Now the lazy chain can be viewed similarly, but note that the dependence on the current value of the coordinate is suppressed. That is, having chosen the coordinate to be affected, we set it to be 0 with probability 1/2 and 1 with probability 1/2, irrespective of the prior value at that coordinate. Thus instead of viewing the action on coordinates to be ‘stay or switch’, we can view the action on the randomly chosen coordinate to be ‘randomly resample’, to use statistical terminology. This is ideal for coupling, because from the time coordinate j is first selected, the value at that coordinate is independent of the past, and in particular, the initial value or distribution. So we can couple arbitrary initial configurations or distributions, and we know that as soon as all coordinates have been selected (a time that can be described as a coupon collector problem), the chains are well coupled, that is, the values are the same.

Note that one way we definitely get periodicity is if the increment distribution for random walk on a group is supported entirely in a single coset of a normal subgroup. Why? Well if we take H\lhd G to be the normal subgroup, and gH to be the relevant coset, then P^t(g',\cdot) is supported entirely on g^tg'H, so is periodic with period equal to the order of gH in the quotient group G/H. Note that if the coset is the normal subgroup itself, then it might well include support on the identity, which immediately makes the chain aperiodic. However, there will be then be no transitions between cosets, so the chain is not irreducible on G.

The previous paragraph is the content of Remark 8.3 in the book we are reading. My final comment is that normality is precisely what is needed for this to hold. The key idea is that the set of subsets {gH, gHgH, gHgHgH, … } forms a partition of the group. This is certainly true if H is normal and gH generates G/H. If the latter statement is not true, then the set of subsets still forms a partition, but of some subset of G. The random walk is then neither irreducible nor aperiodic on the reduced state space. If H is not normal, then there are no such restrictions. For example, gHgH might be equal to the whole group G. Then the random walk is aperiodic, as this would imply we can move between any pair of states in two steps, and so by extension between any pair of states in three steps. (2,3)=1, hence the chain is aperiodic. As a concrete example, consider

\tau=\langle (1 2)\rangle \leq S_3,

the simplest example of a non-normal subgroup. Part of the problem is that cosets are different in the left-case and the right-case. Consider the left coset of \tau given by \sigma\tau=\{(1 2 3),(2 3)\}. These elements have order three and two respectively, and so by a similar argument to the general one above, this random walk is aperiodic.