Linear Algebra II: Eigenvectors and Diagonalisability

This post continues the discussion of the Oxford first-year course Linear Algebra II. We’ve moved on from determinants, and are now considering eigenvalues and eigenvectors of matrices and linear maps.

A good question to ask is: what’s the point of knowing about eigenvectors? I can think of a quick answer and a longer answer. The quick answer is that whenever we have a mapping of any kind, it is natural to ask about its fixed points. And since we are thinking about vector spaces and linear maps, if we can’t find any fixed points, we might nonetheless be able to find the best thing, some vectors whose direction is fixed by the map. In general, knowing about fixed points of a mapping might tell us other more qualitative properties, including the behaviour seen when you apply the map iteratively a large number of times. (Indeed a recent post discusses this exact problem for positive matrices in a context relevant to a chapter of my thesis…)

A more specific answer concerns bases. Recall that a linear map is defined independently of any basis: it’s just a map from the vector space to itself. We can express the linear map via a matrix with respect to some basis, but how to choose the basis? We could always choose the canonical basis in \mathbb{R}^n, since it’s easy to do vector and matrix calculations when most of the entries of all the vectors are zero. We also have a good visual idea (at least in up to three dimensions) of what a matrix might mean with respect to that basis. If we needed to divide the three-dimensional world around us into small volumes, we’d tend to describe it with small cubes rather than small arbitrary parallelopipeds.

But once we know something about the linear map, we might want to choose a basis of vectors on which the behaviour of the map is particularly easy to describe. And eigenvectors fulfil precisely this role. If we are able to choose a basis of eigenvectors, describing the map’s action, either abstractly, or via a (diagonal) matrix, is very straightforward. If we are given a matrix to begin with, we know how to do a change of basis, and changing to the basis of eigenvectors is precisely what’s going when we write A=P^{-1}DP, where D is a diagonal matrix. We construct P by taking its columns to be these eigenvectors. In particular, for a given vector x, y=Px is the vector giving the coefficients of x in the basis of eigenvectors.

So the case where we have a basis of eigenvectors is particularly useful, and in this case, we say the matrix or the map is diagonalisable. Remember how we find eigenvalues. If there exists a non-zero vector x satisfying Ax=\lambda x, then x is in the kernel of A-\lambda I. As we discussed last time, introducing the determinant gives a much more manageable way to verify which values of \lambda result in A-\lambda I having a non-trivial kernel. In particular, if non-zero x is in the kernel, we have \mathrm{det}(A-\lambda I)=0, and this leads to a polynomial of degree n (the dimensional of the vector space / size of the matrix) for \lambda, called the characteristic polynomial \chi_A(z), which has the eigenvalues as its roots.

If we agree to work over the complex field, then this is good, because it means we always have eigenvalues, and so it becomes sensible to talk about exactly how many eigenvalues and eigenvectors we have. Observe that if we restrict to real vector spaces, this might not be the case. In the plane, the rotation by \pi/2 for example has no fixed vectors.

Multiplicities of eigenvalues

We call the algebraic multiplicity \alpha(\lambda) of an eigenvalue \lambda to be the exponent of the factor (z-\lambda) in the factorisation of the characteristic polynomial. To define the geometric multiplicity, observe that all the eigenvectors with eigenvalue \lambda form a subspace, and so it is meaningful to talk about the dimension of this subspace (‘eigenspace’), which is the geometric multiplicity \gamma(\lambda). There are two facts that one needs to remember. The slightly less obvious one is that \gamma(\lambda)\le \alpha(\lambda) for all \lambda. One can see this by, for example, working in a basis that extends a basis of the \lambda-eigenspace. Observe at this stage that the sum of the algebraic multiplicities has to be n by definition, while the sum of geometric multiplicities is at most n. And this makes sense, because the space spanned by all the eigenvectors is a subspace, and so has dimension at most n.

The more obvious, but more frequently forgotten result is that

\alpha(\lambda)\ge 1 \quad \iff \quad \gamma(\lambda)\ge 1,

which is simply a consequence of the property discussed a few paragraphs previously concerning the kernel of A-\lambda I.

In particular, we might make the heuristic observation that ‘most’ polynomials of degree n have n distinct roots. This is certainly true for quadratics: there is only one value that the discriminant can take such that we see a repeated root. Alternatively, imagine shifting the quadratic up and down (in a complex way if necessary); again there is only one moment at which it might have a repeated root. This observation can be generalised easily to higher degree polynomials in a number of ways.

So if we lift this observation across to matrices, we see that most matrices have n distinct eigenvalues, and thus have n linearly independent eigenvectors which form a basis, hence the matrix is diagonalisable. I think it’s really worth reflecting on this, since much of a first exploration into linear algebra ends up treating exactly the case where the matrix is not diagonalisable.

The principal example of a non-diagonalisable matrix is \begin{pmatrix}2&1\\0&2\end{pmatrix}, where the 2s can be replaced by any value, and the 1 can be replaced by an non-zero value. There’s plenty to learn about to what extent versions of this matrix of higher size represent all non-diagonalisable matrices, but such an exposition of Jordan normal form comes next year for the students taking this course.

It probably is worth saying now though, that this example gives a good sanity check for whether a method is actually using diagonalisability correctly. For example, it is easily seen that elementary row operations to not preserve diagonalisability by starting from \begin{pmatrix}2&0\\0&2\end{pmatrix} and ending up at our counter-example. One could also argue from this that the set of non-diagonalisable matrices are dense within the set of matrices with a repeated eigenvalue. That is, having a repeated eigenvalue but full eigenspace is doubly-infinitely-unlikely.

Cayley-Hamilton theorem

Anyway, among other results, we also saw the Cayley-Hamilton theorem, which states that a matrix A satisfies its own characteristic equation. That is \chi_A(A)=0, where the zero on the right-hand side is the zero matrix. It’s tempting to substitute A into the expression \mathrm{det}(A-\lambda I), but of course this is not valid. Indeed imagine a typical eigenvalue determinant matrix with terms like (7-\lambda) on the diagonal; it doesn’t make sense to substitute a matrix for \lambda as one of the entries of the overall matrix!

Fortunately, we can argue convincingly in the case where A is a diagonalisable matrix. Remember that \chi_A(A) is a matrix. Now looki at the action of \chi_A(A) on any eigenvector v, corresponding to eigenvalue \lambda. Applying some power of A to v gives v multiplied by the same power of \lambda, and so we end up with

\chi_A(A)v = \chi_A(\lambda)v = 0.

This only worked when v was an eigenvector, but fortunately there is a basis of eigenvectors if A is diagonalisable, and so \chi_A(A)v=0 for all v, hence \chi_A(A)=0.

But \chi_A(A) is just a matrix-valued function of A. If you think about it, \chi_A is a monic polynomial, all of whose non-leading coefficients are multinomials of degree at most n-1 in the entries of A. Furthermore, these multinomials have (non-negative) integer coefficients. Therefore the entries of \chi_A(A) are multinomials of degree at most 2n-1 in the entries of A, and again have (non-negative) integer coefficients.

Even without the integrality of the coefficients, this says that, under any reasonable definition of continuity of matrices (which could be induced from any topology on \mathbb{R}^{n\times n}) the function \chi_A(A) should be continuous as a function of A. But we’ve shown \chi_A(A)=0 for all diagonalisable A, and also argued that most complex-valued matrices are diagonalisable. Turning this into a formal statement about denseness means that we’ve shown the Cayley-Hamilton theorem for non-diagonalisable matrices also. It feels that because the coefficients are non-negative integers, we might also have shown the result for other fields too, but I have minimal knowledge or recollection at the moment of the things one has to check for this sort of result.

It’s worth ending with the brief comment that Cayley-Hamilton is useful, among other reasons because it enables us to write the inverse of A as a polynomial of degree at most n-1 in terms of A. In many settings this is a lot easier to work with in terms of calculations than an argument with minors.

Linear Algebra II: Determinants 2

In the previous post, we introduced determinants of matrices (and by extension linear maps) via its multilinearity properties, and as the change-of-volume factor. We also discussed how to calculate them, via row operations, or Laplace expansion, or directly via a sum of products of entries over permutations.

The question of why this is ever a useful quantity to consider remains, and this post tries to answer it. We’ll start by seeing one example where this is a very natural quantity to consider, and then the main abstract setting, where the determinant is zero, and consider a particularly nice example of this.

Jacobeans as a determinant

We consider integration by substitution. Firstly, in one variable: when it comes to Riemann integration of a function g(x) with respect to x, we view dx as the width of a small column which approximates the function near x. Now, if we reparameterise, that is if we write x=f(y) for some well-behaved (in particular differentiable) function f, then the width of the column is dx= dy.(dx/dy)=f'(y) dy. This may be negative, if y is decreasing while x is increasing, but for now let’s not worry about this overly, for example by assuming the function g is non-negative. Thus if we want to integrate with y as the variable, we multiply the integrand by this factor |f'(y)|.

What about in higher dimensions? We have exactly the same situation, only instead of two-dimensional columns, we have (n+1)-dimensional columns. We then multiply the n-dimensional volume of the base by the height, again given by g(\mathbf{x}). If we have a similar transformation of the base variable \mathbf{x}=f(\mathbf{y}), we differentiate to get

\mathrm{d}x_i = \sum_{j=1}^n\frac{\mathrm{d}f_i}{\mathrm{d}y_j} \mathrm{d}y_j.

In other words

\mathrm{d}\mathbf{x}= J \mathrm{d}\mathbf{y},

where J is the Jacobean matrix of partial deriatives. In particular, we know how to relate the volume [0,\mathrm{d}x_1]\times\ldots\times [0,\mathrm{d}x_n] to the volume [0,\mathrm{d}y_1]\times \ldots\times [0,\mathrm{d}y_n]. It’s simply the determinant of the Jacobean J. So if we want to integrate with respect to \mathbf{y}, it only remains to pre-multiply the integrand by |\mathrm{det}J| and proceed otherwise as in the one-dimensional case.

Det A = 0

A first linear algebra course might well motivate the introducing matrices as a notational shortcut for solving families of linear equations, Ax=b. The main idea is that generally we can solve this equation uniquely. Almost all of the theory developed in such a first linear algebra course deals with the case when this fails to hold. In particular, there are many ways to characterise this case, and we list some of them now:

  • Ax=b has no solutions for some b;
  • A is not invertible;
  • A has non-trivial kernel, that is, with dimension at least one;
  • A does not have full rank, that is, the image has dimension less than n;
  • The columns (or indeed the rows) are linearly dependent;
  • The matrix can be row-reduced to a matrix with a row of zeroes.

It is useful that these are equivalent, as in abstract problems one can choose whichever interpretation from this list is most relevant. However, all of these are quite hard to check. Exhibiting a non-trivial kernel element is hard – one either has to do manual row-reduction, or the equivalent in the context of linear equations. But we can add the characterisation

  • det A = 0;

to the list. And this is genuinely much easier to check for specific examples, either abstract or numerical.

Let’s quickly convince ourselves of a couple of these equivalences. Determinant is invariant under row-reductions, and by multilinearity it is certainly the case that det A = 0 if A has a row of zeroes. We also said that A is the change-of-volume factor. Note that A is a map from the domain to its image, so if A has less than full rank, then any set in the image has zero volume.

The Vandermonde matrix

This is a good example of this theory in practice. Consider the Vandermonde matrix where each row is a geometric progression:

V=\begin{pmatrix}1&\alpha_1&\ldots&\alpha_1^{n-1}\\1&\alpha_2&\ldots&\alpha_2^{n-1}\\ \vdots&\vdots&\ddots&\vdots\\ 1&\alpha_n&\ldots&\alpha_n^{n-1}\end{pmatrix}.

Now suppose we attempt to solve

V\begin{pmatrix}a_0\\a_1\\ \vdots\\ a_{n-1}\end{pmatrix}=\begin{pmatrix}b_1\\b_2\\ \vdots \\ b_n\end{pmatrix}.

There’s a natural interpretation to this, that’s especially clear with this suggestive notation. Each row corresponds to a polynomial, where the coefficients are given by the (a_0,a_1,\ldots,a_{n-1}), and the argument is given by \alpha_i.

So if we try to solve for (a_0,a_1,\ldots,a_{n-1}), given (\alpha_1,\ldots,\alpha_n) and (b_1,\ldots,b_n), we are asking whether we can find a polynomial P with degree at most n-1 such that P(\alpha_i)=b_i for $i=1. Lagrange interpolation gives an argument where we just directly write down the relevant polynomial, but we can also deploy our linear algebraic arguments too.

The equivalence of all these statements means that to verify existence and uniqueness of such a polynomial, we only need to check that the Vandermonde matrix has non-zero determinant. And in fact there are a variety of methods to show that

\mathrm{det}V=\prod_{1\le i< j}\le n(\alpha_j-\alpha_i).

For the polynomial question to be meaningful, we would certainly demand that the (\alpha_i) are distinct, and so this determinant is non-zero, and we’ve shown that n points determine a degree (n-1) polynomial uniquely.

If we multiply on the left instead, suppose that we are considering a discrete probability distribution X that takes n known values (\alpha_1,\ldots,\alpha_n) with unknown probabilities (p_1,\ldots,p_n). Then we have

(p_1,\ldots,p_n) V = (1,\mathbb{E}X, \mathbb{E}[X^2],\ldots, \mathbb{E}[X^{n-1}]).

So, again by inverting the Vandermonde matrix (which is know is possible since its determinant is non-zero…) we can recover the distribution from the first (n-1) moments of the distribution.

A similar argument applies to show that the Discrete Fourier Transform is invertible, and in this case (where the \alpha_is are roots of unity), the expression for the Vandermonde determinant is particularly tractable.

Linear Algebra II: Determinants 1

This term, I’m giving tutorials on a course that’s new to me, the apparently notorious ‘Linear Algebra II’ for first year undergraduates. I can appreciate how it might have ended up with this reputation, but as always, every challenge is also an opportunity, So I’m going to (try to) write a short series of blog posts about what we’ve discussed in the tutorials.

The first problem sheet-and-a-half concerned determinants of matrices. There are three things worth addressing here:

  1. What are abstract definitions, and which is most useful in each setting?
  2. How to actually calculate them?
  3. What’s the overall point?

The answers are obviously not completely unrelated, but we’ll probably defer the third question to a second post.

The determinant is a map from the set of matrices \mathcal{M}_n to the base field (hereafter assumed to be \mathbb{R},\mathbb{C}). The Oxford course defines it through its properties:

  • Multilinear in the columns of the matrix.
  • Equal to zero if two columns are equal.
  • Equal to one if the matrix is the identity.

One then checks that there is a unique such map, and so from now on it’s reasonable to call it the determinant of the matrix. It will follow from pretty much any consequence that we can replace ‘columns’ with ‘rows’ throughout and get the same map.

Other definitions

We have a closed form expression for the determinant given via permutations of n

\mathrm{det}(A)=\sum_{\sigma\in \Sigma_n} \mathrm{sign}(\sigma) a_{1\sigma(1)}\ldots a_{n\sigma(n)}.

We’ll come back to a discussion of when this particular definition is useful. It can be derived by carefully transforming the identity matrix into A, using the operations which are mentioned in the original definition of the determinant, in particular, keeping track of the number of transpositions of columns.

It’s clear from any definition that the determinant is a polynomial of degree n in the entries of the matrix, but this definition will be useful if you want to make some more precise comment on the nature of this polynomial. For example, if entries of the matrix are polynomials in x of various degree (think of the eigenvalue equation for example) this allows you to control (or at least bound) the overall degree of the determinant as a polynomial in x.

The determinant is also the volume of the n-dimensional parallelopiped formed by the column vectors of the matrix. This is easy to check in two dimensions, for the matrix \begin{pmatrix}a&b\\c&d\end{pmatrix}:

20160225_172206To calculate the area of the central parallelogram, we have to subtract the area of two small rectangles and four small triangles from the outer rectangle, obtaining

(a+b)(c+d)-2bc - \frac12(ac+ac+bd+bd)=ad-bc,

as we expect. This calculation is harder to execute in higher dimensions, and certainly harder to visualise.

Maybe, though, we don’t have to, so long as we can reassure ourselves that this volume satisfies the implicit definition of the determinant map at the start. Multilinearity in the columns is not that hard to see. If we multiply the jth column by some constant, we are stretching the parallelopiped by the same constant factor in one direction, and so the volume grows appropriately. The additivity property can similarly be thought of as joining together two parallelopipeds at their common face (which is common since the other column vectors have to be constant in this construction). If two column vectors are equal, then clearly this volume actually has dimension at most n-1, and thus volume zero, so the final two conditions are genuinely easy to check.

The challenge here is that there is a direction involved. Determinants can be negative, but in our classical viewpoint, areas generally are not. In 2D, we can think of this as saying that the area is positive if the vector (b,d) lies anti-clockwise from (a,c) in the parallelogram, while is it negative otherwise. Again, this is harder to visualise in higher dimensions, but it is at least plausible that one could develop a similar decomposition. Ultimately, we are happy with the notion of directed lengths (ie vectors on the real line), and these are easy to add up without having to separate into cases, and the case holds for areas and higher-dimensional volumes.

Evaluating determinants

If we actually want to compute the determinant of a given matrix, the sum over permutations is intractable since it doesn’t have any natural splits into stages. The implicit definitions and this area consideration are clearly useless for all but the most special of examples.

The Laplace expansion is the usual algorithm to calculate the determinant of an n x n matrix. You pick a row (or a column), and evaluate the determinants of the (n-1) x (n-1) minor matrices given by deleting this row (or column), and each column (or row) in turn. This leaves us with n determinants of smaller matrices, which we pre-multiply by the entries in the original deleted row (or column), and add up in an alternating way (*). This is highly computationally intensive for large matrices, but for 3×3 and 4×4 can be done by hand with probability of an error bounded away from 1.

There is the flexibility to choose the reference row or column. Since the entries of these affect the sum through small products, it is highly convenient to choose a row or column with a lot of zeros. In particular, if there’s a row or column with exactly one non-zero entry, this is an ideal candidate.

The sum over permutations also works well when a lot of the entries are zero, because then a lot of the permutations give a summand which is zero. Upper-triangular matrices are a good example: only one permutation (the identity permutation) avoids all the zero elements underneath the diagonal.

One can also observe from the multilinearity property of the determinant map that there are lots of operations we can apply to the matrix which leave the determinant fixed. These are often called elementary row operations, though obviously we can apply them to the columns as well. To summarise, if we interchange two rows, the sign of the determinant is reversed. And if we add some multiple of one row to any other row, the determinant stays the same.

When matrices are not square, it’s quite important to be specific about exactly what form you can reduce a general matrix to via such row operations, but in this context, it’s not hugely important. Reduced echelon form (without the condition that leading coefficients will be one) is achievable, but this is a special case of an upper triangular matrix, for which the determinant is given by the product of the diagonal entries, ie is easy.

Whether this is substantially easier than Laplace expansion depends on the matrix itself and taste, both to do manually, and to code.

(*) I’m not a fan personally of this alternating definition. It seems to me much more natural to define the minor as

M^{i,j}=(a_{i+k,j_\ell})_{1\le k,\ell\le n-1},

with indices taken modulo n. Then you don’t have any \pm 1s in the Laplace expansion.

Using determinants in abstract problems

So the determinant gives directly the area of the image (under A) of the unit hypercube. By linearity (of A), it is easy to see that it also gives the scale factor of the area change (under A) of any hyper-cuboid, parallel to the conventional axes, anywhere in the space. Then, eg by approximating any sensible n-dimensional shape (*) as a union of such hyper-cuboids, we can show that in fact the area of any sensible shape increases by a factor (det A) under application of A.

This is a good thing to remember, because it is an excellent heuristic for seeing why the determinant of a linear map is basis-independent. It also gives a much easier proof of the key result

\mathrm{det}(AB)=\mathrm{det}(A) \mathrm{det}(B),

than that given by fixing B and viewing det(AB) as a map from matrices to the field, just like the original definition of determinant.

Some of the theory in the course is proved using elementary row operations. But these invite complicated notation, so are best used only in simple arguments, or when things are fairly explicit to begin with. Given an abstract problem about determinants of matrices, it is often tempting to induct on the size of the matrix in some way. I think it’s worth saying that even though the Laplace expansion is explicitly set up in this way, the notation involved is also likely to be annoying here, while permutations are easy to describe inductively: eg let \sigma(1)=k, then view the remainder of the permutation as a bijection \{2,3,\ldots,n\}\rightarrow [n]\backslash \{k\}.

Shortly, we’ll have a second post answering the final question: what’s the point of working with determinants? We’ve already seen half an answer, in that they describe the change-of-volume factor of a matrix (or linear map), but this can be substantially developed.

Lagrange multipliers Part Two

My own question on last week’s BMO2 notwithstanding, inequalities seem out of fashion at the moment among mainstream international olympiads. Such problems often involve minimising some function subject to a constraint, and word has, over the years, filtered down to students interested in such things, that there’s a general method for achieving this via Lagrange multipliers. The motivation for my talk in Hungary, summarised by the previous post and this one, is to dispute the claim made by some of the UK students that these are hard to justify rigorously. I dispute this because I don’t think it’s qualitatively much harder to justify Lagrange multipliers rigorously than an unconstrained optimisation problem, whereas I would claim instead that Lagrange multipliers are merely unlikely to work at a computational level on the majority of olympiad problems.

Unconstrained optimisation in two variables

Before we can possibly discuss constrained optimisation, we should discuss unconstrained optimisation. That is, finding minima of a function of several variables. We don’t lose too much by assuming that our function f(x,y) depends on two variables.

Recall that our method in the previous post for justifying the A-level approach to minima was to find a necessary condition to be a local minimum, and also a general reason why there should be a global minimum. That way, if there are finitely many points satisfying the condition, we just check all of them, and the one with the smallest value of f is the global minimum. We’ll discuss the existence of the global minimum later.

If we hold one coordinate fixed, the local variation of a function is equivalent to the one-dimensional case.

f(x+h,y)-f(x,y)= h\frac{\partial f}{\partial x}(x,y)+O(h^2).

In general we want to vary both variables, which is fine since

f(x+h,y+\ell)-f(x+h,y)=\ell \frac{\partial f}{\partial y}(x+h,y) + O(h^2).

But since we really want everything to be determined by the function at (x,y), we really want

\frac{\partial f}{\partial y}(x+h,y) \approx \frac{\partial f}{\partial y}(x,y),

and so we be mindful that we may have to assume that both partial derivatives are continuous everywhere they exist. Once we have this though, we can rewrite as

f(x+h,y+\ell) - f(x,y)= h\frac{\partial f}{\partial x}(x,y) + \ell \frac{\partial f}{\partial y}(x,y) + O(h^2)

= (\frac{\partial f}{\partial x}(x,y) ,\frac{\partial f}{\partial y}(x,y) )\cdot (h,\ell) + O(h\vee \ell^2).

In particular, if we define grad of f to be \nabla f(x,y)=(\frac{\partial f}{\partial x}(x,y), \frac{\partial f}{\partial y}(x,y) ) and apply a similar argument to that which we used in the original setting. If \nabla f(x,y)\ne 0, then we can choose some small (h,\ell), such that f(x+h,y+\ell)<f(x,y). Thus a necessary condition for (x,y) to be a local minimum for f is that \nabla f=0.

Lagrange multipliers

This is natural time to discuss where Lagrange multipliers emerge. The setting now is that we still want to minimise some function f(\mathbf{x}), but only across those values of \mathbf{x} which satisfy the constraint g(\mathbf{x})=0.

But our approach is exactly the same, namely we find a necessary condition to be a local minimum subject to the condition. As before, we have

f(\mathbf{x}+\mathbf{h}) - f(\mathbf{x}) = \mathbf{h}\cdot \nabla f + O(|\mathbf{h}|^2),

but we are only interested in those small vectors \mathbf{h} for which \mathbf{x}+\mathbf{h} actually satisfies the constraint, namely g(\mathbf{x}+\mathbf{h})=0. But then

0 = g(\mathbf{x}+\mathbf{h})- g(\mathbf{x})=\mathbf{h}\cdot \nabla g + O(|\mathbf{h}|^2).

From this, we conclude that the set of small relevant \mathbf{h} is described by \mathbf{h}\nabla g=O(|\mathbf{h}|^2). And now we really can revert to the original argument. If there’s a small \mathbf{h} such that \mathbf{h}\cdot \nabla g=0 but \mathbf{h}\cdot \nabla f \ne 0, then we can find some \mathbf{h'_+},\mathbf{h'_-}=\pm \mathbf{h}+O(|\mathbf{h}|^2) such that at least one of f(\mathbf{h'_+}),f(\mathbf{h'_-})<f(\mathbf{h}).

So a necessary condition to be a constrained local minimum is that every vector which is perpendicular to \nabla g must also be perpendicular to \nabla f. From which it follows that these two vectors must be parallel, that is \nabla f(\mathbf{x})=\lambda \nabla g(\mathbf{x}), where \lambda is the so-called Lagrange multiplier. Of course we must also have that g(\mathbf{x})=0, and so we have a complete characterisation for a necessary condition that the constrained optimisation has a local minimum at \mathbf{x}, assuming that all the derivatives of both f and g exist with suitable regularity near \mathbf{x}.


The point of setting up the one-variable case in the unusual way in the previous post was to allow me to say at this stage: “it’s exactly the same”. Well, we’ve already seen an extra differentiability condition we might require, but apart from that, the same approach holds. Multi-variate continuous functions also attain their bounds when the domain is finite and includes its boundary.

Checking the boundary might be more complicated in this setting. If the underlying domain is \{x,y,z\ge 0\}, then one will have to produce a separate argument for why the behaviour when at least one of the variables is zero fits what you are looking for. Especially in the constrained case, it’s possible that the surface corresponding to the constraint doesn’t actually have a boundary, for example if it is the surface of a sphere. Similarly, checking that the objective function gets large as the variables diverge to infinity can be annoying, as there are many ‘directions’ down which to diverge to infinity.

Motivating Cauchy-Schwarz

In practise, you probably want to have such methods in hand as a last resort on olympiad problems. It’s always possible that something will slip through the net, but typically problem-setters are going to trying to ensure that their problems are not amenable to mindless application of non-elementary methods. And even then, one runs the risk of accusations of non-rigour if you don’t state exact, precise results which justify everything which I’ve presented above.

One thing that can be useful, on the other hand, is to observe that the Lagrange multiplier condition looks a lot like the equality condition for Cauchy-Schwarz. So, even if you can’t solve the family of Lagrange multiplier ‘equations’, this does suggest that applying Cauchy-Schwarz to the vectors involved might give you some insight into the problem.

The following inequality from the IMO 2007 shortlist is a good example.

Suppose a_1,\ldots,a_{100}\ge 0 satisfy a_1^2+\ldots+a_{100}^2=1. Prove that

a_1^2a_2+a_2^2a_3+\ldots + a_{100}^2 a_1<\frac{12}{25}.

We shouldn’t be perturbed by the strictness. Maybe we’ll end up showing a true bound in terms of surds that is less neat to write down…

Anyway, applying Lagrange multipliers would require us to solve

a_{k-1}^2 + 2a_ka_{k+1}=2\lambda a_k,\quad k=1,\ldots,100,

with indices taken modulo 100. As so often with these cyclic but non-symmetric expressions, this looks quite hard to solve. However, it turns out that by applying Cauchy-Schwarz to the vectors (a_k),(a_{k-1}^2+2a_ka_{k+1}) gets us a long way into the problem by classical means. Working all the way through is probably best left as an exercise.

Lagrange multipliers Part One: A much simpler setting

I am currently in northern Hungary for our annual winter school for some of the strongest young school-aged mathematicians in the UK and Hungary. We’ve had a mixture of lectures, problem-solving sessions and the chance to enjoy a more authentic version of winter than is currently on offer in balmy Oxford.

One of my favourite aspects of this event is the chance it affords for the students and the staff to see a slightly different mathematical culture. It goes without saying that Hungary has a deep tradition in mathematics, and the roots start at school. The British students observe fairly rapidly that their counterparts have a much richer diet of geometry, and methods in combinatorics at school, which is certainly an excellent grounding for use in maths competitions. By contrast, our familiarity with calculus is substantially more developed – by the time students who study further maths leave school, they can differentiate almost anything.

But the prevailing attitude in olympiad circles is that calculus is unrigorous and hence illegal method. The more developed summary is that calculus methods are hard, or at least technical. This is true, and no-one wants to spoil a measured development of analysis from first principles, but since some of the British students asked, it seemed worth giving a short exposition of why calculus can be made rigorous. They are mainly interested in the multivariate case, and the underlying problem is that the approach suggested by the curriculum doesn’t generalise well at all to the multivariate setting. Because it’s much easier to imagine functions of one variable, we’ll develop the machinery of the ideas in this setting in this post first.

Finding minima – the A-level approach

Whether in an applied or an abstract setting, the main use of calculus at school is to find where functions attain their maximum or minimum. The method can be summarised quickly: differentiate, find where the derivative is zero, and check the second-derivative at that value to determine that the stationary point has the form we want.

Finding maxima and finding minima are a symmetric problem, so throughout, we talk about finding minima. It’s instructive to think of some functions where the approach outlined above fails.

20160131_124059In the top left, there clearly is a minimum, but the function is not differentiable at the relevant point. We can probably assert this without defining differentiability formally: there isn’t a well-defined local tangent at the minimum, so we can’t specify the gradient of the tangent. In the top right, there’s a jump, so depending on the value the function takes at the jump point, maybe there is a minimum. But in either case, the derivative doesn’t exist at the jump point, so our calculus approach will fail.

In the middle left, calculus will tell us that the stationary point in the middle is a ‘minimum’, but it isn’t the minimal value taken by the function. Indeed the function doesn’t have a minimum, because it seems to go off to -\infty in both directions. In the middle right, the asymptote provides a lower bound on the values taken by the function, but this bound is never actually achieved. Indeed, we wouldn’t make any progress by calculus, since there are no stationary points.

At the bottom, the functions are only defined on some interval. In both cases, the minimal value is attained at one of the endpoints of the interval, even though the second function has a point which calculus would identify as a minimum.

The underlying problem in any calculus argument is that the derivative, if it exists, only tells us about the local behaviour of the function. At best, it tells us that a point is a local minimum. This is at least a necessary condition to be a global minimum, which is what we actually care about. But this is a change of emphasis from the A-level approach, for which having zero derivative and appropriately-signed second-derivative is treated as a sufficient condition to be a global minimum.

Fortunately, the A-level approach is actually valid. It can be shown that if a function is differentiable everywhere, and it only has one stationary point, where the second-derivative exists and is positive, then this is in fact the global minimum. The first problem is that this is really quite challenging to show – since in general the derivative might not be continuous, although it might have many of the useful properties of a continuous function. Showing all of this really does require setting everything up carefully with proper definitions. The second problem is that this approach does not generalise well to multivariate settings.

Finding minima – an alternative recipe

What we do is narrow down the properties which the global minimum must satisfy. Here are some options:

0) There is no global minimum. For example, the functions pictured in the middle row satisfy this.

Otherwise, say the global minimum is attained at x. It doesn’t matter if it is attained at several points. At least one of the following options must apply to each such x.

1) f'(x)=0,

2) f'(x) is not defined,

3) x lies on the boundary of the domain where f is defined.

We’ll come back to why this is true. But with this decomposition, the key to identifying a global minimum via calculus is to eliminate options 0), 2) and 3). Hopefully we can eliminate 2) immediately. If we know we can differentiate our function everywhere, then 2) couldn’t possibly hold for any value of x. Sometimes we will be thinking about functions defined everywhere, in which case 3) won’t matter. Even if our function is defined on some interval, this only means we have to check two extra values, and this isn’t such hard work.

20160131_124716It’s worth emphasising why if x is a local minimum not on the boundary and f'(x) exists, then f'(x)=0. We show that if f'(x)\ne 0, then x can’t be a local minimum. Suppose f'(x)>0. Then both the formal definition of derivative, and the geometric interpretation in terms of the gradient of a tangent which locally approximates the function, give that, when h is small,

f(x-h) = f(x)-h f'(x) +o(h),

where this ‘little o’ notation indicates that for small enough h, the final term is much smaller than the second term. So for small enough h, f(x-h)<f(x), and so we don’t have a local minimum.

The key is eliminating option 0). Once we know that there definitely is a global minimum, we are in a good position to identify it using calculus and a bit of quick checking. But how would we eliminate option 0)?

Existence of global minima

This is the point where I’m in greatest danger of spoiling first-year undergraduate course content, so I’ll be careful.

As we saw in the middle row, when functions are defined on the whole real line, there’s the danger that they can diverge to \pm \infty, or approach some bounding value while never actually attaining it. So life gets much easier if you work with functions defined on a closed interval. We also saw what can go wrong if there are jumps, so we will assume the function is continuous, meaning that it has no jumps, or that as y gets close to x, f(y) gets close to f(x). If you think a function can be differentiated everywhere, then it is continuous, because we’ve seen that once a function has a jump (see caveat 2) then it certainly isn’t possible to define the derivative at the jump point.

It’s a true result that a continuous function defined on a closed interval is bounded and attains its bounds. Suppose such a function takes arbitrarily large values. The main idea is that if the function takes arbitrarily large values throughout the interval, then because the interval is finite it also takes arbitrarily large values near some point, which will make it hard to be continuous at that point. You can apply a similar argument to show that the function can’t approach a threshold without attaining it somewhere. So how do you prove that this point exists? Well, you probably need to set up some formal definitions of all the properties under discussion, and manipulate them carefully. Which is fine. If you’re still at school, then you can either enjoy thinking about this yourself, or wait until analysis courses at university.

My personal opinion is that this is almost as intuitive as the assertion that if a continuous function takes both positive and negative values, then it has a zero somewhere in between. I feel if you’re happy citing the latter, then you can also cite the behaviour of continuous functions on closed intervals.

Caveat 2) It’s not true to say that if a function doesn’t have jumps then it is continuous. There are other kinds of discontinuity, but in most contexts these are worse than having a jump, so it’s not disastrous in most circumstances to have this as your prime model of non-continuity.

Worked example

Question 1 of this year’s BMO2 was a geometric inequality. I’ve chosen to look at this partly because it’s the first question I’ve set to make it onto BMO, but mainly because it’s quite hard to find olympiad problems which come down to inequalities in a single variable.

Anyway, there are many ways to parameterise and reparameterise the problem, but one method reduces, after some sensible application of Pythagoras, to showing

f(x)=x+ \frac{1}{4x} + \frac{1}{4x+\frac{1}{x}+4}\ge \frac{9}{8}, (*)

for all positive x.

There are simpler ways to address this than calculus, especially if you establish or guess that the equality case is x=1/2. Adding one to both sides is probably a useful start.

But if you did want to use calculus, you should argue as follows. (*) is certainly true when x\ge \frac{9}{8} and also when $x\le \frac{2}{9}$. The function f(x) is continuous, and so on the interval [\frac{2}{9},\frac{9}{8}] it has a minimum somewhere. We can differentiate, and fortunately the derivative factorises (this might be a clue that there’s probably a better method…) as

(1-\frac{1}{4x^2}) \left[ 1 - \frac{4}{(4x+\frac{1}{x}+4)^2} \right].

If x is positive, the second bracket can’t be zero, so the only stationary point is found at x=1/2. We can easily check that f(\frac12)=\frac98, and we have already seen that f(\frac29),f(\frac98)>\frac98. We know f attains its minimum on [\frac29,\frac98], and so this minimal value must be $\frac98$, as we want.

Overall, the moral of this approach is that even if we know how to turn the handle both for the calculation, and for the justification, it probably would be easier to use a softer approach if possible.

Next stage

For the next stage, we assess how much of this carries across to the multivariate setting, including Lagrange multipliers to find minima of a function subject to a constraint.

Perturbation of Eigenvectors

In the previous post, I talked about eigenvalues, and some alternative characterisations which could be useful in some circumstances. Recently, I’ve been interested in controlling how eigenvalues and eigenvectors change as the matrix is varied. My particular example concerns positive matrices, which have a well-defined largest eigenvalue (or Perron root), and a unique (up to normalising in some way) principal eigenvector.

We might expect that perturbing a matrix slightly does not change the eigenvectors very much, since any original eigenvector is still almost an eigenvector, in the sense that its image under the action of the perturbed matrix is almost equal to a multiple of itself. But how to make this precise? And when does it go wrong?

Eigenvalues – The non-multiple case

Throughout, we assume we have a k x k matrix. We might want to allow the entries to be complex, but for now, real entries are perfectly interesting enough.

It makes sense to start with eigenvalues, since it’s easy to define these through the characteristic equation of the matrix. The coefficients of this polynomial are well-behaved (indeed polynomial) functions of the entries of the matrix. So we are really asking how the set of roots of a finite polynomial evolves as the (k+1) coefficients of the polynomial evolve. It is fairly clear that, under any sensible choice of topology on the space of k-(multi)-subsets of \mathbb{C}, the multiset of roots is continuous in the coefficients of the polynomial.

To say anything more precise, we have to introduce some notation.

Let \chi_{A}(z)=z^k+\gamma_{k-1}(A)z^{k-1}+\ldots+\gamma_1(A)z+\gamma_0(A) be the characteristic polynomial of A. Each \gamma_i is a polynomial of degree k-i in the entries of A. Let’s consider now a matrix-valued function A(t), and we assume that the entries of A(t) are all differentiable with respect to t. So each \gamma_i(A(t)) is also differentiable with respect to t.

At this point, let’s make the assumption that t lies in some interval [r,s] for which the eigenvalues of A(t) are distinct. Let \lambda(t) be some eigenvalue of A(t), chosen such that \lambda is a continuous function of t. For example, we might take \lambda(t)=\Lambda_1(t), the eigenvalue with largest absolute value (with some canonical tie-breaking mechanism). Then \chi_{A(t)}(\lambda(t))=0, and so differentiating with respect to \gamma_i:

0=\chi'_{A(t)}(\lambda(t)) \frac{\partial \lambda}{\partial \gamma_i} \Big|_{A(t)} + \lambda(t)^i.

Because we deliberately demanded that the eigenvalues were disjoint, we have \chi'_{A(t)}(\lambda(t))\ne 0, and so \frac{\partial \lambda(t)}{\partial \gamma_i}=-\lambda(t)^i / \chi'_{A(t)}(\lambda(t)). In particular, \lambda(t) is differentiable with respect to the coefficients of the characteristic polynomial, and thus with respect to t also.

Multiple Eigenvalues

It gets more complicated when the characteristic equation has multiple roots. Typically we will be interested in the evolution of the eigenvalue with some extremal property, probably the largest one. Let’s restrict to the real, symmetric case, where the set of eigenvalues is complete and real. Suppose we have t_0 such that A(t_0) has a repeated eigenvalue. Then, in a small enough region of t_0, we can define eigenvalues \lambda(t),\mu(t) continuously such that \lambda(t_0)=\mu(t_0) while \lambda(t)\ne \mu(t) for t=\ne t_0. Then, if the entries of A(t) are analytic functions of t, then so are \lambda(t),\mu(t).

But then \max(\lambda(t),\mu(t)) will in general not be analytic, as the maximum of two smooth functions is in general Lipschitz.

This effect is most obvious in the case of a diagonal matrix A(t)=\begin{pmatrix}t&0\\0&-t\end{pmatrix}, for which the largest eigenvalue is |t|.


When the matrix A is real and symmetric, we know it has real eigenvalues, and an orthogonal basis of eigenvectors. Then the Rayleigh quotient characterises the eigenvector as well as the eigenvalue. Recall that for any x\in\mathbb{R}^k with ||x||_2=1, we have

\lambda_1\ge x^T A x \ge \lambda_k,

with equality precisely at the respective eigenvectors. So if we perturb A slightly, keeping it real and symmetric, we can control the principal eigenvector quite well by this method.

If A is not diagonalisable, we can still say something about this principal eigenvector, via large powers of A, sometimes called the Van Mises iteration. This says that for large N, A^N v should have direction close to that of the eigenvector, for any test vector v. The rate of convergence depends on the ratio of the largest eigenvalue to the second largest eigenvalue, though if the matrix is not diagonalisable, it is not completely trivial to quantify this convergence. We have to be careful though, since A maps the subspace orthogonal to the eigenvector to itself, so the magnitude of the projection of v onto the eigenvector determines the speed of convergence. Indeed, if v is orthogonal to the eigenvector, it won’t converge towards the principal eigenvector at all. (But if there is a well-defined ‘second eigenvector’ then it will converge towards that.)

Continuity of Eigenvectors

The reason why I ended up reading about some of these topics was that I wanted to show that the Perron eigenvector of a positive matrix (that is, the unique eigenvector corresponding to the Perron root) was Lipschitz continuous as a function of the entries of the positive matrix. Since for such a matrix, the largest eigenvalue is simple, we are able to make some progress.

In general, the condition that v is an eigenvector of matrix A with eigenvalue \lambda is described by the relation:

(A-\lambda I)v=0,\quad ||u||_1=1, (*)

or whatever the most appropriate normalising condition appears. This describes an implicit relation between A and the eigenvalue-eigenvector pair (\lambda,v). So given a matrix A_0 with eigenvalue \lambda_0 corresponding to eigenvector v_0, in a neighbourhood of A_0 in \mathbb{R}^{k\times k} we can use the implicit function theorem to comment on the differentiability of (\lambda,v) with respect to A in this neighbourhood.

Precisely, we require the matrix of partial derivatives from (*)

\begin{pmatrix} A_0-\lambda_0 I&v_0 \\ \mathbf{1}&0\end{pmatrix}

to have non-zero determinant. But if \lambda_0 is not simple, then if we apply this matrix from the left to one of the other eigenvectors (with a zero appended) we can see that it has non-trivial kernel. With a bit more work, we can show the converse too, and conclude that (\lambda,v) are smooth with respect to A in some neighbourhood of A_0.

Finally, we observe that when the eigenvalues are not simple, we can’t even guarantee continuity of the eigenvectors. This is unsurprising really, since for a multiple eigenvalue, a) we might not know how many LI eigenvectors exists; and b) we might have complete freedom over the choice of eigenvectors. Think about the identity matrix! Indeed the eigenvectors of \begin{pmatrix}1+\epsilon &0\\ 0&1+\epsilon\end{pmatrix} are (1,0) and (0,1), while the eigenvectors of \begin{pmatrix}1&\epsilon\\ \epsilon&1\end{pmatrix} are (1,1), (1,-1). So no continuous choice of eigenvectors is possible here.

Characterisations of Eigenvalues

I’ve been working for much of the past few months on a version of the frozen percolation random graph process with types. The connectivity between types is controlled by a (finite) non-negative square matrix, and so I’ve been engaging with linear algebra theory to an extent I haven’t really experienced since the second or third year of undergraduate maths.

We are interested in whether the graphs in question are subcritical, critical or supercritical. As in the case of multitype branching processes, this is controlled by the principal eigenvalue of a related non-negative matrix. So I’ve been looking up lots of methods for controlling eigenvalues, and some have proved useful, and some have not, but I thought it would be worthwhile to present some of them here.

Bounds and characterisations of spectral radius

Throughout, I will be talking about finite, square matricies. Eigenvalues may be defined as roots of the characteristic polynomial, and so by the fundamental theorem of algebra, there is always at least one complex eigenvalue. There is always at least one eigenvector associated to any eigenvalue. However, the dimension of the eigenspace is not always the same as the multiplicity of the eigenvalue as a root of the characteristic polynomial. The latter is called algebraic multiplicity, while the former is geometric multiplicity.

For now though, this distinction will be unimportant. The spectral radius of a matrix A is defined as

\rho(A)=\max \{|\lambda|\, : \, \lambda \text{ and eigenvalue of }A\}.

We can bound the spectral radius in terms of the norm of the matrix. Remember that a matrix norm has to satisfy all the usual properties of a norm, as well as a submultiplicative property |||AB|||\le |||A|||\cdot |||B|||. This is good, as otherwise we would be free to replace any norm by an arbitrary multiple of itself, and so no useful bounds could ever emerge. Note that the submultiplicativity implies that |||I_n||\ge 1.

Now, let \lambda,x be some eigenvalue and associated (right-)eigenvector respectively of matrix A. Let X be the square matrix given by taking all the columns to be x. Now Ax=\lambda x implies AX=\lambda X, and so

|\lambda| \cdot|||X||| = |||\lambda X||| = |||AX||| \le |||A|||\cdot |||X|||,

and thus we conclude our most basic bound \lambda \le |||A|||.

When A is diagonalisable, life is particularly easy, but in general we can write A as a conjugate of its Jordan normal form. Then, by looking at each diagonal block of the Jordan normal form separately, we can show that

\lim_{k\rightarrow 0}A^k = 0\quad \iff \quad \rho(A)<1.

Then, applying this, with additional care, to the matrices A / (\rho(A)\pm \epsilon), we derive Gelfand’s Formula, that \rho(A) = \lim_{k\rightarrow \infty} ||A^k||^{1/k}. Again, this applies for any matrix norm.

Real symmetric matrices

When the matrix is real and symmetric, it is not too hard to show that all the eigenvalues are real, and furthermore that all the geometric multiplicities are equal to the algebraic multiplicities. That is, the matrix is diagonalisable, and there is an (orthogonal) basis of eigenvectors. Once we assume we are working with respect to this eigenbasis, it is easy to see how the Rayleigh quotient characterisation of the largest (and smallest) eigenvalue works. Let’s say the eigenvalues are \lambda_1\ge \lambda_2\ge\ldots \ge \lambda_n, then for any ||x||_2=1, we have \lambda_1\ge x^T A x\ge \lambda_n, and equality is attained when x is the respective eigenvector, normalised appropriately.

This is an especially useful characterisation of the largest eigenvalue, as for example we can see fairly easily that this means \lambda_1 is a convex function of the (real, symmetric) matrix.

We can generalise this Rayleigh quotient idea if we take k orthonormal vectors in R^k, arrange them in an nxk matrix P, so that P^T P = I_k. Now we consider the matrix P^TAP. [Note that if k=1, we are exactly considering x^TA x as before.] Then Poincare’s Separation Theorem say that the eigenvalues \mu_1\ge \mu_2\ge\ldots \mu \mu_k of P^TAP (which is also real, symmetric) are bounded by the original eigenvalues:

\lambda_{n-k+i} \ge \mu_i\ge \lambda_i.

Since the trace is preserved under conjugation, and the trace is the sum of eigenvalues, we can apply this result with P’s columns taken to be the any k canonical basis vectors of \mathbb{R}^k. Without loss of generality, we may assume the basis has been chosen so that the diagonal elements of A satisfy a_{11}\ge a_{22}\ge\ldots\ge a_{nn}, and so now we have that the sequence (a_{11},a_{22},\ldots,a_{nn}) is majorised by (\lambda_1,\lambda_2,\ldots,\lambda_n) and majorises (\lambda_n,\lambda_{n-1},\ldots,\lambda_1). The first of these relations can be used via the setup of Karamata’s inequality to conclude that for any convex function f, we have

\sum_{i=1}^n f(\lambda_i)\ge \sum_{i=1}^n f(a_{ii}).

Gershgorin Circles

In fact, we can relate the eigenvalues to the diagonal entries of the matrix in a more general setting. We are motivated by the thought that if the off-diagonal entries are all very small, then the set of eigenvalues should be approximately given by the set of diagonal entries.

For a square complex matrix, let \lambda,x be an eigenvalue, eigenvector pair. For any index i, we have

\lambda - a_{ii}= \frac{\sum_j a_{ij}x_j}{x_i} - a_{ii} = \frac{\sum_{j\ne i}a_{ij}x_j}{x_i}.

Now consider the i such that x_i=\max |x_j|, and take absolute values and apply the triangle inequality,

|\lambda - a_{ii}| \le \sum_{j\ne i} \left| \frac{a_{ij}x_j}{x_i} \right| \le \sum_{j\ne i}|a_{ij}|.

Let’s define R_i=\sum_{j\ne i}|a_{ij}| to be the sum of the non-diagonal entries of the ith row. Then the Gershgorin circle theorem says that every eigenvalue lies within at least one of the discs B(a_{ii},R_i), in the complex plane. So our motivation still makes sense. If the off-diagonal entries are small, this is a strong restriction, and if they are not typically smaller than the diagonal entries, then we perhaps do not learn very much. Obviously, we could apply the same argument to the columns too.

When the diagonal entries are distinct, and the off-diagonal entries are small, the Gershgorin discs are distinct, and we would expect each to contain exactly one eigenvalue, corresponding to the appropriate diagonal entry. In fact, we can say something stronger. In general, the union of the discs is a subset of the complex plane with some connected components. Then, if a component is the union of exactly r discs, then it contains exactly r of the eigenvalues.

To see this, consider multiplying all the off-diagonal entries by z\in[0,1] and observe what happens as z varies from 0 to 1. When z=0, the matrix is diagonal, and each eigenvalue is in the Gershgorin disc (which is a single complex number). As z varies continuously, the characteristic polynomial varies continuously, and also its roots, that is the set of eigenvalues. So since each of the r eigenvalues are initially within the union of the r original, large Gershgorin discs, they must remain within this union as z varies, since they cannot ‘jump’ to another component.

It’s hard to know how time will allow, but provisionally in the next post I will talk about how to control the evolution of eigenvectors as a function of the matrix, and in particular what can go wrong.


For the middle section, I used the progression from Chapter 4 of Matrix Differential Calculus with Applications in Statistics and Econometrics (Magnus and Neudecker).

Gromov-Hausdorff Distance on Trees

This post continues the exposition of Gromov-Hausdorff distance, as introduced in Chapter Four of Steve Evans’ Probability and Real Trees, which we are reading as a group in the Stats Dept in Oxford at the moment. In this post, we consider applications of the Gromov-Hausdorff distance we have just introduced in the context of trees, viewed as metric spaces.

First we consider a direct application of the previous result, which related the Gromov-Hausdorff distance to the infimum of the distortion across the family of correspondences between the two relevant metric spaces. I found this as Corollary 3.7 in notes by Le Gall and Miermont on scaling limits of random trees and maps, which can be found here. I’m not clear whether there is an original source, but the result is simple enough that probably it does not matter hugely.

Proposition – Given f,g excursions above [0,1], and T_f, T_g the real trees associated with these excursions in the standard way, then

d_{GH}(T_f,T_g)\le 2 ||f-g||.

Proof: We construct an appropriate correspondence

\mathcal {R}=\left\{(a,b) \,:\, \exists t\in[0,1]\text{ s.t. } a=p_f(t), b=p_g(t)\right\}.

In other words, the trees are defined as projections from [0.1] (with some equivalence structure), so taking pairs of projections from [0,1] gives a natural correspondence. Now, suppose (p_f(s),p_g(s)), (p_f(t),p_g(t))\in \mathcal {R}. Then

d_{T_f}(p_f(s),p_f(t)) = f(s) + f(t) - 2\hat f(s,t),

where \hat f(s,t):= \min{r\in[s,t]}f(r). Obviously, we can replace f by g to obtain

d_{T_g}(p_g(s),p_g(t)) = g(s) + g(t) - 2\hat g(s,t).

By thinking slightly carefully about where the functions achieve their minima compared with where the minimum in the sup norm is achieved, we can conclude that

|d_{T_f}(p_f(s),p_f(t)) - d_{T_g}(p_g(s),p_g(t)) | \le 4||f-g||,

and so the result follows from the relation between Gromov-Hausdorff distance and the infimum of distortions over the set of correspondences proved at the end of the previous post.

Gromov-Hausdorff Limits of Real Trees

If we are going to consider the Gromov-Hausdorff topology for limits of tree, we want to be sure that the limit of a sequence of real trees is another real tree. In particular, we want to show that this convergence preserves the property of being a geodesic space.

Theorem 4.19 – Given X_n geodesic spaces, and X complete, such that X_n \stackrel{d_{GH}}\rightarrow X, then X is geodesic.

Proof: Here, I’ll go in the opposite order to Evans’ book, as I think it’s easier understand why a special case of this implies the whole result once you’ve actually shown that special case.

Anyway, we want to show that given x,y\in X, there is a geodesic segment x\leftrightarrow y in X. We will start by showing that there is a well-defined midpoint of the geodesic, that is a point z \in X such that d(x,z)=d(y,z)=\frac12 d(x,y).

Given arbitrary \epsilon>0, we can take n such that d_{GH}(X,X_n) < \frac{\epsilon}{3} and then a correspondence \mathcal{R} between X,X_n with \mathrm{dis}\mathcal {R}<\frac{2\epsilon}{3}. Now, by definition of correspondence, we have x',y'\in X_n such that (x,x'),(y,y')\in\mathcal{R}. But we do have geodesics in X_n, so we can take z'\in X_n the midpoint of geodesic x'\leftrightarrow y'. Predictably, we now take z\in X such that (z,z')\in\mathcal {R}.

We can show that z is ‘almost’ the midpoint of [x,y] in the sense that

|d(x,z) - \frac12 d(x,y)| \le |d(x',z')-\frac12 d(x',y')+\frac32 \mathrm{dis}\mathcal{R} \le \epsilon.

Similarly, we have |d(y,z)-\frac12 d(x,y)|\le \epsilon.

Perhaps it’s helpful to think of this point z that we’ve constructed as being like a taut, but not quite rigid string. The midpoint of the string has to be fairly near the midpoint of the endpoints, and in particular, as we let \epsilon\rightarrow 0, the z’s we deal with form a Cauchy sequence, and thus converge to some point, which (in a case of poor notation planning) we also call z \in X, which is the midpoint of [x,y] as described before.

We can now iterate this iteration, to demonstrate that whenever q is a dyadic rational in [0,1], there exists z_q \in X such that

d(x,z_q)=q d(x,y),\quad d(y,z_q) = (1-q)d(x,y).

Again then, if we want the above to hold for some general real r in [0,1], we can approximate r arbitrarily well by dyadic rationals q, and the associated points z_q are Cauchy and thus have a well-defined limit with the required properties. We thus have our geodesic segment x\leftrightarrow y.

Rooted Gromov-Hausdorff Distance

In the end, the trees we want to compare might be rooted. For example, we talk about finite trees being invariant under random re-rooting, and we might be interested in similar results for real trees, in particular the BCRT. So we need to compare metric spaces as viewed outwards from particular identified points of each space.

An isometry of rooted spaces must map the root to the root, and so we adjust to obtain rooted Gromov-Hausdorff distance. We might try to consider embeddings into a common space so that the roots are shared, but it will be more convenient to maintain the infimum over metrics on the disjoint union as before. But to ensure the roots are in roughly the ‘same’ place in both set embeddings, we minimise the maximum of the Hausdorff distance between the sets, and the distance between the images of the roots in the common covering space.

Similar results apply as in the unrooted case, and normally the proofs are very similar. As we might expect, when defining a correspondence between rooted spaces, we demand that the pair of roots is one of the roots in the correspondence, and then the same equivalence between the minimal distortion and the rooted G-H distance applies.

Evans shows that \mathbb{T}^{\mathrm{root}}, the set of compact rooted trees is complete and separable under the rooted G-H topology. Separability is relatively easy to see. Compact trees have finite \epsilon-nets, and there is a canonical way to view the net as the vertices of a finite tree with edge lengths. Approximating these edge lengths by rationals and consider the countable family of isomorphism classes of rooted finite trees gives separability.

If we want, we can also define k-pointed Gromov-Hausdorff distance, where we demand k points in each space are held fixed.

Tree \eta-erasure

To show that this is a natural topology to consider for the family of trees, Evans devotes a short section to the operation of \eta-erasure, where all points within \eta of a leaf are removed from a given tree.

Formally, R_\eta is a map \mathbb{T}^{\mathrm{root}}\rightarrow \mathbb{T}^{\mathrm{root}} (recalling that these are compact real trees), so that R_\eta(T) consists of the tree

\{\rho\}\cup\{a \in T \, : \, \exists x\in T,\, a\in[\rho, x],d_T(a,x)\ge \eta\},

rooted again at \rho. We claim that the range of R_\eta is the set of compact rooted trees with a finite number of leaves. In this setting, we want a geodesic definition of leaf, for example a leaf is a point that doesn’t lie in the interior of the (unique) geodesic segment between any other point and the root.

Given a tree T with a finite number of leaves, we can glue disjoint segments of length \eta onto every leaf. Taking R_\eta of this deeper tree will give T. Similarly, suppose R_\eta(T) has infinitely many leaves, which we can label a_1,a_2,\ldots. Thus we also have x_1,x_2,\ldots \in T such that a_i\in[\rho,x_i], and the segments \{[a_i,x_i]\} are disjoint, and all have length at least \eta, hence T cannot be compact, as it has no finite \frac{\eta}{2} net.

It is clear that for fixed T, the family of these maps applied to T is continuous with respect to \eta in the G-H topology. When \eta is changed a small amount, the amount of extra tree removed is locally small, and so approximating correspondences by points in what is left is absolutely fine. Indeed, the operations T_{\eta_1}, T_{\eta_2} commute to give T_{\eta_1+\eta_2}.

We want to show that for fixed \eta this map R_\eta is continuous with respect to G-H topology. Suppose two compact rooted trees S,T are covered by a rooted correspondence \mathcal R with distortion \epsilon \ll \eta. We can’t immediately restrict \mathcal R to R_\eta(S)\times R_\eta(T), as it won’t necessarily be a surjection under the projection maps any more.

But we can get around this. Note that if (x,y)\in\mathcal{R}, then |d_S(\rho_S,x) - d_T(\rho_T,y)| \le \epsilon by assumption. We will construct a correspondence \mathcal R' on the erased trees as follows. Given (x,y)\in\mathcal{R}, if (x,y)\in R_\eta(S)\times R_\eta(T), we keep it, and if x\not\in R_\eta(S), y\not \in R_\eta(y), we throw it away. Suppose we have (s,t) in \mathcal{R} with s\in R_\eta(S), t\not \in R_\eta(T). Let l_s be a leaf of S such that s \in [\rho_S,l_s], and let \bar t be the farthest point from the root of the geodesic [\rho_T,t] restricted to R_\eta(T). So the tree above \bar t includes t and has height at most \eta.

In words, if a point appears in a pair in the correspondence but is removed by the erasure, we replace it in the pair with the point closest to the original point that was not removed by the erasure.

It remains to prove that this works. My original proof was short but false, and its replacement is long (and I hope true now), but will postpone writing this down either until another post, or indefinitely. The original proof by Evans and co-authors can be found as the main content of Lemma 2.6 in the original paper [1].


[1] Evans, Pitman, Winter – Rayleigh processes, real trees, and root growth with regrafting.

Gromov-Hausdorff Distance and Correspondences

This term, some members of the probability group have been reading Probability and Real Trees, by Steve Evans based on his Saint-Flour course in 2005. A pdf can be found here. This morning was my turn to present, and I gave a precis of Chapter 4, which concerns metrics on metric spaces, a family of tools which will be essential for later chapters which discuss convergence of trees, viewed as metric objects. Hausdorff Distance We start by considering a metric on subsets of a given base space X. The Hausdorff distance between two sets A, B is defined as

d_H(A,B):=\inf\{ r>0: A\subset U_r(B), B\subset U_r(A)\},

where U_r(A):=\{x\in X, d(x,A)< r\} consists of set A, and all the points within r of set A. So the Hausdorff distance measures how much we have to fatten each set before it contains the other. Note that if we have a giant set next to a tiny one, we will have to fatten the tiny one a great deal more to achieve this. Sometimes it will be more helpful to think of the following alternative characterisation

d_H(A,B) = \max\left \{ \sup_{a\in A}\inf_{b\in B} d(a,b), \inf_{a\in A}\sup_{b\in B} d(a,b)\right\}.

In words, we measure how far away is the point in A farthest from B, and vice versa, and take the larger of the two. The presence of the sups and infs indicates that the inclusion or otherwise of the boundaries of A,B does not affect this distance. In particular, this means that d_H(A,\bar A) = d_H(A,A^{\circ}), and so to allow us to call Hausdorff distance a metric, we restrict attention to the closed subsets of X, M(X).

We also observe immediately that if X does not have bounded diameter, then it is possible for the Hausdorff distance between two sets to be infinite. We won’t worry about that for now, because we will mainly be considering host spaces which are compact, for which the following results will be useful.

Lemma 4.4 – If (X,d) is complete, then (M(X),d_H) is also complete.

Proof: Assume we have a sequence of closed sets S_1,S_2,\ldots \subset X which have the Cauchy property under d_H. I’m going to do this slightly differently to Evans, because it’s not the case you can immediately choose x_n\in S_n for each n, such that d(x_n,x_m)\le d_H(S_n,S_m) for all m,n. For an explicit counterexample, see comment from Ed Crane below the article.

Note that if a subsequence of a Cauchy sequence converges to a limit, then the whole sequence converges to that same limit. So we can WLOG replace S_1,S_2,\ldots by some subsequence such that d_H(S_n,S_{n+1})\le 2^{-n}. Now it is clear that for any x_n\in S_n, there is a choice of x_{n+1}\in S_{n+1} such that d(x_n,x_{n+1})\le 2^{-n} (*). Starting from arbitrary x_1\in S_1, we can construct in this manner a sequence x_1,x_2,\ldots,\in X that is Cauchy, and thus has a limit point x\in X.

Let \mathcal{X} be the set of sequences (x_m,x_{m+1},\ldots) for some m, with x_n\in S_n\,\forall n\ge m, satisfying (*). Now let S be the closure of the set of limit points of this family of sequences, which we have shown is non-empty.

Then for any n, and any x_n\in S_n, we can construct such a sequence, and its limit point x, and the triangle inequality down the path x_n,x_{n+1},\ldots gives d(x_n,S)\le 2^{-(n-1)}. Furthermore, by construction S\subset U_{2^{-(n-1)}}(S_n), hence it follows that S_n \stackrel{d_H}\rightarrow S.

Lemma 4.5 – Given (X,d) compact, (M(X),d_H) is also compact.

Sketch proof: We are working with metric spaces, for which the existence of a finite \epsilon-net for every \epsilon>0 is equivalent to compactness. [An \epsilon-net is a set S of points in the space such that every point of the space lies within \epsilon of an element of S. Thinking another way, it is the set of centres of the balls in a finite covering of the space by \epsilon-balls.] It is not too hard to check that if S_\epsilon is an \epsilon-net for (X,d), then \mathcal{P}(S_\epsilon) is finite, and an \epsilon-net for (M(X),d_H).

Gromov-Hausdorff Distance

So far this is fine, but won’t be useful by itself for comparing how similar two trees are as metric spaces, because we can’t be sure a priori that we can embed them in a common host space. To resolve this, we consider instead the Gromov-Hausdorff distance, which will serve as a distance between metric spaces, even when they are not canonically defined as subsets of a common metric space.

Given X, Y metric spaces, we define

d_{GH}(X,Y)=\inf_Z \left\{ d_H(X',Y') \, : \, X',Y' \subset (Z,d)\text{ a metric space }, X'\simeq X, Y'\simeq Y\right\}.

In words, the Gromov-Hausdorff distance between two metric spaces examines the ways to embed them isometrically into a common larger metric space, and gives the minimal Hausdorff distance between them under the class of such embeddings. One issue is that the collection of all metric spaces is not a set. For example, given any set, we can define a metric via the discrete metric, so the collection of metric spaces is at least as large as the collection of all sets, which is not a set. Fortunately, all is not broken, as when we consider a general metric space Z in which we might embed copies of X and Y we are wasting lots of the perhaps very complicated space, because we only need to compare the subsets which are isometric copies of X and Y. So in fact, we lose nothing if we assume that Z is a disjoint union of copies of X and Y, with a metric chosen appropriately. So

d_{GH}=\inf\left\{ d_H(X,Y) : d\text{ a metric on }X\coprod Y\text{ restricting to }d_X \text{ on }X,\, d_Y\text{ on }Y \right\}.

In practice though, this is difficult to compute, since the set of things we have to minimize over is complicated. It turns out we can find an equivalent characterisation which will be easier to use in a number of examples, including the case of real trees which is the whole point of the course.

Correspondence and Distortion

We define a correspondence from X to Y to be

\mathcal{R}\subset X\times Y\text{ s.t. } \pi_X(\mathcal {R}) = X, \, \pi_Y(\mathcal {R}) = Y,

where \pi_X,\pi_Y are the canonical projection maps from X\times Y into X,Y respectively. So we can think of a correspondence as being something like a matching from X to Y. In a matching, we insist that the projection maps into each set are injections, ie that each element of X (resp Y) can appear in at most one pair, whereas for a correspondence, we demand that the projection maps are surjections, ie that each element of X appears in at least one pair.

Then the distortion of a correspondence

\mathrm{dis}(\mathcal{R}):= \sup\left\{ |d_X(x,x') - d_Y(y,y')| \,;\, (x,y),(x',y')\in \mathcal{R} \right\}.

In words, if two sets are non-isomorphic, then a correspondence can’t describe an isometry between the sets, and the distortion is a measure of how far from being an isometry the correspondence is. That is, given a pair of pairs in the correspondence, for an isometry, the distance between the X-elements would be equal to the distance between the Y-elements, and the distortion measures the largest discrepancy between such pairs of pairs over the correspondence.

Theorem 4.11 d_{GH}(X,Y) = \frac12 \inf_{\mathcal{R}} (\mathrm{dis}\mathcal R), where the infimum is taken over all correspondences \mathcal{R} X to Y.

Remark: The RHS of this equivalence can be thought of as the set coupling between X and Y such that the pairs have as equal distances as possible.

Proof: Given an embedding into X\coprod Y with d_H(X,Y)<r, we have \mathcal{R} with \mathrm{dis}\mathcal {R}<2r, by taking:

\mathcal{R}=\{(x,y): d(x,y)<r\}.

From the definition of Hausdorff distance, it follows that the for every x, there is a y with d(x,y)<r, and hence the appropriate projection maps are projections.

So it remains to prove that d_{GH}(X,Y)\le \frac12 \mathrm{dis}\mathcal{R}. We can define a metric on X\times Y by

d(x,y)=\int\left\{ d_X(x,x')+d_Y(y,y') + r \,:\, (x',y')\in \mathcal{R} \right\}.

Then for any x\in X, there is (x,y)\in\mathcal{R}, and thus d(x,y)\le r, and vice versa hence d_H(X,Y)\le r.

It only remains to check that this is actually a metric. Let’s take x,\bar x \in X, and so

d(\bar x,y)\le \inf\{ d_X(x,\bar x) + d_X(x,x')+d_Y(y,y')+r \,: \, (x',y')\in\mathcal{R}\},

so taking d_X(x,\bar x) outside the brackets gives one form of the triangle inequality. We have to check the ‘other combination’ of the triangle inequality. We assume that the infima for (x,y), (\bar x,y) are attained at (x',y'),(\bar x',\bar y') respectively.

d(x,y)+d(\bar x,y)= 2r+ d_X(x,x')+d_X(\bar x,\bar x') + d_Y(y,y')+d_Y(d,\bar y').

But we also have d_X(x',\bar x')-d_Y(y',\bar y')\ge -r from the definition of distortion, and so adding these gives the triangle inequality we want, and completes the proof of this theorem.corc

Tightness in Skorohod Space

This post continues the theme of revising topics in the analytic toolkit relevant to proving convergence of stochastic processes. Of particular interest is the question of how to prove that families of Markov chains might have a process scaling limit converging to a solution of some stochastic differential equation, in a generalisation of Donsker’s theorem for Brownian motion. In this post, however, we address more general aspects of convergence of stochastic processes, with particular reference to Skorohod space.

Topological Background

I’ve discussed Skorohod space in a previous post. For now, we focus attention on compactly supported functions, D[0,T]. Some of what follows can be extended to the infinite-time setting easily, and some requires more work. Although we can define a metric on the space of cadlag functions in lots of ways, it is more useful to think topologically, or at least with a more vague sense of metric. We say two cadlag functions are close to one another if there is a reparameterisation of the time-axis, (a function [0,T] to itself) that is uniformly close to the identity function, and when applied to one of the cadlag functions, brings it close to the other cadlag function. Heuristically, two cadlag functions are close if their large jumps are close to one another and of similar size, and if they are uniformly close elsewhere. It is worth remembering that a cadlag function on even an unbounded interval can have only countably many jumps, and only finitely many with magnitude greater than some threshold on any compact interval.

For much of the theory one would like to use, it is useful for the spaces under investigation to be separable. Recall a topological space is separable if there exists a countable dense subset. Note in particular that D[0,T] is not separable under the uniform metric, since we can define f_x(\cdot)=\mathbf{1}_{(\cdot \ge x)} for each x\in[0,T], then ||f_x-f_y||_\infty=1 whenever x\ne y. In particular, we have an uncountable collection of disjoint open sets given by the balls \mathcal{B}(f_x,\frac12), and so the space is not countable. Similarly, C[0,\infty) is not separable. A counterexample might be given by considering functions which take the values {0,1} on the integers. Thus we have a map from \{0,1\}^{\mathbb{N}}\rightarrow C[0,\infty), where the uniform distance between any two distinct image points is at least one, hence the open balls of radius 1/2 around each image point give the same contradiction as before. However, the Stone-Weierstrass theorem shows that C[0,T] is separable, as we can approximate any such function uniformly well by a polynomial, and thus uniformly well by a polynomial with rational coefficients.

In any case, it can be shown that D[0,T] is separable with respect to the natural choice of metric. It can also be shown that there is a metric which gives the same open sets (hence is a topologically equivalent metric) under which D[0,T] is complete, and hence a Polish space.

Compactness in C[0,T] and D[0,T]

We are interested in tightness of measures on D[0,T], so first we need to address compactness for sets of deterministic functions in D[0,T]. First, we consider C[0,T]. Here, the conditions for a set of functions to be compact is given by the celebrated Arzela-Ascoli theorem. We are really interested in compactness as a property of size, so we consider instead relative compactness. A set is relatively compact (sometimes pre-compact) if its closure is compact. For the existence of subsequential limits, this is identical to compactness, only now we allow the possibility of the limit point lying outside the set.

We note that the function C[0,T]\rightarrow \mathbb{R} given by ||f||_\infty is continuous, and hence uniform boundedness is certainly a required condition for compactness in C[0,T]. Arzela-Ascoli states that uniform boundedness plus equicontinuity is sufficient for a set of such functions to be compact. Equicontinuity should be thought of as uniform continuity that is uniform among all the functions in the set, rather than just within the argument of an individual particular function.

For identical reasons, we need uniform boundedness for relative compactness in D[0,T], but obviously uniform continuity won’t work as a criterion for discontinuous functions! We seek some analogue of the modulus of continuity that ignores jumps. We define

\omega'_\delta(f):=\inf_{\{t_i\}} \max_i \sup_{s,t\in[t_{i-1},t_i)} |f(s)-f(t)|,

where the infimum is taken over all meshes 0=t_0<t_1<\ldots<t_r with t_i-t_{i-1}>\delta. Note that as \delta\downarrow 0, we can, if we want, place the t_i so that large jumps of the function f take place over the boundaries between adjacent parts of the mesh. In particular, for a given cadlag function, it can be shown fairly easily that \omega'_\delta(f)\downarrow 0 as \delta\rightarrow 0. Then, unsurprisingly, in a similar fashion to the Arzela-Ascoli theorem, it follows that a set of functions A\subset D[0,T] is relatively compact if it is uniformly bounded, and

\lim_{\delta\rightarrow 0} \sup_{f\in A}\omega'_\delta(f)=0.

Note that this ‘modulus of continuity’ needs to decay uniformly across the set of functions, but that we do not need to choose the mesh at level \delta uniformly across all functions. This would obviously not work, as then the functions \mathbf{1}_{(\cdot\ge x_n)} for any sequence x_n\rightarrow x would not be compact, but they clearly converge in Skorohod space!

Tightness in C[0,T] and D[0,T]

Naturally, we are mainly interested in (probability) measures on D[0,T], and in particular conditions for tightness on this space. Recall a family of measures is tight if for any \epsilon>0, there exists some compact set A such that

\pi(A)>1-\epsilon,\quad \forall \pi\in\Pi.

So, for measures (\mu_n) on D[0,T], the sequence is tight precisely if for any \epsilon>0, there exists M,\delta and some N such that for any n>N, both

\mu_n(||f||_\infty >M)\le \epsilon,\quad \mu_n(\omega'_\delta(f)>\epsilon)\le \epsilon

hold. In fact, the second condition controls variation sufficiently strongly, that we can replace the first condition with

\mu_n(|f(0)|>M)\le \epsilon.

Often we might be taking some sort of scaling limit of these processes in D[0,T], where the jumps become so small in the limit that we expect the limit process to be continuous, perhaps an SDE or diffusion. If we can replace \omega'_\delta by \omega_\delta, the standard modulus of continuity, then we have the additional that any weak limit lies in C[0,T].

In general, to prove convergence of some stochastic processes, we will want to show that the processes are tight, by demonstrating the properties above, or something equivalent. Then Prohorov’s theorem (which I tend to think of as a probabilistic functional version of Bolzano-Weierstrass) asserts that the family of processes has a weak subsequential limit. Typically, one then shows that any weak subsequential limit must have the law of some particular random process. Normally this is achieved by showing some martingale property (eg for an SDE) in the limit, often by using the Skorohod representation theorem to use almost sure subsequential convergence rather than merely weak convergence. Then one argues that there is a unique process with this property and a given initial distribution. So since all weak subsequential limits are this given process, in fact the whole family has a weak limit.