Recent Progress and Gromov-Hausdorff Convergence

For the past few weeks I’ve been working on the problem of Cycle-Induced Forest Fires, which I’ve referred to in passing in some recent posts. The aim has been to find a non-contrived process which exhibits self-organised criticality, that is, where the process displays critical characteristics (scaling laws, multiple components at the largest order of magnitude) forever. Note that this is in contrast to the conventional Erdos-Renyi graph process, which is only critical at a single time n/2.

The conjecture is that the largest component in equilibrium typically has size on a scale of n^2/3. An argument based on the equilibrium proportion of isolated vertices gives an upper bound on this exponent. The working argument I have for the lower bound at the moment can comfortably fit on the back of a napkin, with perhaps some context provided verbally. Of course, the current full text is very much larger than that, mainly because the napkin would feature assertions like “event A happens at time O(n^\beta)“; whereas the more formal argument has to go like:

“With high probability as n\rightarrow\infty, event A happens between times n^{\beta-\epsilon},n^{\beta+\epsilon}, for any suitably small \epsilon>0. Furthermore, the probability that A happens after this upper threshold decays exponentially with n for fixed \epsilon, and the probability that A happens before the lower threshold is at most n^{-\epsilon}. Finally, this is under the implicit assumption that there will be no fragmentations before event A, and this holds with probability 1-o(1) etc.”

It’s got to the point where I’ve exhausted the canonical set of symbols for small quantities: \epsilon,\delta,(\eta ?).

This has been a very long way of setting up what was going to be my main point, which is that at many points during undergraduate mathematics, colleagues (and occasionally to be honest, probably myself too) have complained that they “don’t want to have anything to do with analysis. They just want to focus on algebra / number theory / statistics / fluids…” Anyway, the point of this ramble was that I think I’ve realised that it is very hard to think about any sort of open problem without engaging with the sort of ideas that a few years ago I would have thought of (and possibly dismissed) as ‘analysis’.

Much of my working on this problem has been rather from first principles, so haven’t been thinking much about any neat less elementary theory recently.

Ok, so on with the actual post now.

Last month I talked about local limits of graphs, which describe convergence in distribution of (local) neighbourhood structure about a ‘typical’ vertex. This is the correct context in which to make claims like “components of G(n,\frac{\lambda}{n}) look like Galton-Watson trees with offspring distribution \text{Po}(\lambda)“.

Even from this example, we can see a couple of drawbacks and omissions from this limiting picture. In the sub-critical regime, this G-W tree will be almost surely finite, but the number of vertices in the graph is going to infinity. More concretely, the limit description only tells us about a single component. If we wanted to know about a second component, in this case, it would be roughly independent of the size of the first component, and with the same distribution, but if we wanted to know about all components, it would get much more complicated.

Similarly, this local limit description isn’t particularly satisfactory in the supercritical regime. When the component in question is finite, this description is correct, but with high probability we have a giant component, and so the ‘typical’ vertex is with some positive probability in the giant component. This is reflected by the fact that the G-W tree with supercritical offspring distribution is infinite with some positive probability. However, the giant component does not look like a \text{Po}(\lambda) G-W tree. As we exhaust O(n) vertices, the offspring distribution decreases, in expectation at least. In fact, without the assumption that the giant component is with high probability unique (so \frac{L_1}{n}=1-\mathbb{P}(|C(v)|<\infty), we can’t even deduce the expected size of the giant component from the local limit result.

This is all unsurprising. By definition a local limit describes the structure near some vertex. How near? Well, finitely near. It can be arbitrarily large, but still finite, so in particular, the change in the offspring distribution after O(n) vertices as mentioned above will not be covered.

So, if we want to learn more about the global structure of a large discrete object, we need to consider a different type of limit. In particular, the limit will not necessarily be a graph. Rather than try to define a priori a ‘continuum’ version of a graph, it is sensible to generalise from the idea that a graph is a discrete object and instead consider it as a metric space.

In this article, I don’t want to spend much time at all thinking about how to encode a finite graph as a metric space. We have a natural notion of graph distance between vertices, and it is not hard to extend this to points on edges. Alternatively, for sparse graphs, we have an encoding through various functions, which live in some (metric) function space.

However, in general, the graph will be a metric object itself, rather than necessarily a subset of a global metric space. We will be interested in convergence, so we need a suitable style of convergence of different metric spaces.

The natural candidate for this is the Gromov-Hausdorff metric, and the corresponding Gromov-Hausdorff convergence.

The Hausdorff distance between two subsets X, Y of a metric space is defined as follows. Informally, we say that d_H(X,Y)<\epsilon if any point of X is within distance \epsilon from some point of Y, in the sense of the original metric. Formally

d_H(X,Y):=\max \{\sup_{x\in X}\inf_{y\in Y}d(x,y), \sup_{y\in Y}\inf_{x\in X}d(x,y)\}.

It is not particularly illuminating to prove that this is in fact a metric. In fact, it isn’t a metric as the definition stands, but rather a pseudo-metric, which is exactly the same, only allowing d(X,Y)=0 when X and Y are not equal. Note that

d(X^\circ,\bar X)=0,

for any set X, so this gives an example, provided X is not both open and closed. Furthermore, if the underlying metric space is unbounded, then the Hausdorff distance between two sets might be infinite. For example in \mathbb{R},

d_H(\mathbb{R}_{<0},\mathbb{R}_{>0})=\infty.

We can overcome this pair of objections by restricting attention to closed, bounded sets. In practice, many spaces under consideration will be real in flavour, so it makes sense to define this for compact sets when appropriate.

But this still leaves the underlying problem, which is how to define a distance function on metric spaces. If two metric spaces X and Y were both subspaces of some larger metric space then it would be easy, as we now have the Hausdorff distance. So this is in fact how we proceed in general. We don’t need any knowledge of this covering space a priori, we can just choose the one which minimises the resulting Hausdorff distance. That is

d_{GH}(X,Y)=\inf\{d_H(\phi(X),\psi(Y))\},

where the infimum is taken over all metric spaces (E,d), and isometric embeddings \phi: X\rightarrow E, \psi: Y\rightarrow E.

The first observation is that this will again be a pseudometric unless we demand that X, Y be closed and bounded. The second is that this index set is not a set. Fortunately, this is quickly rectified. Instead consider all metrics on the disjoint union of sets X and Y, which is set, and contains the subset of those metrics which restrict to the correct metric on each of X and Y. It can be checked that this forms a true metric on the set of compact metric spaces up to isometry.

We have an alternative characterisation. Given compact sets X and Y, a correspondence between X and Y is a set of pairs in X\times Y, such that both projection maps are surjective. Ie for any x in X, there is some pair (x,y) in the correspondence. Let \mathcal{C}(X,Y) be the set of such correspondences. We then define the distortion of correspondence \mathcal{R} by:

\text{dis}(\mathcal{R}):=\sup\{|d_X(x_1,x_2)-d_2(y_1,y_2)|: (x_i,y_i)\in\mathcal{R}\}.

Then

d_{GH}(X,Y)=\frac{1}{2}\inf_{\mathcal{R}\in\mathcal{C}(X,Y)}\text{dis}(\mathcal{R}).

In particular, this gives another reason why we don’t have to worry about taking an infimum over a proper class.

Gromov-Hausdorff convergence then has the natural definition. Note that this does not respect topological equivalence, ie homeomorphism. For example,

\bar{B(0,\frac{1}{n})}\stackrel{GH}{\rightarrow} \{0\},

where the latter has the trivial metric. In particular, although all the closed balls are homeomorphic, the G-H limit is not.

A final remark is that the trees we might be looking at are not necessarily compact, so it is useful to have a notion of how this might be extended to non-compact spaces. The answer is to borrow the idea from local limits of considering large finite balls around a fixed central point. In the case of trees, this is particularly well-motivated, as it is often quite natural to have a canonical choice for the ‘root’.

Then with identified points p_n\in X_n, say (X_n,p_n)\rightarrow (X,p) if for any R>0 the R-ball around p_n in X_n converges to the R-ball around p in X. We adjust the definition of distortion to include the condition that the infimum be over correspondences for which (p_X,p_Y) is an element.

REFERENCES

This article was based on some lecture notes by Jean-Francois Le Gall from the Clay Institute Summer School which can be found on the author’s website here (about halfway down the page). This material is in chapter 3. I also used Nicolas Curien’s tutorials on this chapter to inform some of the examples. The resolution of the proper class problem was mentioned by several sources I examined. These notes by Jan Christina were among the best.

Urns and the Dirichlet Distribution

As I’ve explained in some posts from a while ago, I’ve been thinking about some models related to random graph processes, where we ensure the configuration stays critical by deleting any cycles as they appear. Under various assumptions, this behaves in the limit as the number of vertices grows to infinity, like a coagulation-fragmentation process, with multiplicative coalescence and quadratic fragmentation rate, where the fragmentation kernel is the Poisson-Dirichlet distribution, PD(1/2,1/2). I found it quite hard to find accessible notes on these, partly because the theory is still relatively recently, and also because it seems to be one of those topics where you can’t understand anything properly until you kind of understand everything.

This post was motivated and is based on chapter 3 of Pitman’s Combinatorial Stochastic Processes, and the opening pair of lectures from Pierre Tarres’s TCC course on Self-Interaction and Learning.

It makes sense to begin by discussing the Dirichlet distribution, and there to start with the most simple case, the Beta distribution. As we learned in the Part A Statistics course while trying some canonical examples of posterior distributions, it is convenient to ignore the normalising constants of various distributions until right at the end. This is particularly true of the Beta distribution, which is indeed often used as a prior in such situations. The density function of \text{Beta}(\alpha,\beta) is x^{\alpha-1}(1-x)^{\beta-1}. If these are natural numbers, we have a quick proof by induction using integration by parts, otherwise a slightly longer but still elementary argument gives the normalising constant as

\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}.

(note that the ‘base case’ is the definition of the Gamma function.) For the generalisation we are about to make, it is helpful to think of this Beta density as a distribution not on [0,1], but on partitions of [0,1] into two parts. That is pairs (x,y) such that x+y=1. Why? Because then the density has the form x^{\alpha-1}y^{\beta-1} and is clear how we might generalise this.

Indeed the Dirichlet distribution with parameters (\alpha_1,\ldots,\alpha_m) is a random variable supported on the subset of \mathbb{R}^m with \sum p_k=1 with density \propto \prod p_k^{\alpha_k-1}. For similar reasons, the correct normalising constant in the general case is

\frac{\Gamma(\sum \alpha_k)}{\prod \Gamma(\alpha_k)}.

You can prove this by inducting on the number of variables, using the Beta distribution as a base case.

In many situations, it is useful to be able to express some distribution as a function of IID random variables with a simpler distribution. We can’t quite do that for the Dirichlet distribution, but we can express it very simply as function of independent RVs from the same family. It turns out that the family of Gamma distributions is a wise choice. Recall that the gamma distribution with parameters (\alpha,\beta) has density:

\frac{1}{\beta^\alpha \Gamma(\alpha)}y^{\alpha-1}e^{-y/\beta},\quad y>0.

Anyway, define independent RVs Y_k as gamma distributions with parameters (\alpha_k,1), then we can specify (X_1,\ldots,X_m) as the Dirichlet distribution with these parameters by:

(X_1,\ldots,X_m)\stackrel{d}{=}\frac{(Y_1,\ldots,Y_m)}{\sum Y_k}.

In other words, the Dirichlet distribution gives the ratio between independent gamma RVs. Note the following:

– the sum of the gamma distributions, ie the factor we have to scale by to get back to a ratio, is a gamma distribution itself.

– If we wanted, we could define it in an identical way using Gamma with parameters (\alpha_k,\beta) for some fixed \beta.

– More helpfully, because the gamma distribution is additive in the first argument, we can take a limit to construct a gamma process, where the increments have the form required. This will be a useful interpretation when we take a limit, as largest increments will correspond to largest jumps.

Polya’s Urn

This is one of the best examples of a self-reinforcing process, where an event which has happened in the past is more likely to happen again in the future.

The basic model is as follows. We start with one white ball and one black ball in a bag. We draw a ball from the bag uniformly at random then replace it along with an additional ball of the same colour. Repeat this procedure.

The first step is to look at the distribution at some time n, ie after n balls have been added, so there are n+2 in total. Note that there are exactly n+1 possibilities for the state of the bag at this time. We must have between 1 and n+1 black balls, and indeed all of these are possible. In general, part of the reason why this process is self-reinforcing is that any distribution is in some sense an equilibrium distribution.

What follows is a classic example of a situation which is a notational nightmare in general, but relatively straightforward for a fixed finite example.

Let’s example n=5, and consider the probability that the sequence of balls drawn is BBWBW. This probability is:

\frac12\times \frac 23\times \frac14\times \frac 35\times \frac26.

So far this isn’t especially illuminating, especially if we start trying to cancel these fractions. But note that the denominator of the product will clearly be 6!. What about the numerator? Well, the contribution to the numerator of the product from black balls is 1x2x3=3! while the contribution from white balls is 1×2=2!. In particular, the contribution to the numerator from each colour is independent of the order of whites and blacks. It depends only on the number of whites and blacks. So we can conclude that the probability that we end up a particular ordering of k+1 whites and (n-k)+1 blacks is

\frac{k! (n-k)!}{(n+1)!},

and so the probability that we end up with k+1 whites where we no longer care about ordering is

\binom{n}{k}\frac{k!(n-k)!}{(n+1)!}=\frac{1}{n+1}.

In other words, the distribution of the number of white balls in the bag after n balls have been added is uniform on [1, n+1].

That looks like it might be something of a neat trick, so the natural question to ask is what happens if we adjust the initial conditions. Suppose that instead we start with a_1,\ldots,a_m balls of each of m colours. Obviously, this is going to turn into a proof by suggestive notation. In fact, the model doesn’t really rely on the (a_i) being positive integers. Everything carries through with a_i\in\mathbb{R} if we view the vector as the initial distribution.

As before, the order in which balls of various colours are drawn doesn’t matter hugely. Suppose that the first n balls drawn feature n_i balls of colour i. The probability of this is:

\binom{n}{n_1,\ldots,n_k}\frac{\prod_i \alpha_i(\alpha_i+1)\ldots (\alpha_i+n_i-1)}{\alpha(\alpha+1)\ldots(\alpha+(n-1))}

where \alpha=\sum_i \alpha_i. Then for large n, assuming for now that the \alpha_i\in\mathbb{N} we have

\frac{\alpha(\alpha+1)\ldots(\alpha+n-1)}{n!}=\frac{[\alpha_i+(n_i-1)]!}{n_i! (\alpha_i-1)!}\approx \frac{n_i^{\alpha_i-1}}{(\alpha_i-1)!}.

The denominator will just be a fixed constant, so we get that overall, the probability above is approximately

\frac{\prod_i n_i^{\alpha_i-1}}{n^{\alpha-1}}=\prod (\frac{n_i}{n})^{\alpha_i-1},

which we recall is the pdf of the distribution distribution with parameters (\alpha_i) as telegraphed by our choice of notation. With some suitable martingale machinery, you can also prove that this convergence happens almost surely, for a suitable limit RV defined on the tail sigma algebra.

Next time I’ll introduce a more complicated family of self-reinforcing processes, and discuss some interesting limits of the Dirichlet distribution that relate to such processes.

Enhanced by Zemanta