Isolated Vertices and the Second-Moment Method

It’s back-to-school day or week for much of the UK. I’m sure for many, this brings resolutions of better work habits, and while I could always use some of those too as I try to finish my thesis, I also want to start blogging again when possible.

Right now, I’m in Haifa for the Mostly Markov Mixing summer school and workshop hosted at the Technion. The talks have been interesting so far, and the environment stimulating both for the discussion of new problems, and getting work done on existing research.

20150904_104109I wrote much of this post a while ago, after some of the UK olympiad students asked me to tell them about second-moment methods and I politely declined. I was reminded of this by an interesting problem introduced by Elchanan Mossel in the first lecture of his series. I’ll start with this.

Suppose we consider a symmetric inhomogeneous random graph with two types. That is, we divide the vertices into two equally-sized classes, and we connect vertices in the same class with probability p, and vertices in different classes with probability q, all independently. The question is: if we can see the resulting graph structure, with what accuracy can we recover the class division? [Note: this setup with symmetry between the types can be called the block model.]

This is a hard problem, and got us all thinking a great deal. In the most relevant regime, even the most sophisticated techniques do not allow us to identify the partition perfectly with high probability as the number of vertices goes to infinity. For now, though, I only want to use the most uninteresting regime as a starting point for the rest of this post. EM asked: what happens if we take sparse scaling, that is p,q=\Theta(1/N), where N is the number of vertices? Do we have a chance to identify the classes correctly?

Well in this case the answer is easy, because it is ‘no’, and the reason is that in sparse random graphs, there is a positive proportion of isolated vertices. In particular, there is a positive proportion of isolated vertices of each type. And so, when we see an isolated vertex, for large N we can specify accurately the distribution of each type given this extra information, but in doing so, we admit that we certainly cannot partition the isolated vertices into their classes. So for the rest of the talk, EM focused on the regime where the random graph is connected with high probability.

There are two ideas with worthwhile and simple content here. Firstly, the fact that such a random graph has a positive proportion of isolated vertices. I am finishing off a result in a model where the base structure is the inhomogeneous random graph, and have just now proved a short lemma showing exactly this in a slightly more complicated context. It’s a good example of the second-moment method. Secondly, implicit in the final statement of the previous paragraph is that the absence of isolated vertices and connectivity are roughly equivalent in a random graph. This is something I’ve talked about briefly before, and in the interests of keeping the resolution and actually finishing this post, I won’t talk about it here.

Earlier this year, I gave a short talk by request to the UK olympiad students about first-moment methods. In particular, they were hoping that they might occasionally want to apply such approaches to the sort of combinatorics problems you encounter in olympiads. Typically, the best examples in this setting involve demonstrating the existence of a set without certain bad properties, by showing that the probability that all bad properties hold simultaneously is strictly less than one.

The students asked why I wasn’t talking about second-moment methods. Firstly, the models (such as this one) which are the best examples are less familiar to the students, and are really rooted in the randomness. They can’t easily be turned into a combinatorics problem. Secondly, looking for lower bounds in probability suggests we are aiming for convergence in probability, rather than convergence in expectation, and this is a distinction that is unlikely to be appreciated before some undergraduate probability courses have been taken.

Anyway, we want to show that the proportion of isolated vertices in G(n,c/n) is bounded below in probability as n\rightarrow\infty. All the content of the argument is seen in this classical Erdos-Renyi setting. Any inhomogeneous example will merely demand extra notation.

So, first we deal with the expected number of isolated vertices. The event that a given vertex v is isolated demands that the n-1 edges potentially incident to v do not appear. The probability of this is (1-\frac{c}{n})^{n-1} \rightarrow e^{-c}. Thus the expected number of such vertices is ne^{-c}. We now want to show that the fluctuations (standard deviation if you will) of this quantity are small relative to its mean. To bound the variance, we look at pairs of vertices, v and w. Note that the events {v isolated} and {w isolated} are not independent, since knowing that v is isolated tells us that the edge from v to w is not present, which slightly increases the chance of w being isolated. (For a very concrete example, think of the case n=2.)

However, in a large graph, the events are almost independent. That’s a statement which we feel is true, but we need to quantify, so we ask for the probability that {v and w isolated}. This happens precisely if none of the edges incident to either v or w are present. There are 2n-3 such edges, and so the probability of this is (1-\frac{c}{n})^{2n-3}\rightarrow e^{-2c}. So now we can write

\mathbb{E}\left[ \mathbf{1}(1\text{ isolated})+\ldots+\mathbf{1}(n\text{ isolated})\right]^2= \sum_{k=1}^n \mathbb{E}\mathbf{1}(k\text{ isolated})^2 + 2\sum_{j\ne k} \mathbb{E}\mathbf{1}(j,k\text{ isolated}).

An indicator function takes the values 0 and 1, and so in the first sum we can remove the square sign to leave us with the expected number of isolated vertices. Secondly, the vertices are exchangeable and so we can replace each summand with the value we have already established for v and w. We can now calculate the variance of the number of isolated vertices, and we see that it is o(n^2). With a little bit more care about the limits in n, we can check it is actually O(n). In particular, the variance of the proportion of isolated vertices is tends to zero.

For explicit lower bounds in probability on the proportion of isolated vertices we could appeal to Chebyshev’s inequality. However, since the variance vanishes, we have convergence in distribution to a constant, and thus convergence in probability.

Finally, a word on the end of EM’s talk. Having said that the sparse phase is not interesting because it is impossible, we might ask about the dense phase, where p and q are fixed. Just a for concrete example, suppose the probability of connection within the class is ½, and between classes is 1/3. Thus between any pair of vertices in the same class we expect to see roughly N/8+N/18= 13N/72 paths of length 2. The summands correspond to the middle vertex of the path in the same class, and in the opposite class respectively. However, between any pair of vertices in different classes, we expect to see roughly N/6 paths of length 2 for similar reasons. Both of these quantities will be highly concentrated on their means: consider the second-moment as the existence of each possible path is independent. Indeed, the chance that we see a proportion of paths closer to N/6 when it should be 13N/72 is a large deviations event, and so has exponential decay. As a result, the chance that we get the relative positions of any pair of vertices wrong with this method vanishes for large N.

In fact, the condition that (for p<q),

N \mathbb{P}(\text{Bin}(N-1,p)\ge \text{Bin}(N-1,q)) \rightarrow 0

should be enough to identify the partition with high probability, and indeed this is proved by several authors including EM. Note that the dense regime comfortable satisfies this condition, since it holds even without the factor of N. (The sparse regime, completely fails as the probability is roughly constant.) Even closer to the connectivity threshold remains interesting!

Advertisements

Multitype Branching Processes

One of the fundamental objects in classical probability theory is the Galton-Watson branching process. This is defined to be a model for the growth of a population, where each individual in a generation gives birth to some number (possibly zero) of offspring, who form the next generation. Crucially, the numbers of offspring of the individuals are IID, with the same distribution both within generations and between generations.

There are several ways one might generalise this, such as non-IID offspring distributions, or pairs of individuals producing some number of offspring, but here we consider the situation where each individual has some type, and different types have different offspring distributions. Note that if there are K types, say, then the offspring distributions should now be supported on \mathbb{Z}_{\ge 0}^K. Let’s say the offspring distribution from a parent of type i is \mu^{(i)}.

The first question to address is one of survival. Recall that if we want to know whether a standard Galton-Watson process has positive probability of having infinite size, that is never going extinct, we only need to know the expectation of the offspring distribution. If this is less than 1, then the process is subcritical and is almost surely finite. If it is greater than 1, then it is supercritical and survives with positive probability. If the expectation is exactly 1 (and the variance is finite) then the process is critical and although it is still almost surely finite, the overall population size has a power-law tail, and hence (or otherwise) the expected population size is infinite.

We would like a similar result for the multitype process, saying that we do not need to know everything about the distribution to decide what the survival probability should be.

The first thing to address is why we can’t just reduce the multitype change to the monotype setting. It’s easiest to assume that we know the type of the root in the multitype tree. The case where the type of the root is random can be reconstructed later. Anyway, suppose now that we want to know the offspring distribution of a vertex in the m-th generation. To decide this, we need to know the probability that this vertex has a given type, say type j. To calculate this, we need to work out all the type possibilities for the first m generations, and their probabilities, which may well include lots of complicated size-biasing. Certainly it is not easy, and there’s no reason why these offspring distributions should be IID. The best we can say is that they should probably be exchangeable within each generation.

Obviously if the offspring distribution does not depend on the parent’s type, then we have a standard Galton-Watson tree with types assigned in an IID manner to the realisation. If the types are symmetric (for example if M, to be defined, is invariant under permuting the indices) then life gets much easier. In general, however, it will be more complicated than this.

We can however think about how to decide on survival probability. We consider the expected number of offspring, allowing both the type of the parent and the type of the child to vary. So define m_{ij} to be the expected number of type j children born to a type i parent. Then write these in a matrix M=(m_{ij}).

One generalisation is to consider a Galton-Watson forest started from some positive number of roots of various types. Suppose we have a vector \nu=(\nu_i) listing the number of roots of each type. Then the expected number of descendents of each type at generation n is given by the vector \nu M^n.

Let \lambda be the largest eigenvalue of M. As for the transition matrices of Markov chains, the Perron-Frobenius theorem applies here, which confirms that, because the entries of M are positive, the eigenvalue with largest modulus is simple and real, and the associated eigenvector has entirely positive entries. [In fact we need a couple of extra conditions on M, including that it is possible to get from any type to any other type – we say irreducible – but that isn’t worth going into now.]

So in fact the total number of descendents at generation n grows like \lambda^n in expectation, and so we have the same description of subcriticality and supercriticality. We can also make a sensible comment about the left-\lambda-eigenvector of M. This is the limiting proportion of the different types of vertices.

It’s a result (eg. [3]) that the height profile of a depth-first search on a standard Galton-Watson tree converges to Brownian Motion. Another way to phrase this is that a GW tree conditioned to have some size N has the Brownian Continuum Random Tree as a scaling limit as N grows to infinity. Miermont [4] proves that this result holds for the multitype tree as well. In the remainder of this post I want to discuss one idea along the way to the proof, and one application.

I said initially that there wasn’t a trivial reduction of a multitype process to a monotype process. There is however a non-trivial embedding of a monotype process in a multitype process. Consider all the vertices of type 1, and all the paths between such vertices. Then draw a new tree consisting of just the type 1 vertices. Two of these are joined by an edge if there is no other type 1 vertex on the unique path between them in the original tree. If that definition is confusing, think of the most sensible way to construct a tree on the type 1 vertices from the original, and you’ve probably chosen this definition.

There are two important things about this new tree. 1) It is a Galton-Watson tree, and 2) if the original tree is critical, then this reduced tree is also critical. Proving 1) is heavily dependent on exactly what definitions one takes for both the multitype branching mechanism and the standard G-W mechanism. Essentially, at a type 1 vertex, the number of type 1 descendents is not dependent on anything that happened at previous generations, nor in other branches of the original tree. This gives IID offspring distributions once it is formalised. As for criticality, we note that by the matrix argument given before, under the irreducibility condition discussed, the expectation of the total population size is infinite iff the expected number of type 1 vertices is also infinite. Since the proportion of type 1 vertices is given by the first element of the left eigenvector, which is positive, we can make a further argument that the number of type 1 vertices has a power-law tail iff the total population size also has a power-law tail.

I want to end by explaining why I was thinking about this model at all. In many previous posts I’ve discussed the forest fire model, where occasionally all the edges in some large component are deleted, and the component becomes a set of singletons again. We are interested in the local limit. That is, what do the large components look like from the point of view of a single vertex in the component? If we were able to prove that the large components have BCRT as the scaling limit, this would answer this question.

This holds for the original random graph process. There are two sensible ways to motivate this. Firstly, given that a component is a tree (which it is with high probability if its size is O(1) ), its distribution is that of the uniform tree, and it is known that this has BCRT as a scaling limit [1]. Alternatively, we know that the components have a Poisson Galton-Watson process as a local limit by the same argument used to calculate the increments of the exploration process. So we have an alternative description of the BCRT appearing: the scaling limit of G-W trees conditioned on their size.

Regarding the forest fires, if we stop the process at some time T>1, we know that some vertices have been burned several times and some vertices have never received an edge. What is clear though is that if we specify the age of each vertex, that is, how long has elapsed since it was last burned; conditional on this, we have an inhomogeneous random graph. Note that if we have two vertices of ages s and t, then the probability that there is an edge between them is 1-e^{-\frac{s\wedge t}{n}}, ie approximately \frac{s\wedge t}{n}. The function giving the probabilities of edges between different types of vertices is called the kernel, and here it is sufficiently well-behaved (in particular, it is bounded) that we are able to use the results of Bollobas et al in [2], where they discuss general sparse inhomogeneous random graphs. They show, among many other things, that in this setting as well the local limit is a multitype branching process.

So in conclusion, we have almost all the ingredients towards proving the result we want, that forest fire components have BCRT scaling limit. The only outstanding matter is that the Miermont result deals with a finite number of types, whereas obviously in the setting where we parameterise by age, the set of types is continuous. In other words, I’m working hard!

References

[1] Aldous – The Continuum Random Tree III

[2] Bollobas, Janson, Riordan – The phase transition in inhomogeneous random graphs

[3] Le Gall – Random Trees and Applications

[4] Miermont – Invariance principles for spatial multitype Galton-Watson trees

Enhanced by Zemanta

Characterisations of Geometric Random Graphs

Continuing the LMS-EPSRC summer school on Random Graphs, Geometry and Asymptotic Structure, we’ve now had three of the five lectures by Mathew Penrose on Geometric Random Graphs.

The basic idea is that instead of viewing a graph entirely abstractly, we now place the vertices in the plane, or some other real space. In many network situations, we would expect connectivity to depend somehow on distance. Agents or sites which are close together might be considered more likely to have the sort of relationship indicated by being connected with an edge. In the model discussed in this course, this dependence is deterministic. We have some parameter r, and once we have chosen the location of all the vertices, we connect a pair of vertices if the distance between them is less than r.

For the purposes of this, we work in a compact space [0,1]^d, and we are interested in the limit as the number of vertices n grows to infinity. To avoid the graph getting too connected, as in the standard random graph model, we take r to be a decreasing function of n. Anyway, we place the n points into the unit hypercube uniformly at random, and then the edges are specified by the adjacency rule above. In general, because r_n will be o(1), we won’t have to worry too much above boundary effects. The number of vertices within r_n of the boundary of the cube will be o(1). For some results, this is a genuine problem, when it may be easier to work on the torus.

In G(n,p), the order of np in the limit determines the qualitative structure of the graph. This is the expected degree of a given fixed vertex. In this geometric model, the relevant parameter is nr_n^d, where d is the dimension of the hypercube. If this parameter tends to 0, we say the graph is sparse, and dense if it tends to infinity. The intermediate case is called a thermodynamic limit. Note that the definition of sparse here is slightly different from G(n,p).

Much of the content of the first three lectures has been verifying that the distributions of various quantities in the graph, for example the total number of edges, are asymptotically Poisson. Although sometimes arguments are applicable over a broad spectrum, we also sometimes have to use different calculations for different scaling windows. For example, it is possible to show convergence to a Poisson distribution for the number of edges in the sparse case, from which we get an asymptotic normal approximation almost for free. In the denser regimes, the argument is somewhat more technical, with some substantial moment calculations.

A useful tool in these calculations are some bounds derived via Stein’s method for sums of ‘almost independent’ random variables. For example, the presence or non-presence of an edge between two pairs of vertices are independent in this setting if the pairs are disjoint, and the dependence is still only mild if they share a vertex. An effective description is via a so-called dependency graph, where we view the random variables as the vertices of a graph, with an edge between them if there is some dependence. This description doesn’t have any power in itself, but it does provide a concise notation for what would otherwise be very complicated, and we are able to show versions of (Binomials converge to Poisson) and CLT via these that are exactly as required for this purpose.

In particular, we are able to show that if E_n is the total number of edges, under a broad set of scaling regimes, if \lambda_n is the expected total number of edges, then d_{TV}(E_n,\mathrm{Po}(\lambda_n))\rightarrow 0, as n grows. This convergence in total variation distance is as strong a result as one could hope for, and when the sequence of \lambda_n is O(1), we can derive a normal approximation as well.

At this point it is worth discussing an alternative specification of the model. Recall that for a standard homogenous random graph, we have the choice of G(n,m) and G(n,p) as definitions. G(n,m) is the finer measure, and G(n,p) can be viewed as a weighted mix of G(n,m). We can’t replicate this directly in the geometric setting because the edges and non-edges are a deterministic function of the vertex locations. What we can randomise is the number of vertices. Since we are placing the vertices uniformly at random, it makes sense to consider as an alternative a Poisson Point Process with intensity n. The number of vertices we get overall will be distributed as Po(n), which is concentrated near n, in the same manner as G(n,c/n).

As in G(n,p), this is a less basic model because it is a mixture of the fixed-vertex models. Let’s see if how we would go about extending the total variation convergence result to this slightly different setting without requiring a more general version of the Poisson Approximation Lemma. To avoid having to define everything again, we add a ‘ to indicate that we are talking about the Poisson Point Process case. Writing d(.,.) for total variation distance, the result we have is:

\lim_{n\rightarrow\infty} d(E_n,\mathrm{Po}(\lambda_n))=0.

We want to show that

\lim_{n\rightarrow\infty}d(E_n',\mathrm{Po}(\lambda_n'))=0,

which we can decompose in terms of expectations in the original model by conditioning on N_n

\leq \lim_{n\rightarrow\infty}\mathbb{E}\Big[\mathbb{E}[d(E_{N_n},\mathrm{Po}(\lambda_n')) | N_n]\Big],

where the outer expectation is over N. The observation here, is that the number of points given by the Poisson process induces a measure on distributions, the overwhelming majority of which look quite like Poisson distributions with parameter n. The reason we have a less than sign is that we are applying the triangle inequality in the sum giving total variation distance:

d(X,Y)=\sum_{k\geq 0}|\mathbb{P}(X=k)-\mathbb{P}(Y=k)|.

From this, we use the triangle inequality again:

\lim_{n\rightarrow\infty} \mathbb{E}\Big[\mathbb{E}[d(E_{N_n},\mathrm{Po}(\lambda_{N_n})) | N_n]\Big]

+\lim_{n\rightarrow\infty}\mathbb{E}\Big[\mathbb{E}[d(\mathrm{Po}(\lambda_{N_n}),\mathrm{Po}(\lambda_n')) | N_n]\Big].

Then, by a large deviations argument, we have that for any \epsilon>0, \mathbb{P}(|N_n-n|\geq \epsilon n)\rightarrow 0 exponentially in n. Also, total variation distance is, by definition, bounded above by 1. In the first term, the inner conditioning on N_n is irrelevant, and we have that E_{N_n} converges to the Poisson distribution for any fixed N_n\in (n(1-\epsilon),n(1+\epsilon)). Furthermore, we showed in the proof of the non-PPP result that this convergence is uniform in this interval. (This is not surprising – the upper bound is some well-behaved polynomial in 1/n.) So with probability 1- e^{-\Theta(n)} N_n is in the region where this convergence happens, and elsewhere, the expected TV distance is bounded below 1, so the overall expectation tends to 0. With a similar LD argument, for the second term it suffices to prove that when \lambda\rightarrow\mu, we must have d(\mathrm{Po}(\lambda),\mathrm{Po}(\mu))\rightarrow 0. This is ‘obviously’ true. Formally, it is probably easiest to couple the distributions \mathrm{Bin}(n,\lambda/n),\mathrm{Bin}(n,\mu/n) in the obvious way, and carry the convergence of TV distance as the parameter varies through the convergence in n.

That all sounded a little bit painful, but is really just the obvious thing to do with each term – it’s only the language that’s long-winded!

Anyway, I’m looking forward to seeing how the course develops. In particular, when you split the space into small blocks, the connectivity properties resemble those of (site) percolation, so I wonder whether there will be concrete parallels. Also, after reading about some recent results concerning the metric structure of the critical components in the standard random graph process, it will be interesting to see how these compare to the limit of a random graph process which comes equipped with metric structure for free!

Recent Progress and Gromov-Hausdorff Convergence

For the past few weeks I’ve been working on the problem of Cycle-Induced Forest Fires, which I’ve referred to in passing in some recent posts. The aim has been to find a non-contrived process which exhibits self-organised criticality, that is, where the process displays critical characteristics (scaling laws, multiple components at the largest order of magnitude) forever. Note that this is in contrast to the conventional Erdos-Renyi graph process, which is only critical at a single time n/2.

The conjecture is that the largest component in equilibrium typically has size on a scale of n^2/3. An argument based on the equilibrium proportion of isolated vertices gives an upper bound on this exponent. The working argument I have for the lower bound at the moment can comfortably fit on the back of a napkin, with perhaps some context provided verbally. Of course, the current full text is very much larger than that, mainly because the napkin would feature assertions like “event A happens at time O(n^\beta)“; whereas the more formal argument has to go like:

“With high probability as n\rightarrow\infty, event A happens between times n^{\beta-\epsilon},n^{\beta+\epsilon}, for any suitably small \epsilon>0. Furthermore, the probability that A happens after this upper threshold decays exponentially with n for fixed \epsilon, and the probability that A happens before the lower threshold is at most n^{-\epsilon}. Finally, this is under the implicit assumption that there will be no fragmentations before event A, and this holds with probability 1-o(1) etc.”

It’s got to the point where I’ve exhausted the canonical set of symbols for small quantities: \epsilon,\delta,(\eta ?).

This has been a very long way of setting up what was going to be my main point, which is that at many points during undergraduate mathematics, colleagues (and occasionally to be honest, probably myself too) have complained that they “don’t want to have anything to do with analysis. They just want to focus on algebra / number theory / statistics / fluids…” Anyway, the point of this ramble was that I think I’ve realised that it is very hard to think about any sort of open problem without engaging with the sort of ideas that a few years ago I would have thought of (and possibly dismissed) as ‘analysis’.

Much of my working on this problem has been rather from first principles, so haven’t been thinking much about any neat less elementary theory recently.

Ok, so on with the actual post now.

Last month I talked about local limits of graphs, which describe convergence in distribution of (local) neighbourhood structure about a ‘typical’ vertex. This is the correct context in which to make claims like “components of G(n,\frac{\lambda}{n}) look like Galton-Watson trees with offspring distribution \text{Po}(\lambda)“.

Even from this example, we can see a couple of drawbacks and omissions from this limiting picture. In the sub-critical regime, this G-W tree will be almost surely finite, but the number of vertices in the graph is going to infinity. More concretely, the limit description only tells us about a single component. If we wanted to know about a second component, in this case, it would be roughly independent of the size of the first component, and with the same distribution, but if we wanted to know about all components, it would get much more complicated.

Similarly, this local limit description isn’t particularly satisfactory in the supercritical regime. When the component in question is finite, this description is correct, but with high probability we have a giant component, and so the ‘typical’ vertex is with some positive probability in the giant component. This is reflected by the fact that the G-W tree with supercritical offspring distribution is infinite with some positive probability. However, the giant component does not look like a \text{Po}(\lambda) G-W tree. As we exhaust O(n) vertices, the offspring distribution decreases, in expectation at least. In fact, without the assumption that the giant component is with high probability unique (so \frac{L_1}{n}=1-\mathbb{P}(|C(v)|<\infty), we can’t even deduce the expected size of the giant component from the local limit result.

This is all unsurprising. By definition a local limit describes the structure near some vertex. How near? Well, finitely near. It can be arbitrarily large, but still finite, so in particular, the change in the offspring distribution after O(n) vertices as mentioned above will not be covered.

So, if we want to learn more about the global structure of a large discrete object, we need to consider a different type of limit. In particular, the limit will not necessarily be a graph. Rather than try to define a priori a ‘continuum’ version of a graph, it is sensible to generalise from the idea that a graph is a discrete object and instead consider it as a metric space.

In this article, I don’t want to spend much time at all thinking about how to encode a finite graph as a metric space. We have a natural notion of graph distance between vertices, and it is not hard to extend this to points on edges. Alternatively, for sparse graphs, we have an encoding through various functions, which live in some (metric) function space.

However, in general, the graph will be a metric object itself, rather than necessarily a subset of a global metric space. We will be interested in convergence, so we need a suitable style of convergence of different metric spaces.

The natural candidate for this is the Gromov-Hausdorff metric, and the corresponding Gromov-Hausdorff convergence.

The Hausdorff distance between two subsets X, Y of a metric space is defined as follows. Informally, we say that d_H(X,Y)<\epsilon if any point of X is within distance \epsilon from some point of Y, in the sense of the original metric. Formally

d_H(X,Y):=\max \{\sup_{x\in X}\inf_{y\in Y}d(x,y), \sup_{y\in Y}\inf_{x\in X}d(x,y)\}.

It is not particularly illuminating to prove that this is in fact a metric. In fact, it isn’t a metric as the definition stands, but rather a pseudo-metric, which is exactly the same, only allowing d(X,Y)=0 when X and Y are not equal. Note that

d(X^\circ,\bar X)=0,

for any set X, so this gives an example, provided X is not both open and closed. Furthermore, if the underlying metric space is unbounded, then the Hausdorff distance between two sets might be infinite. For example in \mathbb{R},

d_H(\mathbb{R}_{<0},\mathbb{R}_{>0})=\infty.

We can overcome this pair of objections by restricting attention to closed, bounded sets. In practice, many spaces under consideration will be real in flavour, so it makes sense to define this for compact sets when appropriate.

But this still leaves the underlying problem, which is how to define a distance function on metric spaces. If two metric spaces X and Y were both subspaces of some larger metric space then it would be easy, as we now have the Hausdorff distance. So this is in fact how we proceed in general. We don’t need any knowledge of this covering space a priori, we can just choose the one which minimises the resulting Hausdorff distance. That is

d_{GH}(X,Y)=\inf\{d_H(\phi(X),\psi(Y))\},

where the infimum is taken over all metric spaces (E,d), and isometric embeddings \phi: X\rightarrow E, \psi: Y\rightarrow E.

The first observation is that this will again be a pseudometric unless we demand that X, Y be closed and bounded. The second is that this index set is not a set. Fortunately, this is quickly rectified. Instead consider all metrics on the disjoint union of sets X and Y, which is set, and contains the subset of those metrics which restrict to the correct metric on each of X and Y. It can be checked that this forms a true metric on the set of compact metric spaces up to isometry.

We have an alternative characterisation. Given compact sets X and Y, a correspondence between X and Y is a set of pairs in X\times Y, such that both projection maps are surjective. Ie for any x in X, there is some pair (x,y) in the correspondence. Let \mathcal{C}(X,Y) be the set of such correspondences. We then define the distortion of correspondence \mathcal{R} by:

\text{dis}(\mathcal{R}):=\sup\{|d_X(x_1,x_2)-d_2(y_1,y_2)|: (x_i,y_i)\in\mathcal{R}\}.

Then

d_{GH}(X,Y)=\frac{1}{2}\inf_{\mathcal{R}\in\mathcal{C}(X,Y)}\text{dis}(\mathcal{R}).

In particular, this gives another reason why we don’t have to worry about taking an infimum over a proper class.

Gromov-Hausdorff convergence then has the natural definition. Note that this does not respect topological equivalence, ie homeomorphism. For example,

\bar{B(0,\frac{1}{n})}\stackrel{GH}{\rightarrow} \{0\},

where the latter has the trivial metric. In particular, although all the closed balls are homeomorphic, the G-H limit is not.

A final remark is that the trees we might be looking at are not necessarily compact, so it is useful to have a notion of how this might be extended to non-compact spaces. The answer is to borrow the idea from local limits of considering large finite balls around a fixed central point. In the case of trees, this is particularly well-motivated, as it is often quite natural to have a canonical choice for the ‘root’.

Then with identified points p_n\in X_n, say (X_n,p_n)\rightarrow (X,p) if for any R>0 the R-ball around p_n in X_n converges to the R-ball around p in X. We adjust the definition of distortion to include the condition that the infimum be over correspondences for which (p_X,p_Y) is an element.

REFERENCES

This article was based on some lecture notes by Jean-Francois Le Gall from the Clay Institute Summer School which can be found on the author’s website here (about halfway down the page). This material is in chapter 3. I also used Nicolas Curien’s tutorials on this chapter to inform some of the examples. The resolution of the proper class problem was mentioned by several sources I examined. These notes by Jan Christina were among the best.

Local Limits

In several previous posts, I have talked about scaling limits of various random graphs. Typically in this situation we are interested in convergence of large-scale properties of the graph as the size grows to some limit. These properties will normally be metric in flavour: diameter, component size and so on. To describe convergence of these properties, we divide by the relevant scale, which will often be some simple function of n. If we are looking to find an actual limit object, this is even more important. This is rather similar to describing properties of centred random walks. There, if we run the walk for time n, we have to rescale by \frac{1}{\sqrt{n}} to see the fluctuations on a finite positive scale.

One of the best examples is Aldous’ Continuum Random Tree which we can view as the limit of a Galton-Watson tree conditioned to have total size n, as n tends to infinity. Because of the exploration process or contour process interpretation, where these functions behave rather like a random walk, the correct scaling in this context is again \frac{1}{\sqrt{n}}. The point about this convergence is that it is realised entirely as a convergence of some function that represents the tree. For each finite n, it is clear that the tree with n vertices is a graph, but this is neither clear nor true for the limit object. Although it does indeed have no cycles, if nothing else, if the CRT were a graph it would have [0,1] as vertex set and then would be highly non-obvious how to define the edges.

Local limits aim to give convergence towards a (discrete) infinite graph. The sort of properties we are looking for are now local properties such as degrees and correlations of degrees. These don’t require knowledge of the whole graph, only of some finite subset. First consider the possibility that the sequence of deterministic graphs has the property:

G_1\leq G_2\leq G_3\leq\ldots

where \leq denotes an induced subgraph. Then it is relatively clear what the limit should be, as it is well-defined to take a union. This won’t work directly for a limit of random graphs, because the above relation in probability doesn’t even really make sense if we have a different probability space for each finite graph. This is a general clue that we should be looking to use convergence in distribution rather than anything stronger.

In the previous example, suppose the first finite graph G_1 consists of a single vertex v. If the limit graph (remember this is just the union, since that is well-defined) has bounded degrees, then there is some N such that G_N contains all the information we might want about the limiting neighbourhood of vertex v. For some larger N, G_N contains all the vertex and edges within distance r from our starting vertex v that appear in the limit graph.

This is all the motivation we require for a genuine definition. We will define our limit in terms of neighbourhoods, so we need some mechanism to choose the central vertex of such a neighbourhood. The answer is to consider rooted graphs, that it a graph with an identified vertex. We can introduce randomness by specifying a random graph, or by giving a distribution for the choice of root. If G is finite, the canonical choice is to choose the root uniformly from the set of vertices. This isn’t an option for an infinite graph, so we define the system as (G, p) where G is a (for now deterministic) graph, and p is a probability measure on V(G).

We say that the limit of finite (G_n) is the random rooted infinite graph (G, p) if the neighbourhoods of G_n around a randomly chosen vertex converge in distribution to the neighbourhoods of G around p. Formally, say (G_n)[U_n]\stackrel{d}{\rightarrow} (G,p) if for all r>0, for any finite rooted graph (H,w), the probability that (H,w) is isomorphic to the ball of radius r in G_n centred at randomly chosen $v_n$ converges to the probability that (H,w) is isomorphic to the ball of radius r around v in (G,v), where v is distributed according to measure p.

Informally, we might say that if we zoom in on an average vertex in G_n for large n, the neighbourhood looks the same as the neighbourhood around the root in (G, p). We now consider three examples.

1) When we talk about approximating the component size in a sparse Erdos-Renyi random graph by a \text{Po}(\lambda) branching process, this is exactly the limit sense we mean. The approximation fails if we fix n and take the neighbourhood size very large (eg radius n), but for finite neighbourhoods, or any radius growing more slowly than n, the approximation is good.

2) To emphasise why rooting the finite graphs makes a difference, consider the full binary tree with n levels (so 2^n-1 vertices). If we fix the root, then the limit is the infinite-level binary tree, though this isn’t especially surprising or interesting.

Things get a bit more complicated if we root randomly. Remember that the motivation for random rooting is that we want to know the local structure around a vertex chosen at random in many applications. If we definitely know what vertex we are going to choose, we know the local structure a priori. Note that in an n-level binary tree, 2^{n-1} vertices are leaves, not counting the base of the tree, and 2^{n-2} are distance 1 from a leaf, and 2^{n-3} are distance 2 from a leaf and so on.

This gives us a precise description of the limiting local neighbourhood structure. The resulting limiting object is called the canopy tree. One picture of this can be found on page 6 of this paper. A verbal description is also possible. Consider the set of non-negative integers, arranged in the usual manner on the real line, with edges between adjacent elements. The distribution of the root will be supported on this set of vertices, corresponding to the distance from the leaves in the pre-limit graph. So we have mass 1/2 at 0, 1/4 at 1, 1/8 at 2 and so on. We then connect each vertex k to a full k-level binary tree. The resulting canopy tree looks like an infinite-level full binary tree, viewed from the leaves, which is of course a reasonable heuristic, since that is there the mass is concentrated if we randomly root.

3) In particular, the limit is not the infinite-level binary tree. The canopy tree and the infinite-level binary tree have qualitatively different properties. Simple random walk on the canopy tree is recurrent for example. In fact, a result of Benjamini and Schramm, as explained in this review by Curien, says that any local limit of uniformly bounded degree, uniformly rooted, planar graphs is recurrent for SRW. The infinite-level binary tree can be expressed as a local limit if we choose the root distribution sensibly, using large random 3-regular graphs. The previous result does not apply because the random 3-regular graphs are not almost surely planar.

REFERENCES:

– Much of this article is a paraphrase of a section of Itai Benjamini’s mini-course at the DSSA in Haifa March 2013.

– As well as the review paper linked above, these notes by David Aldous were very useful.

Large Deviations 6 – Random Graphs

As a final instalment in this sequence of posts on Large Deviations, I’m going to try and explain how one might be able to apply some of the theory to a problem about random graphs. I should explain in advance that much of what follows will be a heuristic argument only. In a way, I’m more interested in explaining what the technical challenges are than trying to solve them. Not least because at the moment I don’t know exactly how to solve most of them. At the very end I will present a rate function, and reference properly the authors who have proved this. Their methods are related but not identical to what I will present.

Problem

Recall the two standard definitions of random graphs. As in many previous posts, we are interested in the sparse case where the average degree of a vertex is o(1). Anyway, we start with n vertices, and in one description we add an edge between any pair of vertices independently and with fixed probability \frac{\lambda}{n}. In the second model, we choose uniformly at random from the set of graphs with n vertices and \frac{\lambda n}{2} edges. Note that if we take the first model and condition on the number of edges, we get the second model, since the probability of a given configuration appearing in G(n,p) is a function only of the number of edges present. Furthermore, the number of edges in G(n,p) is binomial with parameters \binom{n}{2} and p. For all purposes here it will make no difference to approximate the former by \frac{n^2}{2}.

Of particular interest in the study of sparse random graphs is the phase transition in the size of the largest component observed as \lambda passes 1. Below 1, the largest component has size on a scale of log n, and with high probability all components are trees. Above 1, there is a unique giant component containing \alpha_\lambda n vertices, and all other components are small. For \lambda\approx 1, where I don’t want to discuss what ‘approximately’ means right now, we have a critical window, for which there are infinitely many components with sizes on a scale of n^{2/3}.

A key observation is that this holds irrespective of which model we are using. In particular, this is consistent. By the central limit theorem, we have that:

|E(G(n,\frac{\lambda}{n}))|\sim \text{Bin}\left(\binom{n}{2},\frac{\lambda}{n}\right)\approx \frac{n\lambda}{2}\pm\alpha,

where \alpha is the error due to CLT-scale fluctuations. In particular, these fluctuations are on a scale smaller than n, so in the limit have no effect on which value of \lambda in the edge-specified model is appropriate.

However, it is still a random model, so we can condition on any event which happens with positive probability, so we might ask: what does a supercritical random graph look like if we condition it to have no giant component? Assume for now that we are considering G(n,\frac{\lambda}{n}),\lambda>1.

This deviation from standard behaviour might be achieved in at least two ways. Firstly, we might just have insufficient edges. If we have a large deviation towards too few edges, then this would correspond to a subcritical G(n,\frac{\mu n}{2}), so would have no giant components. However, it is also possible that the lack of a giant component is due to ‘clustering’. We might in fact have the correct number of edges, but they might have arranged themselves into a configuration that keeps the number of components small. For example, we might have a complete graph on Kn^{1/2} vertices plus a whole load of isolated vertices. This has the correct number of edges, but certainly no giant component (that is an O(n) component).

We might suspect that having too few edges would be the primary cause of having no giant component, but it would be interesting if clustering played a role. In a previous post, I talked about more realistic models of complex networks, for which clustering beyond the levels of Erdos-Renyi is one of the properties we seek. There I described a few models which might produce some of these properties. Obviously another model is to take Erdos-Renyi and condition it to have lots of clustering but that isn’t hugely helpful as it is not obvious what the resulting graphs will in general look like. It would certainly be interesting if conditioning on having no giant component were enough to get lots of clustering.

To do this, we need to find a rate function for the size of the giant component in a supercritical random graph. Then we will assume that evaluating this near 0 gives the LD probability of having ‘no giant component’. We will then compare this to the straightforward rate function for the number of edges; in particular, evaluated at criticality, so the probability that we have a subcritical number of edges in our supercritical random graph. If they are the same, then this says that the surfeit of edges dominates clustering effects. If the former is smaller, then clustering may play a non-trivial role. If the former is larger, then we will probably have made a mistake, as we expect on a LD scale that having too few edges will almost surely lead to a subcritical component.

Methods

The starting point is the exploration process for components of the random graph. Recall we start at some vertex v and explore the component containing v depth-first, tracking the number of vertices which have been seen but not yet explored. We can extend this to all components by defining:

S(0)=0, \quad S(t)=S(t-1)+(X(t)-1),

where X(t) is the number of children of the t’th vertex. For a single component, S(t) is precisely the number of seen but unexplored vertices. It is more complicated in general. Note that when we exhaust the first component S(t)=-1, and then when we exhaust the second component S(t)=-2 and so on. So in fact

S_t-\min_{0\leq s\leq t}S_s

is the number of seen but unexplored vertices, with \min_{0\leq s\leq t}S_s equal to (-1) times the number of components already explored up to time t.

Once we know the structure of the first t vertices, we expect the distribution of X(t) – 1 to be

\text{Bin}\Big(n-t-[S_t-\min_{0\leq s\leq t}S_s],\tfrac{\lambda}{n}\Big)-1.

We aren’t interested in all the edges of the random graph, only in some tree skeleton of each component. So we don’t need to consider the possibility of edges connecting our current location to anywhere we’ve previously visited (as such an edge would have been consider then – it’s a depth-first exploration), hence the -t. But we also don’t want to consider edges connecting our current location to anywhere we’ve seen, since that would be a surplus edge creating a cycle, hence the -S_s. It is binomial because by independence even after all this conditioning, the probability that there’s an edge from my current location to any other vertex apart from those discounted is equal to \frac{\lambda}{n} and independent.

For Mogulskii’s theorem in the previous post, we had an LDP for the rescaled paths of a random walk with independent stationary increments. In this situation we have a random walk where the increments do not have this property. They are not stationary because the pre-limit distribution depends on time. They are also not independent, because the distribution depends on behaviour up to time t, but only through the value of the walk at the present time.

Nonetheless, at least by following through the heuristic of having an instantaneous exponential cost for a LD event, then products of sums becoming integrals within the exponent, we would expect to have a similar result for this case. We can find the rate function \Lambda_\lambda^*(x)of latex \text{Po}(\lambda)-1$ and thus get a rate function for paths of the exploration process

I_\lambda(f)=\int_0^1 \Lambda_{(1-t-\bar{f}(t))\lambda}^*(f')dt,

where \bar{f}(t) is the height of f above its previous minimum.

Technicalities and Challenges

1) First we need to prove that it is actually possible to extend Mogulskii to this more general setting. Even though we are varying the distribution continuously, so we have some sort of ‘local almost convexity’, the proof is going to be fairly fiddly.

2) Having to consider excursions above the local minima is a massive hassle. We would ideally like to replace \bar{f} with f. This doesn’t seem unreasonable. After all, if we pick a giant component within o(n) steps, then everything considered before the giant component won’t show up in the O(n) rescaling, so we will have a series of macroscopic excursions above 0 with widths giving the actual sizes of the giant components. The problem is that even though with high probability we will pick a giant component after O(1) components, then probability that we do not do this decays only exponentially fast, so will show up as a term in the LD analysis. We would hope that this would not be important – after all later we are going to take an infimum, and since the order we choose the vertices to explore is random and in particular independent of the actual structure, it ought not to make a huge difference to any result.

3) A key lemma in the proof of Mogulskii in Dembo and Zeitouni was the result that it doesn’t matter from an LDP point of view whether we consider the linear (continuous) interpolation or the step-wise interpolation to get a process that actually lives in L_\infty([0,1]). In this generalised case, we will also need to check that approximating the Binomial distribution by its Poisson limit is valid on an exponential scale. Note that because errors in the approximation for small values of t affect the parameter of the distribution at larger times, this will be more complicated to check than for the IID case.

4) Once we have a rate function, if we actually want to know about the structure of the ‘typical’ graph displaying some LD property, we will need to find the infimum of the integrated rate function with some constraints. This is likely to be quite nasty unless we can directly use Euler-Lagrange or some other variational tool.

Answer

Papers by O’Connell and Puhalskii have found the rate function. Among other interesting things, we learn that:

I_{(1+\epsilon)}(0)\approx \frac{\epsilon^3}{6},

while the rate function for the number of edges:

-\lim\tfrac{1}{n}\log\mathbb{P}\Big(\text{Bin}(\tfrac{n^2}{2},\tfrac{1+\epsilon}{n})\leq\tfrac{n}{2}\Big)\approx \frac{\epsilon^2}{4}.

So in fact it looks as if there might be a significant contribution from clustering after all.

Exploring the Supercritical Random Graph

I’ve spent a bit of time this week reading and doing all the exercises from some excellent notes by van der Hofstad about random graphs. I think they are absolutely excellent and would not be surprised if they become the standard text for an introduction to probabilistic combinatorics. You can find them hosted on the author’s website. I’ve been reading chapters 4 and 5, which approaches the properties of phase transitions in G(n,p) by formalising the analogy between component sizes and population sizes in a binomial branching process. When I met this sort of material for the first time during Part III, the proofs generally relied on careful first and second moment bounds, which is fine in many ways, but I enjoyed vdH’s (perhaps more modern?) approach, as it seems to give a more accurate picture of what is actually going on. In this post, I am going to talk about using the branching process picture to explain why the giant component emerges when it does, and how to get a grip on how large it is at any time after it has emerged.

Background

A quick tour through the background, and in particular the notation will be required. At some point I will write a post about this topic in a more digestible format, but for now I want to move on as quickly as possible.

We are looking at the sparse random graph G(n,\frac{\lambda}{n}), in the super-critical phase \lambda>1. With high probability (that is, with probability tending to 1 as n grows), we have a so-called giant component, with O(n) vertices.

Because all the edges in the configuration are independent, we can view the component containing a fixed vertex as a branching process. Given vertex v(1), the number of neighbours is distributed like \text{Bi}(n-1,\frac{\lambda}{n}). The number of neighbours of each of these which we haven’t already considered is then \text{Bi}(n-k,\frac{\lambda}{n}), conditional on k, the number of vertices we have already discounted. After any finite number of steps, k=o(n), and so it is fairly reasonable to approximate this just by \text{Bi}(n,\frac{\lambda}{n}). Furthermore, as n grows, this distribution converges to \text{Po}(\lambda), and so it is natural to expect that the probability that the fixed vertex lies in a giant component is equal to the survival probability \zeta_\lambda (that is, the probability that it is infinite) of a branching process with \text{Po}(\lambda) offspring distribution. Note that given a graph, the probability of a fixed vertex lying in a giant component is equal to the fraction of the vertex in the giant component. At this point it is clear why the emergence of the giant component must happen at \lambda=1, because we require \mathbb{E}\text{Po}(\lambda)>1 for the survival probability to be non-zero. Obviously, all of this needs to be made precise and rigorous, and this is treated in sections 4.3 and 4.4 of the notes.

Exploration Process

A common functional of a rooted branching process to consider is the following. This is called in various places an exploration process, a depth-first process or a Lukasiewicz path. We take a depth-first labelling of the tree v(0), v(1), v(2),… , and define c(k) to be the number of children of vertex v(k). We then define the exploration process by:

S(0)=0,\quad S(k+1)=S(k)+c(k)-1.

By far the best way to think of this is to imagine we are making the depth-first walk on the tree. S(k) records how many vertices we have seen (because they are connected by an edge to a vertex we have visited) but have not yet visited. To clarify understanding of the definition, note that when you arrive at a vertex with no children, this should decrease by one, as you can see no new vertices, but have visited an extra one.

This exploration process is useful to consider for a couple of reasons. Firstly, you can reconstruct the branching process directly from it. Secondly, while other functionals (eg the height, or contour process) look like random walks, the exploration process genuinely is a random walk. The distribution of the number of children of the next vertex we arrive at is independent of everything we have previously seen in the tree, and is the same for every vertex. If we were looking at branching processes in a different context, we might observe that this gives some information in a suitably-rescaled limit, as rescaled random walks converge to Brownian motion if the variance of the (offspring) distribution is finite. (This is Donsker’s result, which I should write something about soon…)

The most important property is that the exploration process returns to 0 precisely when we have exhausted all the vertices in a component. At that point, we have seen exactly the vertices which we have explored. There is no reason not to extend the definition to forests, that is a union of trees. The depth-first exploration is the same – but when we have exhausted one component, we move onto another component, chosen according to some labelling property. Then, running minima of the exploration process (ie times when it is smaller than it has been before) correspond to jumping between components, and thus excursions above the minimum to components themselves. The running minimum will be non-positive, with absolute value equal to the number of components already exhausted.

Although the exploration process was defined with reference to and in the language of trees, the result of a branching process, this is not necessary. With some vertex denoted as the root, we can construct a depth-first labelling of a general graph, and the exploration process follows exactly as before. Note that we end up ignoring all edges except a set that forms a forest. This is what we will apply to G(n,p).

Exploring G(n,p)

When we jump between components in the exploration process on a supercritical (that is \lambda>1) random graph, we move to a component chosen randomly with size-biased distribution. If there is a giant component, as we know there is in the supercritical case, then this will dominate the size-biased distribution. Precisely, if the giant component takes up a fraction H of the vertices, then the number of components to be explored before we get to the giant component is geometrically distributed with parameter H. All other components have size O(log n), so the expected number of vertices explored before we get to the giant component is O(log n)/H = o(n), and so in the limit, we explore the giant component immediately.

The exploration process therefore gives good control on the giant component in the limit, as roughly speaking the first time it returns to 0 is the size of the giant component. Fortunately, we can also control the distribution of S_t, the exploration process at time t. We have that:

S_t+(t-1)\sim \text{Bi}(n-1,1-(1-p)^t).

This is not too hard to see. S_t+(t-1) is number of vertices we have explored or seen, ie are connected to a vertex we have explored. Suppose the remaining vertices are called unseen, and we began the exploration at vertex 1. Then any vertex with label in {2,…,n} is unseen if it successively avoids being in the neighbourhood of v(1), v(2), … v(t). This happens with probability (1-p)^t, and so the probability of being an explored or seen vertex is the complement of this.

In the supercritical case, we are taking p=\frac{\lambda}{n} with \lambda>1, and we also want to speed up S, so that all the exploration processes are defined on [0,1], and rescale the sizes by n, so that it records the fraction of the graph rather than the number of vertices. So we set consider the rescaling \frac{1}{n}S_{nt}.

It is straightforward to use the distribution of S_t we deduce that the asymptotic mean \mathbb{E}\frac{1}{n}S_{nt}=\mu_t = 1-t-e^{-\lambda t}.

Now we are in a position to provide more concrete motivation for the claim that the proportion of vertices in the giant component is \zeta_\lambda, the survival probability of a branching process with \text{Po}(\lambda) offspring distribution. It helps to consider instead the extinction probability 1-\zeta_\lambda. We have:

1-\zeta_\lambda=\sum_{k\geq 0}\mathbb{P}(\text{Po}(\lambda)=k)(1-\zeta_\lambda)^k=e^{-\lambda\zeta_\lambda},

where the second equality is a consequence of the simple form for the moment generating function of the Poisson distribution.

As a result, we have that \mu_{\zeta_\lambda}=0. In fact we also have a central limit theorem for S_t, which enables us to deduce that \frac{1}{n}S_{n\zeta_\lambda}=0 with high probability, as well as in expectation, which is precisely what is required to prove that the giant component of G(n,\frac{\lambda}{n}) has size n(\zeta_\lambda+o(1)).