Sharpness of Phase Transition in Percolation

On the flight to Romania a couple of weeks ago, I read this very nice paper by Duminil-Copin and Tassion in which, as a corollary to a more general result in a companion paper, they provide a short proof of a result about percolation on \mathbb{Z}^d. Mitch talked about this paper at our final Junior Probability Seminar of this term yesterday, so it seemed appropriate to write up something about this nice argument. I must confess I know relatively little about these problems, and in particular know nothing about how the original authors, Aizenmann + Barsky (1987), and Menshikov (1986) approached this problem, but experts have said that this is substantially easier to digest.

Rather than reference the first paper repeatedly, I remark now that everything which follows comes from there.

We consider conventional bond percolation on the edges of \mathbb{Z}^d, for d\ge 2, and are interested in whether the the origin percolates with positive probability. That is, that zero is contained in an infinite component. As usual we define p_c=\sup\{p: \mathbb{P}_p(0\leftrightarrow\infty)=0\} to be the critical probability above which percolation happens with positive probability. Defining \theta(p)=\mathbb{P}_p(0\leftrightarrow\infty), we do not know whether \theta(p_c)=0 for some values of d, notably d=3.

If the origin is connected to infinity, it is by definition connected to the boundary of every \Lambda_n:=[-n,n]^d. The number of distinct paths from the origin to \partial \Lambda_n is bounded by the number of self-avoiding walks on the lattice of length n starting from 0, which grows at most as fast as (2d-1)^n. In particular, we know that p_c\ge \frac{1}{2d-1}, but also, for any p<\frac{1}{2d-1}, the probability \mathbb{P}_p[0\leftrightarrow \partial\Lambda_n] decays exponentially in n. We would expect this in fact to hold for all p<p_c, and this is something that the authors prove, called Item 1. They also show that the percolation probability grows at least linearly beyond p_c, specifically (called Item 2)

\theta(p)\ge \frac{p-p_c}{p(1-p_c)}.

The proof here proceeds by considering the function of subsets S which contain the origin:

\varphi_p(S):= p\sum_{(x,y)\in\Delta S} \mathbb{P}_p[0\stackrel{S}{\leftrightarrow} x],\quad S\subset \mathbb{Z}^d.

In words, this gives the expected number of edges across the boundary which are connected to the origin by a path within S. So this gives a measure of how likely we are to escape S, and in particular, an upper bound on the probability that an open path exists from 0 to outside S. The authors then define the alternative critical probability

\tilde p_c := \sup_\{p\text{ s.t. }\exists\text{ finite }0\in S\text{ with }\varphi_p(S)<1\}.

They will show that \tilde p_c satisfies the statements of both Item 1 and Item 2. Item 2 for \tilde p_c implies \tilde p_c\le p_c, and Item 1 for \tilde p_c implies p_c\le \tilde p_c, so this is exactly what we need.

They show Item 1 first. We consider this set S for which \varphi_p(S)<1, and take some box \Lambda_L which strictly contains S. Now we consider the probability of escaping from a box of size kL. The reason why considering this definition of S works really nicely is that it makes it possible to split this event of escaping from \Lambda_{kL} into an event involving subjects of various disjoint sets of edges being open, so we can use independence.

We decompose the path from 0 to \partial\Lambda_{kL} based on the first time it leaves S. We are mindful that there might be lots of paths from from 0 to this boundary. The way we are bounding means it doesn’t matter if we have lots of suitable paths, but they should all spend a maximal number of steps in S, in the sense that whenever the path re-enters S, say to vertex z, there is no open path from 0 to z contained in S. Let the vertex on \partial S we leave from for the first time be x. Then, for all vertices y later in the path, 0\stackrel{S}{\not\leftrightarrow}y.

So under any suitable path, now take y to be the vertex directly following x, hence (x,y)\in\Delta S. If we take \mathcal{C} to be the set of vertices z for which 0\stackrel{S}{\leftrightarrow}z, we can split the expression based on S to obtain:

\mathbb{P}_p[0\leftrightarrow \partial \Lambda_{kL}]\le p \sum_{(x,y)\in\Delta S}\sum_{C\subset S} \mathbb{P}_p[0\stackrel{S}{\leftrightarrow} x,\mathcal{C}=C] \mathbb{P}_p[y\stackrel{\mathbb{Z}^d\backslash C}{\leftrightarrow}\partial\Lambda_{kL}].

Splitting based on C gives us independence between all of these sets of edges, but then we immediately forget about it. Irrespective of choice of y (recall, y\in S\subset \Lambda_L), this final probability is definitely bounded by \mathbb{P}_p[0\leftrightarrow \partial \Lambda_{(k-1)L}], while the p and the first term can be summed over C to give \varphi_p(S). They obtain:

\mathbb{P}_p[0\leftrightarrow \partial \Lambda_{kL}] \le \varphi_p(S)\mathbb{P}_p[y\leftrightarrow \partial \Lambda_{(k-1)L}] \le \varphi_p(S)^{k-1},

where the final relation holds by induction, and clearly gives exponential decay as required.

For Item 2 we use Russo’s formula. Here we have a slightly simpler example than the most general version, since the event under consideration A_n=\{0\leftrightarrow \partial\Lambda_n\} is increasing with respect to adding edges. It is also a function of a finite number of edges. Then we can consider \frac{d}{dp}\mathbb{P}_p[A] under the coupling which adds each edge independently as a Poisson process with (locally) rate \frac{1}{1-t}. (We take this to be the rate to avoid having to reparameterise exponentially between time and probability. Here t=p.)

Just for ease, we only consider the right-derivative at p. Then with \mathbb{P} as the law of the coupled process:

\frac{d}{dp}\mathbb{P}_p[A] \approx \frac{1}{\epsilon} \mathbb{P}[A\text{ holds at }p+\epsilon,\text{ but not at }p]

= \frac{1}{\epsilon} \sum_{e\in E}\mathbb{P}[e\text{ added between }p,p+\epsilon\text{ and }e\text{ completes event }A]

+ \frac{1}{\epsilon}\mathbb{P}[\text{two or more edges relevant to }A\text{ added}].

Since the number of edges whose states determine A is finite, this second term vanishes as \epsilon\rightarrow 0. So

=\frac{1}{\epsilon}\sum \frac{\epsilon}{1-p} \mathbb{P}(A \text{ doesn't hold at p, and edge }e\text{ pivotal at p}).

Taking the limit in \epsilon in this example gives

\frac{d}{dp}\mathbb{P}_p[0\leftrightarrow \partial\Lambda_n] = \frac{1}{1-p} \sum_{e\in\Lambda_n} \mathbb{P}_p[e\text{ pivotal, }0\not\leftrightarrow \partial \Lambda_n].

The argument then proceeds in a similar way to Item 1, decomposing \{0\not\leftrightarrow \partial \Lambda_n\} by conditioning on the set of vertices \mathcal{S} from which it is not possible to get to \partial \Lambda_n. In particular, this set is an excellent candidate to view as S, since on this event it contains 0 by definition. Once we have specified \mathcal{S} we know which edges might be pivotal, namely those across the boundary of \mathcal{S}. Crucially, the event \{\mathcal{S}=S\} only depends on those edges between the boundary of S and \partial \Lambda_n, so is independent of the event \{0\stackrel{S}{\leftrightarrow}x\} whenever x\in\partial \mathcal{S}. So applying this version of Russo gives

\frac{d}{dp}\mathbb{P}_p[0\leftrightarrow \partial\Lambda_n] = \frac{1}{1-p}\sum_{0\in S\subset \Lambda_n} \sum_{(x,y)\in\Delta S} \mathbb{P}_p[0\stackrel{S}{\leftrightarrow} x]\mathbb{P}_p[\mathcal{S}=S].

It is clear where \varphi_p(S) might turn up within the sum (after removing a factor of p), so for a bound we can take \inf_S \varphi_p(S) outside the sum, and arrive at

\frac{d}{dp}\mathbb{P}_p[0\leftrightarrow \partial\Lambda_n] \ge \frac{1}{p(1-p)}\inf_{0\in S\subset \Lambda_n} (1-\mathbb{P}_p[0\leftrightarrow \partial \Lambda_n].

It wasn’t immediately clear to me immediately that this implied the given form of Item 2 (though it certainly is consistent). I think perhaps I was trying to be too complicated and thinking about Gronwall’s lemma when in fact everything really follows from bounding \inf \varphi_p(S) below by 1 (since we have assumed p>\tilde p_c here), then integrating the differential inequality

\frac{d}{dp}\left[ \frac{p}{1-p}f(p) \right] = \frac{p}{1-p}f'(p)+\frac{1}{(1-p)^2}f(p) \ge \frac{1}{(1-p)^2}.

I include this not because it’s an interesting part of the argument (I don’t think it is really) but because I struggled to digest it first time round.

What is interesting is how well this choice to consider \varphi_p(S) works out. In both parts of the argument, sets which work well for splitting the crossing probabilities into disjoint edge events mesh nicely with considering this function after conditioning on sub- or super-criticality with respect to \tilde p_c.

Preferential Attachment Models

I’ve just read a really interesting paper by Peter Morters and Maren Eckhoff that made me feel I should look up some of the background and write a quick post. I may get onto some of the results in the paper at the end of this post, but I want to start by saying a bit about the model itself. I’ve spoken about this briefly in a previous post about several descriptions of complex networks, but I think it’s worth having a second attempt.

We seek a model for random graphs that gives a distribution which exhibits some of the properties of the sort of complex networks seen in the real world. In particular, whereas the degree distribution is Poisson, and so concentrated with exponential tails for the Erdos-Renyi random graph, data indicates that a better model for most applications would have power law tails for this degree distribution.

Albert and Barabasi propose growing such a graph via a so-called preferential attachment scheme. We start with some small possibly empty graph, and add new vertices one at a time. For each new vertex, we add exactly M edges between the new vertex and the vertices already present. The choice of these M other vertices is given by weighting by the degree of the (pre-existing) vertices. That is, vertices with large degree are more likely to be joined to new vertices. This is obviously designed to replicate some of the behaviour seen in say the formation of the internet, where new sites are more likely to link to established and popular sites (Google, Youtube and so on) than a uniformly chosen site.

This model has a couple of problems. Firstly, it is not immediately obvious how to start it. Obviously we need M vertices present for the PA dynamics to start working. In fact, whether one starts with a empty graph or a complete graph on M vertices makes little difference to the large n behaviour. Trickier is the question of multiple edges, which may emerge if we define the PA dynamics in the natural way, that is for each of the M edges in turn. Overcoming this is likely to be annoying.

Bollobas and Riordan do indeed overcome this possible problems in a formal way, and prove that a version of this model does indeed have power law decay of the degree distribution, with exponent equal to 3. The model in the paper instead joins new vertex (n+1) to old vertex m with probability:

\frac{f(\text{in-degree of n})}{n},

where f is some function, which for now we assume has the form f(k)=\gamma k+\beta. Since the vertices are constructed one at a time, it is well-defined to orient these edges from new to old vertices, hence this notion of in-degree makes sense.

It was not obvious to me that this model was more general than the Bollobas/Riordan model, but we will explain this in a little while. First I want to explain why the Bollobas/Riordan model has power law tails, and how one goes about finding the exponent of this decay, since this was presented as obvious in most of the texts I read yet is definitely an important little calculation.

So let’s begin with the Bollobas/Riordan model. It makes sense to think of the process in terms of time t, so there are t – M vertices in the graph. But if t is large, this is essentially equal to t. We want to track the evolution of the degree of some fixed vertex v_i, the ith vertex to be formed. Say this degree is d(t) at time t. Then the total number of edges in the graph at time t is roughly tM. Therefore, the probability that a new vertex gets joined to vertex v is roughly \frac{Md}{2Mt}, where the M appears in the numerator because there are M fresh edges available. Note that we have ignored the possibility of trying to connect multiple edges from the new vertex to v, so this holds provided d is substantially smaller than t. With the boundary condition d(i)=M, this leads to the simple ODE

\dot{d}=\frac{d}{2t}\quad \Rightarrow\quad d=M(\frac{t}{i})^{1/2}.

To me at least it was not immediately clear why this implied that the tail of the degree distribution had exponent 3. The calculation works as follows. Let D be the degree of a vertex at large time t, chosen uniformly at random.

d_i\propto (\frac{t}{i})^{1/2}

\Rightarrow\quad \mathbb{P}(D\geq d)=\frac{1}{t}|\{i:(\frac{t}{i})^{1/2}\geq d\}|=\frac{1}{t}|\{i:i\leq \frac{t}{d^2}\}|=\frac{1}{d^2}

Now we consider the Eckhoff / Morters model. The main difference here is that instead of assuming that each new vertex comes with a fixed number of edges, instead the new vertex joins to each existing vertex independently with probability proportional to the degree of the existing vertex. More precisely, they assume that edges are directed from new vertices to old vertices, and then each new vertex n+1 is joined to vertex m<n+1 with probability \frac{f(\text{indegree of }m\text{ at time }n)}{n}\wedge 1, where f(k)=\gamma k +\beta, for \gamma\in[0,1), \beta>0.

I was stuck for a long time before I read carefully enough the assertion that \beta>0. Of course, if this doesn’t hold, then the graph won’t grow fast enough. For, since the function f is now linear, we can lift the statement about evolution of the degree of a vertex to a statement about the evolution of the total number of edges. Note that each edge contributes exactly one to the total number of in-degrees. So we obtain

\dot{E}=\frac{\gamma E}{t}\quad\Rightarrow E(t)\propto t^\gamma.

In particular, this is much less than t, so the majority of vertices have small degree. The answer is fairly clear in fact: since the preferential attachment mechanism depends only on in-degree, then if f(0)=0, since the in-degree of a new vertex will always be zero by construction, there is no way to get an additional edge to that vertex. So all the edges in the graph for large t will be incident to a vertex that had positive in-degree in the time 0 configuration. Hence we need \beta>0 for the model to be meaningful. Note that this means we effectively have a Erdos-Renyi type mechanism AND a preferential attachment evolution. As, for each new vertex, we add roughly \beta edges to existing vertices chosen uniformly at random (rather than by a PA method) and also some assigned via PA. A previous paper by Dereich and Morters shows that the asymptotic degree distribution has a power law tail with exponent

\tau:=\frac{\gamma+1}{\gamma}.

Note that \gamma=\frac12 gives the same exponent (3) as the Bollobas / Riordan model.

We can apply a similar ODE approximation as above to estimate the likely large time behaviour of the number of edges:

E'=\frac{\gamma E + \beta t}{t}.

So since E'\geq \beta, we have E\geq \beta t so defining F to be E(t)/t, we get:

tF'(t)=\beta-(1-\gamma)F(t)        (1)

Noting that F’ is positive when F< \frac{\beta}{1-\gamma} and negative when F>\frac{\beta}{1-\gamma} suggests that for large t, this is an equilibrium point for F and hence E(t)\approx \frac{\beta t}{1-\gamma}. Obviously, this is highly non-rigorous, as F’ can be very small and still satisfy the relation (1), so it is not clear that the ‘equilibrium’ for F is stable. Furthermore, one needs to check that the binomial variables that supply the randomness to this model are sufficiently concentrated that this approximation by expectation is reasonable.

Nonetheless, as a heuristic this is not completely unsatisfactory, and it leads to the conclusion that E(t) is a linear function of t, and so the distribution of the out-degrees for vertices formed at large times t is asymptotically Poisson, with parameter

\lambda =\frac{\beta\gamma}{1-\gamma}+\beta=\frac{\beta}{1-\gamma}.

Note that this is the same situation as in Erdos-Renyi. In particular, it shows that all the power tail behaviour comes from the in-degrees. In a way this is unsurprising, as these evolve in time, whereas the out-degree of vertex t does not change after time t. Dereich and Morters formalise this heuristic with martingale analysis.

The reason we are interested in this type of model is that it better reflects models seen in real life. Some of these networks are organic, and so there it is natural to consider some form of random destructive mechanism, for example lightning, that kills a vertex and all its edges. We have to compare this sort of mechanism, which chooses a vertex uniformly at random, against a targeted attack, which deletes the vertices with largest degree. Note that in Erdos-Renyi, the largest degree is not much larger than the size of the typical degree, because the degree distribution is asymptotically Poisson. We might imagine that this is not the case in some natural networks. For example, if one wanted to destroy the UK power network, it would make more sense to target a small number of sub-stations serving large cities, than, say, some individual houses. However, a random attack on a single vertex is unlikely to make much difference, since the most likely outcome by far is that we lose only a single house etc.

In Eckhoff / Morters’ model, the oldest vertices are by construction have roughly the largest degree, so it is clear what targeting the most significant \epsilon n vertices means. They then show that these vertices include all the vertices that give the power law behaviour. In particular, if you remove all of these vertices and, obviously, the edges incident to them, you are left with a graph with exponential tail in the asymptotic degree distribution, with largest degree on the order of log n. It was shown in a previous paper that this type of network is not vulnerable to random removal of nodes. Perhaps most interestingly, these authors now prove that after removing the most significant \epsilon n vertices, the network IS now vulnerable to random removal of nodes, leading to the conclusion that it is preferable to experience a random attack followed by a targeted attack than vice versa!

In a future (possibly distant) post, I want to say some slightly more concrete things about how these processes link to combinatorial stochastic processes I understand slightly better, in particular urn models. I might also discuss the configuration model, an alternative approach to generating complex random networks.

Exponentials kill Polynomials

I gave my second session at the UK IMO training and selection camp at Trinity Cambridge earlier today. This one to a group of the more experienced students on the subject of polynomials. This always feels like a tricky topic to present to olympiad students. I always felt that there were lots of useful connections between roots and coefficients, but it was hard to get a handle on exactly what sort of relationship would be useful for each question. Perhaps the main problem is that any of the natural interesting things to talk about lie annoyingly on the fringes of complex analysis or abstract algebra. Or, at any rate, are best explained in that language, which isn’t particularly suitable at this stage when there’s only an hour and a half to play with.

One problem I was particularly keen for the students to attempt was a proof that exponential functions always grow faster than polynomials. I think this is a good problem to think about because it is so useful in all sorts of areas. In probability for example, polynomial decay and exponential decay are the two regimes generally discussed for the tail behaviour of distributions of random variables, and all sorts of things are qualitatively different in the two cases. It is also often a useful step in a proof when we need very crude bounds on function.

Anyway, how to prove it? Well the first stage is to prove that a polynomial of degree n+1 dominates any polynomial of strictly smaller degree. I am writing ‘dominate’ to mean, ‘is eventually larger than’, under the assumption that the leading coefficients are always positive. (As this seems easier than sticking modulus signs everywhere.)

This isn’t too hard. If we take

P(x)=a_nx^n+\ldots+a_1x+a_0,

then for any x>|a_n|+\ldots+|a_0|, we must have x^{n+1}>P(x) eventually.

Now we introduce the exponential function. In most applications, it turns out to be most natural to use e^x, but for students who haven’t even necessarily done AS-levels I wasn’t happy using a concept whose definition might be rather unfamiliar.

If you are happy with the Taylor series definition

e^x:=1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\ldots

then the result is not too challenging. Given a polynomial P of degree k with positive leading coefficient, by the previous result we have that eventually

P(x)<1+x+\frac{x^2}{2!}+\ldots+\frac{x^k}{k!}+\frac{x^{k+1}}{(k+1)!}\leq e^x.

Although the students were able to follow this proof, they were happier thinking about P(n)<2^n. Obviously, we could replace x by n log 2 in the previous argument, but I was pleased with the following direct proof. Ironically, this has much more of the flavour of analysis than the above.

First we can show by induction that n<2^{n/2} for n > 4. It makes sense to take the broader induction hypothesis 4<n<2^{n/2}, and then show that in this range “adding 1 gives you a smaller answer than multiplying by the square root of two”.

From the initial result about polynomials dominating smaller degree polynomials, it suffices to prove the result for P(x)=x^k, for some k, rather than arbitrary polynomials. Now we can proof this by induction on k, the degree of P. We can prove the base case k = 1 via the previous paragraph.

If n^k<2^n eventually, then (4n)^k<2^{n+2k}, so by changing variables, n^k<2^{n/4+2k} eventually, which is in turn < 2^{n/2}. So, eventually

n^{k+1}<2^{n/2}\cdot 2^{n/2}=2^{n}.

Beyond Erdos-Renyi: more realistic models of networks

The claim is often made that the study of random graphs such as the Erdos-Renyi model is worthwhile because it gives us information about complex systems which exist in the real world. The internet or social networks provide the example du jour at the moment, but it’s equally plausible to think about traffic flows, electrical systems or interacting biological processes too.

If this were entirely true, it would be great for two reasons. Firstly, in my opinion at least, it is a beautiful subject in its own right, and to have a concrete applicable reason to continue studying it would make it even better. (Not to mention the dreaded competition for funding…) Secondly, Erdos-Renyi is so simple. After all, it involves little more than adding some simple topology to a collection of IID Bernoulli random variables, and so it would surely be possible to draw some significant conclusions about how complicated real-world objects interact without too much mathematical effort.

Unfortunately, but unsurprising, this simplicity is a drawback as far as applications go. It is fairly clear that most real-world systems cannot offer any property even approaching the niceness of the independent, same probability edges condition. But rather than consign E-R to the ‘pretty but useless’ category of mathematical structures, we should think carefully about exactly why it fails to be a good model for real-world networks, and see whether there are any small adjustments that could be made to improve it.

This is something I’ve been meaning to read up about for ages and ages. What follows is based heavily on the Albert and Barabasi 2002 review paper. I suspect that many of the open problems and intuitive calculations have since been finished and formalised, but for an overview I hope that doesn’t matter hugely. I’ve also leafed through the relevant chapters of Remco Van der Hofstad’s notes, but am setting the details and the exercises aside for the holidays when I have a bit more time!

Problems with Erdos-Renyi

Recall that G(n,p) takes n vertices, and adds edges between any pair of vertices independently with probability p.

One property shared by most real-world networks is the scale-free phenomenon, which says that the degree distribution has a power law tail. The Albert-Barabasi papers gives a comprehensive survey of data verifying this claim. By contrast, G(n,p) has degree distribution which is approximately Poisson as n grows. This is concentrated near the average degree with a thin exponential tail, so does not satisfy this requirement. I was and still am a bit confused by the term ‘scale-free’. The idea is certainly that the local structure is independent of the size of the system, which seems to be true for the degree distributions in sparse ER, that is where p = O(1/n). But I think the correct heuristic is that it doesn’t matter how far zoomed in you are – the macroscopic structure looks similar for n vertices as for n^2 vertices. This certainly fails to be true for ER, where no vertex has O(n) neighbours, whereas with a power law tail, this does hold.

The main consequence of this is that there are a few vertices with very high degree. These are often called ‘hubs’ and parallels are drawn to the internet, where key websites and servers connect lots of traffic and pages from different areas. The idea is that the hubs are almost certainly well-connected to each other, and this offers a step towards a small-world phenomenon, where the shortest path between any two vertices is very small relative to the size of the system. This notion was introduced to mainstream culture by Stanley Milgram’s ‘Six degrees of separation’ experiment in the 60s, where it became clear that subjects were able to deliver a package to a complete stranger on the other side of the USA, using only personal contacts, in about six stages. The graph theoretic notion for this is the diameter, defined as the maximal graph distance between two points. Here, the graph distance means the length of the shortest path between the points. This definition, with the max-min formalism looks rather complicated, but isn’t really. The diameter of an Erdos-Renyi graph for fixed p, increases like log n, which is small relative to n, and so this property holds.

A quick glance at your list of Facebook friends will confirm that the independent edges condition in an Erdos-Renyi random graph is not a plausible model for social networks. How many friends do you have? Let’s say about 1000, more to make the calculation easier than because you’re necessarily very popular. How many does your friend Tom have? Let’s say 1000 again. As was in the news a few months ago, there are now over a billion people on Facebook. Let’s say exactly a billion (that is 10^9 for these purposes). So both you and Tom are friends with 1/10^6 of the total membership of the network. So how large would you expect the overlap of your friendships to be, if they were all chosen independently at random? Well, the probability that you are both friends with Alice is 10^-12, and so the expected number of your mutual friends is 10^-12 x 10^9 = 10^-3 which is substantially less than 1. Yet I imagine if you substituted names suitably, you and Tom might well have over 50 mutual friends if you were, say, in the same year at school or niversity and haven’t yet purged your list.

We want a statistic that records this idea quantitatively. There are various candidates for such a clustering coefficient. The underlying notion that we might expect there to be greater connectivity between neighbours of some fixed point v than in the graph as a whole gives an intuition for a possible definition. Compare the proportion of triangles in the graph to the cube of the proportion of edges. When this ratio is large, then there is a lot of clustering. In the E-R case, we would expect these to be equal, as the probability of forming a triangle is equal to the cube of the probability of the presence of each of the three independent edges that make up the triangle.

So we have three properties of real networks that we would like to incorporate into a model: small diameter, power-law degree distribution, and high clustering. To avoid this turning into a book, I’m going to write a paragraph about each of the possibilities discussed by Albert and Barabasi.

Generalised Random Graph

The degree distribution will typically emerge as a consequence of the construction of a given model. The general idea here is to condition on the degree distribution having the form we want, and see what this does to the structure. Of course, the choice of how to do this conditioning is absolutely key. It certainly isn’t obvious what it means to ‘condition G(n,p) to have power-law distribution’, since the very idea of a power-law vs exponential tail requires the number of vertices to be large.

The first idea for achieving this gives the vertices ‘stubs’, which join up in pairs to form edges. We decide on the distribution of stubs according to this power law, then pair them up uniformly at random. Obviously, there is a possibility of getting some loops, but this is not going to happen so often as to be a genuine problem in the limit. This construction is similarly open to the branching process exploration ideas well covered for the E-R random graph, though we have to be careful to size-bias the degree distributions when necessary. There is still an underlying independence in the location of edges though, so it is reasonably clear that the amount of clustering may be closer to E-R than to the real examples cited.

The other possibility suggested is to retain the independent edge property, but give the vertices weights, and let the probability of an edge between two vertices be some sensible function of the weights. In the end it turns out to make little difference whether the weights are chosen deterministically or randomly, but by taking the weights i.i.d. with infinite mean, we can generate a so-called generalised random graph where the degree distribution has a power law.

Watts-Strogatz

In the WS model, the idea is to interpolate between a graph with maximal clustering and a random graph. A d-regular graph, say on a ring, where every vertex is connected to its d nearest neighbours has high clustering, but large diameter, as for example it takes roughly n/2d steps to get to the other side of the ring. Whereas in the standard E-R model we add edges with some fixed probability p, here we replace edges with some fixed probability p. That is, we take an edge in the regular graph and with some small probability we remove it and instead add an edge between two vertices chosen uniformly at random. The theoretical motivation is that removing a few edges doesn’t destroy the high clustering evident in the regular graph, but even a sparse random graph has small diameter, so adding a few ‘long-range’ edges should be enough to decrease the diameter significantly.

It obviously needs to be checked that a substantial drop in diameter occurs before a substantial decrease in clustering, and there is a calculation and diagram to support this intuitive idea in the paper. The one drawback of this model is that it fails to provide the power-law degree distributions we want. After all, an E-R graph has a concentrated degree distribution, and a d-regular graph has all degrees the same, so we would expect some interpolation between the two to have a concentrated distribution as well. Nonetheless, this model accords well with an idea of how complex networks might form, particularly if there is some underlying geometry. It is reasonable to assume that an initial setup for a network would be that people are connected to those closest to them, and then slowly acquire distant contacts as time progresses.

Preferential Attachment – Barabasi-Albert model

Most of our intuition for networks can be extended to an intuition for the formation of networks. The idea of prescribing a degree distribution is neat, but it doesn’t give any account to the mechanism of formation. Complexity emerges over time, and a good model should be able to describe why this happens. The Barabasi-Albert model takes this as its starting point, with the aim of producing a highly clustered system dynamically. Recall that we can describe G(n,p) as a process by coupling, then increasing p from 0 to 1, and seeing edges emerge. The independence assumption can be lifted through the coupling, and so which edge appears next is independent of the current state of the system.

This is what we need to relax. Recall the motivating idea of ‘hubs’, where a small collection of vertices have very high connectivity across the whole system, as observed in several real situations. A consequence of this is that new edge is more likely to be attached to a hub, than to a pair of poorly connected vertex elsewhere. But it turns out that this idea of preferential attachment isn’t enough by itself. Because as a network forms, it is not just the connectivity that increases, but also the size of the system itself. So in fact it makes sense to add vertices rather than edges, and join the new vertices to existing vertices in proportion to the degrees of the existing vertices. This combination of growth and preferential attachment is key to the scale-free graphs that this Barabasi-Albert model generates. Relaxing either mechanism returns us to the case of exponential tails. However, there are methods in the literature for generating such graphs without the need for a dynamic model, but they are harder to understand and describe. None I have seen so far has a high clustering coefficient.

Hubs are effectively a way to reduce the diameter. Recall the description of Milgram’s experiment where he encouraged randomly chosen people to send a package to Harvard. For the purposes of this model, an undergraduate from Wyoming or a husband from Alabama moving in with his wife in Boston are clear hubs, as for very many people near their previous home, they represent a good connection to Harvard. So it is unsurprising that BA, which reinforces hubs, has a sub-logarithmic small diameter.

Conclusions

I’m not entirely what conclusions I should draw from my reading. Probably the main one is that I should read more as there is plenty of interesting stuff going on in this area. Intuitively, it seems unlikely that there is going to be a single model which unites the descriptions of all relevant real-world networks. As ever, it is pleasant to find structures that are both mathematically interesting in their own right and relevant to applied problems. So it is reassuring to observe how similar many of the models discussed above are to the standard random graph.

Poisson Tails

I’ve had plenty of ideas for potential probability posts recently, but have been a bit too busy to write any of them up. I guess that’s a good thing in some sense. Anyway, this is a quick remark based on an argument I was thinking about yesterday. It combines Large Deviation theory, which I have spent a lot of time learning about this year, and the Poisson process, which I have spent a bit of time teaching.

Question

Does the Poisson distribution have an exponential tail? I ended up asking this question for two completely independent reasons yesterday. Firstly, I’ve been reading up about some more complex models of random networks. Specifically, the Erdos-Renyi random graph is interesting mathematical structure in its own right, but the independent edge condition results in certain regularity properties which are not seen in many real-world networks. In particular, the degree sequence of real-world networks typically follows an approximate power law. That is, the tail is heavy. This corresponds to our intuition that most networks contain ‘hubs’ which are connected to a large region of the network. Think about key servers or websites like Wikipedia and Google which are linked to by millions of other pages, or the social butterfly who will introduce friends from completely different circles. In any case, this property is not observed in an Erdos-Renyi graph, where the degrees are binomial, and in the sparse situation, rescale in the limit to a Poisson distribution. So, to finalise this observation, we want to be able to prove formally that the Poisson distribution has an exponential (so faster than power-law) tail.

The second occurrence of this question concerns large deviations for the exploration process of a random graph. This is a topic I’ve mentioned elsewhere (here for the exploration process, here for LDs) so I won’t recap extensively now. Anyway, the results we are interested in give estimates for the rate of decay in probability for the event that the path defined by the exploration process differs substantially from the expected path as n grows. A major annoyance in this analysis is the possibility of jumps. A jump occurs if a set of o(n) adjacent underlying random variables (here, the increments in the exploration process) have O(n) sum. A starting point might be to consider whether O(1) adjacent RVs can have O(n) sum, or indeed whether a single Poisson random variable can have sum of order n. In practice, this asks whether the probability \mathbb{P}(X>\alpha n) decays faster than exponentially in n. If it does, then this is dominated on a large deviations scale. If it decays exactly exponentially in n, then we have to consider such jumps in the analysis.

Approach

We can give a precise statement of the probabilities that a Po(\lambda) random variable X returns a given integer value:

\mathbb{P}(X=k)=e^{-\lambda}\frac{\lambda^k}{k!}.

Note that these are the terms in the Taylor expansion of e^{\lambda} appropriately normalised. So, while it looks like it should be possible to evaluate

\mathbb{P}(X>\alpha n)=e^{-\lambda}\sum_{\alpha n}^\infty \frac{\lambda^k}{k!},

this seems impossible to do directly, and it isn’t even especially obvious what a sensible bounding strategy might be.

The problem of estimating the form of the limit in probability of increasing unlikely deviations from expected behaviour surely reminds us of Cramer’s theorem. But this and other LD theory is generally formulated in terms of n random variables displaying some collective deviation, rather than a single random variable, with the size of the deviation growing. But we can transform our problem into that form by appealing to the three equivalent definitions of the Poisson process.

Recall that the Poisson process is the canonical description of, say, an arrivals process, where events in disjoint intervals are independent, and the expected number of arrives in a fixed interval is proportional to the width of the interval, giving a well-defined notion of ‘rate’ as we would want. The two main ways to define the process are: 1) the times between arrivals are given by i.i.d. Exponential RVs with parameter \lambda equal to the rate; and 2) the number of arrivals in interval [s,t] is independent of all other times, and has distribution given by Po(\lambda(t-s)). The fact that this definition gives a well-defined process is not necessarily obvious, but let’s not discuss that further here.

So the key equivalence to be exploited is that the event X>n for X\sim \text{Po}(\lambda) is a statement that there are at least n arrivals by time 1. If we move to the exponential inter-arrival times definition, we can write this as:

\mathbb{P}(Z_1+\ldots+Z_n<1),

where the Z’s are the i.i.d. exponential random variables. But this is exactly what we are able to specify through Cramer’s theorem. Recall that the moment generating function of an exponential distribution is not finite everywhere, but that doesn’t matter as we construct our rate function by taking the supremum over some index t of:

I(x)=\sup_t (xt-\log \mathbb{E}e^{tZ_1})=\sup_t(xt-\log(\frac{\lambda}{\lambda-t})).

A simple calculation then gives

I(x)=\lambda x-1 - \log \lambda x.

\Rightarrow I(x)\uparrow \infty\text{ as }x\downarrow 0.

Note that I(1) is the same for both Exp(\lambda) and Po(\lambda), because of the PP equality of events:

\{Z_1+\ldots+Z_n\leq n\}=\{\text{Po}(\lambda n)=\text{Po}(\lambda)_1+\ldots+\text{Po}(\lambda)_n> n\},

similar to the previous argument. In particular, for all \epsilon>0,

\mathbb{P}(\text{Po}(\lambda)>n)=\mathbb{P}(\frac{Z_1+\ldots+Z_n}{n}<\frac{1}{n})<\mathbb{P}(\frac{Z_1+\ldots+Z_n}{n}<\epsilon),\text{ for large }n.

\mathbb{P}(\text{Po}(\lambda)>n)=O(e^{-nI(\epsilon)}),\text{ for all }\epsilon.

Since we can take I(\epsilon) as large as we want, we conclude that the probability decays faster than exponentially in n.

Extreme Value Theory

This is something interesting which came up on the first problem sheet for the Part A Statistics course. The second question introduced the Weibull distribution, defined in terms of parameters \alpha,\lambda>0 through the distribution function:

F(x)=\begin{cases}0 & x<0\\ 1-\exp(-(\frac{x}{\lambda})^\alpha) & x\geq 0.\end{cases}

As mentioned in the statement of the question, this distribution is “typically used in industrial reliability studies in situations where failure of a system comprising many similar components occurs when the weakest component fails”. Why could that be? Expressed more theoretically, the lifetimes of various components might reasonably be assumed to behave like i.i.d. random variables in many contexts. Then the failure time of the system is given by the minimum of the constituent random variables.

So this begs the question: what does the distribution of minimum of a collection of i.i.d. random variables look like? First, we need to think why there should be an answer at all. I mean, it would not be unreasonable to assume that this would depend rather strongly on the underlying distribution. But of course, we might say the same thing about sums of i.i.d. random variables, but there is the Central Limit Theorem. Phrased in a way that is deliberately vague, this says that subject to some fairly mild conditions on the underlying distribution (finite variance in this case), the sum of n i.i.d. RVs look like a normal distribution for large n. Here we know what ‘looks like’ means, since we have a notion of a family of normal distributions. Formally, though, we might say that ‘looks like’ means that the image of the distribution under some linear transformation, where the coefficients are possibly functions of n, converges to the distribution N(0,1) as n grows.

The technical term for this is to say that the underlying RV we are considering, which in this case would be X_1+\ldots +X_n) is in the domain of attraction of N(0,1). Note that other distributions in the family of normals are also in the domain of attraction of N(0,1), and vice versa, so this forms an equivalence relation on the space of distributions, though this is not hugely helpful since most interesting statements involve some sort of limit.

Anyway, with that perspective, it is perhaps more reasonable to imagine that the minimum of a collection of i.i.d. RVs might have some limit distribution. Because we typically feel more comfortable thinking about right-tails rather than left-tails of probability distributions, this problem is more often considered for the maximum of i.i.d. RVs. The Fisher-Tippett-Gnedenko theorem, proved in various forms in the first half of the 20th century, asserts that again under mild regularity assumptions, the maximum of such a collection does lie in the domain of attraction of one of a small set of distributions. The Weibull distribution as defined above is one of these. (Note that if we are considering domains of attraction, then scaling x by a constant is of no consequence, so we can drop the parameterisation by \lambda.)

This is considered the first main theorem of Extreme Value Theory, which addresses precisely this sort of problem. It is not hard to consider why this area is of interest. To decide how much liquidity they require, an insurance company needs to know the likely size of the maximum claim during the policy. Similarly, the designer of a sea-wall doesn’t care about the average wave-height – what matters is how strong the once-in-a-century storm which threatens the town might be. A good answer might also explain how to resolve the apparent contradiction that most human characteristics are distributed roughly normally across the population. Normal distributions are unbounded, yet physiological constraints enable us to state with certainty that there will never be twelve foot tall men (or women). In some sense, EVT is a cousin of Large Deviation theory, the difference being that unlikely events in a large family of i.i.d. RVs are considered on a local scale rather than globally. Note that large deviations for Cramer’s theorem in the case where the underlying distribution has a heavy tail are driven by a single very deviant value, rather than by lots of slightly deviant data, so in this case the theories are comparable, though generally analysed from different perspectives.

In fact, we have to consider the reversed Weibull distribution for a maximum, which is supported on (-\infty,0]. This is one of three possibly distribution families for the limit of a maximum. The other two are the Gumbel distribution

F(x)=e^{-e^{-x}},

and the Frechet distribution

F(x)=\exp(-x^{-\alpha}),\quad x>0.

Note that \alpha is a positive parameter in both the Frechet and Gumbel distributions. These three distributions can be expressed as a single one parameter family, the Generalised Extreme Value distribution.

The differences between them lie in the tail behaviour. The reversed Weibull distribution has an actual upper bound, the Gumbel an exponential, fast-decaying tail, and the Frechet a polynomial ‘fat’ tail. It is not completely obvious that these properties are inherited from the original distribution. After all, to get from the original distribution to extreme value distribution, we are taking the maximum, then rescaling and translating in a potentially quite complicated way. However, it is perhaps reasonable to see that the property of the underlying distribution having an upper bound is preserved through this process. Obviously, the bound itself is not preserved – after all, we are free to apply arbitrary linear transformations to the distributions!

In any case, it does turn out to be the case that the U[0,1] distribution has maximum converging to a reversed Weibull; the exponential tails of the Exp(1) and N(0,1) distributions lead to a Gumbel limit; and the fat-tailed Pareto distribution gives the Frechet limit. The calculations are reasonably straightforward, especially once the correct rescaling is known. See this article from Duke for an excellent overview and the details for these examples I have just cited. These notes discuss further properties of these limiting distributions, including the unsurprising fact that their form is preserved under taking the maximum of i.i.d. copies. This is analogous to the fact that the family of normal distributions is preserved under taking arbitrary finite sums.

From a statistical point of view, devising a good statistical test for what class of extreme value distribution a particular set of data obeys is of great interest. Why? Well mainly because of the applications, some of which were suggested above. But also because of the general statistical principle that it is unwise to extrapolate beyond the range of the available data. But that is precisely what we need to do if we are considering extreme values. After all, the designer of that sea-wall can’t necessarily rely on the largest storm in the future being roughly the same as the biggest storm in the past. So because the EVT theorem gives a clear description of the distribution, to find the limiting properties, which is where the truly large extremes might occur, it suffices to find a good test for the form of the limit distribution – that is, which of the three possibilities is relevant, and what the parameter should be. This seems to be fairly hard in general. I didn’t understand much of it, but this paper provided an interesting review.

Anyway, that was something interesting I didn’t know about (for the record, I also now know how to construct a sensible Q-Q plot for the Weibull distribution!), though I am assured that EVT was a core element of the mainstream undergraduate mathematics syllabus forty years ago.

Analytic vs Probabilistic Arguments for a Supercritical BP

This follows on directly from the previous post. I was originally going to talk only about what follows, but I got rather carried away with the branching process account. I was stuck on a particular exercise, and we ended up coming up with two arguments: one analytic and one probabilistic. Since the typical flavour of this blog is to present problems which show the advantage of the probabilistic approach, it seems only fair to remark on this case, where the analytic method was less interesting, but much simpler.

Recall that we have a supercritical random graph G(n,\frac{\lambda}{n}), \lambda>1, and we are considering the rescaled exploration process S_{nt}, which has asymptotic mean \mu_t=1-t-e^{-\lambda t}. We can calculate similarly an expression for the asymptotic variance

\frac{\text{Var}(S_{nt})}{n}\rightarrow v_t=e^{-\lambda t}(1-e^{-\lambda t}).

To use this to verify the result about the size of the giant component, we verify that \mu_{\zeta_\lambda+x/\sqrt{n}} is negative, and has small variance, which would confirm that the giant component has size bounded above by \zeta_\lambda almost surely. A similar argument is required for the lower bound. The variance is a separate matter, but it is therefore necessary that \mu_t should be decreasing at t=\zeta_\lambda, that is \mu_t'=\lambda e^{-\lambda \zeta_\lambda}<0. This is what we try to prove in the remainder of this post. Recall that in the previous post we have checked that it is equal to zero here.

Heuristic Explanation

\mu_t has been rescaled from the original definition of the exploration process in both size and time-scale so some care is needed to see why this should hold in the limit. Remember that all components apart from the giant component are of size O(log n). So immediately after exhausting the giant component, you are likely to be visiting components of size roughly log n. A time interval of dt for \mu corresponds to ndt for S, during which S will visit some components of size log n and some of O(1) and some in between. In particular, some fixed proportion of vertices are isolated, that is, in a component of size 1.

There is then a complicated size-biasing train of thought. A component of size log n is more likely to come up than an isolated vertex, but there are not as many of them. The log n components push the derivative \mu_t' towards zero, because S_t decreases by 1 over a time-interval of length log n, which gives a gradient of zero in the limit. However, the isolated vertices give a gradient of -1, because S_t decreases by 1 over a time interval of 1. Despite the fact that log n intervals are likely to appear earlier, it still remains the case that after exhausting a component (in particular, at time t=\zeta_\lambda, after exhausting the giant component), with some bounded below positive probability you will choose an isolated vertex next. The component size only affects that time-scale if it is O(n), which none of the remaining components are, so the derivative \mu_{\zeta_\lambda}' consists of some complicated weighted mean of 0 and -1. In particular, it is negative.

Analytic solution

Obviously, that won’t do in practice. Suppressing lambdas for ease of notation, the key fact is: e^{-\lambda \zeta}=1-\zeta. We want to show that \lambda e^{-\lambda \zeta}<1. Substituting

\lambda=-\frac{\log(1-\zeta)}{\zeta},

means that it is required to show:

-\frac{1-\zeta}{\zeta}\log(1-\zeta)<1.

Differentiating the left hand side gives:

\frac{\log(1-\zeta)+\zeta}{\zeta^2}<0,

since of course \log(1-\zeta)=\zeta+\frac{\zeta^2}{2}+\frac{\zeta^3}{3}+\dots. So it suffice to check the result for small \zeta. But, again using a Taylor series:

-\frac{1-\zeta}{\zeta}\log(1-\zeta)=1-\frac12\zeta+O(\zeta^2)<1,

for small \zeta. This gives the required result.

Probabilistic Interpretation and Solution

First, we observe that \lambda e^{-\lambda\zeta}=\lambda(1-\zeta) is the expected number of vertices in the first generation of a \text{Po}(\lambda) whose progeny become extinct. This motivates considering the canonical decomposition of a supercritical branching process Z into the skeleton process and the dual process. The skeleton Z^+ consists of all vertices which have infinitely many successors. It is relatively easy to show that this is a branching process with offspring distribution \text{Po}(\lambda\zeta) conditioned on being positive. The dual process Z^* is a G-W branching process with offspring distribution \text{Po}(\lambda) conditioned on dying. This is the same as a branching process with offspring distribution \text{Po}(\lambda(1-\zeta), by a sprinkling argument, which says that if we begin with a Poisson number of things, then remove each one independently with some fixed probability, the remaining number of things is Poisson also.

We can construct the original branching process by

  • With probability \zeta, take the skeleton, and affixe independent copies of Z^* at every vertex in the skeleton.
  • With probability 1-\zeta, just take a copy of Z^*.

It is immediately clear that \lambda(1-\zeta)\leq 1. After all, the dual process is almost surely finite, so the offspring distribution cannot have expectation greater than 1. Checking that this is strong is more fiddly. The best way I have come up with is to examine the tail of the distribution of total population size of the original branching process.

The total population size T of a branching process has an exponential tail if the offspring distribution is subcritical. It isn’t hugely surprising that this behaves like a large deviation for iid RVs, since in the limit such an event requires a lot of the offspring counts to deviate substantially from the mean. The same holds in the supercritical case, with the additional complication that though the finite tail decays exponential, there is positive probability that the total size will be infinite. In the critical case, however, there is a power-law decay. This is not hugely surprising as it marks the threshhold for the appearance of the infinite population, just as in a multiplicative coalescent at time 1, we have a load of very large components just about to form a giant component. The tool for all of these results is Dwass’s Theorem, which says:

\mathbb{P}(T=n)=\frac{1}{n}\mathbb{P}(X_1+\ldots+X_n=n-1),

where X_1 are iid with the offspring distribution. When \mathbb{E}X_1\neq 1, this is a large deviation event, for which Cramer’s theorem applies (assuming, as is the case for the Poisson distribution, that the offspring distribution has finite variance). When, \mathbb{E}X=1, the Central Limit Theorem says that with high probability,

X_1+\ldots+X_n\in [n-n^{3/4},n+n^{3/4}],

so, skating over the details of whether everything is exactly uniform within this CLT scaling window,

\mathbb{P}(T=n)\geq \frac{1}{n}\cdot\frac{1}{2n^{3/4}}.

The true exponent of the power law decay is substantially slower than this, but the above argument works as a back-of-the-envelope bound.

In particular, if the dual process has mean 1, then the population size of the original branching process is given by taking a distribution with exponential tail with some probability and a distribution with power-law tail with some probability. Obviously the power-law will dominate, which contradicts the assumption that the original branching process was supercritical, and so has an exponential tail.