100k Views

When I started this blog, I implicitly made a promise to myself that I would be aiming for quality of posts rather than quantity of posts. Evaluating quality is hard, whereas one always feels some vague sense of purpose by seeing some measure of output increase. Nonetheless, I feel I have mostly kept this promise, and haven’t written that much solely for the sake of getting out another post. This post is something of an exception, since I noticed in the office on Friday that this blog was closing in on 100,000 page views. Some of these were not actually just me pressing F5 repeatedly. Obviously this is no more relevant an integer to be a threshold as any other, and one shouldn’t feel constrained by base 10, but it still feels like a good moment for a quick review.

Here are some interesting statistics over the (3+\epsilon) years of this website’s existence.

  • 175 posts, not including this one, or the three which are currently in draft status. This works out at about 4.7 posts a month. By some margin the most prolific period was May 2012, when I was revising for my Part III exams in Cambridge, and a series of posts about the fiddliest parts of stochastic calculus and network theory seemed a good way to consolidate this work. I’ve learned recently that PhDs are hard, and in fact it’s been a year since I last produced at least five posts in a month, if you discount the series of olympiad reports, which though enjoyable, don’t exactly require a huge amount of mathematical engagement.
  • By at least one order of magnitude, the most viewed day was 17th August 2014, when several sources simultaneously linked to the third part of my report on IMO 2014 in Cape Town. An interesting point to note is that WordPress counts image views separately to page views, so the rare posts which have a gallery attached count well in a high risk / high return sense. In any case, the analytics consider that this day resulted in 2,366 views by 345 viewers. During a typical period, the number of views per visitor fluctuates between roughly 1.5 and 1.8, so clearly uploading as many photos of maths competitions as possible is the cheap route to lots of hits, at least by the WordPress metric.

EAE stats views

  • One might well expect the distributions involved in such a setup to follow a power-law. It’s not particularly clear from the above data about views per month since late 2012 whether this holds. One anomalously large data point (corresponding to the interest in IMO 2014) does not indicate a priori a power law tail… In addition, there is a general upward trend. Since a substantial proportion of traffic arrives from Google, one might naively assume that traffic rate might be proportion to amount of content, which naturally will grow in time, though it seems impractical to test this. One might also expect more recent posts to be more popular, though in practice this seems not to have been the case.
  • The variance in popularity of the various posts has been surprisingly large. At some level, I guess I thought there would be more viewers who browse through lots of things, but such people would probably do so from the home page, so it doesn’t show up as viewing lots of different articles. There is some internal linking between some pages, but not enough to be a major effect.
  • At either end of the spectrum, a post about coupling and the FKG inequality has received only 16 views in 2.5 years, while a guide to understanding the Levy-Khintchine formula has, in slightly less time, racked up 2,182 hits. There are direct reasons for this. Try googling Levy-Khintchine formula and see what comes up. In a sense, this is not enough, since you also need people to be googling the term in question, and picking topics that are hard but not super-hard at a masters level is probably maximising interest. But I don’t have a good underlying reason for why some posts should end up being more Google-friendly than others.
  • In particular, quality of article seems at best loosely correlated with number of views. This is perhaps worrying, both for my reputation, and for the future of written media, but we will see. Indeed, two articles on the Dubins-Schwarz theorem and a short crib sheet for convergence of random variables, both get a regular readership, despite seeming to have been written (in as much as a blog post can be) on the back of an envelope. I also find it amusing that the Dubins-Schwarz theorem is always viewed at the same time of the year, roughly mid-February, as presumably it comes up in the second term of masters courses, just like it did in my own.
  • By contrast, I remain quite pleased with the Levy-Khintchine article. It’s the sort of topic which is perfectly suited to this medium, since most books on Levy processes seem to assume implicit that their readers will already be familiar with this statement. So it seemed like a worthwhile enterprise to summarise this derivation, and it’s nice to see that others clearly feel the same, and indeed I still find some posts of this flavour useful as revision for myself.

Blog log plot

  • This seemed like a particularly good data set in which to go hunting for power-laws. I appreciate that taking a print-screen of an Excel chart will horrify many viewers, but never mind. The above plot shows the log of page view values for those mathematical articles with at least 200 hits. You can see the Levy-Khintchine as a mild anomaly at the far left. While I haven’t done any serious analysis, this looks fairly convincing.
  • I haven’t engaged particularly seriously in interaction with other blogs and other websites. Perhaps I should have done? I enjoy reading such things, but networking in this fashion seems like a zero-sum game overall except for a few particularly engaged people, even if one gets a pleasing spike in views from a reciprocal tweet somewhere. As a result, the numbers of comments and out-going clicks here is essentially negligible.
  • Views from the UK narrowly outnumber views from the US, but at present rates this will be reversed very soon. I imagine if I discounted the olympiad posts, which are sometimes linked from UK social media, this would have happened already.
  • From purely book-keeping curiosity, WordPress currently thinks the following countries (and territories – I’m not sure how the division occurs…) have recorded exactly one view: Aaland Islands, Afghanistan, Belize, Cuba, Djibouti, El Salvador, Fiji, Guatemala, Guernsey, Laos, Martinique, Mozambique, New Caledonia, Tajikistan, US Virgin Islands, and Zambia. Visiting all of those would be a fun post-viva trip…

Conclusion

As I said, we all know that 100,000 is just a number, but taking half an hour to write this has been a good chance to reflect on what I’ve written here in the past three years. People often ask whether I would recommend that they start something similar. My answer is ‘probably yes’, so long as the writer is getting something out of most posts they produce in real time. When writing about anything hard and technical, you have to accept that until you become very famous, interest in what you produce is always going to be quite low, so the satisfaction has to be gained from the process of understanding and digesting the mathematics itself. None of us will be selling the musical rights any time soon.

I have two pieces of advice to anyone in such a position. 1) Wait until you’ve written five posts before publishing any of them. This is a good test of whether you actually want to do it, and you’ll feel much more plausible linking to a website with more than two articles on it. 2) Don’t set monthly post count targets. Tempting though it is to try this to prevent your blog dying, it doesn’t achieve anything in the long term. If you have lots to say, say lots; if you don’t, occasionally saying something worthwhile feels a lot better when you look back on it than producing your target number of articles which later feel underwhelming.

I don’t know whether this will make it to 6+2\epsilon years, but for now, I’m still enjoying the journey through mathematics.

Advertisements

The Configuration Model

In the past, I’ve talked about limitations of the Erdos-Renyi model of homogeneous random graphs for applications in real-world networks. In a previous post, I’ve discussed a dynamic model, the Preferential Attachment mechanism, that ‘grows’ a graph dynamically by adding edges from new vertices preferentially to existing vertices with high degree. The purpose of this adjustment is to ensure that the distribution of the degrees is not concentrated around some fixed value (which would be c in G(n,c/n) ) but rather exhibits a power-law tail such as observed in many genuine examples.

In this post, we introduce some aspects of the configuration model, which achieves this property more directly. This idea probably first arose in the guise of regular graphs. Recall a regular graph has all degrees equal. How would we construct a random d-regular graph on a large number of vertices?

What we probably want to do is to choose uniformly at random from the set of such graphs, but it is not clear even how large this set is, let alone how one would order its elements to make it possible to make this uniform choice. Instead, we try the following. Assign to each vertex d so-called stubs, which will end up being ‘half-edges’. We then choose two stubs uniformly at random, and glue them together. More formally, we construct an edge between the host vertices, and then delete the chosen stubs. We then continue.

The construction makes no reference to the distribution of stubs, so we are free to choose this as we please. We could for example specify some sequence of degrees which approximates a power-law, so we could sample a random sequence of degrees in some way. So long as we have a sequence of stub set sizes before we start building the edges of the graph we will be able to use the above algorithm.

So what might go wrong? There seem to me to be three potential problems that might arise with this construction.

Firstly, there might be a stub left over, if the sum of the stub set sizes is odd. Recall that in a graph the sum of the degrees is twice the sum of the number of edges, and so in particular the sum of the degrees should be even. But this is a small problem. When the degree sequence is deterministic we can demand that it have even sum, and if it is random, we will typically be working in a large N regime, and so deleting the solitary stub, if such a thing exists, will not affect the sort of properties of the graph we are likely to be interested in.

The second and third objections are perhaps more serious. If we glue together stubs naively, we might end up with loops, that is, edges that ‘begin’ and ‘end’ at the same vertex. These are not allowed in the standard definition of a graph. Alternatively, we might end up with more than one edge between the same pair of vertices.

Our overall aim is that this mechanism gives a convenient way of simulating the uniform distribution on simple graphs with a given degree sequence. At present we have the uniform distribution on potential multigraphs, with a weighting of 1/k! for every multi-edge with multiplicity k, and a weighting of 1/2 for every loop. The latter can be seen because there is an initial probability proportional to d(v_i)d(v_j) that vertices v_i and v_j will be joined, whereas a probability proportional (with the same constant) to d(v_i)^2 that v_i will receive a loop. The multi-edge weighting justification is similar.

However, conditional on getting a simple graph, the distribution is uniform on the set of simple graphs with that degree sequence. So it remains to investigate the probability that a graph generated in this way is simple. So long as this probability does not tend to 0 as n grows, we will probably be happy.

The strongest results on this topic are due to Janson. First observe that if the sum of the degrees grows faster than the number of vertices n, we fail to get a graph without loops with high probability. Heuristically, note that on the first pass, we are taking two picks from the set of vertices, biased by the number of stubs. By Cauchy-Schwarz, Rearrangement Inequality or just intuition, the probability of getting the same vertex is greater than if we picked uniformly from the set of vertices without biasing. So the probability of getting no loop on the first pass is \le (1-\frac{1}{n}). Take some function a(n) that grows faster than n, but slower than the sum of the degrees. Then after a(n) passes, the degree distribution is still roughly the same. In particular, the sum of the degrees is still an order of magnitude greater than n. So we obtain:

\mathbb{P}(\text{no loops})\leq (1-\frac{1}{n})^{a(n)}\approx e^{-\frac{a(n)}{n}}\rightarrow 0.

So, since isolated vertices have no effect on the simplicity or otherwise, we assume the sum of the degrees is \Theta(n). Then, Janson shows that the further condition

\sum_{i=1}^n d_i^2=O(n),

is essentially necessary and sufficient for simplicity. We can see why this might be true by looking at the probability that the first edge added is a loop, which is roughly

\frac{d_1^2+d_2^2+\ldots+d_n^2}{2(\sum d_i)^2}.

We have to consider O(\sum d_i) edges, so if the above expression is much larger than this, we can perform a similar exponential estimate to show that the probability there are no loops is o(1). The technical part is showing that this probability doesn’t change dramatically as the first few stubs disappear.

Note that in both cases, considering only loops is sufficient for simplicity. Although it looks like loop appearance is weaker than multiplicity of edges, in fact they have the same threshold. It should also be pointed out that, like the uniform random forests, an alternative approach is simply to count the number of simple graphs and multigraphs with a given degree sequence. Good asymptotics can then be found for the probability of simplicity.

In the case of G(n,c/n), we were particularly interested in the emergence of the giant component at time c=1. While first-moment methods can be very effective in demonstrating such results, a branching process local limit representation is probably easiest heuristic for this phase transition.

So long as the degree sequences converge in a natural way, we can apply a similar approach to this configuration model. Concretely, we assume that the proportion of vertices with degree i is \lambda_i in the limit. Although the algebra might push through, we should be aware that this means we are not explicitly specifying how many vertices have degree, eg \Theta(n^{1/2}). For now assume the \lambda_is sum to 1, so specify a probability distribution for degree induced by choosing a vertex uniformly at random.

So we start at a vertex, and look at its neighbours. The expected number of neighbours of this root vertex is \sum i\lambda i. Thereafter, when we consider a child vertex, based on how the stubs are paired up (and in particular the fact that the order of the operations does not matter – the choice of partner of a given stub is chosen uniformly at random), we are really choosing a stub uniformly at random. This corresponds to choosing a vertex at random, biased by the number of stubs available. The quantity of interest is how many additional stubs (other than the one that led to the vertex) are attached to this vertex. We assume we don’t need to worry too much about repeating vertices, in a similar way to G(n,c/n). So the expected number of additional stubs is

\frac{1}{\sum i\lambda_i}\sum i\lambda_i(i-1).

For an infinite component, we required the expectation to be > 1, which is equivalent to

\sum \lambda_i i(i-2)>0.

This was proven by Molloy and Reed (95), then with fewer conditions by Janson (07). The latter also shows how to use this construction to derive the giant component for G(n,c/n) result.

REFERENCES

Janson – A New Approach to the Giant Component Problem

Molloy, Reed – A Critical Point for Random Graphs with a Given Degree Sequence

Janson – The Probability that  Random Multigraph is Simple

Preferential Attachment Models

I’ve just read a really interesting paper by Peter Morters and Maren Eckhoff that made me feel I should look up some of the background and write a quick post. I may get onto some of the results in the paper at the end of this post, but I want to start by saying a bit about the model itself. I’ve spoken about this briefly in a previous post about several descriptions of complex networks, but I think it’s worth having a second attempt.

We seek a model for random graphs that gives a distribution which exhibits some of the properties of the sort of complex networks seen in the real world. In particular, whereas the degree distribution is Poisson, and so concentrated with exponential tails for the Erdos-Renyi random graph, data indicates that a better model for most applications would have power law tails for this degree distribution.

Albert and Barabasi propose growing such a graph via a so-called preferential attachment scheme. We start with some small possibly empty graph, and add new vertices one at a time. For each new vertex, we add exactly M edges between the new vertex and the vertices already present. The choice of these M other vertices is given by weighting by the degree of the (pre-existing) vertices. That is, vertices with large degree are more likely to be joined to new vertices. This is obviously designed to replicate some of the behaviour seen in say the formation of the internet, where new sites are more likely to link to established and popular sites (Google, Youtube and so on) than a uniformly chosen site.

This model has a couple of problems. Firstly, it is not immediately obvious how to start it. Obviously we need M vertices present for the PA dynamics to start working. In fact, whether one starts with a empty graph or a complete graph on M vertices makes little difference to the large n behaviour. Trickier is the question of multiple edges, which may emerge if we define the PA dynamics in the natural way, that is for each of the M edges in turn. Overcoming this is likely to be annoying.

Bollobas and Riordan do indeed overcome this possible problems in a formal way, and prove that a version of this model does indeed have power law decay of the degree distribution, with exponent equal to 3. The model in the paper instead joins new vertex (n+1) to old vertex m with probability:

\frac{f(\text{in-degree of n})}{n},

where f is some function, which for now we assume has the form f(k)=\gamma k+\beta. Since the vertices are constructed one at a time, it is well-defined to orient these edges from new to old vertices, hence this notion of in-degree makes sense.

It was not obvious to me that this model was more general than the Bollobas/Riordan model, but we will explain this in a little while. First I want to explain why the Bollobas/Riordan model has power law tails, and how one goes about finding the exponent of this decay, since this was presented as obvious in most of the texts I read yet is definitely an important little calculation.

So let’s begin with the Bollobas/Riordan model. It makes sense to think of the process in terms of time t, so there are t – M vertices in the graph. But if t is large, this is essentially equal to t. We want to track the evolution of the degree of some fixed vertex v_i, the ith vertex to be formed. Say this degree is d(t) at time t. Then the total number of edges in the graph at time t is roughly tM. Therefore, the probability that a new vertex gets joined to vertex v is roughly \frac{Md}{2Mt}, where the M appears in the numerator because there are M fresh edges available. Note that we have ignored the possibility of trying to connect multiple edges from the new vertex to v, so this holds provided d is substantially smaller than t. With the boundary condition d(i)=M, this leads to the simple ODE

\dot{d}=\frac{d}{2t}\quad \Rightarrow\quad d=M(\frac{t}{i})^{1/2}.

To me at least it was not immediately clear why this implied that the tail of the degree distribution had exponent 3. The calculation works as follows. Let D be the degree of a vertex at large time t, chosen uniformly at random.

d_i\propto (\frac{t}{i})^{1/2}

\Rightarrow\quad \mathbb{P}(D\geq d)=\frac{1}{t}|\{i:(\frac{t}{i})^{1/2}\geq d\}|=\frac{1}{t}|\{i:i\leq \frac{t}{d^2}\}|=\frac{1}{d^2}

Now we consider the Eckhoff / Morters model. The main difference here is that instead of assuming that each new vertex comes with a fixed number of edges, instead the new vertex joins to each existing vertex independently with probability proportional to the degree of the existing vertex. More precisely, they assume that edges are directed from new vertices to old vertices, and then each new vertex n+1 is joined to vertex m<n+1 with probability \frac{f(\text{indegree of }m\text{ at time }n)}{n}\wedge 1, where f(k)=\gamma k +\beta, for \gamma\in[0,1), \beta>0.

I was stuck for a long time before I read carefully enough the assertion that \beta>0. Of course, if this doesn’t hold, then the graph won’t grow fast enough. For, since the function f is now linear, we can lift the statement about evolution of the degree of a vertex to a statement about the evolution of the total number of edges. Note that each edge contributes exactly one to the total number of in-degrees. So we obtain

\dot{E}=\frac{\gamma E}{t}\quad\Rightarrow E(t)\propto t^\gamma.

In particular, this is much less than t, so the majority of vertices have small degree. The answer is fairly clear in fact: since the preferential attachment mechanism depends only on in-degree, then if f(0)=0, since the in-degree of a new vertex will always be zero by construction, there is no way to get an additional edge to that vertex. So all the edges in the graph for large t will be incident to a vertex that had positive in-degree in the time 0 configuration. Hence we need \beta>0 for the model to be meaningful. Note that this means we effectively have a Erdos-Renyi type mechanism AND a preferential attachment evolution. As, for each new vertex, we add roughly \beta edges to existing vertices chosen uniformly at random (rather than by a PA method) and also some assigned via PA. A previous paper by Dereich and Morters shows that the asymptotic degree distribution has a power law tail with exponent

\tau:=\frac{\gamma+1}{\gamma}.

Note that \gamma=\frac12 gives the same exponent (3) as the Bollobas / Riordan model.

We can apply a similar ODE approximation as above to estimate the likely large time behaviour of the number of edges:

E'=\frac{\gamma E + \beta t}{t}.

So since E'\geq \beta, we have E\geq \beta t so defining F to be E(t)/t, we get:

tF'(t)=\beta-(1-\gamma)F(t)        (1)

Noting that F’ is positive when F< \frac{\beta}{1-\gamma} and negative when F>\frac{\beta}{1-\gamma} suggests that for large t, this is an equilibrium point for F and hence E(t)\approx \frac{\beta t}{1-\gamma}. Obviously, this is highly non-rigorous, as F’ can be very small and still satisfy the relation (1), so it is not clear that the ‘equilibrium’ for F is stable. Furthermore, one needs to check that the binomial variables that supply the randomness to this model are sufficiently concentrated that this approximation by expectation is reasonable.

Nonetheless, as a heuristic this is not completely unsatisfactory, and it leads to the conclusion that E(t) is a linear function of t, and so the distribution of the out-degrees for vertices formed at large times t is asymptotically Poisson, with parameter

\lambda =\frac{\beta\gamma}{1-\gamma}+\beta=\frac{\beta}{1-\gamma}.

Note that this is the same situation as in Erdos-Renyi. In particular, it shows that all the power tail behaviour comes from the in-degrees. In a way this is unsurprising, as these evolve in time, whereas the out-degree of vertex t does not change after time t. Dereich and Morters formalise this heuristic with martingale analysis.

The reason we are interested in this type of model is that it better reflects models seen in real life. Some of these networks are organic, and so there it is natural to consider some form of random destructive mechanism, for example lightning, that kills a vertex and all its edges. We have to compare this sort of mechanism, which chooses a vertex uniformly at random, against a targeted attack, which deletes the vertices with largest degree. Note that in Erdos-Renyi, the largest degree is not much larger than the size of the typical degree, because the degree distribution is asymptotically Poisson. We might imagine that this is not the case in some natural networks. For example, if one wanted to destroy the UK power network, it would make more sense to target a small number of sub-stations serving large cities, than, say, some individual houses. However, a random attack on a single vertex is unlikely to make much difference, since the most likely outcome by far is that we lose only a single house etc.

In Eckhoff / Morters’ model, the oldest vertices are by construction have roughly the largest degree, so it is clear what targeting the most significant \epsilon n vertices means. They then show that these vertices include all the vertices that give the power law behaviour. In particular, if you remove all of these vertices and, obviously, the edges incident to them, you are left with a graph with exponential tail in the asymptotic degree distribution, with largest degree on the order of log n. It was shown in a previous paper that this type of network is not vulnerable to random removal of nodes. Perhaps most interestingly, these authors now prove that after removing the most significant \epsilon n vertices, the network IS now vulnerable to random removal of nodes, leading to the conclusion that it is preferable to experience a random attack followed by a targeted attack than vice versa!

In a future (possibly distant) post, I want to say some slightly more concrete things about how these processes link to combinatorial stochastic processes I understand slightly better, in particular urn models. I might also discuss the configuration model, an alternative approach to generating complex random networks.

Beyond Erdos-Renyi: more realistic models of networks

The claim is often made that the study of random graphs such as the Erdos-Renyi model is worthwhile because it gives us information about complex systems which exist in the real world. The internet or social networks provide the example du jour at the moment, but it’s equally plausible to think about traffic flows, electrical systems or interacting biological processes too.

If this were entirely true, it would be great for two reasons. Firstly, in my opinion at least, it is a beautiful subject in its own right, and to have a concrete applicable reason to continue studying it would make it even better. (Not to mention the dreaded competition for funding…) Secondly, Erdos-Renyi is so simple. After all, it involves little more than adding some simple topology to a collection of IID Bernoulli random variables, and so it would surely be possible to draw some significant conclusions about how complicated real-world objects interact without too much mathematical effort.

Unfortunately, but unsurprising, this simplicity is a drawback as far as applications go. It is fairly clear that most real-world systems cannot offer any property even approaching the niceness of the independent, same probability edges condition. But rather than consign E-R to the ‘pretty but useless’ category of mathematical structures, we should think carefully about exactly why it fails to be a good model for real-world networks, and see whether there are any small adjustments that could be made to improve it.

This is something I’ve been meaning to read up about for ages and ages. What follows is based heavily on the Albert and Barabasi 2002 review paper. I suspect that many of the open problems and intuitive calculations have since been finished and formalised, but for an overview I hope that doesn’t matter hugely. I’ve also leafed through the relevant chapters of Remco Van der Hofstad’s notes, but am setting the details and the exercises aside for the holidays when I have a bit more time!

Problems with Erdos-Renyi

Recall that G(n,p) takes n vertices, and adds edges between any pair of vertices independently with probability p.

One property shared by most real-world networks is the scale-free phenomenon, which says that the degree distribution has a power law tail. The Albert-Barabasi papers gives a comprehensive survey of data verifying this claim. By contrast, G(n,p) has degree distribution which is approximately Poisson as n grows. This is concentrated near the average degree with a thin exponential tail, so does not satisfy this requirement. I was and still am a bit confused by the term ‘scale-free’. The idea is certainly that the local structure is independent of the size of the system, which seems to be true for the degree distributions in sparse ER, that is where p = O(1/n). But I think the correct heuristic is that it doesn’t matter how far zoomed in you are – the macroscopic structure looks similar for n vertices as for n^2 vertices. This certainly fails to be true for ER, where no vertex has O(n) neighbours, whereas with a power law tail, this does hold.

The main consequence of this is that there are a few vertices with very high degree. These are often called ‘hubs’ and parallels are drawn to the internet, where key websites and servers connect lots of traffic and pages from different areas. The idea is that the hubs are almost certainly well-connected to each other, and this offers a step towards a small-world phenomenon, where the shortest path between any two vertices is very small relative to the size of the system. This notion was introduced to mainstream culture by Stanley Milgram’s ‘Six degrees of separation’ experiment in the 60s, where it became clear that subjects were able to deliver a package to a complete stranger on the other side of the USA, using only personal contacts, in about six stages. The graph theoretic notion for this is the diameter, defined as the maximal graph distance between two points. Here, the graph distance means the length of the shortest path between the points. This definition, with the max-min formalism looks rather complicated, but isn’t really. The diameter of an Erdos-Renyi graph for fixed p, increases like log n, which is small relative to n, and so this property holds.

A quick glance at your list of Facebook friends will confirm that the independent edges condition in an Erdos-Renyi random graph is not a plausible model for social networks. How many friends do you have? Let’s say about 1000, more to make the calculation easier than because you’re necessarily very popular. How many does your friend Tom have? Let’s say 1000 again. As was in the news a few months ago, there are now over a billion people on Facebook. Let’s say exactly a billion (that is 10^9 for these purposes). So both you and Tom are friends with 1/10^6 of the total membership of the network. So how large would you expect the overlap of your friendships to be, if they were all chosen independently at random? Well, the probability that you are both friends with Alice is 10^-12, and so the expected number of your mutual friends is 10^-12 x 10^9 = 10^-3 which is substantially less than 1. Yet I imagine if you substituted names suitably, you and Tom might well have over 50 mutual friends if you were, say, in the same year at school or niversity and haven’t yet purged your list.

We want a statistic that records this idea quantitatively. There are various candidates for such a clustering coefficient. The underlying notion that we might expect there to be greater connectivity between neighbours of some fixed point v than in the graph as a whole gives an intuition for a possible definition. Compare the proportion of triangles in the graph to the cube of the proportion of edges. When this ratio is large, then there is a lot of clustering. In the E-R case, we would expect these to be equal, as the probability of forming a triangle is equal to the cube of the probability of the presence of each of the three independent edges that make up the triangle.

So we have three properties of real networks that we would like to incorporate into a model: small diameter, power-law degree distribution, and high clustering. To avoid this turning into a book, I’m going to write a paragraph about each of the possibilities discussed by Albert and Barabasi.

Generalised Random Graph

The degree distribution will typically emerge as a consequence of the construction of a given model. The general idea here is to condition on the degree distribution having the form we want, and see what this does to the structure. Of course, the choice of how to do this conditioning is absolutely key. It certainly isn’t obvious what it means to ‘condition G(n,p) to have power-law distribution’, since the very idea of a power-law vs exponential tail requires the number of vertices to be large.

The first idea for achieving this gives the vertices ‘stubs’, which join up in pairs to form edges. We decide on the distribution of stubs according to this power law, then pair them up uniformly at random. Obviously, there is a possibility of getting some loops, but this is not going to happen so often as to be a genuine problem in the limit. This construction is similarly open to the branching process exploration ideas well covered for the E-R random graph, though we have to be careful to size-bias the degree distributions when necessary. There is still an underlying independence in the location of edges though, so it is reasonably clear that the amount of clustering may be closer to E-R than to the real examples cited.

The other possibility suggested is to retain the independent edge property, but give the vertices weights, and let the probability of an edge between two vertices be some sensible function of the weights. In the end it turns out to make little difference whether the weights are chosen deterministically or randomly, but by taking the weights i.i.d. with infinite mean, we can generate a so-called generalised random graph where the degree distribution has a power law.

Watts-Strogatz

In the WS model, the idea is to interpolate between a graph with maximal clustering and a random graph. A d-regular graph, say on a ring, where every vertex is connected to its d nearest neighbours has high clustering, but large diameter, as for example it takes roughly n/2d steps to get to the other side of the ring. Whereas in the standard E-R model we add edges with some fixed probability p, here we replace edges with some fixed probability p. That is, we take an edge in the regular graph and with some small probability we remove it and instead add an edge between two vertices chosen uniformly at random. The theoretical motivation is that removing a few edges doesn’t destroy the high clustering evident in the regular graph, but even a sparse random graph has small diameter, so adding a few ‘long-range’ edges should be enough to decrease the diameter significantly.

It obviously needs to be checked that a substantial drop in diameter occurs before a substantial decrease in clustering, and there is a calculation and diagram to support this intuitive idea in the paper. The one drawback of this model is that it fails to provide the power-law degree distributions we want. After all, an E-R graph has a concentrated degree distribution, and a d-regular graph has all degrees the same, so we would expect some interpolation between the two to have a concentrated distribution as well. Nonetheless, this model accords well with an idea of how complex networks might form, particularly if there is some underlying geometry. It is reasonable to assume that an initial setup for a network would be that people are connected to those closest to them, and then slowly acquire distant contacts as time progresses.

Preferential Attachment – Barabasi-Albert model

Most of our intuition for networks can be extended to an intuition for the formation of networks. The idea of prescribing a degree distribution is neat, but it doesn’t give any account to the mechanism of formation. Complexity emerges over time, and a good model should be able to describe why this happens. The Barabasi-Albert model takes this as its starting point, with the aim of producing a highly clustered system dynamically. Recall that we can describe G(n,p) as a process by coupling, then increasing p from 0 to 1, and seeing edges emerge. The independence assumption can be lifted through the coupling, and so which edge appears next is independent of the current state of the system.

This is what we need to relax. Recall the motivating idea of ‘hubs’, where a small collection of vertices have very high connectivity across the whole system, as observed in several real situations. A consequence of this is that new edge is more likely to be attached to a hub, than to a pair of poorly connected vertex elsewhere. But it turns out that this idea of preferential attachment isn’t enough by itself. Because as a network forms, it is not just the connectivity that increases, but also the size of the system itself. So in fact it makes sense to add vertices rather than edges, and join the new vertices to existing vertices in proportion to the degrees of the existing vertices. This combination of growth and preferential attachment is key to the scale-free graphs that this Barabasi-Albert model generates. Relaxing either mechanism returns us to the case of exponential tails. However, there are methods in the literature for generating such graphs without the need for a dynamic model, but they are harder to understand and describe. None I have seen so far has a high clustering coefficient.

Hubs are effectively a way to reduce the diameter. Recall the description of Milgram’s experiment where he encouraged randomly chosen people to send a package to Harvard. For the purposes of this model, an undergraduate from Wyoming or a husband from Alabama moving in with his wife in Boston are clear hubs, as for very many people near their previous home, they represent a good connection to Harvard. So it is unsurprising that BA, which reinforces hubs, has a sub-logarithmic small diameter.

Conclusions

I’m not entirely what conclusions I should draw from my reading. Probably the main one is that I should read more as there is plenty of interesting stuff going on in this area. Intuitively, it seems unlikely that there is going to be a single model which unites the descriptions of all relevant real-world networks. As ever, it is pleasant to find structures that are both mathematically interesting in their own right and relevant to applied problems. So it is reassuring to observe how similar many of the models discussed above are to the standard random graph.

Poisson Tails

I’ve had plenty of ideas for potential probability posts recently, but have been a bit too busy to write any of them up. I guess that’s a good thing in some sense. Anyway, this is a quick remark based on an argument I was thinking about yesterday. It combines Large Deviation theory, which I have spent a lot of time learning about this year, and the Poisson process, which I have spent a bit of time teaching.

Question

Does the Poisson distribution have an exponential tail? I ended up asking this question for two completely independent reasons yesterday. Firstly, I’ve been reading up about some more complex models of random networks. Specifically, the Erdos-Renyi random graph is interesting mathematical structure in its own right, but the independent edge condition results in certain regularity properties which are not seen in many real-world networks. In particular, the degree sequence of real-world networks typically follows an approximate power law. That is, the tail is heavy. This corresponds to our intuition that most networks contain ‘hubs’ which are connected to a large region of the network. Think about key servers or websites like Wikipedia and Google which are linked to by millions of other pages, or the social butterfly who will introduce friends from completely different circles. In any case, this property is not observed in an Erdos-Renyi graph, where the degrees are binomial, and in the sparse situation, rescale in the limit to a Poisson distribution. So, to finalise this observation, we want to be able to prove formally that the Poisson distribution has an exponential (so faster than power-law) tail.

The second occurrence of this question concerns large deviations for the exploration process of a random graph. This is a topic I’ve mentioned elsewhere (here for the exploration process, here for LDs) so I won’t recap extensively now. Anyway, the results we are interested in give estimates for the rate of decay in probability for the event that the path defined by the exploration process differs substantially from the expected path as n grows. A major annoyance in this analysis is the possibility of jumps. A jump occurs if a set of o(n) adjacent underlying random variables (here, the increments in the exploration process) have O(n) sum. A starting point might be to consider whether O(1) adjacent RVs can have O(n) sum, or indeed whether a single Poisson random variable can have sum of order n. In practice, this asks whether the probability \mathbb{P}(X>\alpha n) decays faster than exponentially in n. If it does, then this is dominated on a large deviations scale. If it decays exactly exponentially in n, then we have to consider such jumps in the analysis.

Approach

We can give a precise statement of the probabilities that a Po(\lambda) random variable X returns a given integer value:

\mathbb{P}(X=k)=e^{-\lambda}\frac{\lambda^k}{k!}.

Note that these are the terms in the Taylor expansion of e^{\lambda} appropriately normalised. So, while it looks like it should be possible to evaluate

\mathbb{P}(X>\alpha n)=e^{-\lambda}\sum_{\alpha n}^\infty \frac{\lambda^k}{k!},

this seems impossible to do directly, and it isn’t even especially obvious what a sensible bounding strategy might be.

The problem of estimating the form of the limit in probability of increasing unlikely deviations from expected behaviour surely reminds us of Cramer’s theorem. But this and other LD theory is generally formulated in terms of n random variables displaying some collective deviation, rather than a single random variable, with the size of the deviation growing. But we can transform our problem into that form by appealing to the three equivalent definitions of the Poisson process.

Recall that the Poisson process is the canonical description of, say, an arrivals process, where events in disjoint intervals are independent, and the expected number of arrives in a fixed interval is proportional to the width of the interval, giving a well-defined notion of ‘rate’ as we would want. The two main ways to define the process are: 1) the times between arrivals are given by i.i.d. Exponential RVs with parameter \lambda equal to the rate; and 2) the number of arrivals in interval [s,t] is independent of all other times, and has distribution given by Po(\lambda(t-s)). The fact that this definition gives a well-defined process is not necessarily obvious, but let’s not discuss that further here.

So the key equivalence to be exploited is that the event X>n for X\sim \text{Po}(\lambda) is a statement that there are at least n arrivals by time 1. If we move to the exponential inter-arrival times definition, we can write this as:

\mathbb{P}(Z_1+\ldots+Z_n<1),

where the Z’s are the i.i.d. exponential random variables. But this is exactly what we are able to specify through Cramer’s theorem. Recall that the moment generating function of an exponential distribution is not finite everywhere, but that doesn’t matter as we construct our rate function by taking the supremum over some index t of:

I(x)=\sup_t (xt-\log \mathbb{E}e^{tZ_1})=\sup_t(xt-\log(\frac{\lambda}{\lambda-t})).

A simple calculation then gives

I(x)=\lambda x-1 - \log \lambda x.

\Rightarrow I(x)\uparrow \infty\text{ as }x\downarrow 0.

Note that I(1) is the same for both Exp(\lambda) and Po(\lambda), because of the PP equality of events:

\{Z_1+\ldots+Z_n\leq n\}=\{\text{Po}(\lambda n)=\text{Po}(\lambda)_1+\ldots+\text{Po}(\lambda)_n> n\},

similar to the previous argument. In particular, for all \epsilon>0,

\mathbb{P}(\text{Po}(\lambda)>n)=\mathbb{P}(\frac{Z_1+\ldots+Z_n}{n}<\frac{1}{n})<\mathbb{P}(\frac{Z_1+\ldots+Z_n}{n}<\epsilon),\text{ for large }n.

\mathbb{P}(\text{Po}(\lambda)>n)=O(e^{-nI(\epsilon)}),\text{ for all }\epsilon.

Since we can take I(\epsilon) as large as we want, we conclude that the probability decays faster than exponentially in n.

Analytic vs Probabilistic Arguments for a Supercritical BP

This follows on directly from the previous post. I was originally going to talk only about what follows, but I got rather carried away with the branching process account. I was stuck on a particular exercise, and we ended up coming up with two arguments: one analytic and one probabilistic. Since the typical flavour of this blog is to present problems which show the advantage of the probabilistic approach, it seems only fair to remark on this case, where the analytic method was less interesting, but much simpler.

Recall that we have a supercritical random graph G(n,\frac{\lambda}{n}), \lambda>1, and we are considering the rescaled exploration process S_{nt}, which has asymptotic mean \mu_t=1-t-e^{-\lambda t}. We can calculate similarly an expression for the asymptotic variance

\frac{\text{Var}(S_{nt})}{n}\rightarrow v_t=e^{-\lambda t}(1-e^{-\lambda t}).

To use this to verify the result about the size of the giant component, we verify that \mu_{\zeta_\lambda+x/\sqrt{n}} is negative, and has small variance, which would confirm that the giant component has size bounded above by \zeta_\lambda almost surely. A similar argument is required for the lower bound. The variance is a separate matter, but it is therefore necessary that \mu_t should be decreasing at t=\zeta_\lambda, that is \mu_t'=\lambda e^{-\lambda \zeta_\lambda}<0. This is what we try to prove in the remainder of this post. Recall that in the previous post we have checked that it is equal to zero here.

Heuristic Explanation

\mu_t has been rescaled from the original definition of the exploration process in both size and time-scale so some care is needed to see why this should hold in the limit. Remember that all components apart from the giant component are of size O(log n). So immediately after exhausting the giant component, you are likely to be visiting components of size roughly log n. A time interval of dt for \mu corresponds to ndt for S, during which S will visit some components of size log n and some of O(1) and some in between. In particular, some fixed proportion of vertices are isolated, that is, in a component of size 1.

There is then a complicated size-biasing train of thought. A component of size log n is more likely to come up than an isolated vertex, but there are not as many of them. The log n components push the derivative \mu_t' towards zero, because S_t decreases by 1 over a time-interval of length log n, which gives a gradient of zero in the limit. However, the isolated vertices give a gradient of -1, because S_t decreases by 1 over a time interval of 1. Despite the fact that log n intervals are likely to appear earlier, it still remains the case that after exhausting a component (in particular, at time t=\zeta_\lambda, after exhausting the giant component), with some bounded below positive probability you will choose an isolated vertex next. The component size only affects that time-scale if it is O(n), which none of the remaining components are, so the derivative \mu_{\zeta_\lambda}' consists of some complicated weighted mean of 0 and -1. In particular, it is negative.

Analytic solution

Obviously, that won’t do in practice. Suppressing lambdas for ease of notation, the key fact is: e^{-\lambda \zeta}=1-\zeta. We want to show that \lambda e^{-\lambda \zeta}<1. Substituting

\lambda=-\frac{\log(1-\zeta)}{\zeta},

means that it is required to show:

-\frac{1-\zeta}{\zeta}\log(1-\zeta)<1.

Differentiating the left hand side gives:

\frac{\log(1-\zeta)+\zeta}{\zeta^2}<0,

since of course \log(1-\zeta)=\zeta+\frac{\zeta^2}{2}+\frac{\zeta^3}{3}+\dots. So it suffice to check the result for small \zeta. But, again using a Taylor series:

-\frac{1-\zeta}{\zeta}\log(1-\zeta)=1-\frac12\zeta+O(\zeta^2)<1,

for small \zeta. This gives the required result.

Probabilistic Interpretation and Solution

First, we observe that \lambda e^{-\lambda\zeta}=\lambda(1-\zeta) is the expected number of vertices in the first generation of a \text{Po}(\lambda) whose progeny become extinct. This motivates considering the canonical decomposition of a supercritical branching process Z into the skeleton process and the dual process. The skeleton Z^+ consists of all vertices which have infinitely many successors. It is relatively easy to show that this is a branching process with offspring distribution \text{Po}(\lambda\zeta) conditioned on being positive. The dual process Z^* is a G-W branching process with offspring distribution \text{Po}(\lambda) conditioned on dying. This is the same as a branching process with offspring distribution \text{Po}(\lambda(1-\zeta), by a sprinkling argument, which says that if we begin with a Poisson number of things, then remove each one independently with some fixed probability, the remaining number of things is Poisson also.

We can construct the original branching process by

  • With probability \zeta, take the skeleton, and affixe independent copies of Z^* at every vertex in the skeleton.
  • With probability 1-\zeta, just take a copy of Z^*.

It is immediately clear that \lambda(1-\zeta)\leq 1. After all, the dual process is almost surely finite, so the offspring distribution cannot have expectation greater than 1. Checking that this is strong is more fiddly. The best way I have come up with is to examine the tail of the distribution of total population size of the original branching process.

The total population size T of a branching process has an exponential tail if the offspring distribution is subcritical. It isn’t hugely surprising that this behaves like a large deviation for iid RVs, since in the limit such an event requires a lot of the offspring counts to deviate substantially from the mean. The same holds in the supercritical case, with the additional complication that though the finite tail decays exponential, there is positive probability that the total size will be infinite. In the critical case, however, there is a power-law decay. This is not hugely surprising as it marks the threshhold for the appearance of the infinite population, just as in a multiplicative coalescent at time 1, we have a load of very large components just about to form a giant component. The tool for all of these results is Dwass’s Theorem, which says:

\mathbb{P}(T=n)=\frac{1}{n}\mathbb{P}(X_1+\ldots+X_n=n-1),

where X_1 are iid with the offspring distribution. When \mathbb{E}X_1\neq 1, this is a large deviation event, for which Cramer’s theorem applies (assuming, as is the case for the Poisson distribution, that the offspring distribution has finite variance). When, \mathbb{E}X=1, the Central Limit Theorem says that with high probability,

X_1+\ldots+X_n\in [n-n^{3/4},n+n^{3/4}],

so, skating over the details of whether everything is exactly uniform within this CLT scaling window,

\mathbb{P}(T=n)\geq \frac{1}{n}\cdot\frac{1}{2n^{3/4}}.

The true exponent of the power law decay is substantially slower than this, but the above argument works as a back-of-the-envelope bound.

In particular, if the dual process has mean 1, then the population size of the original branching process is given by taking a distribution with exponential tail with some probability and a distribution with power-law tail with some probability. Obviously the power-law will dominate, which contradicts the assumption that the original branching process was supercritical, and so has an exponential tail.