# Dispersion in Social Networks

This post is based on a paper that appeared a couple of weeks ago on the Computer Science section of arXiv. You can find it here. I’m going to write a few things about the ideas behind the paper, and avoid pretty much entirely the convincing data the authors present, as well as questions of implementing the algorithms discussed.

The setting is a social network, which we can describe as a graph. Nodes stand for people, and an edge represents that the two associated people have some social connection. This paper focuses on edges corresponding to friendship in the Facebook graph.

A key empirical feature of the graph topology of such social networks as compared to most mathematical models of random graphs is the prevalence of short cycles, and so-called clustering. Loosely speaking, in an Erdos-Renyi random graph, any potential edges appear in the graph independently of the rest of the configuration. But this does not accord well with our experience of our own Facebook friend circle. In general, knowledge that A is friends with both B and C increases the likelihood that B and C are themselves friends. This effect appears to be more present in other models, such as Preferential Attachment and the Configuration Model, but that is really more a consequence of the degree sequence being less concentrated.

The reason for this phenomenon appearing in social networks is clear. People meet other people generally by sharing common activities, whether that be choice of school, job or hobbies. The question of how readily people choose to add others on Facebook is a worthwhile one, but not something I have the time or the sociological credibility to consider! In any case, it is not a controversial idea that for some typical activity, it is entirely possible that almost all the participants will end up as friends, leading to a large (almost-) ‘clique’ in the graph. Recall a clique is a copy of a complete graph embedded in a larger graph – that is, a set of nodes all of which are pairwise connected.

We could think of much of the structure of this sort of network as being generated in the following way. Suppose we were able to perform the very unlikely-sounding task of listing every conceivable activity or shared attribute that might engender a friendship. Each activity corresponds to a set of people. We then construct a graph on the set of people by declaring that a pair of nodes are connected by an edge precisely if the people corresponding to these nodes can both be found in some activity set.

Another way of thinking about this setup is to consider a bipartite graph, with people as one class of vertices, and activities as the other. Predictably, we join a person to an activity if they engage in that activity. The edges within the class of people are then induced by the bipartite edges. Obviously, under this interpretation, we could equally well construct a graph on the set of activities. Here, two activities would be joined if there is a person who does them both. Graphs formed in this way can be called Intersection Graphs, and there is lots of interest in investigating various models of Random Intersection Graphs.

The question addressed by the authors of the paper can be summarised as follows. A social network graph tells us whether two people are ‘friends’, but it does not directly tell us how close their relationship is. It is certainly an interesting question to ask whether the (local) network topology can give us a more quantitative measure of the strength of a friendship.

As the authors explain, a first approach might be to consider how many mutual friends two people have. (We consider only pairs of people who are actually friends. It seems reasonable to take this as a pre-requisite for a strong relationship among people who do actually use Facebook.) However, this can fail because of the way these social networks organise themselves around shared attributes and activities. The size of one of these cliques (which are termed social foci in parts of the literature) is not especially likely to be well correlated to the strengths of the friendships within the clique. In particular, the clique corresponding to someone’s workplace is likely to grow in size over time, especially when people grow towards an age where, on average, they move job much less. So it seems likely that, according to a naive examination of the number of mutual friends, we would predict that a person’s strongest friend is likely to be someone they work with, who perhaps by chance also does some other activity with that person.

The authors phrase this problem slightly differently. They examine algorithms for establishing a person’s spouse or long-term partner with good accuracy from only the local network structure.

Heuristically we might expect that a husband knows many of his wife’s work colleagues, and vice versa. Not all of these ties might be so strong that they actually lead to friendship, in the Boolean sense of Facebook, but we might expect that some noticeable proportion have this property. Naturally, there will be cliques to which both belong. One or more of these might be the reason they met in the first place, and others (eg parents at children’s schools) might have developed over the course of their relationship. However, as we’ve explained, this doesn’t narrow things down much.

(We need not be constrained by this heteronormative scenario. However, as the authors point out in a footnote, there are challenges in collecting data because of the large number of ironic relationship listings on Facebook, mainly among the undergraduate and younger community. This problem is particularly obstructive in the case of same-sex marriage, owing to the smaller numbers of genuine pairings, and larger numbers of false listings for this setting.)

The crucial observation is that if we look at the couple’s mutual friends, we expect to see large parts of the most important cliques from both husband and wife’s lives. Among these mutual friends, there will be some overlap, that is cliques of which both are an integral member. But among the rest, there will be a natural partition into friends who really originate from the husband, and friends who were introduced via the wife. So the induced graph on these mutual friends is likely to split into three classes of vertices, with very poor connectivity between two of them.

This is, up to sorting out scaling and so on, precisely the definition of dispersion, introduced by the authors. The dispersion between two vertices is high if the induced graph on their mutual neighbourhood has poor connectivity. Modulo exact choice of definition, they then exhibit data showing that this is indeed a good metric for determining marriages from the network topology, with success rate of around 50% over a wide range of users.

# Persistent Hubs

This post is based on the paper “Existence of a persistent hub in the convex preferential attachment model” which appeared on arXiv last week. It can be found here. My aim is to explain (again) the model; the application-based motivation for the result; and a heuristic for the result in a special case. In particular, I want to stress the relationship between PA models and urns.

The preferential attachment model attempts to describe the formation of large complex networks. It is constructed dynamically: vertices are introduced one at a time. For each new vertex, we have to choose which existing vertices to join to. This choice is random and reinforced. That is, the new vertex is more likely to join to an existing vertex with high degree than to an existing vertex with degree 1. It’s clear why this might correspond well to the evolution of, say, the world wide web. New webpages are much more likely to link to an established site, eg Wikipedia or Google, than to a uniformly randomly chosen page.

The model is motivated also by a desire to fit a common property of real-world networks that is not exhibited by, among others, the Erdos-Renyi random graph model. In such a network, we expect a few nodes to have much greater connectivity than average. In a sense these so-called hubs drive connectivity of the system. This makes sense in practice. If you are travelling by train around the South-East of England, it is very likely you will pass through at least one of Reading, East Croydon, or about five major terminus in London. It would be absurd for every station to be of equal significance to the network. By contrast, the typical vertex degree in the sparse Erdos-Renyi model is O(1), and has a limiting Poisson distribution, with a super-exponential tail.

So, this paper addresses the following question. We know that the PA model, when set up right, has power-law tails for the degree distribution, and so has a largest degree that is an order of magnitude larger than the average degree. Let’s call this the ‘hub’ for now. But the model is dynamic, so we should ask how this hub changes in time as we add extra vertices. In particular, is it the case that one vertex should grow so large that it remains as the dominant hub forever? This paper answers this question in the affirmative, for a certain class of preferential attachment schemes.

We assign a weighting system to possible degrees, that is a function from N to R+. In the case of proportional PA, this function could be f(n)=n. In general, we assume it is convex. Note that the more convex this weight function is, the stronger the preference a new vertex feels towards existing dominant vertices. Part of the author’s proof is a formalisation of this heuristic, which provides some machinery allowing us to treat only really the case f(n)=n. I will discuss only this case from now on.

I want to focus on the fact that we have another model which describes aspects of the degree evolution rather well. We consider some finite fixed collection of vertices at some time, and consider the evolution of their degrees. We will be interested in limiting properties, so the exact time doesn’t matter too much. We look instead at the jump chain, ie those times when one of the degrees changes. This happens when a new vertex joins to one of the chosen vertices. Given that the new vertex has joined one of the chosen vertices, the choice of which of the chosen vertices is still size-biased proportional to the current degrees. In other words, the jump chain of this degree sequence is precisely Polya’s Urn.

This is a powerful observation, as it allows us to make comments about the limiting behaviour of finite quantities almost instantly. In particular, we know that from any starting arrangement, Polya’s Urn converges almost surely. This is useful to the question of persistence for the following reason.

Recall that in the case of two colours, starting with one of each, we converge to the uniform distribution. We should view this as a special case of the Dirichlet distribution, which is supported on partitions into k intervals of [0,1]. In particular, for any fixed k, the probability that two of the intervals have the same size is 0, as the distribution is continuous. So, since the convergence of the proportions in Polya’s Urn is almost sure, with probability one all of the proportions are with $\epsilon>0$ of their limit, and so taking epsilon small enough, given the limit, which we are allowed to do, we can show that the colour which is largest in the limit is eventually the largest at finite times.

Unfortunately, we can’t mesh these together these finite-dimensional observations particularly nicely. What we require instead is a result showing that if a vertex has large enough degree, then it can never be overtaken by any new vertex. This proved via a direct calculation of the probability that a new vertex ‘catches up’ with a pre-existing vertex of some specified size.

That calculation is nice and not too complicated, but has slightly too many stages and factorial approximations to consider reproducing or summarising here. Instead, I offer the following heuristic for a bound on the probability that a new vertex will catch up with a pre-existing vertex of degree k. Let’s root ourselves in the urn interpretation for convenience.

If the initial configuration is (k,1), corresponding to k red balls and 1 blue, we should consider instead the proportion of red balls, which is k/k+1 obviously. Crucially (for proving convergence results if nothing else), this is a martingale, which is clearly bounded within [0,1]. So the expectation of the limiting proportion is also k/k+1. Let us consider the stopping time T at which the number of red balls is equal to the number of blue balls. We decompose the expectation by conditioning on whether T is finite.

$\mathbb{E}X_\infty=\mathbb{E}[X_\infty|T<\infty]\mathbb{P}(T<\infty)+\mathbb{E}[X_\infty|T=\infty]\mathbb{P}(T=\infty)$

$\leq \mathbb{E}[X_\infty | X_T,T<\infty]\mathbb{P}(T<\infty)+(1-\mathbb{P}(T=\infty))$

using that $X_\infty\leq 1$, regardless of the conditioning,

$= \frac12 \mathbb{P}(T<\infty) + (1-\mathbb{P}(T<\infty))$

$\mathbb{P}(T<\infty) \leq \frac{2}{k+1}.$

We really want this to be finite when we sum over k so we can use some kind of Borel-Cantelli argument. Indeed, Galashin gets a bound of $O(k^{-3/2})$ for this quantity. We should stress where we have lost information. We have made the estimate $\mathbb{E}[X_\infty|T=\infty]=1$ which is very weak. This is unsurprising. After all, the probability of this event is large, and shouldn’t really affect the limit that much when it does not happen. The conditioned process is repelled from 1/2, but that is of little relevance when starting from k/k+1. It seems likely this expectation is in fact $\frac{k}{k+1}+O(k^{-3/2})$, from which the result will follow.

# The Configuration Model

In the past, I’ve talked about limitations of the Erdos-Renyi model of homogeneous random graphs for applications in real-world networks. In a previous post, I’ve discussed a dynamic model, the Preferential Attachment mechanism, that ‘grows’ a graph dynamically by adding edges from new vertices preferentially to existing vertices with high degree. The purpose of this adjustment is to ensure that the distribution of the degrees is not concentrated around some fixed value (which would be c in G(n,c/n) ) but rather exhibits a power-law tail such as observed in many genuine examples.

In this post, we introduce some aspects of the configuration model, which achieves this property more directly. This idea probably first arose in the guise of regular graphs. Recall a regular graph has all degrees equal. How would we construct a random d-regular graph on a large number of vertices?

What we probably want to do is to choose uniformly at random from the set of such graphs, but it is not clear even how large this set is, let alone how one would order its elements to make it possible to make this uniform choice. Instead, we try the following. Assign to each vertex d so-called stubs, which will end up being ‘half-edges’. We then choose two stubs uniformly at random, and glue them together. More formally, we construct an edge between the host vertices, and then delete the chosen stubs. We then continue.

The construction makes no reference to the distribution of stubs, so we are free to choose this as we please. We could for example specify some sequence of degrees which approximates a power-law, so we could sample a random sequence of degrees in some way. So long as we have a sequence of stub set sizes before we start building the edges of the graph we will be able to use the above algorithm.

So what might go wrong? There seem to me to be three potential problems that might arise with this construction.

Firstly, there might be a stub left over, if the sum of the stub set sizes is odd. Recall that in a graph the sum of the degrees is twice the sum of the number of edges, and so in particular the sum of the degrees should be even. But this is a small problem. When the degree sequence is deterministic we can demand that it have even sum, and if it is random, we will typically be working in a large N regime, and so deleting the solitary stub, if such a thing exists, will not affect the sort of properties of the graph we are likely to be interested in.

The second and third objections are perhaps more serious. If we glue together stubs naively, we might end up with loops, that is, edges that ‘begin’ and ‘end’ at the same vertex. These are not allowed in the standard definition of a graph. Alternatively, we might end up with more than one edge between the same pair of vertices.

Our overall aim is that this mechanism gives a convenient way of simulating the uniform distribution on simple graphs with a given degree sequence. At present we have the uniform distribution on potential multigraphs, with a weighting of 1/k! for every multi-edge with multiplicity k, and a weighting of 1/2 for every loop. The latter can be seen because there is an initial probability proportional to $d(v_i)d(v_j)$ that vertices v_i and v_j will be joined, whereas a probability proportional (with the same constant) to $d(v_i)^2$ that v_i will receive a loop. The multi-edge weighting justification is similar.

However, conditional on getting a simple graph, the distribution is uniform on the set of simple graphs with that degree sequence. So it remains to investigate the probability that a graph generated in this way is simple. So long as this probability does not tend to 0 as n grows, we will probably be happy.

The strongest results on this topic are due to Janson. First observe that if the sum of the degrees grows faster than the number of vertices n, we fail to get a graph without loops with high probability. Heuristically, note that on the first pass, we are taking two picks from the set of vertices, biased by the number of stubs. By Cauchy-Schwarz, Rearrangement Inequality or just intuition, the probability of getting the same vertex is greater than if we picked uniformly from the set of vertices without biasing. So the probability of getting no loop on the first pass is $\le (1-\frac{1}{n})$. Take some function a(n) that grows faster than n, but slower than the sum of the degrees. Then after a(n) passes, the degree distribution is still roughly the same. In particular, the sum of the degrees is still an order of magnitude greater than n. So we obtain:

$\mathbb{P}(\text{no loops})\leq (1-\frac{1}{n})^{a(n)}\approx e^{-\frac{a(n)}{n}}\rightarrow 0.$

So, since isolated vertices have no effect on the simplicity or otherwise, we assume the sum of the degrees is $\Theta(n)$. Then, Janson shows that the further condition

$\sum_{i=1}^n d_i^2=O(n),$

is essentially necessary and sufficient for simplicity. We can see why this might be true by looking at the probability that the first edge added is a loop, which is roughly

$\frac{d_1^2+d_2^2+\ldots+d_n^2}{2(\sum d_i)^2}.$

We have to consider $O(\sum d_i)$ edges, so if the above expression is much larger than this, we can perform a similar exponential estimate to show that the probability there are no loops is o(1). The technical part is showing that this probability doesn’t change dramatically as the first few stubs disappear.

Note that in both cases, considering only loops is sufficient for simplicity. Although it looks like loop appearance is weaker than multiplicity of edges, in fact they have the same threshold. It should also be pointed out that, like the uniform random forests, an alternative approach is simply to count the number of simple graphs and multigraphs with a given degree sequence. Good asymptotics can then be found for the probability of simplicity.

In the case of G(n,c/n), we were particularly interested in the emergence of the giant component at time c=1. While first-moment methods can be very effective in demonstrating such results, a branching process local limit representation is probably easiest heuristic for this phase transition.

So long as the degree sequences converge in a natural way, we can apply a similar approach to this configuration model. Concretely, we assume that the proportion of vertices with degree i is $\lambda_i$ in the limit. Although the algebra might push through, we should be aware that this means we are not explicitly specifying how many vertices have degree, eg $\Theta(n^{1/2})$. For now assume the $\lambda_i$s sum to 1, so specify a probability distribution for degree induced by choosing a vertex uniformly at random.

So we start at a vertex, and look at its neighbours. The expected number of neighbours of this root vertex is $\sum i\lambda i$. Thereafter, when we consider a child vertex, based on how the stubs are paired up (and in particular the fact that the order of the operations does not matter – the choice of partner of a given stub is chosen uniformly at random), we are really choosing a stub uniformly at random. This corresponds to choosing a vertex at random, biased by the number of stubs available. The quantity of interest is how many additional stubs (other than the one that led to the vertex) are attached to this vertex. We assume we don’t need to worry too much about repeating vertices, in a similar way to G(n,c/n). So the expected number of additional stubs is

$\frac{1}{\sum i\lambda_i}\sum i\lambda_i(i-1).$

For an infinite component, we required the expectation to be > 1, which is equivalent to

$\sum \lambda_i i(i-2)>0.$

This was proven by Molloy and Reed (95), then with fewer conditions by Janson (07). The latter also shows how to use this construction to derive the giant component for G(n,c/n) result.

REFERENCES

Janson – A New Approach to the Giant Component Problem

Molloy, Reed – A Critical Point for Random Graphs with a Given Degree Sequence

Janson – The Probability that  Random Multigraph is Simple

# Characterisations of Geometric Random Graphs

Continuing the LMS-EPSRC summer school on Random Graphs, Geometry and Asymptotic Structure, we’ve now had three of the five lectures by Mathew Penrose on Geometric Random Graphs.

The basic idea is that instead of viewing a graph entirely abstractly, we now place the vertices in the plane, or some other real space. In many network situations, we would expect connectivity to depend somehow on distance. Agents or sites which are close together might be considered more likely to have the sort of relationship indicated by being connected with an edge. In the model discussed in this course, this dependence is deterministic. We have some parameter r, and once we have chosen the location of all the vertices, we connect a pair of vertices if the distance between them is less than r.

For the purposes of this, we work in a compact space [0,1]^d, and we are interested in the limit as the number of vertices n grows to infinity. To avoid the graph getting too connected, as in the standard random graph model, we take r to be a decreasing function of n. Anyway, we place the n points into the unit hypercube uniformly at random, and then the edges are specified by the adjacency rule above. In general, because r_n will be o(1), we won’t have to worry too much above boundary effects. The number of vertices within r_n of the boundary of the cube will be o(1). For some results, this is a genuine problem, when it may be easier to work on the torus.

In G(n,p), the order of np in the limit determines the qualitative structure of the graph. This is the expected degree of a given fixed vertex. In this geometric model, the relevant parameter is $nr_n^d$, where d is the dimension of the hypercube. If this parameter tends to 0, we say the graph is sparse, and dense if it tends to infinity. The intermediate case is called a thermodynamic limit. Note that the definition of sparse here is slightly different from G(n,p).

Much of the content of the first three lectures has been verifying that the distributions of various quantities in the graph, for example the total number of edges, are asymptotically Poisson. Although sometimes arguments are applicable over a broad spectrum, we also sometimes have to use different calculations for different scaling windows. For example, it is possible to show convergence to a Poisson distribution for the number of edges in the sparse case, from which we get an asymptotic normal approximation almost for free. In the denser regimes, the argument is somewhat more technical, with some substantial moment calculations.

A useful tool in these calculations are some bounds derived via Stein’s method for sums of ‘almost independent’ random variables. For example, the presence or non-presence of an edge between two pairs of vertices are independent in this setting if the pairs are disjoint, and the dependence is still only mild if they share a vertex. An effective description is via a so-called dependency graph, where we view the random variables as the vertices of a graph, with an edge between them if there is some dependence. This description doesn’t have any power in itself, but it does provide a concise notation for what would otherwise be very complicated, and we are able to show versions of (Binomials converge to Poisson) and CLT via these that are exactly as required for this purpose.

In particular, we are able to show that if $E_n$ is the total number of edges, under a broad set of scaling regimes, if $\lambda_n$ is the expected total number of edges, then $d_{TV}(E_n,\mathrm{Po}(\lambda_n))\rightarrow 0$, as n grows. This convergence in total variation distance is as strong a result as one could hope for, and when the sequence of $\lambda_n$ is O(1), we can derive a normal approximation as well.

At this point it is worth discussing an alternative specification of the model. Recall that for a standard homogenous random graph, we have the choice of G(n,m) and G(n,p) as definitions. G(n,m) is the finer measure, and G(n,p) can be viewed as a weighted mix of G(n,m). We can’t replicate this directly in the geometric setting because the edges and non-edges are a deterministic function of the vertex locations. What we can randomise is the number of vertices. Since we are placing the vertices uniformly at random, it makes sense to consider as an alternative a Poisson Point Process with intensity n. The number of vertices we get overall will be distributed as Po(n), which is concentrated near n, in the same manner as G(n,c/n).

As in G(n,p), this is a less basic model because it is a mixture of the fixed-vertex models. Let’s see if how we would go about extending the total variation convergence result to this slightly different setting without requiring a more general version of the Poisson Approximation Lemma. To avoid having to define everything again, we add a ‘ to indicate that we are talking about the Poisson Point Process case. Writing d(.,.) for total variation distance, the result we have is:

$\lim_{n\rightarrow\infty} d(E_n,\mathrm{Po}(\lambda_n))=0.$

We want to show that

$\lim_{n\rightarrow\infty}d(E_n',\mathrm{Po}(\lambda_n'))=0,$

which we can decompose in terms of expectations in the original model by conditioning on $N_n$

$\leq \lim_{n\rightarrow\infty}\mathbb{E}\Big[\mathbb{E}[d(E_{N_n},\mathrm{Po}(\lambda_n')) | N_n]\Big],$

where the outer expectation is over N. The observation here, is that the number of points given by the Poisson process induces a measure on distributions, the overwhelming majority of which look quite like Poisson distributions with parameter n. The reason we have a less than sign is that we are applying the triangle inequality in the sum giving total variation distance:

$d(X,Y)=\sum_{k\geq 0}|\mathbb{P}(X=k)-\mathbb{P}(Y=k)|.$

From this, we use the triangle inequality again:

$\lim_{n\rightarrow\infty} \mathbb{E}\Big[\mathbb{E}[d(E_{N_n},\mathrm{Po}(\lambda_{N_n})) | N_n]\Big]$

$+\lim_{n\rightarrow\infty}\mathbb{E}\Big[\mathbb{E}[d(\mathrm{Po}(\lambda_{N_n}),\mathrm{Po}(\lambda_n')) | N_n]\Big].$

Then, by a large deviations argument, we have that for any $\epsilon>0$, $\mathbb{P}(|N_n-n|\geq \epsilon n)\rightarrow 0$ exponentially in n. Also, total variation distance is, by definition, bounded above by 1. In the first term, the inner conditioning on N_n is irrelevant, and we have that $E_{N_n}$ converges to the Poisson distribution for any fixed $N_n\in (n(1-\epsilon),n(1+\epsilon))$. Furthermore, we showed in the proof of the non-PPP result that this convergence is uniform in this interval. (This is not surprising – the upper bound is some well-behaved polynomial in 1/n.) So with probability $1- e^{-\Theta(n)}$ N_n is in the region where this convergence happens, and elsewhere, the expected TV distance is bounded below 1, so the overall expectation tends to 0. With a similar LD argument, for the second term it suffices to prove that when $\lambda\rightarrow\mu$, we must have $d(\mathrm{Po}(\lambda),\mathrm{Po}(\mu))\rightarrow 0$. This is ‘obviously’ true. Formally, it is probably easiest to couple the distributions $\mathrm{Bin}(n,\lambda/n),\mathrm{Bin}(n,\mu/n)$ in the obvious way, and carry the convergence of TV distance as the parameter varies through the convergence in n.

That all sounded a little bit painful, but is really just the obvious thing to do with each term – it’s only the language that’s long-winded!

Anyway, I’m looking forward to seeing how the course develops. In particular, when you split the space into small blocks, the connectivity properties resemble those of (site) percolation, so I wonder whether there will be concrete parallels. Also, after reading about some recent results concerning the metric structure of the critical components in the standard random graph process, it will be interesting to see how these compare to the limit of a random graph process which comes equipped with metric structure for free!

# Preferential Attachment Models

I’ve just read a really interesting paper by Peter Morters and Maren Eckhoff that made me feel I should look up some of the background and write a quick post. I may get onto some of the results in the paper at the end of this post, but I want to start by saying a bit about the model itself. I’ve spoken about this briefly in a previous post about several descriptions of complex networks, but I think it’s worth having a second attempt.

We seek a model for random graphs that gives a distribution which exhibits some of the properties of the sort of complex networks seen in the real world. In particular, whereas the degree distribution is Poisson, and so concentrated with exponential tails for the Erdos-Renyi random graph, data indicates that a better model for most applications would have power law tails for this degree distribution.

Albert and Barabasi propose growing such a graph via a so-called preferential attachment scheme. We start with some small possibly empty graph, and add new vertices one at a time. For each new vertex, we add exactly M edges between the new vertex and the vertices already present. The choice of these M other vertices is given by weighting by the degree of the (pre-existing) vertices. That is, vertices with large degree are more likely to be joined to new vertices. This is obviously designed to replicate some of the behaviour seen in say the formation of the internet, where new sites are more likely to link to established and popular sites (Google, Youtube and so on) than a uniformly chosen site.

This model has a couple of problems. Firstly, it is not immediately obvious how to start it. Obviously we need M vertices present for the PA dynamics to start working. In fact, whether one starts with a empty graph or a complete graph on M vertices makes little difference to the large n behaviour. Trickier is the question of multiple edges, which may emerge if we define the PA dynamics in the natural way, that is for each of the M edges in turn. Overcoming this is likely to be annoying.

Bollobas and Riordan do indeed overcome this possible problems in a formal way, and prove that a version of this model does indeed have power law decay of the degree distribution, with exponent equal to 3. The model in the paper instead joins new vertex (n+1) to old vertex m with probability:

$\frac{f(\text{in-degree of n})}{n},$

where f is some function, which for now we assume has the form $f(k)=\gamma k+\beta$. Since the vertices are constructed one at a time, it is well-defined to orient these edges from new to old vertices, hence this notion of in-degree makes sense.

It was not obvious to me that this model was more general than the Bollobas/Riordan model, but we will explain this in a little while. First I want to explain why the Bollobas/Riordan model has power law tails, and how one goes about finding the exponent of this decay, since this was presented as obvious in most of the texts I read yet is definitely an important little calculation.

So let’s begin with the Bollobas/Riordan model. It makes sense to think of the process in terms of time t, so there are t – M vertices in the graph. But if t is large, this is essentially equal to t. We want to track the evolution of the degree of some fixed vertex v_i, the ith vertex to be formed. Say this degree is d(t) at time t. Then the total number of edges in the graph at time t is roughly tM. Therefore, the probability that a new vertex gets joined to vertex v is roughly $\frac{Md}{2Mt}$, where the M appears in the numerator because there are M fresh edges available. Note that we have ignored the possibility of trying to connect multiple edges from the new vertex to v, so this holds provided d is substantially smaller than t. With the boundary condition $d(i)=M$, this leads to the simple ODE

$\dot{d}=\frac{d}{2t}\quad \Rightarrow\quad d=M(\frac{t}{i})^{1/2}.$

To me at least it was not immediately clear why this implied that the tail of the degree distribution had exponent 3. The calculation works as follows. Let D be the degree of a vertex at large time t, chosen uniformly at random.

$d_i\propto (\frac{t}{i})^{1/2}$

$\Rightarrow\quad \mathbb{P}(D\geq d)=\frac{1}{t}|\{i:(\frac{t}{i})^{1/2}\geq d\}|=\frac{1}{t}|\{i:i\leq \frac{t}{d^2}\}|=\frac{1}{d^2}$

Now we consider the Eckhoff / Morters model. The main difference here is that instead of assuming that each new vertex comes with a fixed number of edges, instead the new vertex joins to each existing vertex independently with probability proportional to the degree of the existing vertex. More precisely, they assume that edges are directed from new vertices to old vertices, and then each new vertex n+1 is joined to vertex m<n+1 with probability $\frac{f(\text{indegree of }m\text{ at time }n)}{n}\wedge 1$, where $f(k)=\gamma k +\beta$, for $\gamma\in[0,1), \beta>0$.

I was stuck for a long time before I read carefully enough the assertion that $\beta>0$. Of course, if this doesn’t hold, then the graph won’t grow fast enough. For, since the function f is now linear, we can lift the statement about evolution of the degree of a vertex to a statement about the evolution of the total number of edges. Note that each edge contributes exactly one to the total number of in-degrees. So we obtain

$\dot{E}=\frac{\gamma E}{t}\quad\Rightarrow E(t)\propto t^\gamma.$

In particular, this is much less than t, so the majority of vertices have small degree. The answer is fairly clear in fact: since the preferential attachment mechanism depends only on in-degree, then if f(0)=0, since the in-degree of a new vertex will always be zero by construction, there is no way to get an additional edge to that vertex. So all the edges in the graph for large t will be incident to a vertex that had positive in-degree in the time 0 configuration. Hence we need $\beta>0$ for the model to be meaningful. Note that this means we effectively have a Erdos-Renyi type mechanism AND a preferential attachment evolution. As, for each new vertex, we add roughly $\beta$ edges to existing vertices chosen uniformly at random (rather than by a PA method) and also some assigned via PA. A previous paper by Dereich and Morters shows that the asymptotic degree distribution has a power law tail with exponent

$\tau:=\frac{\gamma+1}{\gamma}.$

Note that $\gamma=\frac12$ gives the same exponent (3) as the Bollobas / Riordan model.

We can apply a similar ODE approximation as above to estimate the likely large time behaviour of the number of edges:

$E'=\frac{\gamma E + \beta t}{t}.$

So since $E'\geq \beta$, we have $E\geq \beta t$ so defining F to be E(t)/t, we get:

$tF'(t)=\beta-(1-\gamma)F(t)$        (1)

Noting that F’ is positive when $F< \frac{\beta}{1-\gamma}$ and negative when $F>\frac{\beta}{1-\gamma}$ suggests that for large t, this is an equilibrium point for F and hence $E(t)\approx \frac{\beta t}{1-\gamma}$. Obviously, this is highly non-rigorous, as F’ can be very small and still satisfy the relation (1), so it is not clear that the ‘equilibrium’ for F is stable. Furthermore, one needs to check that the binomial variables that supply the randomness to this model are sufficiently concentrated that this approximation by expectation is reasonable.

Nonetheless, as a heuristic this is not completely unsatisfactory, and it leads to the conclusion that E(t) is a linear function of t, and so the distribution of the out-degrees for vertices formed at large times t is asymptotically Poisson, with parameter

$\lambda =\frac{\beta\gamma}{1-\gamma}+\beta=\frac{\beta}{1-\gamma}$.

Note that this is the same situation as in Erdos-Renyi. In particular, it shows that all the power tail behaviour comes from the in-degrees. In a way this is unsurprising, as these evolve in time, whereas the out-degree of vertex t does not change after time t. Dereich and Morters formalise this heuristic with martingale analysis.

The reason we are interested in this type of model is that it better reflects models seen in real life. Some of these networks are organic, and so there it is natural to consider some form of random destructive mechanism, for example lightning, that kills a vertex and all its edges. We have to compare this sort of mechanism, which chooses a vertex uniformly at random, against a targeted attack, which deletes the vertices with largest degree. Note that in Erdos-Renyi, the largest degree is not much larger than the size of the typical degree, because the degree distribution is asymptotically Poisson. We might imagine that this is not the case in some natural networks. For example, if one wanted to destroy the UK power network, it would make more sense to target a small number of sub-stations serving large cities, than, say, some individual houses. However, a random attack on a single vertex is unlikely to make much difference, since the most likely outcome by far is that we lose only a single house etc.

In Eckhoff / Morters’ model, the oldest vertices are by construction have roughly the largest degree, so it is clear what targeting the most significant $\epsilon n$ vertices means. They then show that these vertices include all the vertices that give the power law behaviour. In particular, if you remove all of these vertices and, obviously, the edges incident to them, you are left with a graph with exponential tail in the asymptotic degree distribution, with largest degree on the order of log n. It was shown in a previous paper that this type of network is not vulnerable to random removal of nodes. Perhaps most interestingly, these authors now prove that after removing the most significant $\epsilon n$ vertices, the network IS now vulnerable to random removal of nodes, leading to the conclusion that it is preferable to experience a random attack followed by a targeted attack than vice versa!

In a future (possibly distant) post, I want to say some slightly more concrete things about how these processes link to combinatorial stochastic processes I understand slightly better, in particular urn models. I might also discuss the configuration model, an alternative approach to generating complex random networks.

# Minimum Spanning Trees

In my last post, I discussed the Uniform Spanning Tree. To summarise very briefly, given a connected graph on n vertices, a tree is a subgraph, that is a subset of the edges, which is connected, but which contains no cycles. It turns out this requires the tree to have n-1 edges.

We are interested in natural mechanisms for generating randomly chosen spanning trees of a given graph. One way we can always do this is to choose uniformly at random from the set of possible trees. This UST is in some sense canonical, but it is worth knowing about some other measures on trees that might be of interest.

A family of natural problems in operations research concerns an arbitrary complex network, with some weight or cost associated to each connection. The question is how to perform some operation on the network so as to minimise the resulting cost. Perhaps the most famous such problem is that of the Travelling Salesman. The story is that a salesman needs to visit n locations and wants to do the trip as efficiently as possible. This might be thought of as some sort of financial or time cost, but proably the easiest way to set it up is to imagine he is trying to minimise the distance he has to travel. It is not hard to see why this problem might genuinely arise in plenty of real-world situations, where a organisation or agent is trying to be as efficient as possible.

It might be the case that it is not possible to travel between every pair of locations, but we needn’t assume that for now. So if he knows the distance between any pair of cities, he wants to know which of the possible routes gives the shortest overall distance. The problem is that there are n! routes, and this grows roughly like n^n, which is faster than exponential, so for as few as 20 cities it has turned into a comparison which is too large to compute.

There are various algorithms which reduce the number of routes that must be checked, and some approximation methods. But if you want the exact answer, it is not currently possible to calculate this in polynomial time.

Minimal Spanning Trees and Uniqueness

For the travelling salesman, we were looking for the minimal cost spanning path. In the case of the complete graph, this is the same as the minimal cost non-repeating path of length n-1. Such paths are a subset of the set of spanning trees on the underlying graph. So what if we look instead for the minimal cost spanning tree? This exists as after all, there are only finitely many spanning trees.

So far, this has been deterministic, but we were looking for a random spanning tree. We can achieve this by choosing the weights at random. Anything other than assigning the weights as an IID sequence seems likely to be complicated, but there isn’t a canonical choice of the distribution of the weights. Our first question will be whether the distribution of the weights affects the distribution of the induced MST. In fact it will turn out that so long as the distribution is continuous, it has no effect on the distribution of the MST. The continuous condition might seem odd, but it is present only to ensure that the weights almost certainly end up generating a unique MST.

It turns out that there is a straightforward greedy algorithm to find the MST once the weights are known. We will examine some consequences of this algorithm in the random setting. First we check uniqueness. The condition required for uniqueness is that the weights be distinct. Note that this is slightly weaker than the statement that all of sums of (n-1)-tuples be distinct, which immediately implies a unique MST.

We now prove this condition. Suppose we have distinct weights, and an associated MST. If the underlying graph is a tree, then the result is clear. Otherwise, add some extra edge e, with weight w(e). By the definition of a tree, this generates exctly one cycle. Consider the other edges, say $e_1,\ldots,e_k$ in this cycle. If any of $w(e_i)>w(e)$ then we can replace e_i with e to get a spanning tree with smaller weight, a contradiction of the claim that we started with an MST. So by distinctness of weights, we conclude that $w(e)>w(e_i)$ for all i.

Conversely, suppose we remove some edge e which IS in the MST. We end up with exactly two connected components. Consider all the edges in the underlying graph between the two components, and suppose that one of these f satisfies w(f)<w(e). Then if we add in edge f, which is by construction not in the original MST, we end up with a smaller total weight than we started with, a further contradiction.

We can summarise this in a neat form. Given an edge e between x and y, consider the set of all edges in the underlying graph with weight LESS THAN w(e). Then if x and y are in different components, the edge e must be in the MST. Since we have an explicit description of which edges are present, it follows that the MST is unique. The problem is that working out the component structure of the graph with higher weights removed is computationally rather intensive. We want a slightly faster algorithm.

Kruskal’s Algorithm

Several rather similar algorithms were developed roughly simultaneously. Prim’s algorithm is a slight generalisation of what we will discuss. Anyway, for now we consider Kruskal’s algorithm which has the advantage that it can be described without really needing to draw a diagram.

We start by ordering the weights. Without loss of generality, we might as well relabel the edges so that

$w(e_1)< w(e_2)<\ldots< w(e_{|E|}).$

Now, by the condition derived in the argument for uniqueness, we must have e_1 and e_2 in any MST. Now consider e_3. Unless doing so would create a cycle, add e_3. Then, unless doing so would create a cycle, add e_4. Continue. It is clear that the result of this procedure is acyclic. To check it is actually a spanning tree, we show that it is also connected. Suppose not, and two of the components are A and B. Let e be the edge between A and B with minimal weight. According to the algorithm, we should have included e in our MST because at no point would adding it possibly have created a cycle. So we have proved that this greedy algorithm does indeed give the (unique) MST.

A useful consequence of this is that we know the two edges with overall minimum weight are definitely in the MST. In the search for a random measure on spanning trees, what is most important is that we didn’t use the actual values of the weights in this construction, only the order. In other words, we might as well have assumed the weights were a random permutation from $S_{|E|}$. This now answers our original question about how the random weight MST depends in distribution on the underlying edge weight distribution. So long as with probability one the weights are distinct (which holds if the distribution is continuous), then the distribution of the resulting spanning tree is constant.

It’s not too hard to show this isn’t the same as UST: n=4 suffices as a counterexample. But the difference in asymptotic behaviour of properties such as the diameter is of interest, and will be explored in the next post.

# Beyond Erdos-Renyi: more realistic models of networks

The claim is often made that the study of random graphs such as the Erdos-Renyi model is worthwhile because it gives us information about complex systems which exist in the real world. The internet or social networks provide the example du jour at the moment, but it’s equally plausible to think about traffic flows, electrical systems or interacting biological processes too.

If this were entirely true, it would be great for two reasons. Firstly, in my opinion at least, it is a beautiful subject in its own right, and to have a concrete applicable reason to continue studying it would make it even better. (Not to mention the dreaded competition for funding…) Secondly, Erdos-Renyi is so simple. After all, it involves little more than adding some simple topology to a collection of IID Bernoulli random variables, and so it would surely be possible to draw some significant conclusions about how complicated real-world objects interact without too much mathematical effort.

Unfortunately, but unsurprising, this simplicity is a drawback as far as applications go. It is fairly clear that most real-world systems cannot offer any property even approaching the niceness of the independent, same probability edges condition. But rather than consign E-R to the ‘pretty but useless’ category of mathematical structures, we should think carefully about exactly why it fails to be a good model for real-world networks, and see whether there are any small adjustments that could be made to improve it.

This is something I’ve been meaning to read up about for ages and ages. What follows is based heavily on the Albert and Barabasi 2002 review paper. I suspect that many of the open problems and intuitive calculations have since been finished and formalised, but for an overview I hope that doesn’t matter hugely. I’ve also leafed through the relevant chapters of Remco Van der Hofstad’s notes, but am setting the details and the exercises aside for the holidays when I have a bit more time!

Problems with Erdos-Renyi

Recall that G(n,p) takes n vertices, and adds edges between any pair of vertices independently with probability p.

One property shared by most real-world networks is the scale-free phenomenon, which says that the degree distribution has a power law tail. The Albert-Barabasi papers gives a comprehensive survey of data verifying this claim. By contrast, G(n,p) has degree distribution which is approximately Poisson as n grows. This is concentrated near the average degree with a thin exponential tail, so does not satisfy this requirement. I was and still am a bit confused by the term ‘scale-free’. The idea is certainly that the local structure is independent of the size of the system, which seems to be true for the degree distributions in sparse ER, that is where p = O(1/n). But I think the correct heuristic is that it doesn’t matter how far zoomed in you are – the macroscopic structure looks similar for n vertices as for n^2 vertices. This certainly fails to be true for ER, where no vertex has O(n) neighbours, whereas with a power law tail, this does hold.

The main consequence of this is that there are a few vertices with very high degree. These are often called ‘hubs’ and parallels are drawn to the internet, where key websites and servers connect lots of traffic and pages from different areas. The idea is that the hubs are almost certainly well-connected to each other, and this offers a step towards a small-world phenomenon, where the shortest path between any two vertices is very small relative to the size of the system. This notion was introduced to mainstream culture by Stanley Milgram’s ‘Six degrees of separation’ experiment in the 60s, where it became clear that subjects were able to deliver a package to a complete stranger on the other side of the USA, using only personal contacts, in about six stages. The graph theoretic notion for this is the diameter, defined as the maximal graph distance between two points. Here, the graph distance means the length of the shortest path between the points. This definition, with the max-min formalism looks rather complicated, but isn’t really. The diameter of an Erdos-Renyi graph for fixed p, increases like log n, which is small relative to n, and so this property holds.

A quick glance at your list of Facebook friends will confirm that the independent edges condition in an Erdos-Renyi random graph is not a plausible model for social networks. How many friends do you have? Let’s say about 1000, more to make the calculation easier than because you’re necessarily very popular. How many does your friend Tom have? Let’s say 1000 again. As was in the news a few months ago, there are now over a billion people on Facebook. Let’s say exactly a billion (that is 10^9 for these purposes). So both you and Tom are friends with 1/10^6 of the total membership of the network. So how large would you expect the overlap of your friendships to be, if they were all chosen independently at random? Well, the probability that you are both friends with Alice is 10^-12, and so the expected number of your mutual friends is 10^-12 x 10^9 = 10^-3 which is substantially less than 1. Yet I imagine if you substituted names suitably, you and Tom might well have over 50 mutual friends if you were, say, in the same year at school or niversity and haven’t yet purged your list.

We want a statistic that records this idea quantitatively. There are various candidates for such a clustering coefficient. The underlying notion that we might expect there to be greater connectivity between neighbours of some fixed point v than in the graph as a whole gives an intuition for a possible definition. Compare the proportion of triangles in the graph to the cube of the proportion of edges. When this ratio is large, then there is a lot of clustering. In the E-R case, we would expect these to be equal, as the probability of forming a triangle is equal to the cube of the probability of the presence of each of the three independent edges that make up the triangle.

So we have three properties of real networks that we would like to incorporate into a model: small diameter, power-law degree distribution, and high clustering. To avoid this turning into a book, I’m going to write a paragraph about each of the possibilities discussed by Albert and Barabasi.

Generalised Random Graph

The degree distribution will typically emerge as a consequence of the construction of a given model. The general idea here is to condition on the degree distribution having the form we want, and see what this does to the structure. Of course, the choice of how to do this conditioning is absolutely key. It certainly isn’t obvious what it means to ‘condition G(n,p) to have power-law distribution’, since the very idea of a power-law vs exponential tail requires the number of vertices to be large.

The first idea for achieving this gives the vertices ‘stubs’, which join up in pairs to form edges. We decide on the distribution of stubs according to this power law, then pair them up uniformly at random. Obviously, there is a possibility of getting some loops, but this is not going to happen so often as to be a genuine problem in the limit. This construction is similarly open to the branching process exploration ideas well covered for the E-R random graph, though we have to be careful to size-bias the degree distributions when necessary. There is still an underlying independence in the location of edges though, so it is reasonably clear that the amount of clustering may be closer to E-R than to the real examples cited.

The other possibility suggested is to retain the independent edge property, but give the vertices weights, and let the probability of an edge between two vertices be some sensible function of the weights. In the end it turns out to make little difference whether the weights are chosen deterministically or randomly, but by taking the weights i.i.d. with infinite mean, we can generate a so-called generalised random graph where the degree distribution has a power law.

Watts-Strogatz

In the WS model, the idea is to interpolate between a graph with maximal clustering and a random graph. A d-regular graph, say on a ring, where every vertex is connected to its d nearest neighbours has high clustering, but large diameter, as for example it takes roughly n/2d steps to get to the other side of the ring. Whereas in the standard E-R model we add edges with some fixed probability p, here we replace edges with some fixed probability p. That is, we take an edge in the regular graph and with some small probability we remove it and instead add an edge between two vertices chosen uniformly at random. The theoretical motivation is that removing a few edges doesn’t destroy the high clustering evident in the regular graph, but even a sparse random graph has small diameter, so adding a few ‘long-range’ edges should be enough to decrease the diameter significantly.

It obviously needs to be checked that a substantial drop in diameter occurs before a substantial decrease in clustering, and there is a calculation and diagram to support this intuitive idea in the paper. The one drawback of this model is that it fails to provide the power-law degree distributions we want. After all, an E-R graph has a concentrated degree distribution, and a d-regular graph has all degrees the same, so we would expect some interpolation between the two to have a concentrated distribution as well. Nonetheless, this model accords well with an idea of how complex networks might form, particularly if there is some underlying geometry. It is reasonable to assume that an initial setup for a network would be that people are connected to those closest to them, and then slowly acquire distant contacts as time progresses.

Preferential Attachment – Barabasi-Albert model

Most of our intuition for networks can be extended to an intuition for the formation of networks. The idea of prescribing a degree distribution is neat, but it doesn’t give any account to the mechanism of formation. Complexity emerges over time, and a good model should be able to describe why this happens. The Barabasi-Albert model takes this as its starting point, with the aim of producing a highly clustered system dynamically. Recall that we can describe G(n,p) as a process by coupling, then increasing p from 0 to 1, and seeing edges emerge. The independence assumption can be lifted through the coupling, and so which edge appears next is independent of the current state of the system.

This is what we need to relax. Recall the motivating idea of ‘hubs’, where a small collection of vertices have very high connectivity across the whole system, as observed in several real situations. A consequence of this is that new edge is more likely to be attached to a hub, than to a pair of poorly connected vertex elsewhere. But it turns out that this idea of preferential attachment isn’t enough by itself. Because as a network forms, it is not just the connectivity that increases, but also the size of the system itself. So in fact it makes sense to add vertices rather than edges, and join the new vertices to existing vertices in proportion to the degrees of the existing vertices. This combination of growth and preferential attachment is key to the scale-free graphs that this Barabasi-Albert model generates. Relaxing either mechanism returns us to the case of exponential tails. However, there are methods in the literature for generating such graphs without the need for a dynamic model, but they are harder to understand and describe. None I have seen so far has a high clustering coefficient.

Hubs are effectively a way to reduce the diameter. Recall the description of Milgram’s experiment where he encouraged randomly chosen people to send a package to Harvard. For the purposes of this model, an undergraduate from Wyoming or a husband from Alabama moving in with his wife in Boston are clear hubs, as for very many people near their previous home, they represent a good connection to Harvard. So it is unsurprising that BA, which reinforces hubs, has a sub-logarithmic small diameter.

Conclusions

I’m not entirely what conclusions I should draw from my reading. Probably the main one is that I should read more as there is plenty of interesting stuff going on in this area. Intuitively, it seems unlikely that there is going to be a single model which unites the descriptions of all relevant real-world networks. As ever, it is pleasant to find structures that are both mathematically interesting in their own right and relevant to applied problems. So it is reassuring to observe how similar many of the models discussed above are to the standard random graph.

## Delayed Connectivity in Random Graphs

### Aside

I presented a poster at the Oxford SIAM Student Chapter Conference on Friday. It was nice to win the prize for best poster, but mainly I enjoyed putting it together. For once it meant ignoring the technical details and anything requiring displayed formulae, and focusing only on aspects that could be conveyed with bullet points and images. Anyway, this is what I came up with. The real thing is sitting safely in a tube in my office, ready for the next time it is needed in a hurry!

Delayed Connectivity Poster

# Mixing Times 2 – Metropolis Chains

In our second reading group meeting for Mixing Times of Markov Chains, we reviewed chapters 3 and 4 of the Levin, Peres and Wilmer book. This post and the next contains a couple of brief thoughts about the ideas I found most interesting in each chapter.

Before reading chapter 3, the only thing I really knew about Monte Carlo methods was the slogan. If you want to sample from a probability distribution that you can’t describe explicitly, find a Markov chain which has that distribution as an equilibrium distribution, then run it for long enough starting from wherever you fancy. Then the convergence theorem for finite Markov chains means that the state of the chain after a long time approximates well the distribution you were originally looking for.

On the one previous occasion I had stopped and thought about this, I had two questions which I never really got round to answering. Firstly, what sort of distributions might you not be able to simulate directly? Secondly, and perhaps more fundamentally, how would you go about finding a Markov chain for which a given distribution is in equilibrium?

In the end, the second question is the one answered by this particular chapter. The method is called a Metropolis chain, and the basic idea is that you take ANY Markov chain with appropriate state space, then fiddle with the transition probabilities slightly. The starting chain is called a base chain. It is completely possible to adjust the following algorithm for a general base chain, but for simplicity, let’s assume it is possible to take an irreducible chain for which the transition matrix is symmetric. By thinking about the DBEs, this shows that the uniform distribution is the (unique) equilibrium distribution. Suppose the  transition matrix is given by $\Psi(x,y)$, to copy notation from the book. Then set:

$P(x,y)=\begin{cases}\Psi(x,y)\left[1\wedge \frac{\pi(y)}{\pi(x)}\right]&y\neq x\\ 1-\sum_{z\neq x} \Psi(x,z)\left [1\wedge \frac{\pi(z)}{\pi(x)}\right]& y=x.\end{cases}$

Note that this second case (y=x) is of essentially no importance. It just confirms that the rows of P add to 1. It is easy to check from the DBEs that $\pi$ is the equilibrium distribution of matrix P. One way to think of this algorithm is that we run the normal chain, but occasionally suppress transitions is they involve a move from a state which is likely (under $\pi$), to one which is less likely. This is done in proportion to the ratio, so it is unsurprising perhaps that the limit in distribution is $\pi$.

Conveniently, this algorithm also gives us some ideas for how to answer the first question. Note that at no point do we need to know $\pi(x)$ for some state x. We only need to use $\frac{\pi(x)}{\pi(y)}$ the ratios of probabilities. So this is perfect for distributions where there is a normalising constant which is computationally taxing to evaluate. For example, in the Ising model and similar statistical physics objects, probabilities are viewed more as weightings. There is a normalising constant, often called the partition function Z in this context, lying in the background, but especially the underlying geometry is quite exotic we definitely don’t want to have to worry about actually calculating Z. Thus we have a way to generate samples from such models. The other classic example is a random walk on a large, perhaps unknown graph. Then the equilibrium distribution at a vertex is inversely proportional to the degree of that vertex, but again you might not know about this information over the entire graph. It is reasonable to think of a situation where you might be able to take a random walk on a graph, say the connectivity graph of the internet, without knowing about all the edges at any one time. So, even though you potentially explore everywhere, you only need to know a small amount at any one time.

Of course, the drawback of both of these examples is that a lack of knowledge about the overall system means that it is hard in general to know how many steps the Metropolis chain must run before we can be sure that we are the equilibrium distribution it has been constructed to approach. So, while these chains are an excellent example to have in mind while thinking about mixing times, they are also a good motivation for the subject itself. General rules about speed of convergence to equilibrium are precisely what are required to make such implementation concrete.

# Loss Networks and Erlang’s Fixed Point

Loss Networks

In Erlang’s telephone line model discussed in the previous post, we considered users competing for a finite set of resources. When insufficient resources are available, the call is lost. A loss network generalises this situation to more complicated resource configurations. We think of links 1, …, J, each with some integer capacity $c_j$. Each incoming call requires 1 units of each link in some subset of the set of links, and lasts for a time distributed as an exponential random variable with parameter 1, independent of everything else in the model. We call this subset the route, and denote by $A_{jr}$ the incidence matrix of links on routes. Calls arrive as PP($\nu_r$)\$ independently for each route: no queueing occurs – a call is lost if some link required is operating at full capacity. We call the probability of this event $L_r$, the loss probability. Observe that $(n_r)$, the number of calls on each route r, is a Markov chain on the truncated space $\{An\leq c\}$.

By checking the DBEs, it is clear that an ED for this Markov chain is proportional to the ED for the MC without the capacity constraint, with state-space restricted to the truncated space. But without capacity constraints, the system is a linear migration process, for which we discovered the form of the ED in the previous section. If we write $H(c)=\mathbb{P}(An\leq c)$ in the linear migration process, we can compute the acceptance probability for the finite capacity system as:

$1-L_r=\frac{H(C-Ae_r)}{H(C)}$

Approximating Blocking Probabilities

We want to calculate $B_j$, the equilibrium blocking probability, that a given link j is full. We have two methods: firstly, to find the distribution for $(n_r)$ with maximum probability, for which the blocking probabilities appear as shadow prices. And secondly, to make a reasonable approximation about blocking independence, and solve explicitly. We want to show that these methods give the same answers.

To maximise the probability $\pi(n)\propto \prod_r \frac{\nu_r^{n_r}}{n_r!}$ on $\{An\leq c\}$, we take logs and maximise using Stirling’s approximation, which is reasonable as we are implicitly working under a regime where the throughput tends to infinity while preserving ratios.

The primal problem is

$\max\quad \sum_r(x_r\log \nu_r-x_r\log x_r+x_r),\quad\text{s.t. }Ax\leq c$

which has Lagrangian

$L(x,y,z)=\sum_r x_r+\sum_r x_r(\log \nu_r-\log x_r-\sum_j y_jA_{jr})+\sum_j y_jc_j-\sum_j y_jc_j$

We observe that complementary slackness here has the form $y.z=0$, and remember that by Strong Duality, which applies here because everything relevant is convex, this equality holds at the primal optimum. Differentiating the Lagrangian at the optimum allows us to specify the optimal x in terms of y:

$\bar{x}_r=\nu_r e^{-\sum y_jA_{jr}}$

The dual problem is then to minimise

$\min\quad \sum_r \nu_re^{-\sum_jy_jA_{jr}}+\sum_j y_jc_j$

At this point, we make the suggestive substitution $e^{-y_j}=1-B_j$, observing that this gives B non-negative by default since y is non-negative. After further work, we will deduce that these B do indeed have a sensible interpretation as blocking probabilities, but it should be stressed that this is in no way obvious yet. Now complementary slackness asserts:

$\sum_rA_{jr}\nu_r\prod_i(1-B_i)^{A_{ir}}\left\{\begin{array}{l l}=c_j& \quad B_j>0\\ \leq c_j & \quad B_j=0\\ \end{array} \right.$

Note that the primal objective function is strictly convex so $\bar{x}$ as discussed is the unique optimum. The dual objective is strictly convex in $yA$, so if A has full rank J, this induces a unique optimum in terms of y. We assume A is full rank (since for example we can perturb slightly) and that there is no degeneracy in the blocking.

Now we consider a sequence of networks with proportionally increasing arrival rates and capacities Continue reading