# The Yule Process

The second problem sheet for classes on the Applied Probability course this term features a long question about the Yule process. This is probably the simplest example of a birth process. It’s named for the British statistician George Udny Yule, though some sources prefer to call it the Yule-Furry process for the American physicist Wendell Furry who used it as a model of a radioactive reaction.

The model is straightforward. At any time there is some number of individuals in the population, and each individual gives birth to an offspring at constant rate $\lambda$, independently from the rest of the population. After a birth has happened, the parent and child evolve independently. In the notation of general birth processes, the birth rate when there are n individuals is $\lambda_n=\lambda n$.

Note that if we start with two or more individuals, the sizes of the two or more families of descendents evolve as a continuous-time Polya’s urn. The arrivals process speeds up with time, but the jump chain is exactly Polya’s urn. Unsurprisingly, the Yule process can be found embedded in preferential attachment models, and other processes which are based around Polya’s urn with extra information.

This is a discrete, random version of exponential growth. Since the geometric distribution is the discrete analogue of the exponential distribution, we probably shouldn’t be surprised to learn that this is indeed the distribution of the process at some fixed time t, when it is started from a single original ancestor. This is all we care about, since the numbers of descendents from each different original ancestors are independent. In general, the distribution of the population size at some fixed time will be negative binomial, that is, a sum of IID geometric distributions.

The standard method here is to proceed using generating functions. Conditioning on the first splitting time gives two independent copies of the original process over a shorter time-scale. One derives an ODE in time for the generating function evaluated at any particular value z. This can be solved uniquely for each z, and patching together gives the generating function of the distribution at any specific time t, which can be seen to coincide with the corresponding generating function of the geometric distribution with parameter $e^{-\lambda t}$.

So we were trying to decide whether there might be a more heuristic argument for this geometric distribution. The method we came up with is not immediate, but does justify the geometric distribution in a couple of steps. First, we say that the birth times are $T_2,T_3,\ldots$, so between times $[T_n,T_{n+1})$ there are n individuals, with $T_1:=0$ for concreteness. Then by construction of the birth process, $T_{n+1}-T_n\stackrel{d}{=}\mathrm{Exp}(\lambda n)$.

We now look at these ‘inter-birth times’ backwards, starting from $T_{n+1}$. Note that $\mathrm{Exp}(\lambda n)$ is the distribution of the time for the first of n IID $\mathrm{Exp}(\lambda)$ clocks to ring. But then, looking backwards, the next inter-birth time is thus the distribution of the time for one of (n-1) IID $\mathrm{Exp}(\lambda)$ clocks to ring. So by memorylessness of the exponential distribution (discussed at great length on the first problem sheet), we can actually take these (n-1) clocks to be exactly those of the original n clocks which did not ring first. Continuing this argument, we can show that the first (in the original time direction) inter-birth time corresponds to the time spent waiting for the final clock to ring. Rewriting this observation formally:

$T_{n+1}\stackrel{d}{=}\max\{X_i : X_1,\ldots,X_n\stackrel{\text{iid}}{\sim}\mathrm{Exp}(\lambda)\}.$ (*)

To return to justifying the geometric form of the distribution, we need to clarify the easiest relationship between the population size at a fixed size and these birth times. As we are aiming for the geometric distribution, the probability of the event $\{X_t>n\}$ will be most useful. Clearly this event is the same as $\{T_{n+1}, and from the description involving maxima of IID exponentials, this is easy to compute as $(1-e^{-\lambda t})^n$, which is exactly what we want.

There are two interesting couplings hidden in these constructions. On closer inspection they turn out to be essentially the same from two different perspectives.

We have specified the distribution of $T_n$ at (*). Look at this distribution on the right hand side. There is a very natural way to couple these distributions for all n, namely to take some infinite sequence $X_1,X_2,\ldots$ of IID $\mathrm{Exp}(\lambda)$ random variables, then use initial sequences of these to generate each of the $T_n$s as described in (*).

Does this coupling correspond to the use of these IID RVs in the birth process? Well, in fact it doesn’t. Examining the argument, we can see that $X_1$ gives a different inter-birth time for each value of t in the correspondence proposed. Even more concretely, in the birth process, almost surely $T_{n+1}>T_n$ for each n. This is not true if we take the canonical coupling of (*). Here, if $X_n<\max\{X_1,\ldots,X_{n-1}\}$, which happens with high probability for large n, we have $T_{n+1}=T_n$ in the process of running maxima.

Perhaps more interestingly, we might observe that this birth process gives a coupling of the geometric distributions. If we want to recover the standard parameterisation of the geometric distribution, we should reparameterise time. [And thus generate an essentially inevitable temptation to make some joke about now having a Yule Log process.]

Let’s consider what the standard coupling might be. For a binomial random variable, either on [n] or some more exotic set, as in percolation, we can couple across all values of the parameter by constructing a family independent uniform random variables, and returning a 1 if $U_i>1-p$ and so on, where p is the parameter of a specific binomial realisation.

We can do exactly the same here. A geometric distribution can be justified as the first success in a sequence of Bernoulli trials, so again we can replace the relevant Bernoulli distribution with a uniform distribution. Take $U_1,U_2,\ldots$ to be IID U[0,1] random variables. Then, we have:

$X_t=\stackrel{d}{=}\bar X_t:= \max\{n: U_1,\ldots,U_{n-1}\ge e^{-\lambda t}\}.$

The equality in distribution holds for any particular value of t by constructing. But it certainly doesn’t hold uniformly in t. Note that if we define $\bar X_t$ as a process, then typically the jumps of this process will be greater than 1, which is forbidden in the Yule process.

So, we have seen that this Yule process, even though its distribution at a fixed time has a standard form, provides a coupling of such distributions that is perhaps slightly surprising.

# Dispersion in Social Networks

This post is based on a paper that appeared a couple of weeks ago on the Computer Science section of arXiv. You can find it here. I’m going to write a few things about the ideas behind the paper, and avoid pretty much entirely the convincing data the authors present, as well as questions of implementing the algorithms discussed.

The setting is a social network, which we can describe as a graph. Nodes stand for people, and an edge represents that the two associated people have some social connection. This paper focuses on edges corresponding to friendship in the Facebook graph.

A key empirical feature of the graph topology of such social networks as compared to most mathematical models of random graphs is the prevalence of short cycles, and so-called clustering. Loosely speaking, in an Erdos-Renyi random graph, any potential edges appear in the graph independently of the rest of the configuration. But this does not accord well with our experience of our own Facebook friend circle. In general, knowledge that A is friends with both B and C increases the likelihood that B and C are themselves friends. This effect appears to be more present in other models, such as Preferential Attachment and the Configuration Model, but that is really more a consequence of the degree sequence being less concentrated.

The reason for this phenomenon appearing in social networks is clear. People meet other people generally by sharing common activities, whether that be choice of school, job or hobbies. The question of how readily people choose to add others on Facebook is a worthwhile one, but not something I have the time or the sociological credibility to consider! In any case, it is not a controversial idea that for some typical activity, it is entirely possible that almost all the participants will end up as friends, leading to a large (almost-) ‘clique’ in the graph. Recall a clique is a copy of a complete graph embedded in a larger graph – that is, a set of nodes all of which are pairwise connected.

We could think of much of the structure of this sort of network as being generated in the following way. Suppose we were able to perform the very unlikely-sounding task of listing every conceivable activity or shared attribute that might engender a friendship. Each activity corresponds to a set of people. We then construct a graph on the set of people by declaring that a pair of nodes are connected by an edge precisely if the people corresponding to these nodes can both be found in some activity set.

Another way of thinking about this setup is to consider a bipartite graph, with people as one class of vertices, and activities as the other. Predictably, we join a person to an activity if they engage in that activity. The edges within the class of people are then induced by the bipartite edges. Obviously, under this interpretation, we could equally well construct a graph on the set of activities. Here, two activities would be joined if there is a person who does them both. Graphs formed in this way can be called Intersection Graphs, and there is lots of interest in investigating various models of Random Intersection Graphs.

The question addressed by the authors of the paper can be summarised as follows. A social network graph tells us whether two people are ‘friends’, but it does not directly tell us how close their relationship is. It is certainly an interesting question to ask whether the (local) network topology can give us a more quantitative measure of the strength of a friendship.

As the authors explain, a first approach might be to consider how many mutual friends two people have. (We consider only pairs of people who are actually friends. It seems reasonable to take this as a pre-requisite for a strong relationship among people who do actually use Facebook.) However, this can fail because of the way these social networks organise themselves around shared attributes and activities. The size of one of these cliques (which are termed social foci in parts of the literature) is not especially likely to be well correlated to the strengths of the friendships within the clique. In particular, the clique corresponding to someone’s workplace is likely to grow in size over time, especially when people grow towards an age where, on average, they move job much less. So it seems likely that, according to a naive examination of the number of mutual friends, we would predict that a person’s strongest friend is likely to be someone they work with, who perhaps by chance also does some other activity with that person.

The authors phrase this problem slightly differently. They examine algorithms for establishing a person’s spouse or long-term partner with good accuracy from only the local network structure.

Heuristically we might expect that a husband knows many of his wife’s work colleagues, and vice versa. Not all of these ties might be so strong that they actually lead to friendship, in the Boolean sense of Facebook, but we might expect that some noticeable proportion have this property. Naturally, there will be cliques to which both belong. One or more of these might be the reason they met in the first place, and others (eg parents at children’s schools) might have developed over the course of their relationship. However, as we’ve explained, this doesn’t narrow things down much.

(We need not be constrained by this heteronormative scenario. However, as the authors point out in a footnote, there are challenges in collecting data because of the large number of ironic relationship listings on Facebook, mainly among the undergraduate and younger community. This problem is particularly obstructive in the case of same-sex marriage, owing to the smaller numbers of genuine pairings, and larger numbers of false listings for this setting.)

The crucial observation is that if we look at the couple’s mutual friends, we expect to see large parts of the most important cliques from both husband and wife’s lives. Among these mutual friends, there will be some overlap, that is cliques of which both are an integral member. But among the rest, there will be a natural partition into friends who really originate from the husband, and friends who were introduced via the wife. So the induced graph on these mutual friends is likely to split into three classes of vertices, with very poor connectivity between two of them.

This is, up to sorting out scaling and so on, precisely the definition of dispersion, introduced by the authors. The dispersion between two vertices is high if the induced graph on their mutual neighbourhood has poor connectivity. Modulo exact choice of definition, they then exhibit data showing that this is indeed a good metric for determining marriages from the network topology, with success rate of around 50% over a wide range of users.

# Persistent Hubs

This post is based on the paper “Existence of a persistent hub in the convex preferential attachment model” which appeared on arXiv last week. It can be found here. My aim is to explain (again) the model; the application-based motivation for the result; and a heuristic for the result in a special case. In particular, I want to stress the relationship between PA models and urns.

The preferential attachment model attempts to describe the formation of large complex networks. It is constructed dynamically: vertices are introduced one at a time. For each new vertex, we have to choose which existing vertices to join to. This choice is random and reinforced. That is, the new vertex is more likely to join to an existing vertex with high degree than to an existing vertex with degree 1. It’s clear why this might correspond well to the evolution of, say, the world wide web. New webpages are much more likely to link to an established site, eg Wikipedia or Google, than to a uniformly randomly chosen page.

The model is motivated also by a desire to fit a common property of real-world networks that is not exhibited by, among others, the Erdos-Renyi random graph model. In such a network, we expect a few nodes to have much greater connectivity than average. In a sense these so-called hubs drive connectivity of the system. This makes sense in practice. If you are travelling by train around the South-East of England, it is very likely you will pass through at least one of Reading, East Croydon, or about five major terminus in London. It would be absurd for every station to be of equal significance to the network. By contrast, the typical vertex degree in the sparse Erdos-Renyi model is O(1), and has a limiting Poisson distribution, with a super-exponential tail.

So, this paper addresses the following question. We know that the PA model, when set up right, has power-law tails for the degree distribution, and so has a largest degree that is an order of magnitude larger than the average degree. Let’s call this the ‘hub’ for now. But the model is dynamic, so we should ask how this hub changes in time as we add extra vertices. In particular, is it the case that one vertex should grow so large that it remains as the dominant hub forever? This paper answers this question in the affirmative, for a certain class of preferential attachment schemes.

We assign a weighting system to possible degrees, that is a function from N to R+. In the case of proportional PA, this function could be f(n)=n. In general, we assume it is convex. Note that the more convex this weight function is, the stronger the preference a new vertex feels towards existing dominant vertices. Part of the author’s proof is a formalisation of this heuristic, which provides some machinery allowing us to treat only really the case f(n)=n. I will discuss only this case from now on.

I want to focus on the fact that we have another model which describes aspects of the degree evolution rather well. We consider some finite fixed collection of vertices at some time, and consider the evolution of their degrees. We will be interested in limiting properties, so the exact time doesn’t matter too much. We look instead at the jump chain, ie those times when one of the degrees changes. This happens when a new vertex joins to one of the chosen vertices. Given that the new vertex has joined one of the chosen vertices, the choice of which of the chosen vertices is still size-biased proportional to the current degrees. In other words, the jump chain of this degree sequence is precisely Polya’s Urn.

This is a powerful observation, as it allows us to make comments about the limiting behaviour of finite quantities almost instantly. In particular, we know that from any starting arrangement, Polya’s Urn converges almost surely. This is useful to the question of persistence for the following reason.

Recall that in the case of two colours, starting with one of each, we converge to the uniform distribution. We should view this as a special case of the Dirichlet distribution, which is supported on partitions into k intervals of [0,1]. In particular, for any fixed k, the probability that two of the intervals have the same size is 0, as the distribution is continuous. So, since the convergence of the proportions in Polya’s Urn is almost sure, with probability one all of the proportions are with $\epsilon>0$ of their limit, and so taking epsilon small enough, given the limit, which we are allowed to do, we can show that the colour which is largest in the limit is eventually the largest at finite times.

Unfortunately, we can’t mesh these together these finite-dimensional observations particularly nicely. What we require instead is a result showing that if a vertex has large enough degree, then it can never be overtaken by any new vertex. This proved via a direct calculation of the probability that a new vertex ‘catches up’ with a pre-existing vertex of some specified size.

That calculation is nice and not too complicated, but has slightly too many stages and factorial approximations to consider reproducing or summarising here. Instead, I offer the following heuristic for a bound on the probability that a new vertex will catch up with a pre-existing vertex of degree k. Let’s root ourselves in the urn interpretation for convenience.

If the initial configuration is (k,1), corresponding to k red balls and 1 blue, we should consider instead the proportion of red balls, which is k/k+1 obviously. Crucially (for proving convergence results if nothing else), this is a martingale, which is clearly bounded within [0,1]. So the expectation of the limiting proportion is also k/k+1. Let us consider the stopping time T at which the number of red balls is equal to the number of blue balls. We decompose the expectation by conditioning on whether T is finite.

$\mathbb{E}X_\infty=\mathbb{E}[X_\infty|T<\infty]\mathbb{P}(T<\infty)+\mathbb{E}[X_\infty|T=\infty]\mathbb{P}(T=\infty)$

$\leq \mathbb{E}[X_\infty | X_T,T<\infty]\mathbb{P}(T<\infty)+(1-\mathbb{P}(T=\infty))$

using that $X_\infty\leq 1$, regardless of the conditioning,

$= \frac12 \mathbb{P}(T<\infty) + (1-\mathbb{P}(T<\infty))$

$\mathbb{P}(T<\infty) \leq \frac{2}{k+1}.$

We really want this to be finite when we sum over k so we can use some kind of Borel-Cantelli argument. Indeed, Galashin gets a bound of $O(k^{-3/2})$ for this quantity. We should stress where we have lost information. We have made the estimate $\mathbb{E}[X_\infty|T=\infty]=1$ which is very weak. This is unsurprising. After all, the probability of this event is large, and shouldn’t really affect the limit that much when it does not happen. The conditioned process is repelled from 1/2, but that is of little relevance when starting from k/k+1. It seems likely this expectation is in fact $\frac{k}{k+1}+O(k^{-3/2})$, from which the result will follow.

# The Configuration Model

In the past, I’ve talked about limitations of the Erdos-Renyi model of homogeneous random graphs for applications in real-world networks. In a previous post, I’ve discussed a dynamic model, the Preferential Attachment mechanism, that ‘grows’ a graph dynamically by adding edges from new vertices preferentially to existing vertices with high degree. The purpose of this adjustment is to ensure that the distribution of the degrees is not concentrated around some fixed value (which would be c in G(n,c/n) ) but rather exhibits a power-law tail such as observed in many genuine examples.

In this post, we introduce some aspects of the configuration model, which achieves this property more directly. This idea probably first arose in the guise of regular graphs. Recall a regular graph has all degrees equal. How would we construct a random d-regular graph on a large number of vertices?

What we probably want to do is to choose uniformly at random from the set of such graphs, but it is not clear even how large this set is, let alone how one would order its elements to make it possible to make this uniform choice. Instead, we try the following. Assign to each vertex d so-called stubs, which will end up being ‘half-edges’. We then choose two stubs uniformly at random, and glue them together. More formally, we construct an edge between the host vertices, and then delete the chosen stubs. We then continue.

The construction makes no reference to the distribution of stubs, so we are free to choose this as we please. We could for example specify some sequence of degrees which approximates a power-law, so we could sample a random sequence of degrees in some way. So long as we have a sequence of stub set sizes before we start building the edges of the graph we will be able to use the above algorithm.

So what might go wrong? There seem to me to be three potential problems that might arise with this construction.

Firstly, there might be a stub left over, if the sum of the stub set sizes is odd. Recall that in a graph the sum of the degrees is twice the sum of the number of edges, and so in particular the sum of the degrees should be even. But this is a small problem. When the degree sequence is deterministic we can demand that it have even sum, and if it is random, we will typically be working in a large N regime, and so deleting the solitary stub, if such a thing exists, will not affect the sort of properties of the graph we are likely to be interested in.

The second and third objections are perhaps more serious. If we glue together stubs naively, we might end up with loops, that is, edges that ‘begin’ and ‘end’ at the same vertex. These are not allowed in the standard definition of a graph. Alternatively, we might end up with more than one edge between the same pair of vertices.

Our overall aim is that this mechanism gives a convenient way of simulating the uniform distribution on simple graphs with a given degree sequence. At present we have the uniform distribution on potential multigraphs, with a weighting of 1/k! for every multi-edge with multiplicity k, and a weighting of 1/2 for every loop. The latter can be seen because there is an initial probability proportional to $d(v_i)d(v_j)$ that vertices v_i and v_j will be joined, whereas a probability proportional (with the same constant) to $d(v_i)^2$ that v_i will receive a loop. The multi-edge weighting justification is similar.

However, conditional on getting a simple graph, the distribution is uniform on the set of simple graphs with that degree sequence. So it remains to investigate the probability that a graph generated in this way is simple. So long as this probability does not tend to 0 as n grows, we will probably be happy.

The strongest results on this topic are due to Janson. First observe that if the sum of the degrees grows faster than the number of vertices n, we fail to get a graph without loops with high probability. Heuristically, note that on the first pass, we are taking two picks from the set of vertices, biased by the number of stubs. By Cauchy-Schwarz, Rearrangement Inequality or just intuition, the probability of getting the same vertex is greater than if we picked uniformly from the set of vertices without biasing. So the probability of getting no loop on the first pass is $\le (1-\frac{1}{n})$. Take some function a(n) that grows faster than n, but slower than the sum of the degrees. Then after a(n) passes, the degree distribution is still roughly the same. In particular, the sum of the degrees is still an order of magnitude greater than n. So we obtain:

$\mathbb{P}(\text{no loops})\leq (1-\frac{1}{n})^{a(n)}\approx e^{-\frac{a(n)}{n}}\rightarrow 0.$

So, since isolated vertices have no effect on the simplicity or otherwise, we assume the sum of the degrees is $\Theta(n)$. Then, Janson shows that the further condition

$\sum_{i=1}^n d_i^2=O(n),$

is essentially necessary and sufficient for simplicity. We can see why this might be true by looking at the probability that the first edge added is a loop, which is roughly

$\frac{d_1^2+d_2^2+\ldots+d_n^2}{2(\sum d_i)^2}.$

We have to consider $O(\sum d_i)$ edges, so if the above expression is much larger than this, we can perform a similar exponential estimate to show that the probability there are no loops is o(1). The technical part is showing that this probability doesn’t change dramatically as the first few stubs disappear.

Note that in both cases, considering only loops is sufficient for simplicity. Although it looks like loop appearance is weaker than multiplicity of edges, in fact they have the same threshold. It should also be pointed out that, like the uniform random forests, an alternative approach is simply to count the number of simple graphs and multigraphs with a given degree sequence. Good asymptotics can then be found for the probability of simplicity.

In the case of G(n,c/n), we were particularly interested in the emergence of the giant component at time c=1. While first-moment methods can be very effective in demonstrating such results, a branching process local limit representation is probably easiest heuristic for this phase transition.

So long as the degree sequences converge in a natural way, we can apply a similar approach to this configuration model. Concretely, we assume that the proportion of vertices with degree i is $\lambda_i$ in the limit. Although the algebra might push through, we should be aware that this means we are not explicitly specifying how many vertices have degree, eg $\Theta(n^{1/2})$. For now assume the $\lambda_i$s sum to 1, so specify a probability distribution for degree induced by choosing a vertex uniformly at random.

So we start at a vertex, and look at its neighbours. The expected number of neighbours of this root vertex is $\sum i\lambda i$. Thereafter, when we consider a child vertex, based on how the stubs are paired up (and in particular the fact that the order of the operations does not matter – the choice of partner of a given stub is chosen uniformly at random), we are really choosing a stub uniformly at random. This corresponds to choosing a vertex at random, biased by the number of stubs available. The quantity of interest is how many additional stubs (other than the one that led to the vertex) are attached to this vertex. We assume we don’t need to worry too much about repeating vertices, in a similar way to G(n,c/n). So the expected number of additional stubs is

$\frac{1}{\sum i\lambda_i}\sum i\lambda_i(i-1).$

For an infinite component, we required the expectation to be > 1, which is equivalent to

$\sum \lambda_i i(i-2)>0.$

This was proven by Molloy and Reed (95), then with fewer conditions by Janson (07). The latter also shows how to use this construction to derive the giant component for G(n,c/n) result.

REFERENCES

Janson – A New Approach to the Giant Component Problem

Molloy, Reed – A Critical Point for Random Graphs with a Given Degree Sequence

Janson – The Probability that  Random Multigraph is Simple

# The Chinese Restaurant Process

A couple of months ago I wrote a post about Polya’s Urn, the simplest example of self-reinforcing process. Recall that we have a bag containing black and white balls, and sequentially we draw a ball then replace it together with an additional ball of the same colour. The process is self-reinforcing in the sense that if there is a surplus of black balls, the dynamics will reinforce this by adding more black balls than white balls. Alternatively, you can think of a natural limit process when the number of balls is large, for which any distribution is an invariant distribution. We have seen models such as the Preferential Attachment dynamics for network creation, where the degrees of vertices clearly have this self-reinforcing property. New vertices are more likely to join to existing vertices with large degrees.

One difference between the Polya Urn and some of the models we might be interested in for applications is that for the urn model, the number of classes (in this context colours of balls) is fixed. In many applications, we will want to allow new classes to appear. In the process which follows, we will allow this, and the new classes will have initial size equal to 1, so will be at a disadvantage for the self-reinforcing dynamics. Nonetheless, some will show up in a meaningful way in the limit. It is worth emphasising that Polya’s Urn gave us the Dirichlet distribution in the limit, and this can be thought of as a partition of [0,1]. These more general processes will give us a more interesting family of partitions, called the Poisson-Dirichlet distributions. These will turn up in a wide variety of contexts, and this is perhaps the friendliest way to introduce them.

The model is this. We start with a single diner who sits at the first table. Then whenever the (n+1)th diner arrives, they join a table with k diners already with probability k/n+1, and they start a new table with probability 1/n+1.

(Aside: I’m not exactly sure how this relates to a Chinese restaurant? It seems more reminiscent of a university dining hall during freshers’ week, but I guess that would be a less catchy name for a model.)

Anyway, the interest in this description lies not in organising seating arrangements. Consider choosing uniformly at random from the set of permutations on [n+1]. Suppose x maps to n+1 and n+1 maps to y. Consider taking the permutation of [n] formed by instead mapping x to y and ignoring n+1. This has the uniform distribution on the set of permutations of [n]. By reversing this procedure, we can construct a uniform permutation of [n+1] from a uniform permutation of [n]. When you do this as a process for n growing, observe that the orbits correspond exactly to tables in the Chinese Restaurant Process. If we wanted the CRP to give all the information about the permutation, we could specify the ordering round each table, by saying that with probability 1/n+1 the new diner sits to the left of any given existing diner.

As a starting point for why this is a useful description of the uniform permutation distribution, observe that the size of the component containing the element 1 evolves as a Polya Urn with initial vector (1,1). The second 1 in the initial vector corresponds to the possibility of starting a new table, which is maintained at every stage. This tells us immediately that as n grows to infinity, the proportion of elements in the same cycle as 1 in the uniform permutation converges in distribution to U[0,1]. The construction also allows for an easy proof that the expected number of cycles is roughly log n for large n, since on each pass of the process, the probability that there is a new cycle formed is 1/k.

In this case, the partition induced on [n] by the process is clearly exchangeable given our permutation description. However, this will turn out to hold in greater generality. Note also,, that conditional on the size of the cycle containing 1, the sizes of the remaining cycles are given by a uniform permutation on a smaller number of elements. So the limiting result holds jointly in the first k cycle sizes for all k. More precisely, if $(N_1,N_2,\ldots)$ are the cycle sizes ordered by least element, then the frequencies converge to:

$(U_1,(1-U_1)U_2,(1-U_1)(1-U_2)U_3,\ldots),$

where the Us are independent U[0,1] RVs. This is known as a stick-breaking procedure, where at each step we break off some proportion of the stick according to a fixed distribution, and assemble the pieces into a partition.

We generalise this process to get a two-parameter version. The standard notation for the parameters is $(\alpha,\theta)$. Then we amend the dynamics. We now have to take into account how many tables are occupied when the (n+1)th diner arrives. If k tables are occupied, and the ith table has $n_i$ diners, then the new one will join this table with probability $\frac{n_i-\alpha}{n+\theta}$, and will start a new table otherwise, so with probability $\frac{\theta+k\alpha}{n+\theta}$. The original process therefore corresponds to parameters (0,1).

First we examine which parameters are possible. If $\alpha<0$, and $m|\alpha|<\theta<(m+1)|\alpha|$, then with high probability the (m+1)th table will eventually be occupied, whereafter the probability of forming a further table will be negative. So we have to demand instead that $\theta$ is an integer multiple of $-\alpha$. Then the number of tables is bounded by this multiple, so for large n, the probability of joining one of the k (fixed) tables is roughly $\frac{n_i}{n}$, so this should behave roughly like the standard Polya Urn. And indeed, the induced frequencies do converge to the Dirichlet distribution with k equal parameters.

Obviously $\alpha$ cannot be greater than 1, otherwise the probability of the second diner joining the first table is negative. If it is equal to 1, then every diner starts a new table, which isn’t very interesting. So we care about $\alpha\in[0,1)$, and for the probability of the second diner starting a new table to be non-negative we require $\theta>-\alpha$.

It turns out that the partitions induced by this process are exchangeable also. We also have a stick-breaking construction, although now the broken proportions are not IID, but distributed as

$U_i\sim \mathrm{Beta}(1-\alpha,\theta+i\alpha),$

with the same notation otherwise. It turns out that under mild assumptions, these are all the infinite exchangeable random partitions with this stick-breaking property.

My initial struggle with this process was to understand what roles $(\alpha,\theta)$ played in a more precise way. It turns out this is best explained through the limit of the partition, but Pitman’s Exercise 3.2.2 does at least give an idea of how such a process with parameters (1/2,0) might naturally arise as a version of an urn model.

3.2.2. Let an urn initially contain two balls of different colours. Draw 1 is a simple draw from the urn with replacement. Thereafter, balls are drawn from the urn, with replacement of the ball drawn, and addition of two more balls as follows. If the ball drawn is of a colour never drawn before, it is replaced together with two additional balls of two distinct new colours, different to the colours of balls already in the urn. Whereas if the ball drawn is of a colour that has been drawn before, it is replaced together with two balls of its own colour.

Let $n_1$ be the number of times a ball of the colour of the first ball drawn (and replaced) is drawn. Let $n_2,n_3,\ldots$ be the number of times balls of each other colour are drawn. Suppose after n draws, we have drawn k colours. (There will be other colours in the bag not yet drawn.) Then, for each drawn colour i, there are $2n_i-1$ balls of that colour in the bag, giving 2n-k in total. But there should be 2n balls in total, so there are k other balls. Then the probability that we see a new colour is k/2n, and the probability that we see colour i again is $\latex \frac{2n_i-1}{2n}=\frac{n_i-1/2}{n}$, which exactly corresponds to the dynamics for PD(1/2,0).

The other question I was puzzled by initially is where does the dust come from in the limit? Recall that in an infinite exchangeable partition, the sum of the frequencies does not need to be 1. The difference between this sum and 1 gives the probability that an element is in a block by itself. Obviously, when the number of tables is bounded (as when $\alpha<0$) this is not an issue, but for positive $\alpha$, this won’t hold. So we need to account for these singletons. The temptation is to imagine that these correspond to tables which are started but never joined. But this use of ‘never’ is not ideal. For each k, the k-th table will eventually include arbitrarily large numbers of diners. But for any finite n, there will likely be some proportion of people dining alone, some in pairs, and so on. So the sum of all of these proportions in the limit gives this dust.

Generalising Polya’s Urn in another direction, if I have time, I might write something about a model which I recently read about on arXiv where the classes are vertices of a graph, and there is dependence between them based on the presence of edges. This might also be a good moment to explain some other generalisations and stochastic approximation methods used to treat them.

REFERENCES

This post is almost entirely a paraphrase of Sections 3.1 and 3.2 from Pitman’s Combinatorial Stochastic Processes, available online here.

# Preferential Attachment Models

I’ve just read a really interesting paper by Peter Morters and Maren Eckhoff that made me feel I should look up some of the background and write a quick post. I may get onto some of the results in the paper at the end of this post, but I want to start by saying a bit about the model itself. I’ve spoken about this briefly in a previous post about several descriptions of complex networks, but I think it’s worth having a second attempt.

We seek a model for random graphs that gives a distribution which exhibits some of the properties of the sort of complex networks seen in the real world. In particular, whereas the degree distribution is Poisson, and so concentrated with exponential tails for the Erdos-Renyi random graph, data indicates that a better model for most applications would have power law tails for this degree distribution.

Albert and Barabasi propose growing such a graph via a so-called preferential attachment scheme. We start with some small possibly empty graph, and add new vertices one at a time. For each new vertex, we add exactly M edges between the new vertex and the vertices already present. The choice of these M other vertices is given by weighting by the degree of the (pre-existing) vertices. That is, vertices with large degree are more likely to be joined to new vertices. This is obviously designed to replicate some of the behaviour seen in say the formation of the internet, where new sites are more likely to link to established and popular sites (Google, Youtube and so on) than a uniformly chosen site.

This model has a couple of problems. Firstly, it is not immediately obvious how to start it. Obviously we need M vertices present for the PA dynamics to start working. In fact, whether one starts with a empty graph or a complete graph on M vertices makes little difference to the large n behaviour. Trickier is the question of multiple edges, which may emerge if we define the PA dynamics in the natural way, that is for each of the M edges in turn. Overcoming this is likely to be annoying.

Bollobas and Riordan do indeed overcome this possible problems in a formal way, and prove that a version of this model does indeed have power law decay of the degree distribution, with exponent equal to 3. The model in the paper instead joins new vertex (n+1) to old vertex m with probability:

$\frac{f(\text{in-degree of n})}{n},$

where f is some function, which for now we assume has the form $f(k)=\gamma k+\beta$. Since the vertices are constructed one at a time, it is well-defined to orient these edges from new to old vertices, hence this notion of in-degree makes sense.

It was not obvious to me that this model was more general than the Bollobas/Riordan model, but we will explain this in a little while. First I want to explain why the Bollobas/Riordan model has power law tails, and how one goes about finding the exponent of this decay, since this was presented as obvious in most of the texts I read yet is definitely an important little calculation.

So let’s begin with the Bollobas/Riordan model. It makes sense to think of the process in terms of time t, so there are t – M vertices in the graph. But if t is large, this is essentially equal to t. We want to track the evolution of the degree of some fixed vertex v_i, the ith vertex to be formed. Say this degree is d(t) at time t. Then the total number of edges in the graph at time t is roughly tM. Therefore, the probability that a new vertex gets joined to vertex v is roughly $\frac{Md}{2Mt}$, where the M appears in the numerator because there are M fresh edges available. Note that we have ignored the possibility of trying to connect multiple edges from the new vertex to v, so this holds provided d is substantially smaller than t. With the boundary condition $d(i)=M$, this leads to the simple ODE

$\dot{d}=\frac{d}{2t}\quad \Rightarrow\quad d=M(\frac{t}{i})^{1/2}.$

To me at least it was not immediately clear why this implied that the tail of the degree distribution had exponent 3. The calculation works as follows. Let D be the degree of a vertex at large time t, chosen uniformly at random.

$d_i\propto (\frac{t}{i})^{1/2}$

$\Rightarrow\quad \mathbb{P}(D\geq d)=\frac{1}{t}|\{i:(\frac{t}{i})^{1/2}\geq d\}|=\frac{1}{t}|\{i:i\leq \frac{t}{d^2}\}|=\frac{1}{d^2}$

Now we consider the Eckhoff / Morters model. The main difference here is that instead of assuming that each new vertex comes with a fixed number of edges, instead the new vertex joins to each existing vertex independently with probability proportional to the degree of the existing vertex. More precisely, they assume that edges are directed from new vertices to old vertices, and then each new vertex n+1 is joined to vertex m<n+1 with probability $\frac{f(\text{indegree of }m\text{ at time }n)}{n}\wedge 1$, where $f(k)=\gamma k +\beta$, for $\gamma\in[0,1), \beta>0$.

I was stuck for a long time before I read carefully enough the assertion that $\beta>0$. Of course, if this doesn’t hold, then the graph won’t grow fast enough. For, since the function f is now linear, we can lift the statement about evolution of the degree of a vertex to a statement about the evolution of the total number of edges. Note that each edge contributes exactly one to the total number of in-degrees. So we obtain

$\dot{E}=\frac{\gamma E}{t}\quad\Rightarrow E(t)\propto t^\gamma.$

In particular, this is much less than t, so the majority of vertices have small degree. The answer is fairly clear in fact: since the preferential attachment mechanism depends only on in-degree, then if f(0)=0, since the in-degree of a new vertex will always be zero by construction, there is no way to get an additional edge to that vertex. So all the edges in the graph for large t will be incident to a vertex that had positive in-degree in the time 0 configuration. Hence we need $\beta>0$ for the model to be meaningful. Note that this means we effectively have a Erdos-Renyi type mechanism AND a preferential attachment evolution. As, for each new vertex, we add roughly $\beta$ edges to existing vertices chosen uniformly at random (rather than by a PA method) and also some assigned via PA. A previous paper by Dereich and Morters shows that the asymptotic degree distribution has a power law tail with exponent

$\tau:=\frac{\gamma+1}{\gamma}.$

Note that $\gamma=\frac12$ gives the same exponent (3) as the Bollobas / Riordan model.

We can apply a similar ODE approximation as above to estimate the likely large time behaviour of the number of edges:

$E'=\frac{\gamma E + \beta t}{t}.$

So since $E'\geq \beta$, we have $E\geq \beta t$ so defining F to be E(t)/t, we get:

$tF'(t)=\beta-(1-\gamma)F(t)$        (1)

Noting that F’ is positive when $F< \frac{\beta}{1-\gamma}$ and negative when $F>\frac{\beta}{1-\gamma}$ suggests that for large t, this is an equilibrium point for F and hence $E(t)\approx \frac{\beta t}{1-\gamma}$. Obviously, this is highly non-rigorous, as F’ can be very small and still satisfy the relation (1), so it is not clear that the ‘equilibrium’ for F is stable. Furthermore, one needs to check that the binomial variables that supply the randomness to this model are sufficiently concentrated that this approximation by expectation is reasonable.

Nonetheless, as a heuristic this is not completely unsatisfactory, and it leads to the conclusion that E(t) is a linear function of t, and so the distribution of the out-degrees for vertices formed at large times t is asymptotically Poisson, with parameter

$\lambda =\frac{\beta\gamma}{1-\gamma}+\beta=\frac{\beta}{1-\gamma}$.

Note that this is the same situation as in Erdos-Renyi. In particular, it shows that all the power tail behaviour comes from the in-degrees. In a way this is unsurprising, as these evolve in time, whereas the out-degree of vertex t does not change after time t. Dereich and Morters formalise this heuristic with martingale analysis.

The reason we are interested in this type of model is that it better reflects models seen in real life. Some of these networks are organic, and so there it is natural to consider some form of random destructive mechanism, for example lightning, that kills a vertex and all its edges. We have to compare this sort of mechanism, which chooses a vertex uniformly at random, against a targeted attack, which deletes the vertices with largest degree. Note that in Erdos-Renyi, the largest degree is not much larger than the size of the typical degree, because the degree distribution is asymptotically Poisson. We might imagine that this is not the case in some natural networks. For example, if one wanted to destroy the UK power network, it would make more sense to target a small number of sub-stations serving large cities, than, say, some individual houses. However, a random attack on a single vertex is unlikely to make much difference, since the most likely outcome by far is that we lose only a single house etc.

In Eckhoff / Morters’ model, the oldest vertices are by construction have roughly the largest degree, so it is clear what targeting the most significant $\epsilon n$ vertices means. They then show that these vertices include all the vertices that give the power law behaviour. In particular, if you remove all of these vertices and, obviously, the edges incident to them, you are left with a graph with exponential tail in the asymptotic degree distribution, with largest degree on the order of log n. It was shown in a previous paper that this type of network is not vulnerable to random removal of nodes. Perhaps most interestingly, these authors now prove that after removing the most significant $\epsilon n$ vertices, the network IS now vulnerable to random removal of nodes, leading to the conclusion that it is preferable to experience a random attack followed by a targeted attack than vice versa!

In a future (possibly distant) post, I want to say some slightly more concrete things about how these processes link to combinatorial stochastic processes I understand slightly better, in particular urn models. I might also discuss the configuration model, an alternative approach to generating complex random networks.

# Beyond Erdos-Renyi: more realistic models of networks

The claim is often made that the study of random graphs such as the Erdos-Renyi model is worthwhile because it gives us information about complex systems which exist in the real world. The internet or social networks provide the example du jour at the moment, but it’s equally plausible to think about traffic flows, electrical systems or interacting biological processes too.

If this were entirely true, it would be great for two reasons. Firstly, in my opinion at least, it is a beautiful subject in its own right, and to have a concrete applicable reason to continue studying it would make it even better. (Not to mention the dreaded competition for funding…) Secondly, Erdos-Renyi is so simple. After all, it involves little more than adding some simple topology to a collection of IID Bernoulli random variables, and so it would surely be possible to draw some significant conclusions about how complicated real-world objects interact without too much mathematical effort.

Unfortunately, but unsurprising, this simplicity is a drawback as far as applications go. It is fairly clear that most real-world systems cannot offer any property even approaching the niceness of the independent, same probability edges condition. But rather than consign E-R to the ‘pretty but useless’ category of mathematical structures, we should think carefully about exactly why it fails to be a good model for real-world networks, and see whether there are any small adjustments that could be made to improve it.

This is something I’ve been meaning to read up about for ages and ages. What follows is based heavily on the Albert and Barabasi 2002 review paper. I suspect that many of the open problems and intuitive calculations have since been finished and formalised, but for an overview I hope that doesn’t matter hugely. I’ve also leafed through the relevant chapters of Remco Van der Hofstad’s notes, but am setting the details and the exercises aside for the holidays when I have a bit more time!

Problems with Erdos-Renyi

Recall that G(n,p) takes n vertices, and adds edges between any pair of vertices independently with probability p.

One property shared by most real-world networks is the scale-free phenomenon, which says that the degree distribution has a power law tail. The Albert-Barabasi papers gives a comprehensive survey of data verifying this claim. By contrast, G(n,p) has degree distribution which is approximately Poisson as n grows. This is concentrated near the average degree with a thin exponential tail, so does not satisfy this requirement. I was and still am a bit confused by the term ‘scale-free’. The idea is certainly that the local structure is independent of the size of the system, which seems to be true for the degree distributions in sparse ER, that is where p = O(1/n). But I think the correct heuristic is that it doesn’t matter how far zoomed in you are – the macroscopic structure looks similar for n vertices as for n^2 vertices. This certainly fails to be true for ER, where no vertex has O(n) neighbours, whereas with a power law tail, this does hold.

The main consequence of this is that there are a few vertices with very high degree. These are often called ‘hubs’ and parallels are drawn to the internet, where key websites and servers connect lots of traffic and pages from different areas. The idea is that the hubs are almost certainly well-connected to each other, and this offers a step towards a small-world phenomenon, where the shortest path between any two vertices is very small relative to the size of the system. This notion was introduced to mainstream culture by Stanley Milgram’s ‘Six degrees of separation’ experiment in the 60s, where it became clear that subjects were able to deliver a package to a complete stranger on the other side of the USA, using only personal contacts, in about six stages. The graph theoretic notion for this is the diameter, defined as the maximal graph distance between two points. Here, the graph distance means the length of the shortest path between the points. This definition, with the max-min formalism looks rather complicated, but isn’t really. The diameter of an Erdos-Renyi graph for fixed p, increases like log n, which is small relative to n, and so this property holds.

A quick glance at your list of Facebook friends will confirm that the independent edges condition in an Erdos-Renyi random graph is not a plausible model for social networks. How many friends do you have? Let’s say about 1000, more to make the calculation easier than because you’re necessarily very popular. How many does your friend Tom have? Let’s say 1000 again. As was in the news a few months ago, there are now over a billion people on Facebook. Let’s say exactly a billion (that is 10^9 for these purposes). So both you and Tom are friends with 1/10^6 of the total membership of the network. So how large would you expect the overlap of your friendships to be, if they were all chosen independently at random? Well, the probability that you are both friends with Alice is 10^-12, and so the expected number of your mutual friends is 10^-12 x 10^9 = 10^-3 which is substantially less than 1. Yet I imagine if you substituted names suitably, you and Tom might well have over 50 mutual friends if you were, say, in the same year at school or niversity and haven’t yet purged your list.

We want a statistic that records this idea quantitatively. There are various candidates for such a clustering coefficient. The underlying notion that we might expect there to be greater connectivity between neighbours of some fixed point v than in the graph as a whole gives an intuition for a possible definition. Compare the proportion of triangles in the graph to the cube of the proportion of edges. When this ratio is large, then there is a lot of clustering. In the E-R case, we would expect these to be equal, as the probability of forming a triangle is equal to the cube of the probability of the presence of each of the three independent edges that make up the triangle.

So we have three properties of real networks that we would like to incorporate into a model: small diameter, power-law degree distribution, and high clustering. To avoid this turning into a book, I’m going to write a paragraph about each of the possibilities discussed by Albert and Barabasi.

Generalised Random Graph

The degree distribution will typically emerge as a consequence of the construction of a given model. The general idea here is to condition on the degree distribution having the form we want, and see what this does to the structure. Of course, the choice of how to do this conditioning is absolutely key. It certainly isn’t obvious what it means to ‘condition G(n,p) to have power-law distribution’, since the very idea of a power-law vs exponential tail requires the number of vertices to be large.

The first idea for achieving this gives the vertices ‘stubs’, which join up in pairs to form edges. We decide on the distribution of stubs according to this power law, then pair them up uniformly at random. Obviously, there is a possibility of getting some loops, but this is not going to happen so often as to be a genuine problem in the limit. This construction is similarly open to the branching process exploration ideas well covered for the E-R random graph, though we have to be careful to size-bias the degree distributions when necessary. There is still an underlying independence in the location of edges though, so it is reasonably clear that the amount of clustering may be closer to E-R than to the real examples cited.

The other possibility suggested is to retain the independent edge property, but give the vertices weights, and let the probability of an edge between two vertices be some sensible function of the weights. In the end it turns out to make little difference whether the weights are chosen deterministically or randomly, but by taking the weights i.i.d. with infinite mean, we can generate a so-called generalised random graph where the degree distribution has a power law.

Watts-Strogatz

In the WS model, the idea is to interpolate between a graph with maximal clustering and a random graph. A d-regular graph, say on a ring, where every vertex is connected to its d nearest neighbours has high clustering, but large diameter, as for example it takes roughly n/2d steps to get to the other side of the ring. Whereas in the standard E-R model we add edges with some fixed probability p, here we replace edges with some fixed probability p. That is, we take an edge in the regular graph and with some small probability we remove it and instead add an edge between two vertices chosen uniformly at random. The theoretical motivation is that removing a few edges doesn’t destroy the high clustering evident in the regular graph, but even a sparse random graph has small diameter, so adding a few ‘long-range’ edges should be enough to decrease the diameter significantly.

It obviously needs to be checked that a substantial drop in diameter occurs before a substantial decrease in clustering, and there is a calculation and diagram to support this intuitive idea in the paper. The one drawback of this model is that it fails to provide the power-law degree distributions we want. After all, an E-R graph has a concentrated degree distribution, and a d-regular graph has all degrees the same, so we would expect some interpolation between the two to have a concentrated distribution as well. Nonetheless, this model accords well with an idea of how complex networks might form, particularly if there is some underlying geometry. It is reasonable to assume that an initial setup for a network would be that people are connected to those closest to them, and then slowly acquire distant contacts as time progresses.

Preferential Attachment – Barabasi-Albert model

Most of our intuition for networks can be extended to an intuition for the formation of networks. The idea of prescribing a degree distribution is neat, but it doesn’t give any account to the mechanism of formation. Complexity emerges over time, and a good model should be able to describe why this happens. The Barabasi-Albert model takes this as its starting point, with the aim of producing a highly clustered system dynamically. Recall that we can describe G(n,p) as a process by coupling, then increasing p from 0 to 1, and seeing edges emerge. The independence assumption can be lifted through the coupling, and so which edge appears next is independent of the current state of the system.

This is what we need to relax. Recall the motivating idea of ‘hubs’, where a small collection of vertices have very high connectivity across the whole system, as observed in several real situations. A consequence of this is that new edge is more likely to be attached to a hub, than to a pair of poorly connected vertex elsewhere. But it turns out that this idea of preferential attachment isn’t enough by itself. Because as a network forms, it is not just the connectivity that increases, but also the size of the system itself. So in fact it makes sense to add vertices rather than edges, and join the new vertices to existing vertices in proportion to the degrees of the existing vertices. This combination of growth and preferential attachment is key to the scale-free graphs that this Barabasi-Albert model generates. Relaxing either mechanism returns us to the case of exponential tails. However, there are methods in the literature for generating such graphs without the need for a dynamic model, but they are harder to understand and describe. None I have seen so far has a high clustering coefficient.

Hubs are effectively a way to reduce the diameter. Recall the description of Milgram’s experiment where he encouraged randomly chosen people to send a package to Harvard. For the purposes of this model, an undergraduate from Wyoming or a husband from Alabama moving in with his wife in Boston are clear hubs, as for very many people near their previous home, they represent a good connection to Harvard. So it is unsurprising that BA, which reinforces hubs, has a sub-logarithmic small diameter.

Conclusions

I’m not entirely what conclusions I should draw from my reading. Probably the main one is that I should read more as there is plenty of interesting stuff going on in this area. Intuitively, it seems unlikely that there is going to be a single model which unites the descriptions of all relevant real-world networks. As ever, it is pleasant to find structures that are both mathematically interesting in their own right and relevant to applied problems. So it is reassuring to observe how similar many of the models discussed above are to the standard random graph.