Generating uniform trees

A long time ago, I wrote quite a few a things about uniform trees. That is, a uniform choice from the n^{n-2} unrooted trees with vertex set [n]. This enumeration, normally called Cayley’s formula, has several elegant arguments, including the classical Prufer bijection. But making a uniform choice from a large set is awkward, and so we seek more probabilistic methods to sample such a tree, which might also give insight into the structure of a ‘typical’ uniform tree.

In another historic post, I talked about the Aldous-Broder algorithm. Here’s a quick summary. We run a random walk on the complete graph K_n started from a uniformly-chosen vertex. Every time we arrive at a vertex we haven’t visited before, we record the edge just traversed. Eventually we have visited all n vertices, so have recorded n-1 edges. It’s easy enough to convince yourself that these n-1 edges form a tree (how could there be a cycle?) and a bit more complicated to decide that the distribution of this tree is uniform.

It’s worth noting that this algorithm works to construct a uniform spanning tree on any connected base graph.

This post is about a few alternative constructions and interpretations of the uniform random tree. The first construction uses a Galton-Watson process. We take a Galton-Watson process where the offspring distribution is Poisson(1), and condition that the total population size is n. The resulting random tree has a root but no labels, however if we assign labels in [n] uniformly at random, the resulting rooted tree has the uniform distribution among rooted trees on [n].

Proof

This is all about moving from ordered trees to non-ordered trees. That is, when setting up a Galton-Watson tree, we distinguish between the following two trees, drawn extremely roughly in Paint:

That is, it matters which of the first-generation vertices have three children. Anyway, for such a (rooted) ordered tree T with n vertices, the probability that the Galton-Watson process ends up equal to T is

\mathbb{P}(GW = T) = \prod_{v\in T} \frac{e^{-1}}{C(v)!} = e^{-n} \prod_{v\in T}\frac{1}{C(v)!},

where C(v) is the number of children of a vertex v\in T. Then, since \mathbb{P}( |GW|=n ) is a function of n, we find

\mathbb{P}(GW=T \,\big|\, |GW|=n) = f(n)\prod_{v\in T} \frac{1}{C(v)!},

where f(n) is a function of n alone (ie depends on T only through its size n).

But given an unordered rooted tree t, labelled by [n], there are \prod_{v \in t} C(v)! ordered trees associated to t in the natural way. Furthermore, if we take the Poisson Galton-Watson tree conditioned to have total population size n, and label uniformly at random with [n], we obtain any one of these ordered trees with probability \frac{f(n)}{n!} \prod_{v\in t} \frac{1}{C(v)!}. So the probability that we have t after we forget about the ordering is \frac{f(n)}{n!}, which is a function of n alone, and so the distribution is uniform among the set of rooted unordered trees labelled by [n], exactly as required.

Heuristic for Poisson offspring distribution

In this proof, the fact that \mathbb{P}(C(v)=k)\propto \frac{1}{k!} exactly balances the number of orderings of the k children explains why Poisson(1) works out. Indeed, you can see in the proof that Poisson(c) works equally well, though when c\ne 1, the event we are conditioning on (namely that the total population size is n) has probability decaying exponentially in n, whereas for c=1, the branching process is critical, and the probability decays polynomially.

We can provide independent motivation though, from the Aldous-Broder construction. Both the conditioned Galton-Watson construction and the A-B algorithm supply the tree with a root, so we’ll keep that, and look at the distribution of the degree of the root as constructed by A-B. Let \rho=v_1,v_2,v_3,\ldots be the vertices [n], ordered by their discovery during the construction. Then \rho is definitely connected by an edge to v_2, but thereafter it follows by an elementary check that the probability \rho is connected to v_m is \frac{1}{n-1}, independently across all m. In other words, the distribution of the degree of \rho in the tree as constructed by A-B is

1+ \mathrm{Bin}\left(n-2,\frac{1}{n-1}\right) \approx 1+\mathrm{Poisson}(1).

Now, in the Galton-Watson process, conditioning the tree to have fixed, large size changes the offspring distribution of the root. Conveniently though, in a limiting sense it’s the same change as conditioning the tree to have size at least n. Since these events are monotone in n, it’s possible to take a limit of the conditioning events, and interpret the result as the Galton-Watson tree conditioned to survive. It’s a beautiful result that this interpretation can be formalised as a local limit. The limiting spine decomposition consists of an infinite spine, where the offspring distribution is a size-biased version of the original offspring distribution (and so in particular, always has at least one child) and where non-spine vertices have the original distribution.

In particular, the number of the offspring of the root is size-biased, and it is well-known and not hard to check that size-biasing Poisson(c) gives 1+Poisson(c) ! So in fact we have, in an appropriate limiting sense in both objects, a match between the degree distribution of the root in the uniform tree, and in the conditioned Galton-Watson tree.

This isn’t supposed to justify why a conditioned Galton-Watson tree is relevant a priori (especially the unconditional independence of degrees), but it does explain why Poisson offspring distributions are relevant.

Construction via G(N,p) and the random cluster model

The main reason uniform trees were important to my thesis was their appearance in the Erdos-Renyi random graph G(N,p). The probability that vertices {1, …, n} form a tree component in G(N,p) with some particular structure is

p^{n-1} (1-p)^{\binom{n}{2}-(n-1)} \times (1-p)^{n(N-m)}.

Here, the first two terms give the probability that the graph structure on {1, …, n} is correct, and the the final term gives the probability of the (independent) event that these vertices are not connected to anything else in the graph. In particular, this has no dependence on the tree structure chosen on [n] (for example, whether it should be a path or a star – both examples of trees). So the conditional distribution is uniform among all trees.

If we work in some limiting regime, where pn\rightarrow 0 (for example if n is fixed and p=\frac{1}{N}\rightarrow 0), then we can get away asymptotically with less strong conditioning. Suppose we condition instead just that [n] form a component. Now, there are more ways to form a connected graph with one cycle on [n] than there are trees on [n], but the former all require an extra edge, and so the probability that a given one such tree-with-extra-edge appears as the restriction to [n] in G(N,p) is asymptotically negligible compared to the probability that the restriction to [n] of G(N,p) is a tree. Naturally, the local limit of components in G(N,c/N) is a Poisson(c) Galton-Watson branching process, and so this is all consistent with the original construction.

One slightly unsatisfying aspect to this construction is that we have to embed the tree of size [n] within a much larger graph on [N] to see uniform trees. We can’t choose a scaling p=p(n) such that G(n,p) itself concentrates on trees. To guarantee connectivity with high probability, we need to take p> \frac{\log n}{n}, but by this threshold, the graph has (many) cycles with high probability.

At this PIMS summer school in Vancouver, one of the courses is focusing on lattice spin models, including the random cluster model, which we now briefly define. We start with some underlying graph G. From a physical motivation, we might take G to be \mathbb{Z}^d or some finite subset of it, or a d-ary tree, or the complete graph K_N. As in classical bond percolation (note G(N,p) is bond percolation on K_N), a random subset of the edges of G are included, or declared open. The probability of a given configuration w, with e open edges is proportional to

p^e (1-p)^{|E(G)| - e} q^{k(w)}, (*)

where the edge-weight p\in(0,1) as usual, and cluster weight q\in (0,\infty), and k(w) counts the number of connected components in configuration w. When q=1, we recover classical bond percolation (including G(N,p) ), while for q>1, this cluster-reweighting favours having more components, and q<1 favours fewer components. Note that in the case q\ne 1, the normalising constant (or partition function) of (*) is generally intractable to calculate explicitly.

As in the Erdos-Renyi graph, consider fixing the underlying graph G, and taking p\rightarrow 0, but also taking \frac{q}{p}\rightarrow 0. So the resulting graph asymptotically ‘wants to have as few edges as possible, but really wants to have as few components as possible’. In particular, 1) all spanning trees of G are equally likely; 2) any configuration with more than one component has asymptotically negligible probability relative to any tree; 3) any graph with a cycle has #components + #edges greater than that of a tree, and so is asymptotically negligible probability relative to any tree.

In other words, the limit of the distribution is the uniform spanning tree of G, and so this (like Aldous-Broder) is a substantial generalisation, which constructs the uniform random tree in the special case where G=K_n.

 

Birthday Coincidences and Poisson Approximations

This morning, Facebook was extremely keen to remind me via every available medium that four of my friends celebrate their birthday today. My first thought was that I hope they all enjoy their day, and my second thought was to ask what the chance of this was. I have about 200 Facebook friends, and so this struck me as an unlikely occurrence. But this problem has form, and it felt worthwhile to try some calculations to see if my intuition was well-founded.

rainbowfishcake_compressed

Siméon Denis Poisson celebrated his 234th birthday on 21st June this year.

The classical birthday problem

The starting point is the question: how many friends do you have to have before you expect to start seeing anyone sharing a birthday? There are a ridiculous number of articles about this on the web already, so I will say little, except that I don’t want to call this the ‘birthday paradox’, because it’s not a paradox at all. At best it might be counter-intuitive, but then the moral should be to change our intuition for this type of problem.

Throughout, let’s discount February 29th, as this doesn’t add much. So then, to guarantee having a shared pair of birthdays, you need to have 366 friends. But if you have a mere 23 friends, then the probability of having some pair that share a birthday is slightly greater than a half. The disparity between these two numbers leads to the counter-intuition. Some people might find it helpful to think that instead of counting friends, we should instead be counting pairs of friends, but I don’t personally find this especially helpful.

For me, thinking about the calculation in very slightly more generality is helpful. Here, and throughout, let’s instead take N to be the number of days in a year, and K the number of friends, or kids in the class if you prefer. Then, as usual, it is easier to calculate the probability that no two share a birthday (that is, that all the birthdays are distinct) than the probability that some two share a birthday. We could think of the number of ways to pick the set of birthdays, or we could look at the kids one-at-a-time, and demand that their birthday is not one of those we’ve already seen. Naturally, we get the same answer, that is

\frac{^N P_K}{N^K} = 1\cdot \frac{N-1}{N}\cdot\ldots \frac{N-K+1}{N}.

We’ve assumed here that all birthdates are equally likely. We’ll come back to this assumption right at the end. For now, let’s assume that both N and K are large, and we’ll try to decide roughly how large K has to be in relation to N for this answer to be away from 0 and 1. If we pair opposite terms up, we might approximate this by

(\frac{N-\frac{K}{2}}{N})^K = (1-\frac{K}{2N})^K\approx e^{-K^2/2N}.

In fact, AM-GM says that this is an overestimate, and a bit more care can be used to show that this is a good-approximation to first order. So we see that if K=\Theta(\sqrt{N}) for large N, we get a non-trivial limit.

Challenges for four-way shared birthdays

So the original problem I posed is harder, because there isn’t (unless I’m missing something) a natural way to choose birthdays one-at-a-time, or describe the set of suitable birthday sets. There are two major obstacles in a calculation such as this. Firstly, the overlap of people, that is we might have five or more birthdays overlapping; secondly, the overlap of days, that is we might have several days with four (or more) birthdays. We’ll end up worrying more about the second situation.

We start by eliminating both problems, by asking for the probability that exactly four friends are born on January 1st. The general form of this probability is \frac{\binom{K}{4} }{N^4} \cdot (\frac{N-1}{N})^{K-4}. Now, if K\ll N, this final term should not be significant. Removing this is not exactly the same as specifying the probability that at least four birthdays are on January 1st. But in fact this removal turns a lower bound (because {exactly four}<{at least four}) into an upper (in fact a union) bound. So if the factor being removed is very close to one, we can use whichever expression is more convenient.

In the real life case of N=365, K=200, this term is not negligible. But accounting for this, we get that the probability of exactly four birthdays on 1st January is ~0.0021. Our upper bound on the probability of at least four is ~0.0036.

But now that we know the probability for a given day, can we calculate (1-0.0021)^{365} to estimate the probability that we never have four-overlap? When we did our previous iterative calculation, we were using independence of the different kids’ birthdays. But the event that we have four-overlap on January 1st is not quite independent of the event that we have four-overlap on January 2nd. Why? Well if we know at least four people were born on January 1st, there are fewer people left (potentially) to be born on January 2nd. But maybe this dependence is mild enough that we can ignore it?

We can, however, use some moment estimates. The expected number of days with four-overlap is 365\cdot 0.0021 \approx 0.77. So the probability that there is at least one day with four-overlap is at most ~0.77.

But we really want a lower bound. So, maybe we can borrow exactly the second-moment argument we tried (there for isolated vertices in the random graph) in the previous post? Here, the probability that both January 1st and January 2nd are four-overlapping is

\frac{\binom{K}{4}\binom{K-4}{4}}{N^8}\cdot (\frac{N-2}{N})^{K-8}\approx 4.3\times 10^{-6}.

From this, we can evaluate the expectation of the square of the number of days with four-overlap, and thus find that the variance is ~0.74. So we use Chebyshev, calling this number of days #D for now:

\mathbb{P}(\# D=0)\le \mathbb{P}(|\#D - \mathbb{E}\# D|^2 \ge (\mathbb{E}\# D)^2 ) \le \frac{\mathrm{Var} \# D}{(\mathbb{E} \#D)^2}.

In our case, this unfortunately gives us an upper bound greater than 1 on this probability, and thus a lower bound of zero on the probability that there is at least one day with four-overlap. Which isn’t especially interesting…

Fairly recently, I spoke about the Lovasz Local Lemma, which can be used to find lower bounds on the probabilities of intersections of events, many of which are independent (in a particular precise sense). Perhaps this might be useful here? The natural choice of ‘bad event’ is that particular 4-sets of people share a birthday. There are \binom{K}{4} such events, and each is independent of the collection of \binom{K-4}{4} disjoint events. Thus we can consider using LLL if e\cdot (\binom{K}{4}-\binom{K-4}{4})\cdot 0.0021 \le 1. Unfortunately, this difference of binomial coefficients is large in our example, and so in fact the LHS has order 10^3.

Random number of friends – coupling to a Poisson Process

All of these methods failed because without independence we had to use estimates which were really not tight at all. But we can re-introduce independence if we remove the constraints on the model. Suppose instead of demanding I have K friends, I instead demand that I have a random number of friends, with distribution Poisson(K). Now it is reasonable to assume that for each day, I have a Poisson(K/365) friends with that birthday, independently for each day.

If we end up having exactly K friends with this random choice, then the distribution of the number of 4-overlap days is exactly the same as in the original setup. However, crucially, if we end up having at most K friends with this random choice, the distribution of the number of 4-overlap days is stochastically dominated by the original distribution. So instead let’s assume we have Poisson(L) friends, where L<K, and see how well we can do. For definiteness, we’ll go back to N=365, K=200 now. Let’s say X is the distribution of birthdays in the original model, and \Xi for the distribution of birthdays in the model with a random number of friends

Then

\mathbb{P}(\exists \ge 4\text{-overlap in }\Xi) = 1- \mathbb{P}(\mathrm{Po}(L/365)\le 3)^365. (*)

Now we can write the crucial domination relation as

\mathbb{P}(\exists \ge 4\text{-overlap in }X)\ge \mathbb{P}( \exists \ge 4\text{-overlap in }\Xi \,|\, |\Xi|\le 200),

and then use an inequality version of the law of total probability to bound further as

\ge \frac{ \mathbb{P}(\exists \ge 4\text{-overlap in }\Xi) - \mathbb{P}(|\Xi|>200)}{\mathbb{P}(|\Xi|\le 200)}.

This is a function of L, and in principle we could find its maximum, perhaps as N\rightarrow\infty. Here, though, let’s just take L=365/2 and see what happens. For (*) we get ~0.472.

To estimate \mathbb{P}(\mathrm{Po}(365/2)>200), observe that this event corresponds to 1.4 standard deviations above the mean, so we can approximate using quantiles of the normal distribution, via the CLT. (Obviously this isn’t completely precise, but it could be made precise if we really wanted.) I looked up a table, and this probability is, conveniently for calculations, roughly 0.1. Thus we obtain a lower bound of \frac{0.472-0.1}{0.9}. Allowing for the fairly weak estimates at various points, we still get a lower bound of around 0.4. Which is good, because it shows that my intuition wasn’t right, but that I was in the right ball-park for it being a ‘middle-probability event’.

Remarks and References

– The reason for doing the upper bound for the probability of exact 4-overlap is that the same argument for at-least-4-overlap would have given an upper bound of 1. However, this Poisson Process coupling is also a much better method for obtaining an upper bound on either event.

– Birthdays are not uniformly distributed through the year. The deviation is strong enough that even from the set of birth frequencies (rather than the sequence of birth frequencies), we can reject a null hypothesis of uniformity. Early September is pretty close to the maximum. Two comments: 1) this is the time of year where small variations in birth date have a big effect on education, especially in primary school; 2) we are 37 weeks into the year…

– It is known that 187 friends is the first time the probability of having at-least-4-overlap is greater than ½. You can find the full sequence on OEIS as A014088. I used to have about 650 Facebook friends, before I decided that I’d prefer instead the pleasant surprise of finding out what old acquaintances were up to when I next spoke to them. In this case, the median of the distribution of the largest number sharing a birthday would be seven.

– Eric Weisstein’s article on Mathworld is, in my opinion, the best resource for a mathematician on the first two pages of Google hits by at least two orders of magnitude. In the notation of this article, we were calculating P_4(n=200,d=365). There are also some good general asymptotics, or at least recipes for asymptotics, in equations (17) and (18).

– The paper Methods for Studying Coincidences by Diaconis and Mosteller is, as one might expect, extremely readable, and summarises many results and applications, including several generalisations.

Guess 2/3 of the average

The following problem was doing the rounds on some email lists in Oxford this week:

“You and a group of colleagues are all invited to pick an integer between 0 and 100. The average x is calculated, and the person whose guess is unique and comes closest to 2x/3 (among the unique guesses) is declared the winner.”

Although it wasn’t specified in the setting, one assumes that if two unique guesses were equally close to 2/3 of the average, then the two participants shared the prize.

In any case, it is interesting to ask the question, what is the optimal strategy?

I don’t know very much about game theory, so I’m going to say some things that I hope are true. I don’t claim that this is the best or even a good way of thinking about the situation. If we break the problem down into two parts, we have two of the more famous problems in game theory. To consider the first part, we remove the requirement that the winner’s guess be unique, and allow the participants to pick any real number between 0 and 100.

It seems to me there are three ways to do this. Let’s say we have K possible winners who guessed closed to 2x/3. We could give each of them the fixed prize, say £1. Or we could give each of them £1/K, or we could choose one at random and give them £1. I claim that the latter two methods are essentially the same, and so will ignore the third possibility.

Note that 2x/3 will definitely be at most 200/3. Suppose I pick a real value y that is greater than 200/3. What would happen if instead I picked 200/3? Well the value of x would decrease slightly. How much it decreases would depend on the number of people playing the game. It is not hard to check that we end up closer to 2x/3 if we pick 200/3 instead of y. Let’s call \alpha=\frac{200}{3}.

We might say that the strategy “pick \alpha” dominates the strategy “pick y”, because we always do at least as well with the first strategy as the second, regardless of what the other players do, and indeed in some cases do strictly better. Now comes the potentially philosophical bit. We have concluded that it would be irrational for us to pick any number greater than \alpha. We also know that the email was sent to mathematicians in Oxford, whom we might hope we could safely assume were equally rational. So now we are playing the same game as before, only with the bounds that we should pick a number between 0 and \alpha. We can then apply exactly the same argument to show that it would be irrational to pick any number greater than \frac{2\alpha}{3}. This equally applies to all the other agents, and so we continue, showing that for any k, it is irrational to pick any number greater than (\frac23)^k\alpha. In conclusion, the only rational strategy is to pick 0, and hence everyone will do this.

For the prize structure suggested, this is a Nash equilibrium, since no-one can improve their winnings by changing strategy (with everyone else’s kept the same). This principle is in general referred to as iterated elimination of dominated strategies. The assumption that, not only are players rational, but that all players know that all other players are rational, and all players know that all players know the others rational and so on, is called Common Knowledge of Rationality. This doesn’t seem hugely controversial in this artificial setting, but it is not hard to think of examples where this might not hold. For example, some players might be facing the same situation, but have different utility functions, so especially if there is some randomness in the dynamics, people will have different ‘rational’ opinions concerning strategies, and unless you know their utility functions, it might seem like the other players are not acting rationally.

This argument works equally well with integers. We first eliminate all possibilities greater than 67, and then keep dividing by 2/3, but we have to take the ceiling function to get the new upper bound. As a result, after repeated application of this idea, we conclude that the only rational strategies are to pick 0 or 1.

It can be seen that neither strategy dominates the other. If more than ¾ of the agents choose 1, then choosing 1 is the optimal strategy; if less than ¾ choose 1, then choosing 0 is better. If exactly ¾ choose 1 then everyone wins. There are, however, only two Nash equilibrium, given by everyone choosing 1, and everyone choosing 0. It is tempting to argue that we might expect roughly ½ the agents to choose 1, and so our better option is to choose 0. But we are assuming the other agents are rational, and there is no a priori reason why picking 0 or 1 with probability ½ each should be the rational thing to do.

However, choosing randomly is a possible strategy. In this setting, it becomes harder to talk about Nash equilibria. When there are just two agents, the set of even random strategies for each agent is relatively manageable, but when there is some arbitrary number of players, the set of possible strategies might be very complicated indeed.

The final stage of this integer problem is morally equivalent to playing a game where everyone has to pick 0 or 1, and you win if you are in the majority. For this problem, it is clear that the only Nash equilibria are for everyone to pick either 0 or 1. If the winners share the prize-money equally, then this game has the interesting property that every configuration is Pareto optimal. This means that no-one can improve their winnings without causing someone else’s situation to decrease. When it comes to deciding what to do, with no knowledge of other people’s strategies, we have made little progress. At least here the symmetry between 0 and 1 makes it clear how little progress we can make. The only thing we can say is that if the utility is concave (a generally reasonable assumption), then by conditioning on what everyone else does, an optimal strategy is to pick 0 with probability 1/2.

The problem with saying anything sensible in these sorts of problem with a zero-sum condition (as we effectively have if the winnings are shared) is the symmetry between all the players. Given any strategy, if everyone used that strategy, then everyone would win 1/N in expectation. This applies equally well if we add the condition that the winner has to pick something unique.

The only question we can perhaps say something about occurs if we say that no-one wins the original game if there are no unique integer answers. Then it is worth considering which strategy could everyone adopt so as to minimise the probability of this happening. For the reasons given above, we should indeed force everyone to take the same strategy, which we can interpret as a distribution on the integers. So which distribution on the integers between 1 and k maximises the probability of having at least one unique value?

We also need to know how many people are playing, so let’s say this is N. I don’t know how to solve the general problem, but I can say something if we fix either k or N and then let the other one go to infinity.

If we fix the number of people and let k go to infinity, then we can get arbitrarily close to probability one by using the uniform distribution. Indeed, any sequence of distributions where the probabilities converge uniformly to zero ( ie \max_{x\in[k]} \mathbb{P}(x) \rightarrow 0 ) has this property.

The case where k is fixed and the number of players goes to infinity is a bit trickier. Essentially, if we choose any fixed distribution, the event we seek becomes less and less likely as the number of agents grows. It is equivalent to demanding that if we roll a dice a million times, we only see precisely one six. If we replace a million by larger numbers, this probability decays exponentially.

If we want to maximise the probability of getting precise one agent choosing value 1, we observe that the number of agents choosing that value is binomial(N,p), where p is the probability of that value. If p=o(1/N) then asymptotically, with high probability the number of 1s is 0. If p=w(1/N) then asymptotically, with high probability, the number of 1s is \sim pN. Taking p=\frac{\alpha}{N}, the asymptotic distribution of the number of 1s is Poisson with parameter \alpha. We can solve to see which value of \alpha maximises the probability of getting a single 1. It turns out, unsurprisingly after considering the expectation of the corresponding pre-limit binomial distribution, that this maximum is achieved at \alpha=1.

So note now that if we take the probabilities of picking 1 and 2 both to be 1/N, we get two Poisson random variables asymptotically. For a similar argument to the construction of the Poisson process from independent infinitesimal increments, the covariance of these tends to 0. So I conjecture, and I reckon it is probably not too hard to come up with a formal proof that for large N, the optimal distribution looks like:

(\frac{1}{N},\ldots,\frac{1}{N},1-\frac{k-1}{N}),

or some permutation of that.

Enhanced by Zemanta

Characterisations of Geometric Random Graphs

Continuing the LMS-EPSRC summer school on Random Graphs, Geometry and Asymptotic Structure, we’ve now had three of the five lectures by Mathew Penrose on Geometric Random Graphs.

The basic idea is that instead of viewing a graph entirely abstractly, we now place the vertices in the plane, or some other real space. In many network situations, we would expect connectivity to depend somehow on distance. Agents or sites which are close together might be considered more likely to have the sort of relationship indicated by being connected with an edge. In the model discussed in this course, this dependence is deterministic. We have some parameter r, and once we have chosen the location of all the vertices, we connect a pair of vertices if the distance between them is less than r.

For the purposes of this, we work in a compact space [0,1]^d, and we are interested in the limit as the number of vertices n grows to infinity. To avoid the graph getting too connected, as in the standard random graph model, we take r to be a decreasing function of n. Anyway, we place the n points into the unit hypercube uniformly at random, and then the edges are specified by the adjacency rule above. In general, because r_n will be o(1), we won’t have to worry too much above boundary effects. The number of vertices within r_n of the boundary of the cube will be o(1). For some results, this is a genuine problem, when it may be easier to work on the torus.

In G(n,p), the order of np in the limit determines the qualitative structure of the graph. This is the expected degree of a given fixed vertex. In this geometric model, the relevant parameter is nr_n^d, where d is the dimension of the hypercube. If this parameter tends to 0, we say the graph is sparse, and dense if it tends to infinity. The intermediate case is called a thermodynamic limit. Note that the definition of sparse here is slightly different from G(n,p).

Much of the content of the first three lectures has been verifying that the distributions of various quantities in the graph, for example the total number of edges, are asymptotically Poisson. Although sometimes arguments are applicable over a broad spectrum, we also sometimes have to use different calculations for different scaling windows. For example, it is possible to show convergence to a Poisson distribution for the number of edges in the sparse case, from which we get an asymptotic normal approximation almost for free. In the denser regimes, the argument is somewhat more technical, with some substantial moment calculations.

A useful tool in these calculations are some bounds derived via Stein’s method for sums of ‘almost independent’ random variables. For example, the presence or non-presence of an edge between two pairs of vertices are independent in this setting if the pairs are disjoint, and the dependence is still only mild if they share a vertex. An effective description is via a so-called dependency graph, where we view the random variables as the vertices of a graph, with an edge between them if there is some dependence. This description doesn’t have any power in itself, but it does provide a concise notation for what would otherwise be very complicated, and we are able to show versions of (Binomials converge to Poisson) and CLT via these that are exactly as required for this purpose.

In particular, we are able to show that if E_n is the total number of edges, under a broad set of scaling regimes, if \lambda_n is the expected total number of edges, then d_{TV}(E_n,\mathrm{Po}(\lambda_n))\rightarrow 0, as n grows. This convergence in total variation distance is as strong a result as one could hope for, and when the sequence of \lambda_n is O(1), we can derive a normal approximation as well.

At this point it is worth discussing an alternative specification of the model. Recall that for a standard homogenous random graph, we have the choice of G(n,m) and G(n,p) as definitions. G(n,m) is the finer measure, and G(n,p) can be viewed as a weighted mix of G(n,m). We can’t replicate this directly in the geometric setting because the edges and non-edges are a deterministic function of the vertex locations. What we can randomise is the number of vertices. Since we are placing the vertices uniformly at random, it makes sense to consider as an alternative a Poisson Point Process with intensity n. The number of vertices we get overall will be distributed as Po(n), which is concentrated near n, in the same manner as G(n,c/n).

As in G(n,p), this is a less basic model because it is a mixture of the fixed-vertex models. Let’s see if how we would go about extending the total variation convergence result to this slightly different setting without requiring a more general version of the Poisson Approximation Lemma. To avoid having to define everything again, we add a ‘ to indicate that we are talking about the Poisson Point Process case. Writing d(.,.) for total variation distance, the result we have is:

\lim_{n\rightarrow\infty} d(E_n,\mathrm{Po}(\lambda_n))=0.

We want to show that

\lim_{n\rightarrow\infty}d(E_n',\mathrm{Po}(\lambda_n'))=0,

which we can decompose in terms of expectations in the original model by conditioning on N_n

\leq \lim_{n\rightarrow\infty}\mathbb{E}\Big[\mathbb{E}[d(E_{N_n},\mathrm{Po}(\lambda_n')) | N_n]\Big],

where the outer expectation is over N. The observation here, is that the number of points given by the Poisson process induces a measure on distributions, the overwhelming majority of which look quite like Poisson distributions with parameter n. The reason we have a less than sign is that we are applying the triangle inequality in the sum giving total variation distance:

d(X,Y)=\sum_{k\geq 0}|\mathbb{P}(X=k)-\mathbb{P}(Y=k)|.

From this, we use the triangle inequality again:

\lim_{n\rightarrow\infty} \mathbb{E}\Big[\mathbb{E}[d(E_{N_n},\mathrm{Po}(\lambda_{N_n})) | N_n]\Big]

+\lim_{n\rightarrow\infty}\mathbb{E}\Big[\mathbb{E}[d(\mathrm{Po}(\lambda_{N_n}),\mathrm{Po}(\lambda_n')) | N_n]\Big].

Then, by a large deviations argument, we have that for any \epsilon>0, \mathbb{P}(|N_n-n|\geq \epsilon n)\rightarrow 0 exponentially in n. Also, total variation distance is, by definition, bounded above by 1. In the first term, the inner conditioning on N_n is irrelevant, and we have that E_{N_n} converges to the Poisson distribution for any fixed N_n\in (n(1-\epsilon),n(1+\epsilon)). Furthermore, we showed in the proof of the non-PPP result that this convergence is uniform in this interval. (This is not surprising – the upper bound is some well-behaved polynomial in 1/n.) So with probability 1- e^{-\Theta(n)} N_n is in the region where this convergence happens, and elsewhere, the expected TV distance is bounded below 1, so the overall expectation tends to 0. With a similar LD argument, for the second term it suffices to prove that when \lambda\rightarrow\mu, we must have d(\mathrm{Po}(\lambda),\mathrm{Po}(\mu))\rightarrow 0. This is ‘obviously’ true. Formally, it is probably easiest to couple the distributions \mathrm{Bin}(n,\lambda/n),\mathrm{Bin}(n,\mu/n) in the obvious way, and carry the convergence of TV distance as the parameter varies through the convergence in n.

That all sounded a little bit painful, but is really just the obvious thing to do with each term – it’s only the language that’s long-winded!

Anyway, I’m looking forward to seeing how the course develops. In particular, when you split the space into small blocks, the connectivity properties resemble those of (site) percolation, so I wonder whether there will be concrete parallels. Also, after reading about some recent results concerning the metric structure of the critical components in the standard random graph process, it will be interesting to see how these compare to the limit of a random graph process which comes equipped with metric structure for free!

Preferential Attachment Models

I’ve just read a really interesting paper by Peter Morters and Maren Eckhoff that made me feel I should look up some of the background and write a quick post. I may get onto some of the results in the paper at the end of this post, but I want to start by saying a bit about the model itself. I’ve spoken about this briefly in a previous post about several descriptions of complex networks, but I think it’s worth having a second attempt.

We seek a model for random graphs that gives a distribution which exhibits some of the properties of the sort of complex networks seen in the real world. In particular, whereas the degree distribution is Poisson, and so concentrated with exponential tails for the Erdos-Renyi random graph, data indicates that a better model for most applications would have power law tails for this degree distribution.

Albert and Barabasi propose growing such a graph via a so-called preferential attachment scheme. We start with some small possibly empty graph, and add new vertices one at a time. For each new vertex, we add exactly M edges between the new vertex and the vertices already present. The choice of these M other vertices is given by weighting by the degree of the (pre-existing) vertices. That is, vertices with large degree are more likely to be joined to new vertices. This is obviously designed to replicate some of the behaviour seen in say the formation of the internet, where new sites are more likely to link to established and popular sites (Google, Youtube and so on) than a uniformly chosen site.

This model has a couple of problems. Firstly, it is not immediately obvious how to start it. Obviously we need M vertices present for the PA dynamics to start working. In fact, whether one starts with a empty graph or a complete graph on M vertices makes little difference to the large n behaviour. Trickier is the question of multiple edges, which may emerge if we define the PA dynamics in the natural way, that is for each of the M edges in turn. Overcoming this is likely to be annoying.

Bollobas and Riordan do indeed overcome this possible problems in a formal way, and prove that a version of this model does indeed have power law decay of the degree distribution, with exponent equal to 3. The model in the paper instead joins new vertex (n+1) to old vertex m with probability:

\frac{f(\text{in-degree of n})}{n},

where f is some function, which for now we assume has the form f(k)=\gamma k+\beta. Since the vertices are constructed one at a time, it is well-defined to orient these edges from new to old vertices, hence this notion of in-degree makes sense.

It was not obvious to me that this model was more general than the Bollobas/Riordan model, but we will explain this in a little while. First I want to explain why the Bollobas/Riordan model has power law tails, and how one goes about finding the exponent of this decay, since this was presented as obvious in most of the texts I read yet is definitely an important little calculation.

So let’s begin with the Bollobas/Riordan model. It makes sense to think of the process in terms of time t, so there are t – M vertices in the graph. But if t is large, this is essentially equal to t. We want to track the evolution of the degree of some fixed vertex v_i, the ith vertex to be formed. Say this degree is d(t) at time t. Then the total number of edges in the graph at time t is roughly tM. Therefore, the probability that a new vertex gets joined to vertex v is roughly \frac{Md}{2Mt}, where the M appears in the numerator because there are M fresh edges available. Note that we have ignored the possibility of trying to connect multiple edges from the new vertex to v, so this holds provided d is substantially smaller than t. With the boundary condition d(i)=M, this leads to the simple ODE

\dot{d}=\frac{d}{2t}\quad \Rightarrow\quad d=M(\frac{t}{i})^{1/2}.

To me at least it was not immediately clear why this implied that the tail of the degree distribution had exponent 3. The calculation works as follows. Let D be the degree of a vertex at large time t, chosen uniformly at random.

d_i\propto (\frac{t}{i})^{1/2}

\Rightarrow\quad \mathbb{P}(D\geq d)=\frac{1}{t}|\{i:(\frac{t}{i})^{1/2}\geq d\}|=\frac{1}{t}|\{i:i\leq \frac{t}{d^2}\}|=\frac{1}{d^2}

Now we consider the Eckhoff / Morters model. The main difference here is that instead of assuming that each new vertex comes with a fixed number of edges, instead the new vertex joins to each existing vertex independently with probability proportional to the degree of the existing vertex. More precisely, they assume that edges are directed from new vertices to old vertices, and then each new vertex n+1 is joined to vertex m<n+1 with probability \frac{f(\text{indegree of }m\text{ at time }n)}{n}\wedge 1, where f(k)=\gamma k +\beta, for \gamma\in[0,1), \beta>0.

I was stuck for a long time before I read carefully enough the assertion that \beta>0. Of course, if this doesn’t hold, then the graph won’t grow fast enough. For, since the function f is now linear, we can lift the statement about evolution of the degree of a vertex to a statement about the evolution of the total number of edges. Note that each edge contributes exactly one to the total number of in-degrees. So we obtain

\dot{E}=\frac{\gamma E}{t}\quad\Rightarrow E(t)\propto t^\gamma.

In particular, this is much less than t, so the majority of vertices have small degree. The answer is fairly clear in fact: since the preferential attachment mechanism depends only on in-degree, then if f(0)=0, since the in-degree of a new vertex will always be zero by construction, there is no way to get an additional edge to that vertex. So all the edges in the graph for large t will be incident to a vertex that had positive in-degree in the time 0 configuration. Hence we need \beta>0 for the model to be meaningful. Note that this means we effectively have a Erdos-Renyi type mechanism AND a preferential attachment evolution. As, for each new vertex, we add roughly \beta edges to existing vertices chosen uniformly at random (rather than by a PA method) and also some assigned via PA. A previous paper by Dereich and Morters shows that the asymptotic degree distribution has a power law tail with exponent

\tau:=\frac{\gamma+1}{\gamma}.

Note that \gamma=\frac12 gives the same exponent (3) as the Bollobas / Riordan model.

We can apply a similar ODE approximation as above to estimate the likely large time behaviour of the number of edges:

E'=\frac{\gamma E + \beta t}{t}.

So since E'\geq \beta, we have E\geq \beta t so defining F to be E(t)/t, we get:

tF'(t)=\beta-(1-\gamma)F(t)        (1)

Noting that F’ is positive when F< \frac{\beta}{1-\gamma} and negative when F>\frac{\beta}{1-\gamma} suggests that for large t, this is an equilibrium point for F and hence E(t)\approx \frac{\beta t}{1-\gamma}. Obviously, this is highly non-rigorous, as F’ can be very small and still satisfy the relation (1), so it is not clear that the ‘equilibrium’ for F is stable. Furthermore, one needs to check that the binomial variables that supply the randomness to this model are sufficiently concentrated that this approximation by expectation is reasonable.

Nonetheless, as a heuristic this is not completely unsatisfactory, and it leads to the conclusion that E(t) is a linear function of t, and so the distribution of the out-degrees for vertices formed at large times t is asymptotically Poisson, with parameter

\lambda =\frac{\beta\gamma}{1-\gamma}+\beta=\frac{\beta}{1-\gamma}.

Note that this is the same situation as in Erdos-Renyi. In particular, it shows that all the power tail behaviour comes from the in-degrees. In a way this is unsurprising, as these evolve in time, whereas the out-degree of vertex t does not change after time t. Dereich and Morters formalise this heuristic with martingale analysis.

The reason we are interested in this type of model is that it better reflects models seen in real life. Some of these networks are organic, and so there it is natural to consider some form of random destructive mechanism, for example lightning, that kills a vertex and all its edges. We have to compare this sort of mechanism, which chooses a vertex uniformly at random, against a targeted attack, which deletes the vertices with largest degree. Note that in Erdos-Renyi, the largest degree is not much larger than the size of the typical degree, because the degree distribution is asymptotically Poisson. We might imagine that this is not the case in some natural networks. For example, if one wanted to destroy the UK power network, it would make more sense to target a small number of sub-stations serving large cities, than, say, some individual houses. However, a random attack on a single vertex is unlikely to make much difference, since the most likely outcome by far is that we lose only a single house etc.

In Eckhoff / Morters’ model, the oldest vertices are by construction have roughly the largest degree, so it is clear what targeting the most significant \epsilon n vertices means. They then show that these vertices include all the vertices that give the power law behaviour. In particular, if you remove all of these vertices and, obviously, the edges incident to them, you are left with a graph with exponential tail in the asymptotic degree distribution, with largest degree on the order of log n. It was shown in a previous paper that this type of network is not vulnerable to random removal of nodes. Perhaps most interestingly, these authors now prove that after removing the most significant \epsilon n vertices, the network IS now vulnerable to random removal of nodes, leading to the conclusion that it is preferable to experience a random attack followed by a targeted attack than vice versa!

In a future (possibly distant) post, I want to say some slightly more concrete things about how these processes link to combinatorial stochastic processes I understand slightly better, in particular urn models. I might also discuss the configuration model, an alternative approach to generating complex random networks.

Poisson Tails

I’ve had plenty of ideas for potential probability posts recently, but have been a bit too busy to write any of them up. I guess that’s a good thing in some sense. Anyway, this is a quick remark based on an argument I was thinking about yesterday. It combines Large Deviation theory, which I have spent a lot of time learning about this year, and the Poisson process, which I have spent a bit of time teaching.

Question

Does the Poisson distribution have an exponential tail? I ended up asking this question for two completely independent reasons yesterday. Firstly, I’ve been reading up about some more complex models of random networks. Specifically, the Erdos-Renyi random graph is interesting mathematical structure in its own right, but the independent edge condition results in certain regularity properties which are not seen in many real-world networks. In particular, the degree sequence of real-world networks typically follows an approximate power law. That is, the tail is heavy. This corresponds to our intuition that most networks contain ‘hubs’ which are connected to a large region of the network. Think about key servers or websites like Wikipedia and Google which are linked to by millions of other pages, or the social butterfly who will introduce friends from completely different circles. In any case, this property is not observed in an Erdos-Renyi graph, where the degrees are binomial, and in the sparse situation, rescale in the limit to a Poisson distribution. So, to finalise this observation, we want to be able to prove formally that the Poisson distribution has an exponential (so faster than power-law) tail.

The second occurrence of this question concerns large deviations for the exploration process of a random graph. This is a topic I’ve mentioned elsewhere (here for the exploration process, here for LDs) so I won’t recap extensively now. Anyway, the results we are interested in give estimates for the rate of decay in probability for the event that the path defined by the exploration process differs substantially from the expected path as n grows. A major annoyance in this analysis is the possibility of jumps. A jump occurs if a set of o(n) adjacent underlying random variables (here, the increments in the exploration process) have O(n) sum. A starting point might be to consider whether O(1) adjacent RVs can have O(n) sum, or indeed whether a single Poisson random variable can have sum of order n. In practice, this asks whether the probability \mathbb{P}(X>\alpha n) decays faster than exponentially in n. If it does, then this is dominated on a large deviations scale. If it decays exactly exponentially in n, then we have to consider such jumps in the analysis.

Approach

We can give a precise statement of the probabilities that a Po(\lambda) random variable X returns a given integer value:

\mathbb{P}(X=k)=e^{-\lambda}\frac{\lambda^k}{k!}.

Note that these are the terms in the Taylor expansion of e^{\lambda} appropriately normalised. So, while it looks like it should be possible to evaluate

\mathbb{P}(X>\alpha n)=e^{-\lambda}\sum_{\alpha n}^\infty \frac{\lambda^k}{k!},

this seems impossible to do directly, and it isn’t even especially obvious what a sensible bounding strategy might be.

The problem of estimating the form of the limit in probability of increasing unlikely deviations from expected behaviour surely reminds us of Cramer’s theorem. But this and other LD theory is generally formulated in terms of n random variables displaying some collective deviation, rather than a single random variable, with the size of the deviation growing. But we can transform our problem into that form by appealing to the three equivalent definitions of the Poisson process.

Recall that the Poisson process is the canonical description of, say, an arrivals process, where events in disjoint intervals are independent, and the expected number of arrives in a fixed interval is proportional to the width of the interval, giving a well-defined notion of ‘rate’ as we would want. The two main ways to define the process are: 1) the times between arrivals are given by i.i.d. Exponential RVs with parameter \lambda equal to the rate; and 2) the number of arrivals in interval [s,t] is independent of all other times, and has distribution given by Po(\lambda(t-s)). The fact that this definition gives a well-defined process is not necessarily obvious, but let’s not discuss that further here.

So the key equivalence to be exploited is that the event X>n for X\sim \text{Po}(\lambda) is a statement that there are at least n arrivals by time 1. If we move to the exponential inter-arrival times definition, we can write this as:

\mathbb{P}(Z_1+\ldots+Z_n<1),

where the Z’s are the i.i.d. exponential random variables. But this is exactly what we are able to specify through Cramer’s theorem. Recall that the moment generating function of an exponential distribution is not finite everywhere, but that doesn’t matter as we construct our rate function by taking the supremum over some index t of:

I(x)=\sup_t (xt-\log \mathbb{E}e^{tZ_1})=\sup_t(xt-\log(\frac{\lambda}{\lambda-t})).

A simple calculation then gives

I(x)=\lambda x-1 - \log \lambda x.

\Rightarrow I(x)\uparrow \infty\text{ as }x\downarrow 0.

Note that I(1) is the same for both Exp(\lambda) and Po(\lambda), because of the PP equality of events:

\{Z_1+\ldots+Z_n\leq n\}=\{\text{Po}(\lambda n)=\text{Po}(\lambda)_1+\ldots+\text{Po}(\lambda)_n> n\},

similar to the previous argument. In particular, for all \epsilon>0,

\mathbb{P}(\text{Po}(\lambda)>n)=\mathbb{P}(\frac{Z_1+\ldots+Z_n}{n}<\frac{1}{n})<\mathbb{P}(\frac{Z_1+\ldots+Z_n}{n}<\epsilon),\text{ for large }n.

\mathbb{P}(\text{Po}(\lambda)>n)=O(e^{-nI(\epsilon)}),\text{ for all }\epsilon.

Since we can take I(\epsilon) as large as we want, we conclude that the probability decays faster than exponentially in n.