Lecture 10 – the configuration model

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

As we enter the final stages of the semester, I want to discuss some extensions to the standard Erdos-Renyi random graph which has been the focus of most of the course so far. Although we will not get far into the details during this course, the overall goal is to develop models which are close to Erdos-Renyi in terms of ease of analysis, while also allowing more of the features characteristic of networks observed in the real world.

One of the more obvious deficiencies of the sparse regime of Erdos-Renyi random graphs for modelling ‘real-world phenomena’ concerns the degree sequence. Indeed, the empirical degree distribution of G(n,c/n) converges to Poisson(c). By contrast, in real-world networks, a much wider range of degrees is typically observed, and in many cases it is felt that these should follow a power law, with a small number of a very highly connected agents.

One way around this problem to construct random graphs where we insist that the graph has a given sequence of degrees. The configuration model, which is the subject of this lecture and this post (and about which I’ve written before), offers one way to achieve this.

Definition and notes

Let n\ge 1 and let d=(d_1,d_2,\ldots,d_n) be a sequence of non-negative integers such that \sum_{i=1}^n d_i is even. Then the configuration model with degree sequence d is a random multigraph with vertex set [n], constructed as follows:

  • To each vertex i\in[n], assign d_i half-edges;
  • Then, take a uniform matching of these half-edges;
  • Finally, for each pair of half-edges in the matching, replace the two half-edges with a genuine edge, to obtain the multigraph CM_n(d), in which, by construction, vertex i has degree d_i.

One should note immediately that although the matching is uniform, the multigraph is not uniform amongst multigraphs with that degree sequence. Note also that the condition on the sums of the degrees is necessary for any graph, and in this context means that the number of half-edges is even, without which it would not be possible to construct a matching.

This effect is manifest in the simplest possible example, when n=2 and d=(3,3). There are two possible graphs, up to isomorphism, which are shown below:

For obvious reasons, we might refer to these as the handcuffs and the theta , respectively. It’s helpful if we, temporarily, assume the half-edges are distinguishable at the moment we join them up in the configuration model construction. Because then there are 3×3=9 ways to join them up to form the handcuffs (think of which half-edge ends up forming the edge between the two vertices) while there are 3!=6 ways to pair up the half-edges in the theta.

In general, for multigraphs H with the correct degree sequence, we have

\mathbb{P}( CM_n(d)\simeq H) \propto \left( 2^{\# \text{loops}(H)} \prod_{e\in E(H)} \text{mult}(e)! \right),

where \text{mult}(e) is the multiplicity with which a given edge e appears in H.

Note: it might seem counterintuitive that this procedure is biased against multiple edges and self-loops, but it is really just saying that there are more ways to form two distinct edges than to form two equal edges (ie a multiedge pair) when we view the half-edges as distinguishable. (See this post for further discussion of this aspect in the 3-regular setting.)

However, a consequence of this result is that if we condition on the event that CM_n(d) is simple, then the resulting random graph is uniform on the set of simple graphs satisfying the degree property. Note that the same example as above shows that there’s no guarantee that there exists a simple graph whose degrees are some given sequence.

d-regular configuration model

In general, from a modelling point of view, we are particularly interested in simple, connected graphs, and so it is valuable to study whether the large examples of the configuration model are likely to have these properties. In this lecture, I will mainly focus on the case where the multigraphs are d-regular, meaning that all the vertices have degree equal to d. For the purposes of this lecture, we denote by G^d(n), the d-regular configuration model CM_n(d,\ldots,d).

  • d=1: to satisfy the parity condition on the sums of degrees, we must have n even. But then G^1(n) will consist of n/2 disjoint edges.
  • d=2: G^2(n) will consist of some number of disjoint cycles, and it is a straightforward calculation to check that when n is large, with high probability the graph will be disconnected.

In particular, I will focus on the case when d=3, which is the first interesting case. Most of the results we prove here can be generalised (under various conditions) to more general examples of the configuration model. The main goal of the lecture is revision of some techniques of the course, plus one new one, in a fresh setting, and the strongest possible versions of many of these results can be found amongst the references listed at the end.

Connectedness

In the lecture, we showed that G^3(2n) is connected with high probability. This is, in fact, a very weak result, since in fact G^d(n) is d-connected with high probability for d\ge 3 [Bol81, Wor81]. Here, d-connected means that one must remove at least d vertices in order to disconnect the graph, or, equivalently, that there are d disjoint paths between any pair of vertices. Furthermore, Bollobas shows that for d\ge 3, G^d(n) is a (random) expander family [Bol88].

Anyway, for the purposes of this course, the main tool is direct enumeration. The matching number M_{2k} satisfies

M_{2k}=(2k-1)\times (2k-3)\times\ldots\times 3\times 1 = \frac{(2k)!}{2^k \cdot k!},

and so Stirling’s approximation gives the asymptotics

M_{2k} = (\sqrt{2}+o(1)) \left(\frac{2}{e}\right)^k k^k,

although it will be useful to use the true bounds

c \left(\frac{2}{e}\right)^k k^k \le M_{2k}\le C\left(\frac{2}{e}\right)^k k^k,\quad \forall k,

instead in some places. Anyway, in G^3(2n), there are 6n half-edges in total, and so the probability that the graph may be split into two parts consisting of 2\ell,2m vertices, with 2\ell+2m=2n, and with no edges between the classes is \frac{\binom{2n}{2\ell} M_{6\ell}M_{6m}}{M_{6n}}. Continue reading

Advertisement

Lecture 2 – Connectivity threshold

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

The goal of the second lecture was to establish the sharp phase transition for the connectivity of the random graph G(n,p(n)) around the critical regime p(n)\sim \frac{\log n}{n}. In the end, we showed that when \omega(n) is any diverging sequence, and p(n)=\frac{\log n-\omega(n)}{n}, then we have that G(n,p(n)) is with high probability not connected.

In the next lecture, we will finish the classification by studying p(n)=\frac{\log n+\omega(n)}{n}, and show that for this range of p, the graph G(n,p(n)) is with high probability connected.

The details of the lecture, especially the calculation, are not presented fully here. There, I followed van der Hofstad’s recent book fairly closely, sometimes taking different approximations and routes through the algebra, though all versions remain fairly close to the original enumerations by Renyi.

Immediate remarks

  • One is allowed to be surprised that for almost all scalings of p(n), G(n,p) is either whp connected or whp not connected. The speed of the transition is definitely interesting.
  • As defined in lectures, the property that a graph is connected is an increasing property, meaning that it is preserved when you add additional edges to the graph.
  • Because of the natural coupling between G(n,p) and G(n,q), the fact that connectedness is an increasing property makes life easier. For example, we can insist temporarily that \omega(n)\ll \log n, or whatever scaling turns out to be convenient for the proof, but conclude the result for all diverging \omega(n). This avoids the necessity for an annoying case distinction.

Heuristics – Isolated vertices

It turns out that the ‘easiest’ way for such a graph to be disconnected is for it to have an isolated vertex. In determining that the graph has a cut into classes of sizes a and b with no edges between them, there is a trade-off between the number of ways to choose the partition (which increases with min(a,b) ) and the probabilistic penalty from banning the ab edges between the classes (which decreases with min(a,b) ). It turns out that the latter effect is slightly stronger, and so (1,n-1) dominates.

Method 1: second-moment method

In the case p(n)=\frac{\log n - \omega(n)}{n}, we use a second-moment method argument to establish that G(n,p) contains an isolated vertex with high probability. Note that a given vertex v is isolated precisely if n-1 edges are not present. Furthermore, two given vertices v,w are both isolated, precisely if 2n-3 edges are not present. So in fact, both the first moment and the second moment of the number of isolated vertices are straightforward to evaluate.

It turns out that the number of isolated vertices, Y_n, satisfies

\mathbb{E}[Y_n]= \exp(\omega(n)+o(1))\rightarrow\infty. (*)

As always, we have to eliminate the possibility that this divergent expectation is achieved by the graph typically having no isolated vertices, but occasionally having very many. So we turn to the second moment, and can show

\mathrm{Var}(Y_n)= (1+o(1))\mathbb{E}[Y_n],

and so by Chebyshev’s inequality, we have \mathbb{P}(Y_n=0)\rightarrow 0.

Method 2: first-moment method

Counter-intuitively, although the case p(n)=\frac{\log n + \omega(n)}{n} requires only a first-moment method, it is more technical because it involves the non-clear direction of the informal equivalence:

\text{Connected}\; ``\iff ''\; \text{no isolated vertices}.

At the time we showed (*), we also showed that for this regime of p(n), G(n,p) whp has no isolated vertices. It remains to show that it has no splits into (unions of) connected components of sizes k and n-k. Continue reading

Kernels of critical graph components

This post is motivated by G(N,p), the classical Erdos-Renyi random graph, specifically its critical window, when p=p(N)=\frac{1}{N}(1+\lambda N^{-1/3}).

We start with the following observation, which makes no restriction on p. Suppose a component of G(N,p) is a tree. Then, the graph geometry of this component is that of a uniform random tree on the appropriate number of vertices. This is deliberately informal. To be formal, we’d have to say “condition on a particular subset of vertices forming a tree-component” and so on. But the formality is broadly irrelevant, because at the level of metric scaling limits, if we want to describe the structure of a tree component, it doesn’t matter whether it has \log N or \frac{1}{7}N vertices, because in both cases the tree structure is uniform. The only thing that changes is the scaling factor.

In general, when V vertices form a connected component of a graph with E edges, we define the excess to be E-V+1. So the excess is non-negative, and is zero precisely when the component is a tree. I’m reluctant to say that the excess counts the number of cycles in the component, but certainly it quantifies the amount of cyclic structure present. We will sometimes, in a mild abuse of notation, talk about excess edges. But note that for a connected component with positive excess, there is a priori no way to select which edges would be the excess edges. In a graph process, or when there is some underlying exploration of the component, there sometimes might be a canonical way to classify the excess edges, though it’s worth remarking that the risk of size-biasing errors is always extremely high in this sort of situation.

Returning to the random graph process, as so often there are big changes around criticality. In the subcritical regime, the components are small, and most of them, even the largest with high probability, are trees. In the supercritical regime, the giant component has excess \Theta(N), which is qualitatively very different.

It feels like every talk I’ve ever given has begun with an exposition of Aldous’s seminal paper [Al97] giving a distributional scaling limit of the sizes of critical components in the critical window, and a relation between the process on this time-scale and the multiplicative coalescent. And it remains relevant here, because the breadth-first exploration process can also be used to track the number of excess edges.

In a breadth-first exploration, we have a stack of vertices we are waiting to explore. We pick one and look its neighbours restricted to the rest of the graph, that is without the vertices we have already fully explored, and also without the other vertices in the stack. That’s the easiest way to handle the total component size. But we can simultaneously track how many times we would have joined to a neighbour within the stack, which leads to an excess edge, and Aldous derives a joint distributional scaling limit for the sizes of the critical components and their excesses. (Note that in this case, there is a canonical notion of excess edge, but it depends not just on the graph structure, but also on the extra randomness of the ordering within the breadth-first search.)

Roughly speaking, we consider the reflected exploration process, and its scaling limit, which is a reflected parabolically-drifting Brownian motion (though the details of this are not important at this level of exposition, except that it’s a well-behaved non-negative process that hits zero often). The component sizes are given by the widths of the excursions above zero, scaled up in a factor N^{1/3}. Then conditional on the shape of the excursion, the excess is Poisson with parameter the area under the excursion, with no rescaling. That is, a critical component has \Theta(1) excess.

So, with Aldous’s result in the background, when we ask about the metric structure of these critical components, we are really asking: “what does a uniformly-chosen connected component with fixed excess look like when the number of vertices grows?”

I’ll try to keep notation light, but let’s say T(n,k) is a uniform choice from connected graphs on n vertices with excess k.

[Note, the separation of N and n is deliberate, because in the critical window, the connected components have size n = \Theta(N^{2/3}), so I want to distinguish the two problems.]

In this post, we will mainly address the question: “what does the cycle structure of T(n,k) look like for large n?” When k=0, we have a uniform tree, and the convergence of this to the Brownian CRT is now well-known [CRT2, LeGall]. We hope for results with a similar flavour for positive excess k.

2-cores and kernels

First, we have to give a precise statement of what it means to study just the cycle structure of a connected component. From now on I will assume we are always working with a connected graph.

There are several equivalent definitions of the 2-core C(G) of a graph G:

  • When the excess is positive, there are some cycles. The 2-core is the union of all edges which form part of some cycle, and any edges which lie on a path between two edges which both form part of some cycle.
  • C(G) is the maximal induced subgraph where all degrees are at least two.
  • If you remove all the leaves from the graph, then all the leaves from the remaining graph, and continue, the 2-core is the state you arrive at where there are no leaves.

It’s very helpful to think of the overall structure of the graph as consisting of the 2-core, with pendant trees ‘hanging off’ the 2-core. That is, we can view every vertex of the 2-core as the root of a (possibly size 1) tree. This is particular clear if we remove all the edges of the 2-core from the graph. What remains is a forest, with one tree for each vertex of the 2-core.

In general, the k-core is the maximal induced subgraph where all degrees are at least k. The core is generally taken to be something rather different. For this post (and any immediate sequels) I will never refer to the k-core for k>2, and certainly not to the traditional core. So I write ‘core’ for ‘2-core’.

As you can see in the diagram, the core consists of lots of paths, and topologically, the lengths of these paths are redundant. So we will often consider instead the kernel, K(G), which is constructed by taking the core and contracting all the paths between vertices of degree greater than 2. The resulting graph has minimal degree at least three. So far we’ve made no comment about the simplicity of the original graphs, but certainly the kernel need not be simple. It will regularly have loops and multiple edges. The kernel of the graph and core in the previous diagram is therefore this:

Kernels of critical components

To recap, we can deconstruct a connected graph as follows. It has a kernel, and each edge of the kernel is a path length of some length in the core. The rest of the graph consists of trees hanging off from the core vertices.

For now, we ask about the distribution of the kernel of a T(n,K). You might notice that the case k=1 is slightly awkward, as when the core consists of a single cycle, it’s somewhat ambiguous how to define the kernel. Everything we do is easily fixable for k=1, but rather than carry separate cases, we handle the case k\ge 2.

We first observe that fixing k doesn’t confirm the number of vertices or edges in the kernel. For example, both of the following pictures could correspond to k=3:

However, with high probability the kernel is 3-regular, which suddenly makes the previous post relevant. As I said earlier, it can introduce size-biasing errors to add the excess edges one-at-a-time, but these should be constant factor errors, not scaling errors. So imagine the core of a large graph with excess k=2. For the sake of argument, assume the kernel has the dumbbell / handcuffs shape. Now add an extra edge somewhere. It’s asymptotically very unlikely that this is incident to one of the two vertices with degree three in the core. Note it would need to be incident to both to generate the right-hand picture above. Instead, the core will gain two new vertices of degree three.

Roughly equivalently, once the size of the core is fixed (and large) we have to make a uniform choice from connected graphs of this size where almost every vertex has degree 2, and \Theta(1) of the rest have degree 3 or higher. But the sum of the degrees is fixed, because the excess is fixed. If there are n vertices in the core, then there are \Theta(n) more graphs where all the vertices have degree 2 or 3, than graphs where a vertex has degree at least 4. Let’s state this formally.

Proposition: The kernel of a uniform graph with n vertices and excess k\ge 2 is, with high probability as n\rightarrow\infty, 3-regular.

This proved rather more formally as part of Theorem 7 of [JKLP], essentially as a corollary after some very comprehensive generating function setup; and in [LPW] with a more direct computation.

In the previous post, we introduced the configuration model as a method for constructing regular graphs (or any graphs with fixed degree sequence). We observe that, conditional on the event that the resulting graph is simple, it is in fact uniformly-distributed among simple graphs. When the graph is allowed to be a multigraph, this is no longer true. However, in many circumstances, as remarked in (1.1) of [JKLP], for most applications the configuration model measure on multigraphs is the most natural.

Given a 3-regular labelled multigraph H with 2(k-1) vertices and 3(k-1) edges, and K a uniform choice from the configuration model with these parameters, we have

\mathbb{P}\left( K \equiv H \right) \propto \left(2^{t(H)} \prod_{e\in E(H)} \mathrm{mult}(e)! \right)^{-1},

where t(H) is the number of loops in H, and mult(e) the multiplicity of an edge e. This might seem initially counter-intuitive, because it looks we are biasing against graphs with multiple edges, when perhaps our intuition is that because there are more ways to form a set of multiple edges we should bias in favour of it.

I think it’s most helpful to look at a diagram of a multigraph as shown, and ask how to assign stubs to edges. At a vertex with degree three, all stub assignments are different, that is 3!=6 possibilities. At the multiple edge, however, we care which stubs match with which stubs, but we don’t care about the order within the multi-edge. Alternatively, there are three choices of how to divide each vertex’s stubs into (2 for the multi-edge, 1 for the rest), and then two choices for how to match up the multi-edge stubs, ie 18 in total = 36/2, and a discount factor of 2.

We mention this because in fact K(T(n,k)) converges in distribution to this uniform configuration model. Once you know that K(T(n,k)) is with high probability 3-regular, then again it’s probably easiest to think about the core, indeed you might as well condition on its total size and number of degree 3 vertices. It’s then not hard to convince yourself that a uniform choice induces a uniform choice of kernel. Again, let’s state that as a proposition.

Proposition: For any H a 3-regular labelled multigraph H with 2(k-1) vertices and 3(k-1) edges as before,

\lim_{n\rightarrow\infty}\mathbb{P}\left( K(T(n,k)) \equiv H \right) \propto \left(2^{t(H)} \prod_{e\in E(H)} \mathrm{mult}(e)! \right)^{-1}.

As we said before, the kernel describes the topology of the core. To reconstruct the graph, we need to know the lengths in the core, and then how to glue pendant trees onto the core. But this final stage depends on k only through the total length of paths in the core. Given that information, it’s a combinatorial problem, and while I’m not claiming it’s easy, it’s essentially the same as for the case with k=1, and is worth treating separately.

It is worth clarifying a couple of things first though. Even the outline of methods above relies on the fact that the size of the core diverges as n grows. Again, the heuristic is that up to size-biasing errors, T(n,k) looks like a uniform tree with some uniformly-chosen extra edges. But distances in T(n,k) scale like n^{1/2} (and thus in critical components of G(N,p) scale like N^{1/3}). And the core will be roughly the set of edges on paths between the uniformly-chosen pairs of vertices, and so will also have length \Theta(n^{1/2}).

Once you have conditioned on the kernel structure, and the (large) number of internal vertices on paths in the core (ie the length of the core), it is natural that the assignment of the degree-2 vertices to core paths / kernel edges is uniform. A consequence of this is that if you record (Y_1,\ldots,Y_m) the lengths of paths in the core, where m=3(k-1), then

\frac{(Y_1,\ldots,Y_m)}{\sum Y_i} \stackrel{d}\rightarrow \mathrm{Dirichlet}(1,1,\ldots,1).

This is stated formally as Corollary 7 b) of [ABG09]. It’s worth noting that this confirms that the lengths of core paths are bounded in probability away from zero after the appropriate rescaling. In seeking a metric scaling limit, this is convenient as it means there’s so danger that two of the degree-3 vertices end up in ‘the same place’ in the scaling limit object.

To recap, the only missing ingredients now to give a complete limiting metric description of T(n,k) are 1) a distributional limit of the total core length; 2) some appropriate description of set of pendant trees conditional on the size of the pendant forest. [ABG09] show the first of these. As remarked before, all the content of the second of these is encoded in the unicyclic k=1 case, which I have written about before, albeit slightly sketchily, here. (Note that in that post we get around size-biasing by counting a slightly different object, namely unicyclic graphs with an identified cyclic edge.)

However, [ABG09] also propose an alternative construction, which you can think of as glueing CRTs directly onto the stubs of the kernel (with the same distribution as before). The proof that this construction works isn’t as painful as one might fear, and allows a lot of the other metric distributional results to be read off as corollaries.

References

[ABG09] – Addario-Berry, Broutin, Goldschmidt – Critical random graphs: limiting constructions and distributional properties

[CRT2] – Aldous – The continuum random tree: II

[Al97] – Aldous – Brownian excursions, critical random graphs and the multiplicative coalescent

[JKLP] – Janson, Knuth, Luczak, Pittel – The birth of the giant component

[LeGall] – Le Gall – Random trees and applications

[LPW] – Luczak, Pittel, Wierman – The structure of a random graph at the point of the phase transition

 

Random 3-regular graphs

A graph is d-regular if every vertex has degree d. Probably the easiest examples of d-regular graphs are the complete graph on (d+1) vertices, and the infinite d-ary tree. A less trivial example is the Petersen graph, which is 3-regular. 3-regular graphs will be the main focus for some of this post, but initially we lose nothing by considering general d.

Throughout, a necessary condition for the existence of a d-regular graph with N vertices is that at least one of d and N is even, as the sum of the degrees of a graph must be even. We will always assume that this holds, so that when d=3, we are always taking N to be even.

A natural pair of questions for a probabilist is ‘can we sample a d-regular graph with N vertices uniformly at random?’ and ‘what does a typical large d-regular graph look like?’

In a rather old post, I addressed some aspects of the first question, but revisit it briefly here. A good idea, due to Bollobas [B80] is to assign to all the vertices d stubs (or half-edges), and choose a matching of the Nd stubs uniformly at random. This works as a method to generate a random graph with any fixed degree sequence.

If you want your graphs to be simple, this can go wrong, because there’s a chance you get loops (that is, an edge from a vertex v to itself) and multiple edges between the same pair of vertices. It would be nice the graph formed in this fashion was simple with high probability when N\rightarrow\infty. Unfortunately that’s not the case, however the probability that the graph is simple remains asymptotically bounded away from 0 and 1. Indeed, because the presence of a loop / multiple edge is asymptotically independent of the presence of a loop / multiple edge elsewhere, it’s unsurprising we have a Poisson limit for the number of such occurences. So from a sampling point of view, it’s reasonable to sample a graph in this way until you find a simple one. This takes O(1) steps, and it’s O(N) steps to check whether a given multigraph is simple.

It’s clear that conditional on the graph generated in this fashion being simple, its distribution is uniform on the set of simple graphs with the correct degree distribution. If you are happy for your graphs to have loops, then it’s a little bit more complicated, because if an edge has multiplicity k, these can appear in k! ways in the configuration construction.

Other asymptotic properties

Loops and multiple edges can be thought of as cycles of length 1 and 2 respectively if you want. We might ask about other small cycles. A calculation in expectation is relatively straightforward. Given three vertices, the probability they form a triangle (in at least one way) is \Theta(N^{-3}), and there are \Theta(N^3) ways to choose three vertices. Thus the expected number of triangles is \Theta(1). Finally, the edge structure induced on disjoint triples is asymptotically independent, and hence a Poisson limit. (See [J06] for details, including more detail on the general configuration construction.) The same result holds for the same reasons for cycles of any fixed finite length.

We might also ask about connectivity. At a heuristic level, there are two ways for the graph to be disconnected: it could have some small components; or it could have two components of size \Theta(N). The smallest possible component is K_4, and an argument like for the cycles above shows that the number of copies of K_4 vanishes in expectation. Now, consider having two components of size roughly N/2. There are \binom{N}{N/2} \sim 2^{2N} ways to make this choice. However, given such a choice, we can handle the probability that all the stubs from one class match within that class by going through the class one stub at a time:

\frac{\frac{3N}{2}-1}{3N-1} \times \frac{\frac{3N}{2}-3}{3N-3} \times \cdots \times \frac{1}{\frac{3N}{2}+1}.

We approximate this as

\frac{\sqrt{(3N/2)!}}{\sqrt{ (3N)!}} \sim  e^{3N/2} 2^{-3N/2} \left(3N\right)^{-3N/2},

and this dominates the number of choices powerfully enough that we might believe it remains valid for a broader range of class sizes. In fact we have a much stronger statement, namely that G(N,3) is 3-connected with high probability. This means that the graph cannot be disconnected by removing two vertices, or equivalently that there are three vertex-disjoint paths between any pair of vertices in the graph, essentially one emerging from each stub. See this note by David Ellis for a quick proof. We might return to this later.

You might ask about planarity. It’s clear from degree consideration that there are no induced copies of K_5 in any random 3-regular graph, and since K_{3,3} contains a cycle of length 4, and with high probability G(N,3) doesn’t, that takes care of that possibility too. However, there might be minors of this form. This seemed a good example of the Kuratowski criterion not actually being that useful, since I certainly don’t find the minors of the 3-regular graph an obvious structure to handle.

However, we can use Euler’s formula V – E + F = 2 for planar graphs. Here V = N, E = 3N/2. Faces are described by (a subset of the) cycles, and we there are asymptotically O(1) small cycles, so most faces include a large number of edges. But each edge corresponds to at most two faces. So we have F \ll E, and so with high probability Euler’s formula can’t hold in G(N,3) for large N.

We can also ask about the local limit of G(N,3). Since the vertices are exchangeable, we don’t need to worry about whether we choose the root uniformly at random (often referred to as the Benjamini-Schramm sense) or by some other method.

The root has up to three neighbours, and with high probability it has exactly three neighbours. These neighbours have at most two other neighbours themselves. However, we’ve already seen that there are asymptotically O(1) cycles, and so with high probability there are no small cycles near a fixed root vertex. So the six neighbours-of-neighbours are with high probability different to the root and the root’s neighbours and to each other. We can make this argument at arbitrary finite radius from the root, to conclude that the local limit of G(N,3) is the infinite 3-ary tree.

Spectral expansion

[Caveat – this is something I read about and wanted to mention, but I really don’t know much at all about any of this theory, and it’s definitely not certain that what follows wouldn’t be better replaced by a set of links.]

This straightforward local limit offers good heuristics on some of the more global properties. Almost by definition, the d-ary tree expands as rapidly as is possible away from the root among infinite d-regular graphs. There are a number of ways to measure the expansion of a graph, and some methods transfer better to the infinite setting than others. The adjacency matrix of an infinite graph can be defined similarly to that of a finite graph, and it remains possible to talk about eigenfunctions and spectrum. As for the finite setting, d is an eigenvalue because the tree is d-regular, and -d is an eigenvalue because it is also bipartite.

The next largest eigenvalue \lambda_2 governs the spectral gap d-\lambda_2 which is a measure of the expansion of a graph. A graph is a good (spectral) expander if all the non-trivial eigenvalues are close to zero. A priori, all we know is that |\lambda_2|\le d. For the infinite d-ary tree, we have \lambda_2 = 2\sqrt{d-1}. This blog post by Luca Trevisan gives a very readable proof.

A key result is that finite graphs can have \lambda_2 \le 2\sqrt{d-1}, but not asymptotically. That is, taking N to be the number of vertices:

\lambda_2 \ge 2\sqrt{d-1} - o_N(1).

This is the content of the Alon-Boppana theorem [Al86]. In fact the error can be quantified as O(\frac{1}{\log N}) – the diamater of the graph is relevant here. A finite d-regular graph for which \lambda_2\le 2\sqrt{d-1} is called a Ramanujan graph. The existence of Ramanujan graphs has been much studied, and various constructions often rely on number theoretic properties of N, and lie at the interface of disparate branches of mathematics where my understanding is zero rather than epsilon.

Now return to our view of the d-ary tree as the local limit of a d-regular graph on N vertices for large N. We might expect from everything above that the uniform d-regular graph is a good expander. Bollobas shows that in the sense of edge-expansion, asymptotically almost all d-regular graphs have edge-expansion bounded away from zero. (See Section 2 of [Ell], including history of the d=3 case.) Friedman [Fri08] proves the conjecture of Alon that for every \epsilon>0, a.a.s. \lambda_2 for G(N,d) is at most 2\sqrt{d-1}+\epsilon. In this sense, G(N,d) is asymptotically ‘almost Ramanujan’. (See also [Bor17] for another proof and an introduction including history, context and references.)

Some other links: The Wikipedia page on expanders, which includes a discussion of the different descriptions of expansion, and the Cheeger inequalities and other relations between them; slides for a talk by Spielman on spectra and Ramanujan graphs; a survey by Murty on Ramanujan graphs;.

What next?

This post took a slightly different direction from what I had intended, and rather than make a halting U-turn back to my planned finale, I’ll postpone this. However, a short overture is that I’m interested in the structure of critical components of random graphs during the critical window. This is the window during which the largest components first have cycles with probability \Theta(1). Indeed, the critical components have size \Theta(N^{2/3}) and \Theta(1) surplus edges. Conditional on their size, and number of surplus edges, the choice of the graph structure on the component is uniform among such (connected) graphs.

Addario-Berry, Broutin and Goldschmidt [ABG09] study scaling limits of such components. Central to this analysis is the 2-core of such components, which can be described in terms of 3-regular (multi)graphs. Various processes we are now interested in running on the critical components of critical RGs can then be studied in terms of related processes on random 3-regular graphs.

References

[ABG09] – Addario-Berry, Broutin, Goldschmidt – Critical random graphs: limiting constructions and distributional properties

[Al86] – Alon – Eigenvalues and expanders

[B80] – Bollobas – A probabilistic proof of an asymptotic formula for the number of labelled regular graphs

[B88] – Bollobas – The isoperimetric number of random regular graphs

[Bor17] – Bordenave – A new proof of Friedman’s second eigenvalue theorem and its extension to random lifts. Arxiv.

[Ell] – Ellis – The expansion of random regular graphs

[Fri08] – Friedman – A proof of Alon’s second eigenvalue conjecture and related problems

[J06] – Janson – The probability that a random multigraph is simple by

Generating uniform trees

A long time ago, I wrote quite a few a things about uniform trees. That is, a uniform choice from the n^{n-2} unrooted trees with vertex set [n]. This enumeration, normally called Cayley’s formula, has several elegant arguments, including the classical Prufer bijection. But making a uniform choice from a large set is awkward, and so we seek more probabilistic methods to sample such a tree, which might also give insight into the structure of a ‘typical’ uniform tree.

In another historic post, I talked about the Aldous-Broder algorithm. Here’s a quick summary. We run a random walk on the complete graph K_n started from a uniformly-chosen vertex. Every time we arrive at a vertex we haven’t visited before, we record the edge just traversed. Eventually we have visited all n vertices, so have recorded n-1 edges. It’s easy enough to convince yourself that these n-1 edges form a tree (how could there be a cycle?) and a bit more complicated to decide that the distribution of this tree is uniform.

It’s worth noting that this algorithm works to construct a uniform spanning tree on any connected base graph.

This post is about a few alternative constructions and interpretations of the uniform random tree. The first construction uses a Galton-Watson process. We take a Galton-Watson process where the offspring distribution is Poisson(1), and condition that the total population size is n. The resulting random tree has a root but no labels, however if we assign labels in [n] uniformly at random, the resulting rooted tree has the uniform distribution among rooted trees on [n].

Proof

This is all about moving from ordered trees to non-ordered trees. That is, when setting up a Galton-Watson tree, we distinguish between the following two trees, drawn extremely roughly in Paint:

That is, it matters which of the first-generation vertices have three children. Anyway, for such a (rooted) ordered tree T with n vertices, the probability that the Galton-Watson process ends up equal to T is

\mathbb{P}(GW = T) = \prod_{v\in T} \frac{e^{-1}}{C(v)!} = e^{-n} \prod_{v\in T}\frac{1}{C(v)!},

where C(v) is the number of children of a vertex v\in T. Then, since \mathbb{P}( |GW|=n ) is a function of n, we find

\mathbb{P}(GW=T \,\big|\, |GW|=n) = f(n)\prod_{v\in T} \frac{1}{C(v)!},

where f(n) is a function of n alone (ie depends on T only through its size n).

But given an unordered rooted tree t, labelled by [n], there are \prod_{v \in t} C(v)! ordered trees associated to t in the natural way. Furthermore, if we take the Poisson Galton-Watson tree conditioned to have total population size n, and label uniformly at random with [n], we obtain any one of these ordered trees with probability \frac{f(n)}{n!} \prod_{v\in t} \frac{1}{C(v)!}. So the probability that we have t after we forget about the ordering is \frac{f(n)}{n!}, which is a function of n alone, and so the distribution is uniform among the set of rooted unordered trees labelled by [n], exactly as required.

Heuristic for Poisson offspring distribution

In this proof, the fact that \mathbb{P}(C(v)=k)\propto \frac{1}{k!} exactly balances the number of orderings of the k children explains why Poisson(1) works out. Indeed, you can see in the proof that Poisson(c) works equally well, though when c\ne 1, the event we are conditioning on (namely that the total population size is n) has probability decaying exponentially in n, whereas for c=1, the branching process is critical, and the probability decays polynomially.

We can provide independent motivation though, from the Aldous-Broder construction. Both the conditioned Galton-Watson construction and the A-B algorithm supply the tree with a root, so we’ll keep that, and look at the distribution of the degree of the root as constructed by A-B. Let \rho=v_1,v_2,v_3,\ldots be the vertices [n], ordered by their discovery during the construction. Then \rho is definitely connected by an edge to v_2, but thereafter it follows by an elementary check that the probability \rho is connected to v_m is \frac{1}{n-1}, independently across all m. In other words, the distribution of the degree of \rho in the tree as constructed by A-B is

1+ \mathrm{Bin}\left(n-2,\frac{1}{n-1}\right) \approx 1+\mathrm{Poisson}(1).

Now, in the Galton-Watson process, conditioning the tree to have fixed, large size changes the offspring distribution of the root. Conveniently though, in a limiting sense it’s the same change as conditioning the tree to have size at least n. Since these events are monotone in n, it’s possible to take a limit of the conditioning events, and interpret the result as the Galton-Watson tree conditioned to survive. It’s a beautiful result that this interpretation can be formalised as a local limit. The limiting spine decomposition consists of an infinite spine, where the offspring distribution is a size-biased version of the original offspring distribution (and so in particular, always has at least one child) and where non-spine vertices have the original distribution.

In particular, the number of the offspring of the root is size-biased, and it is well-known and not hard to check that size-biasing Poisson(c) gives 1+Poisson(c) ! So in fact we have, in an appropriate limiting sense in both objects, a match between the degree distribution of the root in the uniform tree, and in the conditioned Galton-Watson tree.

This isn’t supposed to justify why a conditioned Galton-Watson tree is relevant a priori (especially the unconditional independence of degrees), but it does explain why Poisson offspring distributions are relevant.

Construction via G(N,p) and the random cluster model

The main reason uniform trees were important to my thesis was their appearance in the Erdos-Renyi random graph G(N,p). The probability that vertices {1, …, n} form a tree component in G(N,p) with some particular structure is

p^{n-1} (1-p)^{\binom{n}{2}-(n-1)} \times (1-p)^{n(N-m)}.

Here, the first two terms give the probability that the graph structure on {1, …, n} is correct, and the the final term gives the probability of the (independent) event that these vertices are not connected to anything else in the graph. In particular, this has no dependence on the tree structure chosen on [n] (for example, whether it should be a path or a star – both examples of trees). So the conditional distribution is uniform among all trees.

If we work in some limiting regime, where pn\rightarrow 0 (for example if n is fixed and p=\frac{1}{N}\rightarrow 0), then we can get away asymptotically with less strong conditioning. Suppose we condition instead just that [n] form a component. Now, there are more ways to form a connected graph with one cycle on [n] than there are trees on [n], but the former all require an extra edge, and so the probability that a given one such tree-with-extra-edge appears as the restriction to [n] in G(N,p) is asymptotically negligible compared to the probability that the restriction to [n] of G(N,p) is a tree. Naturally, the local limit of components in G(N,c/N) is a Poisson(c) Galton-Watson branching process, and so this is all consistent with the original construction.

One slightly unsatisfying aspect to this construction is that we have to embed the tree of size [n] within a much larger graph on [N] to see uniform trees. We can’t choose a scaling p=p(n) such that G(n,p) itself concentrates on trees. To guarantee connectivity with high probability, we need to take p> \frac{\log n}{n}, but by this threshold, the graph has (many) cycles with high probability.

At this PIMS summer school in Vancouver, one of the courses is focusing on lattice spin models, including the random cluster model, which we now briefly define. We start with some underlying graph G. From a physical motivation, we might take G to be \mathbb{Z}^d or some finite subset of it, or a d-ary tree, or the complete graph K_N. As in classical bond percolation (note G(N,p) is bond percolation on K_N), a random subset of the edges of G are included, or declared open. The probability of a given configuration w, with e open edges is proportional to

p^e (1-p)^{|E(G)| - e} q^{k(w)}, (*)

where the edge-weight p\in(0,1) as usual, and cluster weight q\in (0,\infty), and k(w) counts the number of connected components in configuration w. When q=1, we recover classical bond percolation (including G(N,p) ), while for q>1, this cluster-reweighting favours having more components, and q<1 favours fewer components. Note that in the case q\ne 1, the normalising constant (or partition function) of (*) is generally intractable to calculate explicitly.

As in the Erdos-Renyi graph, consider fixing the underlying graph G, and taking p\rightarrow 0, but also taking \frac{q}{p}\rightarrow 0. So the resulting graph asymptotically ‘wants to have as few edges as possible, but really wants to have as few components as possible’. In particular, 1) all spanning trees of G are equally likely; 2) any configuration with more than one component has asymptotically negligible probability relative to any tree; 3) any graph with a cycle has #components + #edges greater than that of a tree, and so is asymptotically negligible probability relative to any tree.

In other words, the limit of the distribution is the uniform spanning tree of G, and so this (like Aldous-Broder) is a substantial generalisation, which constructs the uniform random tree in the special case where G=K_n.

 

Antichains in the grid

In the previous post on this topic, we discussed Dilworth’s theorem on chains and antichains in a general partially ordered set. In particular, whatever the size of the largest antichain in a poset, it is possible to partition the poset into exactly that many chains. So for various specific posets, or the directed acyclic graphs associated to them, we are interested in the size of this largest antichain.

The following example turned out to be more interesting than I’d expected. At a conventional modern maths olympiad, there are typically three questions on each paper, and for reasons lost in the mists of time, each student receives an integer score between 0 and 7 per question. A natural question to ask is “how many students need to sit a paper before it’s guaranteed that one will scores at least as highly as another on every question?” (I’m posing this as a straight combinatorial problem – the correlation between scores on different questions will be non-zero and presumably positive, but that is not relevant here.)

The set of outcomes is clearly \{0,1,\ldots,7\}^3, with the usual weak domination partial order inherited from \mathbb{R}^3. Then an antichain corresponds to a set of triples of scores such that no triple dominates another triple. So the answer to the question posed is: “the size of the largest antichain in this poset, plus one.”

In general, we might ask about \{1,2,\ldots,n\}^d, again with the weak domination ordering. This directed graph, which generalises the hypercube as well as our example, is called the grid.

Heuristics for the largest antichain

Retaining the language of test scores on multiple questions is helpful. In the previous post, we constructed a partition of the poset into antichains, indexed by the elements of some maximal chain, by starting with the sources, then looking at everything descended only from sources, and so on. (Recall that the statement that this is possible was referred to as the dual of Dilworth’s theorem.) In the grid, there’s a lot of symmetry (in particular under the mapping x\mapsto n+1-x in every coordinate), and so you end up with the same family of antichains whether you work upwards from the sources or downwards from the sinks. (Or vice versa depending on how you’ve oriented your diagram…) The layers of antichains also have a natural interpretation – each layer corresponds to a given total score. It’s clear a priori why each of these is an antichain. If A scores the same as B overall, but strictly more on the first question, this must be counterbalanced by a strictly lower score on another question.

So a natural guess for the largest antichain is the largest antichain corresponding to some fixed total score. Which total score should this be? It ought to be the middle layer, that is total score \frac{(n+1)d}{2}, or the two values directly on either side if this isn’t an integer. My intuition was probabilistic. The uniform distribution on the grid is achieved by IID uniform distributions in each coordinate, which you can think of as a random walk, especially if you subtract off the mean first. It feels that any symmetric random walk should have mode zero or next-to-zero. Certainly this works asymptotically in a rescaled sense by CLT, and in a slightly stronger sense by local CLT, but we don’t really want asymptotics here.

When I started writing the previous paragraph, I assumed there would be a simple justification for the claim that the middle layer(s) was largest, whether by straight enumeration, or some combinatorial argument, or even generating functions. Perhaps there is, and I didn’t spot it. Induction on d definitely works though, with a slightly stronger hypothesis that the layer sizes are symmetric around the median, and monotone on either side of the median. The details are simple and not especially interesting, so I won’t go into them.

From now on, the hypothesis is that this middle layer of the grid is the largest antichain. Why shouldn’t it, for example, be some mixture of middle-ish layers? (*) Well, heuristically, any score sequence in one layer removes several possibilities from a directly adjacent layer, and it seems unlikely that this effect is going to cancel out if you take some intermediate number of score sequences in the first layer. Also, the layers get smaller as you go away from the middle, so because of the large amount of symmetry (coordinates are exchangeable etc), it feels reasonable that there should be surjections between layers in the outward direction from the middle. The union of all these surjections gives a decomposition into chains.

This result is in fact true, and its proof by Bollobas and Leader, using shadows and compression can be found in the very readable Sections 0 and 1 of [1].

Most of the key ideas to a compression argument are present in the case n=2, for which some notes by Leader can be found here, starting with Proof 1 of Theorem 3, the approach of which is developed over subsequent sections. We treat the case n=2, but focusing on a particularly slick approach that does not generalise as successfully. We also return to the original case d=3 without using anything especially exotic.

Largest antichain in the hypercube – Sperner’s Theorem

The hypercube \{0,1\}^d is the classical example. There is a natural correspondence between the vertices of the hypercube, and subsets of [d]. The ordering on the hypercube corresponds to the ordering given by containment on \mathcal{P}([d]). Almost by definition, the k-th layer corresponds to subsets of size k, and thus includes \binom{d}{k} subsets. The claim is that the size of the largest antichain is \binom{d}{\lfloor d/2 \rfloor}, corresponding to the middle layer if d is even, and one of the two middle layers if d is odd. This result is true, and is called Sperner’s theorem.

I know a few proofs of this from the Combinatorics course I attended in my final year at Cambridge. As explained, I’m mostly going to ignore the arguments using compression and shadows, even though these generalise better.

As in the previous post, one approach is to exhibit a covering family of exactly this number of disjoint chains. Indeed, this can be done layer by layer, working outwards from the middle layer(s). The tool here is Hall’s Marriage Theorem, and we verify the relevant condition by double-counting. Probably the hardest case is demonstrating the existence of a matching between the middle pair of layers when d is odd.

Take d odd, and let d':= \lfloor d/2\rfloor. Now consider any subset S of the d’-th layer \binom{[d]}{d'}. We now let the upper shadow of S be

\partial^+(S):= \{A\in \binom{[d]}{d'+1}\,:\, \exists B\in S, B\subset A\},

the sets in the (d’+1)-th layer which lie above some set in S. To apply Hall’s Marriage theorem, we have to show that |\partial^+(S)|\ge |S| for all choice of S.

We double-count the number of edges in the hypercube from S to \partial^+(S). Firstly, for every element B\in S, there are exactly d’ relevant edges. Secondly, for every element A\in\partial^+(S), there are exactly d’ edges to some element of \binom{[d]}{d'}, and so in particular there are at most d’ edges to elements of S. Thus

d' |S|=|\text{edges }S\leftrightarrow\partial^+(S)| \le d' |\partial^+(S)|,

which is exactly what we require for Hall’s MT. The argument for the matching between other layers is the same, with a bit more notation, but also more flexibility, since it isn’t a perfect matching.

The second proof looks at maximal chains. Recall, in this context, a maximal chain is a sequence \mathcal{C}=B_0\subset B_1\subset\ldots\subset B_d where each B_k:= \binom{[d]}{k}. We now consider some largest-possible antichain \mathcal{A}, and count how many maximal chains include an element A\in\mathcal{A}. If |A|=k, it’s easy to convince yourself that there are \binom{d}{r} such maximal chains. However, given A\ne A'\in\mathcal{A}, the set of maximal chains containing A and the set of maximal chains containing A’ are disjoint, since \mathcal{A} is an antichain. From this, we obtain

\sum_{A\in\mathcal{A}} \binom{d}{|A|} \le d!. (**)

Normally after a change of notation, so that we are counting the size of the intersection of the antichain with each layer, this is called the LYM inequality after Lubell, Yamamoto and Meshalkin. The heuristic is that the sum of the proportions of layers taken up by the antichain is at most one. This is essentially the same as earlier at (*). This argument can also be phrased probabilistically, by choosing a *random* maximal chain, and considering the probability that it intersects the proposed largest antichain, which is, naturally, at most one. Of course, the content is the same as this deterministic combinatorial argument.

Either way, from (**), the statement of Sperner’s theorem follows rapidly, since we know that \binom{d}{|A|}\le \binom{d}{\lfloor d/2\rfloor} for all A.

Largest antichain in the general grid

Instead of attempting a proof or even a digest of the argument in the general case, I’ll give a brief outline of why the previous arguments don’t transfer immediately. It’s pretty much the same reason for both approaches. In the hypercube, there is a lot of symmetry within each layer. Indeed, almost by definition, any vertex in the k-th layer can be obtained from any other vertex in the k-th layer just by permuting the labels (or permuting the coordinates if thinking as a vector).

The hypercube ‘looks the same’ from every vertex, but that is not true of the grid. Consider for clarity the n=8, d=3 case we discussed right at the beginning, and compare the scores (7,0,0) and (2,2,3). The number of maximal chains through (7,0,0) is \binom{14}{7}, while the number of maximal chains through (2,2,3) is \binom{7}{2, 2,3}\binom{14}{4,5,5}, and the latter is a lot larger, which means any attempt to use the second argument is going to be tricky, or at least require an extra layer of detail. Indeed, exactly the same problem arises when we try and use Hall’s condition to construct the optimal chain covering directly. In the double-counting section, it’s a lot more complicated than just multiplying by d’, as was the case in the middle of the hypercube.

Largest antichain in the d=3 grid

We can, however, do the d=3 case. As we will see, the main reason we can do the d=3 case is that the d=2 case is very tractable, and we have lots of choices for the chain coverings, and can choose one which is well-suited to the move to d=3. Indeed, when I set this problem to some students, an explicit listing of a maximal chain covering was the approach some of them went for, and the construction wasn’t too horrible to state.

[Another factor is that it computationally feasible to calculate the size of the middle layer, which is much more annoying in d>3.]

[I’m redefining the grid here as \{0,1,\ldots,n-1\}^d rather than \{1,2,\ldots,n\}^d.]

The case distinction between n even and n odd is going to make both the calculation and the argument annoying, so I’m only going to treat the even case, since n=8 was the original problem posed. I should be honest and confess that I haven’t checked the n odd case, but I assume it’s similar.

So when n is even, there are two middle layers namely \frac{3n}{2}-2, \frac{3n}{2}-1 (corresponding to total score 10 and total score eleven in the original problem). I calculated the number of element in the \frac{3n}{2}-1 layer by splitting based on the value of the first coordinate. I found it helpful to decompose the resulting sum as

\sum_{k=0}^{n-1} = \sum_{k=0}^{\frac{n}{2}-1} + \sum_{k=\frac{n}{2}}^{n-1},

based on whether there is an upper bound, or a lower bound on the value taken by the second coordinate. This is not very interesting, and I obtained the answer \frac{3n^2}{4}, and of course this is an integer, since n is even.

Now to show that any antichain has size at most \frac{3n^2}{4}. Here we use our good control on the chain coverings in the case d=2. We note that there is a chain covering of the (n,d=2) grid where the chains have 2n-1, 2n-3,…, 3, 1 elements (%). We get this by starting with a maximal chain, then taking a maximal chain on what remains etc. It’s pretty much the first thing you’re likely to try.

Consider an antichain with size A in the (n,d=3) grid, and project into the second and third coordinates. The image sets are distinct, because otherwise a non-trivial pre-image would be a chain. So we have A sets in the (n,d=2) grid. How many can be in each chain in the decomposition (%). Well, if there are more than n in any chain in (%), then two must have been mapped from elements of the (n,d=3) grid with the same first coordinate, and so satisfy a containment relation. So in fact there are at most n image points in any of the chains of (%). So we now have a bound of n^2. But of course, some of the chains in (%) have length less than n, so we are throwing away information. Indeed, the number of images points in a given chain is at most

\max(n,\text{length of chain}),

and so the number of image points in total is bounded by

n+\ldots+n+ (n-1)+(n-3)+\ldots+1,

where there are n/2 copies of n in the first half of the sum. Evaluating this sum gives \frac{3n^2}{4}, exactly as we wanted.

References

[1] – Bollobas, Leader (1991) – Compressions and Isoperimetric Inequalities. Available open-access here.

Chains and antichains

I’ve recently been at the UK-Hungary winter olympiad camp in Tata, for what is now my sixth time. As well as doing some of my own work, have enjoyed the rare diversion of some deterministic combinatorics. It seems to be a local variant of the pigeonhole principle that given six days at a mathematical event in Hungary, at least one element from {Ramsay theory, Erdos-Szekeres, antichains in the hypercube} will be discussed, with probability one. On this occasion, all were discussed, so I thought I’d write something about at least one of them.

Posets and directed acyclic graphs

This came up on the problem set constructed by the Hungarian leaders. The original formulation asked students to show that among any 17 positive integers, there are either five such that no one divides any other, or five such that among any pair, one divides the other.

It is fairly clear why number theory plays little role. We assign the given integers to the vertices of a graph, and whenever a divides b, we add a directed edge from the vertex corresponding to a to the vertex corresponding to b. Having translated the given situation into a purely combinatorial statement, fortunately we can translate the goal into the same language. If we can find a chain of four directed edges (hence five vertices – beware confusing use of the word ‘length’ here) then we have found the second possible option. Similarly, if we can find an antichain, a set of five vertices with no directed edges between them, then we have found the first possible option.

It’s worth noting that the directed graph we are working with with is transitive. That is, whenever there is an edge a->b and b->c, there will also be an edge a->c. This follows immediately from the divisibility condition. There are also no directed cycles in the graph, since otherwise there would be a cycle of integers where each divided its successor. But of course, when a divides b and these are distinct positive integers, this means that b is strictly larger than a, and so this relation cannot cycle.

In fact, among a set of positive integers, divisibility defines a partial order, which we might choose to define as any ordering whether the associated directed graph is transitive and acyclic, although obviously we could use language more naturally associated with orderings. Either way, from now on we consider posets and the associated DAGs (directed acyclic graphs) interchangeably.

Dilworth’s theorem

In the original problem, we are looking for either a large chain, or a large antichain. We are trying to prove that it’s not possible to have largest chain size at most four, and largest antichain size at most four when there are 17 vertices, so we suspect there may some underlying structure: in some sense perhaps the vertex set is the ‘product’ of a chain and an antichain, or at least a method of producing antichains from a single vertex.

Anyway, one statement of Dilworth’s theorem is as follows:

Statement 1: in a poset with nm+1 elements, there is either a chain of size n+1, or an antichain of size m+1.

Taking n=m=4 immediately finishes the original problem about families of divisors. While this is the most useful statement here, it’s probably not the original, which says the following:

Statement 2: in a poset, there exists \mathcal{C} a decomposition into chains, and an antichain A such that |\mathcal{C}|=|A|.

Remark 1: Note that for any decomposition into chains and any antichain, we have |\mathcal{C}|\ge |A|, since you can’t have more than one representative from any chain in the antichain. So Statement 2 is saying that equality does actually hold.

Remark 2: Statement 1 follows immediately from Statement 2. If all antichains had size at most m, then there’s a decomposition into at most m chains. But each chain has size n, so the total size of the graph is at most mn. Contradiction.

Unsuccessful proof strategies for Dilworth

Since various smart young people who didn’t know the statement or proof of Dilworth’s theorem attempted to find it (in the form of Statement 1, and in a special case) in finite time conditions, it’s easy to talk about what doesn’t work, and try to gain intellectual value by qualifying why.

  • Forgetting directions: in general one might well attack a problem by asking whether we have more information than we need. But ignoring the directions of the edges is throwing away too much information. After doing this, antichains are fine, but maybe you need to exhibit some undirected ‘chains’. Unless these undirected chains are much longer than you are aiming for, you will struggle to reconstruct directed chains out of them.
  • Where can the final vertex go?: in a classic trope, one might exhibit a directed graph on nm vertices with neither a chain of size n+1 nor an antichain of size m+1. We attempt to argue that this construction is essentially unique, and that it goes wrong when we add an extra vertex. As a general point, it seems unlikely to be easier to prove that exactly one class of configurations has a given property in the nm case, than to prove no configurations has the same property in the nm+1 case. A standalone proof of uniqueness is likely to be hard, or a disguised rehash of an actual proof of the original statement.
  • Removing a chain: If you remove a chain of maximal length, then, for contradiction, what you have left is m(n-1)+1 vertices. If you have a long chain left, then you’re done, although maximality has gone wrong somewhere. So you have an antichain size n in what remains. But it’s totally unclear why it should be possible to extend the antichain with one of the vertices you’ve just removed.

An actual proof of Dilworth (Statement 1), and two consequences

This isn’t really a proof, instead a way of classifying the vertices in the directed graph so that this version of Dilworth. As we said earlier, we imagine there may be some product structure. In particular, we expect to be able to find a maximal chain, and a nice antichain associated to each element of the maximal chain.

dilworth-thmWe start by letting V_0 consist of all the vertices which are sources, that is, have zero indegree. These are minima in the partial ordering setting. Now let V_1 consist of all vertices whose in-neighbourhood is entirely contained in V_0, that is they are descendents only of V_0. Then let V_2 consist of all remaining vertices whose in-neighourhood is entirely contained in V_0\cup V_1 (but not entirely in V_0, otherwise it would have already been treated), and so on. We end up with what one might call an onion decomposition of the vertices based on how far they are from the sources. We end up with V_0,V_1,\ldots,V_k, and then we can find a chain of size k+1 by starting with any vertex in V_k and constructing backwards towards the source. However, this is also the largest possible size of a chain, because every time we move up a level in the chain, we must move from V_i to V_j where j>i.

It’s easy to check that each V_i is an antichain, and thus we can read off Statement 1. A little more care, and probably an inductive argument is required to settle Statement 2.

We have however proved what is often called the dual of Dilworth’s theorem, namely that in a poset there exists a chain C, and a decomposition into a collection \mathcal{A} of antichains, for which |C|=|\mathcal{A}|.

Finally, as promised returning to Erdos-Szekeres, if not to positive integers. We apply Dilworth Statement 1 to a sequence of m^2+1 real numbers a_0,a_1,\ldots,a_{m^2}, with the ordering a_i\rightarrow a_j if i\le j and a_i\le a_j. Chains correspond to increasing subsequences, and antichains to decreasing subsequences, so we have shown that there is either a monotone subsequence of length m+1.

 

Turan’s Theorem

Turan’s theorem gives bounds on the number of edges required in a graph on a fixed number of vertices n to guarantee it contains a complete graph of size r+1. Equivalently, an upper bound on the number of edges in a K_{r+1}-free graph. For some of the applications and proofs, it may be more natural to look instead at the complement graph, for which the theorem becomes a statement about the existence or otherwise of an independent set of size r+1.

Rather than give an expression for the bound immediately, it is more natural to consider the Turan graph T(n,r), the maximal graph on n vertices without a copy of K_{r+1}. This is constructed by dividing the vertices into r classes with ‘as equal size as possible’. That is, some classes have size \lfloor \frac{n}{r}\rfloor and others have size \lfloor \frac{n}{r}\rfloor +1. Then connect any pair of vertices which are not in the same class by an edge. This gives a complete r-partite graph on these classes. Since any collection of r+1 vertices contains at least two in the same class, it can’t contain a K_{r+1}. Note that the complement of the complete r-partite graph is the union of r disjoint complete graphs on the classes.

There are a number of ways to enumerate the edges in T(n,r), and some can get quite complicated quite quickly. After a moderate amount of thought, this is my favourite. Let n=\ell r+k, so T(n,r) has k classes of size (l+1) and (r-k) classes of size l. Pick an ordered pair of vertices uniformly at random. (So picking the same vertices is indeed an option, and is counted twice.) Then the probability they are the same class is

\frac{k}{r}\cdot\frac{\ell+1}{n}+\frac{r-k}{r}\cdot \frac{\ell}{n} = \frac{1}{r}.

So the probability they are in different classes is \frac{r-1}{r}, and we can treat all of the 2n^2 ordered pairs in this way, noting a) that we count everything twice; and b) we know a priori that we don’t have loops, so the fact that we’ve included these in the count doesn’t matter. We end up with the enumeration (1-\frac{1}{r})\frac{n^2}{2} for the edges in T(n,r).

A standard proof

For both proofs, I find it slightly easier to work in the complement graph, where we are aiming for the largest number of edges with an independent set of size (r+1). Suppose we have a graph with the minimal number of vertices such that there’s no independent set of given size. Suppose also that there is an edge joining vertices v and w, such that d(v)> d(w). Then if we change v’s neighbourhood \Gamma(v) so that it becomes the same as \Gamma(w), (that is, we replace v with a copy of w, and maintain the original edge vw), then it is easily checked that we still do not have an independent set of that size, but fewer edges.

Note that by attempting to make the neighbourhoods of connected vertices equal, we are making the graph look more like a union of complete components. We can do a similar trick if we have three vertices u,v,w such that there are edges between u and v and v and w, but not u and w. Then we know the degrees of u,v,w are the same by the previous argument, and so it can again be checked that making \Gamma(u),\Gamma(w) the same as \Gamma(v), and adding the edge uw reduces the number of edges, and maintains the non-existence of the independent set.

The consequence of this is that we’ve shown that the minimum can only be attained when presence of edges is an equivalence relation (ignoring reflexivity). Thus the minimum is only attained for a union of at most r complete graphs. Jensen (or any root-mean-square type inequality) will then confirm that the true minimum is attained when the sizes of the r components are as equal as possible.

A probabilistic proof

The following probabilistic proof is courtesy of Alon and Spencer. The motivation is that in the (equality) case of a union of complete graphs, however we try to build up a maximal independent set, we always succeed. That is, it doesn’t matter how we choose which vertex (unconnected to those we already have) to add next – we will always get a set of size r. This motivates a probabilistic proof, as an argument in expectation will have equality in the equality case, which is always good.

Anyway, we build up an independent set in a graph by choosing uniformly at random a vertex which is not connected to any we have so far, until this set of vertices is empty. It makes sense to settle the randomness at the start, so give the vertices a uniform random labelling on [n], and at each stage, choose the independent vertex with minimal label.

Thus, a vertex v will be chosen for the independent set if, and only if, it has a smaller label than all of its neighbours, that is, with probability \frac{1}{1+d(v)}. So the expected size of the independent set constructed in this fashion is

\sum_{v\in V(G)} \frac{1}{1+d(v)}\ge \frac{V}{1+\bar d} = \frac{V}{1+\frac{2E}{V}}.

One can chase through the expressions to get the bound we want back.

Olympiad example

The reason I was thinking about Turan’s theorem was a problem which the UK IMO squad was discussing. It comes from an American selection test (slightly rephrased): given 100 points in the plane, what is the largest number of pairs of points with \ell_1 distance in (1,2]?

The key step is to think about how large a collection of points can have this property pairwise. It is easy to come up with an example of four points which work, and seemingly impossible to come up with an example with five points. To prove this, I found it easiest to place a point at the origin, then explicitly work with coordinates relative the basis (1,1),(1,-1) for fairly obvious reasons in this metric.

Anyway, once you are convinced that you can’t have five points with this property pairwise, you are ready to convert into a graph-theoretic statement. Vertices correspond to points, and edges link pairs of points whose distance is in (1,2] as required. We know from the previous paragraph that there is no copy of K_5 here, so Turan’s theorem bounds the number of edges, ie the number of suitable pairs.

It also tells us under what sort of circumstances the bound is attained, and from this, it’s natural to split the 100 points into four groups of 25, for example by taking four points which satisfy the condition pairwise (eg a diamond around the origin), and placing each group very near one of the points.

Extensions and other directions

The existence of a complete subgraph is reminiscent of Ramsey theory, which in one case is a symmetric version of Turan’s theorem. In Turan, we are adding enough edges to force a complete subgraph, while in the original form of Ramsey theory, we are asking how large the graph needs to be to ensure that for any edge configuration, either the original graph or the complement graph includes a complete subgraph. It makes a lot more sense to phrase this in terms of colours for the purpose of generalisation.

A natural extension is to ask about finding copies of fixed graphs H other than the complete graph. This is the content of the Erdos-Stone theorem. I’d prefer to say almost nothing rather than be vague, but the key difference is that the bound is asymptotic in the number of vertices rather than exact. Furthermore, the asymptotic proportion of vertices depends on the chromatic number of H, which tells you how many classes r are required to embed H in a (large if necessary) r-partite graph. So it is perhaps unsurprising that the limiting proportions end up matching the proportion of edges in the Turan graphs, namely r-1/r as r varies, which leaves the exact scaling open to further investigation in the case where H is bipartite (hence has chromatic number 2).

Birthday Coincidences and Poisson Approximations

This morning, Facebook was extremely keen to remind me via every available medium that four of my friends celebrate their birthday today. My first thought was that I hope they all enjoy their day, and my second thought was to ask what the chance of this was. I have about 200 Facebook friends, and so this struck me as an unlikely occurrence. But this problem has form, and it felt worthwhile to try some calculations to see if my intuition was well-founded.

rainbowfishcake_compressed

Siméon Denis Poisson celebrated his 234th birthday on 21st June this year.

The classical birthday problem

The starting point is the question: how many friends do you have to have before you expect to start seeing anyone sharing a birthday? There are a ridiculous number of articles about this on the web already, so I will say little, except that I don’t want to call this the ‘birthday paradox’, because it’s not a paradox at all. At best it might be counter-intuitive, but then the moral should be to change our intuition for this type of problem.

Throughout, let’s discount February 29th, as this doesn’t add much. So then, to guarantee having a shared pair of birthdays, you need to have 366 friends. But if you have a mere 23 friends, then the probability of having some pair that share a birthday is slightly greater than a half. The disparity between these two numbers leads to the counter-intuition. Some people might find it helpful to think that instead of counting friends, we should instead be counting pairs of friends, but I don’t personally find this especially helpful.

For me, thinking about the calculation in very slightly more generality is helpful. Here, and throughout, let’s instead take N to be the number of days in a year, and K the number of friends, or kids in the class if you prefer. Then, as usual, it is easier to calculate the probability that no two share a birthday (that is, that all the birthdays are distinct) than the probability that some two share a birthday. We could think of the number of ways to pick the set of birthdays, or we could look at the kids one-at-a-time, and demand that their birthday is not one of those we’ve already seen. Naturally, we get the same answer, that is

\frac{^N P_K}{N^K} = 1\cdot \frac{N-1}{N}\cdot\ldots \frac{N-K+1}{N}.

We’ve assumed here that all birthdates are equally likely. We’ll come back to this assumption right at the end. For now, let’s assume that both N and K are large, and we’ll try to decide roughly how large K has to be in relation to N for this answer to be away from 0 and 1. If we pair opposite terms up, we might approximate this by

(\frac{N-\frac{K}{2}}{N})^K = (1-\frac{K}{2N})^K\approx e^{-K^2/2N}.

In fact, AM-GM says that this is an overestimate, and a bit more care can be used to show that this is a good-approximation to first order. So we see that if K=\Theta(\sqrt{N}) for large N, we get a non-trivial limit.

Challenges for four-way shared birthdays

So the original problem I posed is harder, because there isn’t (unless I’m missing something) a natural way to choose birthdays one-at-a-time, or describe the set of suitable birthday sets. There are two major obstacles in a calculation such as this. Firstly, the overlap of people, that is we might have five or more birthdays overlapping; secondly, the overlap of days, that is we might have several days with four (or more) birthdays. We’ll end up worrying more about the second situation.

We start by eliminating both problems, by asking for the probability that exactly four friends are born on January 1st. The general form of this probability is \frac{\binom{K}{4} }{N^4} \cdot (\frac{N-1}{N})^{K-4}. Now, if K\ll N, this final term should not be significant. Removing this is not exactly the same as specifying the probability that at least four birthdays are on January 1st. But in fact this removal turns a lower bound (because {exactly four}<{at least four}) into an upper (in fact a union) bound. So if the factor being removed is very close to one, we can use whichever expression is more convenient.

In the real life case of N=365, K=200, this term is not negligible. But accounting for this, we get that the probability of exactly four birthdays on 1st January is ~0.0021. Our upper bound on the probability of at least four is ~0.0036.

But now that we know the probability for a given day, can we calculate (1-0.0021)^{365} to estimate the probability that we never have four-overlap? When we did our previous iterative calculation, we were using independence of the different kids’ birthdays. But the event that we have four-overlap on January 1st is not quite independent of the event that we have four-overlap on January 2nd. Why? Well if we know at least four people were born on January 1st, there are fewer people left (potentially) to be born on January 2nd. But maybe this dependence is mild enough that we can ignore it?

We can, however, use some moment estimates. The expected number of days with four-overlap is 365\cdot 0.0021 \approx 0.77. So the probability that there is at least one day with four-overlap is at most ~0.77.

But we really want a lower bound. So, maybe we can borrow exactly the second-moment argument we tried (there for isolated vertices in the random graph) in the previous post? Here, the probability that both January 1st and January 2nd are four-overlapping is

\frac{\binom{K}{4}\binom{K-4}{4}}{N^8}\cdot (\frac{N-2}{N})^{K-8}\approx 4.3\times 10^{-6}.

From this, we can evaluate the expectation of the square of the number of days with four-overlap, and thus find that the variance is ~0.74. So we use Chebyshev, calling this number of days #D for now:

\mathbb{P}(\# D=0)\le \mathbb{P}(|\#D - \mathbb{E}\# D|^2 \ge (\mathbb{E}\# D)^2 ) \le \frac{\mathrm{Var} \# D}{(\mathbb{E} \#D)^2}.

In our case, this unfortunately gives us an upper bound greater than 1 on this probability, and thus a lower bound of zero on the probability that there is at least one day with four-overlap. Which isn’t especially interesting…

Fairly recently, I spoke about the Lovasz Local Lemma, which can be used to find lower bounds on the probabilities of intersections of events, many of which are independent (in a particular precise sense). Perhaps this might be useful here? The natural choice of ‘bad event’ is that particular 4-sets of people share a birthday. There are \binom{K}{4} such events, and each is independent of the collection of \binom{K-4}{4} disjoint events. Thus we can consider using LLL if e\cdot (\binom{K}{4}-\binom{K-4}{4})\cdot 0.0021 \le 1. Unfortunately, this difference of binomial coefficients is large in our example, and so in fact the LHS has order 10^3.

Random number of friends – coupling to a Poisson Process

All of these methods failed because without independence we had to use estimates which were really not tight at all. But we can re-introduce independence if we remove the constraints on the model. Suppose instead of demanding I have K friends, I instead demand that I have a random number of friends, with distribution Poisson(K). Now it is reasonable to assume that for each day, I have a Poisson(K/365) friends with that birthday, independently for each day.

If we end up having exactly K friends with this random choice, then the distribution of the number of 4-overlap days is exactly the same as in the original setup. However, crucially, if we end up having at most K friends with this random choice, the distribution of the number of 4-overlap days is stochastically dominated by the original distribution. So instead let’s assume we have Poisson(L) friends, where L<K, and see how well we can do. For definiteness, we’ll go back to N=365, K=200 now. Let’s say X is the distribution of birthdays in the original model, and \Xi for the distribution of birthdays in the model with a random number of friends

Then

\mathbb{P}(\exists \ge 4\text{-overlap in }\Xi) = 1- \mathbb{P}(\mathrm{Po}(L/365)\le 3)^365. (*)

Now we can write the crucial domination relation as

\mathbb{P}(\exists \ge 4\text{-overlap in }X)\ge \mathbb{P}( \exists \ge 4\text{-overlap in }\Xi \,|\, |\Xi|\le 200),

and then use an inequality version of the law of total probability to bound further as

\ge \frac{ \mathbb{P}(\exists \ge 4\text{-overlap in }\Xi) - \mathbb{P}(|\Xi|>200)}{\mathbb{P}(|\Xi|\le 200)}.

This is a function of L, and in principle we could find its maximum, perhaps as N\rightarrow\infty. Here, though, let’s just take L=365/2 and see what happens. For (*) we get ~0.472.

To estimate \mathbb{P}(\mathrm{Po}(365/2)>200), observe that this event corresponds to 1.4 standard deviations above the mean, so we can approximate using quantiles of the normal distribution, via the CLT. (Obviously this isn’t completely precise, but it could be made precise if we really wanted.) I looked up a table, and this probability is, conveniently for calculations, roughly 0.1. Thus we obtain a lower bound of \frac{0.472-0.1}{0.9}. Allowing for the fairly weak estimates at various points, we still get a lower bound of around 0.4. Which is good, because it shows that my intuition wasn’t right, but that I was in the right ball-park for it being a ‘middle-probability event’.

Remarks and References

– The reason for doing the upper bound for the probability of exact 4-overlap is that the same argument for at-least-4-overlap would have given an upper bound of 1. However, this Poisson Process coupling is also a much better method for obtaining an upper bound on either event.

– Birthdays are not uniformly distributed through the year. The deviation is strong enough that even from the set of birth frequencies (rather than the sequence of birth frequencies), we can reject a null hypothesis of uniformity. Early September is pretty close to the maximum. Two comments: 1) this is the time of year where small variations in birth date have a big effect on education, especially in primary school; 2) we are 37 weeks into the year…

– It is known that 187 friends is the first time the probability of having at-least-4-overlap is greater than ½. You can find the full sequence on OEIS as A014088. I used to have about 650 Facebook friends, before I decided that I’d prefer instead the pleasant surprise of finding out what old acquaintances were up to when I next spoke to them. In this case, the median of the distribution of the largest number sharing a birthday would be seven.

– Eric Weisstein’s article on Mathworld is, in my opinion, the best resource for a mathematician on the first two pages of Google hits by at least two orders of magnitude. In the notation of this article, we were calculating P_4(n=200,d=365). There are also some good general asymptotics, or at least recipes for asymptotics, in equations (17) and (18).

– The paper Methods for Studying Coincidences by Diaconis and Mosteller is, as one might expect, extremely readable, and summarises many results and applications, including several generalisations.

Lovasz Local Lemma

At our training and selection camp in Tonbridge in May, I gave a session about the use of probabilistic methods to solve olympiad combinatorics questions. Such an approach will normally be relevant to a problem where it is required to show the existence of a structure satisfying a particular property. We might consider constructing a structure randomly, and then try to show that the probability that our construction has the required property is non-zero.

There are several possible approaches to this, but often for the property we seek, we can describe a family of ‘bad events’ A_1,\ldots,A_n, and then we have the property if none of the bad events hold. That is, we are looking for general conditions under which we can show that \mathbb{P}(\Cap_{i=1}^n A_i^c)>0.

We have two cases where this is easy to do.

1) If all the A_is are independent, then so long as all \mathbb{P}(A_i)<1, we have our result.

2) If the probability of the bad events have sum less than 1, then we can use a union bound

\mathbb{P}(\cup A_i)\le \sum_{i=1}^n \mathbb{P}(A_i) <1,

to conclude what we want.

In Tonbridge we also discussed first-moment methods, where we show that the expected number of bad events is less than 1, meaning there must be some elements of the probability space where the number of bad events is zero. In this article, we’re looking in a different direction. We’ll try to interpolate between situation 1) and 2), to make some concrete comment on the situation where the probabilities of the bad events are small, but not small enough to use the union bound, and where the events are not quite independent, but are in some sense almost independent.

The first thing we need to do is think about what it means for a family of events to be independent. Suppose we toss a fair coin twice, and consider the events:

A:= {first coin is H}, B:={second coin is T}, C:={exactly one H}.

So, it is easy to see that each pair of events is independent, but if we know that A and B hold, then also C holds (and indeed this stays true under cyclic re-ordering). So C is not independent of the family of events {A,B}. Rather than give a formal definition, I’ll say instead that an event B is said to be independent of the family of events \{A_1,\ldots,A_5\} if it is independent of

A_3

A_1\cap A_2

A_3\cap A_4^c\cap A_5^c,

and so on. I hope it’s clear from this what I mean. Slogan: no knowledge about which of the A_i do or don’t hold gives information about B.

Now we return to our original setup. We want that each A_i is independent of lots of the rest, and so we choose for each i\in[n] a dependency set D_i\subset [n] of indices, such that A_i is independent of the family of events \{A_j: j\in [n]\backslash D_i\}. It is tempting to interpret this immediately as saying that A_i depends on each event with index in D_i. This will normally be fine in practice, but we should emphasise that there is a lot of freedom to choose D_i, and the genuinely important condition is that A_i is independent of the family given by the rest of the indices.

[*Health warning*: The language I am using seems sensible for anyone learning about this for the first time, but is very slightly non-standard. Instead of dependency sets, the classical notation is to consider a dependency digraph on [n], with an edge i->j whenever j\in D_i.]

The symmetric version of the Lovasz Local Lemma says: suppose \mathbb{P}(A_i)\le p and we can choose D_i as described so that |D_i|\le d for each I. Then, if epd\le 1, we may conclude \mathbb{P}(\Cap_{i=1}^n A_i^c)>0.

We’ll come back to the proof, which is best seen in a slightly more general formulation. First, let’s get to grips with the notation, by reassuring ourselves that this really does lie between the union bound and the independence case.

If the events are independent, then clearly we may take D_i=\{i\} for each i, that is d=1, so we may draw the desired conclusion so long as p\le 1/e, which is fine, but a lot less convincing than p<1. Similarly, for the union bound, we have said nothing about the dependency relationships, and so we have to take D_i=[n] for each i. So we may draw the desired conclusion provided p\le 1/ne, which is obviously again a factor of e less than what we would have had with a union bound itself.

Now we’ll see how this might be useful when applied, for example, to a probabilistic construction for the lower bound on the Ramsey number R(k). From Erdos’s argument, we know R(k)\ge (1+o(1)) 2^{k/2}, and we will earn an extra factor of k\sqrt{2}/e. An extra factor of k/e can also be obtained by an alteration argument, but this extra factor of \sqrt{2} makes this LLL method one of the strongest available for this problem.

Recall that for a lower bound, we are trying to find examples of 2-edge-colourings of a large complete graph K_n, such that there is no monochromatic copy of a K_k. We consider the uniform independent random edge-colouring. It makes sense that a bad event A_S should be the presence of a monochromatic complete graph induced on a set S\subset [n], of size k. Since there are two potential colours for the monochromatic induced K_k, we have \mathbb{P}(A_S)=2^{1-\binom{k}{2}}. Then we take the dependency set D_S of A_S to include all those k-sets which share an edge with S, that is |S\cap T|\ge 2. We think about which vertices might contribute to the shared edge, and which make up the remainder to conclude |D_S|\le \binom{k}{2}\binom{n-2}{k-2}.

So now, whenever e\cdot 2^{1-\binom{k}{2}}\binom{k}{2}\binom{n-2}{k-2}\le 1, as a consequence of LLL we can conclude that with positive probability the random colouring gives a suitable example, that is R(k)>n. After fiddling around with Stirling’s formula in a fashion that isn’t hugely interesting, we can conclude R(k)\ge (1+o(1)) \frac{k\sqrt{2}}{2} 2^{k/2}.

The prompt for this article was a discussion during our Malaysian training camp of methods applicable to IMO 2014 Q6. If you want to know just how applicable LLL is, I suggest you have a think yourself. It’s not a direct application – so some thought is involved. Maybe as an appetiser, here are some more elementary exercises, which I’ve borrowed from examples on Po-Shen Loh’s olympiad handouts, and Wikipedia, though I doubt the latter is the original source:

1) 11n points on a circle are coloured with n colours, such that each colour is used exactly 11 times. Show that it’s possible to choose one point of each colour such that no pair are adjacent.

2) A finite graph is given, and all degrees are between 50 and 100. Find some finite C such that you can always colour the vertices of such a graph so that the neighbourhood of any vertex includes at least 20 colours.

Finally, we discuss the more general statement of LLL, and explain how the proof works.

General Lovasz Local Lemma: Suppose there exist x_i\in [0,1) such that \mathbb{P}(A_i)\le x_i \prod_{j\in D_i\backslash\{i\}} (1-x_j) (*). Then \mathbb{P}(\Cap A_i^c)\ge \prod (1-x_i)>0.

Deducing the symmetric form from the general form is not hard. Condition (*) is motivated by the proof. We want to be able to say that no matter which other events and their unions, complements etc we condition on, we still have an upper bound for the probability of A_i. This bound will be x_i. In particular, we want to show that the probability of bad event A_i does not get too high, even if we know that lots of other bad events have not occurred.

The proof proceeds by showing \mathbb{P}(A_i | \Cap_{j\in S}A_j^c)\le x_i for all i, by induction on |S|. For the inductive step, you split S=S_1\cup S_2 where S_1=S\cap D_i, S_2=S\cap D_i^c. If S_1=\varnothing, you are done straight away, by the assumption (*) and independence of A_i and the events not indexed by D_i. Otherwise, you can use the inductive hypothesis on S_2, and repeated Bayes’ theorem to show what you want in a series of steps that have a lot of notation, but aren’t hugely difficult.