Bijections, Prufer Codes and Cayley’s Formula

I’m currently at the training camp in Cambridge for this year’s UK IMO squad. This afternoon I gave a talk to some of the less experienced students about combinatorics. My aim was to cover as many useful tricks for calculating the sizes of combinatorial sets as I could in an hour and a half. We started by discussing binomial coefficients, which pleasingly turned out to be revision for the majority. But my next goal was to demonstrate that we are much more interested in the fact that we can calculate these if we want than in the actual expression for their values.

Put another way, my argument was that the interpretation of \binom{n}{m} as the number of ways to choose m objects from a collection of n, or the number of up-and-right paths from (0,0) to (m,n) is more useful than the fact that \binom{n}{m}=\frac{n!}{m!(n-m)!}. The opening gambit was to prove the fundamental result underlying the famous construction of Pascal’s triangle that

\binom{n+1}{m+1}=\binom{n}{m}+\binom{n}{m+1}.

This is not a hard result to prove by manipulating factorials, but it is a very easy result to prove in the path-counting setting, for example.

So it turned out that the goal of my session, as further supported by some unsubtly motivated problems from the collection, was to convince the students to use bijections as much as possible. That is, if you have to count something awkward, show that counting the awkward thing is equivalent to counting something more manageable, then count that instead. For many simpler questions, this equivalence is often drawn implicitly using words (“each of the n objects can be in any subset of the collection of bags so we multiply…” etc), but it is always worth having in mind the formal bijective approach. Apart from anything else, asking the question “is this bijection so obvious I don’t need to prove it” is often a good starting-point for assessing whether the argument is in fact correct!

Anyway, I really wanted to show my favouriite bijection argument, but there wasn’t time, and I didn’t want to spoil other lecturers’ thunder by defining a graph and a tree and so forth. The exploration process encoding of trees is a strong contender, but today I want to define quickly the Prufer coding for trees, and use it to prove a famous result I’ve been using a lot recently, Cayley’s formula for the number of spanning trees on the complete graph with n vertices, n^{n-2}.

We are going to count rooted trees instead. Since we can choose any vertex to be the root, there are n^{n-1} rooted trees on n vertices. The description of the Prufer code is relatively simple. Take a rooted tree with vertices labelled by [n]. A leaf is a vertex with degree 1, other than the root. Find the leaf with the largest label. Write down the label of the single vertex to which this leaf is connected, then delete the leaf. Now repeat the procedure, writing down the label of the vertex connected to the leaf now with the largest label, until there are only two vertices remaining, when you delete the non-root vertex, and write down the label of the root. We get a string of (n-1) labels. We want to show that this mapping is a bijection from the set of rooted trees with vertices labelled by [n] to [n]^{n-1}.

Let’s record informally how we would recover a tree from the Prufer code. First, observe that the label of any vertex which is not a leaf must appear in the code. Why? Well, the root label appears right at the end, if not earlier, and every vertex must be deleted. But a vertex cannot be deleted until it has degree one, so the neighbours further from the root (or ancestors) of the vertex must be removed first, and so by construction the label appears. So know what the root is, and what the leaves are straight away.

In fact we can say slightly more than this. The number of times the root label appears is the degree of the root, while the number of times any other label appears is the degree of the corresponding vertex minus one. Call this sequence the Prufer degrees.

So we construct the tree backwards from the leaves towards the root. We add edges one at a time, with the k-th edge joining the vertex with the k-th label to some other vertex. For k=1, this other vertex is the leaf with maximum label. In general, let G_k be the graph formed after the addition of k-1 edges, so G_1 is empty, and G_n is the full tree. Define T_k to be the set of vertices such that their degree in G_k is exactly one less than their Prufer degree. Note that T_1 is therefore the set of leaves suggested by the Prufer code. So we form G_{k+1} by adding an edge between the vertex with label appearing at position k+1 in the Prufer sequence and the vertex of T_k with maximum label.

Proving that this is indeed the inverse is a bit fiddly, more because of notation than any actual mathematics. You probably want to show injectivity by an extremal argument, taking the closest vertex to the root that is different in two trees with the same Prufer code. I hope it isn’t a complete cop out to swerve around presenting this in full technical detail, as I feel I’ve achieved by main goal of explaining why bijection arguments can reduce a counting problem that was genuinely challenging to an exercise in choosing sensible notation for proving a fairly natural bijection.

Mixing Times 6 – Aldous-Broder Algorithm and Cover Times

In several previous posts, I’ve discussed the Uniform Spanning Tree. The definition is straightforward: we choose uniformly at random from the set of trees which span a fixed underlying graph. But for a dense underlying graph, there are a very large number spanning trees. Cayley’s formula says that the complete graph K_n has n^{n-2} spanning trees, so to select from this list is impractical.

We seek a better algorithm. In a post about a year ago, I presented the result that the path between two fixed points x and y in the UST is distributed as the path generated by Loop-Erased Random Walk, for which we start at x and delete cycles as they appear. An initial problem might be that this only gives us a single path, which might be enough in some contexts, but in general we will want to specify the whole tree. Wilson’s Algorithm is an unsurprising but useful extension to this equivalence which does just that. You start by constructing the LERW between two vertices, then you add the LERW which connects some other vertex to the path you already have. Then you take a further vertex not currently explored and start LERW there, continuing until you hit the tree that you already have. Iterate this process, which must terminate after at most n steps when there are no vertices which to start from. The tree thus obtained is the UST. The tricky part is proving that the method for selecting which unused vertices to start from has no effect on the distribution of paths between two fixed points.

I want to consider a different algorithm, discovered roughly simultaneously by Aldous and Broder. Start a random walk on the underlying graph at some particular vertex. Every time we traverse an edge which takes us to a vertex we haven’t yet explored, add this edge to the tree. For now I don’t want to give a proof that this algorithm works, but rather to talk about how fast it works, because it ties in nicely with something from the Mixing Times book we’ve been reading recently. It is clear that the algorithm terminates at the first time the random walk has visited every vertex. This is a stopping time, called the cover time of the Markov chain. If we are working with an underlying complete, then we notice that this is annoying, because it means that the cover time will increase like n.log n. That is, it will take an increasingly long time to gather the final few vertices into the tree. Perhaps some combination of Aldous-Broder initially then Wilson’s method for the final o(n) vertices might be preferable?

I want to discuss how to treat this cover time. Often we have information about the hitting times of states from other states \mathbb{E}_x T_y. A relationship between S, the hitting time, defined to be the maximum of the previous display over x and y, and the expected cover time would be useful, especially for a highly symmetric graph like the complete graph where the expected hitting times are all the same.

Matthews’ Method relates these two for an irreducible finite Markov chain on n states. It says:

t_{cov}\leq t_{hit}\left(1+\frac12+\ldots+\frac 1 n\right).

We first remark that this agrees with what we should get for the random walk on the complete graph. There, the hitting time of x from y is a geometric random variable with success probability 1/n, hence expectation is n. The cover time is the standard coupon collector problem, giving expectation n log n, and the sum of reciprocals factor is asymptotically a good approximation.

The intuition is that if we continue until we hit state 1, then reset and continue until we hit state 2, and so on, by the time we hit state n after (n-1) iterations, this is a very poor overestimate of the cover time, because we are actually likely to have hit most states many times. What we want to do really is say that after we’ve hit state 1, we continue until we hit state 2, unless we’ve already done so, in which case we choose a different state to aim for, one which we haven’t already visited. But this becomes complicated because we then need to know the precise conditional probabilities of visiting any site on the way between two other states, which will depend rather strongly on the exact structure of the chain.

Peres et al give a coupling proof in Chapter 11 of their book which I think can be made a bit shorter, at least informally. The key step is that we still consider hitting the sites in order, only now in a random order.

That is, we choose a permutation \sigma\in S_n uniformly at random, and we let T_k be the first time that states \sigma(1),\ldots,\sigma(k) have all been visited. This is a random time that is measurable in the product space, and for each \sigma it is a stopping time.

The key observation is that \mathbb{P}(T_{k+1}=T_k)=1-\frac{1}{k+1}. This holds conditional on any path of the Markov chain because the requirement for the event is that \sigma(k+1) is visited after \{\sigma(1),\ldots,\sigma(k)\}. The statement therefore holds as stated as well as just pathwise. Then, by the SMP, conditional on \{T_{k+1}>T_k\}, we have

T_{k+1}-T_k \leq_{st} t_{hit}.

Note that by the definition of t_{hit}, this bound on the hitting time T_{k+1} is unaffected by concerns about where the chain actually is at T_k (since it is not necessarily at \sigma(k)).

So, removing the conditioning, we have:

\mathbb{E}\Big[T_{k+1}-T_k\Big]\leq\frac{1}{k+1}t_{hit},

and so the telescoping sum gives us Matthews’ result.

One example is the cover time of random walk on the n x n torus, which turns out to be

O(n^2(\log n)^2).

If anyone remembers that Microsoft screensaver from many years ago which started with a black screen and a snake leaving a trail of white pixels as it negotiated the screen, this will be familiar. The last few black bits take a frustratingly long while to disappear. Obviously that isn’t quite a random walk, but it perhaps diminishes the surprise that it should take this long to find the cover time.

There are a couple of interesting things I wanted to say about electrical networks for Markov chains and analytic methods for mixing times, but the moment may have passed, so this is probably the last post about Mixing Times. Plans are in motion for a similar reading group next term, possible on Random Matrices.

Minimum Spanning Trees

In my last post, I discussed the Uniform Spanning Tree. To summarise very briefly, given a connected graph on n vertices, a tree is a subgraph, that is a subset of the edges, which is connected, but which contains no cycles. It turns out this requires the tree to have n-1 edges.

We are interested in natural mechanisms for generating randomly chosen spanning trees of a given graph. One way we can always do this is to choose uniformly at random from the set of possible trees. This UST is in some sense canonical, but it is worth knowing about some other measures on trees that might be of interest.

A family of natural problems in operations research concerns an arbitrary complex network, with some weight or cost associated to each connection. The question is how to perform some operation on the network so as to minimise the resulting cost. Perhaps the most famous such problem is that of the Travelling Salesman. The story is that a salesman needs to visit n locations and wants to do the trip as efficiently as possible. This might be thought of as some sort of financial or time cost, but proably the easiest way to set it up is to imagine he is trying to minimise the distance he has to travel. It is not hard to see why this problem might genuinely arise in plenty of real-world situations, where a organisation or agent is trying to be as efficient as possible.

It might be the case that it is not possible to travel between every pair of locations, but we needn’t assume that for now. So if he knows the distance between any pair of cities, he wants to know which of the possible routes gives the shortest overall distance. The problem is that there are n! routes, and this grows roughly like n^n, which is faster than exponential, so for as few as 20 cities it has turned into a comparison which is too large to compute.

There are various algorithms which reduce the number of routes that must be checked, and some approximation methods. But if you want the exact answer, it is not currently possible to calculate this in polynomial time.

Minimal Spanning Trees and Uniqueness

For the travelling salesman, we were looking for the minimal cost spanning path. In the case of the complete graph, this is the same as the minimal cost non-repeating path of length n-1. Such paths are a subset of the set of spanning trees on the underlying graph. So what if we look instead for the minimal cost spanning tree? This exists as after all, there are only finitely many spanning trees.

So far, this has been deterministic, but we were looking for a random spanning tree. We can achieve this by choosing the weights at random. Anything other than assigning the weights as an IID sequence seems likely to be complicated, but there isn’t a canonical choice of the distribution of the weights. Our first question will be whether the distribution of the weights affects the distribution of the induced MST. In fact it will turn out that so long as the distribution is continuous, it has no effect on the distribution of the MST. The continuous condition might seem odd, but it is present only to ensure that the weights almost certainly end up generating a unique MST.

It turns out that there is a straightforward greedy algorithm to find the MST once the weights are known. We will examine some consequences of this algorithm in the random setting. First we check uniqueness. The condition required for uniqueness is that the weights be distinct. Note that this is slightly weaker than the statement that all of sums of (n-1)-tuples be distinct, which immediately implies a unique MST.

We now prove this condition. Suppose we have distinct weights, and an associated MST. If the underlying graph is a tree, then the result is clear. Otherwise, add some extra edge e, with weight w(e). By the definition of a tree, this generates exctly one cycle. Consider the other edges, say e_1,\ldots,e_k in this cycle. If any of w(e_i)>w(e) then we can replace e_i with e to get a spanning tree with smaller weight, a contradiction of the claim that we started with an MST. So by distinctness of weights, we conclude that w(e)>w(e_i) for all i.

Conversely, suppose we remove some edge e which IS in the MST. We end up with exactly two connected components. Consider all the edges in the underlying graph between the two components, and suppose that one of these f satisfies w(f)<w(e). Then if we add in edge f, which is by construction not in the original MST, we end up with a smaller total weight than we started with, a further contradiction.

We can summarise this in a neat form. Given an edge e between x and y, consider the set of all edges in the underlying graph with weight LESS THAN w(e). Then if x and y are in different components, the edge e must be in the MST. Since we have an explicit description of which edges are present, it follows that the MST is unique. The problem is that working out the component structure of the graph with higher weights removed is computationally rather intensive. We want a slightly faster algorithm.

Kruskal’s Algorithm

Several rather similar algorithms were developed roughly simultaneously. Prim’s algorithm is a slight generalisation of what we will discuss. Anyway, for now we consider Kruskal’s algorithm which has the advantage that it can be described without really needing to draw a diagram.

We start by ordering the weights. Without loss of generality, we might as well relabel the edges so that

w(e_1)< w(e_2)<\ldots< w(e_{|E|}).

Now, by the condition derived in the argument for uniqueness, we must have e_1 and e_2 in any MST. Now consider e_3. Unless doing so would create a cycle, add e_3. Then, unless doing so would create a cycle, add e_4. Continue. It is clear that the result of this procedure is acyclic. To check it is actually a spanning tree, we show that it is also connected. Suppose not, and two of the components are A and B. Let e be the edge between A and B with minimal weight. According to the algorithm, we should have included e in our MST because at no point would adding it possibly have created a cycle. So we have proved that this greedy algorithm does indeed give the (unique) MST.

A useful consequence of this is that we know the two edges with overall minimum weight are definitely in the MST. In the search for a random measure on spanning trees, what is most important is that we didn’t use the actual values of the weights in this construction, only the order. In other words, we might as well have assumed the weights were a random permutation from S_{|E|}. This now answers our original question about how the random weight MST depends in distribution on the underlying edge weight distribution. So long as with probability one the weights are distinct (which holds if the distribution is continuous), then the distribution of the resulting spanning tree is constant.

It’s not too hard to show this isn’t the same as UST: n=4 suffices as a counterexample. But the difference in asymptotic behaviour of properties such as the diameter is of interest, and will be explored in the next post.

Uniform Spanning Trees

For applications to random graphs, the local binomial structure and independence means that the Galton-Watson branching process is a useful structure to consider embedding in the graph. In several previous posts, I have shown how we can set up the so-called exploration process which visits the sites in a component as if the component were actually a tree. The typical degree is O(1), and so in particular small components will be trees with high probability in the limit. In the giant component for a supercritical graph, this is not the case, but it doesn’t matter, as we ignore vertices we have already explored in our exploration process. We can consider the excess edges separately by ‘sprinkling’ them back in once we have the tree-like backbone of all the components. Again, independence is crucial here.

I am now thinking about a new model. We take an Erdos-Renyi process as before, with edges arriving at some fixed rate, but whenever a cycle appears, we immediately delete all the edges that make up the cycle. Thus at all times the system consists of a collection (or forest) of trees on the n vertices. So initially this process will look exactly like the normal E-R process, but as soon as the components start getting large, we start getting excess edges which destroy the cycles and make everything small again. The question to ask is: if we run the process for long enough, roughly how large are all the components? It seems unlikely that the splitting mechanism is so weak that we will get true giant components forming, ie O(n) sizes, so we might guess that, in common with some other split-merge models of this type, we end up with components of size n^{2/3}, as in the critical window for the E-R process.

In any case, the scaling limit process is likely to have components whose sizes grow with n, so we will have a class of trees larger than those we have considered previously, which have typically been O(1). So it’s worth thinking about some ways to generate random trees on a fixed number of vertices.

Conditioned Galton-Watson

Our favourite method of creating trees is inductive. We take a root and connect the root to a number of offspring given by a fixed distribution, and each of these some offspring given by an independent sample from the same distribution and so on. The natural formulation gives no control over the size of the tree. This is a random variable whose distribution depends on the offspring distribution, and which in some circumstances be computed explicitly, for example when the offspring distribution is geometric. In other cases, it is easier to make recourse to generating functions or to a random walk analogue as described in the exploration process discussion.

Of course, there is nothing to stop us conditioning on the total size of the population. This is equivalent to conditioning on the hitting time of -1 for the corresponding random walk, and Donsker’s theorem gives several consequences of a convergence relation towards a rescaled Brownian excursion. Note that there is no a priori labelling for the resulting tree. This will have to be supplied later, with breadth-first and depth-first the most natural choices, which might cause annoyance if you actually want to use it. In particular, it is not obvious, and probably not true unless you are careful, that the distribution is invariant under permuting the labels (having initially assumed 1 is the root etc) which is not ideal if you are embedding into the complete graph.

However, we would like to have some more direct constructions of random trees on n vertices. We now consider perhaps the two best known such methods. These are of particular interest as they are applicable to finding random spanning trees embedded in any graph, rather than just the complete graph.

Uniform Spanning Tree

Given a connected graph, consider the set of all subgraphs which are trees and span the vertex set of the original graph. An element of this set is called a spanning tree. A uniform spanning tree is chosen uniformly at random from the set of spanning trees on the complex graph on n vertices. A famous result of Arthur Cayley says that the number of such spanning trees is n^{n-2}. There are various neat proofs, many of which consider a mild generalisation which gives us a more natural framework for using induction. This might be a suitable subject for a subsequent post.

While there is no objective answer to the question of what is the right model for random trees on n vertices, this is what you get from the Erdos-Renyi process. Formally, conditional on the sizes of the (tree) components, the structures of the tree components are given by UST.

To see why this is the case, observe that when we condition that a component has m vertices and is a tree, we are demanding that it be connected and have m-1 edges. Since the probability of a particular configuration appearing in G(n,p) is a function only of the number of edges in the configuration, it follows that the probability of each spanning tree on the m vertices in question is equal.

Interesting things happen when you do this dynamically. That is, if we have two USTs of sizes m and n at some time t, and condition that the next edge to be added in the process joins them, then the resulting component is not a UST on m+n vertices. To see why, consider the probability of a ‘star’, that is a tree with a single distinguished vertex to which every other vertex is joined. Then the probability that the UST on m vertices is a star is \frac{m}{m^{m-2}}=m^{-(m-3)}. By contrast, it is not possible to obtain a star on m+n vertices by joining a tree on m vertices and a tree on n vertices with an additional edge.

However, I think the UST property is preserved by the cycle deletion mechanism mentioned at the very start of this post. My working has been very much of the back of the envelope variety, but I am fairly convinced that once you have taken a UST and conditioned on the sizes of the smaller trees which result from cycle deletion. My argument is that you might as well fix the cycle to be deleted, then condition on how many vertices are in each of the trees coming off this cycle. Now the choice of each of these trees is clearly uniform among spanning trees on the correct number of vertices.

However, it is my current belief that the combination of these two mechanisms does not give UST-like trees even after conditioning on the sizes at fixed time.

Random Walks and Spanning Trees

Introduction

In this post, I’m going to talk about probability distributions on graphs. In particular, consider paths on a graph, and how to assign a distribution to these in a natural way. One way is to consider a standard random walk, treating the vertices as states of a discrete Markov chain. But, in many situations, this isn’t really enough. We might want paths, that is, walks which are self-avoiding. But even in regular lattices like \mathbb{Z}^d, it is hard to say a great deal about the set of self-avoiding walks (SAWs) of length n. We would prefer a distribution which has a natural product form, or which at least we can sample from without large combinatorial calculations.

A spanning tree is a connected graph without cycles. The set of edges is maximal, in the sense that adding any further edge creates a cycle, and also minimal, as removing one will disconnect the graph. Counting the number of spanning trees on labelled vertices is harder than one might suspect, and is possibly worth a post by itself. However, in general the uniform distribution on spanning trees is a useful object to consider. Any spanning tree contains a unique path between two vertices of the graph, and so a Uniform Spanning Tree (UST) induces a distribution on paths.

An alternative construction is to take a random walk and remove the cycles. This is not well-defined unless you specify a canonical order in which to remove them, and the obvious option is to remove the cycles in the order in which they appear. That is, every time you end up at a vertex which you have already visited, you remove all the edges traversed since you were last at v. This gives a Loop-Erased Random Walk (LERW), and from this another measure on paths between two vertices (subject to connectivity conditions which guarantee that the LERW almost surely hits the target vertex eventually).

At an informal level, the difference between these measures is significant. Inducing a measure down from a natural but potentially unwieldy uniform distribution is theoretically natural, but hard to work with. On the other hand a computer could sample from LERW, just by performing a random walk and storing the history suitably, which immediately makes it useful. So, the following theorem is elegant in its own right, but in particular as a bridge between these two frameworks.

Theorem: The measures on paths from to in a finite graph G induced by UST and generated by LERW are the same.

Proof: Some books give proofs which involve coupling a LERW construction with a sequence of STs directly, driven by an underlying random walk on the graph. The difficulty in this approach is that to prove that the uniform distribution on spanning trees is stationary often requires the random walk to be in equilibrium, a criterion which causes difficulties when it come to starting at x later in the proof. Essentially, the difficulty in many LERW constructions is that the edges being removed are incident to the wrong vertex in the random walk, and so RW equilibrium is required to show that the spanning tree transitions are reversible. Instead, we proceed by an argument motivated by the fact that LERWs appear ‘backwards’ in UST constructions. Continue reading