Antichains in the grid

In the previous post on this topic, we discussed Dilworth’s theorem on chains and antichains in a general partially ordered set. In particular, whatever the size of the largest antichain in a poset, it is possible to partition the poset into exactly that many chains. So for various specific posets, or the directed acyclic graphs associated to them, we are interested in the size of this largest antichain.

The following example turned out to be more interesting than I’d expected. At a conventional modern maths olympiad, there are typically three questions on each paper, and for reasons lost in the mists of time, each student receives an integer score between 0 and 7 per question. A natural question to ask is “how many students need to sit a paper before it’s guaranteed that one will scores at least as highly as another on every question?” (I’m posing this as a straight combinatorial problem – the correlation between scores on different questions will be non-zero and presumably positive, but that is not relevant here.)

The set of outcomes is clearly \{0,1,\ldots,7\}^3, with the usual weak domination partial order inherited from \mathbb{R}^3. Then an antichain corresponds to a set of triples of scores such that no triple dominates another triple. So the answer to the question posed is: “the size of the largest antichain in this poset, plus one.”

In general, we might ask about \{1,2,\ldots,n\}^d, again with the weak domination ordering. This directed graph, which generalises the hypercube as well as our example, is called the grid.

Heuristics for the largest antichain

Retaining the language of test scores on multiple questions is helpful. In the previous post, we constructed a partition of the poset into antichains, indexed by the elements of some maximal chain, by starting with the sources, then looking at everything descended only from sources, and so on. (Recall that the statement that this is possible was referred to as the dual of Dilworth’s theorem.) In the grid, there’s a lot of symmetry (in particular under the mapping x\mapsto n+1-x in every coordinate), and so you end up with the same family of antichains whether you work upwards from the sources or downwards from the sinks. (Or vice versa depending on how you’ve oriented your diagram…) The layers of antichains also have a natural interpretation – each layer corresponds to a given total score. It’s clear a priori why each of these is an antichain. If A scores the same as B overall, but strictly more on the first question, this must be counterbalanced by a strictly lower score on another question.

So a natural guess for the largest antichain is the largest antichain corresponding to some fixed total score. Which total score should this be? It ought to be the middle layer, that is total score \frac{(n+1)d}{2}, or the two values directly on either side if this isn’t an integer. My intuition was probabilistic. The uniform distribution on the grid is achieved by IID uniform distributions in each coordinate, which you can think of as a random walk, especially if you subtract off the mean first. It feels that any symmetric random walk should have mode zero or next-to-zero. Certainly this works asymptotically in a rescaled sense by CLT, and in a slightly stronger sense by local CLT, but we don’t really want asymptotics here.

When I started writing the previous paragraph, I assumed there would be a simple justification for the claim that the middle layer(s) was largest, whether by straight enumeration, or some combinatorial argument, or even generating functions. Perhaps there is, and I didn’t spot it. Induction on d definitely works though, with a slightly stronger hypothesis that the layer sizes are symmetric around the median, and monotone on either side of the median. The details are simple and not especially interesting, so I won’t go into them.

From now on, the hypothesis is that this middle layer of the grid is the largest antichain. Why shouldn’t it, for example, be some mixture of middle-ish layers? (*) Well, heuristically, any score sequence in one layer removes several possibilities from a directly adjacent layer, and it seems unlikely that this effect is going to cancel out if you take some intermediate number of score sequences in the first layer. Also, the layers get smaller as you go away from the middle, so because of the large amount of symmetry (coordinates are exchangeable etc), it feels reasonable that there should be surjections between layers in the outward direction from the middle. The union of all these surjections gives a decomposition into chains.

This result is in fact true, and its proof by Bollobas and Leader, using shadows and compression can be found in the very readable Sections 0 and 1 of [1].

Most of the key ideas to a compression argument are present in the case n=2, for which some notes by Leader can be found here, starting with Proof 1 of Theorem 3, the approach of which is developed over subsequent sections. We treat the case n=2, but focusing on a particularly slick approach that does not generalise as successfully. We also return to the original case d=3 without using anything especially exotic.

Largest antichain in the hypercube – Sperner’s Theorem

The hypercube \{0,1\}^d is the classical example. There is a natural correspondence between the vertices of the hypercube, and subsets of [d]. The ordering on the hypercube corresponds to the ordering given by containment on \mathcal{P}([d]). Almost by definition, the k-th layer corresponds to subsets of size k, and thus includes \binom{d}{k} subsets. The claim is that the size of the largest antichain is \binom{d}{\lfloor d/2 \rfloor}, corresponding to the middle layer if d is even, and one of the two middle layers if d is odd. This result is true, and is called Sperner’s theorem.

I know a few proofs of this from the Combinatorics course I attended in my final year at Cambridge. As explained, I’m mostly going to ignore the arguments using compression and shadows, even though these generalise better.

As in the previous post, one approach is to exhibit a covering family of exactly this number of disjoint chains. Indeed, this can be done layer by layer, working outwards from the middle layer(s). The tool here is Hall’s Marriage Theorem, and we verify the relevant condition by double-counting. Probably the hardest case is demonstrating the existence of a matching between the middle pair of layers when d is odd.

Take d odd, and let d':= \lfloor d/2\rfloor. Now consider any subset S of the d’-th layer \binom{[d]}{d'}. We now let the upper shadow of S be

\partial^+(S):= \{A\in \binom{[d]}{d'+1}\,:\, \exists B\in S, B\subset A\},

the sets in the (d’+1)-th layer which lie above some set in S. To apply Hall’s Marriage theorem, we have to show that |\partial^+(S)|\ge |S| for all choice of S.

We double-count the number of edges in the hypercube from S to \partial^+(S). Firstly, for every element B\in S, there are exactly d’ relevant edges. Secondly, for every element A\in\partial^+(S), there are exactly d’ edges to some element of \binom{[d]}{d'}, and so in particular there are at most d’ edges to elements of S. Thus

d' |S|=|\text{edges }S\leftrightarrow\partial^+(S)| \le d' |\partial^+(S)|,

which is exactly what we require for Hall’s MT. The argument for the matching between other layers is the same, with a bit more notation, but also more flexibility, since it isn’t a perfect matching.

The second proof looks at maximal chains. Recall, in this context, a maximal chain is a sequence \mathcal{C}=B_0\subset B_1\subset\ldots\subset B_d where each B_k:= \binom{[d]}{k}. We now consider some largest-possible antichain \mathcal{A}, and count how many maximal chains include an element A\in\mathcal{A}. If |A|=k, it’s easy to convince yourself that there are \binom{d}{r} such maximal chains. However, given A\ne A'\in\mathcal{A}, the set of maximal chains containing A and the set of maximal chains containing A’ are disjoint, since \mathcal{A} is an antichain. From this, we obtain

\sum_{A\in\mathcal{A}} \binom{d}{|A|} \le d!. (**)

Normally after a change of notation, so that we are counting the size of the intersection of the antichain with each layer, this is called the LYM inequality after Lubell, Yamamoto and Meshalkin. The heuristic is that the sum of the proportions of layers taken up by the antichain is at most one. This is essentially the same as earlier at (*). This argument can also be phrased probabilistically, by choosing a *random* maximal chain, and considering the probability that it intersects the proposed largest antichain, which is, naturally, at most one. Of course, the content is the same as this deterministic combinatorial argument.

Either way, from (**), the statement of Sperner’s theorem follows rapidly, since we know that \binom{d}{|A|}\le \binom{d}{\lfloor d/2\rfloor} for all A.

Largest antichain in the general grid

Instead of attempting a proof or even a digest of the argument in the general case, I’ll give a brief outline of why the previous arguments don’t transfer immediately. It’s pretty much the same reason for both approaches. In the hypercube, there is a lot of symmetry within each layer. Indeed, almost by definition, any vertex in the k-th layer can be obtained from any other vertex in the k-th layer just by permuting the labels (or permuting the coordinates if thinking as a vector).

The hypercube ‘looks the same’ from every vertex, but that is not true of the grid. Consider for clarity the n=8, d=3 case we discussed right at the beginning, and compare the scores (7,0,0) and (2,2,3). The number of maximal chains through (7,0,0) is \binom{14}{7}, while the number of maximal chains through (2,2,3) is \binom{7}{2, 2,3}\binom{14}{4,5,5}, and the latter is a lot larger, which means any attempt to use the second argument is going to be tricky, or at least require an extra layer of detail. Indeed, exactly the same problem arises when we try and use Hall’s condition to construct the optimal chain covering directly. In the double-counting section, it’s a lot more complicated than just multiplying by d’, as was the case in the middle of the hypercube.

Largest antichain in the d=3 grid

We can, however, do the d=3 case. As we will see, the main reason we can do the d=3 case is that the d=2 case is very tractable, and we have lots of choices for the chain coverings, and can choose one which is well-suited to the move to d=3. Indeed, when I set this problem to some students, an explicit listing of a maximal chain covering was the approach some of them went for, and the construction wasn’t too horrible to state.

[Another factor is that it computationally feasible to calculate the size of the middle layer, which is much more annoying in d>3.]

[I’m redefining the grid here as \{0,1,\ldots,n-1\}^d rather than \{1,2,\ldots,n\}^d.]

The case distinction between n even and n odd is going to make both the calculation and the argument annoying, so I’m only going to treat the even case, since n=8 was the original problem posed. I should be honest and confess that I haven’t checked the n odd case, but I assume it’s similar.

So when n is even, there are two middle layers namely \frac{3n}{2}-2, \frac{3n}{2}-1 (corresponding to total score 10 and total score eleven in the original problem). I calculated the number of element in the \frac{3n}{2}-1 layer by splitting based on the value of the first coordinate. I found it helpful to decompose the resulting sum as

\sum_{k=0}^{n-1} = \sum_{k=0}^{\frac{n}{2}-1} + \sum_{k=\frac{n}{2}}^{n-1},

based on whether there is an upper bound, or a lower bound on the value taken by the second coordinate. This is not very interesting, and I obtained the answer \frac{3n^2}{4}, and of course this is an integer, since n is even.

Now to show that any antichain has size at most \frac{3n^2}{4}. Here we use our good control on the chain coverings in the case d=2. We note that there is a chain covering of the (n,d=2) grid where the chains have 2n-1, 2n-3,…, 3, 1 elements (%). We get this by starting with a maximal chain, then taking a maximal chain on what remains etc. It’s pretty much the first thing you’re likely to try.

Consider an antichain with size A in the (n,d=3) grid, and project into the second and third coordinates. The image sets are distinct, because otherwise a non-trivial pre-image would be a chain. So we have A sets in the (n,d=2) grid. How many can be in each chain in the decomposition (%). Well, if there are more than n in any chain in (%), then two must have been mapped from elements of the (n,d=3) grid with the same first coordinate, and so satisfy a containment relation. So in fact there are at most n image points in any of the chains of (%). So we now have a bound of n^2. But of course, some of the chains in (%) have length less than n, so we are throwing away information. Indeed, the number of images points in a given chain is at most

\max(n,\text{length of chain}),

and so the number of image points in total is bounded by

n+\ldots+n+ (n-1)+(n-3)+\ldots+1,

where there are n/2 copies of n in the first half of the sum. Evaluating this sum gives \frac{3n^2}{4}, exactly as we wanted.


[1] – Bollobas, Leader (1991) – Compressions and Isoperimetric Inequalities. Available open-access here.

Chains and antichains

I’ve recently been at the UK-Hungary winter olympiad camp in Tata, for what is now my sixth time. As well as doing some of my own work, have enjoyed the rare diversion of some deterministic combinatorics. It seems to be a local variant of the pigeonhole principle that given six days at a mathematical event in Hungary, at least one element from {Ramsay theory, Erdos-Szekeres, antichains in the hypercube} will be discussed, with probability one. On this occasion, all were discussed, so I thought I’d write something about at least one of them.

Posets and directed acyclic graphs

This came up on the problem set constructed by the Hungarian leaders. The original formulation asked students to show that among any 17 positive integers, there are either five such that no one divides any other, or five such that among any pair, one divides the other.

It is fairly clear why number theory plays little role. We assign the given integers to the vertices of a graph, and whenever a divides b, we add a directed edge from the vertex corresponding to a to the vertex corresponding to b. Having translated the given situation into a purely combinatorial statement, fortunately we can translate the goal into the same language. If we can find a chain of four directed edges (hence five vertices – beware confusing use of the word ‘length’ here) then we have found the second possible option. Similarly, if we can find an antichain, a set of five vertices with no directed edges between them, then we have found the first possible option.

It’s worth noting that the directed graph we are working with with is transitive. That is, whenever there is an edge a->b and b->c, there will also be an edge a->c. This follows immediately from the divisibility condition. There are also no directed cycles in the graph, since otherwise there would be a cycle of integers where each divided its successor. But of course, when a divides b and these are distinct positive integers, this means that b is strictly larger than a, and so this relation cannot cycle.

In fact, among a set of positive integers, divisibility defines a partial order, which we might choose to define as any ordering whether the associated directed graph is transitive and acyclic, although obviously we could use language more naturally associated with orderings. Either way, from now on we consider posets and the associated DAGs (directed acyclic graphs) interchangeably.

Dilworth’s theorem

In the original problem, we are looking for either a large chain, or a large antichain. We are trying to prove that it’s not possible to have largest chain size at most four, and largest antichain size at most four when there are 17 vertices, so we suspect there may some underlying structure: in some sense perhaps the vertex set is the ‘product’ of a chain and an antichain, or at least a method of producing antichains from a single vertex.

Anyway, one statement of Dilworth’s theorem is as follows:

Statement 1: in a poset with nm+1 elements, there is either a chain of size n+1, or an antichain of size m+1.

Taking n=m=4 immediately finishes the original problem about families of divisors. While this is the most useful statement here, it’s probably not the original, which says the following:

Statement 2: in a poset, there exists \mathcal{C} a decomposition into chains, and an antichain A such that |\mathcal{C}|=|A|.

Remark 1: Note that for any decomposition into chains and any antichain, we have |\mathcal{C}|\ge |A|, since you can’t have more than one representative from any chain in the antichain. So Statement 2 is saying that equality does actually hold.

Remark 2: Statement 1 follows immediately from Statement 2. If all antichains had size at most m, then there’s a decomposition into at most m chains. But each chain has size n, so the total size of the graph is at most mn. Contradiction.

Unsuccessful proof strategies for Dilworth

Since various smart young people who didn’t know the statement or proof of Dilworth’s theorem attempted to find it (in the form of Statement 1, and in a special case) in finite time conditions, it’s easy to talk about what doesn’t work, and try to gain intellectual value by qualifying why.

  • Forgetting directions: in general one might well attack a problem by asking whether we have more information than we need. But ignoring the directions of the edges is throwing away too much information. After doing this, antichains are fine, but maybe you need to exhibit some undirected ‘chains’. Unless these undirected chains are much longer than you are aiming for, you will struggle to reconstruct directed chains out of them.
  • Where can the final vertex go?: in a classic trope, one might exhibit a directed graph on nm vertices with neither a chain of size n+1 nor an antichain of size m+1. We attempt to argue that this construction is essentially unique, and that it goes wrong when we add an extra vertex. As a general point, it seems unlikely to be easier to prove that exactly one class of configurations has a given property in the nm case, than to prove no configurations has the same property in the nm+1 case. A standalone proof of uniqueness is likely to be hard, or a disguised rehash of an actual proof of the original statement.
  • Removing a chain: If you remove a chain of maximal length, then, for contradiction, what you have left is m(n-1)+1 vertices. If you have a long chain left, then you’re done, although maximality has gone wrong somewhere. So you have an antichain size n in what remains. But it’s totally unclear why it should be possible to extend the antichain with one of the vertices you’ve just removed.

An actual proof of Dilworth (Statement 1), and two consequences

This isn’t really a proof, instead a way of classifying the vertices in the directed graph so that this version of Dilworth. As we said earlier, we imagine there may be some product structure. In particular, we expect to be able to find a maximal chain, and a nice antichain associated to each element of the maximal chain.

dilworth-thmWe start by letting V_0 consist of all the vertices which are sources, that is, have zero indegree. These are minima in the partial ordering setting. Now let V_1 consist of all vertices whose in-neighbourhood is entirely contained in V_0, that is they are descendents only of V_0. Then let V_2 consist of all remaining vertices whose in-neighourhood is entirely contained in V_0\cup V_1 (but not entirely in V_0, otherwise it would have already been treated), and so on. We end up with what one might call an onion decomposition of the vertices based on how far they are from the sources. We end up with V_0,V_1,\ldots,V_k, and then we can find a chain of size k+1 by starting with any vertex in V_k and constructing backwards towards the source. However, this is also the largest possible size of a chain, because every time we move up a level in the chain, we must move from V_i to V_j where j>i.

It’s easy to check that each V_i is an antichain, and thus we can read off Statement 1. A little more care, and probably an inductive argument is required to settle Statement 2.

We have however proved what is often called the dual of Dilworth’s theorem, namely that in a poset there exists a chain C, and a decomposition into a collection \mathcal{A} of antichains, for which |C|=|\mathcal{A}|.

Finally, as promised returning to Erdos-Szekeres, if not to positive integers. We apply Dilworth Statement 1 to a sequence of m^2+1 real numbers a_0,a_1,\ldots,a_{m^2}, with the ordering a_i\rightarrow a_j if i\le j and a_i\le a_j. Chains correspond to increasing subsequences, and antichains to decreasing subsequences, so we have shown that there is either a monotone subsequence of length m+1.


EGMO 2016 Paper I

We’ve just our annual selection and training camp for the UK IMO team in Cambridge, and I hope it was enjoyed by all. I allotted myself the ‘graveyard slot’ at 5pm on the final afternoon (incidentally, right in the middle of this, but what England fan could have seen that coming in advance?) and talked about random walks on graphs and the (discrete) heat equation. More on that soon perhaps.

The UK has a team competing in the 5th European Girls Mathematical Olympiad (hereafter EGMO 2016) right now in Busteni, Romania. The first paper was sat yesterday, and the second paper is being sat as I write this. Although we’ve already sent a team to the Romania this year (where they did rather well indeed! I blame the fact that I wasn’t there.), this feels like the start of the olympiad ‘season’. It also coincides well with Oxford holidays, when, though thesis deadlines loom, I have a bit more free time for thinking about these problems. Anyway, last year I wrote a summary of my thoughts and motivations when trying the EGMO problems, and this seemed to go down well, so I’m doing the same this year. My aim is not to offer official solutions, or even outlines of good solutions, but rather to talk about ideas, and how and why I decided whether they did or didn’t work. I hope some of it is interesting.

You can find the paper in many languages on the EGMO 2016 website. I have several things to say about the geometry Q2, but I have neither enough time nor geometric diagram software this morning, so will only talk about questions 1 and 3. If you are reading this with the intention of trying the problems yourself at some point, you probably shouldn’t keep reading, in the nicest possible way.

Question 1

[Slightly paraphrased] Let n be an odd positive integer and x_1,\ldots,x_n\ge 0. Show that

\min_{i\in[n]} \left( x_i^2+x_{i+1}^2\right) \le \max_{j\in[n]} 2x_jx_{j+1},

where we define x_{n+1}=x_1 cyclically in the natural way.

Thought 1: this is a very nice statement. Obviously when i and j are equal, the inequality holds the other way round, and so it’s interesting and surprising that constructing a set of pairs of inequalities in the way suggested gives a situation where the ‘maximum minimum’ is at least the ‘minimum maximum’.

Thought 2: what happens if n is actually even? Well, you can kill the right-hand-side by taking at least every other term to be zero. And if n is even, you really can take every other term to be even, while leaving the remaining terms positive. So then the RHS is zero and the LHS is positive.

The extension to this thought is that the statement is in danger of not holding if there’s a lot of alternating behaviour. Maybe we’ll use that later.

Idea 1: We can write

2(x_i^2+x_{i+1}^2)=(x_i+x_{i+1})^2 + |x_i-x_{i+1}|^2, \quad 4x_ix_{i+1}=(x_i+x_{i+1})^2 - |x_i-x_{i+1}|^2,

which gives insight into ‘the problem multiplied by 2’. This was an ‘olympiad experience’ idea. These transformations between various expressions involving sums of squares turn out to be useful all the time. Cf BMO2 2016 question 4, and probably about a million other examples. As soon as you see these expressions, your antennae start twitching. Like when you notice a non-trivial parallelogram in a geometry problem, but I digress. I’m not sure why I stuck in the absolute value signs.

This was definitely a good idea, but I couldn’t find a way to make useful deductions from it especially easily. I tried converting the RHS expression for i (where LHS attains minimum) into the RHS expression for any j by adding on terms, but I couldn’t think of a good way to get any control over these terms, so I moved on.

Idea 2: An equality case is when they are all equal. I didn’t investigate very carefully at this point whether this might be the only equality case. I started thinking about what happens if you start with an ‘equal-ish’ sequence where the inequality holds, then fiddle with one of the values. If you adjust exactly one value, then both sides might stay constant. It seemed quite unlikely that both would vary, but I didn’t really follow this up. In any case, I didn’t feel like I had very good control over the behaviour of the two sides if I started from equality and built up to the general case by adjusting individual values. Or at least, I didn’t have a good idea for a natural ordering to do this adjustment so that I would have good control.

Idea 3: Now I thought about focusing on where the LHS attains this minimum. Somewhere, there are values (x,y) next to each other such that x^2+y^2 is minimal. Let’s say x\le y. Therefore we know that the element before x is at least y, and vice versa, ie we have

\ldots, \ge y, x, y, \ge x,\ldots.

and this wasn’t helpful, because I couldn’t take this deduction one step further on the right. However, once you have declared the minimum of the LHS, you are free to make all the other values of x_i smaller, so long as they don’t break this minimum. Why? Because the LHS stays the same, and the RHS gets smaller. So if you can prove the statement after doing this, then the statement was also true before doing this. So after thinking briefly, this means that you can say that for every i, either x_{i-1}^2+x_i^3 or x_i^2+x_{i+1}^2 attains this minimum.

Suddenly this feels great, because once we know at least one of the pairs corresponding to i attains the minimum, this is related to parity of n, which is in the statement. At this point, I was pretty confident I was done. Because you can’t partition odd [n] into pairs, there must be some i which achieves a minimum on both sides. So focus on that.

Let’s say the values are (x,y,x) with x\le y. Now when we try to extend in both directions, we actually can do this, because the values alternate with bounds in the right way. This key is to use the fact that the minimum x^2+y^2 must be attained at least every other pair. (*) So we get

\ldots, \le x,\ge y,x,y,x,\ge y,\le x,\ldots.

But it’s cyclic, so the ‘ends’ of this sequence join up. If n\equiv 1 modulo 4, we get \ge y,\ge y next to each other, which means the RHS of the statement is indeed at least the LHS. If n\equiv 3 modulo 4, then we get \le x,\le x next to each other, which contradicts minimality of x^2+y^2 unless x=y. Then we chase equality cases through the argument (*) and find that they must all be equal. So (after checking that the case x\ge y really is the same), we are done.

Thought 3: This really is the alternating thought 2 in action. I should have probably stayed with the idea a bit longer, but this plan of reducing values so that equality was achieved often came naturally out of the other ideas.

Thought 4: If I had to do this as an official solution, I imagine one can convert this into a proof by contradiction and it might be slightly easier, or at least easier to follow. If you go for contradiction, you are forcing local alternating behaviour, and should be able to derive a contradiction when your terms match up without having to start by adjusting them to achieve equality everywhere.

Question 3

Let m be a positive integer. Consider a 4m x 4m grid, where two cells are related to each other if they are different but share a row or a column. Some cells are coloured blue, such that every cell is related to at least two blue cells. Determine the minimum number of blue cells.

Thought 1: I spent the majority of my time on this problem working with the idea that the answer was 8m. Achieved by taking two in each row or column in pretty much any fashion, eg both diagonals. This made me uneasy because the construction didn’t take advantage of the fact that the grid size was divisible by 4. I also couldn’t prove it.

Thought 2: bipartite graphs are sometimes useful to describe grid problems. Edges correspond to cells and each vertex set to row labels or column labels.

Idea 1: As part of an attempt to find a proof, I was thinking about convexity, and why having exactly two in every row was best, so I wrote down the following:

Claim A: No point having three in a row.

Claim B: Suppose a row has only one in it + previous claim => contradiction.

In Cambridge, as usual I organised a fairly comprehensive discussion of how to write up solutions to olympiad problems. The leading-order piece of advice is to separate your argument into small pieces, which you might choose to describe as lemmas or claims, or just separate implicitly by spacing. This is useful if you have to do an uninteresting calculation in the middle of a proof and don’t want anyone to get distracted, but mostly it’s useful for the reader because it gives an outline of your argument.

My attempt at this problem illustrates an example of the benefit of doing this even in rough. If your claim is a precise statement, then that’s a prompt to go back and separately decide whether it is actually true or not. I couldn’t prove it, so started thinking about whether it was true.

Idea 2: Claim A is probably false. This was based on my previous intuition, and the fact that I couldn’t prove it or get any handle on why it might be true. I’d already tried the case m=1, but I decided I must have done it wrong so tried it again. I had got it wrong, because 6 is possible, and it wasn’t hard from here (now being quite familiar with the problem) to turn this into a construction for 6m in the general case.

Idea 3: This will be proved by some sort of double-counting argument. Sometimes these arguments turn on a convexity approach, but when the idea is that a few rows have three blue cells, and the rest have one, this now seemed unlikely.

Subthought: Does it make sense for a row to have more than three blue cells? No. Why not? Note that as soon as we have three in a row, all the cells in that row are fine, irrespective of the rest of the grid. If we do the problem the other way round, and have some blues, and want to fill out legally the largest possible board, why would we put six in one row, when we could add an extra row, have three in each (maintaining column structure) and be better off than we were before. A meta-subthought is that this will be impossible to turn into an argument, but we should try to use it to inform our setup.

Ages and ages ago, I’d noticed that you could permute the rows and columns without really affecting anything, so now seemed a good time to put all the rows with exactly one blue cell at the top (having previously established that rows with no blue cell were a disaster for achieving 6m), and all the columns with one blue cell at the left. I said there were r_1,c_1 such rows and columns. Then, I put all the columns which had a blue cell in common with the r_1 rows next to the c_1 columns I already had. Any such column has at least three blues in it, so I said there were c_3 of these, and similarly r_3 rows. The remaining columns and rows might as well be r_0,c_0 and hopefully won’t matter too much.

From here, I felt I had all the ingredients, and in fact I did, though some of the book-keeping got a bit fiddly. Knowing what you are aiming for and what you have means there’s only one way to proceed: first expressions in terms of these which are upper bounds for the number of columns (or twice the number of columns = rows if you want to keep symmetry), and lower bounds in terms of these for the number of blue cells. I found a noticeable case-distinction depending on whether r_1\le 3c_3 and c_1\le 3r_3. If both held or neither held, it was quite straightforward, and if exactly one held, it got messy, probably because I hadn’t set things up optimally. Overall, fiddling about with these expressions occupied slightly more time than actually working out the answer was 6m, so I don’t necessarily have a huge number of lessons to learn, except be more organised.

Afterthought 2: Thought 2 said to consider bipartite graphs. I thought about this later while cycling home, because one can’t (or at least, I can’t) manipulate linear inequalities in my head while negotiating Oxford traffic and potholes. I should have thought about it earlier. The equality case is key. If you add in the edges corresponding to blue cells, you get a series of copies of K_{1,3}, that is, one vertex with three neighbours. Thus you have three edges for every four vertices, and everything’s a tree. This is a massively useful observation for coming up with a very short proof. You just need to show that there can’t be components of size smaller than 4. Also, I bet this is how the problem-setter came up with it…

Lovasz Local Lemma

At our training and selection camp in Tonbridge in May, I gave a session about the use of probabilistic methods to solve olympiad combinatorics questions. Such an approach will normally be relevant to a problem where it is required to show the existence of a structure satisfying a particular property. We might consider constructing a structure randomly, and then try to show that the probability that our construction has the required property is non-zero.

There are several possible approaches to this, but often for the property we seek, we can describe a family of ‘bad events’ A_1,\ldots,A_n, and then we have the property if none of the bad events hold. That is, we are looking for general conditions under which we can show that \mathbb{P}(\Cap_{i=1}^n A_i^c)>0.

We have two cases where this is easy to do.

1) If all the A_is are independent, then so long as all \mathbb{P}(A_i)<1, we have our result.

2) If the probability of the bad events have sum less than 1, then we can use a union bound

\mathbb{P}(\cup A_i)\le \sum_{i=1}^n \mathbb{P}(A_i) <1,

to conclude what we want.

In Tonbridge we also discussed first-moment methods, where we show that the expected number of bad events is less than 1, meaning there must be some elements of the probability space where the number of bad events is zero. In this article, we’re looking in a different direction. We’ll try to interpolate between situation 1) and 2), to make some concrete comment on the situation where the probabilities of the bad events are small, but not small enough to use the union bound, and where the events are not quite independent, but are in some sense almost independent.

The first thing we need to do is think about what it means for a family of events to be independent. Suppose we toss a fair coin twice, and consider the events:

A:= {first coin is H}, B:={second coin is T}, C:={exactly one H}.

So, it is easy to see that each pair of events is independent, but if we know that A and B hold, then also C holds (and indeed this stays true under cyclic re-ordering). So C is not independent of the family of events {A,B}. Rather than give a formal definition, I’ll say instead that an event B is said to be independent of the family of events \{A_1,\ldots,A_5\} if it is independent of


A_1\cap A_2

A_3\cap A_4^c\cap A_5^c,

and so on. I hope it’s clear from this what I mean. Slogan: no knowledge about which of the A_i do or don’t hold gives information about B.

Now we return to our original setup. We want that each A_i is independent of lots of the rest, and so we choose for each i\in[n] a dependency set D_i\subset [n] of indices, such that A_i is independent of the family of events \{A_j: j\in [n]\backslash D_i\}. It is tempting to interpret this immediately as saying that A_i depends on each event with index in D_i. This will normally be fine in practice, but we should emphasise that there is a lot of freedom to choose D_i, and the genuinely important condition is that A_i is independent of the family given by the rest of the indices.

[*Health warning*: The language I am using seems sensible for anyone learning about this for the first time, but is very slightly non-standard. Instead of dependency sets, the classical notation is to consider a dependency digraph on [n], with an edge i->j whenever j\in D_i.]

The symmetric version of the Lovasz Local Lemma says: suppose \mathbb{P}(A_i)\le p and we can choose D_i as described so that |D_i|\le d for each I. Then, if epd\le 1, we may conclude \mathbb{P}(\Cap_{i=1}^n A_i^c)>0.

We’ll come back to the proof, which is best seen in a slightly more general formulation. First, let’s get to grips with the notation, by reassuring ourselves that this really does lie between the union bound and the independence case.

If the events are independent, then clearly we may take D_i=\{i\} for each i, that is d=1, so we may draw the desired conclusion so long as p\le 1/e, which is fine, but a lot less convincing than p<1. Similarly, for the union bound, we have said nothing about the dependency relationships, and so we have to take D_i=[n] for each i. So we may draw the desired conclusion provided p\le 1/ne, which is obviously again a factor of e less than what we would have had with a union bound itself.

Now we’ll see how this might be useful when applied, for example, to a probabilistic construction for the lower bound on the Ramsey number R(k). From Erdos’s argument, we know R(k)\ge (1+o(1)) 2^{k/2}, and we will earn an extra factor of k\sqrt{2}/e. An extra factor of k/e can also be obtained by an alteration argument, but this extra factor of \sqrt{2} makes this LLL method one of the strongest available for this problem.

Recall that for a lower bound, we are trying to find examples of 2-edge-colourings of a large complete graph K_n, such that there is no monochromatic copy of a K_k. We consider the uniform independent random edge-colouring. It makes sense that a bad event A_S should be the presence of a monochromatic complete graph induced on a set S\subset [n], of size k. Since there are two potential colours for the monochromatic induced K_k, we have \mathbb{P}(A_S)=2^{1-\binom{k}{2}}. Then we take the dependency set D_S of A_S to include all those k-sets which share an edge with S, that is |S\cap T|\ge 2. We think about which vertices might contribute to the shared edge, and which make up the remainder to conclude |D_S|\le \binom{k}{2}\binom{n-2}{k-2}.

So now, whenever e\cdot 2^{1-\binom{k}{2}}\binom{k}{2}\binom{n-2}{k-2}\le 1, as a consequence of LLL we can conclude that with positive probability the random colouring gives a suitable example, that is R(k)>n. After fiddling around with Stirling’s formula in a fashion that isn’t hugely interesting, we can conclude R(k)\ge (1+o(1)) \frac{k\sqrt{2}}{2} 2^{k/2}.

The prompt for this article was a discussion during our Malaysian training camp of methods applicable to IMO 2014 Q6. If you want to know just how applicable LLL is, I suggest you have a think yourself. It’s not a direct application – so some thought is involved. Maybe as an appetiser, here are some more elementary exercises, which I’ve borrowed from examples on Po-Shen Loh’s olympiad handouts, and Wikipedia, though I doubt the latter is the original source:

1) 11n points on a circle are coloured with n colours, such that each colour is used exactly 11 times. Show that it’s possible to choose one point of each colour such that no pair are adjacent.

2) A finite graph is given, and all degrees are between 50 and 100. Find some finite C such that you can always colour the vertices of such a graph so that the neighbourhood of any vertex includes at least 20 colours.

Finally, we discuss the more general statement of LLL, and explain how the proof works.

General Lovasz Local Lemma: Suppose there exist x_i\in [0,1) such that \mathbb{P}(A_i)\le x_i \prod_{j\in D_i\backslash\{i\}} (1-x_j) (*). Then \mathbb{P}(\Cap A_i^c)\ge \prod (1-x_i)>0.

Deducing the symmetric form from the general form is not hard. Condition (*) is motivated by the proof. We want to be able to say that no matter which other events and their unions, complements etc we condition on, we still have an upper bound for the probability of A_i. This bound will be x_i. In particular, we want to show that the probability of bad event A_i does not get too high, even if we know that lots of other bad events have not occurred.

The proof proceeds by showing \mathbb{P}(A_i | \Cap_{j\in S}A_j^c)\le x_i for all i, by induction on |S|. For the inductive step, you split S=S_1\cup S_2 where S_1=S\cap D_i, S_2=S\cap D_i^c. If S_1=\varnothing, you are done straight away, by the assumption (*) and independence of A_i and the events not indexed by D_i. Otherwise, you can use the inductive hypothesis on S_2, and repeated Bayes’ theorem to show what you want in a series of steps that have a lot of notation, but aren’t hugely difficult.

Local to Global in Point Set Combinatorics


I’ve spent the last few days in Cambridge, helping to run the first camp to select the UK team for the International Mathematical Olympiad. We’ve had two 4.5 hour exams, replicating what six of the students will face in the final competition, which this year is held in Cape Town, and also lots of problem-solving sessions, introducing the students to different areas of mathematics relevant to the olympiad. For various reasons, I ended up choosing to discuss combinatorial problems involving points and lines with both sets of students. This is an area which is very popular with problem setters (witness for example Question Two at last year’s IMO in Colombia) but which, because of a lack of any real ‘theory’, is often not treated at these camps.

Anyway, 90 minutes is not a lot of time to solve a large number of problems, so I was interested in the notion of using the time to try to teach the students how to have the right sort of ideas to deal with these. Obviously if there were an easy way to teach people to have good ideas, mathematically or otherwise, life would be a whole lot more straightforward. So, given we can’t do that, the next best thing is to have lots of ideas of any kind, and to get practice at choosing which ones seem best reliably and quickly.

We have two problems to consider that are very similar. I’ll state the first one:

Q1: a strip of width w is the set of all points which lie on or between two parallel lines that are a distance w apart. Let S be a set of n>2 points such that any three can be covered by a strip of width 1. Prove that S can be covered by a strip of width 2.

This sort of thing comes up all the time in olympiad combinatorics and beyond. We have some condition that is local, in the sense that it only affects triples of points, but we want to show that the same condition works for all the points together.

The first step is to be clear about what we are looking for. We are looking for a strip, but we have also been given lots of strips, namely the strips covering each triple of points. So the natural thing to do is to construct the strip we are looking for from the strips we have.

The next observation offered was that we might want to combine two strips of width 1 to get a strip of width 2. This is certainly an idea, but a quick check makes you realise it isn’t a very promising idea. Why? Well unless two strips are parallel, it takes an infinite-width strip to cover their union. Ok, so although that wasn’t useful in itself, it now encourages us to focus on constructing the strip we are looking for from a single one of the strips we are given.

It’s not yet clear exactly how we are going to construct the desired strip from the strip we are given. We should be alive to the possibility that 2 might be a weak bound, and actually 1+\epsilon would suffice. Or we might have tried some examples. One thing we might have is a strong example of where a strip of width 2 is necessary, and where we can’t add any points which don’t fit in this strip without breaking the first condition.

Maybe this doesn’t help so much. So perhaps we should think not about how to construct the big strip from the little strip, but instead, how to choose the little strip from the large set we have at our disposal. It is at this stage that I feel it is really important to use all the power at our disposal. In many combinatorics problems we have dealing with a discrete set like [n], where the roles of the labels are interchangeable, and so it can sometimes to be difficult to pick out a particular element. For example, it doesn’t matter for induction whether we take [n-1] then add {n} or start with [2,n] and add {1}. When the construction is geometric, we should use the geometry to make life easier in the construction. For example, if we want to induct, we could add the left-most point in some co-ordinate axes, or some point on the convex hull.

In our particular example, we are still trying to decide which strip to work with. If we consider the triple of points which is in some sense hardest to cover with a strip, this might be a sensible thing to work with. Now we need to decide what it means to be hard to cover a triangle with a strip. We might observe that if the triangle has an altitude of height h, we can certainly cover the triangle with a strip of width h. Further playing around convinces us that the minimum width of a strip required to cover a triangle is given by the shortest altitude. So our intuition that it’s hard to cover a triangle if it is large has been made a bit more concrete. We might also observe that we don’t have much flexibility for a strip if one of the lengths of the triangle is very long. For fixed strip width, as the length of the triangle gets large, the range of angles the strip could sit at (if it is possible at all) gets small.

This motivates considering the least convenient triangle as the one with maximal width and maximal height of altitude. We might try both orders, but certainly considering the triangle ABC such that d(A,B) is maximal among the set of points, and d(AB, C) is maximal once A,B are determined, is useful. The fact that we can cover ABC with one of the strips tells us from the above that d(AB,C)\le 1, and suddenly we are almost done, as we now have a strong clue for what the strip we are looking for should be. If we take all the points x such that d(AB,x)\le 1, we have a strip of width 2, and by construction of C we must have included all points of the set.

The key ingredient here was using what we already had, realising it probably wasn’t going to be helpful to attempt to use *all* of what we already had, then coming up with a sensible way to choose which pre-existing strip to develop.

With this in mind, the next question seems a lot easier:

Q2: Given n points in the plane such that any three lie within some circle of radius 1, prove that *all* the points lie within some circle of radius 1.

Circulating round the room, there were lots of pairs of compasses out and various diagrams of interlocking circles snaking around pieces of A4. But after discussing Q1, suddenly all seems a lot more clear. Similarly to the previous case, we have a lot of circles, and we are looking for a circle. Conveniently, the circle we are looking for is even the same size as the circles we have.

So we need to choose which circle we already have is likely to be most useful. Suppose we have three points very close together. Then we know very little about the circle containing those points – it could be almost anything. By contrast, if we have three points which form a triangle with circumradius equal to 1, then there is in fact no choice about what the covering circle could be. So this is most likely to be useful. So we conjecture that we should consider the triangle with largest circumradius. In fact, this is a problem: if the triangle is obtuse, then the circumradius might actually be greater than 1, but the points still satisfy the condition. In that case, it makes sense to consider the circle with diameter given by the longest side of the triangle.

With that clarification, we now have a collection of circles with radii all \le 1, which cover the triangles. We conjecture that the circle with largest radius might cover all the points. This turns out to be correct, and not too difficult to prove. Both these problems are in some sense extremal. Only here the tricky part was deciding what to extremise. But all were agreed that with some meta-thinking, we were able to decide fairly accurately which ideas were leading towards progress, and following them led fairly rapidly to this extremal property.

Enhanced by Zemanta

The Combinatorial Nullstellensatz

I’ve been taking a TCC course this term on Additive Combinatorics, delivered via video link from Bristol by Julia Wolf. At some point once the dust of this term has settled, I might write some things about the ideas of the course I’ve found most interesting, in particular the tools of discrete Fourier analysis to get a hold on some useful combinatorial properties of subsets of \mathbb{Z}/n\mathbb{Z} for example.

For this post, I want to talk instead about a topic that was merely mentioned in passing, the Combinatorial Nullstellensatz. The majority of this post is based on Alon’s original paper, which can be found here, and Chapter 9 of Tao and Vu’s book Additive Combinatorics. My aim is to motivate the theorem, give a proof, introduce one useful application from additive combinatorics, and solve Q6 from IMO 2007 as a direct corollary.

What does Nullstellensatz mean? Roughly speaking, it seems to mean ‘a theorem specifying the zeros’. We will be specifying the zeros of a polynomial. We are comfortable with how the zeros of a complex-valued polynomial of one variable behave. The number of zeros is given precisely by the degree of the polynomial (allowing appropriately for multiplicity). It is generally less clear how we might treat the zeros of a polynomial of many variables. The zero set is likely to be some surface, perhaps of dimension one less than the number of variables. In particular, it no longer really makes sense to talk about whether this set is finite or not. The Combinatorial Nullstellensatz gives us some control over the structure of this set of zeros.

The idea behind the generalisation is to view the Fundamental Theorem of Algebra as a statement not about existence of roots, but rather about (combinatorial) existence of non-roots. That is, given a polynomial P(x) of degree n, for any choice of (n+1) complex numbers, at least one of them is not a root of P. This may look like a very weak statement in this context, where we only expect finitely many roots anyway, but in a multivariate setting it is much more intuitively powerful.

Recall that the degree of a monomial is given by the sum of the exponents of the variables present. So the degree of 4x^2 y^3 z is 6. The degree of a polynomial is then given by the largest degree of a monomial in that polynomial. A polynomial P(x_1,\ldots,x_n) over a field F with degree d might have lots of monomial terms of degree d. Suppose one of these monomials is x_1^{d_1}\ldots x_n^{d_n}, where \sum d_i=d. Then one version of the Combinatorial Nullstellensatz asserts that whenever you take subsets of the base field S_i\subset F with |S_i|\ge d_i+1, then there is a point with x_i\in S_i such that P(x_1,\ldots,x_n)=0.

In other words, you can’t have a box (ie product of sets) of dimension d_1+1 \times d_2+1 \times\ldots\times d_n+1 on which the polynomial is zero.

Unsurprisingly, the proof proceeds by induction on the number of variables. Alon’s result proceeds via a more general theorem giving information about the possibility of writing multinomial polynomials as linear combinations of polynomials in one variable.

We would like to start this induction by fixing the x_n co-ordinate, then viewing P as a polynomial in x_1,\ldots,x_{n-1} only. One problem with this approach is that the largest degree monomials in P are not necessarily still the largest degree monomials in P with x_n fixed. So we need to apply a division algorithm argument.

I’m going to miss some steps so as to keep this of suitable blog post length. The key idea is to apply the division algorithm to P with respect to the simplest polynomial that is zero on all of S_n, which we define as:

g(x_n)=\prod_{s_n\in S_n}(x_n-s_n).

We can decompose as


So now we ask where the term x_1^{d_1}\ldots x_n^{d_n} is coming from, bearing in mind that d_n<|S_n|. The lower order terms in g cannot contribute to this, as  they cannot be of maximal degree. Also, the first term in q_n(\mathbf{x})g(x_n) cannot contribute as the exponent of x_n is too large. So the term in question must be coming from r_{n,d_n}(x_1,\ldots,x_{n-1})x_n^{d_n}. So now we can apply the induction hypothesis to the polynomial r_{n,d_n} to find $x_1\in S_1,\ldots, x_{n-1}\in S_{n-1}$ such that r_{n,d_n}(x_1,\ldots,x_{n-1} is non-zero. With these values, we can view the remainder as a polynomial in x_n of degree |S_n|>d_n, and so there is an x_n\in S_n such that

\sum_{j=0}^{|S_n|}r_{n,j}(x_1,\ldots,x_{n-1})x_n^j)\neq 0.

This concludes the proof by induction.

I want to discuss two relatively simple applications. The first is the Cauchy-Davenport Theorem, which one might view as the first non-trivial theorem in additive combinatorics, giving a bound on the size of a sumset.

Theorem (Cauchy-Davenport): Given A, B non-empty subsets of Z_p for p a prime, then

|A+B|\geq \min\{p,|A|+|B|-1\}.

( A+B:=\{c: c=a+b,a\in A,b\in B\} )

Note that the result isn’t especially surprising. Providing some sort of ordering to the elements of A and B might be a sensible way to proceed. Certainly if they were sets in \mathbb{Z}, this would give a proof immediately.

Proof: Only the case |A|+|B| <= p is interesting. Following Alon’s argument, suppose that |A+B| <= |A|+|B|-2, and let C=A+B. Set f(x,y)=\prod_{c\in C}(x+y-c), so f(a,b)=0 for all a\in A,b\in B.

Then the coefficient of x^{|A|-1}y^{|B|-1} in f is \binom{|A|+|B|-2}{|A|-1} as we have to choose which of the terms in the product supply an x and which supply a y. This is non-zero (in Z_p recall) since the upper integer is less than p. The Combinatorial Nullstellensatz then gives a contradiction.

My second example is from the IMO in Vietnam which I attended. I spent a lot of time thinking about this problem, but made no progress.

IMO 2007 Question 6: Let n be a positive integer. Consider

S=\{(x,y,z) | x,y,z\in \{0,1,\ldots,n\}, x+y+z>0\}

as a set of (n+1)^3-1 points in 3D space. Determine the smallest number of planes, the union of which contains S but does not include (0,0,0).

Answer: 3n. Consider the planes x+y+z = k for k varying between 1 and 3n. The aim is to prove that you cannot do it with fewer.

To prove this, suppose we can do with fewer planes, say k. We write the equation of a plane as


Note that the d’s are non-zero as (0,0,0) must not be a solution. Then take the product of all these degree one polynomials together and subtract a multiple of

\prod_{i=1}^n (x-i)(y-i)(z-i),

with the multiple chosen so the resulting polynomial has a root at (0,0,0). (This constant must be non-zero to cancel the non-zero product of the d’s.) This resulting polynomial is degree 3n by construction, and x^ny^nz^n has a non-zero coefficient, but it is zero on the box [0,n]^3, which contradicts Combinatorial Nullstellensatz.

Bijections, Prufer Codes and Cayley’s Formula

I’m currently at the training camp in Cambridge for this year’s UK IMO squad. This afternoon I gave a talk to some of the less experienced students about combinatorics. My aim was to cover as many useful tricks for calculating the sizes of combinatorial sets as I could in an hour and a half. We started by discussing binomial coefficients, which pleasingly turned out to be revision for the majority. But my next goal was to demonstrate that we are much more interested in the fact that we can calculate these if we want than in the actual expression for their values.

Put another way, my argument was that the interpretation of \binom{n}{m} as the number of ways to choose m objects from a collection of n, or the number of up-and-right paths from (0,0) to (m,n) is more useful than the fact that \binom{n}{m}=\frac{n!}{m!(n-m)!}. The opening gambit was to prove the fundamental result underlying the famous construction of Pascal’s triangle that


This is not a hard result to prove by manipulating factorials, but it is a very easy result to prove in the path-counting setting, for example.

So it turned out that the goal of my session, as further supported by some unsubtly motivated problems from the collection, was to convince the students to use bijections as much as possible. That is, if you have to count something awkward, show that counting the awkward thing is equivalent to counting something more manageable, then count that instead. For many simpler questions, this equivalence is often drawn implicitly using words (“each of the n objects can be in any subset of the collection of bags so we multiply…” etc), but it is always worth having in mind the formal bijective approach. Apart from anything else, asking the question “is this bijection so obvious I don’t need to prove it” is often a good starting-point for assessing whether the argument is in fact correct!

Anyway, I really wanted to show my favouriite bijection argument, but there wasn’t time, and I didn’t want to spoil other lecturers’ thunder by defining a graph and a tree and so forth. The exploration process encoding of trees is a strong contender, but today I want to define quickly the Prufer coding for trees, and use it to prove a famous result I’ve been using a lot recently, Cayley’s formula for the number of spanning trees on the complete graph with n vertices, n^{n-2}.

We are going to count rooted trees instead. Since we can choose any vertex to be the root, there are n^{n-1} rooted trees on n vertices. The description of the Prufer code is relatively simple. Take a rooted tree with vertices labelled by [n]. A leaf is a vertex with degree 1, other than the root. Find the leaf with the largest label. Write down the label of the single vertex to which this leaf is connected, then delete the leaf. Now repeat the procedure, writing down the label of the vertex connected to the leaf now with the largest label, until there are only two vertices remaining, when you delete the non-root vertex, and write down the label of the root. We get a string of (n-1) labels. We want to show that this mapping is a bijection from the set of rooted trees with vertices labelled by [n] to [n]^{n-1}.

Let’s record informally how we would recover a tree from the Prufer code. First, observe that the label of any vertex which is not a leaf must appear in the code. Why? Well, the root label appears right at the end, if not earlier, and every vertex must be deleted. But a vertex cannot be deleted until it has degree one, so the neighbours further from the root (or ancestors) of the vertex must be removed first, and so by construction the label appears. So know what the root is, and what the leaves are straight away.

In fact we can say slightly more than this. The number of times the root label appears is the degree of the root, while the number of times any other label appears is the degree of the corresponding vertex minus one. Call this sequence the Prufer degrees.

So we construct the tree backwards from the leaves towards the root. We add edges one at a time, with the k-th edge joining the vertex with the k-th label to some other vertex. For k=1, this other vertex is the leaf with maximum label. In general, let G_k be the graph formed after the addition of k-1 edges, so G_1 is empty, and G_n is the full tree. Define T_k to be the set of vertices such that their degree in G_k is exactly one less than their Prufer degree. Note that T_1 is therefore the set of leaves suggested by the Prufer code. So we form G_{k+1} by adding an edge between the vertex with label appearing at position k+1 in the Prufer sequence and the vertex of T_k with maximum label.

Proving that this is indeed the inverse is a bit fiddly, more because of notation than any actual mathematics. You probably want to show injectivity by an extremal argument, taking the closest vertex to the root that is different in two trees with the same Prufer code. I hope it isn’t a complete cop out to swerve around presenting this in full technical detail, as I feel I’ve achieved by main goal of explaining why bijection arguments can reduce a counting problem that was genuinely challenging to an exercise in choosing sensible notation for proving a fairly natural bijection.