The Rearrangement Inequality

A favourite result of many students doing olympiad inequality problems is the so-called Rearrangement Inequality. This is a mathematical formulation of the idea well-known to even the smallest of child that if you prefer cakes to carrots then if you are offered two of one and one of the other, you should take two of the one you prefer!

At a more formal level, it says that given two strings of non-negative numbers

a_1\le a_2,\le \ldots\le a_n, \quad b_1\le b_2\le \ldots\le b_n,

if you want to form a sum of products of pairs, like


you get the largest result if you take


Formally, for any permutation \sigma \in S_n,

a_1b_1+\ldots+a_nb_n\ge a_1b_{\sigma(1)}+\ldots+a_nb_{\sigma(n)}\ge a_1b_n+\ldots+a_nb_1.

That is, you multiply the largest terms in each sequence together.

The notation to describe to equality case is a bit annoying. Essentially, the sums are equal if and only if the summands exactly correspond. If the sequences are strictly increasing, then equality holds only if the permutation \sigma=\text{id}.

This result is nice because, although it is rarely explicitly useful, it goes in a different direction from the standard scheme of results strengthening AM-GM, Cauchy-Schwarz and so on, and is in some sense more intuitive than these more well-known inequalities, at least in the form presented in an olympiad context.

I was thinking about this partly because it’s a nice result in its own right, but also because it came up in a research problem to do with comparing the expected likelihood of different tree isomorphism classes arising in an inhomogeneous, but relatively well-behaved, random graph model. The probability of forming a given tree is a homogeneous multivariate polynomial in the ages of the vertices that would form the tree. It is then necessary to integrate over the joint distribution (which fortunately is a product in the limit) of the ages of the vertices. I was playing around with this by considering what seemed to be the extreme cases: the star and the path. I was working with the relatively simple case n=4, and it struck me that perhaps the polynomial for the star was always at least as large as that for the path. This would be convenient as it would avoid the need for a horrific-looking integral calculation. This turned out to be true. My first method was a heavy but uncontroversial convexity and stationary point argument, but I found a pair of vectors embedded in the desired inequality on which I could deploy rearrangement.

Anyway, I thought I should be able to come up with a nice proof, and I think this is one. I think this is particularly nice because it is a demonstration that one can do a proof by induction without explicitly inducting on the natural numbers.

We begin with a base case, which is the theorem for n=2, even though we will not be doing induction in the canonical way. We are required to prove that given

a_1\le a_2,\quad b_1\le b_2,


a_1b_1+a_2b_2\ge a_1b_2+a_2b_1,

since these are the only available permutations. Moving some terms around gives

(a_2-a_1)(b_2-b_1)\ge 0,

which is true by construction, and so the n=2 result follows.

We now move straight to the general n case. We focus on the left of the two inequalities in the statement of the result, since the other will follow by an identical method, applied in reverse. We consider the case where \sigma is a transposition. For example, we might consider 12435. When we write out the result we want:

a_1b_1+a_2b_2+a_3b_3+a_4b_4+a_5b_5\ge a_1b_1+a_2b_2+a_3b_4+a_4b_3+a_5b_5,

we realise that many of the terms cancel, and the content of the theorem reduces to the n=2 case we have already dealt with. Obviously, this holds equally well whenever \sigma is a transposition. Similarly, if \sigma is a product of two disjoint transpositions, which means that two disjoint pairs of elements are interchanged, we can apply the n=2 case twice, then add on the extra terms to get the result.

In fact, we can do much better than this, by using the fact that any permutation can be expressed as a product of transpositions. We need to be careful about the risk of asserting that every time we multiply the permutation \sigma by a transposition, the value of the associated sum-product expression gets smaller. While the idea is correct, this cannot be generally true. After all, applying the same transposition twice returns us to the identity permutation!

We can nonetheless say something useful. If we start with a permutation


and we interchange the ith and jth elements, to get,


then the product sum corresponding to \tau is less than or equal to the product sum corresponding to \sigma if $\sigma(i)\leq sigma(j)$, under the implicit assumption that i<j. In other words, we can prove the rearrangement inequality for any permutation \sigma that can be obtained from the identity by repeatedly interchanging elements that are initially in increasing order. Essentially, we have defined a partial ordering on the set of permutations.

It suffices to check that all permutations have this property. In fact, this is relatively easy. We can move element n to its required position in \sigma by successively swapping with (n-1), (n-2), etc. If we set this up as an inductive argument, we can finish by applying the hypothesis to the remaining (n-1) elements, which are in the same order as the identity permutation on [n-1].

So we have proved the left-hand side of the Rearrangement Inequality. In fact, this partial ordering framework makes it clear how to prove the right-hand side. By an identical argument, we can get from any permutation to the reverse identity by a similar set of operations.

The Top-to-Random Shuffle II

In the last post, I introduced the top-to-random shuffle. In particular, we considered why this sort of procedure was important as an alternative to choosing an ordering afresh, and how to we would go about measuring how close we had got to randomness.

In this post, I want to develop the second of these points. The intuition might be that we can get very close to uniform randomness if we repeat the shuffle often enough. Recall this means that even if we choose our bet in a really complicated and careful way, we still couldn’t make much profit by knowing the actual distribution of the ordering. But we might also suspect that the pack will never be exactly random, in the same way that the distribution of the proportion of heads seen on a repeatedly-flipped coin will eventually get very close to 1/2, but will not be exactly 1/2.

This intuition is extremely sensible, and in general is true. It is a nice fact, however, that it fails for the top-to-random shuffle, where we do in fact get to a uniformly random deck. Recall that we approximated how long it would take to get to a state that was roughly random by calculating the time taken for the original bottom card to rise to the top of the deck. This time was:


where n is the total number of cards. A further shuffle is required to send the original card back into the pack somewhere. The claim is that the pack is now uniformly random. Note that if we ever actually use this, we have to be careful because although we can calculate the average time at which this event happens, the time itself is random. Rather than worry about that, let’s see why it is true.

As a motivating example, let’s assume that the pack was originally in the order 1,2,3,…,n, and consider the final relative order of the cards numbered 1 and 2. There is some small probability that on the first go (and then again on the second go possibly also) that the card number 1 stays put. Let’s ignore that possibility and progress to the first time that the 1 has been moved into the interior of the pack, so the 2 is now on top. When we do this, we are choosing a number between 2 and n uniformly at random, and moving the card numbered 1 to this position. Let’s call this number X.

Now, when we try to shuffle the 2, we are choosing a number between 1 and n uniformly at random, and moving the card numbered 2 to this position. Let’s call this number Y. Under exactly what circumstances does 2 end up above 1? The clearest example is if Y=X. Then card 2 has been moved to the position previously occupied by card 1. So card 1 moves up a position (since the card 2 is no longer on the top). The final configuration therefore includes card 1 directly above card 2. So we can say

\mathbb{P}(\text{2 above 1})=\mathbb{P}(Y<X),

where the fact that the inequality in the second probability is strict is important. To calculate this second probability, we want to exploit the symmetry of the situation. The only problems are that the case X=Y is not symmetric as then 1 ends up above 2 as described above, and also that X cannot be 1. So we account for these separately. Note

\mathbb{P}(Y=1)=\frac{1}{n},\quad \mathbb{P}(Y=X)=\frac{1}{n}.

The second result holds overall since it holds whenever we condition on a particular value of X. These events are also disjoint. Then

\mathbb{P}(Y<X)=\mathbb{P}(Y=1)+\mathbb{P}(Y<X | Y>1, X\neq Y)\mathbb{P}(Y>1,X\neq Y)

= \frac{1}{n}+\frac{1}{2}(1-\mathbb{P}(Y>1,Y\neq X))

=\frac{1}{n}+\frac{1}{2}(1-\frac{1}{n}-\frac{1}{n}) = \frac{1}{2}.

In summary, the cards 1 and 2 will be in uniformly random order. We might like to extend this idea, but it will get complicated when we add card 3 to the mix, as it is possible (if unlikely) that 1 and 2 will be further mixed before this. This shouldn’t affect the result much, but it will get complicated to define the notation required to carry this sort of argument all the way up to the nth card.

Even using induction is not going to make life substantially easier. Knowing that once we have inserted cards 1,2,…,k into the pack they are in uniformly random order is not enough to make inference about what happens once we put k+1 into the pack. We have to know something about the current positions of 1,2,…,k. For example, if one of these cards is definitely on the bottom of the pack, then the probability that k+1 ends up last among 1,2,…,k+1 is 1/n rather than 1/k+1 as it should be. So in fact we would have to control an annoying amount of information jointly.

In the argument we attempted above, we were looking at the first times some card k got folded back into the pack. Note that this division of time is different to the one we were using for the coupon collector approach to the mixing time in the previous post. Let’s try to use that instead here.

Now we consider the times at which a card is moved below card n. We deliberately decline to say what these cards are. But rather, we want to prove that, conditional on the cards below n being A_k=\{a_1,\ldots,a_k\}, the ordering of these is uniform on S_{A_k}, that is, every possibility is equally likely. Now this is easy to prove by induction. For, by conditioning on A_k and a_{k+1} being the new card to be moved below n, we are conditioning on the set of cards below n now being A_{k+1}=A_k\cup\{a_{k+1}\}. The position of the new card is uniformly random within this, by construction of the top-to-random shuffle, and so the new arrangement is uniformly random on the (k+1)! possibilities.

To see why we have proved the original result we wanted, note that this argument works at the time when the original bottom card is now at the top. So the remaining cards are uniformly randomly ordered. Inserting card n at random gives an arrangement that is uniformly random overall. So as we suggested before, working out how long it takes to get close to randomness in this case reduces to working out how long it is before the original bottom card hits the top and is re-inserted, as at that point, the pack genuinely is uniformly random.

The Contour Process

As I explained in my previous post, I haven’t been reading around as much as I would generally like to recently. A few days in London staying with my parents and catching up with some friends has therefore been a good chance to get back into the habit of leafing through papers and Pitman’s book among other things.

This morning’s post should be a relatively short one. I’m going to define the contour process, a function of a (random or deterministic) tree, related to the exploration process which I have mentioned a few times previously. I will then use this to prove a simple but cute result equating in distribution the sizes of two different branching processes via a direct bijection.

The Contour Process

To start with, we have to have a root, and from that root we label the tree with a depth-first labelling. An example of this is given below. It is helpful at this stage to conceive this process as an explorer walking on the tree, and turning back on themselves only when there is no option to visit a vertex they haven’t already seen. So in the example tree shown, the depth-first exploration visits vertex V_2 exactly four times. Note that with this description, it is clear that the exploration traverses every edge exactly twice, and so the length of the sequence is 2n-1, where n is the number of vertices in the tree since obviously, we start and end at the root.

Another common interpretation of this depth-first exploration is to take some planar realisation of the tree. (Note trees are always planar – proof via induction after removing a leaf.) Then if you treat the tree as a hedge and starting at the root walk along, following the outer boundary with your right hand, this exactly recreates the process.

The height of a tree at a particular vertex is simply the graph distance between that vertex and the root. So when we move from one vertex to an adjacent vertex, the height must increase or decrease by 1.

The contour process is the sequence of heights seen along the depth-first exploration. It is therefore a sequence:

0=h_0,h_1,\ldots,h_{2n-1}=0,\quad h_i\geq 0,

and such that |h_{i+1}-h_i|=1.

Note that though the contour process uniquely determines the tree structure, the choice of depth-first labelling is a priori non-canonical. For example, in the display above, V_3 might have been explored before V_2. Normally this is resolved by taking the suitable vertex with the smallest label in the original tree to be next. It makes little difference to any analysis to choose the ordering of descendents of some vertex in a depth-first labelling randomly. Note that this explains why it is rather hard to recover Cayley’s theorem about the number of rooted trees on n vertices from this characterisation. Although the number of suitable contour functions is possible to calculate, we would require a complicated multiplicative correction for labelling if we wanted to recover the number of trees.

The only real observation about the uses of the contour process at this stage is that it is not in general a random walk with IID increments for a Galton-Watson branching process. This equivalence is what made the exploration process so useful. In particular, it made it straightforward, at least heuristically, to see why large trees might have a limit interpretation through Brownian excursions. If for example, the offspring distribution is bounded above, say by M, then the contour process certainly cannot be a random walk, as if we have visited a particular vertex exactly M+1 times, then it cannot have another descendent, and so we must return closer to the root at the next step.

I want to mention that in fact Aldous showed his results on scaling limits towards the Continuum Random Tree through the contour process rather than the exploration process. However, I don’t want to say any more about that right now.

A Neat Equivalence

What I do want to talk about is the following distribution on the positive integers. This comes up in Balazs Rath and Balint Toth’s work on forest-fires on the complete graph that I have been reading about recently. The role of this distribution is a conjectured equilibrium distribution for component size in a version of the Erdos-Renyi process where components are deleted (or ‘struck by lightning’) at a rate tuned so that giant components ‘just’ never emerge.

This distribution has the possibly useful property that it is the distribution of the total population size in a Galton-Watson process with Geom(1/2) offspring distribution. It is also the distribution of the total number of leaves in a critical binary branching process, where every vertex has either two descendents or zero descendents, each with probability 1/2. Note that both of these tree processes are critical, as the expected number of offspring is 1 in each case. This is a good start, as it suggests that the relevant equilibrium distribution should also have the power-law tail that is found in these critical branching processes. This would confirm that the forest-fire model exhibits self-organised criticality.

Anyway, as a sanity check, I tried to find a reason why, ignoring the forest-fires for now, these two distributions should be the same. One can argue using generating functions, but there is also the following nice bijective argument.

We focus first on the critical Geometric branching process. We examine its contour function. As explained above, the contour process is not in general a random walk with IID increments. However, for this particular case, it is. The geometric distribution should be viewed as the family of discrete memoryless distributions.

This is useful for the contour process. Note that if we are at vertex V for the (m+1)th time, that is we have already explored m of the edges out of V, then the probability that there is at least one further edge is 1/2, independently of the history of the exploration, as the offspring distribution is Geometric(1/2), which we can easily think of as adding edges one at a time based on independent fair coin tosses until we see a tail for example. The contour process for this random tree is therefore a simple symmetric random walk on Z. Note that this will hit -1 at some point, and the associated contour process is the RW up to the final time it hits 0 before hitting -1. We can check that this obeys the clear rule that with probability 1/2 the tree is a single vertex.

Now we consider the other model, the Galton-Watson process with critical binary branching mechanism. We should consider the exploration process. Recall that the increments in this process are given by the offspring distribution minus one. So this random sequence also behaves as a simple symmetric random walk on Z, again stopped when we hit -1.

To complete the bijective argument, we have to relate leaves in the binary process to vertices in the geometric one. A vertex is a leaf if it has no offspring, so the number of leaves is the number of times before the hitting time of -1 that the exploration process decreases by 1. (*)

Similarly for the contour process. Note that there is bijection between the set of vertices that aren’t the root and the set of edges. The contour process explores every edge exactly twice, once giving an increase of 1 and once giving a decrease of 1. So there is a bijection between the times that the contour process decreases by 1 and the non-root vertices. But the contour process was defined only up to the time we return to the root. This is fine if we know in advance how large the tree is, but we don’t know which return to the root is the final return to the root. So if we extend the random walk to the first time it hits -1, the portion up until the last increment is the contour process, and the final increment must be a decrease by 1, hence there is a bijection between the number of vertices in the Geom(1/2) G-W tree and the number of times that the contour process decreases by 1 before the hitting time of -1. Comparing with (*) gives the result.

Mixing Times 6 – Aldous-Broder Algorithm and Cover Times

In several previous posts, I’ve discussed the Uniform Spanning Tree. The definition is straightforward: we choose uniformly at random from the set of trees which span a fixed underlying graph. But for a dense underlying graph, there are a very large number spanning trees. Cayley’s formula says that the complete graph K_n has n^{n-2} spanning trees, so to select from this list is impractical.

We seek a better algorithm. In a post about a year ago, I presented the result that the path between two fixed points x and y in the UST is distributed as the path generated by Loop-Erased Random Walk, for which we start at x and delete cycles as they appear. An initial problem might be that this only gives us a single path, which might be enough in some contexts, but in general we will want to specify the whole tree. Wilson’s Algorithm is an unsurprising but useful extension to this equivalence which does just that. You start by constructing the LERW between two vertices, then you add the LERW which connects some other vertex to the path you already have. Then you take a further vertex not currently explored and start LERW there, continuing until you hit the tree that you already have. Iterate this process, which must terminate after at most n steps when there are no vertices which to start from. The tree thus obtained is the UST. The tricky part is proving that the method for selecting which unused vertices to start from has no effect on the distribution of paths between two fixed points.

I want to consider a different algorithm, discovered roughly simultaneously by Aldous and Broder. Start a random walk on the underlying graph at some particular vertex. Every time we traverse an edge which takes us to a vertex we haven’t yet explored, add this edge to the tree. For now I don’t want to give a proof that this algorithm works, but rather to talk about how fast it works, because it ties in nicely with something from the Mixing Times book we’ve been reading recently. It is clear that the algorithm terminates at the first time the random walk has visited every vertex. This is a stopping time, called the cover time of the Markov chain. If we are working with an underlying complete, then we notice that this is annoying, because it means that the cover time will increase like n.log n. That is, it will take an increasingly long time to gather the final few vertices into the tree. Perhaps some combination of Aldous-Broder initially then Wilson’s method for the final o(n) vertices might be preferable?

I want to discuss how to treat this cover time. Often we have information about the hitting times of states from other states \mathbb{E}_x T_y. A relationship between S, the hitting time, defined to be the maximum of the previous display over x and y, and the expected cover time would be useful, especially for a highly symmetric graph like the complete graph where the expected hitting times are all the same.

Matthews’ Method relates these two for an irreducible finite Markov chain on n states. It says:

t_{cov}\leq t_{hit}\left(1+\frac12+\ldots+\frac 1 n\right).

We first remark that this agrees with what we should get for the random walk on the complete graph. There, the hitting time of x from y is a geometric random variable with success probability 1/n, hence expectation is n. The cover time is the standard coupon collector problem, giving expectation n log n, and the sum of reciprocals factor is asymptotically a good approximation.

The intuition is that if we continue until we hit state 1, then reset and continue until we hit state 2, and so on, by the time we hit state n after (n-1) iterations, this is a very poor overestimate of the cover time, because we are actually likely to have hit most states many times. What we want to do really is say that after we’ve hit state 1, we continue until we hit state 2, unless we’ve already done so, in which case we choose a different state to aim for, one which we haven’t already visited. But this becomes complicated because we then need to know the precise conditional probabilities of visiting any site on the way between two other states, which will depend rather strongly on the exact structure of the chain.

Peres et al give a coupling proof in Chapter 11 of their book which I think can be made a bit shorter, at least informally. The key step is that we still consider hitting the sites in order, only now in a random order.

That is, we choose a permutation \sigma\in S_n uniformly at random, and we let T_k be the first time that states \sigma(1),\ldots,\sigma(k) have all been visited. This is a random time that is measurable in the product space, and for each \sigma it is a stopping time.

The key observation is that \mathbb{P}(T_{k+1}=T_k)=1-\frac{1}{k+1}. This holds conditional on any path of the Markov chain because the requirement for the event is that \sigma(k+1) is visited after \{\sigma(1),\ldots,\sigma(k)\}. The statement therefore holds as stated as well as just pathwise. Then, by the SMP, conditional on \{T_{k+1}>T_k\}, we have

T_{k+1}-T_k \leq_{st} t_{hit}.

Note that by the definition of t_{hit}, this bound on the hitting time T_{k+1} is unaffected by concerns about where the chain actually is at T_k (since it is not necessarily at \sigma(k)).

So, removing the conditioning, we have:


and so the telescoping sum gives us Matthews’ result.

One example is the cover time of random walk on the n x n torus, which turns out to be

O(n^2(\log n)^2).

If anyone remembers that Microsoft screensaver from many years ago which started with a black screen and a snake leaving a trail of white pixels as it negotiated the screen, this will be familiar. The last few black bits take a frustratingly long while to disappear. Obviously that isn’t quite a random walk, but it perhaps diminishes the surprise that it should take this long to find the cover time.

There are a couple of interesting things I wanted to say about electrical networks for Markov chains and analytic methods for mixing times, but the moment may have passed, so this is probably the last post about Mixing Times. Plans are in motion for a similar reading group next term, possible on Random Matrices.

How to Prove Fermat’s Little Theorem

The following article was prompted by a question from one of my mentees on the Senior Mentoring Scheme. A pdf version is also available.

Background Ramble

When students first meet problems in number theory, it often seems rather different from other topics encountered at a similar time. For example, in Euclidean geometry, we immediately meet the criteria for triangle similarity or congruence, and various circle theorems. Similarly, in any introduction to inequalities, you will see AM-GM, Cauchy-Schwarz, and after a quick online search it becomes apparent that these are merely the tip of the iceberg for the bewildering array of useful results that a student could add to their toolkit.

Initially, number theory lacks such milestones. In this respect, it is rather like combinatorics. However, bar one or two hugely general ideas, a student gets better at olympiad combinatorics questions by trying lots of olympiad combinatorics questions.

I don’t think this is quite the case for the fledgling number theorist. For them, a key transition is to become comfortable with some ideas and notation, particularly modular arithmetic, which make it possible to express natural properties rather neatly. The fact that multiplication is well-defined modulo n is important, but not terribly surprising. The Chinese Remainder Theorem is a `theorem’ only in that it is useful and requires proof. When you ask a capable 15-year-old why an arithmetic progression with common difference 7 must contain multiples of 3, they will often say exactly the right thing. Many will even give an explanation for the regularity in occurrence of these which is precisely the content of the theorem. The key to improving number theory problem solving skills is to take these ideas, which are probably obvious, but sitting passively at the back of your mind, and actively think to deploy them in questions.

Fermat’s Little Theorem

It can therefore come as a bit of a shock to meet your first non-obvious (by which I mean, the first result which seems surprising, even after you’ve thought about it for a while) theorem, which will typically be Fermat’s Little Theorem. This states that:

\text{For a prime }p,\text{ and }a\text{ any integer:}\quad a^p\equiv a\mod p. (1)


  • Students are typically prompted to check this result for the small cases p=3, 5 and 7. Trying p=9 confirms that we do really need the condition that p be prime. This appears on the 2012 November Senior Mentoring problem sheet and is a very worthwhile exercise in recently acquired ideas, so I will say no more about it here.
  • Note that the statement of FLT is completely obvious when a is a multiple of p. The rest of the time, a is coprime to p, so we can divide by a to get the equivalent statement:

\text{For a prime }p,\text{ and }a\text{ any integer coprime to }p:\quad a^{p-1}\equiv 1\mod p. (2)

  • Sometimes it will be easier to prove (2) than (1). More importantly, (2) is sometimes easier to use in problems. For example, to show a^{p^2}\equiv a \mod p, it suffices to write as:

a^{p^2}\equiv a^{(p-1)(p+1)+1}\equiv (a^{p-1})^{p+1}\times a\equiv 1^{p+1}\times a \equiv a.

  • A word of warning. FLT is one of those theorems which it is tempting to use on every problem you meet, once you know the statement. Try to resist this temptation! Also check the statement with small numbers (eg p=3 ,a=2) the first few times you use it, as with any new theorem. You might be surprised how often solutions contain assertions along the lines of

a^p\equiv p \mod (a-1).


I must have used FLT dozens of times (or at least tried to use it – see the previous remark), before I really got to grips with a proof. I think I was daunted by the fact that the best method for, say, p=7, a careful systematic check, would clearly not work in the general case. FLT has several nice proofs, and is well worth thinking about for a while before reading what follows. However, I hope these hints provide a useful prompt towards discovering some of the more interesting arguments.

Induction on a to prove (1)

  • Suppose a^p\equiv a\mod p. Now consider (a+1)^p modulo p.
  • What happens to each of the (p+1) terms in the expansions?
  • If necessary, look at the expansion in the special case p=5 or 7, formulate a conjecture, then prove it for general p.

Congruence classes modulo p to prove (2)

  • Consider the set \{a,2a,3a,\ldots,(p-1)a\} modulo p.
  • What is this set? If the answer seems obvious, think about what you would have to check for a formal proof.
  • What could you do now to learn something about a^{p-1}?

Combinatorics to prove (1)

  • Suppose I want a necklace with p beads, and I have a colours for these beads. We count how many arrangements are possible.
  • Initially, I have the string in a line, so there are p labelled places for beads. How many arrangements?
  • Join the two ends. It is now a circle, so we don’t mind where the labelling starts: Red-Green-Blue is the same as Green-Blue-Red.
  • So, we’ve counted some arrangements more than once. How many have we counted exactly once?
  • How many have we counted exactly p times? Have we counted any arrangements some other number of times?

Group Theory to prove (2)

This is mainly for the interest of students who have seen some of the material for FP3, or some other introduction to groups.

  • Can we view multiplication modulo p as a group? Which elements might we have to ignore to ensure that we have inverses?
  • What is \{1,a,a^2,a^3,\ldots\} in this context? Which axiom is hardest to check?
  • How is the size of the set of powers of a modulo p related to the size of the whole group of congruences?
  • Which of the previous three proofs is this argument is most similar to this one?
  • Can you extend this to show the Fermat-Euler Theorem:

\text{For any integer }n,\text{ and }a\text{ coprime to }n:\quad a^{\phi(n)}\equiv 1 \mod n,

where \phi(n) counts how many integers between 1 and n are coprime to n.

Branching Processes and Dwass’s Theorem

This is something I had to think about when writing my Part III essay, and it turns out to be relevant to some of the literature I’ve been reading this week. The main result is hugely helpful for reducing a potentially complicated combinatorial object to a finite sum of i.i.d. random variables, which in general we do know quite a lot about. I was very pleased with the proof I came up with while writing the essay, even if in the end it turned out to have appeared elsewhere before. (Citation at end)

Galton-Watson processes

A Galton-Watson process is a stochastic process describing a simple model for evolution of a population. At each stage of the evolution, a new generation is created as every member of the current generation produces some number of `offspring’ with identical and independent (both across all generations and within generations) distributions. Such processes were introduced by Galton and Watson to examine the evolution of surnames through history.

More precisely, we specify an offspring distribution, a probability distribution supported on \mathbb{N}_0. Then define a sequence of random variables (Z_n,n\in\mathbb{N}) by:


where (Y_k^n,k\geq 1,n\geq 0) is a family of i.i.d. random variables with the offspring distribution Y. We say Z_n is the size of the nth generation. From now on, assume Z_0=1 and then we call (Z_n,n\geq 0) a Galton-Watson process. We also define the total population size to be


noting that this might be infinite. We refer to the situation where X<\infty finite as extinction, and can show that extinction occurs almost surely when \mathbb{E}Y\leq 1, excepting the trivial case Y=\delta_1. The strict inequality parts are as you would expect. We say the process is critical if \mathbb{E}Y=1, and this is less obvious to visualise, but works equally well in the proof, which is usually driven using generating functions.

Total Population Size and Dwass’s Theorem

Of particular interest is X, the total population size, and its distribution. The following result gives us a precise and useful result linking the probability of the population having size n and the distribution of the sum of n RVs with the relevant offspring distribution. Among the consequences are that we can conclude immediately, by CLT and Cramer’s Large Deviations Theorem, that the total population size distribution has power-law decay in the critical case, and exponential decay otherwise.

Theorem (Dwass (1)): For a general branching process with a single time-0 ancestor and offspring distribution $Y$ and total population size $X$:

\mathbb{P}(X=k)=\frac{1}{k}\mathbb{P}(Y^1+\ldots+ Y^k=k-1),\quad k\geq 1

where Y^1,\ldots,Y^k are independent copies of Y.

We now give a proof via a combinatorial argument. The approach is similar to that given in (2). Much of the literature gives a proof using generating functions.

Proof: For motivation, consider the following. It is natural to consider a branching process as a tree, with the time-0 ancestor as the root. Suppose the event \{X=k\} in holds, which means that the tree has k vertices. Now consider the numbers of offspring of each vertex in the tree. Since every vertex except the root has exactly one parent, and there are no vertices outside the tree, we must have Y^1+\ldots+Y^k=k-1 where Y^1,\ldots,Y^k are the offspring numbers in some order. However, observe that this is not sufficient. For example, if Y^1 is the number of offspring of the root, and k\geq 2, then we must have Y^1\geq 1. Continue reading

The Sample Space for a Die

Also featuring: a non-Lebesgue measurable set and Dynkin’s Lemma.

At the National Maths Summer School last week, the senior students and I spent a while talking about probability space, and in particular, when it was reasonable to assign a probability to a potential event. We considered rolling a standard die, and the probabilities \mathbb{P}(D\in\{\}), the empty event, and \mathbb{P}(D=7). Though it is tempting to conclude that the latter must be zero, in the end we decided that it should not actually be defined at all.

Why? Well, if we accept \mathbb{P}(D=7)=0, then by extension we must accept \mathbb{P}(D=137) and \mathbb{P}(D=\$1.50) also both exist and are zero. What have we gained? In reality very little. But the cost is this: we might define an event to be any subset of the sample space. Before, our sample space was \Omega=\{1,2,3,4,5,6\}, and so there are exactly 64 events, including the possibly counter-intuitive empty event \{\}. This is finite, which is always nice. With the extra events, however, we must extend the sample space to \Omega=\{1,2,3,4,5,6,7,\ldots, where “…” means “the rest of the universe”. This is a fairly exotic mathematical object, and really has no place in any sensible discussion.

This reminded me of one of my favourite results from Part II Probability and Measure. Of course, for uncountable sample spaces, we cannot necessarily assume all subsets of \Omega are measurable. Instead we build up a sigma-algebra of measurable sets, most importantly for Lebesgue measure on \mathbb{R}. An immediate question to ask is: are all subsets of \mathbb{R} Lebesgue-measurable?

And the answer is ‘no’. Why? The standard counterexample is as follows. Consider Lebesgue measure on the unit interval U=\mathbb{R}/\mathbb{Z}, with endpoints identified. Now consider the rationals in U, which are actually a subgroup \mathbb{Q}\leq U, with uncountably many cosets. Pick an element from each coset (*). Call this sets A. Then, working modulo 1, U=\cup_{q\in\mathbb{Q}\cap U}A+q. If A is Lebesgue-measurable, then so is A+q, and \mu(A+q)=\mu(A) (**). Combining these two results, using countable additivity:

\mu(U)= (0 if \mu(A)=0, \infty otherwise).

This is a contradiction, and hence we conclude that A is not Lebesgue-measurable.

Remark on (*): This relies on the Axiom of Choice. In fact, the existence of non-Lebesgue measurable sets MAY be equivalent to AC.

Remark on (**): I was suddenly unsure that this was obvious. I mean, this is such a weird set that it is in fact not measurable: why should its hypothetical measure be translation invariant? It is tempting to argue vaguely, by saying that the construction of Lebesgue measure is invariant under translation at all steps. As so often with elementary measure theory, recourse to Dynkin’s Lemma is more reliable.

Let D be the collection of measurable sets whose measure is invariant under translation. By definition, D is invariant under translation (of its elements). D certainly contains all intervals in U, which is a pi-system generating \mathcal{B}([0,1]). But, in a classic proof by suggestive notation, we can check that D is a d-system. The presence of the empty sets is clear. If we have B\subset A, both in D, then also A\backslash B\in D, as the translates of x\in A\backslash B must be in A, but cannot be in B, as B is translation invariant. Finally, given A_1\subset A_2\subset\ldots\subset D, then x\in\cup A_i\Rightarrow x\in A_n for some n, so all of x’s translates are in A_n, and hence in \cup A_i.

Now we can deploy Dynkin’s Lemma. D must be the sigma-algebra of all measurable sets, as we wanted.

NMSS 2012 – Strong Law of Large Numbers for a Coin Flip

The 2012 National Mathematics Summer School, held at Queens’ College affiliated to the University of Birmingham, and run by the United Kingdom Mathematics Trust, is drawing to a close today. I gave a problem-based talk on Probability to two groups of 20 junior students (15/16 year olds selected based on strong performance in national competitions for their agegroups), and a lecture to the six senior students (some of 2011’s strongest and most enthusiastic junior students) on the SLLN for the simplest non-trivial random variable imaginable: a coin flip.

In case any of the students, or indeed anyone else, is interested, a text of the problems, and the worked solutions that took up the majority of the lecture will be available here for a short while. Do email me if there are any questions!

Senior Probability Solutions. [Link removed. Email me if interested]

Remarkable fact about Brownian Motion #4: The Dirichlet Problem

So this property of Brownian Motion is so elegant, in my opinion, that when I was recently asked what my ‘favourite theorem’ was, I suggested this. With this result, we can use this probabilistic structure to specify solutions to an important PDE, with boundary conditions, over a large class of domains.

Given a domain D, Laplace’s equation is: \Delta u=0 on D, and u=f on the boundary dD, where f is any continuous function defined there. This PDE arises wherever the notion of potentials is defined, for example electromagnetism, fluids and thermodynamics.

Theorem: Given suitable regularity conditions on D to be discussed later, Laplace’s equation has a unique solution, given by:


Notation: First, what does this mean? Define T_D:=\inf\{t:B_t\not\in D\}, to be the time at which a Brownian Motion leaves the domain D. This is a stopping time, and so will be suitable for application of the Strong Markov Property. \mathbb{E}_x means that we are taking expectation with respect to a BM started at x. So informally, we are defining u(x) as: start a BM at x; see where it hits the boundary of D; record the value of f at that point. Then set u(x) to be the expected value of this process.

Existence: First, we are going to check that the solution conjectured is a solution. We will need a lemma:

Lemma: A locally-bounded function u satisfies \Delta u=0 on a domain D if and only if it has the property that for every closed ball \bar{B(x,r)}\subset D we have:


where \sigma_{x,r} is the surface area measure on the boundary S(x,r) of the ball radius r centred on x. Essentially, this says that u(x) is equal to the average value of u on a ball around x.

Proof of Theorem: First, existence. Set u as specified in the statement of the theorem. Given a Brownian Motion started at x, we have stopping times T_r<T_D corresponding to the hitting times of the ball radius r around x and the boundary dD. The domination condition holds by continuity provided B(x,r) is contained within D. So we may apply the Strong Markov Property:


By definition, the left hand expression is u(x). But also, because the distribution of B_{T_r} is uniform on S(x,r), the right hand side is equal to:


and so by the lemma, this guarantees that the function u is harmonic on the interior of D.

The lemma can also be used to show uniqueness. Continue reading

Uncountable Infinity and Cantor’s Diagonal Argument

In my last post, I talked about why infinity shouldn’t seem terrifying, and some of the interesting aspects you can consider without recourse to philosophy or excessive technicalities. Today, I’m going to explore the fact that there are different kinds of infinity. For this, we’ll use what is in my opinion one of the coolest proofs of all time, originally due to Cantor in the 19th century.


Last time we talked about sets which have infinite size. In particular, a set S is called countable, if its elements can be paired up with the set of positive integers. This is equivalent to saying that you can find a way to list the elements of S, in such a way that every element comes up eventually. It might initially seem as if every set should be countable, but consider the reals between 0 and 1. If we were to put them in a list, what should the first element in the list be; and what should the second be; and so on? Is it obvious that we can make this work? But if we can’t make it work, how can possibly prove that it is impossible, because there are so many potential ways to list the elements?

The Rationals

As a warm up, consider the rationals: \mathbb{Q}=\{\frac{p}{q}:p,q\text{ integers with no common factor}\}. Is this set countable? For simplicity, let’s just think about the rationals between 0 and 1. What is the most sensible way to list them? What about starting with 0 and 1, then 1/2, then the ones with 3 as denominator, ie 1/3 and 2/3. Can we carry on with strategy? First we need to check exactly what we are doing: we are ordering the rationals by increasing size of the denominator, then within that, by increasing size of the numerator. What about 2/4? That isn’t a rational as we’ve described them, so we can just ignore it.

With this algorithm is that every such rational will turn up eventually. We’ve therefore shown that this set of rationals is countable. It’s hard to give a precise ‘formula’ for how the list works. To work out where 2/11 appears, you need to know about the prime factors of the integers less than 11, and keep track of lots of things. It turns out that 2/11 appears 35th in the list, but the easiest way to deduce this is to check up to that point in the list. The important fact is that it must appear at some point, and only once.

The Reals

Theorem (Cantor): The set of real numbers between 0 and 1 is not countable.

Proof: This will be a proof by contradiction. That means, we will assume that the set is countable, then derive a false statement. From this, we can deduce that the set cannot be countable.

Suppose it is countable. This means we can write all the reals in a list. Write them in binary, so it probably looks something like this:


and so on. Continue reading