# Balkan MO 2017 – Qs 1, 3 and 4

The UK is normally invited to participate as a guest team at the Balkan Mathematical Olympiad, an annual competition between eleven countries from South-Eastern Europe. I got to take part in Rhodes almost exactly ten years ago, and this year the competition was held in Ohrid, in Macedonia. There’s one paper, comprising four questions, normally one from each of the agreed olympiad topic areas, with 4.5 hours for students to address them. The contest was sat this morning, and I’m going to say quite a bit about the geometric Q2, and a little bit about Qs 1 and 3 also. In all cases, this discussion will include most of a solution, with some commentary, so don’t read these if you are planning to try the problems yourself.

I’m not saying anything about Q4, because I haven’t solved it. (Edit: I have solved it now, so will postpone Q2 until later today.)

Question One

Find all ordered pairs of positive integers (x,y) such that

$x^3+y^3=x^2+42xy+y^2.$

The first thought is that if either of x or y is ‘large’, then the LHS is bigger than the RHS, and so equality can’t hold. That is, there are only finitely many solutions. The smallest possible value of y is, naturally, 1, and substituting y=1 is convenient as then $y^2=y^3$, and it’s straightforward to derive $x=7$ as a solution.

Regarding the non-existence of large solutions, you can make this precise by factorising the LHS as

$(x+y)(x^2-xy+y^2) = x^2+42xy+y^2.$

There are 44 terms of degree two on the RHS, and one term of degree in the second bracket on the LHS. With a bit of AM-GM, you can see then that if $x+y>44$, you get a contradiction, as the LHS will be greater than the RHS. But that’s still a lot of possibilities to check.

It struck me that I could find ways to reduce the burden by reducing modulo various primes. 2, 3 and 7 all divide 42, and furthermore cubes are nice modulo 7 and squares are nice modulo 3, so maybe that would bring the number of possibilities down. But my instinct was that this wasn’t the right way to use the fact that we were solving over positive integers.

The second bracket in the factorisation looks enough like the RHS, that it’s worth exploring. If we move $x^2-xy+y^2$ from the right to the left, we get

$(x+y-1)(x^2-xy+y^2) = 43xy.$ (1.1)

Now it suddenly does look useful that we are solving over positive integers, because 43 is a prime, so has to appear as a factor somewhere on the LHS. But it’s generally quite restrictive that $x^2-xy+y^2 | 43xy$. This definitely looks like something that won’t hold often. If x and y are coprime, then certainly $x^2-xy+y^2$ and $y$ are coprime also. But actually if x and y have a non-trivial common factor d, we can divide both sides by $d^2$, and it still holds. Let’s write

$x=dm,\quad y=dn,\quad\text{where }d=\mathrm{gcd}(x,y).$

Then $m^2 -mn+n^2$ really does divide 43, since it is coprime to both m and n. This is now very restrictive indeed, since it requires that $m^2-mn+n^2$ be equal to 1 or 43. A square-sandwiching argument gives $m^2-mn+n^2=1$ iff $m=n=1$. 43 requires a little bit more work, with (at least as I did it) a few cases to check by hand, but again only has one solution, namely $m=7, n=1$ and vice versa.

We now need to add the common divisor d back into the mix. In the first case, (1.1) reduces to $(2d-1)=43$, which gives $(x,y)=(22,22)$. In the second case, after cancelling a couple of factors, (1.1) reduces to $(8d-1)=7$, from which $(x,y)=(7,1),(1,7)$ emerges, and these must be all the solutions.

The moral here seemed to be that divisibility was a stronger tool than case-reduction. But that was just this question. There are other examples where case-reduction is probably more useful than chasing divisibility.

Question Three

Find all functions $f:\mathbb{N}\rightarrow\mathbb{N}$ such that

$n+f(m) \,\big|\, f(n)+nf(m)$

for all $m,n\in\mathbb{N}$.

What would be useful here? There are two variables, and a function. It would be useful if we could reduce the number of variables, or the number of occurences of f. We can reduce the number of variables by taking m=n, to get

$n+f(n) \,\big|\, f(n) [1+n].$ (3.1)

From this, we might observe that $f(n)\equiv 1$ is a solution. Of course we could analyse this much more, but this doesn’t look like a 10/10 insight, so I tried other things first.

In general, the statement that $a\,|\,b$ also tells us that $a\,|\, b-ka$. That is, we can subtract arbitrary multiples of the divisor, and the result is still true. A recurring trope is that the original b is elegant, but an adjusted b-ka is useful. I don’t think we can do the latter, but by subtracting $n^2 +nf(m)$ from the problem statement, we get

$n+f(m) \,\big|\, n^2-f(n).$ (3.2)

There’s now no m on the RHS, but this relation has to hold for all m. One option is that $f(n)=n^2$ everywhere, then what we’ve deduced always holds since the RHS is zero. But if there’s a value of n for which $f(n)\ne n^2$, then (3.2) is a very useful statement. From now on, we assume this. Because then as we fix n and vary m, we need $n+f(m)$ to remain a divisor of the RHS, which is fixed, and so has finitely many divisors. So $f(m)$ takes only finitely many values, and in particular is bounded.

This ties to the observation that $f\equiv 1$ is a solution, which we made around (3.1), so let’s revisit that: (Note, there might be more elegant ways to finish from here, but this is what I did. Also note, n is no longer fixed as in previous paragraph.)

$n+f(n) \,\big|\, f(n) [1+n].$ (3.1)

Just to avoid confusion between the function itself, and one of the finite collection of values it might take, let’s say b is a value taken by f. So there are values of n for which

$n+b \,\big|\, b(1+n).$

By thinking about linear equations, you might be able to convince yourself that there are only finitely many solutions (in n) to this relation. There are certainly only finitely many solutions where LHS=RHS (well, at most one solution), and only finitely many where 2xLHS=RHS etc etc. But why do something complicated, when we can actually repeat the trick from the beginning, and subtract $b(n+b)$, to obtain

$n+b \,\big|\, b^2-b.$

For similar reasons to before, this is a great deduction, because it means if $b\ne 1$, then the RHS is positive, which means only finitely many n can satisfy this relation. Remember we’re trying to show that no n can satisfy this relation if $b\ne 1$, so this is definitely massive progress!

If any of what’s already happened looked like magic, I hope we can buy into the idea that subtracting multiples of the divisor from the RHS is the only tool we used, and that making the RHS fixed gives a lot of information about the LHS as the free variable varies. The final step is not magic either. We know that f is eventually 1. If you prefer “for large enough n, $f(n)=1$,” since all other values appear only finitely often. I could write this with quantifiers, but I don’t want to, because that makes it seem more complicated than it is. We genuinely don’t care when the last non-1 value appears.

Anyway, since we’ve deduced this, we absolutely have to substitute this into something we already have. Why not the original problem statement? Fix m, then for all large enough n

$n+f(m) \,\big|\, 1+nf(m).$ (3.3)

To emphasise, (3.3) has to hold for all large enough n. Is it possible that f(m)=2? Again, it’s easy to convince yourself not. But, yet again, why not use the approach we’ve used so profitably before to clear the RHS? In fact, we already did this, and called it (3.2), and we can make that work [3.4], but in this setting, because f(m) is fixed and we’re working with variable large n, it’s better to eliminate n, to get

$n+f(m)\,\big|\, f(m)^2-1,$

again for all large enough n. By the same size argument as before, this is totally impossible unless f(m)=1. Which means that in fact $f(m)=1$ for all m. Remember ages ago we assumed that f(n) was not $n^2$ everywhere, so this gives our two solutions: $f(n)=1,\, f(n)=n^2$.

Moral: choosing carefully which expression to work with can make life much more interesting later. Eliminating as many variables or difficult things from one side is a good choice. Playing with small values can help you understand the problem, but here you need to think about soft properties of the expression, in particular what happens when you take one variable large while holding another fixed.

[3.4] – if you do use the original approach, you get $n^2-1$ on the RHS. There’s then the temptation to kill the divisibility by taking n to be the integer in the middle of a large twin prime pair. Unfortunately, the existence of such an n is still just a conjecture

Question Four

(Statement copied from Art of Problem Solving. I’m unsure whether this is the exact wording given to the students in the contest.)

On a circular table sit n>2 students. First, each student has just one candy. At each step, each student chooses one of the following actions:

(A) Gives a candy to the student sitting on his left or to the student sitting on his right.

(B) Separates all its candies in two, possibly empty, sets and gives one set to the student sitting on his left and the other to the student sitting on his right.

At each step, students perform the actions they have chosen at the same time. A distribution of candy is called legitimate if it can occur after a finite number of steps.
Find the number of legitimate distributions.

My moral for this question is this: I’m glad I thought about this on the bus first. What I found hardest here was getting the right answer. My initial thoughts:

• Do I know how to calculate the total number of possibilities, irrespective of the algorithm? Fortunately yes I do. Marbles-in-urns = barriers between marbles on a line (maybe add one extra marble per urn first). [4.1]
• What happens if you just use technique a)? Well first you can get into trouble because what happens if you have zero sweets? But fine, let’s temporarily say you can have a negative number of sweets. If n is even, then there’s a clear parity situation developing, as if you colour the children red and blue alternately, at every stage you have n/2 sweets moving from red children to blue and vice versa, so actually the total number of sweets among the red children is constant through the process.
• What happens if you just use technique b)? This felt much more promising.
• Can you get all the sweets to one child? I considered looking at the child directly opposite (or almost-directly opposite) and ‘sweeping’ all the sweets away from them. It felt like this would work, except if for some parity reason we couldn’t prevent the final child having one (or more, but probably exactly one) sweets at the crucial moment when all the other sweets got passed to him.

Then I got home, and with some paper, I felt I could do all possibilities with n=5, and all but a few when n=6. My conjecture was that all are possible with n odd, and all are possible with n even, except those when none of the red kids or none of the kids get a sweet. I tried n=8, and there were a few more that I couldn’t construct, but this felt like my failure to be a computer rather than a big problem. Again there’s a trade-off between confirming your answer, and trying to prove it.

Claim: If n is even, you can’t achieve the configurations where either the red children or the blue children have no sweets.

Proof: Suppose you can. That means there’s a first time that all the sweets were on one colour. Call this time T. Without loss of generality, all the sweets are on red at T. Where could the sweets have been at time T-1? I claim they must all have been on blue, which contradicts minimality. Why? Because if at least one red child had at least one sweet, they must have passed at least one sweet to a blue neighbour.

Now it remains to give a construction for all other cases. In the end, my proof has two stages:

Step One: Given a configuration, in two steps, you can move a candy two places to the right, leaving everything else unchanged.

This is enough to settle the n odd case. For the even case, we need an extra step, which really corresponds to an initial phase of the construction.

Step Two: We can make some version of the ‘sweeping’ move precise, to end up in some configuration where the red number of children have any number of sweets except 0 or n.

Step one is not so hard. Realising that step one would be a useful tool to have was probably the one moment where I shifted from feeling like I hadn’t got into the problem to feeling that I’d mostly finished it. As ever in constructions, working out how to do a small local adjustment, which you plan to do lots of times to get a global effect, is great. (Think of how you solve a Rubik’s cube for example.)

Step two is notationally fiddly, and I would think very carefully before writing it up. In the end I didn’t use the sweeping move. Instead, with the observation that you can take an adjacent pair and continually swap their sweets it’s possible to set up an induction.

Actual morals: Observing the possibility to make a small change in a couple of moves (Step one above) was crucial. My original moral does still hold slightly. Writing lots of things down didn’t make life easier, and in the end the ideas on the bus were pretty much everything I needed.

[4.1] – one session to a group of 15 year olds is enough to teach you that the canon is always ‘marbles in urns’ never ‘balls’ nor ‘bags’, let alone both.

Advertisements

# Branching Random Walk and Amenability

This post is about some of the things I learned in an interesting given by Elisabetta Candellero in Oxford last week, based on joint work with Matt Roberts. The paper on which this is based can be found here. The main thing I want to talk about are some properties of graphs which were mentioned near the beginning which I hadn’t heard about before.

Branching Random Walk (hereafter BRW) is a model to which much attention has been paid, because of its natural applications in a range of physical and genetic settings. As with many of the best models, the definition is pretty much in the title. We take the ingredients for a random walk on a graph, which is a graph, and a transition matrix P on that graph. For most of the time we will consider simple random walk, so the graph G exactly specifies P. This requires the additional condition that the graph G is locally finite. We will introduce a branching mechanism, so at discrete times {0,1,2,…} we will track both the number of particles, and their current locations. We start at time 0 with a single particle at some vertex. Then at each time-step, all the vertices present die, and each gives birth independently to some number of offspring according to a fixed probability distribution $\mu$. These offspring then perform one move according to transition matrix P. Note that if you want the system to carry the appearance of having no death, then taking the support of the offspring distribution to be {1,2,3,…} achieves precisely this. The properties we consider will not be very interesting unless G is infinite, so assume that from now on.

There are almost limitless ways we could think of to generalise these dynamics. The offspring distribution could be allowed to depend on the vertex the particle is occupying. The joint transition probabilities of the offspring at a vertex could be biased in favour or against the offspring moving to the same site next. The environment could be chosen in advance before the process starts, but random.

The classical question about BRW is that of recurrence and transience. The definition extends naturally from that of a Markov chain (which any non-branching random walk on a graph is). As in that setting, we say a BRW is recurrent if every vertex is almost surely visited infinitely often by particles of the graph.

Heuristically, we should observe that in some sense, it is quite difficult for simple random walk on an infinite graph to be recurrent. We have examples in $\mathbb{Z},\mathbb{Z}^2$, but these are about as ‘small’ as an infinite graph can be. An idea might be that if the number of sites some distance away from where we start grows rapidly as the distance grows, then there isn’t enough ‘pull’ back to visit the sites near where we start infinitely often. Extending this argument, it is easier for a BRW to be recurrent, as we have the option to make the branching rate large, which means that there are lots of particles at large times, hence more possibility for visiting everywhere. Note that if the offspring distribution is subcritical, we don’t stand a chance of having interesting properties. If we ignore the random walk part, we just have a subcritical Galton-Watson process, which dies out almost surely.

We need a measure of the concept discussed in the heuristic for how fast the number of vertices in the graph grows as we consider bands of vertices further and further away from the starting vertex. The standard measure for this is the spectral radius, which is defined not in terms of number of vertices, but through the limiting probability of returning to a fixed vertex at large time n. Precisely

$\rho:= \limsup \mathbb{P}_i(X_n=i)^{1/n},$

so in some approximation sense

$\mathbb{P}_i(X_n=i)\sim \rho^{n},$

which explains why $\rho\le 1$. Note that by considering the sum of such terms, if simple random walk on G is recurrent, then $\rho=1$, but the converse does not hold. (Consider SRW on $\mathbb{Z}^3$ for example.)

It’s also worth remarking that $\rho$ is a class property. In particular, for a connected graph, the value of $\rho$ is independent of i. This is not surprising, as if d is the graph distance between vertices i and j, then

$p_{ii}^{(n)}\ge p_{ij}^{(d)}p_{jj}^{(n-2d)}p_{ji}^{(d)},$

and vice versa, which enables us to sandwich usefully for the limits.

Really, $\rho$ is a function of the transition matrix P. In fact, we can be more specific, by considering diagonalising P. The only case we care about is when P is infinite, so this is not especially nice, but it makes it clear why $p_{ii}^{(n)}$ decays like $|\rho|^n$ where $\rho$ is the largest eigenvalue of P. Indeed this is an alternative definition of the spectral radius. Note that Perron-Frobenius theory (which seems to keep coming up on the blog this week…) says that since $|\rho|\le 1$, then if $|\rho|=1$, we must have $\rho=1$. So the spectral radius being 1 is precisely equivalent to having an invariant measure. We don’t know whether we can normalise it, but P-F guarantees the relevant left-eigenvector is non-negative, and hence a measure.

Next we give this situation a name. Say that a random walk is amenable if $\rho(P)=1$. We can extend this property to say that a graph is amenable if SRW on it is amenable.

This is not the standard definition of amenability. This property is originally defined (by von Neumann) in the context of groups. A group G is said to be amenable if there exists a left-invariant probability measure on G, ie $\mu$ such that

$\forall A\subset G, \forall g\in G, \mu(gA)=A.$

The uniform distribution shows that any finite group is amenable.

It turns out that in general there are several conditions for a group which are equivalent to amenability. One is that, given G finitely generated by B, the Cayley graph for G with edges given by elements of B does not satisfy a strong isoperimetric inequality. Such an inequality is an alternative way of saying that the graph grows rapidly. It says that the size of the boundary of a subset of the vertices is uniformly large relative to the size of the set. Precisely, there exists a constant c>0 such that whenever U is a finite subset of the vertices, we have $|\partial U|\ge c|U|$. (Note that finiteness of U is important – we would not expect results like this to hold for very large subsets.)

Kesten proved that it is further equivalent to the statement that simple random walk on $Cay(G,B)$ is amenable in our original sense. This technical and important result links the two definitions.

We finish by declaring the main classical result in BRW, which is a precise condition for transience. As motivated earlier, the rate of branching and the spectral radius have opposing effects on whether the system is recurrent or transient. Note that at some large time, the expected number of particles which have returned to the starting vertex is given by the expected number of particles in the system multiplied by the probability that any one of them is back at its origin, ie $\sim \mu^n\rho^n$. So the probability that there is a particle back at the origin at this time is (crudely transferring from expectation to probability) $1\wedge (\mu \rho)^n$. We can conclude that the chain is recurrent if $\mu > \rho^{-1}$ and transient if $\mu<\rho^{-1}$. This result is due to Benjamini and Peres.

The remaining case, when $\mu=\rho^{-1}$ is called, unsurprisingly, critical BRW. It was proved in ’06 by Gantert and Muller that, in fact, all critical BRWs are transient too. This must exclude the amenable case, as we could think of SRW on $\mathbb{Z}$ as a critical BRW by taking the branching distribution to be identically one, as the spectral radius is also 1.

In the end, the material in this post is rather preliminary to the work presented in EC’s talk, which concerned the trace of BRW, and whether there are infinitely many essentially different paths to infinity taken by the particles of the BRW. They show that this holds in a broad class of graphs with symmetric properties.

# The Rearrangement Inequality

A favourite result of many students doing olympiad inequality problems is the so-called Rearrangement Inequality. This is a mathematical formulation of the idea well-known to even the smallest of child that if you prefer cakes to carrots then if you are offered two of one and one of the other, you should take two of the one you prefer!

At a more formal level, it says that given two strings of non-negative numbers

$a_1\le a_2,\le \ldots\le a_n, \quad b_1\le b_2\le \ldots\le b_n,$

if you want to form a sum of products of pairs, like

$a_1b_4+a_2b_1+a_3b_3+\ldots,$

you get the largest result if you take

$a_1b_1+a_2b_2+\ldots+a_nb_n.$

Formally, for any permutation $\sigma \in S_n$,

$a_1b_1+\ldots+a_nb_n\ge a_1b_{\sigma(1)}+\ldots+a_nb_{\sigma(n)}\ge a_1b_n+\ldots+a_nb_1.$

That is, you multiply the largest terms in each sequence together.

The notation to describe to equality case is a bit annoying. Essentially, the sums are equal if and only if the summands exactly correspond. If the sequences are strictly increasing, then equality holds only if the permutation $\sigma=\text{id}$.

This result is nice because, although it is rarely explicitly useful, it goes in a different direction from the standard scheme of results strengthening AM-GM, Cauchy-Schwarz and so on, and is in some sense more intuitive than these more well-known inequalities, at least in the form presented in an olympiad context.

I was thinking about this partly because it’s a nice result in its own right, but also because it came up in a research problem to do with comparing the expected likelihood of different tree isomorphism classes arising in an inhomogeneous, but relatively well-behaved, random graph model. The probability of forming a given tree is a homogeneous multivariate polynomial in the ages of the vertices that would form the tree. It is then necessary to integrate over the joint distribution (which fortunately is a product in the limit) of the ages of the vertices. I was playing around with this by considering what seemed to be the extreme cases: the star and the path. I was working with the relatively simple case n=4, and it struck me that perhaps the polynomial for the star was always at least as large as that for the path. This would be convenient as it would avoid the need for a horrific-looking integral calculation. This turned out to be true. My first method was a heavy but uncontroversial convexity and stationary point argument, but I found a pair of vectors embedded in the desired inequality on which I could deploy rearrangement.

Anyway, I thought I should be able to come up with a nice proof, and I think this is one. I think this is particularly nice because it is a demonstration that one can do a proof by induction without explicitly inducting on the natural numbers.

We begin with a base case, which is the theorem for n=2, even though we will not be doing induction in the canonical way. We are required to prove that given

$a_1\le a_2,\quad b_1\le b_2,$

that

$a_1b_1+a_2b_2\ge a_1b_2+a_2b_1,$

since these are the only available permutations. Moving some terms around gives

$(a_2-a_1)(b_2-b_1)\ge 0,$

which is true by construction, and so the n=2 result follows.

We now move straight to the general n case. We focus on the left of the two inequalities in the statement of the result, since the other will follow by an identical method, applied in reverse. We consider the case where $\sigma$ is a transposition. For example, we might consider 12435. When we write out the result we want:

$a_1b_1+a_2b_2+a_3b_3+a_4b_4+a_5b_5\ge a_1b_1+a_2b_2+a_3b_4+a_4b_3+a_5b_5,$

we realise that many of the terms cancel, and the content of the theorem reduces to the n=2 case we have already dealt with. Obviously, this holds equally well whenever $\sigma$ is a transposition. Similarly, if $\sigma$ is a product of two disjoint transpositions, which means that two disjoint pairs of elements are interchanged, we can apply the n=2 case twice, then add on the extra terms to get the result.

In fact, we can do much better than this, by using the fact that any permutation can be expressed as a product of transpositions. We need to be careful about the risk of asserting that every time we multiply the permutation $\sigma$ by a transposition, the value of the associated sum-product expression gets smaller. While the idea is correct, this cannot be generally true. After all, applying the same transposition twice returns us to the identity permutation!

We can nonetheless say something useful. If we start with a permutation

$\sigma(1),\sigma(2),\ldots,\sigma(n),$

and we interchange the ith and jth elements, to get,

$\tau=\sigma(1),\ldots,\sigma(i-1),\sigma(j),\sigma(i+1),\ldots,\sigma(j-1),\sigma(i),\sigma(j+1),\ldots,\sigma(n),$

then the product sum corresponding to $\tau$ is less than or equal to the product sum corresponding to $\sigma$ if $\sigma(i)\leq sigma(j)$, under the implicit assumption that i<j. In other words, we can prove the rearrangement inequality for any permutation $\sigma$ that can be obtained from the identity by repeatedly interchanging elements that are initially in increasing order. Essentially, we have defined a partial ordering on the set of permutations.

It suffices to check that all permutations have this property. In fact, this is relatively easy. We can move element n to its required position in $\sigma$ by successively swapping with (n-1), (n-2), etc. If we set this up as an inductive argument, we can finish by applying the hypothesis to the remaining (n-1) elements, which are in the same order as the identity permutation on [n-1].

So we have proved the left-hand side of the Rearrangement Inequality. In fact, this partial ordering framework makes it clear how to prove the right-hand side. By an identical argument, we can get from any permutation to the reverse identity by a similar set of operations.

# The Top-to-Random Shuffle III

This post concludes my non-exhaustive list of things I think are interesting about the top-to-random shuffle. In previous posts I have talked about the construction and correct sense of convergence to randomness, and that this algorithm does genuinely achieve uniform randomness at some hitting time which is easy to specify. Promising that posts will be short hasn’t worked in the past so I won’t do that again now, but the idea of this post is brief:

When we specified the dynamics of the top-to-random shuffle, we insisted that the top card card could be placed anywhere in the deck with equal probability including back on top. This appears to be doing nothing except slowing down the shuffling process. Why is this important for convergence to randomness?

Fortunately the answer is short: if we do not let the top card be inserted back onto the top, allowing the configuration to stay the same, then we can divide up the set of orderings into two classes, and the pack will alternate between them.

Why is this a problem? Suppose the classes are called X and Y, and X is the class that contains the original ordering 1,2,…,n. Then after k shuffles, the ordering of the deck will be in X if k is even and in Y if k is odd. Remember our definition of ‘close to randomness’ will be the greatest difference in probability of an event between the actual distribution and the uniform distribution. As before, you can think of this by a betting analogy – what proportion profit can you make again someone who thinks it’s uniform by knowing the true distribution?

Well, it will turn out that the sets X and Y have the same size, so in the uniform distribution, the probability that an ordering is in X is 1/2. Whereas if the pack alternates, then so long as we know how many shuffles have occurred, this probability is either 0 or 1. In particular this is far from 1/2. We should remark that if we introduce the notion of sampling at a random time, or taking an average over all large times in some sense, such problems may disappear, but the result obtained may be less useful. See this post on Cesaro Mixing for details presented in a more rigorous style.

So it remains to see why this is true. First a definition. A transposition is when two elements in a permutation are exchanged. Eg 31452 -> 35412 by transposing 1 and 5. It makes sense intuitively that we can get from any permutation to any other permutation by making successive transpositions. Indeed, this is precisely what is happening in the top-to-random shuffle. To avoid continually having to write it out, we call the original permutation 1,2,…,n the identity permutation.

Then the idea is that X is the set of permutations we can obtain by starting with the identity and applying an even number of transpositions, while Y is the set obtained by applying an odd number of transpositions. For this to work, we will need to show that these sets are disjoint. That is, no permutation can be generated by both an odd number and an even number of transpositions. This is important, as a permutation can certainly be generated from transpositions in multiple ways. For example, if the elements are 1,2,3, we can obtain the permutation 2,1,3 by transposing 1 and 2, obviously. However, we could alternatively start by transposing 2 and 3 to get 1,3,2, then 1 and 3 to get 3,1,2, then 2 and 3 again to get 2,1,3. Note that both of these require an odd number of transpositions.

We will call a permutation even if it is generated by an even number of transpositions, and odd otherwise. We also say that its sign (alternatively signature, parity) is +1 or -1 respectively. To prove this is well-defined, we really want to find a different property that is easier to track.

A useful trick is to count how many pairs of elements are not in the correct order. Let’s do this for our previous example: 31452. There are 5 elements so 5 x 4 / 2 = 10 pairs of elements. We list them:

• 1 and 2 are in the correct order.
• 1 and 3 are not, as 3 comes before 1 in this permutation.
• 1 and 4 are correct.
• 1 and 5 are correct.
• 2 and 3 are not.
• 2 and 4 are not.
• 2 and 5 are not.
• 3 and 4 are correct.
• 3 and 5 are correct.
• 4 and 5 are not.

So 5 pairs are not in the correct order. Since 5 is odd, the claim is that this means 31452 is an odd permutation. To check this, and to confirm that the sign is well-defined, it suffices to check that the number of so-called inversions, or pairs in the wrong order, changes parity every time we apply a transposition.

This is clearly true if we transpose adjacent elements. Then the orderings of all pairs remain the same, apart from the pair we transposed, which changes. Then, if the elements are not adjacent, instead of transposing them directly, we can perform a succession of transpositions of adjacent elements. The easiest way to describe this is again by example. Suppose we want to transpose 3 and 5 in 31452.

31452 -> 13452 -> 14352 -> 14532 -> 15432 -> 51432.

Note that the middle transposition is actually transposing 3 and 5, and the others are symmetric about this middle operation. In particular, there is an odd number of transpositions in total. So we have proved the result for general transpositions, and thus we now know that the sign of a permutation is well-defined. Note also that there are an equal number of odd and even permutations of every n=>2. For every odd permutation, transposing 1 and 2 gives an even permutation, and vice versa, uniquely, giving a bijection.

What’s really going on is that we are able to multiply permutations, by doing one after the other. Unlike multiplying real numbers, the order in which we do this now matters. In this context, the set of permutations is an example of a general structure called a group. The idea of partitioning a group into subsets which are in some sense symmetric and where some other operation jumps between the subsets is a useful motivation point for a whole avenue of interesting theory. Not to be explored now unfortunately…

# Mixing Times 4 – Avoiding Periodicity

A Markov chain is periodic if you can partition the state space such it is possible to be in a particular class only at certain, periodic times. Concretely, suppose we can find a decomposition into classes $\Omega=V_1\cup\ldots\cup V_k$ such that conditional on $X_t\in V_i$, we have $\mathbb{P}(X_{t+1}\in V_{i+1})=1$, where the indices of the Vs are taken modulo k. Such a chain is called periodic with period k. In most cases, we would want to define the period to be the maximal such k.

Why is periodicity a problem? It prevents convergence to equilibrium. The distribution at time t has some fairly strong dependence on the initial distribution. For example, if the initial distribution is entirely supported on $V_1$ as defined above, then the distribution at time t will be entirely supported on $V_i$, where $i\equiv t \mod k$. In particular, this cannot converge to some equilibrium.

Aperiodicity thus becomes a necessary condition in any theorem on convergence to equilibrium. Note that by construction this is only relevant for chains in discrete time. In an first account of Markov chains, most of the examples will either have a small state space, for which the transition matrix will have to contain lots of zeros before it stands a chance of being periodic, or obviously aperiodic birth-death or queue type processes. But some of the combinatorially motivated chains we consider for interesting mixing properties are more likely to be periodic. In particular, for a random walk on a group say, the generator measure may well be supported only on a small subset of the whole group, which is completely natural (eg transpositions as a subset of the symmetric group). Then it becomes more plausible that periodicity might arise because of some underlying regularity or symmetry in the group structure.

My first claim is that periodicity is not a disaster for convergence properties of Markov chains. Firstly, by the definition above, $P^k(x,y)$ for $x,y\in V_1$ is an irreducible (aperiodic if k is maximal) transition matrix on $V_1$, and so we have convergence to some equilibrium distribution on $V_1$ of $(X_{kt+a})$ or similar. An initial distribution mixed between classes gives a mix of such equilibria. Alternatively, we could think about large-time ergodic properties. By taking an average over all distributions up to some large t, the periodic problems get smoothed out. So, for mixing on a periodic chain, it might be possible to make headway with Cesaro mixing, which looks at the speed of convergence of the ergodic average distribution.

In most cases, though, we prefer to alter the chain directly to remove periodicity, or even any chance of periodicity. The preferred method in many contexts is to replace the transition matrix P with $\frac12 (P+I)$. This says that at every time t, we toss an independent fair coin, and with probability 1/2 make the transition suggested by P, and with probability 1/2 we stay where we are. Note that if a chain is irreducible, and some P(x,x)>0, then it is definitely aperiodic, as x cannot be in more than one class as per the definition of periodicity.

If you want to know about the mixing time of the original chain, note that this so-called lazy chain moves at half the speed of the original, so to get exact asymptotics (eg in the case of cutoff, that is mixing speed faster than the scale of the mixing time) you must multiply by 2. Also note that all of the eigenvalues of $\frac12 (P+I)$ are non-negative, and in fact, the eigenvalues are subject to a linear transform in the construction of the lazy transition matrix $\lambda\mapsto \frac12(1+\lambda)$.

Note that choosing 1/2 as the parameter is unnecessary. Firstly, it would suffice to take some $P(x,x)=\epsilon$ and rescale the rest of the row appropriately. Also, in some cases, a different constant gives a more natural interpretation of the underlying mechanism. For example, one model worth considering is the Random Transposition Random Walk on the symmetric group, where at time t we multiply (ie compose) with a transposition chosen uniformly at random. This model is interesting partly because the orbits of an element resemble, at least initially, the component size process of a Erdos-Renyi random graph, on the grounds that when the number of transpositions is small, they don’t interact too much, so can be viewed as independent edges. Anyway, some form of laziness is needed in RTRW, otherwise the chain will alternative between odd and even permutations. In this case, 1/2 is not the most natural choice. The most sensible way to sample random transpositions is to select the two elements of [n] to be transposed uniformly and independently at random. Thus each transposition is selected with probability $\frac{2}{n^2}$, while the identity, which corresponds to ‘lazily’ staying at the current state in the random walk, is selected with probability 1/n.

The lazy chain is also useful when the original chain has a lot of symmetry involved. In particular, if the original chain involves ‘switching’ say one coordinate. The best example is the random walk on the vertices of the n-hypercube, but there are others. Here, the most helpful way to visualise the configuration is to choose a coordinate uniformly at random and then flip its value (from 0 to 1 or 1 to 0). Now the lazy chain can be viewed similarly, but note that the dependence on the current value of the coordinate is suppressed. That is, having chosen the coordinate to be affected, we set it to be 0 with probability 1/2 and 1 with probability 1/2, irrespective of the prior value at that coordinate. Thus instead of viewing the action on coordinates to be ‘stay or switch’, we can view the action on the randomly chosen coordinate to be ‘randomly resample’, to use statistical terminology. This is ideal for coupling, because from the time coordinate j is first selected, the value at that coordinate is independent of the past, and in particular, the initial value or distribution. So we can couple arbitrary initial configurations or distributions, and we know that as soon as all coordinates have been selected (a time that can be described as a coupon collector problem), the chains are well coupled, that is, the values are the same.

Note that one way we definitely get periodicity is if the increment distribution for random walk on a group is supported entirely in a single coset of a normal subgroup. Why? Well if we take $H\lhd G$ to be the normal subgroup, and gH to be the relevant coset, then $P^t(g',\cdot)$ is supported entirely on $g^tg'H$, so is periodic with period equal to the order of gH in the quotient group G/H. Note that if the coset is the normal subgroup itself, then it might well include support on the identity, which immediately makes the chain aperiodic. However, there will be then be no transitions between cosets, so the chain is not irreducible on G.

The previous paragraph is the content of Remark 8.3 in the book we are reading. My final comment is that normality is precisely what is needed for this to hold. The key idea is that the set of subsets {gH, gHgH, gHgHgH, … } forms a partition of the group. This is certainly true if H is normal and gH generates G/H. If the latter statement is not true, then the set of subsets still forms a partition, but of some subset of G. The random walk is then neither irreducible nor aperiodic on the reduced state space. If H is not normal, then there are no such restrictions. For example, gHgH might be equal to the whole group G. Then the random walk is aperiodic, as this would imply we can move between any pair of states in two steps, and so by extension between any pair of states in three steps. (2,3)=1, hence the chain is aperiodic. As a concrete example, consider

$\tau=\langle (1 2)\rangle \leq S_3,$

the simplest example of a non-normal subgroup. Part of the problem is that cosets are different in the left-case and the right-case. Consider the left coset of $\tau$ given by $\sigma\tau=\{(1 2 3),(2 3)\}$. These elements have order three and two respectively, and so by a similar argument to the general one above, this random walk is aperiodic.