BMO2 2017

The second round of the British Mathematical Olympiad was taken yesterday by about 100 invited participants, and about the same number of open entries. To qualify at all for this stage is worth celebrating. For the majority of the contestants, this might be the hardest exam they have ever sat, indeed relative to current age and experience it might well be the hardest exam they ever sit. And so I thought it was particularly worth writing about this year’s set of questions. Because at least in my opinion, the gap between finding every question very intimidating, and solving two or three is smaller, and more down to mindset, than one might suspect.

A key over-arching point at this kind of competition is the following: the questions have been carefully chosen, and carefully checked, to make sure they can be solved, checked and written up by school students in an hour. That’s not to say that many, or indeed any, will take that little time, but in principle it’s possible. That’s also not to say that there aren’t valid but more complicated routes to solutions, but in general people often spend a lot more time writing than they should, and a bit less time thinking. Small insights along the lines of “what’s really going on here?” often get you a lot further into the problem than complicated substitutions or lengthy calculations at this level.

So if some of the arguments below feel slick, then I guess that’s intentional. When I received the paper and had a glance in my office, I was only looking for slick observations, partly because I didn’t have time for detailed analysis, but also because I was confident that there were slick observations to be made, and I felt it was just my task to find them.

Anyway, these are the questions: (note that the copyright to these is held by BMOS – reproduced here with permission.)

Question One

2017-bmo2-q1I immediately tried the example where the perpendicular sides are parallel to the coordinate axes, and found that I could generate all multiples of 3 in this way. This seemed a plausible candidate for an answer, so I started trying to find a proof. I observed that if you have lots of integer points on one of the equal sides, you have lots of integer points on the corresponding side, and these exactly match up, and then you also have lots of integer points on the hypotenuse too. In my first example, these exactly matched up too, so I became confident I was right.

Then I tried another example ( (0,0), (1,1), (-1,1) ) which has four integer points, and could easily be generalised to give any multiple of four as the number of integer points. But I was convinced that this matching up approach had to be the right thing, and so I continued, trusting that I’d see where this alternate option came in during the proof.

Good setup makes life easy. The apex of the isosceles triangle might as well be at the origin, and then your other vertices can be (m,n), (n,-m) or similar. Since integral points are preserved under the rotation which takes equal side to the other, the example I had does generalise, but we really need to start enumerating. The number of integer points on the side from (0,0) to (m,n) is G+1, where G is the greatest common divisor of m and n. But thinking about the hypotenuse as a vector (if you prefer, translate it so one vertex is at the origin), the number of integral points on this line segment must be \mathrm{gcd}(m+n,m-n) +1.

To me, this felt highly promising, because this is a classic trope in olympiad problem-setting. Even without this experience, we know that this gcd is equal to G if m and n have different parities (ie one odd, one even) and equal to 2G if m and n have the same parity.

So we’re done. Being careful not to double-count the vertices, we have 3G integral points if m and n have opposite parities, and 4G integral points if m and n have the same parity, which exactly fits the pair of examples I had. But remember that we already had a pair of constructions, so (after adjusting the hypothesis to allow the second example!) all we had to prove was that the number of integral points is divisible by at least one of 3 and 4. And we’ve just done that. Counting how many integers less than 2017 have this property can be done easily, checking that we don’t double-count multiples of 12, and that we don’t accidentally include or fail to include zero as appropriate, which would be an annoying way to perhaps lose a mark after totally finishing the real content of the problem.

Question Two

2017-bmo2-q2(Keen observers will note that this problem first appeared on the shortlist for IMO 2006 in Slovenia.)

As n increases, obviously \frac{1}{n} decreases, but the bracketed expression increases. Which of these effects is more substantial? Well \lfloor \frac{n}{k}\rfloor is the number of multiples of k which are at most n, and so as a function of n, this increases precisely when n is a multiple of k. So, we expect the bracketed expression to increase substantially when n has lots of factors, and to increase less substantially when n has few factors. An extreme case of the former might be when n is a large factorial, and certainly the extreme case of the latter is n a prime.

It felt easier to test a calculation on the prime case first, even though this was more likely to lead to an answer for b). When n moves from p-1 to p, the bracketed expression goes up by exactly two, as the first floor increases, and there is a new final term. So, we start with a fraction, and then increase the numerator by two and the denominator by one. Provided the fraction was initially greater than two, it stays greater than two, but decreases. This is the case here (for reasons we’ll come back to shortly), and so we’ve done part b). The answer is yes.

Then I tried to do the calculation when n was a large factorial, and I found I really needed to know the approximate value of the bracketed expression, at least for this value of n. And I do know that when n is large, the bracketed expression should be approximately n\log n, with a further correction of size at most n to account for the floor functions, but I wasn’t sure whether I was allowed to know that.

But surely you don’t need to engage with exactly how large the correction due to the floors is in various cases? This seemed potentially interesting (we are after all just counting factors), but also way too complicated. An even softer version of what I’ve just said is that the harmonic function (the sum of the first n reciprocals) diverges faster than n. So in fact we have all the ingredients we need. The bracketed expression grows faster than n, (you might want to formalise this by dividing by n before analysing the floors) and so the a_ns get arbitrarily large. Therefore, there must certainly be an infinite number of points of increase.

Remark: a few people have commented to me that part a) can be done easily by treating the case n=2^k-1, possibly after some combinatorial rewriting of the bracketed expression. I agree that this works fine. Possibly this is one of the best examples of the difference between doing a problem leisurely as a postgraduate, and actually under exam pressure as a teenager. Thinking about the softest possible properties of a sequence (roughly how quickly does it grow, in this case) is a natural first thing to do in all circumstances, especially if you are both lazy and used to talking about asymptotics, and certainly if you don’t have paper.

Question 3

2017-bmo2-q3I only drew a very rough diagram for this question, and it caused no problems whatsoever, because there aren’t really that many points, and it’s fairly easy to remember what their properties are. Even in the most crude diagram, we see R and S lie on AC and AD respectively, and so the conclusion about parallel lines is really about similarity of triangles ARS and ACD. This will follow either from some equal angles, or by comparing ratios of lengths.

bmo2-2017-q3Since angle bisectors by definition involve equal angles, the first attack point seems promising. But actually the ratios of lengths is better, provided we know the angle bisector theorem, which is literally about ratios of lengths in the angle bisector diagram. Indeed

\frac{AR}{RC}=\frac{AQ}{CQ},\quad \frac{AS}{SD}=\frac{AP}{PD},     (1)

and so it only remains to show that these quantities are in fact all equal. Note that there’s some anti-symmetry here – none of these expressions use B at all! We could for example note that AP/PD = BP/PC, from which

\left(\frac{AS}{SD}\right)^2 = \frac{AP.BP}{PC.PD},     (2)

and correspondingly for R and Q, and work with symmetric expressions. I was pretty sure that there was a fairly well-known result that in a cyclic quadrilateral, where P is the intersection of the diagonals

\frac{AP}{PC} = \frac{AD.AB}{DC.BC},     (3)

(I was initially wondering whether there was a square on the LHS, but an example diagram makes the given expression look correct.)

There will be a corresponding result for Q, and then we would be almost done by decomposing (2) slightly differently, and once we’d proved (3) of course. But doing this will turn out to be much longer than necessary. The important message from (3) is that in a very simple diagram (only five points), we have a result which is true, but which is not just similar triangles. There are two pairs of similar triangles in the diagram, but they aren’t in the right places to get this result. What you do have is some pairs of triangles with one pair of equal angles, and one pair of complementary angles (that is, \theta in one, and 180-\theta in the other). This is a glaring invitation to use the sine rule, since the sines of complementary angles are equal.

But, this is also the easiest way to prove the angle bisector theorem. So maybe we should just try this approach directly on the original ratio-of-lengths statement that we decided at (1) was enough, namely \frac{AQ}{CQ}=\frac{AP}{PD}. And actually it drops out rapidly. Using natural but informal language referencing my diagram

\frac{AP}{PD} = \frac{\sin(\mathrm{Green})}{\sin(\mathrm{Pink})},\quad\text{and}\quad \frac{AQ}{CQ}= \frac{\sin(\mathrm{Green})}{\sin(180-\mathrm{Pink})}

and we are done. But whatever your motivation for moving to the sine rule, this is crucial. Unless you construct quite a few extra cyclic quadrilaterals, doing this with similar triangles and circle theorems alone is going to be challenging.

Remark: If you haven’t seen the angle bisector theorem before, that’s fine. Both equalities in (1) are a direct statement of the theorem. It’s not an intimidating statement, and it would be a good exercise to prove either of these statements in (1). Some of the methods just described will be useful here too!

Question 4

2017-bmo2-q4You might as well start by playing around with methodical strategies. My first try involved testing 000, 111, … , 999. After this, you know which integers appear as digits. Note that at this stage, it’s not the same as the original game with only three digits, because we can test using digits which we know are wrong, so that answers are less ambiguous. If the three digits are different, we can identify the first digit in two tests, and then the second in a further test, and so identify the third by elimination. If only two integers appear as digits, we identify each digit separately, again in three tests overall. If only one integer appears, then we are already done. So this is thirteen tests, and I was fairly convinced that this wasn’t optimal, partly because it felt like testing 999 was a waste. But even with lots of case tries I couldn’t do better. So I figured I’d try to prove some bound, and see where I got.

A crucial observation is the following: when you run a test, the outcome eliminates some possibilities. One of the outcomes eliminates at least half the codes, and the other outcome eliminates at most half the codes. So, imagining you get unlucky every time, after k tests, you might have at least 1000\times 2^{-k} possible codes remaining. From this, we know that we need at least 9 tests.

For this bound to be tight, each test really does need to split the options roughly in two. But this certainly isn’t the case for the first test, which splits the options into 729 (no digit agreements) and 271 (at least one agreement). Suppose the first test reduces it to 729 options, then by the same argument as above, we still need 9 tests. We now know we need at least 10 tests, and so the original guess of 13 is starting to come back into play.

We now have to make a meta-mathematical decision about what to do next. We could look at how many options might be left after the second test, which has quite a large number of cases (depending on how much overlap there is between the first test number and the second test number). It’s probably going to be less than 512 in at least one of the cases, so this won’t get us to a bound of 11 unless we then consider the third test too. This feels like a poor route to take for now, as the tree of options has branching at rate 3 (or 4 if you count obviously silly things) per turn, so gets unwieldy quickly. Another thought is that this power of two argument is strong when the set of remaining options is small, so it’s easier for a test to split the field roughly in two.

Now go back to our proposed original strategy. When does the strategy work faster than planned? It works faster than planned if we find all the digits early (eg if they are all 6 or less). So the worst case scenario is if we find the correct set of digits fairly late. But the fact that we were choosing numbers of the form aaa is irrelevant, as the digits are independent (consider adding 3 to the middle digit modulo 10 at all times in any strategy – it still works!).

This is key. For k\le 9, after k tests, it is possible that we fail every test, which means that at least (10-k) options remain for each digit, and so at least (10-k)^3 options in total. [(*) Note that it might actually be even worse if eg we get a ‘close’ on exactly one test, but we are aiming for a lower bound, so at this stage considering an outcome sequence which is tractable is more important than getting the absolute worst case outcome sequence if it’s more complicated.] Bearing in mind that I’d already tried finishing from the case of reduction to three possibilities, and I’d tried hard to sneak through in one fewer test, and failed, it seemed sensible to try k=7.

After 7 tests, we have at least 27 options remaining, which by the powers-of-two argument requires at least 5 further tests to separate. So 12 in total, which is annoying, because now I need to decide whether this is really the answer and come up a better construction, or enhance the proof.

Clearly though, before aiming for either of these things, I should actually try some other values of k, since this takes basically no time at all. And k=6 leaves 64 options, from which the power of two argument is tight; and k=5 leaves 125, which is less tight. So attacking k=6 is clearly best. We just need to check that the 7th move can’t split the options exactly into 32 + 32. Note that in the example, where we try previously unseen digits in every position, we split into 27 + 37 [think about (*) again now!]. Obviously, if we have more than four options left for any digit, we are done as then we have strictly more than 4x4x4=64 options. So it remains to check the counts if we try previously unseen digits in zero, one or two positions. Zero is silly (gives no information), and one and two can be calculated, and don’t give 32 + 32.

So this is a slightly fiddly end to the solution, and relies upon having good control over what you’re trying to do, and what tools you currently have. The trick to solving this is resisting calculations and case divisions that are very complicated. In the argument I’ve proposed, the only real case division is right at the end, by which point we are just doing an enumeration in a handful of cases, which is not really that bad.

Chains and antichains

I’ve recently been at the UK-Hungary winter olympiad camp in Tata, for what is now my sixth time. As well as doing some of my own work, have enjoyed the rare diversion of some deterministic combinatorics. It seems to be a local variant of the pigeonhole principle that given six days at a mathematical event in Hungary, at least one element from {Ramsay theory, Erdos-Szekeres, antichains in the hypercube} will be discussed, with probability one. On this occasion, all were discussed, so I thought I’d write something about at least one of them.

Posets and directed acyclic graphs

This came up on the problem set constructed by the Hungarian leaders. The original formulation asked students to show that among any 17 positive integers, there are either five such that no one divides any other, or five such that among any pair, one divides the other.

It is fairly clear why number theory plays little role. We assign the given integers to the vertices of a graph, and whenever a divides b, we add a directed edge from the vertex corresponding to a to the vertex corresponding to b. Having translated the given situation into a purely combinatorial statement, fortunately we can translate the goal into the same language. If we can find a chain of four directed edges (hence five vertices – beware confusing use of the word ‘length’ here) then we have found the second possible option. Similarly, if we can find an antichain, a set of five vertices with no directed edges between them, then we have found the first possible option.

It’s worth noting that the directed graph we are working with with is transitive. That is, whenever there is an edge a->b and b->c, there will also be an edge a->c. This follows immediately from the divisibility condition. There are also no directed cycles in the graph, since otherwise there would be a cycle of integers where each divided its successor. But of course, when a divides b and these are distinct positive integers, this means that b is strictly larger than a, and so this relation cannot cycle.

In fact, among a set of positive integers, divisibility defines a partial order, which we might choose to define as any ordering whether the associated directed graph is transitive and acyclic, although obviously we could use language more naturally associated with orderings. Either way, from now on we consider posets and the associated DAGs (directed acyclic graphs) interchangeably.

Dilworth’s theorem

In the original problem, we are looking for either a large chain, or a large antichain. We are trying to prove that it’s not possible to have largest chain size at most four, and largest antichain size at most four when there are 17 vertices, so we suspect there may some underlying structure: in some sense perhaps the vertex set is the ‘product’ of a chain and an antichain, or at least a method of producing antichains from a single vertex.

Anyway, one statement of Dilworth’s theorem is as follows:

Statement 1: in a poset with nm+1 elements, there is either a chain of size n+1, or an antichain of size m+1.

Taking n=m=4 immediately finishes the original problem about families of divisors. While this is the most useful statement here, it’s probably not the original, which says the following:

Statement 2: in a poset, there exists \mathcal{C} a decomposition into chains, and an antichain A such that |\mathcal{C}|=|A|.

Remark 1: Note that for any decomposition into chains and any antichain, we have |\mathcal{C}|\ge |A|, since you can’t have more than one representative from any chain in the antichain. So Statement 2 is saying that equality does actually hold.

Remark 2: Statement 1 follows immediately from Statement 2. If all antichains had size at most m, then there’s a decomposition into at most m chains. But each chain has size n, so the total size of the graph is at most mn. Contradiction.

Unsuccessful proof strategies for Dilworth

Since various smart young people who didn’t know the statement or proof of Dilworth’s theorem attempted to find it (in the form of Statement 1, and in a special case) in finite time conditions, it’s easy to talk about what doesn’t work, and try to gain intellectual value by qualifying why.

  • Forgetting directions: in general one might well attack a problem by asking whether we have more information than we need. But ignoring the directions of the edges is throwing away too much information. After doing this, antichains are fine, but maybe you need to exhibit some undirected ‘chains’. Unless these undirected chains are much longer than you are aiming for, you will struggle to reconstruct directed chains out of them.
  • Where can the final vertex go?: in a classic trope, one might exhibit a directed graph on nm vertices with neither a chain of size n+1 nor an antichain of size m+1. We attempt to argue that this construction is essentially unique, and that it goes wrong when we add an extra vertex. As a general point, it seems unlikely to be easier to prove that exactly one class of configurations has a given property in the nm case, than to prove no configurations has the same property in the nm+1 case. A standalone proof of uniqueness is likely to be hard, or a disguised rehash of an actual proof of the original statement.
  • Removing a chain: If you remove a chain of maximal length, then, for contradiction, what you have left is m(n-1)+1 vertices. If you have a long chain left, then you’re done, although maximality has gone wrong somewhere. So you have an antichain size n in what remains. But it’s totally unclear why it should be possible to extend the antichain with one of the vertices you’ve just removed.

An actual proof of Dilworth (Statement 1), and two consequences

This isn’t really a proof, instead a way of classifying the vertices in the directed graph so that this version of Dilworth. As we said earlier, we imagine there may be some product structure. In particular, we expect to be able to find a maximal chain, and a nice antichain associated to each element of the maximal chain.

dilworth-thmWe start by letting V_0 consist of all the vertices which are sources, that is, have zero indegree. These are minima in the partial ordering setting. Now let V_1 consist of all vertices whose in-neighbourhood is entirely contained in V_0, that is they are descendents only of V_0. Then let V_2 consist of all remaining vertices whose in-neighourhood is entirely contained in V_0\cup V_1 (but not entirely in V_0, otherwise it would have already been treated), and so on. We end up with what one might call an onion decomposition of the vertices based on how far they are from the sources. We end up with V_0,V_1,\ldots,V_k, and then we can find a chain of size k+1 by starting with any vertex in V_k and constructing backwards towards the source. However, this is also the largest possible size of a chain, because every time we move up a level in the chain, we must move from V_i to V_j where j>i.

It’s easy to check that each V_i is an antichain, and thus we can read off Statement 1. A little more care, and probably an inductive argument is required to settle Statement 2.

We have however proved what is often called the dual of Dilworth’s theorem, namely that in a poset there exists a chain C, and a decomposition into a collection \mathcal{A} of antichains, for which |C|=|\mathcal{A}|.

Finally, as promised returning to Erdos-Szekeres, if not to positive integers. We apply Dilworth Statement 1 to a sequence of m^2+1 real numbers a_0,a_1,\ldots,a_{m^2}, with the ordering a_i\rightarrow a_j if i\le j and a_i\le a_j. Chains correspond to increasing subsequences, and antichains to decreasing subsequences, so we have shown that there is either a monotone subsequence of length m+1.

 

Skorohod embedding

Background

Suppose we are given a standard Brownian motion (B_t), and a stopping time T. Then, so long as T satisfies one of the regularity conditions under which the Optional Stopping Theorem applies, we know that \mathbb{E}[B_T]=0. (See here for a less formal introduction to OST.) Furthermore, since B_t^2-t is a martingale, \mathbb{E}[B_T^2]=\mathbb{E}[T], so if the latter is finite, so is the former.

Now, using the strong Markov property of Brownian motion, we can come up with a sequence of stopping times 0=T_0, T_1, T_2,\ldots such that the increments T_k-T_{k-1} are IID with the same distribution as T. Then 0,B_{T_1},B_{T_2},\ldots is a centered random walk. By taking T to be the hitting time of \{-1,+1\}, it is easy to see that we can embed simple random walk in a Brownian motion using this approach.

p1020956_compressedEmbedding simple random walk in Brownian motion.

The Skorohod embedding question asks: can all centered random walks be constructed in this fashion, by stopping Brownian motion at a sequence of stopping time? With the strong Markov property, it immediately reduces the question of whether all centered finite-variance distributions X can be expressed as B_T for some integrable stopping time T.

The answer to this question is yes, and much of what follows is drawn from, or at least prompted by Obloj’s survey paper which details the problem and rich history of the many approaches to its solution over the past seventy years.

Applications and related things

The relationship between random walks and Brownian motion is a rich one. Donsker’s invariance principle asserts that Brownian motion appears as the scaling limit of a random walk. Indeed, one can construct Brownian motion itself as the limit of a sequence of consistent random walks with normal increments on an increasingly dense set of times. Furthermore, random walks are martingales, and we know that continuous, local martingales can be expressed as a (stochastically) time-changed Brownian motion, from the Dubins-Schwarz theorem.

The Skorohod embedding theorem can be used to prove results about random walks with general distribution by proving the corresponding result for Brownian motion, and checking that the construction of the sequence of stopping times has the right properties to allow the result to be carried back to the original setting. It obviously also gives a coupling between a individual random walk and a Brownian motion which may be useful in some contexts, as well as a coupling between any pair of random walks. This is useful in proving results for random walks which are much easier for special cases of the distribution. For example, when the increments are Gaussian, or when there are combinatorial approaches to a problem about simple random walk. At the moment no aspect of this blog schedule is guaranteed, but I plan to talk about the law of the iterated logarithm shortly, whose proof is approachable in both of these settings, as well as for Brownian motion, and Skorohod embedding provides the route to the general proof.

At the end, we will briefly compare some other ways to couple a random walk and a Brownian motion.

Adding extra randomness

One thing we could do is sample a copy of X independently from the Brownian motion, then declare T= \tau_{X}:= \inf\{t\ge 0: B_t=X\}, the hitting time of (random value) X. But recall that unfortunately \tau_x has infinite expectation for all non-zero x, so this doesn’t fit the conditions required to use OST.

Skorohod’s original method is described in Section 3.1 of Obloj’s notes linked above. The method is roughly to pair up positive values taken by X appropriately with negative values taken by X in a clever way. If we have a positive value b and a negative value a, then \tau_{a,b}, the first hitting time of \mathbb{R}\backslash (a,b) is integrable. Then we choose one of these positive-negative pairs according to the projection of the distribution of X onto the pairings, and let T be the hitting time of this pair of values. The probability of hitting b conditional on hitting {a,b} is easy to compute (it’s \frac{-a}{b-a}) so we need to have chosen our pairs so that the ‘probability’ of hitting b (ie the density) comes out right. In particular, this method has to start from continuous distributions X, and treat atoms in the distribution of X separately.

The case where the distribution X is symmetric (that is X\stackrel{d}=-X) is particularly clear, as then the pairs should be (-x,x).

However, it feels like there is enough randomness in Brownian motion already, and subsequent authors showed that indeed it wasn’t necessary to introduce extra randomness to provide a solution.

One might ask whether it’s possible to generate the distribution on the set of pairs (as above) out of the Brownian motion itself, but independently from all the hitting times. It feels like it might be possible to make the distribution on the pairs measurable with respect to

\mathcal{F}_{0+} = \bigcap\limits_{t>0} \mathcal{F}_t,

the sigma-algebra of events determined by limiting behaviour as t\rightarrow 0 (which is independent of hitting times). But of course, unfortunately \mathcal{F}_{0+} has a zero-one law, so it’s not possible to embed non-trivial distributions there.

Dubins solution

The exemplar for solutions without extra randomness is due to Dubins, shortly after Skorohod’s original argument. The idea is to express the distribution X as the almost sure limit of a martingale. We first use the hitting time of a pair of points to ‘decide’ whether we will end up positive or negative, and then given this information look at the hitting time (after this first time) of two subsequent points to ‘decide’ which of four regions of the real interval we end up in.

I’m going to use different notation to Obloj, corresponding more closely with how I ended up thinking about this method. We let

a_+:= \mathbb{E}[X \,|\, X>0], \quad a_- := \mathbb{E}[X\,|\, X<0], (*)

and take T_1 = \tau_{\{a_-,a_+\}}. We need to check that

\mathbb{P}\left( B_{T_1}=a_+\right) = \mathbb{P}\left(X>0\right),

for this to have a chance of working. But we know that

\mathbb{P}\left( B_{T_1}=a_+\right) = \frac{a_+}{a_+-a_-},

and we can also attack the other side using (*) and the fact that \mathbb{E}[X]=0, using the law of total expectation:

0=\mathbb{E}[X]=\mathbb{E}[X\,|\, X>0] \mathbb{P}(X>0) + \mathbb{E}[X\,|\,X<0]\mathbb{P}(X<0) = a_+ \mathbb{P}(X>0) + a_- \left(1-\mathbb{P}(X>0) \right),

\Rightarrow\quad \mathbb{P}(X>0)=\frac{a_+}{a_+-a_-}.

Now we define

a_{++}=\mathbb{E}[X \,|\, X>a_+],\quad a_{+-}=\mathbb{E}[X\,|\, 0<X<a_+],

and similarly a_{-+},a_{--}. So then, conditional on B_{T_1}=a_+, we take

T_2:= \inf_{t\ge T_1}\left\{ B_t\not\in (a_{+-},a_{++})  \right\},

and similarly conditional on B_{T_1}=a_-. By an identical argument to the one we have just deployed, we have \mathbb{E}\left[B_{T_2} \,|\,\mathcal{F}_{T_1} \right] = B_{T_1} almost surely. So, although the a_{+-+} notation now starts to get very unwieldy, it’s clear we can keep going in this way to get a sequence of stopping times 0=T_0,T_1,T_2,\ldots where B_{T_n} determines which of the 2^n regions of the real line any limit \lim_{m\rightarrow\infty} B_{T_m} should lie in.

A bit of work is required to check that the almost sure limit T_n\rightarrow T is almost surely finite, but once we have this, it is clear that B_{T_n}\rightarrow B_T almost surely, and B_T has the distribution required.

Komlos, Major, Tusnady coupling

We want to know how close we can make this coupling between a centered random walk with variance 1, and a standard Brownian motion. Here, ‘close’ means uniformly close in probability. For large times, the typical difference between one of the stopping times 0,T_1,T_2,\ldots in the Skorohod embedding and its expectation (recall \mathbb{E}[T_k]=k) is \sqrt{n}. So, constructing the random walk S_0,S_1,S_2,\ldots from the Brownian motion via Skorohod embedding leads to

\left |S_k - B_k \right| = \omega(n^{1/4}),

for most values of k\le n. Strassen (1966) shows that the true scale of the maximum

\max_{k\le n} \left| S_k - B_k \right|

is slightly larger than this, with some extra powers of \log n and \log\log n as one would expect.

The Komlos-Major-Tusnady coupling is a way to do a lot better than this, in the setting where the distribution of the increments has a finite MGF near 0. Then, there exists a coupling of the random walk and the Brownian motion such that

\max_{k\le n}\left|S_k- B_k\right| = O(\log n).

That is, there exists C such that

\left[\max_{k\le n} \left |S_k-B_k\right| - C\log n\right] \vee 0

is a tight family of distributions, indeed with uniform exponential tail. To avoid digressing infinitely far from my original plan to discuss the proof of the law of iterated logarithm for general distributions, I’ll stop here. I found it hard to find much coverage of the KMT result apart from the challenging original paper, and many versions expressed in the language of empirical processes, which are similar to random walks in many ways relevant to convergence and this coupling, but not for Skorohod embedding. So, here is a link to some slides from a talk by Chatterjee which I found helpful in getting a sense of the history, and some of the modern approaches to this type of normal approximation problem.

BMO1 2016 – the non-geometry

Here’s a link to yesterday’s BMO1 paper, and the video solutions for all the problems. I gave the video solution to the geometric Q5, and discuss aspects of this at some length in the previous post.

In these videos, for obvious educational reasons, there’s a requirement to avoid referencing theory and ideas that aren’t standard on the school curriculum or relatively obvious directly from first principles. Here, I’ve written down some of my own thoughts on the other problems in a way that might add further value for those students who are already have some experience at olympiads and these types of problems. In particular, on problems you can do, it’s worth asking what you can learn from how you did them that might be applicable generally, and obviously for some of the harder problems, it’s worth knowing about solutions that do use a little bit of theory. Anyway, I hope it’s of interest to someone.

bmo1-2016-q1Obviously we aren’t going to write out the whole list, but there’s a trade-off in time between coming up with neat ideas involving symmetry, and just listing and counting things. Any idea is going to formalise somehow the intuitive statement ‘roughly half the digits are odd’. The neat ideas involve formalising the statement ‘if we add leading zeros, then roughly half the digits are odd’. The level of roughness required is less in the first statement than the second statement.

Then there’s the trade-off. Trying to come up with the perfect general statement that is useful and true might lead to something like the following:

‘If we write the numbers from 0000 to N, with leading zeros, and all digits of N+1 are even, then half the total digits, ie 2N of them, are odd.’

This is false, and maybe the first three such things you try along these lines are also false. What you really want to do is control the numbers from 0000 to 1999, for which an argument by matching is clear, and gives you 2000 x 4 / 2 = 4000 odd digits. You can exploit the symmetry by matching k with 1999-k, or do it directly first with the units, then with the tens and so on.

The rest (that is, 2000 to 2016) can be treated by listing and counting. Of course, the question wants an actual answer, so we should be wary of getting it wrong by plus or minus one in some step. A classic error of this kind is that the number of integers between 2000 and 2016 inclusive is 17, not 16. I don’t know why the memory is so vivid, but I recall being upset in Year 2 about erring on a problem of this kind involving fences and fenceposts.

bmo1-2016-q2As with so many new types of equation, the recipe is to reduce to a type of equation you already know how to solve. Here, because {x} has a different form on different ranges, it makes sense to consider the three ranges

x\in[0,1/25],\, x\in[1/25,1/8],\, x\in [1/8,\infty),

as for each of these ranges, we can rewrite 5y\{8y\}\{25y\} in terms of standard functions without this bracket notation. On each range we can solve the corresponding equation. We then have to check that each solution does actually lie in the appropriate range, and in two cases it does, and in one case it doesn’t.

bmo1-2016-q3Adding an appropriately-chosen value to each side allows you to factorise the quadratics. This might be very useful. But is it an invitation to do number theory and look at coprime factors and so on, or is a softer approach more helpful?

The general idea is that the set of values taken by any quadratic sequence with integer coefficients and leading coefficient one looks from a distance like the set of squares, or the set \{m(m+1), \,m\in\mathbb{N}\}, which you might think of as ‘half-squares’ or ‘double triangle numbers’ as you wish. And by, ‘from a distance’ I mean ‘up to an additive constant’. If you care about limiting behaviour, then of course this additive constant might well not matter, but if you care about all solutions, you probably do care. To see why this holds, note that

n^2+2n = (n+1)^2 - 1,

so indeed up to an additive constant, the quadratic on the LHS gives the squares, and similarly

n^2 - 7n = (n-4)(n-3)-12,

and so on. To solve the equation n^2=m^2+6, over the integers, one can factorise, but another approach is to argue that the distance between adjacent squares is much more than 6 in the majority of cases, which leaves only a handful of candidates for n and m to check.

The same applies at this question. Adding on 9 gives

n^2-6n+9 = m^2 + m -1,

which is of course the same as

(n-3)^2 = m(m+1)-1.

Now, since we now that adjacent squares and ‘half-squares’ are more than one apart in all but a couple of cases, we know why there should only be a small number of solutions. I would call a method of this kind square-sandwiching, but I don’t see much evidence from Google that this term is generally used, except on this blog.

Of course, we have to be formal in an actual solution, and the easiest way to achieve this is to sandwich m(m+1)-1 between adjacent squares m^2 and (m+1)^2, since it is very much clear-cut that the only squares which differ by one are zero and one itself.

bmo1-2016-q4I really don’t have much to say about this. It’s not on the school curriculum so the official solutions are not allowed to say this, but you have to use that all integers except those which are 2 modulo 4 can be written as a difference of two squares. The easiest way to show this is by explicitly writing down the appropriate squares, treating the cases of odds and multiples of four separately.

So you lose if after your turn the running total is 2 modulo 4. At this point, the combinatorics isn’t too hard, though as in Q1 one has to be mindful that making an odd number of small mistakes will lead to the wrong answer! As in all such problems, it’s best to try and give a concrete strategy for Naomi. And it’s best if there’s something inherent in the strategy which makes it clear that it’s actually possible to implement. (Eg, if you claim she should choose a particular number, ideally it’s obvious that number is available to choose.)

One strategy might be: Naomi starts by choosing a multiple of four. Then there are an even number of multiples of four, so Naomi’s strategy is:

  • whenever Tom chooses a multiple of four, Naomi may choose another multiple of four;
  • whenever Tom chooses a number which is one (respectively three) modulo 4, Naomi may choose another which is three (respectively one) modulo 4.

Note that Naomi may always choose another multiple of four precisely because we’ve also specified the second condition. If sometimes Tom chooses an odd number and Naomi responds with a multiple of four out an idle and illogical sense of caprice, then the first bullet point would not be true. One can avoid this problem by being more specific about exactly what the algorithm is, though there’s a danger that statements like ‘whenever Tom chooses k, Naomi should choose 100-k’ can introduce problems about avoiding the case k=50.

bmo1-2016-q6I started this at the train station in Balatonfured with no paper and so I decided to focus on the case of just m, m+1 and n, n+2. This wasn’t a good idea in my opinion because it was awkward but guessable, and so didn’t give too much insight into actual methods. Also, it didn’t feel like inducting on the size of the sequences in question was likely to be successful.

If we know about the Chinese Remainder Theorem, we should know that we definitely want to use it here in some form. Here are some clearly-written notes about CRT with exercises and hard problems which a) I think are good; b) cite this blog in the abstract. (I make no comment on correlation or causality between a) and b)…)

CRT is about solutions to sets of congruence equations modulo various bases. There are two aspects to this , and it feels to me like a theorem where students often remember one aspect, and forget the other one, in some order. Firstly, the theorem says that subject to conditions on the values modulo any non-coprime bases, there exist solutions. In many constructive problems, especially when the congruences are not explicit, this is useful enough by itself.

But secondly, the theorem tells us what all the solutions are. There are two stages to this: finding the smallest solution, then finding all the solutions. Three comments: 1) the second of these is easy – we just add on all multiples of the LCM of the bases; 2) we don’t need to find the smallest solution – any solution will do; 3) if you understand CRT, you might well comment that the previous two comments are essentially the same. Anyway, finding the smallest solution, or any solution is often hard. When you give students an exercise sheet on CRT, finding an integer which is 3 mod 5, 1 mod 7 and 12 mod 13 is the hard part. Even if you’re given the recipe for the algorithm, it’s the kind of computation that’s more appealing if you are an actual computer.

Ok, so returning to this problem, the key step is to phrase everything in a way which makes the application of CRT easy. We observe that taking n=2m satisfies the statement – the only problem of course is that 2m is not odd. But CRT then tells us what all solutions for n are, and it’s clear that 2m is the smallest, so we only need to add on the LCM (which is odd) to obtain the smallest odd solution.

BMO1 2016 Q5 – from areas to angles

For the second year in a row Question 5 has been a geometry problem; and for the second year in a row I presented the video solution; and the for the second year in a row I received the question(s) while I was abroad. You can see the video solutions for all the questions here (for now). I had a think about Q5 and Q6 on the train back from a day out at Lake Balaton in Western Hungary, so in keeping with last year’s corresponding post, here are some photos from those sunnier days.

bmo1-2016-q5aI didn’t enjoy this year’s geometry quite as much as last year’s, but I still want to say some things about it. At the time of writing, I don’t know who proposed Q5, but in contrast to most geometry problems, where you can see how the question might have emerged by tweaking a standard configuration, I don’t have a good intuition for what’s really going on here. I can, however, at least offer some insight into why the ‘official’ solution I give on the video has the form that it does.

bmo1-2016-q5bThe configuration given is very classical, with only five points, and lots of equal angles. The target statement is also about angles, indeed we have to show that a particular angle is a right-angle. So we might suspect that the model approach might well involve showing some other tangency relation, where one of the lines AC and BC is a radius and the other a tangent to a relevant circle. I think it’s worth emphasising that throughout mathematics, the method of solving a problem is likely to involve similar objects to the statement of the problem itself. And especially so in competition problems – it seemed entirely reasonable that the setter might have found a configuration with two corresponding tangency relations and constructed a problem by essentially only telling us the details of one of the relations.

There’s the temptation to draw lots of extra points or lots of extra lines to try and fit the given configuration into a larger configuration with more symmetry, or more suggestive similarity [1]. But, at least for my taste, you can often make a lot of progress just by thinking about what properties you want the extra lines and points to have, rather than actually drawing them. Be that as it may, for this question, I couldn’t initially find anything suitable along these lines [2]. So we have to think about the condition.

But then the condition we’ve been given involves areas, which feels at least two steps away from giving us lots of information about angles. It doesn’t feel likely that we are going to be able to read off some tangency conditions immediately from the area equality we’ve been given. So before thinking about the condition too carefully, it makes sense to return to the configuration and think in very loose terms about how we might prove the result.

How do we actually prove that an angle is a right-angle? (*) I was trying to find some tangency condition, but it’s also obviously the angle subtending by the diameter of a circle. You could aim for the Pythagoras relation on a triangle which includes the proposed right-angle, or possibly it might be easier to know one angle and two side-lengths in such a triangle, and conclude with some light trigonometry? We’ve been given a condition in terms of areas, so perhaps we can use the fact that the area of a right-angled triangle is half the product of the shorter side-lengths? Getting more exotic, if the configuration is suited to description via vectors, then a dot product might be useful, but probably this configuration isn’t.

The conclusion should be that it’s not obvious what sort of geometry we’re going to need to do to solve the problem. Maybe everything will come out from similar triangles with enough imagination, but maybe it won’t. So that’s why in the video, I split the analysis into an analysis of the configuration itself, and then an analysis of the area condition. What really happens is that we play with the area condition until we get literally anything that looks at all like one of the approaches discussed in paragraph (*). To increase our chances, we need to know as much about the configuration as possible, so any deductions from the areas are strong.

The configuration doesn’t have many points, so there’s not much ambiguity about what we could do. There are two tangents to the circle. We treat APC with equal tangents and the alternate segment theorem to show the triangle is isosceles and that the base angles are equal to the angle at B in ABC. Then point Q is ideally defined in terms of ABC to use power of a point, and add some further equal angles into the diagram. (Though it turns out we don’t need the extra equal angle except through power of a point.)

So we have some equal angles, and also some length relations. One of the length relations is straightforward (AP=CP) and the other less so (power of a point CQ^2 = AQ\cdot BQ). The really key observation is that the angle-chasing has identified

\angle PAQ = 180 - \angle \hat C,

which gives us an alternative goal: maybe it will be easier to show that PAQ is a right-angle.

Anyway, that pretty much drinks the configuration dry, and we have to use the area condition. I want to emphasise how crucial this phase in for this type of geometry problem. Thinking about how to prove the goal, and getting a flavour for the type of relation that comes out of the configuration is great, but now we need to watch like a hawk when we play with the area condition for relations which look similar to what we have, and where we might be going, as that’s very likely to be the key to the problem.

We remarked earlier that we’re aiming for angles, and are given areas. A natural middle ground is lengths. All the more so since the configuration doesn’t have many points, and so several of the triangles listed as having the same area also have the same or similar bases. You might have noticed that ABC and BCQ share height above line AQ, from which we deduce AB=BQ. It’s crucial then to identify that this is useful because it supports the power of a point result from the configuration itself. It’s also crucial to identify that we are doing a good job of relating lots of lengths in the diagram. We have two pairs of equal lengths, and (through Power of a Point) a third length which differs from one of them by a factor of \sqrt{2}.

If we make that meta-mathematical step, we are almost home. We have a relation between a triple of lengths, and between a pair of lengths. These segments make up the perimeter of triangle APQ. So if we can relate one set of lengths and the other set of lengths, then we’ll know the ratios of the side lengths of APQ. And this is excellent, since much earlier we proposed Pythagoras as a possible method for establish an angle is a right-angle, and this is exactly the information we’d need for that approach.

Can we relate the two sets of lengths? We might guess yes, that with a different comparison of triangles areas (since we haven’t yet used the area of APC) we can find a further relation. Indeed, comparing APC and APQ gives CQ = 2PC by an identical argument about heights above lines.

bmo1-2016-q5cNow we know all the ratios, it really is just a quick calculation…

[1] – I discussed the notion of adding extra points when the scripts for the recording were being shared around. It was mentioned that for some people, the requirement to add extra points (or whatever) marks a hard division between ‘problems they can do’ and ‘problem they can’t do’. While I didn’t necessarily follow this practice while I was a contestant myself, these days the first thing I do when I see any angles or an angle condition in a problem is to think about whether there’s a simple way to alter the configuration so the condition is more natural. Obviously this doesn’t always work (see [2]), but it’s on my list of ‘things to try during initial thinking’, and certainly comes a long way before approaches like ‘place in a Cartesian coordinate system’.

[2] – Well, I could actually find something suitable, but I couldn’t initially turn it into a solution. The most natural thing is to reflect P in AC to get P’, and Q in BC to get Q’. The area conditions [AP’C]=[ABC]=[BCQ’] continue to hold, but now P’ and B are on the same side of AC, hence P’B || AC. Similarly AQ’ || BC. I see no reason not to carry across the equal length deductions from the original diagram, and we need to note that angles P’AC, ACP’, CBA are equal and angles Q’AB and BAC are equal. In the new diagram, there are many things it would suffice to prove, including that CP’Q’ are collinear. Note that unless you draw the diagram deliberately badly, it’s especially easy accidentally to assume that CP’Q’ are collinear while playing around, so I wasted quite a bit of time. Later, while writing up this post, I could finish it [3].

[3] – In the double-reflected diagram, BCQ’ is similar to P’BA, and since Q’C=2P’C = P’A, and Q’B=AB, you can even deduce that the scale factor is \sqrt{2}. There now seemed two options:

  • focus on AP’BC, where we now three of the lengths, and three of the angles are equal, so we can solve for the measure of this angle. I had to use a level of trigonometry rather more exotic than the Pythagoras of the original solution, so this doesn’t really serve purpose.
  • Since BCQ’ is similar to P’BA and ABQ’ similar to CP’A, we actually have Q’BCA similar to AP’BC. In particular, \angle CBP' = \angle ACB, and thus both are 90. Note that for this, we only needed the angle deductions in the original configuration, and the pair of equal lengths.
  • There are other ways to hack this final stage, including showing that BP’ meets AQ’ at the latter’s midpoint, to give CP’Q’ collinear.

DGFF 3 – Gibbs-Markov property for entropic repulsion

In the previous post, we saw that it isn’t much extra effort to define the DGFF with non-zero boundary conditions, by adding onto the zero-BC DGFF the unique (deterministic) harmonic function which extends the boundary values into the domain. We also saw how a Gibbs-Markov property applies, whereby the values taken by the field on some sub-region A\subset D depend on the values taken on D\backslash A only through values taken on \partial A.

In this post, we look at how this property and some other methods are applied by Deuschel [1] to study the probability that the DGFF on a large box in \mathbb{Z}^d is positive ‘everywhere’. This event can be interpreted in a couple of ways, all of which are referred to there as entropic repulsion. Everything which follows is either taken directly or paraphrased directly from [1]. I have tried to phrase this in a way which avoids repeating most of the calculations, instead focusing on the methods and the motivation for using them.

Fix dimension d\ge 2 throughout. We let P^0_N be the law of the DGFF on V_N:=[-N,N]^d\subset \mathbb{Z}^d with zero boundary conditions. Then for any subset A\subset \mathbb{Z}^d, in an intuitively-clear abuse of notation, we let

\Omega^+(A):= \{ h_x\ge 0, x\in A\},

be the event that some random field h takes only non-negative values on A. The goal is to determine P^0_N ( \Omega^+(V_N)). But for the purposes of this post, we will focus on showing bounds on the probability that the field is non-negative on a thin annulus near the boundary of V_N, since this is a self-contained step in the argument which contains a blog-friendly number of ideas.

We set (L_N) to be a sequence of integers greater than one (to avoid dividing by zero in the statement), for which \frac{L_N}{N}\rightarrow 0. We now define for each N, the annulus

W_N = \{v\in V_N: L_N\le d_{\mathbb{Z}^d}(v, V_N^c)\le 2L_N \}

with radius L_N set a distance L_N inside the box V_N. We aim to control P^N_0 (\Omega^+(W_N)). This forms middle steps of Deuschel’s Propositions 2.5 and 2.9, which discuss P^N_0(\Omega^+(V_{N-L_N})). Clearly there is the upper bound

P^N_0(\Omega^+(V_{N-L_N})) \le P^N_0(\Omega^+(W_N)) (1)

and a lower bound on P^N_0(\Omega^+(V_{N-L_N})) is obtained in the second proposition by considering the box as a union of annuli then combining the bounds on each annulus using the FKG inequality.

Upper bound via odds and evens

After removing step (1), this is Proposition 2.5:

\limsup_{N\rightarrow \infty} \frac{L_N}{N^{d-1} \log L_N} \log P^N_0(\Omega^+(W_N)) < 0. (2)

This is giving a limiting upper bound on the probability of the form L_N^{-CN^{d-1}/L_N}, though as with all LDP estimates, the form given at (2) is more instructive.

Morally, the reason why it is unlikely that the field should be non-negative everywhere within the annulus is that the distribution at each location is centred, and even though any pair of values are positively correlated, this correlation is not strong enough to avoid this event being unlikely. But this is hard to corral into an upper bound argument directly. In many circumstances, we want to prove upper bounds for complicated multivariate systems by projecting to get an unlikely event for a one-dimensional random variable, or a family of independent variables, even if we have to throw away some probability. We have plenty of tools for tail probabilities in both of these settings. Since the DGFF is normal, a one-dimensional RV that is a linear combination (eg the sum) of all the field heights is a natural candidate. But in this case we would have thrown away too much probability, since the only way we could dominate is to demand that the sum \sum_{x\in W_N}h^N_x\ge 0, which obviously has probability 1/2 by symmetry. (3)

So Deuschel splits W_N into W_N^o,W_N^e, where the former includes all vertices with odd total parity in W_N and the latter includes all the vertices with even total parity in the interior of W_N. (Recall that \mathbb{Z}^d is bipartite in exactly this fashion). The idea is to condition on h^N\big|_{W^o_N}. But obviously each even vertex is exactly surrounded by odd vertices. So by the Gibbs-Markov property, conditional on the odd vertices, the values of the field at the even vertices are independent. Indeed, if for each v\in W_N^e we define \bar h_v to be the average of its neighbours (which is measurable w.r.t to the sigma-algebra generated by the odd vertices), then

\{h_v: v\in W_N^e \,\big|\, \sigma(h_w: w\in W_N^o)\},

is a collection of independent normals with variance one, and where the mean of h_v is \bar h_v.

To start finding bounds, we fix some threshold m=m_N\gg 1 to be determined later, and consider the odd-measurable event A_N that at most half of the even vertices v have \bar h_v\ge m. So A_N^c\cap \Omega^+(W_N) says that all the odd vertices are non-negative and many are quite large. This certainly feels like a low-probability event, and unlike at (3), we might be able to obtain good tail bounds by projection into one dimension.

In the other case, conditional on A_N, there are a large number of even vertices with conditional mean at most m, and so we can control the probability that at least one is negative as a product

(1-\varphi(m))^{\frac12 |W_N^e|}. (4)

Note that for this upper bound, we can completely ignore the other even vertices (those with conditional mean greater than m).

So we’ll go back to A_N^c \cap \Omega^+(W_N). For computations, the easiest one-dimensional variable to work with is probably the mean of the \bar h_vs across v\in W_N^e, since on A_N^c\cap \Omega^+(W_N) this is at least \frac{m}{2}. Rather than focus on the calculations themselves involving

\bar S^e_N:= \frac{1}{|W_N^e|} \sum\limits_{v\in W_N^e} \bar h_v,

let us remark that it is certainly normal and centered, and so there are many methods to bound its tail, for example

P^0_N \left( \bar S^e_N \ge \frac{m}{2} \right) \le \exp\left( \frac{-m^2}{8\mathrm{Var}(\bar S^e_N)} \right), (5)

as used by Deuschel just follows from an easy comparison argument within the integral of the pdf. We can tackle the variance using the Green’s function for the random walk (recall the first post in this set). But before that, it’s worth making an observation which is general and useful, namely that \bar S^e_N is the expectation of

S^e_N:= \sum{1}{|W_N^e|}\sum\limits_{v\in W_N^e} h_v

conditional on the odds. Directly from the law of total variance, the variance of any random variable X is always larger than the variance of \mathbb{E}[X|Y].

So in this case, we can replace \mathrm{Var}(\bar S^e_N) in (5) with \mathrm{Var}(S^e_N), which can be controlled via the Green’s function calculation.

Finally, we choose m_N so that the probability at (4) matches the probability at (5) in scale, and this choice leads directly to (2).

In summary, we decomposed the event that everything is non-negative into two parts: either there are lots of unlikely local events in the field between an even vertex and its odd neighbours, or the field has to be atypically large at the odd sites. Tuning the parameter m_N allows us to control both of these probabilities in the sense required.

Lower bound via a sparse sub-lattice

To get a lower bound on the probability that the field is non-negative on the annulus, we need to exploit the positive correlations in the field. We use a similar idea to the upper bound. If we know the field is positive and fairly large in many places, then it is increasingly likely that it is positive everywhere. The question is how many places to choose?

We are going to consider a sub-lattice that lives in a slightly larger region than W_N itself, and condition the field to be larger than m=m_N everywhere on this lattice. We want the lattice to be sparse enough that even if we ignore positive correlations, the chance of this happening is not too small. But we also want the lattice to be dense enough that, conditional on this event, the chance that the field is actually non-negative everywhere in W_N is not too small either.

To achieve this, Deuschel chooses a sub-lattice of width \lfloor\epsilon L_N^{2/d}\rfloor, and sets \Lambda_N(\epsilon) to be the intersection of this with the annulus with radii [N-\frac{5}{2}L_N, N-\frac{1}{2}L_N], to ensure it lives in a slightly larger region than W_N itself. The scaling of this sub-lattice density is such that when a random walk is started at any v\in W_N, the probability that the RW hits \Lambda_N(\epsilon) before \partial V_N is asymptotically in (0,1). (Ie, not asymptotically zero or one – this requires some definitely non-trivial calculations.) In particular, for appropriate (ie large enough) choice of \epsilon, this probability is at least 1/2 for all v\in W_N. This means that after conditioning on event B_N:=\{h_v\ge m : v\in \Lambda_N(\epsilon)\}, the conditional expectation of h_w is at least \frac{m}{2} for all w\in W_N\backslash \Lambda_N(\epsilon). Again this uses the Gibbs-Markov property and the Gaussian nature of the field. In particular, this conditioning means we are left with the DGFF on V_N\backslash \Lambda_N(\epsilon), ie with boundary \partial V_N\cup \Lambda_N(\epsilon), and then by linearity, the mean at non-boundary points is given by the harmonic extension, which is linear (and so increasing) in the boundary values.

At this point, the route through the calculations is fairly clear. Since we are aiming for a lower bound on the probability of the event \Omega^+(W_N), it’s enough to find a lower bound on P^0_N(\Omega^+(W_N)\cap B).

Now, by positive correlation (or, formally, the FKG inequality) we can control P^0_N(B) just as a product of the probabilities that the field exceeds the threshold at each individual site in \Lambda_N(\epsilon). Since the value of the field at each site is normal with variance at least 1 (by definition), this is straightforward.

Finally, we treat P^0_N(\Omega^+(W_N) \,\big|\, B). We’ve established that, conditional on B, the mean at each point of W_N\backslash \Lambda_N(\epsilon) is at least \frac{m}{2}, and we can bound the variance above too. Again, this is a conditional variance, and so is at most the corresponding original variance, which is bounded above by \sigma_N^2:=\mathrm{Var}(h^N_0). (This fact that the variance is maximised at the centre is intuitively clear when phrased in terms of occupation times, but the proof is non-obvious, or at least non-obvious to me.)

Since each of the event h_v^N\ge 0 for v\in W_N\backslash \Lambda_N(\epsilon) is positively correlated with B, we can bound the probability it holds for all v by the product of the probabilities that it holds for each v. But having established that the conditional mean is at least \frac{m_N}{2} for each v, and the variance is uniformly bounded above (including in N), this gives an easy tail bound of the form we require.

Again it just remains to choose the sequence of thresholds m_N to maximise the lower bound on the probability that we’ve found in this way. In both cases, it turns out that taking m_N= \sqrt{C\log N} is sensible, and this turns out to be linked to the scaling of the maximum of the DGFF, which we will explore in the future.

References

[1] – J-D Deuschel, Entropic Repulsion of the Lattice Free Field, II. The 0-Boundary Case. Available at ProjectEuclid.

DGFF 2 – Boundary conditions and Gibbs-Markov property

In the previous post, we defined the Discrete Gaussian Free Field, and offered some motivation via the discrete random walk bridge. In particular, when the increments of the random walk are chosen to be Gaussian, many natural calculations are straightforward, since Gaussian processes are well-behaved under conditioning and under linear transformations.

Non-zero boundary conditions

In the definition of the DFGG given last time, we demanded that h\equiv 0 on \partial D. But the model is perfectly well-defined under more general boundary conditions.

It’s helpful to recall again the situation with random walk and Brownian bridge. If we want a Brownian motion which passes through (0,0) and (1,s), we could repeat one construction for Brownian bridge, by taking a standard Brownian motion and conditioning (modulo probability zero technicalities) on passing through level s at time 1. But alternatively, we could set

B^{\mathrm{drift-br}}(t) = B(t)+ t(s-B(1)),\quad t\in[0,1],

or equivalently

B^{\mathrm{drift-br}}(t)=B^{\mathrm{br}}(t)+ st, \quad t\in[0,1].

That is, a Brownian bridge with drift can be obtain from a centered Brownian bridge by a linear transformation, and so certainly remains a Gaussian process. And exactly the same holds for a discrete Gaussian bridge: if we want non-zero values at the endpoints, we can obtain this distribution by taking the standard centred bridge and applying a linear transformation.

We can see how this works directly at the level of density functions. If we take 0=Z_0,Z_1,\ldots,Z_{N-1},Z_N=0 a centred Gaussian bridge, then the density of Z=\mathbf{z}\in \mathbb{R}^{N+1} is proportional to

\mathbf{1}\{z_0=z_N=0\}\exp\left( -\frac12 \sum_{i=1}^N (z_i-z_{i-1})^2 \right). (3)

So rewriting z_i= y_i- ki (where we might want k=s/N to fit the previous example), the sum within the exponent rearranges as

-\frac12 \sum_{i=1}^N (y_i-y_{i-1} - k)^2 = -\frac12 \sum_{i=1}^N (y_i-y_{i-1})^2 - 2k(y_N-y_0)+ Nk^2.

So when the values at the endpoints z_0,z_n,y_0,y_N are fixed, this middle term is a constant, as is the final term, and thus the density of the linearly transformed bridge has exactly the same form as the original one.

In two or more dimensions, the analogue of adding a linear function is to add a harmonic function. First, some notation. Let \varphi be any function on \partial D. Then there is a unique harmonic extension of \varphi, for which \nabla \varphi=0 everywhere on D, the interior of the domain. Recall that \nabla is the discrete graph Laplacian defined up to a constant by

(\nabla \varphi) _x = \sum\limits_{x\sim y} \varphi_x - \varphi_y.

If we want h^D instead to have boundary values \varphi, it’s enough to replace h^D with h^D+\varphi. Then, in the density for the DGFF ( (1) in the previous post), the term in the exponential becomes (ignoring the \frac{1}{4d} )

-\sum\limits_{x\sim y} \left[ (h^D_x-h^D_y)^2 + (\varphi_x-\varphi_y)^2 +2(h^D_x - h^D_y)(\varphi_x-\varphi_y)\right].

For each x\in D, on taking this sum over its neighbours y\in \bar D, the final term vanishes (since \varphi is harmonic), while the second term is just a constant. So the density of the transformed field, which we’ll call h^{D,\varphi} is proportional to (after removing the constant arising from the second term above)

\mathbf{1}\left\{h^{D,\varphi}_x = \varphi_x,\, x\in\partial D\right\} \exp\left( -\frac{1}{4d} \sum\limits_{x\sim y} \left( h^{D,\varphi}_x - h^{D,\varphi}_y \right)^2 \right).

So h^{D,\varphi}:= h^D + \varphi satisfies the conditions for the DGFF on D with non-zero boundary conditions \varphi.

Harmonic functions and RW – a quick review

Like the covariances in DGFF, harmonic functions on D are related to simple random walk on D stopped on \partial D. (I’m not claiming a direct connection right now.) We can define the harmonic extension \varphi to an interior point x by taking \mathbb{P}_x to be the law of SRW x=Z_0,Z_1,Z_2,\ldots started from x, and then setting

\varphi(x):= \mathbb{E}\left[ \varphi_{\tau_{\partial d}} \right],

where \tau_{\partial D} is the first time that the random walk hits the boundary.

Inverse temperature – a quick remark

In the original definition of the density of the DGFF, there is the option to add a constant \beta>0 within the exponential term so the density is proportional to

\exp\left(-\beta \sum\limits_{x\sim y} (h_x-h_y)^2 \right).

With zero boundary conditions, the effect of this is straightforward, as varying \beta just rescales the values taken by the field. But with non-zero boundary conditions, the effect is instead to vary the magnitude of the fluctuations of the values of the field around the (unique) harmonic function on the domain with those BCs. In particular, when \beta\rightarrow \infty, the field is ‘reluctant to be far from harmonic’, and so h^D \Rightarrow \varphi.

This parameter \beta is called inverse temperature. So low temperature corresponds to high \beta, and high stability, which fits some physical intuition.

A Markov property

For a discrete (Gaussian) random walk, the Markov property says that conditional on a given value at a given time, the trajectory of the process before this time is independent of the trajectory afterwards. The discrete Gaussian bridge is similar. Suppose we have as before 0=Z_0,Z_1,\ldots, Z_N=0 a centred Gaussian bridge, and condition that Z_k=y, for k\in\{1,\ldots,N-1\}, and y\in\mathbb{R}. With this conditioning, the density (3) splits as a product

\mathbf{1}\{z_0=z_N=0, z_k=y\}\exp\left(-\frac12 \sum\limits_{i=1}^N (z_i-z_{i-1})^2 \right) =

\mathbf{1}\{z_0=0,z_k=y\} \exp\left(-\frac12 \sum\limits_{i=1}^k (z_i-z_{i-1})^2 \right) \cdot \mathbf{1}\{z_k=y,z_N=0\} \exp\left(-\frac12 \sum\limits_{i=k+1}^N (z_i-z_{i-1})^2 \right).

Therefore, with this conditioning, the discrete Gaussian bridge splits into a pair of independent discrete Gaussian bridges with drift. (The same would hold if the original process had drift too.)

The situation for the DGFF is similar, though rather than focusing on the condition, it makes sense to start by focusing on the sub-domain of interest. Let A\subset D, and take B=\bar D\backslash A. So in particular \partial A\subset B.

img_20161106_194123472_compressedThen we have that conditional on h^D\big|_{\partial A}, the restricted fields h^D\big|_{B\backslash \partial A} and h^D\big|_A are independent. Furthermore, h^D\big|_A has the distribution of the DGFF on A, with boundary condition given by h^D\big|_{\partial A}. As in the discrete bridge, this follows just by splitting the density. Every gradient term corresponds to an edge in the underlying graph that lies either entirely inside \bar A or entirely inside B. This holds for a general class of Gibbs models where the Hamiltonian depends only on the sum of some function of the heights (taken to be constant in this ‘free’ model) and the sum of some function of their nearest-neighbour gradients.

One additional and useful interpretation is that if we only care about the field on the restricted region A, the dependence of h^D\big|_A on h^D\big|_{D\backslash A} comes only through h^D\big|_{\partial A}. But more than that, it comes only through the (random) harmonic function which extends the (random) values taken on the boundary of A to the whole of A. So, if h^A is an independent DGFF on A with zero boundary conditions, we can construct the DGFF h^D from its value on D\backslash A via

h^D_x \stackrel{d}= h^A_x + \varphi^{h^D\big|_{\partial A}},

where \varphi^{h^D\big|_{\partial A}} is the unique harmonic extension of the (random) values taken by h^D on \partial A to \bar A.

This Markov property is crucial to much of the analysis to come. There are several choices of the restricted domain which come up repeatedly. In the next post we’ll look at how much one can deduce by taking A to be the even vertices in D (recalling that every integer lattice \mathbb{Z}^d is bipartite), and then taking A to be a finer sublattice within D. We’ll use this to get some good bounds on the probability that the DGFF is positive on the whole of D. Perhaps later we’ll look at a ring decomposition of \mathbb{Z}^d consisting of annuli spreading out from a fixed origin. Then the distribution of the field at this origin can be considered, via the final idea discussed above, as the limit of an infinite sequence of random harmonic functions given by the values taken by the field at increasingly large radius from the origin. Defining the DGFF on the whole lattice depends on the existence or otherwise of this local limit.

DGFF 1 – The discrete Gaussian free field from scratch

I’ve moved to Haifa in northern Israel to start a post-doc in the probability group at the Technion, and now that my thesis is finished I want to start blogging again. The past couple of weeks have been occupied with finding an apartment and learning about the Discrete Gaussian Free Field. All questions about the apartment are solved, but fortunately lots remain open about the DGFF, so I thought I’d write some background about this object and methods which have been used to study it.

Background – Random walk bridge

When we think of a random walk, we usually think of the index as time, normally going forwards. So for a random walk bridge, we might assume Z_0=0, and then condition on Z_N=0, thinking of this as a demand that the process has returned to zero at the future time. In some applications, this is the ideal intuition, but in others, it is more useful to think of the random walk bridge

(0=Z_0,Z_1,\ldots,Z_{N-1},Z_N=0),

as a random height function indexed by [0,N], where the probability of a given path decomposes naturally into a product depending on the N increments, up to a normalising constant.

Naturally, we are interested in the asymptotic behaviour of such a random walk bridge when N\rightarrow\infty. So long as the step distribution has finite variance, a conditioned version of Donsker’s theorem shows that the rescaled random walk bridge converges in distribution to Brownian bridge. Note that Brownian bridge

(B^{\mathrm{br}}_t, t\in[0,1])

can be constructed either by conditioning a standard Brownian motion B to return to zero at time one (modulo some technicalities – this event has zero probability), or by applying an appropriate (random) linear shift

B^{\mathrm{br}}(t):= B(t) - tB(1). (*)

It is not too hard to calculate the distribution of B^{\mathrm{br}}(t) for each t\in[0,1], and with a bit more work, one can calculate the joint distribution of (B^{\mathrm{br}}(s),B^{\mathrm{br}}(t)). In particular, the joint distribution is multivariate Gaussian, and so everything depends on the covariance ‘matrix’ (which here is indexed by [0,1]).

So if we return to a random walk bridge what should the step distribution be? Simple symmetric RW is a natural choice, as then lots of the quantities we might want to consider boil down to combinatorial calculations. Cleverness and Stirling’s formula can often get us useful asymptotics. But there are lots of inconveniences, not least the requirement to be careful about parity (N has to be even for a start unless you make the walk lazy, in which case the combinatorics becomes harder), and even if these can be overcome in a given calculation, it would be better not to have this.

The claim is that the random walk with Gaussian increments is by far the easiest to analyse asymptotically. As a further heuristic, think about the statement of the central limit theorem in the case where the underlying distribution is normal: it’s true but obvious. [Indeed, it’s my favourite piece of advice to anyone taking second year probability exams to check that your proposed statement of CLT does actually work for N(\mu,\sigma^2)…] More concretely, if a RW has Gaussian increments, then the path (Z_1,\ldots,Z_N) is a multivariate normal, or a Gaussian process with finite index set. In particular, covariances define the distribution. It remains a Gaussian process after conditioning on Z_N=0, and the linear tilting argument at (*) remains true here, and can indeed be applied to turn any boundary conditions into any other boundary conditions.

The discrete Gaussian free field

We know how to generalise the domain of a random walk to higher dimensions. But what generalising the index to higher dimension? So now there is definitely no arrow of time, and the notion of a random height function above \mathbb{Z}^2 (or a subset of it) is helpful, for which a scaling limit might be a random surface rather than Brownian motion.

Because we can’t well-order \mathbb{Z}^d, it’s harder to define any such random object on the entire lattice immediately, so we start with compact connected subsets, with zero boundary conditions, as in the one-dimensional case of random walk bridge. Formally, let D be a finite subset of \mathbb{Z}^d, and the boundary \partial D those elements of D^c which are adjacent to an element of D, and let \bar D:= D\cup \partial D.

Then, the discrete Gaussian free field on D is a random real vector h^D=(h^D_x: x\in \bar D), with probability density proportional to

\mathbf{1}\{h^D_x=0, x\in\partial D\}\exp\left ( - \frac{1}{4d} \sum_{x\sim y}(h^D_x - h^D_y)^2 \right), (1)

where we write x\sim y if that x,y are adjacent in \bar D. We won’t at any stage worry much about the partition function which normalises this pdf. Note also that \frac{1}{4d} is just a convenient choice of constant, which corresponds to one of the canonical choices for the discrete Laplacian. Adjusting this constant is the same as uniformly rescaling the values taken by the field.

The immediate interpretation of (1) is that the values taken by the field at vertices which are close to each other are positively correlated. Furthermore, the form of the density is Gaussian. Concretely, if the values of h^D are fixed everywhere except one vertex x\in D, then the conditional distribution of h^D_x is Gaussian. Later, or in subsequent posts, we will heavily develop this idea. Alternatively, we could if we really wanted describe the model in terms of independent Gaussians describing the ‘increment’ along each edge in D (which we should direct), subject to a very large number of conditions, namely that the sum of increments along any directed cycle is zero. This latter description might be more useful if you wanted to define a DGFF on a more sparse graph, but won’t be useful in what follows.

Note that we can rearrange the Laplacian in (1) in terms of the transition kernel p( ) of the simple random walk of D to obtain

\exp\left( -\frac12 (h^D)^T (\mathbf{P}-\mathbf{1})h^D \right),

where P_{x,y}=p(y-x) is the transition matrix of SRW on D. In particular, this means that the free field is Gaussian, and we can extract the covariances via

\mathrm{Cov}(h^D_x,h^D_y) = \left[ (\mathbf{1}-\mathbf{P})^{-1}\right]_{x,y}

= \left[\sum_{n\ge 0} \mathbf{P}^n\right]_{x,y} = \sum_{n\ge 0} \mathbb{P}_x\left[X_n=y,\tau_{\partial D}>n\right],

where, under \mathbb{P}_x, (X_0,X_1,\ldots) is simple random walk started from x.

This final quantity records the expected number of visits to y before leaving the domain D, for a random walk started at x, and is called the Green’s function.

In summary, the DGFF on D is the centred Gaussian random vector indexed by \bar D with covariance given by the Green’s function G_D(x,y).

How many of these equivalences carries over to more general D-indexed random fields is discussed in the survey paper by Velenik. But it’s worth emphasising that having the covariance given by the Green’s function as in the definition we’ve just given is a very nice property, as there are lots of pre-existing tools for calculating these. By contrast, it’s hard to think of a natural model for an integer-valued surface of this kind, as an analogue to SRW.

[Though definitely not impossible. The nicest example I’ve heard of is for height functions of large uniform domino tilings within their ‘arctic circle’, which have GFF asymptotics. See this paper by Kenyon.]

A continuous limit?

We motivated the discussion of random walk bridge by the limit object, namely Brownian bridge. Part of the reason why the DGFF is more interesting than Gaussian random walk bridge, is that the limit object, the (continuum) Gaussian free field is hard to define classically in two dimensions.

We might suppose that the DGFF in V_N, the square box of width N has some scaling limit as N\rightarrow\infty. However, for fixed x,y\in [0,1]^2, (and taking integer parts component-wise), well-known asymptotics for SRW in a large square lattice (more on this soon hopefully) assert that

\mathrm{Cov}(h^{V_N}_{\lfloor Nx \rfloor},h^{V_N}_{\lfloor Ny\rfloor}) \sim \log |x-y|, (2)

and so any scaling limit will rescale only the square domain, not the height (since there is no N on the RHS of (2)). However, then the variance of the proposed limit is infinite everywhere.

So the GFF does not exist as a random height function on [0,1]^2, with the consequence that a) more care is needed over its abstract definition; b) the DGFF in 2D on a large square is an interesting object, since it does exist in this sense.

What makes it ‘free’?

This seemed like a natural question to ask, but I’ve received various answers. Some sources seem to suggest that having zero boundary condition is free. Other sources refer to the Hamiltonian (that is the term inside the exponential function at (1) ) as free since it depends only on the increments between values. If the Hamiltonian also depends on the heights themselves, for example via the addition of a \sum_{x} \Psi(h^D_x) term, then for suitable choice of function \Psi, this is interpreted as a model where the particles have mass. The physical interpretation of these more general Gibbs measures is discussed widely, and I’m not very comfortable with it all at the moment, but aim to come back to it later, when hopefully I will be more comfortable.

The Envelope ‘Paradox’

At the recent IMO in Hong Kong, there were several moments where the deputy leaders had to hang around, and I spent some of these moments discussing the following problem with Stephen Mackereth, my counterpart from New Zealand. He’s a mathematically-trained philosopher, so has a similar level of skepticism to me, but for different reasons, regarding supposed paradoxes in probability. Because, as we will see shortly, I don’t think this is a paradox in even the slightest fashion, I think there’s probably too much written about this on the internet already. So I’m aware that contributing further to this oeuvre is hypocritical, but we did the thinking in HKUST’s apparently famous Einstein Cafe, so it makes sense to write down the thoughts.

[And then forget about it for eight weeks. Oops.]

The ‘Paradox’

Here’s the situation. A cryptic friend gives you an envelope containing some sum of money, and shows you a second envelope. They then inform you that one of the envelopes contains twice as much money as the other. It’s implicit in this that the choice of which is which is uniform. You have the option to switch envelopes. Should you?

The supposed paradox arises by considering the amount in your envelope, say X. In the absence of further information, it is equally likely that the other envelope contains X/2 as 2X. Therefore, the average value of the other envelope is

\frac12 \left(\frac{X}{2}+2X \right)= \frac54 X > X.

So you should switch, since on average you gain money. But this is paradoxical, since the assignment of larger and smaller sums was uniform, so switching envelope should make no difference.

Probabilistic setup

This is not supposed to be a problem on a first-year probability exercise sheet. It’s supposed to be conducive to light discussion. So saying “I won’t engage with this problem until you tell me what the probability space is” doesn’t go down terribly well. But it is important to work out what is random, and what isn’t.

There are two sources of randomness, or at least ignorance. Firstly, there is the pair of values contained in the envelopes. Secondly, there is the assignment of this pair of values to the two envelopes. The second is a source of randomness, and this problem is founded on the premise that this second stage is ‘symmetric enough’ to smooth over any complications in the first stage. If we think that probability isn’t broken (and that’s what I think), then the answer is probably that the second stage isn’t symmetric enough.

Or, that the first stage isn’t very well-defined. In what follows, I’m going to make the second stage very symmetric, at the expense of setting up the first stage in what seems to me a reasonable way using the conventional language of probability theory to record our ignorance about the values in play.

So what’s the first stage? We must have a set of possible pairs of values taken by the envelopes. Let’s call this A, so

A\subset \mathbb{A}:=\{(x,2x)\,:\, x\in (0,\infty)\}.

Maybe we know what A is, but maybe we don’t, in which we should take A=\mathbb{A}, on the grounds that any pair is possible. Suppose that your friend has chosen the pair of values according to some distribution on \mathbb{A}, which we’ll assume has a density f, which is known by you. Maybe this isn’t the actual density, but it serves perfectly well if you treat it as *your* opinion on the likelihood. Then this actually does reduce to a problem along the lines of first-year probability, whether or not you get to see the amount in your envelope.

Suppose first that you do get to see the amount, and that it is x. Then the conditional probabilities that the pair is (x/2,x) or (x,2x) are, respectively

\frac{f(x/2,x)}{f(x/2,x)+f(x,2x)},\quad \frac{f(x,2x)}{f(x/2,x)+f(x,2x)}.

So you can work out your expected gain by switching, and decide accordingly. If you don’t know the value in your envelope, you can still work out the probability that it is better (in expectation) to switch, but this isn’t really a hugely meaningful measure, unless it is zero or one.

It’s worth noting that if you can view inside your envelope, and you know A has a particular form, then the game becomes determined. For example, if

A\subset \{(n,2n), n\text{ an odd integer}\},

then life is very easy. If you open your envelope and see an odd integer, you should switch, and if you see an even integer you shouldn’t.

We’ll return at the end to discuss a case where it is always better to switch, and why this isn’t actually a paradox.

Improper prior and paradox of resampling when \mathbb{E}=\infty

For now though, let’s assume that we don’t know anything about the amounts of money in the envelopes. Earlier, we said that “in the absence of further information, it is equally likely that the other envelope contains X/2 as 2X”. In the language of a distribution on \mathbb{A}, we are taking the uniform measure. Of course this not a distribution, in the same way that there isn’t a uniform distribution on the positive reals.

However, if this is your belief about the values in the pair of envelopes, what do you think is the mean value of the content of your envelope? Well, you think all values are equally likely. So, even though this isn’t a distribution, you pretty much think the value of your envelope has infinite expectation.

[This is where the philosophy comes in I guess. Is (expressing uniform ignorance about the content of the other envelope given knowledge of your own) the same as (expressing uniform ignorance of both envelopes at the beginning)? I think it is, even though it has a different consequence here, since the former can be turned into a proper distribution, whereas the latter cannot.]

Let’s briefly consider an alternative example. It’s fairly easy to conjure up distributions which are almost surely finite but which have infinite expectation. For example \mathbb{P}(X=2^k)=2^{-k} for k=1,2,…, which is the content of the *St. Petersburg paradox*, another supposed paradox in probability, but one whose resolution is a bit more clear.

Anyway, let X and Y be independent copies of such a distribution. Now suppose your friend offers you an envelope containing amount X. You look at the value, and then you are offered the opportunity to switch to an envelope containing amount Y. Should you?

Well, if expectation is what you care about, then you definitely should. Because with probability one, you are staring at a finite value in your envelope, whereas the other unknown envelope promises infinite expectation, which is certainly larger than the value that you’re looking at.

Is this also a paradox? I definitely don’t think it is. The expectation of the content of your envelope is infinite, the expected gain is infinite with probability one, which is consistent with the expected content of the other envelope being infinite. [Note that you don’t want to be claiming that the expectation of X-Y is zero.]

An example density function

As an exercise that isn’t necessarily hugely interesting, let’s assume that f, the distribution of the smaller of the pair, is \mathrm{Exp}(\lambda). So the mean of this smaller number is 1/\lambda. Then, conditional on seeing x in my envelope, the expected value of the number in the other envelope is

\frac{\frac{x}{2} e^{-\lambda x/2} + 2x e^{-\lambda x}}{e^{-\lambda x/2}+ e^{-\lambda x}}. (*)

Some straightforward manipulation shows that this quantity is at least x (implying it’s advantageous to switch) precisely when

e^{-\lambda x/2}\ge \frac12.

That is, when x\le \frac{2\log 2}{\lambda}. The shape of this interval should fit our intuition, namely that the optimal strategy should be to switch if the value in your envelope is small enough.

The point of doing this calculation is to emphasise that it ceases to be an interesting problem, and certainly ceases to be a paradox of any kind, once we specify f concretely. It doesn’t matter whether this is some true distribution (ie the friend is genuinely sampling the values somehow at random), or rather a perceived likelihood (that happens to be normalisable).

What if you should always switch?

The statement of the paradox only really has any bite if the suggestion is that we should always switch. Earlier, we discussed potential objections to considering the uniform prior in this setting, but what about other possible distributions f which might lead to this conclusion?

As at (*), we can conclude that when f(x)+f(x/2)>0, we should switch on seeing x precisely if

f(x)\ge 2f\left(\frac{x}{2}\right).

Therefore, partitioning the support of f into a collection of geometric sequences with exponent 2, it is clear that the mean of f is infinite if everything is integer-valued. If f is real-valued, there are some complications, but so long as everything is measurable, the same conclusion will hold.

So the you-should-switch-given-x strategy can only hold for all values of x if f has infinite mean. This pretty much wraps up my feelings. If the mean isn’t infinite, the statement of the paradox no longer holds, and if it is infinite, then the paradox dissolves into a statement about trying to order various expectations, all of which are infinite.

Conclusions

Mathematical summary: it’s Bayes. Things may be exchangeable initially, but not once you condition on the value of one of them! Well, not unless you have a very specific prior.

Philosophical summary: everything in my argument depends on the premise that one can always describe the protagonist’s prior opinion on the contents of the pair of envelopes with a (possibly degenerate) distribution. I feel this is reasonable. As soon as you write down \frac12 \cdot\frac{x}{2} + \frac12 \cdot2x, you are doing a conditional expectation, and it’s got to be conditional with respect to something. Here it’s the uniform prior, or at least the uniform prior restricted to the set of values that are now possible given the revelation of your number.

Second mathematical summary: once you are working with the uniform prior, or any measure with infinite mean, there’s no reason why

\mathbb{E}\left[X|Y\right]>Y,

with probability one (in terms of Y) should be surprising, since the LHS is (almost-surely) infinite while the RHS is almost surely finite, despite having infinite mean itself.

Questionable Statistics

My four years in the Statistics Department have been characterised by a continual feeling that I don’t know enough about statistics. That said, I can’t help but notice how statistics are used in non-specialist contexts, and of course there are times when eyebrows might reasonably be raised. Andrew Gelman (who gave a very enjoyable departmental seminar on this topic in May) has a notable blog critiquing poor statistical methodology, especially in social science; and the blogosphere is packed with compilations of egregious statistical oxymorons (“All schools *must* be above average” etc).

I’m aiming for more of a middle ground here. I’ve picked three examples of articles I found interesting at some level over the past few months. In each of them, some kind of data presentation or experiment design arose, and in all cases, I think it resulted in more questions rather than more answers, not really in a good way, since in all cases I did actually want to know more about what was going on.

Extreme rounding errors

This Guardian article about inequality in career trajectories for teachers in schools serving more advantaged and more deprived areas raises a lot of interesting questions, and actually proposes answers to several of them. The problem is the double bar chart halfway down.

First a superficial moan. Rounding to the nearest integer is fine, indeed sensible in most contexts, but not when you are treating small numbers. Comparing the extreme bars, we could have 11.5 v 12.5 or we could have 10.5 v 13.5. The headlines would probably read “N% more teachers leave disadvantaged schools”. In the first case N is 9, in the second it’s 29. So it’s not really a superficial moan. Data is supposed to offer a measure of the effect under consideration, and a discussion of how treat this effect requires a veneer of reliability about the size of the effect. We don’t just need to be asking “how can we solve this”; we also need to be asking “is this worth solving?” I’m not saying the answer is no, but based on this graph alone the answer is not ‘definitely yes’. [1]

A question to ask in the context of this report is “why do more teachers leave the profession from schools in deprived areas?” But I think most people could speculate a broadly sensible answer to this. The data raises the more subtle question “why are the rates of leaving the profession so similar across all types of school?” Essentially I think this survey is a negative result – the effect just isn’t as strong as I’d have suspected, nor I imagine the authors nor the Guardian education editorship.

By contrast, even with the amateurish presentation, the rates of moving school clearly are significant and worth talking about based on this survey. Significant data demands further information though. Once an effect is significant, we need to know more details, like to what extent the teachers are moving to schools in the same band or to mostly towards the left end of the graph. Again though, there’s no comment on the magnitude of this effect. Assuming the rate is given per year [2], then among teachers who do not leave the profession, the average tenure of a given job is more than ten years, even in the most deprived category. Maybe this is all very heavy-tailed, and so the average isn’t a good thing to be using. But this doesn’t seem like an unreasonable number to me? If the claim is that this is a major problem for the teaching profession, then we should be told what the numbers are in vaguely comparable industries.

The final thing is pretty bad. Three paragraphs under the graph, it is claimed that the rate of leaving is 70% higher in most- than least-deprived categories. We decided that N was somewhere between 9 and 29. But not 70. I can only assume that this came from the ‘moving’ category instead…

[1] I’m assuming the sample pool was large enough that the effect is statistically significant, because there isn’t a direct link to the source data or even the source report unfortunately.

[2] and is it plausible that this is rate per year? Does the profession really experience ten percent annual turnover? Is it even plausible that the scale is the same? Maybe rate of leaving really is twice rate of moving school, and I have no idea about anything, but this seems questionable.

Many, many categories

The bank Halifax commissions an annual survey about pocket money, and this year’s received widespread coverage. My attention was caught by this article on BBC news. The short summary is that they surveyed about 1,200 children from about 600 families (so in many cases a direct comparison between siblings could have been possible), and it sounds like they restricted to the age range 8-15, and asked how much pocket money they received, and whether they wanted more.

The clickbait summary was that boys receive about 13% more than girls. While it doesn’t give the exact numbers, it also mentions that more boys than girls thought they deserved more money. A psychologist conjectures that this latter effect might explain the former effect, and indeed this seems plausible. The problem is that elsewhere in the article, it says that in the corresponding survey in 2015, boys receive on average only 2% more than girls.

We therefore have to consider a couple of things. 1) Is it plausible that the actual average pocket money rates (among the whole population) have fluctuated in such a gender-dependent way in 12 months? 2) If not, then can we still say something useful, even if our confidence in the methodology of the survey is reduced?

I think the answer to the first question is definitely ‘no’. So we come to the second question. Firstly, how could this have arisen? Well, there’s no indication how the children were chosen, but equally it’s hard to think of a way of choosing them that would increase artificially the boys’ pocket money. In this last sentence, I’m deliberately suggesting the null hypothesis that boys and girls are the same. This is misleading. For now, our null hypothesis is that 2015 should be the same as 2016. Indeed, it’s equally plausible that this year’s summary data was ‘more correct’, and so we are asking why the difference was excessively compressed in 2015.

This is entirely a matter of effect size, and it’s to the article’s credit that so much information is given further down the page that we can actually make the following remarks quantitatively. At the bottom, they show the average pay (across genders) in different regions. There is a lot of variation here. The rate of pay in East Anglia is only 60% what it is in London. At some level, I’m surprised it isn’t even higher. The article tells us that children between the ages of eight and fifteen were surveyed. By what rate did your pocket money change in that interval? Surely it doubled at the very least? Weekly pocket money at age eight is for sweets and an occasional packet of Pokemon cards (substitute 21st century equivalent freely…). Teenagers need to pay for outings and clothes etc, so there’s no reason at all why this should be comparable. For the purposes of definiteness in the next calculation, let’s say that fifteen-year-olds get paid four times as much as eight-year-olds on average, which I think is rather conservative.

So here’s a plausible reason for the fluctuations. Suppose between 2015 and 2016, the survey substituted 10 eight-year-old boys from the Midlands for 10 fifteen-year-old boys from London. The expectation change in the average amongst the 600 boys surveyed is

\frac{10}{600} \times (\sim 6.5) \times (4 \times (0.6)^{-1}) \approx 0.7,

in pounds, which is on the order of the actual change from 2015 to 2016. Choosing the participants in a non-uniform way seems like too basic a mistake to make, but mild fluctuations in the distribution of ages and locations, as well as the occasional outlier (they want to use the mean, so one oligarch’s daughter could make a noticeable difference in a sample size of 600) seems more plausible as an explanation to me than a general change in the population rates. Choosing participants in a uniform way is just hard when there are loads and loads of categories.

I’m not really saying that this survey is bad – they have to report what they found, and without a lot more information, I have no idea whether this year’s is more or less plausible than last year’s. But if you add the phrase “especially in 2016” to everything the psychologist says, it suddenly seems a bit ridiculous. So it would be worth making the remark that even if this effect is statistically significant, that doesn’t mean the effect size is large relative to lots of other less controversial effect sizes visible in the data.

Comparing precious metals

I’ve recently returning from this year’s International Mathematical Olympiad, and now we are well into the swing of its sweatier namesake in Rio. At both events, there have been many opportunities to observe how different people with different ambitions and levels of extroversion display different levels of pleasure at the same medal outcomes. In the light of this, I was drawn to this article, which has been getting quite a lot of traction online.

The basic premise is simple. Silver medallists are less happy than bronze medallists (when – unlike at the IMO – there is only one of each medal), and it’s not hard to come up with a cheap narrative: silver-winners are disappointed to have missed out on gold; bronze-winners are glad to have got a medal at all. This is all pretty convincing, so maybe there’s no need actually to collect any data, especially since asking a crowd of undergraduates to rate faces on the 1-10 agony-ecstacy scale sounds like a bit of an effort. Let’s read a couple of sentences, describing a follow-up study on judo medallists at the Athens Olympics:

Altogether, they found that thirteen of the fourteen gold medal winners smiled immediately after they completed their winning match, while eighteen of the twenty-six bronze medalists smiled. However, none of the silver medalists smiled immediately after their match ended.

You might wonder how come there are almost twice as many bronzes as golds. Well, very full details about the judo repechage structure can be found here, but the key information is that all bronze-medallists won their final match, as, naturally, did the gold-medallists. The silver-medallists lost their final match, ie the gold-medal final. So this study is literally reporting that highly-driven competitive sportsmen are happier straight after they win, than straight after they lose. This doesn’t have much to do with medals, and isn’t very exciting, in my opinion. I would, however, be interested to know what was eating the one gold medallist who didn’t smile immediately.

This isn’t really a gripe about the statistics, more about the writing. They are testing for an effect which seems highly plausible, but is branded as surprising. The study for which they give more methodological details in fact seems to be testing for a related effect, which is not just plausible, but essentially self-evident. So naturally I want to know more about the original study, which is the only place left for there to be any interesting effects, for example if they looked at athletics events which don’t have a binary elimination structure. But we aren’t told enough to find these interesting effects, if they exist. How annoying. The only thing left is to think about is my bronze medal at IMO 2008. They said eight years would be enough for the wounds to heal but I’m still not smiling.