# BMO2 2018

The second round of the British Mathematical Olympiad was taken yesterday by the 100 or so top scoring eligible participants from the first round, as well as some open entries. Qualifying for BMO2 is worth celebrating in its own right. The goal of the setters is to find the sweet spot of difficult but stimulating for the eligible participants, which ultimately means it’s likely to be the most challenging exam many of the candidates sit while in high school, at least in mathematics.

I know that lots of students view BMO2 as something actively worth preparing for. As with everything, this is a good attitude in moderation. Part of the goal in writing about the questions at such length (and in particular not just presenting direct solutions) is because I think at this level it’s particularly easy to devote more time than needed to preparation, and use it poorly.

All these questions could be solved by able children. In fact, each could be solved by able children in less than an hour. You definitely count as an able child if you qualified or if your teacher allowed you to make an open entry! Others count too naturally. But most candidates won’t in fact solve all the questions, and many won’t solve any. And I think candidates often come up with the wrong reasons why they didn’t solve problems. “I didn’t know the right theorems” is very very rarely the reason. Olympiad problems have standard themes and recurring tropes, but the task is not to look at the problem and decide that it is an example of Olympiad technique #371. The task is actually to have as many ideas as possible, and eliminate the ones that don’t work as quickly as possible.

The best way to realise that an idea works is to solve the problem immediately. For the majority of occasions when we’re not lucky enough for that to happen, the second-best way to realise that an idea works is to see that it makes the problem look a bit more like something familiar. Conversely, the best way to realise that an idea doesn’t work is to observe that if it worked it would solve a stronger but false problem too. (Eg Fermat’s Last Theorem *does* have solutions over the reals…) The second-best way to realise that an idea doesn’t work is to have the confidence that you’ve tried it enough and you’ve only made the problem harder, or less familiar.

Both of these second-best ideas do require a bit of experience, but I will try to explain why none of the ideas I needed for various solutions this year required any knowledge beyond the school syllabus, some similarities to recent BMOs, and a small bit of creativity.

As usual, the caveat that these are not really solutions, and certainly not official solutions, but they are close enough to spoil the problems for anyone who hasn’t tried them by themselves already. Of course, the copyright for the problems is held by BMOS, and reproduced here with permission.

Question One

I wrote this question. Perhaps as a focal point of the renaissance of my interest in geometry, or at least my interest in teaching geometry, I have quite a lot to say about the problem, its solutions, its origin story, the use of directed angles, the non-use of coordinate methods and so on. In an ideal world I would write a book about this sort of thing, but for now, a long and separate post is the answer.

This will be available once I’ve successfully de-flooded my apartment.

Question Two

I also wrote this problem, though I feel it’s only fair to show the version I submitted to the BMO committee. All the credit for the magical statement that appears above lies with them. There is a less magical origin story as well, but hopefully with some interesting combinatorial probability, which is postponed until the end of this post.One quick observation is that in my version Joe / Hatter gets to keep going forever. As we shall see, all the business happens in the first N steps, but a priori one doesn’t know that, and in my version it forces you to strategise slightly differently for Neel / Alice. In the competition version, we know Alice is done as soon as she visits a place for a second time, but not in the original. So in the original we only have to consider ‘avoid one place’ rather than the multiple possibilities now of ‘avoid one place’ or ‘visit a place again’.

But I think the best idea is to get Alice to avoid one particular place $c\not\equiv 0$ whenever possible. At all times she has two possible options for where to go next, lets say $b_k+a_k, b_k-a_k$ in the language of the original statement. We lose nothing by assuming $-N/2 < a_k\le N/2$, and certainly it would be ridiculous for Joe / Hatter ever to choose $a_k=0$. The only time Alice’s strategy doesn’t work is when both of these are congruent to $c$, which implies $N\,|\, 2a_k$, and thus we must have $N= 2a_k$. In other words, Alice’s strategy will always work if N is odd.

I think it’s really worth noticing that the previous argument is weak. We certainly did not show that N must be odd for Alice to win. We showed that Alice can avoid a congruence class modulo an odd integer. We didn’t really need that odd integer to be N for this to work. In particular, if N has an odd factor p (say a prime), then the same argument works to show that we can avoid visiting any site with label congruent to 1 modulo p.

It’s actually very slightly more complicated. In the original argument, we didn’t need to use any property of $b_k$. But obviously here, if $b_k\equiv 1$ modulo p and $p\,|\,a_k$, then certainly $b_{k+1}\equiv 1$ modulo p. So we have to prove instead that Alice can ensure she never ‘visits 1 modulo p for the first time’. Which is fine, by the same argument.

So, we’ve shown that Neel / Alice wins if N is odd, or has an odd factor. The only values that remain are powers of 2. I should confess that I was genuinely a little surprised that Joe / Hatter wins in the power of 2 case. You can find a construction fairly easily for N=2 and N=4, but I suspected that might be a facet of small numbers. Why? Because it still felt we could avoid a particular site. In order for Alice’s strategy to fail, we have to end up exactly opposite the particular site at exactly the time when the next $a_k=N/2$, and so maybe we could try to avoid that second site as well, and so on backwards?

But that turned out to be a good example of something that got very complicated quite quickly with little insight. And, as discussed at the beginning, that’s often a sign in a competition problem that your idea isn’t so good. (Obviously, when composing a problem, that’s no guarantee at all. Sometimes things are true but no good ideas work.) So we want other ideas. Note that for N=4, the sequence (2,1,2) works for Joe / Hatter, because that forces Alice / Neel to visit either (0,2,1,3) or (0,2,3,1). In particular, this strategy gave Alice no control on the first step nor the last step, and the consequence is that we force her to visit the evens first, then transfer to an odd, and then force her to visit the other odd.

We might play around with N=8, or we might proceed directly to a general extension. If we have a Joe / Hatter strategy for N, then by doubling all the $a_k$s, we have a strategy for 2N which visits all the even sites in the first N steps. But then we can move to an odd site eg by taking $a_N=1$. Just as in the N=4 case, it doesn’t matter which odd site we start from, since if we again double all the $a_k$s, we will visit all the other odd sites. This gives us an inductive construction of a strategy for powers of two. To check it’s understood, the sequence for N=8 is (4,2,4,1,4,2,4).

Although we don’t use it, note that this strategy takes Alice on a tour of sites described by decreasing order of largest power of two dividing the label of the site.

Question Three

I have a theory that the average marks on Q1, Q2 and Q3 on this year’s paper will be in ascending order rather than, as one might expect, descending order. I think my theory will fail because it’s an unavoidable fact of life that in any exam, candidates normally start at the beginning, and don’t move to the middle until making earlier progress. But I think that’s the only reason my theory will fail.

Like kitchen cleanliness or children’s character flaws, it’s hard to compare one’s own problem proposals with others’ rationally. But I felt that, allowing for general levels of geometry non-preference, Q3 was more approachable than Q2, especially to any candidate who’d prepared by looking at some past papers.

I’m in no way a number theorist, but I know three or four common themes when one is asked to prove that a certain sequence contains no squares, or almost no squares. [3a]

• Number theoretic properties of the sequence of squares. Squares cannot be 3 modulo 4 for example. They also cannot be 2 modulo 4, and thus they also cannot be $2^{k-1}$ modulo $2^k$ for any even k. This first observation was essentially the body of most solutions to Q4 of BMO1 2016, among many others.
• Soft properties of the sequence of squares. The sequence of squares grows quadratically. Sometimes we can show a quadratic sequence will have no overlap with some other sequence for basic reasons. This is especially common if the second sequence is also quadratic or similar. For example, the expression $n^2+3n-4$ is typically not a square because

$(n+1)^2 = n^2+2n+1 < n^2 + 3n - 4 < n^2+4n+4 = (n+2)^2,$

• when n is large. In fact the right hand inequality is always true, and the left hand inequality is true for $n\ge 6$, which doesn’t leave too many cases to check (and n=5 does actually give a square). This type of argument has been quite common on BMO recently, directly on Q1 of BMO1 2011 and also Q3 of BMO1 2016. An example in a more abstract setting is Q3 of Balkan MO 2007, which I greatly enjoyed at the time…
• Number theoretic properties of the definition of a square. A square is the product of an integer with itself, and so if we want the product of two or more integers to be a square, then this imposes conditions on the shared factors of the two integers. I’ll cite some examples shortly.
• Huge theorems. Some old paper which I encountered as a child asked us to find all solutions to $x^2-1=2^y$. Or similar – I can’t find it now – but Q2 of BMO2 2006 is close enough to the sensible approach to the problem. I think it’s more helpful to think about this as proving that a particular sequence rarely includes powers of two than that a particular sequence rarely includes squares. But either way, one could in principle use the Catalan conjecture, which controls all non-trivial solutions to $a^p - b^q=1$. Fortunately, the Catalan conjecture was proved, by Mihailescu (readable blog about it), between the paper being set, and me attempting it a few years later. I’m being flippant. This is not a standard trope in solving these questions. For very obvious reasons. If it can be killed by direct reference to a known theorem, it won’t be set.

Anyway, those references (and more to follow) are to illuminate why I thought this question was not too hard. Indeed, I feel one can make substantial meta-progress in your head. The given information is interesting, but for the purpose of this question is just a black box. By subtracting the expression for m from the expression for 2m, we can derive an expression for the required sum. It’ll be a quartic in m, because the leading terms won’t cancel.

This leaves all three of the methods above very accessible. Unfortunately m=0 would be a square were it not excluded specifically, so a modular arithmetic approach is unlikely to work directly. Bounding between two quadratics is entirely plausible, as is factorising and comparing number theoretic properties of the factors. I thought the second one seemed more promising, but either way, having two potentially good ideas based only on recent BMO problems before even writing anything down is a good opening.

We do have to calculate the sum, and I make it $\frac{1}{4}m^2(5m+3)(3m+1)$. Now I’m not so sure how to bound this between two quadratics, because the leading coefficient is 15/4, which is not the square of a rational. But the factor analysis approach is definitely on.

Let’s review this generally. Throughout, suppose m,n are positive integers.

Claim 1: if mn is a square, then m and n are squares too.

Claim 2: if mn is a square, then m=n.

Both of these claims are false. However, a version of Claim 1 is true.

Claim 1′: if mn is a square, and m,n are coprime, then each is a square.

Even though this isn’t a named theorem, it is true, and well-known and can be used without proof. One way to prove it is to write m,n as products of primes, and show that since the primes are disjoint, the exponents must all be even. Most other methods will be equivalent to this, maybe with less notation.

What is good about Claim 1′ is that more complicated versions are true for for essentially similar reasons. For example

Claim 3: if mn is $6k^2$, and m,n are coprime, then either one is a square and the other is six times a square; or one is two times a square, and the other is three times a square.

Claim 4: if mn is a square, and the greatest common divisor (m,n) is either 5 or 1, then either each is a square, or each is five times a square.

I cited some examples of the other methods I proposed. Here are some examples of this sort of thing in recent BMOs:

• Q4 of BMO2 2016. Even the statement is suggestive. There are more complicated routes, but showing that $(2p-u-v)(2p+u+v)$ is a square is one way to proceed, and then Claim 4 directly applies after checking a gcd.
• Q2 of BMO1 2014 is similar, but it is much more explicit that this is the correct approach. Expose $p^2$ then use a (correct) version of Claim 2.
• Q1 of BMO2 2009. Show that a and b must each be a square times 41 for rationality reasons.
• Q6 of BMO1 2006. After sensible focused substitutions, obtain $3n^2=q(q-1)$. Rather than try to ‘solve’ this, extract the key properties along the lines of Claim 3, eliminate one of the cases by modular arithmetic, and return to the required statement.
• Q3 of BMO2 2010 requires the student to reproduce the essentials of the arguments above in the case of a particular degree six polynomial with a tractable factorisation, along with some mild square-sandwiching or bounding arguments as discussed earlier.

In conclusion, I’m trying to say that if I claim I am confident I can find all integers m such that $\frac14 m^2(5m+3)(3m+1)$ is a square, this is not based on complicated adult experience, but rather on recent problems at a similar sensible level. And I still don’t think it counts as Olympiad technique #371 – thinking about divisibility of factors is a good thing to do when talking about integers, and so it’s just a natural entry point into problems about squares. Plenty of problems might have this sort of thing as a starting point or an ending point.

For this problem we need a different ending point. To be brief, the factors (5m+3) and (3m+1) cannot both be squares because 5m+3 is never a square. So since the gcd of these factors is 1, 2 or 4, the only other option is that they are both squares times 2. And because -1 is not a square modulo 3, so 1 is not a (square times 2) modulo 3, and we are done. Note that this was a literal example of the first technique for proving something is not a square, proposed all the way back at the start of this section.

Footnotes

[3a] – some common themes for proving that sequences do include squares might be comparison with Pell’s Equations, or comparison with the explicit construction of solutions to Pythagoras’s equation.

Question Four

An example of an absorbing function is $f(x)=\lfloor x\rfloor$. One challenge is thinking of many other examples. This one is fine, but it’s true under replacing 2018 by 1 in the statement, and so it doesn’t really capture the richness of the situation.

Notation: the pre-image of a function is the language used to describe the inverse of a function which doesn’t have a uniquely-defined inverse. That is, if f is not injective, and multiple arguments have the output. We write $f^{-1}(y)=\{x: f(x)=y\}$. In particular, this is a set of values, not necessarily a single value. We also use $\mathbb{Z}$ to denote the integers. We can apply pre-images to sets as well. So for example $f^{-1}(\mathbb{Z})=\{x : f(x)\in \mathbb{Z}\}$.

This question is tricky, and I will be surprised to see many full solutions from the eligible candidates. It rewards the sort of organisation and clear-thinking that is easier said than done in a time-pressured contest environment. There are also many many possible things to consider, and so is particularly challenging in the short timeframe of BMO2 as opposed to, for example, appearing as the middle question on a 4.5 hour international-level paper.

At a meta-level we are being asked to confirm or deny the existence of absorbing functions where $f^{-1}(\mathbb{Z})$ is small in some sense, firstly when actually having finite size, secondly when, although infinite, being a small sort of infinite, namely spread out in a sparse, well-ordered way (you might say countable if familiar with that language). The general idea is presumably that it’s hard to be absorbing if the pre-image of the integers is small, and so it’s reasonable to assume that it’s too hard if this is finite; but perhaps not quite too hard if it’s merely countable. So (no, yes) is a sensible guess at the answer to the question, though (no, no) might also fit, maybe with a harder argument for the second no.

Ok, instead of trying a) or b), just play with the configuration. Let $A=f^{-1}(\mathbb{Z})$. We will use this frequently. In the picture below, f maps the real line on top to the real line below. If two reals get mapped to the same image, then whether or not the image is an integer, the whole (closed) interval bounded by the two reals also gets mapped to the same image. This is because f is weakly increasing.

This means that A consists of various intervals (which include single points). But in both a) and b) we know that A is ‘small’, and so it cannot contain any intervals of positive length. So in fact A is a set of separated real values. In the case of a) it’s a finite set.

Do we want to try and iterate this, and look at $f^{-1}(A)$? Well maybe, but we don’t know much about about pre-images of A, only about pre-images of $\mathbb{Z}$.

But note that the pre-image of the pre-image of the … of the pre-image [2017 times] of A must be the whole real line, so at some point, some value has a pre-image that is an interval. So if we’re guessing that the answer to b) is yes, then we need to give a construction.

$\mathbb{R} \stackrel{f}\longrightarrow ?? \stackrel{f}\longrightarrow\quad\ldots\quad \stackrel{f}\longrightarrow ??\stackrel{f}\longrightarrow A \stackrel{f}\longrightarrow f(A)\subset \mathbb{Z}.$

If you play around for a bit, it seems very unlikely to be absorbing if the integers don’t get mapped to the integers. You can try to prove this, but at the moment we’re just aiming for a construction, so let’s assume $f(\mathbb{Z})\subset \mathbb{Z}$. It would be convenient if f(n)=n for all $n\in \mathbb{Z}$, but we already know that this won’t work because then the pre-image of the pre-image of the… of $\mathbb{Z}$ is always $\mathbb{Z}$, but we need it to be $\mathbb{R}$.

The ideal situation would be if $A= \mathbb{Z}\cup \{\ldots, a'_{-1},a'_0,a'_1,\ldots\}$, where the pre-image of $\{\ldots, a'_{-1},a_0,a'_1,\ldots\}$ is pretty much everything.

Informally, we are specifically banned from mapping intervals directly onto an integer. So have an intermediate set, and try to map almost everything (except the integers and the set itself) onto that set, so and map that set into the integers.

At this point, you really just have to have the right idea and finish it. Many things will work, but this seems the easiest to me. Let the set A consist of the integers and the (integers plus 1/2). And for $x\in A$, f(x)=2x. This is what f looks like so far.

Here the black crosses are integers, and the purple crosses are (integers plus 1/2). But now we need to make as many reals as possible in the top row map to a purple cross (which is allowed, because purple crosses aren’t integers), but we need also to preserve the weakly increasing property. Fortunately, we can exactly do that. Each cross of either colour in the top row maps to a black cross in the middle row (ie an integer), so we can map the open interval between crosses in the top row to a purple cross in the middle row. As shown in red:

Note that this is consistent. The fact that I haven’t drawn in the red cones into the bottom row is only because I didn’t use the bottom row to motivate doing this. I’ve shown a consistent definition of f that maps all the reals onto the integers in two steps. If it’s an integer to begin with, that was great; if it was an (integer plus 1/2) to begin with then it becomes an integer in one step and stays an integer; and otherwise it first maps to an (integer plus 1/2), and then to an integer in the second step.

To check you’ve understood, try to write down a standalone definition of this function.

I’ve therefore solved part b) with the alternative condition $\ldots a_{-1} which isn’t exactly as required. It requires one small and simple idea to convert to a solution to the actual statement. See if you can find it yourself!

I think part a) is harder, not because the solution will look more complicated, but because there are so many potential partial results you could try to prove, because there are so many sets you could consider. To name a few: the image of f, the image of f intersected with $\mathbb{Z}$, the image of $\mathbb{Z}$, the 2018-composition image $f^{2018}(\mathbb{R})$, the 2018-composition image $f^{2018}(\mathbb{Z})$ and so on and so forth. You might have good insight into the wrong things.

For me, the crucial observation (which you can see from the figure in the b) construction) is that when composing an increasing function with itself, the ‘trajectories’ are either increasing or decreasing. That is, if $x\le f(x)$ (respectively, $x\ge f(x)$), then $x\le f(x)\le f^2(x)\le f^3(x)\le\ldots$ (respectively $x\ge f(x)\ge f^2(x)\ge \ldots$). Again, you can think of this as Olympiad technique #371 if you insist, but I don’t think that’s helpful. There are lots of things one could try to say here, and this turns out to be natural, true and useful, but you can’t know it’s useful until you play with it.

Anyway, we’re playing with part a), and we know that $f^k(x)$ is an integer for all large enough k, and that $f^{k+1}(x)$ is also an integer, so $f^k(x)$ is one of a finite set of integers because of the condition on A. But we’ve seen the sequence $x,f(x),f^2(x),\ldots$ is weakly increasing or weakly decreasing, and so if we also know it’s eventually bounded (because eventually it’s in this finite set) then it must eventually be constant. And this constant is one of the integers, say n. But unless we started from n, this means that f(n)=n, but also f(x)=n for some other real value x. And so exactly as at the very very beginning, that’s bad, because then the whole interval [x,n] gets mapped to n, which is a contradiction.

Question Two – Origin story

The origin story for Q2 started in a talk I heard by Renan Gross at Weizmann, who referenced some of the history of Scenery Reconstruction. Roughly speaking, we colour the integers (say with two colours), and then let loose a random walker, who tells us the sequence of colours she observes during her walk, but no other information about the walk itself.

How much information can we recover about the colouring? Obviously, the best we can hope for is to recover the colouring, up to translations and reflection, since for every possible random walk trajectory, the exact reflection is equally probable, and we are given no information about the starting point.

Since lots of the transitions between recoverable and unrecoverable depend on the periodicity of the colouring, a reasonable toy model is to do it on a cycle. Note that the Strong Law of Large Numbers tells us that we almost surely recover the number of black sites and white sites from an the infinite trajectory of the random walk. Of course it’s possible that there are only two black vertices, and they are adjacent, and the walker oscillates between them, thus seeing BBBBBB… But this is extremely unlikely. You could think of this in Bayesian terms as strongly increasing the prior on the whole cycle being black, but I think initially it’s best to do this as an infinite-time, SLLN problem not as finite time WLLN/CLT reweightings of anything.

But what more? It’s clear that the lengths of all black substrings should follow some mixed geometric-ish distribution, and this distribution will almost surely wash out as the empirical distribution in an SLLN sense. But it’s tricky to justify why such a mixed geometric-ish distribution should be uniquely determined by the lengths of black arcs in the cycle. But it does definitely feel like we should have enough information to reconstruct the colouring up to reflection/rotation with probability one. For example, analogously to the number of black vertices and the number of white vertices, we should be able to recover the number of adjacent black vertices, the number of adjacent white vertices, and the number of black-white adjacent vertices, and so on.

Anyway, this can be done, and it follows as a consequence of various authors’ work answering some more general conjectures of Benjamini and, separately, of den Hollander and Keane. Douglas Howard [DH] shows a handful of generalisations of this, as do Benjamini and Kesten [BK]. Most of this work is focused on sceneries on $\mathbb{Z}$, but periodic sceneries are often used as a basis, and of course, the only difference between periodic sceneries on $\mathbb{Z}$ and sceneries on the N-cycle are whether you know the period in advance. [BK] show that ‘almost all’ sceneries are distinguishable in a particular sense, in response to which Lindenstrauss [L99] exhibits a large family of sceneries which are not distinguishable. A readable but technical review is [ML].

So Renan’s talk was about the similar problem (and generalisations) on the hypercube [GG]. Rather than paraphrase the main differences badly, you can read his own excellent blog post about the work.

On the train back to Haifa from Rehovot, I was thinking a bit about the cycle case, and what happens if you generalise the random walk with varying jump lengths, or indeed introduce a demon walker, whose goal is to make it as hard as possible for the reviewer to deduce the colouring. One way this can certainly happen is if the walker can avoid visiting some particular site, as then how could one possibly deduce the colour of the never-visited site? And so we get to the statement posed.

References

[BK] – Benjamini, Kesten, 1996 – Distinguishing sceneries by observing the scenery along a random walk path

[dH] – den Hollander, 1988 – Mixing properties for random walk in random scenery

[DH] – Douglas Howard, 1996 – Detecting defects in periodic scenery by random walks on Z

[GG] – Grupel, Gross, 2017 – Indistinguishable sceneries on the Boolean hypercube

[L99] – Lindenstrauss, 1999 – Indistinguishable sceneries

[ML] – Matzinger, Lember, 2003 – Scenery reconstruction: an overview [link]

# Gaussian tail bounds and a word of caution about CLT

The first part is more of an aside. In a variety of contexts, whether for testing Large Deviations Principles or calculating expectations by integrating over the tail, it is useful to know good approximations to the tail of various distributions. In particular, the exact form of the tail of a standard normal distribution is not particularly tractable. The following upper bound is therefore often extremely useful, especially because it is fairly tight, as we will see.

Let $Z\sim N(0,1)$ be a standard normal RV. We are interested in the tail probability $R(x)=\mathbb{P}(Z\geq x)$. The density function of a normal RV decays very rapidly, as the exponential of a quadratic function of x. This means we might expect that conditional on $\{Z\geq x\}$, with high probability Z is in fact quite close to x. This concentration of measure property would suggest that the tail probability decays at a rate comparable to the density function itself. In fact, we can show that:

$\mathbb{P}(Z>x)< \frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.$

It is in fact relatively straightforward:

$\mathbb{P}(Z>x)=\frac{1}{\sqrt{2\pi}}\int_x^\infty e^{-u^2/2}du< \frac{1}{\sqrt{2\pi}}\int_x^\infty \frac{u}{x}e^{-u^2/2}du=\frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.$

Just by comparing derivatives, we can also show that this bound is fairly tight. In particular:

$\frac{1}{\sqrt{2\pi}}\frac{x}{x^2+1}e^{-x^2/2}<\mathbb{P}(Z>x)< \frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.$

—-

Now for the second part about CLT. The following question is why I started thinking about various interpretations of CLT in the previous post. Suppose we are trying to prove the Strong Law of Large Numbers for a random variable with 0 mean and unit variance. Suppose we try to use an argument via Borel-Cantelli:

$\mathbb{P}(\frac{S_n}{n}>\epsilon) = \mathbb{P}(\frac{S_n}{\sqrt{n}}>\epsilon\sqrt{n})\stackrel{\text{CLT}}{\approx}\mathbb{P}(Z>\epsilon\sqrt{n}).$

Now we can use our favourite estimate on the tail of a normal distribution.

$\mathbb{P}(Z>\epsilon\sqrt{n})\leq \frac{1}{\epsilon\sqrt{n}\sqrt{2\pi}}e^{-n/2}$

$\Rightarrow \sum_n \mathbb{P}(Z>\epsilon\sqrt{n})\leq \frac{1}{\sqrt{2\pi}}(e^{-1/2})^n=\frac{1}{\sqrt{2\pi}(e^{1/2}-1)}<\infty.$

By Borel-Cantelli, we conclude that with probability 1, eventually $\frac{S_n}{n}<\epsilon$. This holds for all $\epsilon>0$, and a symmetric result for the negative case. We therefore obtain the Strong Law of Large Numbers.

The question is: was that application of CLT valid? It certainly looks ok, but I claim not. The main problem is that the deviations under discussion fall outside the remit of discussion. CLT gives a limiting expression for deviations on the $\sqrt{n}$ scale.

Let’s explain this another way. Let’s take $\epsilon=10^{-2}$. CLT says that as n becomes very large

$\mathbb{P}(\frac{S_n}{\sqrt{n}}>1000)\approx \mathbb{P}(Z>1000).$

But we don’t know how large n has to be before this approximation is vaguely accurate. If in fact it only becomes accurate for $n=10^{12}$, then it is not relevant for estimating

$\mathbb{P}(\frac{S_n}{\sqrt{n}}>\epsilon\sqrt{n}).$

This looks like an artificial example, but the key is that this problem becomes worse as n grows, (or as we increase the number which currently reads as 1000), and certainly is invalid in the limit. [I find the original explanation about scale of deviation treated by CLT more manageable, but hopefully this further clarifies.]

One solution might be to find some sort of uniform convergence criterion for CLT, ie a (hopefully rapidly decreasing) function $f(n)$ such that

$\sup_{x\in\mathbb{R}}|\mathbb{P}(\frac{S_n}{\sqrt{n}}>x)-\mathbb{P}(Z>x)|\leq f(n).$

This is possible, as given by the Berry-Esseen theorem, but even the most careful refinements in the special case where the third moment is bounded fail to give better bounds than

$f(n)\sim \frac{1}{\sqrt{n}}.$

Adding this error term will certainly destroy any hope we had of the sum being finite. Of course, part of the problem is that the supremum in the above definition is certainly not going to be attained at any point under discussion in these post-$\sqrt{n}$ deviations. We really want to take a supremum over larger-than-usual deviations if this is to work out.

By this stage, however, I hope it is clear what the cautionary note is, even if the argument could potentially be patched. CLT is a theorem about standard deviations. Separate principles are required to deal with the case of large deviations. This feels like a strangely ominous note on which to end, but I don’t think there’s much more to say. Do comment below if you think there’s a quick fix to the argument for SLLN presented above.

# Large Deviations and the CLT

Taking a course on Large Deviations has forced me to think a bit more carefully about what happens when you have large collections of IID random variables. I guess the first thing think to think about is ‘What is a Large Deviation‘? In particular, how large or deviant does it have to be?

Of primary interest is the tail of the distribution function of $S_n=X_1+\ldots+X_n$, where the $X_i$ are independent and identically distributed as $X$. As we can always negate everything later if necessary, we typically consider the probability of events of the type:

$\mathbb{P}(S_n\geq \theta(n))$

where $\theta(n)$ is some function which almost certainly increases fairly fast with $n$. More pertinently, if we are looking for some limit which corresponds to an actual random variable, we perhaps want to look at lots of related $\theta(n)$s simultaneously. More concretely, we should fix $\theta$ and consider the probabilities

$\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha).$ (*)

Throughout, we lose no generality by assuming that $\mathbb{E}X=0$. Of course, it is possible that this expectation does not exist, but that is certainly a question for another post!

Now let’s consider the implications of our choice of $\theta(n)$. If this increases with $n$ too slowly, and the likely deviation of $S_n$ is greater than $\theta(n)$, then the event might not be a large deviation at all. In fact, the difference between this event and the event ($S_n$ is above 0, that is, its mean) becomes negligible, and so the probability at (*) might be 1/2 or whatever, regardless of the value of $\alpha$. So object $\lim \frac{S_n}{\theta(n)}$ whatever that means, certainly cannot be a proper random variable, as if we were to have convergence in distribution, this would imply that the limit RV consisted of point mass at each of $\{+\infty, -\infty\}$.

On the other hand, if $\theta(n)$ increases rapidly with $n$, then the probabilities at (*) might become very small indeed when $\alpha>0$. For example, we might expect:

$\lim_{n\rightarrow\infty}\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha)=\begin{cases}0& \alpha>0\\1&\alpha<0.\end{cases}$

and more information to be required when $\alpha=0$. This is what we mean by a large deviation event. Although we always have to define everything concretely in terms of some finite sum $S_n$, we are always thinking about the behaviour in the limit. A large deviation principle exists in an enormous range of cases to show that these probabilities in fact decay exponentially. Again, that is the subject for another post, or indeed the lecture course I’m attending.

Instead, I want to return to the Central Limit Theorem. I first encountered this result in popular science books in a vague “the histogram of boys’ heights looks like a bell” kind of way, then, once a normal random variable had been to some extent defined, it returned in A-level statistics courses in a slightly more fleshed out form. As an undergraduate, you see it in several forms, including as a corollary following from Levy’s convergence theorem.

In all applications though, it is generally used as a method of calculating good approximations. It is not uncommon to see it presented as:

$\mathbb{P}(a\sigma\sqrt{n}+\mu n\leq S_n\leq b\sigma\sqrt{n}+\mu n)\approx \frac{1}{\sqrt{2\pi}}\int_a^b e^{-x^2/2}dx.$

Although in many cases that is the right way to think use it, it isn’t the most interesting aspect of the theorem itself. CLT says that the correct scaling of $\theta(n)$ so that the deviation probabilities lie between the two cases outline above is the same (that is, $\theta(n)=O(\sqrt{n})$ in some sense) for an enormous class of distributions, and in particular, most distributions that one might encounter in practice (ie finite mean, finite variance). There is even greater universality, as furthermore the limit distribution at this interface has the same form (some appropriate normal distribution) whenever $X$ is in this class of distributions. I think that goes a long way to explaining why we should care about the theorem. It also immediately prompts several questions:

• What happens for less regular distributions? It is now more clear what the right question to ask in this setting might be. What is the appropriate scaling for $\theta(n)$ in this case, if such a scaling exists? Is there a similar universality property for suitable classes of distributions?
• What is special about the normal distribution? The theorem itself shows us that it appears as a universal scaling limit in distribution, but we might reasonably ask what properties such a distribution should have, as perhaps this will offer a clue to a version of CLT or LLNs for less regular distributions.
• We can see that the Weak Law of Large Numbers follows immediately from CLT. In fact we can say more, perhaps a Slightly Less Weak LLN, that

$\frac{S_n-\mu n}{\sigma \theta(n)}\stackrel{d}{\rightarrow}0$

• whenever $\sqrt{n}<<\theta(n)$. But of course, we also have a Strong Law of Large Numbers, which asserts that the empirical mean converges almost surely. What is the threshhold for almost sure convergence, because there is no a priori reason why it should be $\theta(n)=n$?

To be continued next time.