BMO1 2018

The first round of the British Mathematical Olympiad was sat yesterday. The paper can be found here, and video solutions here. Copyright for the questions is held by BMOS. They are reproduced here with permission.

I hope any students who sat the paper enjoyed at least some of the questions, and found it challenging! The following commentaries on the problems are not official solutions, and are not always full solutions at all, but contain significant steps of solutions, so would be best saved until after you have attempted the problems, if you are planning to do so. I’ve written quite a lot about Q5 because I found it hard (or at least time-consuming) and somewhat atypical, and I’ve written a lot about Q6 because there was a lot to say. I hope at least some of this is interesting to some readers of all levels of olympiad experience.

Question 1

A list of five two-digit positive integers is written in increasing order on a blackboard. Each of the five integers is a multiple of 3, and each digit {0,1,…,9} appears exactly once on the blackboard. In how many ways can this be done? (Note that a two-digit number cannot begin with zero.)

It’s a trope of BMO1 that the first question must be doable by some sort of exhaustive calculation or listing exercise. Of course, that is rarely the most efficient solution.

However, there is normally a trade-off between eliminating all listing, and reducing to a manageable task.

The key observation here is that writing the integers in increasing order is really just a way to indicate that order of the choices doesn’t matter. Even if that seems counter-intuitive. The question wants to know how many ways to choose these five numbers. The order of choice doesn’t matter since we’re going to put them in ascending order on the blackboard anyway.

You want to make your choices with as much independence as possible. So it would, for example, be a bad idea to choose the smallest number first. How many possibilities are there where the smallest number is 24? What about 42? What about 69? These are all different, and some are zero, so will make the computation very taxing.

However, you might notice that the digits {0,3,6,9} have to go together to form two numbers, and the rest have to pair up with one digit from {1,4,7} and one from {2,5,8}. You might know that an integer is divisible by 3 precisely if its digit sum is divisible by 3, but in this context you wouldn’t lose too much time by simply listing everything! These tasks are now completely separate, so you can take the number of ways to pair up {0,3,6,9} and multiply by the number of ways to pair up {1,4,7} and {2,5,8}. You need to take care over the ordering. It does (obviously) matter which is the first digit and which is the second digit in a number!

Continue reading

Advertisement

Extreme Value Theory

This is something interesting which came up on the first problem sheet for the Part A Statistics course. The second question introduced the Weibull distribution, defined in terms of parameters \alpha,\lambda>0 through the distribution function:

F(x)=\begin{cases}0 & x<0\\ 1-\exp(-(\frac{x}{\lambda})^\alpha) & x\geq 0.\end{cases}

As mentioned in the statement of the question, this distribution is “typically used in industrial reliability studies in situations where failure of a system comprising many similar components occurs when the weakest component fails”. Why could that be? Expressed more theoretically, the lifetimes of various components might reasonably be assumed to behave like i.i.d. random variables in many contexts. Then the failure time of the system is given by the minimum of the constituent random variables.

So this begs the question: what does the distribution of minimum of a collection of i.i.d. random variables look like? First, we need to think why there should be an answer at all. I mean, it would not be unreasonable to assume that this would depend rather strongly on the underlying distribution. But of course, we might say the same thing about sums of i.i.d. random variables, but there is the Central Limit Theorem. Phrased in a way that is deliberately vague, this says that subject to some fairly mild conditions on the underlying distribution (finite variance in this case), the sum of n i.i.d. RVs look like a normal distribution for large n. Here we know what ‘looks like’ means, since we have a notion of a family of normal distributions. Formally, though, we might say that ‘looks like’ means that the image of the distribution under some linear transformation, where the coefficients are possibly functions of n, converges to the distribution N(0,1) as n grows.

The technical term for this is to say that the underlying RV we are considering, which in this case would be X_1+\ldots +X_n) is in the domain of attraction of N(0,1). Note that other distributions in the family of normals are also in the domain of attraction of N(0,1), and vice versa, so this forms an equivalence relation on the space of distributions, though this is not hugely helpful since most interesting statements involve some sort of limit.

Anyway, with that perspective, it is perhaps more reasonable to imagine that the minimum of a collection of i.i.d. RVs might have some limit distribution. Because we typically feel more comfortable thinking about right-tails rather than left-tails of probability distributions, this problem is more often considered for the maximum of i.i.d. RVs. The Fisher-Tippett-Gnedenko theorem, proved in various forms in the first half of the 20th century, asserts that again under mild regularity assumptions, the maximum of such a collection does lie in the domain of attraction of one of a small set of distributions. The Weibull distribution as defined above is one of these. (Note that if we are considering domains of attraction, then scaling x by a constant is of no consequence, so we can drop the parameterisation by \lambda.)

This is considered the first main theorem of Extreme Value Theory, which addresses precisely this sort of problem. It is not hard to consider why this area is of interest. To decide how much liquidity they require, an insurance company needs to know the likely size of the maximum claim during the policy. Similarly, the designer of a sea-wall doesn’t care about the average wave-height – what matters is how strong the once-in-a-century storm which threatens the town might be. A good answer might also explain how to resolve the apparent contradiction that most human characteristics are distributed roughly normally across the population. Normal distributions are unbounded, yet physiological constraints enable us to state with certainty that there will never be twelve foot tall men (or women). In some sense, EVT is a cousin of Large Deviation theory, the difference being that unlikely events in a large family of i.i.d. RVs are considered on a local scale rather than globally. Note that large deviations for Cramer’s theorem in the case where the underlying distribution has a heavy tail are driven by a single very deviant value, rather than by lots of slightly deviant data, so in this case the theories are comparable, though generally analysed from different perspectives.

In fact, we have to consider the reversed Weibull distribution for a maximum, which is supported on (-\infty,0]. This is one of three possibly distribution families for the limit of a maximum. The other two are the Gumbel distribution

F(x)=e^{-e^{-x}},

and the Frechet distribution

F(x)=\exp(-x^{-\alpha}),\quad x>0.

Note that \alpha is a positive parameter in both the Frechet and Gumbel distributions. These three distributions can be expressed as a single one parameter family, the Generalised Extreme Value distribution.

The differences between them lie in the tail behaviour. The reversed Weibull distribution has an actual upper bound, the Gumbel an exponential, fast-decaying tail, and the Frechet a polynomial ‘fat’ tail. It is not completely obvious that these properties are inherited from the original distribution. After all, to get from the original distribution to extreme value distribution, we are taking the maximum, then rescaling and translating in a potentially quite complicated way. However, it is perhaps reasonable to see that the property of the underlying distribution having an upper bound is preserved through this process. Obviously, the bound itself is not preserved – after all, we are free to apply arbitrary linear transformations to the distributions!

In any case, it does turn out to be the case that the U[0,1] distribution has maximum converging to a reversed Weibull; the exponential tails of the Exp(1) and N(0,1) distributions lead to a Gumbel limit; and the fat-tailed Pareto distribution gives the Frechet limit. The calculations are reasonably straightforward, especially once the correct rescaling is known. See this article from Duke for an excellent overview and the details for these examples I have just cited. These notes discuss further properties of these limiting distributions, including the unsurprising fact that their form is preserved under taking the maximum of i.i.d. copies. This is analogous to the fact that the family of normal distributions is preserved under taking arbitrary finite sums.

From a statistical point of view, devising a good statistical test for what class of extreme value distribution a particular set of data obeys is of great interest. Why? Well mainly because of the applications, some of which were suggested above. But also because of the general statistical principle that it is unwise to extrapolate beyond the range of the available data. But that is precisely what we need to do if we are considering extreme values. After all, the designer of that sea-wall can’t necessarily rely on the largest storm in the future being roughly the same as the biggest storm in the past. So because the EVT theorem gives a clear description of the distribution, to find the limiting properties, which is where the truly large extremes might occur, it suffices to find a good test for the form of the limit distribution – that is, which of the three possibilities is relevant, and what the parameter should be. This seems to be fairly hard in general. I didn’t understand much of it, but this paper provided an interesting review.

Anyway, that was something interesting I didn’t know about (for the record, I also now know how to construct a sensible Q-Q plot for the Weibull distribution!), though I am assured that EVT was a core element of the mainstream undergraduate mathematics syllabus forty years ago.

Gaussian tail bounds and a word of caution about CLT

The first part is more of an aside. In a variety of contexts, whether for testing Large Deviations Principles or calculating expectations by integrating over the tail, it is useful to know good approximations to the tail of various distributions. In particular, the exact form of the tail of a standard normal distribution is not particularly tractable. The following upper bound is therefore often extremely useful, especially because it is fairly tight, as we will see.

Let Z\sim N(0,1) be a standard normal RV. We are interested in the tail probability R(x)=\mathbb{P}(Z\geq x). The density function of a normal RV decays very rapidly, as the exponential of a quadratic function of x. This means we might expect that conditional on \{Z\geq x\}, with high probability Z is in fact quite close to x. This concentration of measure property would suggest that the tail probability decays at a rate comparable to the density function itself. In fact, we can show that:

\mathbb{P}(Z>x)< \frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.

It is in fact relatively straightforward:

\mathbb{P}(Z>x)=\frac{1}{\sqrt{2\pi}}\int_x^\infty e^{-u^2/2}du< \frac{1}{\sqrt{2\pi}}\int_x^\infty \frac{u}{x}e^{-u^2/2}du=\frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.

Just by comparing derivatives, we can also show that this bound is fairly tight. In particular:

\frac{1}{\sqrt{2\pi}}\frac{x}{x^2+1}e^{-x^2/2}<\mathbb{P}(Z>x)< \frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.

—-

Now for the second part about CLT. The following question is why I started thinking about various interpretations of CLT in the previous post. Suppose we are trying to prove the Strong Law of Large Numbers for a random variable with 0 mean and unit variance. Suppose we try to use an argument via Borel-Cantelli:

\mathbb{P}(\frac{S_n}{n}>\epsilon) = \mathbb{P}(\frac{S_n}{\sqrt{n}}>\epsilon\sqrt{n})\stackrel{\text{CLT}}{\approx}\mathbb{P}(Z>\epsilon\sqrt{n}).

Now we can use our favourite estimate on the tail of a normal distribution.

\mathbb{P}(Z>\epsilon\sqrt{n})\leq \frac{1}{\epsilon\sqrt{n}\sqrt{2\pi}}e^{-n/2}

\Rightarrow \sum_n \mathbb{P}(Z>\epsilon\sqrt{n})\leq \frac{1}{\sqrt{2\pi}}(e^{-1/2})^n=\frac{1}{\sqrt{2\pi}(e^{1/2}-1)}<\infty.

By Borel-Cantelli, we conclude that with probability 1, eventually \frac{S_n}{n}<\epsilon. This holds for all \epsilon>0, and a symmetric result for the negative case. We therefore obtain the Strong Law of Large Numbers.

The question is: was that application of CLT valid? It certainly looks ok, but I claim not. The main problem is that the deviations under discussion fall outside the remit of discussion. CLT gives a limiting expression for deviations on the \sqrt{n} scale.

Let’s explain this another way. Let’s take \epsilon=10^{-2}. CLT says that as n becomes very large

\mathbb{P}(\frac{S_n}{\sqrt{n}}>1000)\approx \mathbb{P}(Z>1000).

But we don’t know how large n has to be before this approximation is vaguely accurate. If in fact it only becomes accurate for n=10^{12}, then it is not relevant for estimating

\mathbb{P}(\frac{S_n}{\sqrt{n}}>\epsilon\sqrt{n}).

This looks like an artificial example, but the key is that this problem becomes worse as n grows, (or as we increase the number which currently reads as 1000), and certainly is invalid in the limit. [I find the original explanation about scale of deviation treated by CLT more manageable, but hopefully this further clarifies.]

One solution might be to find some sort of uniform convergence criterion for CLT, ie a (hopefully rapidly decreasing) function f(n) such that

\sup_{x\in\mathbb{R}}|\mathbb{P}(\frac{S_n}{\sqrt{n}}>x)-\mathbb{P}(Z>x)|\leq f(n).

This is possible, as given by the Berry-Esseen theorem, but even the most careful refinements in the special case where the third moment is bounded fail to give better bounds than

f(n)\sim \frac{1}{\sqrt{n}}.

Adding this error term will certainly destroy any hope we had of the sum being finite. Of course, part of the problem is that the supremum in the above definition is certainly not going to be attained at any point under discussion in these post-\sqrt{n} deviations. We really want to take a supremum over larger-than-usual deviations if this is to work out.

By this stage, however, I hope it is clear what the cautionary note is, even if the argument could potentially be patched. CLT is a theorem about standard deviations. Separate principles are required to deal with the case of large deviations. This feels like a strangely ominous note on which to end, but I don’t think there’s much more to say. Do comment below if you think there’s a quick fix to the argument for SLLN presented above.

CLT and Stable Distributions

One of the questions I posed at the end of the previous post about the Central Limit Theorem was this: what is special about the normal distribution?

More precisely, for a large class of variables (those with finite variance) the limit in distribution of S_n after a natural rescaling is distributed as N(0,1). As a starting point for investigating similar results for a more general class of underlying distributions, it is worth considering what properties we might require of a distribution if it is to appear as a limit in distribution of sums of IID RVs, rescaled if necessary.

The property required is that the distribution is stable. In the rest of the post I am going to give an informal precis of the content of the relevant chapter of Feller.

Throughout, we assume a collection of IID RVs, X,X_1,X_2,\ldots, with the initial sums S_n:=X_1+\ldots+X_n. Then we say X is stable in the broad sense if

S_n\stackrel{d}{=}c_nX+\gamma_n,

for some deterministic parameters c_n,\gamma_n for every n. If in fact \gamma_n=0 then we say X is stable in the strict sense. I’m not sure if this division into strict and broad is still widely drawn, but anyway. One interpretation might be that a collection of distributions is stable if they form a non-trivial subspace of the vector space of random variables and also form a subgroup under the operation of adding independent RVs. I’m not sure that this is hugely useful either though. One observation is that if \mathbb{E}X exists and is 0, then so are all the \gamma_ns.

The key result to be shown is that

c_n=n^{1/\alpha} for some 0<\alpha\leq 2.

Relevant though the observation about means is, a more useful one is this. The stability property is retained if we replace the distribution of X with the distribution of X_1-X-2 (independent copies naturally!). The behaviour of c_n is also preserved. Now we can work with an underlying distribution that is symmetric about 0, rather than merely centred. The deduction that \gamma_n=0 still holds now, whether or not X has a mean.

Now we proceed with the proof. All equalities are taken to be in distribution unless otherwise specified. By splitting into two smaller sums, we deduce that

c_{m+n}X=S_{m+n}=c_mX_1+c_nX_2.

Extending this idea, we have

c_{kr}X=S_{kr}=S_k^{(1)}+\ldots+S_k^{(r)}=c_kX_1+\ldots+c_kX_r=c_kS_r=c_kc_rX.

Note that it is not even obvious yet that the c_ns are increasing. To get a bit more control, we proceed as follows. Set v=m+n, and express

X=\frac{c_m}{c_v}X_1+\frac{c_n}{c_v}X_2,

from which we can make the deduction

\mathbb{P}(X>t)\geq \mathbb{P}(X_1>0,X_2>t\frac{c_v}{c_n})=\frac12\mathbb{P}(X_2>t\frac{c_v}{c_n}). (*)

So most importantly, by taking t>>0 in the above, and using that X is symmetric, we can obtain an upper bound

\mathbb{P}(X_2>t\frac{c_v}{c_n})\leq \delta<\frac12,

in fact for any \delta<\frac12 if we take t large enough. But since

\mathbb{P}(X_2>0)=\frac12(1-\mathbb{P}(X_2=0)),

(which should in most cases be \frac12), this implies that \frac{c_v}{c_n} cannot be very close to 0. In other words, \frac{c_n}{c_v} is bounded above. This is in fact regularity enough to deduce that c_n=n^{1/\alpha} from the Cauchy-type functional equation (*).

It remains to check that \alpha\leq 2. Note that this equality case \alpha=2 corresponds exactly to the \frac{1}{\sqrt{n}} scaling we saw for the normal distribution, in the context of the CLT. This motivates the proof. If \alpha>2, we will show that the variance of X is finite, so CLT applies. This gives some control over c_n in an n\rightarrow\infty limit, which is plenty to ensure a contradiction.

To show the variance is finite, we use the definition of stable to check that there is a value of t such that

\mathbb{P}(S_n>tc_n)<\frac14\,\forall n.

Now consider the event that the maximum of the X_is is >tc_n and that the sum of the rest is non-negative. This has, by independence, exactly half the probability of the event demanding just that the maximum be bounded below, and furthermore is contained within the event with probability <\frac14 shown above. So if we set

z(n)=n\mathbb{P}(X>tc_n)

we then have

\frac14>\mathbb{P}(S_n>tc_n)\geq\frac12\mathbb{P}(\max X_i>tc_n)=\frac12[1-(1-\frac{z}{n})^n]

\iff 1-e^{-z(n)}\leq \frac12\text{ for large }n.

So, z(n)=n(1-F(tc_n)) is bounded as n varies. Rescaling suitably, this gives that

x^\alpha(1-R(x))<M\,\forall x,\,\text{for some }M<\infty.

This is exactly what we need to control the variance, as:

\mathbb{E}X^2=\int_0^\infty \mathbb{P}(X^2>t)dt=\int_0^\infty \mathbb{P}(X^2>u^2)2udu

=\int_0^\infty 4u\mathbb{P}(X>u)du\leq \int_0^\infty 1\wedge\frac{4M}{u^{-(\alpha-1)}}du<\infty,

using that X is symmetric and that \alpha>2 for the final equalities. But we know from CLT that if the variance is finite, we must have \alpha=2.

All that remains is to mention how stable distributions fit into the context of limits in distribution of RVs. This is little more than a definition.

We say F is in the domain of attraction of a broadly stable distribution R if

\exists a_n>0,b_n,\quad\text{s.t.}\quad \frac{S_n-b_n}{a_n}\stackrel{d}{\rightarrow}R.

The role of b_n is not hugely important, as a broadly stable distribution is in the domain of attraction of the corresponding strictly stable distribution.

The natural question to ask is: do the domains of attraction of stable distributions (for 0<\alpha\leq 2) partition the space of probability distributions, or is some extra condition required?

Next time I will talk about stable distributions in a more analytic context, and in particular how a discussion of their properties is motivated by the construction of Levy processes.