The Envelope ‘Paradox’

At the recent IMO in Hong Kong, there were several moments where the deputy leaders had to hang around, and I spent some of these moments discussing the following problem with Stephen Mackereth, my counterpart from New Zealand. He’s a mathematically-trained philosopher, so has a similar level of skepticism to me, but for different reasons, regarding supposed paradoxes in probability. Because, as we will see shortly, I don’t think this is a paradox in even the slightest fashion, I think there’s probably too much written about this on the internet already. So I’m aware that contributing further to this oeuvre is hypocritical, but we did the thinking in HKUST’s apparently famous Einstein Cafe, so it makes sense to write down the thoughts.

[And then forget about it for eight weeks. Oops.]

The ‘Paradox’

Here’s the situation. A cryptic friend gives you an envelope containing some sum of money, and shows you a second envelope. They then inform you that one of the envelopes contains twice as much money as the other. It’s implicit in this that the choice of which is which is uniform. You have the option to switch envelopes. Should you?

The supposed paradox arises by considering the amount in your envelope, say X. In the absence of further information, it is equally likely that the other envelope contains X/2 as 2X. Therefore, the average value of the other envelope is

\frac12 \left(\frac{X}{2}+2X \right)= \frac54 X > X.

So you should switch, since on average you gain money. But this is paradoxical, since the assignment of larger and smaller sums was uniform, so switching envelope should make no difference.

Probabilistic setup

This is not supposed to be a problem on a first-year probability exercise sheet. It’s supposed to be conducive to light discussion. So saying “I won’t engage with this problem until you tell me what the probability space is” doesn’t go down terribly well. But it is important to work out what is random, and what isn’t.

There are two sources of randomness, or at least ignorance. Firstly, there is the pair of values contained in the envelopes. Secondly, there is the assignment of this pair of values to the two envelopes. The second is a source of randomness, and this problem is founded on the premise that this second stage is ‘symmetric enough’ to smooth over any complications in the first stage. If we think that probability isn’t broken (and that’s what I think), then the answer is probably that the second stage isn’t symmetric enough.

Or, that the first stage isn’t very well-defined. In what follows, I’m going to make the second stage very symmetric, at the expense of setting up the first stage in what seems to me a reasonable way using the conventional language of probability theory to record our ignorance about the values in play.

So what’s the first stage? We must have a set of possible pairs of values taken by the envelopes. Let’s call this A, so

A\subset \mathbb{A}:=\{(x,2x)\,:\, x\in (0,\infty)\}.

Maybe we know what A is, but maybe we don’t, in which we should take A=\mathbb{A}, on the grounds that any pair is possible. Suppose that your friend has chosen the pair of values according to some distribution on \mathbb{A}, which we’ll assume has a density f, which is known by you. Maybe this isn’t the actual density, but it serves perfectly well if you treat it as *your* opinion on the likelihood. Then this actually does reduce to a problem along the lines of first-year probability, whether or not you get to see the amount in your envelope.

Suppose first that you do get to see the amount, and that it is x. Then the conditional probabilities that the pair is (x/2,x) or (x,2x) are, respectively

\frac{f(x/2,x)}{f(x/2,x)+f(x,2x)},\quad \frac{f(x,2x)}{f(x/2,x)+f(x,2x)}.

So you can work out your expected gain by switching, and decide accordingly. If you don’t know the value in your envelope, you can still work out the probability that it is better (in expectation) to switch, but this isn’t really a hugely meaningful measure, unless it is zero or one.

It’s worth noting that if you can view inside your envelope, and you know A has a particular form, then the game becomes determined. For example, if

A\subset \{(n,2n), n\text{ an odd integer}\},

then life is very easy. If you open your envelope and see an odd integer, you should switch, and if you see an even integer you shouldn’t.

We’ll return at the end to discuss a case where it is always better to switch, and why this isn’t actually a paradox.

Improper prior and paradox of resampling when \mathbb{E}=\infty

For now though, let’s assume that we don’t know anything about the amounts of money in the envelopes. Earlier, we said that “in the absence of further information, it is equally likely that the other envelope contains X/2 as 2X”. In the language of a distribution on \mathbb{A}, we are taking the uniform measure. Of course this not a distribution, in the same way that there isn’t a uniform distribution on the positive reals.

However, if this is your belief about the values in the pair of envelopes, what do you think is the mean value of the content of your envelope? Well, you think all values are equally likely. So, even though this isn’t a distribution, you pretty much think the value of your envelope has infinite expectation.

[This is where the philosophy comes in I guess. Is (expressing uniform ignorance about the content of the other envelope given knowledge of your own) the same as (expressing uniform ignorance of both envelopes at the beginning)? I think it is, even though it has a different consequence here, since the former can be turned into a proper distribution, whereas the latter cannot.]

Let’s briefly consider an alternative example. It’s fairly easy to conjure up distributions which are almost surely finite but which have infinite expectation. For example \mathbb{P}(X=2^k)=2^{-k} for k=1,2,…, which is the content of the *St. Petersburg paradox*, another supposed paradox in probability, but one whose resolution is a bit more clear.

Anyway, let X and Y be independent copies of such a distribution. Now suppose your friend offers you an envelope containing amount X. You look at the value, and then you are offered the opportunity to switch to an envelope containing amount Y. Should you?

Well, if expectation is what you care about, then you definitely should. Because with probability one, you are staring at a finite value in your envelope, whereas the other unknown envelope promises infinite expectation, which is certainly larger than the value that you’re looking at.

Is this also a paradox? I definitely don’t think it is. The expectation of the content of your envelope is infinite, the expected gain is infinite with probability one, which is consistent with the expected content of the other envelope being infinite. [Note that you don’t want to be claiming that the expectation of X-Y is zero.]

An example density function

As an exercise that isn’t necessarily hugely interesting, let’s assume that f, the distribution of the smaller of the pair, is \mathrm{Exp}(\lambda). So the mean of this smaller number is 1/\lambda. Then, conditional on seeing x in my envelope, the expected value of the number in the other envelope is

\frac{\frac{x}{2} e^{-\lambda x/2} + 2x e^{-\lambda x}}{e^{-\lambda x/2}+ e^{-\lambda x}}. (*)

Some straightforward manipulation shows that this quantity is at least x (implying it’s advantageous to switch) precisely when

e^{-\lambda x/2}\ge \frac12.

That is, when x\le \frac{2\log 2}{\lambda}. The shape of this interval should fit our intuition, namely that the optimal strategy should be to switch if the value in your envelope is small enough.

The point of doing this calculation is to emphasise that it ceases to be an interesting problem, and certainly ceases to be a paradox of any kind, once we specify f concretely. It doesn’t matter whether this is some true distribution (ie the friend is genuinely sampling the values somehow at random), or rather a perceived likelihood (that happens to be normalisable).

What if you should always switch?

The statement of the paradox only really has any bite if the suggestion is that we should always switch. Earlier, we discussed potential objections to considering the uniform prior in this setting, but what about other possible distributions f which might lead to this conclusion?

As at (*), we can conclude that when f(x)+f(x/2)>0, we should switch on seeing x precisely if

f(x)\ge 2f\left(\frac{x}{2}\right).

Therefore, partitioning the support of f into a collection of geometric sequences with exponent 2, it is clear that the mean of f is infinite if everything is integer-valued. If f is real-valued, there are some complications, but so long as everything is measurable, the same conclusion will hold.

So the you-should-switch-given-x strategy can only hold for all values of x if f has infinite mean. This pretty much wraps up my feelings. If the mean isn’t infinite, the statement of the paradox no longer holds, and if it is infinite, then the paradox dissolves into a statement about trying to order various expectations, all of which are infinite.

Conclusions

Mathematical summary: it’s Bayes. Things may be exchangeable initially, but not once you condition on the value of one of them! Well, not unless you have a very specific prior.

Philosophical summary: everything in my argument depends on the premise that one can always describe the protagonist’s prior opinion on the contents of the pair of envelopes with a (possibly degenerate) distribution. I feel this is reasonable. As soon as you write down \frac12 \cdot\frac{x}{2} + \frac12 \cdot2x, you are doing a conditional expectation, and it’s got to be conditional with respect to something. Here it’s the uniform prior, or at least the uniform prior restricted to the set of values that are now possible given the revelation of your number.

Second mathematical summary: once you are working with the uniform prior, or any measure with infinite mean, there’s no reason why

\mathbb{E}\left[X|Y\right]>Y,

with probability one (in terms of Y) should be surprising, since the LHS is (almost-surely) infinite while the RHS is almost surely finite, despite having infinite mean itself.

Crocodiles, Zebras, and Hard Exams

Education and mathematical social media has recently been alive with a discussion of a problem which appeared in this year’s Scottish Higher exam.

[Background: for anyone reading this from outside the UK, this exam is taken at the end of the final year of school, by which stage most students (including almost all those taking this paper) will have specialised to about four subjects. Achieving particular results in these exams will be the condition set by many universities for candidates to take up places.]

The paper under discussion can be found here. It’s the second paper in the set, where calculators are permitted. The crocodile question that has attracted particular attention is Q8, and I hope it is ok to reprint it here, though of course copyright lies with the SQA:

CrocodileQI’ll come back to the rest of the paper later, but for now let’s talk about some of the reasons why candidates found this question challenging under time pressure.

The structure of the argument prompted by the question is the following. We are modelling a situation which might arise in the real world involving crocodiles, or in a variety of different contexts. There are a number of ways to get between two places, involving travelling through two different types of routes, at different speeds. Which is the fastest possible route? We describe the routes with a parameter, and after thinking briefly about the range of parameter, we can calculate (in this case using Pythagoras – the river has constant width presumably – and the now very familiar distance-speed-time relation) the time taken for each route. Finally, finding the fastest possible route is an exercise in minimising this function, for which we differentiate.

I’m sure the well-prepared candidates had seen all parts of this argument in various contexts during class and revision. What I imagine made this question seem confusing was that the examiners actually did the first step for the candidates, by telling them how the time taken depended on x, without expecting them to engage with any of the geometry or dynamics implicit in the answer. So all that is left for the candidates is to substitute in two values into the expression, and then differentiate to find the minimum, possibly with a small additional argument to say why this should be a minimum.

Obviously, if they just said “Find T(0) and T(20)”, then part a) would not be meaningful in any way as a test of ability. So you get two marks for correctly interpreting the ‘word problem’, that travelling only in the water corresponds to x=20, and minimising water time corresponds to x=0. Now the paper is out of 70, so this fine, but you can also now see why this caused confusion. Is x actually a variable according to the pre-amble? Look at Q2 (which is all about domains and so on). Students are expected at this stage to be familiar with the technicalities of defining functions and their arguments. Well the phrase “particular point” is used, so if we are going to be literal, then x is not a variable, and T(x) is just one number so given this information about about the time taken under the minimal route, we can’t do part a).

Of course this is deliberately artificial. I did not think what I have just described, and nor would any experienced mathematician. But maybe this contributed to the confusion?

Applied Problems

On a friend’s Facebook post referencing this question, someone had commented, loosely paraphrased: “This sort of question will put students off maths because they tell you the answer. This will make students feel that the maths they have learned isn’t useful.”

I disagree with this sentiment on several levels. Firstly, while no-one wants exams to be crushing for the candidates, nor a tedious regurgitation of routine material, the time to become convinced of the usefulness, relevance and interest of the course material is, well, during the course. It’s hard enough to write questions that test understanding and competence at core material, without adding the constraint that it has to make integration seem important. Hopefully some questions may seem interesting and relevant, but this is neither objective, nor the central ambition of such an exam.

Secondly, interpreting intermediate steps which you are given is a hugely important skill, and in this particular example, the fast-forward mechanism of the question makes it even more clear why the maths they have learned is useful: I know how to differentiate things with square roots, so I can use that new skill to solve problems about crocodiles. While I wonder whether the original plan had been to get candidates to derive T(x), I’m not sure whether this could really be said to be a useful application of maths that they’d learned, since they learned it several years previously.

In some other areas of commentary, it was raised that this is not a realistic model of crocodile mechanics, because it fails to take into account an almost infinite number of other factors. At least the ones that said the zebra would have just run away were joking. Others were not joking, but I feel it outside the remit of this short post to discuss why simple mathematical models can still be valuable.

Crocodile_compressed‘Too hard’? ‘Grade review needed’?

I did think that the exam as a whole was pretty tricky for this level. Q1, for example, requires quite a lot of careful calculation, and it’s a short exam. The question about frogs iteratively climbing up a wall genuinely requires some thought. Perhaps this is a standard type of problem introduced in the syllabus, but in either case it’s a nice question. Thinking about functions and their fixed points and monotonicity, and convergence (or not) towards fixed points is a really useful thing to think about, much more so than calculations like eigenvectors for a 3×3 matrix which certainly used to be popular on some exam boards.

So the crocodile question needs to be seen in the context of a paper where even the strongest candidates would be under a lot of time pressure once they reached the final problems. In other words, it was a hard paper, and I personally believe it seems to have been hard in a good way. Despite this mild confusion for some about what they needed to do on the crocodile question, in general both competence with the course material, and creativity in problem solving should be rewarded by this paper.

So it’s a shame to see all the usual cliches about needing to re-think the grades coming out. I should admit that I haven’t checked the full story, but my impression from the quotations given by the SQA is that they wanted a harder exam, not fewer top grades. I don’t see any reason why performance on this harder exam would be less strongly correlated with the candidate’s understanding of the course material and general problem-solving ability than a more routine exam, such as many were clearly expecting. In particular, I doubt many students performed dramatically differently (in relative terms) on the harder exam than they expected to do on whatever they were expecting. Maybe the variance is slightly higher, since maybe some questions have more of a you-either-see-it-or-you-don’t flavour? But I don’t see any real case for this, and it’s somewhat counter-balanced by the greater variety of questions (if the implication is that you don’t have to do them all to do very well).

I can also sympathise with students who need a top grade to go to university. Traditionally, they would complete most of the questions on the paper because they had prepared carefully and had a good idea what sort of things might come up. Concerns about having done ‘well enough’ would still exist, but at least they knew what the parameters were, roughly, and could be fairly confident they’d done ‘enough’.

However, that’s not a good enough reason not to set harder problems. No test can be a perfect measure of ability, since essentially by definition no such thing exists. I’m afraid it’s automatic that for all the students surprised by the change of tone of some of these questions, there is an equal and opposite set of students who, though perhaps they didn’t realise it at the time, actually thrived on the chance to address more demanding material. And their achievement should be celebrated and stand as a message to next year’s students and beyond that some hard questions doesn’t mean everyone must fail, either to get the grade they want, or to engage with good mathematics.

Birthday Coincidences and Poisson Approximations

This morning, Facebook was extremely keen to remind me via every available medium that four of my friends celebrate their birthday today. My first thought was that I hope they all enjoy their day, and my second thought was to ask what the chance of this was. I have about 200 Facebook friends, and so this struck me as an unlikely occurrence. But this problem has form, and it felt worthwhile to try some calculations to see if my intuition was well-founded.

rainbowfishcake_compressed

Siméon Denis Poisson celebrated his 234th birthday on 21st June this year.

The classical birthday problem

The starting point is the question: how many friends do you have to have before you expect to start seeing anyone sharing a birthday? There are a ridiculous number of articles about this on the web already, so I will say little, except that I don’t want to call this the ‘birthday paradox’, because it’s not a paradox at all. At best it might be counter-intuitive, but then the moral should be to change our intuition for this type of problem.

Throughout, let’s discount February 29th, as this doesn’t add much. So then, to guarantee having a shared pair of birthdays, you need to have 366 friends. But if you have a mere 23 friends, then the probability of having some pair that share a birthday is slightly greater than a half. The disparity between these two numbers leads to the counter-intuition. Some people might find it helpful to think that instead of counting friends, we should instead be counting pairs of friends, but I don’t personally find this especially helpful.

For me, thinking about the calculation in very slightly more generality is helpful. Here, and throughout, let’s instead take N to be the number of days in a year, and K the number of friends, or kids in the class if you prefer. Then, as usual, it is easier to calculate the probability that no two share a birthday (that is, that all the birthdays are distinct) than the probability that some two share a birthday. We could think of the number of ways to pick the set of birthdays, or we could look at the kids one-at-a-time, and demand that their birthday is not one of those we’ve already seen. Naturally, we get the same answer, that is

\frac{^N P_K}{N^K} = 1\cdot \frac{N-1}{N}\cdot\ldots \frac{N-K+1}{N}.

We’ve assumed here that all birthdates are equally likely. We’ll come back to this assumption right at the end. For now, let’s assume that both N and K are large, and we’ll try to decide roughly how large K has to be in relation to N for this answer to be away from 0 and 1. If we pair opposite terms up, we might approximate this by

(\frac{N-\frac{K}{2}}{N})^K = (1-\frac{K}{2N})^K\approx e^{-K^2/2N}.

In fact, AM-GM says that this is an overestimate, and a bit more care can be used to show that this is a good-approximation to first order. So we see that if K=\Theta(\sqrt{N}) for large N, we get a non-trivial limit.

Challenges for four-way shared birthdays

So the original problem I posed is harder, because there isn’t (unless I’m missing something) a natural way to choose birthdays one-at-a-time, or describe the set of suitable birthday sets. There are two major obstacles in a calculation such as this. Firstly, the overlap of people, that is we might have five or more birthdays overlapping; secondly, the overlap of days, that is we might have several days with four (or more) birthdays. We’ll end up worrying more about the second situation.

We start by eliminating both problems, by asking for the probability that exactly four friends are born on January 1st. The general form of this probability is \frac{\binom{K}{4} }{N^4} \cdot (\frac{N-1}{N})^{K-4}. Now, if K\ll N, this final term should not be significant. Removing this is not exactly the same as specifying the probability that at least four birthdays are on January 1st. But in fact this removal turns a lower bound (because {exactly four}<{at least four}) into an upper (in fact a union) bound. So if the factor being removed is very close to one, we can use whichever expression is more convenient.

In the real life case of N=365, K=200, this term is not negligible. But accounting for this, we get that the probability of exactly four birthdays on 1st January is ~0.0021. Our upper bound on the probability of at least four is ~0.0036.

But now that we know the probability for a given day, can we calculate (1-0.0021)^{365} to estimate the probability that we never have four-overlap? When we did our previous iterative calculation, we were using independence of the different kids’ birthdays. But the event that we have four-overlap on January 1st is not quite independent of the event that we have four-overlap on January 2nd. Why? Well if we know at least four people were born on January 1st, there are fewer people left (potentially) to be born on January 2nd. But maybe this dependence is mild enough that we can ignore it?

We can, however, use some moment estimates. The expected number of days with four-overlap is 365\cdot 0.0021 \approx 0.77. So the probability that there is at least one day with four-overlap is at most ~0.77.

But we really want a lower bound. So, maybe we can borrow exactly the second-moment argument we tried (there for isolated vertices in the random graph) in the previous post? Here, the probability that both January 1st and January 2nd are four-overlapping is

\frac{\binom{K}{4}\binom{K-4}{4}}{N^8}\cdot (\frac{N-2}{N})^{K-8}\approx 4.3\times 10^{-6}.

From this, we can evaluate the expectation of the square of the number of days with four-overlap, and thus find that the variance is ~0.74. So we use Chebyshev, calling this number of days #D for now:

\mathbb{P}(\# D=0)\le \mathbb{P}(|\#D - \mathbb{E}\# D|^2 \ge (\mathbb{E}\# D)^2 ) \le \frac{\mathrm{Var} \# D}{(\mathbb{E} \#D)^2}.

In our case, this unfortunately gives us an upper bound greater than 1 on this probability, and thus a lower bound of zero on the probability that there is at least one day with four-overlap. Which isn’t especially interesting…

Fairly recently, I spoke about the Lovasz Local Lemma, which can be used to find lower bounds on the probabilities of intersections of events, many of which are independent (in a particular precise sense). Perhaps this might be useful here? The natural choice of ‘bad event’ is that particular 4-sets of people share a birthday. There are \binom{K}{4} such events, and each is independent of the collection of \binom{K-4}{4} disjoint events. Thus we can consider using LLL if e\cdot (\binom{K}{4}-\binom{K-4}{4})\cdot 0.0021 \le 1. Unfortunately, this difference of binomial coefficients is large in our example, and so in fact the LHS has order 10^3.

Random number of friends – coupling to a Poisson Process

All of these methods failed because without independence we had to use estimates which were really not tight at all. But we can re-introduce independence if we remove the constraints on the model. Suppose instead of demanding I have K friends, I instead demand that I have a random number of friends, with distribution Poisson(K). Now it is reasonable to assume that for each day, I have a Poisson(K/365) friends with that birthday, independently for each day.

If we end up having exactly K friends with this random choice, then the distribution of the number of 4-overlap days is exactly the same as in the original setup. However, crucially, if we end up having at most K friends with this random choice, the distribution of the number of 4-overlap days is stochastically dominated by the original distribution. So instead let’s assume we have Poisson(L) friends, where L<K, and see how well we can do. For definiteness, we’ll go back to N=365, K=200 now. Let’s say X is the distribution of birthdays in the original model, and \Xi for the distribution of birthdays in the model with a random number of friends

Then

\mathbb{P}(\exists \ge 4\text{-overlap in }\Xi) = 1- \mathbb{P}(\mathrm{Po}(L/365)\le 3)^365. (*)

Now we can write the crucial domination relation as

\mathbb{P}(\exists \ge 4\text{-overlap in }X)\ge \mathbb{P}( \exists \ge 4\text{-overlap in }\Xi \,|\, |\Xi|\le 200),

and then use an inequality version of the law of total probability to bound further as

\ge \frac{ \mathbb{P}(\exists \ge 4\text{-overlap in }\Xi) - \mathbb{P}(|\Xi|>200)}{\mathbb{P}(|\Xi|\le 200)}.

This is a function of L, and in principle we could find its maximum, perhaps as N\rightarrow\infty. Here, though, let’s just take L=365/2 and see what happens. For (*) we get ~0.472.

To estimate \mathbb{P}(\mathrm{Po}(365/2)>200), observe that this event corresponds to 1.4 standard deviations above the mean, so we can approximate using quantiles of the normal distribution, via the CLT. (Obviously this isn’t completely precise, but it could be made precise if we really wanted.) I looked up a table, and this probability is, conveniently for calculations, roughly 0.1. Thus we obtain a lower bound of \frac{0.472-0.1}{0.9}. Allowing for the fairly weak estimates at various points, we still get a lower bound of around 0.4. Which is good, because it shows that my intuition wasn’t right, but that I was in the right ball-park for it being a ‘middle-probability event’.

Remarks and References

– The reason for doing the upper bound for the probability of exact 4-overlap is that the same argument for at-least-4-overlap would have given an upper bound of 1. However, this Poisson Process coupling is also a much better method for obtaining an upper bound on either event.

– Birthdays are not uniformly distributed through the year. The deviation is strong enough that even from the set of birth frequencies (rather than the sequence of birth frequencies), we can reject a null hypothesis of uniformity. Early September is pretty close to the maximum. Two comments: 1) this is the time of year where small variations in birth date have a big effect on education, especially in primary school; 2) we are 37 weeks into the year…

– It is known that 187 friends is the first time the probability of having at-least-4-overlap is greater than ½. You can find the full sequence on OEIS as A014088. I used to have about 650 Facebook friends, before I decided that I’d prefer instead the pleasant surprise of finding out what old acquaintances were up to when I next spoke to them. In this case, the median of the distribution of the largest number sharing a birthday would be seven.

– Eric Weisstein’s article on Mathworld is, in my opinion, the best resource for a mathematician on the first two pages of Google hits by at least two orders of magnitude. In the notation of this article, we were calculating P_4(n=200,d=365). There are also some good general asymptotics, or at least recipes for asymptotics, in equations (17) and (18).

– The paper Methods for Studying Coincidences by Diaconis and Mosteller is, as one might expect, extremely readable, and summarises many results and applications, including several generalisations.

EGMO 2015

It’s been a while since I last wrote anything substantial. There have been some posts in the pipeline, but mainly I’ve been working hard on technical things that don’t translate very well into blog posts, and when I haven’t been working on those, have felt like doing non-maths.

Anyway, among other things which happened recently were the UK’s IMO selection camp in Cambridge during the last week of March, and the fourth European Girls’ Mathematical Olympiad in Belarus this past weekend. At the former, I was quite busy organising various things, and so gave a session based on some of my favourite problems about points and lines that I’ve used in the past. My hope was that with each problem in turn the students would actively invest time in thinking about whether the ideas each other were having seemed likely to be profitable. The aim was that being critical about your approach is a necessary skill once you start having lots of ideas for approaches.

This is hard to practise at school just by reading around, whether regarding competition material or generally. In the competition world, official solutions often contain unmotivated magic. This is reasonable, since they are supposed to be a minimal elementary demonstration of the problem’s validity. Motivation is a luxury which space and time sometimes doesn’t permit. And the solutions you find on, for example, Mathlinks often give the impression that the solvers know how to do 25,000 specific types of problem, and the sole task is to identify which type the latest one belongs to. Relating problems to ones you’ve seen before is important, but can hide, or at least dilute the actual ideas in some cases. Knowing that a specific inequality is a special case of a big machine allows you to claim a ‘solution’ but doesn’t help you identify the relevant ideas.

Later in the camp, a conversation arose concerning to what extent the younger staff found these elementary methods and problems easier now that they had experienced various levels of higher education in mathematics than when they were at school. It’s a complicated question, and I don’t think there’s a simple answer. I think the students might suspect that doing a university degree teaches you ‘advanced techniques’ which immediately simplify some of these problems. In rare examples this can be the case, but the majority of the time, I personally feel the only advantage I have is perhaps better instincts for whether something is or isn’t going to work.

Probably a good test would be Euclidean geometry. Adult olympiad staff typically come in two flavours: those who used to be good at geometry and feel out of practice; and those who weren’t good at geometry and certainly had no inclination to remedy this after they left school. I’m probably closer to the first category and I definitely feel out of practice, but also have minimal inclination to remedy this. Nonetheless, on the rare occasions I try these questions (and it’s not going to be for 4.5 hours at this stage…) my progress rate is probably comparable to when I was in sixth form. I’ve no idea how to turn this into a more concrete assessment, but there must be something about doing abstract maths all the time that prevents you forgetting how to do this, so presumably it should be at least as helpful in the types of problem with non-zero overlap with things I actually do. I won’t discuss geometry in the rest of this post, but I did also enjoy the geometry questions – it’s simply that I feel anything I have to say about them would be less useful than saying nothing.

In any case, I enjoyed trying the problems from Belarus in between bouts of typing, and thought I would write some brief comments about how I decided whether my ideas were good or not. To achieve this, I’ve looked at my rough, and will tell you the ideas which didn’t work, as well as some of the ones which did. I’ve paraphrased the statements slightly to avoid having too many LaTeX tags.

WARNING: what follows will spoil questions {2,4,5} if you haven’t already looked at them, but would like to.

Question 2 – A domino is a 2 × 1 or 1 × 2 tile. Determine in how many ways exactly n^2 dominoes can be placed without overlapping on a 2n × 2n chessboard so that every 2 × 2 square contains at least two uncovered unit squares which lie in the same row or column.

The wording of the question prompted me to consider splitting the board naturally into n^2 2 x 2 squares. I then thought about this ‘at least’ in the question, and concluded that for these 2 x 2 squares, it should be ‘exactly’.

I tried doing an unusual colouring, when I coloured half the black squares green, and half blue and tried to show that either only greens or only blues were covered, but this was clearly not true, or fixable. I then tried to do something similar for the other set of 2 x 2 squares (those whose vertices have (odd, odd) coordinates). Roughly speaking, if too few of the outer cells on the original board are covered, you can’t achieve the bounds on these odd inner squares. But this didn’t really give any useful information.

However, it did get me thinking about how what happens in the far top-left affects the rest of the board, and from there most of the ideas came quickly. I described a 2 x 2 square as N, E, S, W depending on which ‘half’ of the square was covered. Then if a square is N, all the squares directly above it must be also be N (*).

I then made two mistakes, and if we’re looking for reasons where experience is helpful, it was probably here, as I spotted them fairly quickly, rather than wasting ages and ages.

First, I decided that either all squares were {N,E} or all were {S,W} which seemed plausible when I had been focusing on the top-left. This gave an answer of 2 \binom{2n}{n}, but after a bit more thought is clearly not symmetric enough.

Second, I thought condition (*) might be the only constraint, along with similar ones for E, S, W naturally too. I tried to count these using a similar method of enumerating lines between these regions, and I realised I didn’t know how to do these sensibly, for example if it looked like this:

EGMO15 Q2 diagram (3)

This led to another bit of meta-maths that I probably wouldn’t have considered if I was 16. Namely, that the idea of counting these monotone lines across the 2n x 2n grid was too nice not to be useful. Also, if I couldn’t see a way to adapt it for this more complicated setup, my setup was probably wrong. This thought was useful, and then by thinking about the interface between {N,E} and {S,W}, then the other way round, it made sense that the configuration could be parameterised by two monotone lines between different pairs of corners, giving an answer of \binom{2n}{n}^2.

So, if it’s possible to give a reason why I could do this, it’s probably because I had an arsenal of ways to check an answer for plausibility, which saved wasting time on something wrong, and also because I trusted that the answer would be nice, which saved wasting time on something which was also wrong, and would probably have been very complicated to resolve.

Question 4 – Determine whether there exists an infinite sequence a_1,a_2,\ldots of positive integers satisfying a_{n+2}=a_{n+1}+\sqrt{a_{n+1}+a_n}.

So, my first reaction was ‘no way’. Why? Because everything’s determined by the first two terms, and I couldn’t think of any reason why a cool choice of the first two terms would force all of the sums a_{n+1}+a_n to be perfect squares. It seemed just about possible that we could arbitrarily large finite sequences with this property. (Though this also turns out to be impossible.)

I imagine many contestants might have felt similarly and spent a while playing round with quadratic residues to get a sense for exactly how hard it is to make this work for the initial terms. But I was suspicious of the form of the recurrence. We know that if it had been defined linearly, then the terms would grow exponentially, but it’s natural to ask roughly how fast they grow in this example, even relaxing the restriction that the terms be integers.

The first observation was that they are indeed (strictly) growing! But how fast? Are there actually enough squares that every a_{n+1}+a_n can be a different one? Note that the squares themselves satisfy a similar recurrence relation (N+1)^2 = N^2 + 2\sqrt{N^2} + 1 > N^2 + \sqrt{ N^2 + N^2}. So this seemed like a very good idea, and my instinct was that this should work, and I felt glad that I hadn’t pursued the quadratic residues approach.

From here we were basically home. I asked whether the sequence grows faster than the sequence (\frac{n^2}{2}), and the answer was no as

a_{n+2}\le a_{n+1}+ \sqrt{2 a_{n+1}} \le (\sqrt{a_{n=1}} + \frac{1}{\sqrt 2})^2,

so if (after translating indices) a_n = \frac{x^2}{2}, we have a_{n+1}\le \frac{(x+1)^2}{2}. This is clearly a problem or at best a very tight constraint if all the \{a_n+a_{n+1}\} have to be perfect squares, so even though we aren’t completely finished, I am confident I can be in one or two lines, with a bit of care. Fiddling with small values of n looked like it would work, or showing that looking at a large enough initial subsequence that the effect of the first two terms dissipated, we must by the pigeonhole principle have a_{n+2}+a_{n+1}=(k+1)^2,\, a_{n+1}+a_n=k^2, which is enough by a parity argument, using the original statement.

This final bit looks and feels a bit messy, but by this stage it really is just finding any argument to justify why a sequence which grows at most as fast as n^2 can’t actually be n^2 eventually.

Probably the reason I did this quickly was because I trusted my instinct for ‘no’, and also because there seemed a quick way to qualify *roughly* how this sequence grew. Sensibly approximating things, and deciding whether it’s appropriate to approximate them is definitely something I feel I’ve got better at during my PhD, so I guess this was helpful, then just try and throw back the important condition that the elements were integers at the very end.

Question 5 – Anastasia partitions the integers [1,2m] into pairs. Boris tries to choose one integer from each pair so that the sum is n. Given n and m, prove Anastasia can select pairs so that Boris can’t succeed.

So I wasted some thought time by imagining that n was fixed and trying to come up with ideas for the choice of pairs which might avoid n. It struck me that there was no reason why a a typical (uniformly chosen) pairing should avoid any particular n unless this value was particularly big or small.

How big or how small? Well Boris can always choose the bigger element of a pair, so the way to make the minimum maximum is to pair as (1,2), (3,4), … , (2m-1,2m). Conveniently, this also achieves the maximum minimum. These can be calculated as m^2,m(m+1) respectively. Suddenly this seems great, because we’ve actually solved the problem for a huge range of n, ie everything not within between these extrema.

The key step here was to start considering special pairings, chosen to give a reduced set of possible sums. Once we’ve had this idea, it makes sense to consider other different special pairings. The reason we got a small set of possible sums is that there’s lots of overlap. We can achieve lots of overlap by looking at the difference between elements in a pair, and making as many of these the same as possible. For, consider pairs (a,a+d), (b,b+d). Then it’s the same to take a and b+d as to take a+d and b, so we get the same sums in lots of different ways.

The other way to have as many differences the same as possible is to go for (1,m+1), (2,m+2), … , (m,2m). Conveniently, we can parameterise the sums now because at each choice, we decide whether to add an extra m or not, so the sum must be 1+2+…+m, plus some multiple of m. So we can defeat Boris, except when n is \binom{m}{2}+km.

This is a good point to stop because what followed was basically book-keeping. We only have to consider a couple of specific cases when m is odd, and one when m is even, that happen not to fall into either of the categories we can now deal with. It wasn’t too hard to corrupt the examples we already have to deal with these.

The key here was responding to initial failure by trying to find any control at all over n. Perhaps given enough time I would have started trying special pairings? Equally, I might have tried small examples, which would not really have been hugely helpful for this question. In any case, trying to eliminate very small or very large n luckily worked well, as a) I didn’t need to use the word ‘very’ twice in the end; and b) the idea of looking at choices of pairings to minimise the number of sums quickly gave other useful deductions.

I also really enjoyed Question 3, though was suspicious that I’d used a bound a great deal weaker than one in the question. Along the way, I was trying something confusing and not especially useful that led me into some interesting theory about Latin squares, which I might write something about later in the week.

 

100k Views

When I started this blog, I implicitly made a promise to myself that I would be aiming for quality of posts rather than quantity of posts. Evaluating quality is hard, whereas one always feels some vague sense of purpose by seeing some measure of output increase. Nonetheless, I feel I have mostly kept this promise, and haven’t written that much solely for the sake of getting out another post. This post is something of an exception, since I noticed in the office on Friday that this blog was closing in on 100,000 page views. Some of these were not actually just me pressing F5 repeatedly. Obviously this is no more relevant an integer to be a threshold as any other, and one shouldn’t feel constrained by base 10, but it still feels like a good moment for a quick review.

Here are some interesting statistics over the (3+\epsilon) years of this website’s existence.

  • 175 posts, not including this one, or the three which are currently in draft status. This works out at about 4.7 posts a month. By some margin the most prolific period was May 2012, when I was revising for my Part III exams in Cambridge, and a series of posts about the fiddliest parts of stochastic calculus and network theory seemed a good way to consolidate this work. I’ve learned recently that PhDs are hard, and in fact it’s been a year since I last produced at least five posts in a month, if you discount the series of olympiad reports, which though enjoyable, don’t exactly require a huge amount of mathematical engagement.
  • By at least one order of magnitude, the most viewed day was 17th August 2014, when several sources simultaneously linked to the third part of my report on IMO 2014 in Cape Town. An interesting point to note is that WordPress counts image views separately to page views, so the rare posts which have a gallery attached count well in a high risk / high return sense. In any case, the analytics consider that this day resulted in 2,366 views by 345 viewers. During a typical period, the number of views per visitor fluctuates between roughly 1.5 and 1.8, so clearly uploading as many photos of maths competitions as possible is the cheap route to lots of hits, at least by the WordPress metric.

EAE stats views

  • One might well expect the distributions involved in such a setup to follow a power-law. It’s not particularly clear from the above data about views per month since late 2012 whether this holds. One anomalously large data point (corresponding to the interest in IMO 2014) does not indicate a priori a power law tail… In addition, there is a general upward trend. Since a substantial proportion of traffic arrives from Google, one might naively assume that traffic rate might be proportion to amount of content, which naturally will grow in time, though it seems impractical to test this. One might also expect more recent posts to be more popular, though in practice this seems not to have been the case.
  • The variance in popularity of the various posts has been surprisingly large. At some level, I guess I thought there would be more viewers who browse through lots of things, but such people would probably do so from the home page, so it doesn’t show up as viewing lots of different articles. There is some internal linking between some pages, but not enough to be a major effect.
  • At either end of the spectrum, a post about coupling and the FKG inequality has received only 16 views in 2.5 years, while a guide to understanding the Levy-Khintchine formula has, in slightly less time, racked up 2,182 hits. There are direct reasons for this. Try googling Levy-Khintchine formula and see what comes up. In a sense, this is not enough, since you also need people to be googling the term in question, and picking topics that are hard but not super-hard at a masters level is probably maximising interest. But I don’t have a good underlying reason for why some posts should end up being more Google-friendly than others.
  • In particular, quality of article seems at best loosely correlated with number of views. This is perhaps worrying, both for my reputation, and for the future of written media, but we will see. Indeed, two articles on the Dubins-Schwarz theorem and a short crib sheet for convergence of random variables, both get a regular readership, despite seeming to have been written (in as much as a blog post can be) on the back of an envelope. I also find it amusing that the Dubins-Schwarz theorem is always viewed at the same time of the year, roughly mid-February, as presumably it comes up in the second term of masters courses, just like it did in my own.
  • By contrast, I remain quite pleased with the Levy-Khintchine article. It’s the sort of topic which is perfectly suited to this medium, since most books on Levy processes seem to assume implicit that their readers will already be familiar with this statement. So it seemed like a worthwhile enterprise to summarise this derivation, and it’s nice to see that others clearly feel the same, and indeed I still find some posts of this flavour useful as revision for myself.

Blog log plot

  • This seemed like a particularly good data set in which to go hunting for power-laws. I appreciate that taking a print-screen of an Excel chart will horrify many viewers, but never mind. The above plot shows the log of page view values for those mathematical articles with at least 200 hits. You can see the Levy-Khintchine as a mild anomaly at the far left. While I haven’t done any serious analysis, this looks fairly convincing.
  • I haven’t engaged particularly seriously in interaction with other blogs and other websites. Perhaps I should have done? I enjoy reading such things, but networking in this fashion seems like a zero-sum game overall except for a few particularly engaged people, even if one gets a pleasing spike in views from a reciprocal tweet somewhere. As a result, the numbers of comments and out-going clicks here is essentially negligible.
  • Views from the UK narrowly outnumber views from the US, but at present rates this will be reversed very soon. I imagine if I discounted the olympiad posts, which are sometimes linked from UK social media, this would have happened already.
  • From purely book-keeping curiosity, WordPress currently thinks the following countries (and territories – I’m not sure how the division occurs…) have recorded exactly one view: Aaland Islands, Afghanistan, Belize, Cuba, Djibouti, El Salvador, Fiji, Guatemala, Guernsey, Laos, Martinique, Mozambique, New Caledonia, Tajikistan, US Virgin Islands, and Zambia. Visiting all of those would be a fun post-viva trip…

Conclusion

As I said, we all know that 100,000 is just a number, but taking half an hour to write this has been a good chance to reflect on what I’ve written here in the past three years. People often ask whether I would recommend that they start something similar. My answer is ‘probably yes’, so long as the writer is getting something out of most posts they produce in real time. When writing about anything hard and technical, you have to accept that until you become very famous, interest in what you produce is always going to be quite low, so the satisfaction has to be gained from the process of understanding and digesting the mathematics itself. None of us will be selling the musical rights any time soon.

I have two pieces of advice to anyone in such a position. 1) Wait until you’ve written five posts before publishing any of them. This is a good test of whether you actually want to do it, and you’ll feel much more plausible linking to a website with more than two articles on it. 2) Don’t set monthly post count targets. Tempting though it is to try this to prevent your blog dying, it doesn’t achieve anything in the long term. If you have lots to say, say lots; if you don’t, occasionally saying something worthwhile feels a lot better when you look back on it than producing your target number of articles which later feel underwhelming.

I don’t know whether this will make it to 6+2\epsilon years, but for now, I’m still enjoying the journey through mathematics.

Grade Reparameterisation – A Free Lunch?

The debate has been hotting up this week in the discussion of the Scottish independence referendum, and with it has come the inevitable barrage of questionable arguments, questionable statistics and questionable logic. Nonetheless, at least lots of people seem to be questioning all of the above fairly carefully. Instead I want to question some of the logic and philosophy of a BBC report, which discusses a change to the grade divisions for GCSEs in some subjects, to be introduced from 2017. The article can be found here.

First a caveat. Picking logical holes in arguments for educational reform is like taking candy from a baby in the sense that it’s normally quite easy, and not usually entertaining or instructive for either party. This article is at least a long way short of the infamous comment from someone at the Institute of Physics that “nearly half of [schools] are actually doing worse than average” (source). In that case, a discussion of gender inequality ends up more like a discussion of Markov’s inequality.

The exact nature of the grade reparameterisation is not hugely important. Instead of indexing grade classes by the set \{A*,A,\ldots,G,U\}, the set is to be \{1,2,\ldots,9\}, with 9 the top grade. In the sequel, I’m going to talk about A-levels, because I know very slightly more about that, but everything relevant applies equally well to GCSEs. I want to discuss the request by the general secretary of the Association of School and College Leaders that “students must not be disadvantaged by the change in grading.” My claim is that this is not possible under any change in grading.

We need to consider what the value of a grade is. If different grades make no difference to a student, then by definition that student can’t be disadvantaged by a change in grading procedure. The grade gives an approximate measure of the student’s ability in the subject in question. There are several reasons why it is approximate.

Firstly, marks in exams are not a perfect measure of ability, if there even is an absolute underlying idea of ‘ability’. Some people will under-perform in a given exam, and some people will over-perform. Note if someone feels they always under-perform, a law of large numbers suggests that in fact it is their prediction skills that are below average. This is a fundamentally different question to students who don’t prosper in exam conditions (or prosper excessively!). This is a problem, but not a problem that anyone expects to be solvable by grade reparameterisation.

Secondly, a grade is an approximate measure because it represents a range of possible marks. If you believe you are the best student at a particular subject in the country, you are probably misguided, but in any case, you cannot demonstrate this quality in your GCSE grade. The best you can hope for is to get the top grade, and thus be considered first equal with tens of thousands of others. In conclusion the importance of your grade is entirely a function of how it is interpreted.

DSC_4252

Anyway, suppose we are organising a shift in a grade boundary. To make life easy, we only adjust the boundary between the top two grades, which are called Red and Blue in accordance with available board markers. We are describing exam score as a number between 0 and 1, rather than an integer between 0 and 72 or whatever. We focus on students achieving the top Red grade. Who has been disadvantaged in the change of grading? Well naturally the portion of students who are now scoring Blue, whereas previously they were Red. Admittedly they were on the border before, and they are still on the border, just on the side with everyone who scored lower this time. So it looks like no-one has gained from this.

But this is patently false. We’ve said that value of the Red grade is a function of how it is interpreted. To put this into a more realistic context, imagine an employer or, whisper it softly, a university admissions officer, looking at a CV announcing a Red grade in maths. The employer has to make a judgement on the student’s true maths ability (whatever that means) based on this Red grade. What factors are relevant?

  • How hard is the exam? Scoring top marks on GCSE maths and scoring top marks on the International Maths Olympiad are both excellent achievements, but in some contexts, the latter would stand out more. The value of the grade is an increasing function of the difficulty of the exam. (I’m assuming there is no rescaling in this toy model, but the logic is preserved under rescaling.)
  • How wide is the grade interval? If any mark gave you a Red grade, it wouldn’t signify anything? If Red indicates 70%-100%, you’re have to assume that candidate will on average be somewhere in the middle. This is not uncharitable, merely realistic given an absence of more precise information. The result of this is that the value of a Red grade is a decreasing function of the width of the grade interval.

Thus, so long as the difficulty of the exam remains constant (which is up for discussion, but not here), all the students who get a Red grade under the new regime have gained an advantage from the change. In conclusion, this is a zero-sum game. Some students will benefit, others will be at a disadvantage, and the claim is true, at least within the framework of this reasonable model.

I have interpreted literally a statement that was probably not intended to be interpreted literally. But this is not about point-scoring pedantry. It’s more that this sort of vagueness on matters that could be precise distracts from more useful statements. The person I quoted finishes by saying:

“What is important is that Ofqual sets out very clearly to teachers and students what is needed to achieve a specific grade. This is not the same as simply describing what statistical proportion of pupils will achieve a grade. Employers need a clear message that if a student has achieved a particular grade, it means that they have a certain skill or knowledge level.”

Let’s see how this works in practice at the moment by looking up the OCR Mathematics A-level specification. Examine pages 31 and 32. This describes verbally the standard expected of students to achieve grades A, C and E. I feel I am not being uncharitable if I say that the algorithm to get from grade A to the others is to replace the words “almost all” and “high” with “most” and “reasonable” and drop in the adjectives “usually” or “sometimes” fairly freely.

This illustrates the underlying problem. This phenomenon is more probably apparent in maths than in subjects where quality of writing is a factor. Ultimately the difference between an A and an E is that the students were taught the same material, but one found at least some parts of it more challenging, or at least was slower and less accurate at answering questions about it in the exams. The notion that a grade might give a clear qualitative indication of ability at a more sophisticated level than the above, is both very challenging, and almost completely independent of grade boundaries. If this is genuinely what is desired, it makes sense to focus on this rather than fiddling with boundaries, or complaining about other people fiddling with boundaries.

A Month in Vancouver

I’ve spent the past ~3 weeks at the University of British Columbia in Vancouver, working on some projects related to forest fires and frozen percolation with Balazs Rath. In advance, I’d imagined that this might be a prolific time for new articles on this blog, but in the end it was more like the exact opposite. After spending all of every day playing with technical details on our models (which do not lend themselves naturally to informal writing…) if I felt like doing maths in the evenings, it was more important to keep working on the questions in hand.

Anyway, I had a great time mathematically and otherwise, so I feel a shameless tourism post is justified to report what I enjoyed most about Vancouver.

Asian Food

It sounds ridiculous to return from a trip and reference this as the highlight, but the most noticeable long-term result of a month in Vancouver is that I will probably never be able to order sushi in the UK again. The typical house roll cost about £3 and contained so much seafood that it was essentially impossible to manipulate. Or maybe that was just my (lack of) chopstick skills? Anyway, as I’ve observed in other parts of Canada, groceries bizarrely expensive relative to eating out, at least relative to Europe. By my third week I had basically given up and found a different East Asian cuisine every other night. I ate Japanese, Chinese, Korean, Thai, Malaysian, Taiwanese, Vietnamese and others that didn’t feel constrained by a single national identity. All were uniformly excellent, unfailingly welcoming, and made commendable efforts to include smoked salmon in every single dish.

Outdoor Pursuits

For a country where a donut shop is an integral part of the national identity, the total absence of fat people was immediately noticeable. In their place was a city where everyone seemed to go running, hiking, skiing, quad-biking, skidoo-ing and so on at the slightest opportunity. I haven’t seen that much lycra since breakfast in college on the morning of May Bumps…

For a major world city, the wilderness was tantalisingly close, and remarkably accessible. The public transit system spread far into the mountains and up the coast, and offered you any 90 minutes worth of travel for $2. A breezy ferry trip across the sound to Bowen Island offered the chance to hike up Mt Gardner, an unrelenting and totally solitary trek up through the pine forest. I was conscious that if I slipped somewhere, there wouldn’t be much left to find once the bears had got to me, but the views over the inlet from the summit made it all worthwhile. A trip to Mt Seymour on Victoria Day, then down past the vertigo-inducing Lynn Suspension bridge and a trail back towards civilisation was equally successful. On both occasions, felt I’d very much earned that donut.

UBC

Of course, the main reason for the trip was to do worthwhile mathematics, and I certainly haven’t experienced a nicer university campus to do exactly that. A five minute walk and 500 steps away from the Wysteria-covered maths department is the 7km long Wreck Beach, a sunny haven for Vancouver’s hipsters and nudists, protected from direct exposure to the Pacific by Vancouver Island. Of more interest to me were the 30km or so of trails through the forest in the Pacific Spirit Regional Park, directly across the road from where I was staying in Wesbrook Village at the south end of the campus.

Most importantly, I’m extremely grateful for the invitation to work at UBC for these past few weeks. It’s been productive, and an excellent opportunity to meet some new people and learn about some interesting aspects of probability. I very much hope an opportunity will arise to go back soon!

The Top-to-Random Shuffle

This article is based on a talk I gave to the Maths Society at St Paul’s School on Monday. It may turn into a short series if I have time before I go to ALEA in Luminy near Marseille on Saturday.

My original plan had been to talk about riffle-shuffling, and some of the interesting mixing time themed results one can obtain. As a motivating example, I began by discussing the simpler top-to-random shuffle, and this proved sufficiently interesting to occupy the time I had been allowed (and mea culpa a bit more). It therefore seems worth writing a hopefully moderately accessible blog post on the subject. The aim of this post at least is to discuss the idea that repeatedly shuffling brings a pack of cards close to randomness. We have to settle on a definition of ‘close to randomness’, and find some ways to calculate this.

Suppose we are playing some bizarre card game where it is necessary that three cards labelled, uncontroversially, 1, 2 and 3 need to be placed in a random order. If we are organised, we can write down all the ways to do this in a list:

123, 132, 213, 231, 312, 321.

We want to select each of these with equal probability. We could for example use a dice. Most relevantly, even a computer as ancient as my laptop is very happy simulating a random choice from this set. (Now is not the time to talk about exactly how pseudo-random or otherwise this choice would be.)

Of course, when we play a sensible card game we have not three cards, but fifty-two. So the approach described above still works in theory, but no longer in practice, as the list of possible arrangements now has size 52!. Recall this is defined to be

52!=1\times 2 \times\ldots \times 52.

The reason we get this particular expression is that when we are choosing the first card, we have 52 possible choices. Then, regardless of what this first card actually is, there are precisely 51 cards left from which to choose the second card. So there are 52×51 ways to pick the first two cards in the arrangement, and so on, giving the answer. We can approximate how large 52! is by counting powers of ten rather crudely. It seems reasonable that it should be about 10^{65}. Note that the number of atoms in the universe is *only* about 10^{80}, so if we are going to write down this list, we better have very compact handwriting! But being serious, this number is way too large to realistically compute with, so we have to come up with some cleverer methods.

One way is to spread the cards out on a table then pick them up one at a time, ensuring at all times that the choice of card is uniform among those currently present, and not related to any of the past choices. This is relatively easy for a computer, but hard for a human, and certainly deeply tedious for anyone waiting to receive their hand!

So we seek a different approach, namely an algorithm for shuffling. Our aim is to introduce overall randomness by repeatedly applying some simple but random process. Note we have to be careful about our definition of ‘random’ here. The permutation 123456 is just as ‘random’ as the permutation 361524. That is, if they are fixed, then they are not random at all. Just because it is easier to decribe one of them verbally does not mean it is less random. For example, if I am trying to cheat at poker, then I might be able to if I knew the exact order of the cards in the pack before the dealer dealt. It wouldn’t matter what that order was. I would have to adjust my strategy based on the order, but it wouldn’t affect the fact that I had a massive advantage!

The shuffling algorithm to be discussed here is the top-to-random shuffle. Like all the best things in life, this does exactly what it says on the tin. At a given time, we remove the top card from the deck at present, and insert it at a randomly chosen point in the deck. This could be on the bottom, and it could also be back on the top. It feels like this possibility to remain constant can’t possibly help us, but later we will discuss why we need this.

In any case, it feels natural that if we keep applying this procedure, the arrangement of the deck should start to get more and more random, in the sense that knowing the original arrangement will tell us successively little about the current arrangement as time progresses. But we need to find a way to quantify this if we are to do any mathematics.

When we are talking about real numbers, it is fairly clear what it means if I say that the numbers 2, 1.1, 1.01, 1.001 and so on are getting closer and closer to 1. Indeed we can measure the distance along the number line between each term and 1, using the absolute difference. It is not so clear how to compute the distance between two probability distributions. Bearing in mind the fact that a distribution on the set of permutations of cards is defined to be a set of 52! probabilities that sum to 1, there will be a 52!-1 dimensional space (eg the plane is two-dimensional, the world is three-dimensional, *and so on* – whatever that means) where we have a nice distance formula already.

But this is not what we will choose to use. Rather we return to the cheating-at-poker analogy. Suppose I am playing some sort of game involving the pack of cards with my enemy. He or she thinks the deck is perfectly random, but I know the actual distribution. How big a profit can I make by exploiting this knowledge? This will be our measure of how far a distribution is from uniform. It turns out that this will coincide precisely with the formal definition of total variation distance, but that language belongs to a different level of rigour and is not relevant here.

What is relevant is an explanatory example. Suppose we start with the arrangement 12345678. We are now going to perform one iteration of the top-to-random shuffle. The outcome might, for example, be 23456178, if we insert the 1 between the 6 and the 7. Note there were 8 places for the card to go, so the probability of this particular outcome is 1/8. Now let’s see how I might use my knowledge of the distribution to my advantage. Suppose I suggest the bet that the bottom card is an 8. My enemy thinks the stack is uniformly randomly arranged, so the probability of this is 1/8. On the other hand, I know that the only way the 8 might disappear from the bottom is if I place the 1 under it, which happens with probability 1/8. So in fact, I know the probability of this event is 7/8, which gives me an advantage of 3/4. In fact, I could come up with bets that do even better than this, but they are less simple to describe verbally.

At what point do I lose this advantage? Well, we said that the probability that the 8 leaves the bottom of the stack is 1/8. And it will continue to be 1/8 on every turn where it is at the bottom. Recalling that the outcomes of successive shuffles are independent, note this is reminiscent of rolling a dice until a six comes up. The number of rolls required to get the six is an example of a geometric random variable. I don’t want to spoil S1 (or whichever module) by going into too much detail, but it turns out that if the probability of an event happening on a single go is p, then the average time we have to wait is 1/p. So 1/(1/8)=8 of course, and this is how long we typically have to wait before the bet I placed before becomes much less effective.

Now seems like a good time to stop talking about 8 cards and start talking about n cards. Obviously, in practice, we will want n to be 52. Anyway, by the same argument as before, it takes on average n iterations before the bottom card leaves the bottom. This is important, because after then, my bet that the bottom card is n is no longer so effective. However, I could equally place a bet that one of the bottom *two* cards is n.

So we consider how long it takes before n is no longer one of the bottom two cards. Well certainly we need to wait until it is no long *the* bottom card, which takes time n on average. Then, once it is second bottom, there is now a 2/n chance that we move the previously top card below it, so by the same argument as before, the time for this to happen is n/2 on average. If we want this effect to disappear, we have to wait until the original bottom card is in fact at the top of the pile for the first time, and by extending our previous argument, the average time for this is

n+\frac{n}{2}+\frac{n}{3}+\ldots+\frac{n}{n-1}.

Fortunately, we have tools for approximating this sort of sum, in particular integration, which is the practice of finding the area under certain curves. It turns out that the answer is roughly n log n. You can think of as log n a measure of the number of digits required to write out n. (This is not the exact definition but it will do for now. In any case, log n gets larger as n gets larger, but not very fast.) There’s a lot more about this in my previous post on the coupon collector problem, from a more technical point of view.

The next question will be to prove that it is actually quite well shuffled by this time, but that’s for another post. The other question to ask is whether this is satisfactory overall? For n=52, the number of operations we have to perform is about 230, which is fine for a computer, but deeply tedious for anyone sitting at a casino table waiting for the next hand. So next time we’ll talk about the riffle shuffle, which seems to introduce a lot of randomness in each go, but we’ll also see that we have to be careful, because the randomness may not be as great as our intuition suggests.

Responding to the Monty Hall Problem

It’s been a while since I got round to writing anything, so I’m easing myself back in by avoiding anything too mathematically strenuous. Shortly after my last post, there was a nice clip on Horizon featuring Marcus du Sautoy explaining the famous Monty Hall Problem to the ever-incredulous Alan Davies. The BBC article here shows the clip, and then an excellent account by probabilist John Moriarty explaining the supposed paradox.

Recall the situation. A game show is hosted by Monty Hall. At some point, to decide the grand prize, or whatever, a contestant plays the following lottery. There are three doors, and three prizes, one behind each. One of the prizes is worthwhile, normally a car, while the other two are rubbish. For reasons I don’t understand, it is normally said that behind the other two doors are goats. Anyway, MH knows where the car is from the beginning. Then the contestant picks a door, but doesn’t open it yet. MH then dramatically opens a different door, revealing a goat. The contestant then has the option to stick with his original choice, or switch to the third door. What should he or she do?

So the natural error to make is that since one possibility has been eliminated, since the remaining two were a priori equally likely, they should still be equally likely, and hence each occur with probability 1/2. So it makes no difference whether you stay or switch. But, of course, as well as removing information (ruling out one door), we have also gained information. The fact that MH chose to reveal that door, rather than the potential third door is important.

In my opinion, the best way to think about this is to imagine you are the host. The nice thing about being in this position is that then you can imagine you know everything about the configuration in advance, so are less likely to fall foul of counter-intuitive conditional probabilities. Anyway, there’s loads of symmetry, so from my point of view as the host, I only care whether the contestant initially selects the right door or the wrong door. If he selects the wrong door, then I need to open the ONLY OTHER wrong door. In other words, I have no choice in what I do next. I open the other wrong door. Therefore in this case the contestant should switch. If the contestant in fact selects the right door initially, I then have a choice about which door to open. However, it doesn’t really matter which I pick, by symmetry, and in either case the contestant would do better to stay with his original choice. Note that we have defined the outcome entirely by the initial choice. And as the host, I knew where the car was at the start, so from my point of view, unless he had somehow cheated, the contestant always had a 1/3 chance of guessing correctly. So it is better to switch 2/3 of the time.

The BBC’s weekly collation feature, The Loop, printed a couple of responses to the article. This is one:

“I believe that the problem is a game of two halves. In the first instance you have a 1:3 chance of getting the prize. Once you’ve made your initial choice, even though the ‘box’ contents remain the same, one ‘false’ choice is removed and you are then offered the chance to choose again. There is a play on words here, perhaps, with the option to ‘switch’. Effectively you are being asked to make a choice of 1:2. Now, if this is always going to happen then the odds of you winning the prize in the first instance is 1:2, as it doesn’t really matter whether or not you choose correctly the first time – one ‘false’ is going to be removed and you’ll be presented with just two options. It’s mind trickery. The first choice is totally irrelevant because you will always end up having to pick one option from two in order to attain the prize. In my humble opinion.”

I highlight this not to inspire contempt, since it is an easy mistake to make, even after reading as nice an explanation as Moriarty’s. Rather it is an excellent example of how using too many words can convince you an argument is complete before you’ve really made any concrete statement at all. Note that the sentences beginning ‘Effectively…’, ‘Now, if this…’ and ‘The first choice…’ all say exactly the same thing, which is, to use inappropriately formal terminology, exactly what the writer is trying to prove. The padding of ‘mind trickery’ and ‘humble opinions’ is as close as a maths argument can ever really come to an ad hominem.

So I was trying to come to a clean conclusion about exactly why the Monty Hall problem does such a good job at generating confusion. I think the following aspects of the set-up are important:

1) In general, people often get confused about the meaning of ‘random’. In some sense, the sequence 416235 is more random than the sequence 123456. However, typically, we use ‘random’ to mean ‘not determined in advance’. The key is that the bit where MH reveals a goat is not a random action. Whatever you do initially, you know that this will happen. So the fact that it does actually happen should not add extra information.

2) This fact is compounded by the time-steps at which decisions have to be made. It feels like the arc of the narrative is directing everything towards this final choice, which feels very separate from the initial choice. You can imagine the dramatic music that undoubtedly plays as MH opens the appropriate door, and the camera pans back to the contestant making their decision. All of which is an excellent way to suggest that something complicated has happened, when in fact in terms of what we are interested in overall, nothing has happened.

A point I hadn’t thought about before was that it is important that MH knows what happens in advance. Let’s develop this point and consider what happens if MH in fact has no knowledge of where the car is, and chooses a door at random. We will still assume that he must choose a different door to the one that the contestant originally chose.

Then, 1/3 of the time the contestant picks a goat, and MH picks the car, so he gets a goat whether he stays or switches. 1/3 of the time, the contestant picks a goat, and MH picks the other goat, so he should switch. And 1/3 of the time, the contestant originally picks the car, and then by default MH picks a goat, so he should stay. So, overall, it is equally preferable to stay as to switch.

I guess the morals of this story are: “be careful with conditional probabilities; sometimes apparently new information doesn’t change anything; and that it’s probably not a good idea to get into a discussion about the Monty Hall problem with a skeptic unless you have a piece of paper to hand on which to draw a probability table.”

PS. A final moral is that, based on the links WordPress suggested, at least 10% of the internet is devoted to explanations of the Monty Hall problem…

Nested Closed Intervals

Th UK team for this year’s International Mathematical Olympiad in Santa Marta, Colombia, has just been selected. For details, see the BMOC website.

During the selection camp, which was hosted at Oundle School near Peterborough, we spent a while discussing analytic questions that typically lie outside the scope of the olympiad syllabus. Furthermore, incorrect consideration of, for example, the exact conditions for a stationary point to be a global maximum, are likely to incur very heavy penalties if a candidate has attempted a solution using, for example, Lagrange multipliers. As a result we have a dilemma about how much analysis to teach during the training process: we want the students to be able to use sophisticated methods if necessary; but we don’t want to spoil the experience of learning this theory in a serious step-by-step manner as first year undergraduates.

This post does not present a solution to this dilemma. Rather, I want to discuss one question that arose on the last day of exams. Because the statement of the question is currently classified, I will have to be oblique in discussion of the solution, but this shouldn’t distract from the maths I actually want to talk about.

The setup is as follows. We have a sequence of nested closed intervals in the reals, that is:

[a_1,b_1]\supset [a_2,b_2]\supset [a_3,b_3]\supset\ldots

We want to demonstrate that there is some real number that lies in all of the intervals, that is \cap_{n\geq 1}[a_n,b_n]\neq \varnothing. This feels intuitively obvious, but some form of proof is required.

First, what does closed mean? Well a closed interval is a closed set, but what about, for example, [a,\infty)\subset\mathbb{R}? It turns out that it is most convenient to define an open set, and then take a closed set to be the complement of an open set.

The best way of thinking about an open set is to say that it does not contain its boundary. This is certainly the case for an open interval or an open ball. It is not immediately clear how to extend this to a general definition. But note that if no point in the set X can be on the boundary, then in all the natural examples to consider there must some finite distance between any point x\in X and the boundary. So in particular there is a small open ball centred on x that is entirely contained within X. This is then the definition of an open set in a metric space (in particular for some \mathbb{R}^d).

Note that it is not a sensible definition to say that a closed set has the property that there is a closed ball containing each point. Any open set has this property also! For if there is an open ball of radius R around a point x, then there is a closed ball of radius R/2 around that same point. So we really do have to say that a set is closed if the complement is open. Note that in \mathbb{R}, a closed interval is closed, and a finite union of closed intervals is closed, though not a countable union as:

(0,1]=\cup_{n\geq 1}[\frac{1}{n},1].

Now we know what a closed set is, we can start thinking about the question.

First we remark that it is not true if we allow the interval to be unbounded. Obviously \cap_n [n,\infty)=\varnothing. Note that even though it appears that these sets have an open upper boundary, they are closed because the complement (-\infty,n) is open. This will not be a problem in our question because everything is contained within the first interval [a_1,b_1] and so is bounded.

Secondly we remark that the result is not true if we move to a general host set. For example, it makes sense to consider open and closed sets in the rationals. For example, the open ball radius 1 either side of 1/2 is just all the rationals strictly between -1/2 and 3/2. We could write this as (-\frac12,\frac32)\cap\mathbb{Q}. But then note that

\cap_{n\geq 1}[\pi-\frac{1}{n},\pi+\frac{1}{n}]\cap\mathbb{Q}

cannot contain any rational elements, so must be empty.

There was various talk that this result might in general be a consequence of the Baire Category Theorem. I think this is overkill. Firstly, there is a straightforward proof for the reals which I will give at the end. Secondly, if anything it is a lemma required for the proof of BCT. This result is often called Cantor’s Lemma.

Lemma (Cantor): Let X be a complete metric space, and let F_1\supset F_2\supset\ldots be a nested sequence of non-empty closed subsets of X with \text{diam}(F_n)\rightarrow 0. There there exists x\in X such that

\cap F_n=\{x\}.

Translation: for ‘complete metric space’ think ‘the reals’ or \mathbb{R}^d. The diameter is, unsurprisingly, the largest distance between two points in the set. For reasons that I won’t go into, the argument for the olympiad question gave \text{diam}(F_{n+1})\leq \frac12\text{diam}(F_n) so this certainly holds here.

Proof: With reference to AC if necessary, pick an element x_n\in F_n for all n. Note that by nesting, x_m\in F_n\;\forall n\leq m. As a result, for m>n the distance d(x_n,x_m)\leq \text{diam}(F_n). This tends to 0 as n grows. The definition of complete is that such a sequence then has a limit point x.

Informally, completeness says that if a sequence of points get increasingly close, they must tend towards a point in the set. This is why it doesn’t work for the rationals. You can have a sequence of rationals that get very close together, but approach a point not in the set, eg an irrational. We use the definition of closed sets in terms of sequences: if the sequence is within a closed set, then the limit is too. This could only go wrong if we ‘leaked onto the boundary’ in the limit, but for a closed set, the boundary is in the set. This shows that x\in F_n for each n, and so $x\in\cap_n F_n$. But if there is another point in \cap_n F_n, then the distance between them is strictly positive, contradicting the claim that diameter tends to 0. This ends the proof.

Olympiad-friendly version: I think the following works fine as a fairly topology definition-free proof. Consider the sequence of left-boundaries

a_1\leq a_2\leq a_3\leq \ldots <b_1.

This sequence is non-decreasing and bounded, so it has a well-defined limit. Why? Consider the supremum. We can’t exceed the sup, but we must eventually get arbitrarily close, by definition of supremum and because the sequence is non-decreasing. Call this limit a. Then do the same for the upper boundaries to get limit b.

If a>b, then there must be some a_n>b_n, which is absurd. So we must have some non-empty interval as the intersection. Consideration of the diameter again gives that this must be a single point.