# Questionable Statistics

My four years in the Statistics Department have been characterised by a continual feeling that I don’t know enough about statistics. That said, I can’t help but notice how statistics are used in non-specialist contexts, and of course there are times when eyebrows might reasonably be raised. Andrew Gelman (who gave a very enjoyable departmental seminar on this topic in May) has a notable blog critiquing poor statistical methodology, especially in social science; and the blogosphere is packed with compilations of egregious statistical oxymorons (“All schools *must* be above average” etc).

I’m aiming for more of a middle ground here. I’ve picked three examples of articles I found interesting at some level over the past few months. In each of them, some kind of data presentation or experiment design arose, and in all cases, I think it resulted in more questions rather than more answers, not really in a good way, since in all cases I did actually want to know more about what was going on.

Extreme rounding errors

This Guardian article about inequality in career trajectories for teachers in schools serving more advantaged and more deprived areas raises a lot of interesting questions, and actually proposes answers to several of them. The problem is the double bar chart halfway down.

First a superficial moan. Rounding to the nearest integer is fine, indeed sensible in most contexts, but not when you are treating small numbers. Comparing the extreme bars, we could have 11.5 v 12.5 or we could have 10.5 v 13.5. The headlines would probably read “N% more teachers leave disadvantaged schools”. In the first case N is 9, in the second it’s 29. So it’s not really a superficial moan. Data is supposed to offer a measure of the effect under consideration, and a discussion of how treat this effect requires a veneer of reliability about the size of the effect. We don’t just need to be asking “how can we solve this”; we also need to be asking “is this worth solving?” I’m not saying the answer is no, but based on this graph alone the answer is not ‘definitely yes’. [1]

A question to ask in the context of this report is “why do more teachers leave the profession from schools in deprived areas?” But I think most people could speculate a broadly sensible answer to this. The data raises the more subtle question “why are the rates of leaving the profession so similar across all types of school?” Essentially I think this survey is a negative result – the effect just isn’t as strong as I’d have suspected, nor I imagine the authors nor the Guardian education editorship.

By contrast, even with the amateurish presentation, the rates of moving school clearly are significant and worth talking about based on this survey. Significant data demands further information though. Once an effect is significant, we need to know more details, like to what extent the teachers are moving to schools in the same band or to mostly towards the left end of the graph. Again though, there’s no comment on the magnitude of this effect. Assuming the rate is given per year [2], then among teachers who do not leave the profession, the average tenure of a given job is more than ten years, even in the most deprived category. Maybe this is all very heavy-tailed, and so the average isn’t a good thing to be using. But this doesn’t seem like an unreasonable number to me? If the claim is that this is a major problem for the teaching profession, then we should be told what the numbers are in vaguely comparable industries.

The final thing is pretty bad. Three paragraphs under the graph, it is claimed that the rate of leaving is 70% higher in most- than least-deprived categories. We decided that N was somewhere between 9 and 29. But not 70. I can only assume that this came from the ‘moving’ category instead…

[1] I’m assuming the sample pool was large enough that the effect is statistically significant, because there isn’t a direct link to the source data or even the source report unfortunately.

[2] and is it plausible that this is rate per year? Does the profession really experience ten percent annual turnover? Is it even plausible that the scale is the same? Maybe rate of leaving really is twice rate of moving school, and I have no idea about anything, but this seems questionable.

Many, many categories

The bank Halifax commissions an annual survey about pocket money, and this year’s received widespread coverage. My attention was caught by this article on BBC news. The short summary is that they surveyed about 1,200 children from about 600 families (so in many cases a direct comparison between siblings could have been possible), and it sounds like they restricted to the age range 8-15, and asked how much pocket money they received, and whether they wanted more.

The clickbait summary was that boys receive about 13% more than girls. While it doesn’t give the exact numbers, it also mentions that more boys than girls thought they deserved more money. A psychologist conjectures that this latter effect might explain the former effect, and indeed this seems plausible. The problem is that elsewhere in the article, it says that in the corresponding survey in 2015, boys receive on average only 2% more than girls.

We therefore have to consider a couple of things. 1) Is it plausible that the actual average pocket money rates (among the whole population) have fluctuated in such a gender-dependent way in 12 months? 2) If not, then can we still say something useful, even if our confidence in the methodology of the survey is reduced?

I think the answer to the first question is definitely ‘no’. So we come to the second question. Firstly, how could this have arisen? Well, there’s no indication how the children were chosen, but equally it’s hard to think of a way of choosing them that would increase artificially the boys’ pocket money. In this last sentence, I’m deliberately suggesting the null hypothesis that boys and girls are the same. This is misleading. For now, our null hypothesis is that 2015 should be the same as 2016. Indeed, it’s equally plausible that this year’s summary data was ‘more correct’, and so we are asking why the difference was excessively compressed in 2015.

This is entirely a matter of effect size, and it’s to the article’s credit that so much information is given further down the page that we can actually make the following remarks quantitatively. At the bottom, they show the average pay (across genders) in different regions. There is a lot of variation here. The rate of pay in East Anglia is only 60% what it is in London. At some level, I’m surprised it isn’t even higher. The article tells us that children between the ages of eight and fifteen were surveyed. By what rate did your pocket money change in that interval? Surely it doubled at the very least? Weekly pocket money at age eight is for sweets and an occasional packet of Pokemon cards (substitute 21st century equivalent freely…). Teenagers need to pay for outings and clothes etc, so there’s no reason at all why this should be comparable. For the purposes of definiteness in the next calculation, let’s say that fifteen-year-olds get paid four times as much as eight-year-olds on average, which I think is rather conservative.

So here’s a plausible reason for the fluctuations. Suppose between 2015 and 2016, the survey substituted 10 eight-year-old boys from the Midlands for 10 fifteen-year-old boys from London. The expectation change in the average amongst the 600 boys surveyed is

$\frac{10}{600} \times (\sim 6.5) \times (4 \times (0.6)^{-1}) \approx 0.7,$

in pounds, which is on the order of the actual change from 2015 to 2016. Choosing the participants in a non-uniform way seems like too basic a mistake to make, but mild fluctuations in the distribution of ages and locations, as well as the occasional outlier (they want to use the mean, so one oligarch’s daughter could make a noticeable difference in a sample size of 600) seems more plausible as an explanation to me than a general change in the population rates. Choosing participants in a uniform way is just hard when there are loads and loads of categories.

I’m not really saying that this survey is bad – they have to report what they found, and without a lot more information, I have no idea whether this year’s is more or less plausible than last year’s. But if you add the phrase “especially in 2016” to everything the psychologist says, it suddenly seems a bit ridiculous. So it would be worth making the remark that even if this effect is statistically significant, that doesn’t mean the effect size is large relative to lots of other less controversial effect sizes visible in the data.

Comparing precious metals

I’ve recently returning from this year’s International Mathematical Olympiad, and now we are well into the swing of its sweatier namesake in Rio. At both events, there have been many opportunities to observe how different people with different ambitions and levels of extroversion display different levels of pleasure at the same medal outcomes. In the light of this, I was drawn to this article, which has been getting quite a lot of traction online.

The basic premise is simple. Silver medallists are less happy than bronze medallists (when – unlike at the IMO – there is only one of each medal), and it’s not hard to come up with a cheap narrative: silver-winners are disappointed to have missed out on gold; bronze-winners are glad to have got a medal at all. This is all pretty convincing, so maybe there’s no need actually to collect any data, especially since asking a crowd of undergraduates to rate faces on the 1-10 agony-ecstacy scale sounds like a bit of an effort. Let’s read a couple of sentences, describing a follow-up study on judo medallists at the Athens Olympics:

Altogether, they found that thirteen of the fourteen gold medal winners smiled immediately after they completed their winning match, while eighteen of the twenty-six bronze medalists smiled. However, none of the silver medalists smiled immediately after their match ended.

You might wonder how come there are almost twice as many bronzes as golds. Well, very full details about the judo repechage structure can be found here, but the key information is that all bronze-medallists won their final match, as, naturally, did the gold-medallists. The silver-medallists lost their final match, ie the gold-medal final. So this study is literally reporting that highly-driven competitive sportsmen are happier straight after they win, than straight after they lose. This doesn’t have much to do with medals, and isn’t very exciting, in my opinion. I would, however, be interested to know what was eating the one gold medallist who didn’t smile immediately.

This isn’t really a gripe about the statistics, more about the writing. They are testing for an effect which seems highly plausible, but is branded as surprising. The study for which they give more methodological details in fact seems to be testing for a related effect, which is not just plausible, but essentially self-evident. So naturally I want to know more about the original study, which is the only place left for there to be any interesting effects, for example if they looked at athletics events which don’t have a binary elimination structure. But we aren’t told enough to find these interesting effects, if they exist. How annoying. The only thing left is to think about is my bronze medal at IMO 2008. They said eight years would be enough for the wounds to heal but I’m still not smiling.

# 100k Views

When I started this blog, I implicitly made a promise to myself that I would be aiming for quality of posts rather than quantity of posts. Evaluating quality is hard, whereas one always feels some vague sense of purpose by seeing some measure of output increase. Nonetheless, I feel I have mostly kept this promise, and haven’t written that much solely for the sake of getting out another post. This post is something of an exception, since I noticed in the office on Friday that this blog was closing in on 100,000 page views. Some of these were not actually just me pressing F5 repeatedly. Obviously this is no more relevant an integer to be a threshold as any other, and one shouldn’t feel constrained by base 10, but it still feels like a good moment for a quick review.

Here are some interesting statistics over the $(3+\epsilon)$ years of this website’s existence.

• 175 posts, not including this one, or the three which are currently in draft status. This works out at about 4.7 posts a month. By some margin the most prolific period was May 2012, when I was revising for my Part III exams in Cambridge, and a series of posts about the fiddliest parts of stochastic calculus and network theory seemed a good way to consolidate this work. I’ve learned recently that PhDs are hard, and in fact it’s been a year since I last produced at least five posts in a month, if you discount the series of olympiad reports, which though enjoyable, don’t exactly require a huge amount of mathematical engagement.
• By at least one order of magnitude, the most viewed day was 17th August 2014, when several sources simultaneously linked to the third part of my report on IMO 2014 in Cape Town. An interesting point to note is that WordPress counts image views separately to page views, so the rare posts which have a gallery attached count well in a high risk / high return sense. In any case, the analytics consider that this day resulted in 2,366 views by 345 viewers. During a typical period, the number of views per visitor fluctuates between roughly 1.5 and 1.8, so clearly uploading as many photos of maths competitions as possible is the cheap route to lots of hits, at least by the WordPress metric.

• One might well expect the distributions involved in such a setup to follow a power-law. It’s not particularly clear from the above data about views per month since late 2012 whether this holds. One anomalously large data point (corresponding to the interest in IMO 2014) does not indicate a priori a power law tail… In addition, there is a general upward trend. Since a substantial proportion of traffic arrives from Google, one might naively assume that traffic rate might be proportion to amount of content, which naturally will grow in time, though it seems impractical to test this. One might also expect more recent posts to be more popular, though in practice this seems not to have been the case.
• The variance in popularity of the various posts has been surprisingly large. At some level, I guess I thought there would be more viewers who browse through lots of things, but such people would probably do so from the home page, so it doesn’t show up as viewing lots of different articles. There is some internal linking between some pages, but not enough to be a major effect.
• At either end of the spectrum, a post about coupling and the FKG inequality has received only 16 views in 2.5 years, while a guide to understanding the Levy-Khintchine formula has, in slightly less time, racked up 2,182 hits. There are direct reasons for this. Try googling Levy-Khintchine formula and see what comes up. In a sense, this is not enough, since you also need people to be googling the term in question, and picking topics that are hard but not super-hard at a masters level is probably maximising interest. But I don’t have a good underlying reason for why some posts should end up being more Google-friendly than others.
• In particular, quality of article seems at best loosely correlated with number of views. This is perhaps worrying, both for my reputation, and for the future of written media, but we will see. Indeed, two articles on the Dubins-Schwarz theorem and a short crib sheet for convergence of random variables, both get a regular readership, despite seeming to have been written (in as much as a blog post can be) on the back of an envelope. I also find it amusing that the Dubins-Schwarz theorem is always viewed at the same time of the year, roughly mid-February, as presumably it comes up in the second term of masters courses, just like it did in my own.
• By contrast, I remain quite pleased with the Levy-Khintchine article. It’s the sort of topic which is perfectly suited to this medium, since most books on Levy processes seem to assume implicit that their readers will already be familiar with this statement. So it seemed like a worthwhile enterprise to summarise this derivation, and it’s nice to see that others clearly feel the same, and indeed I still find some posts of this flavour useful as revision for myself.

• This seemed like a particularly good data set in which to go hunting for power-laws. I appreciate that taking a print-screen of an Excel chart will horrify many viewers, but never mind. The above plot shows the log of page view values for those mathematical articles with at least 200 hits. You can see the Levy-Khintchine as a mild anomaly at the far left. While I haven’t done any serious analysis, this looks fairly convincing.
• I haven’t engaged particularly seriously in interaction with other blogs and other websites. Perhaps I should have done? I enjoy reading such things, but networking in this fashion seems like a zero-sum game overall except for a few particularly engaged people, even if one gets a pleasing spike in views from a reciprocal tweet somewhere. As a result, the numbers of comments and out-going clicks here is essentially negligible.
• Views from the UK narrowly outnumber views from the US, but at present rates this will be reversed very soon. I imagine if I discounted the olympiad posts, which are sometimes linked from UK social media, this would have happened already.
• From purely book-keeping curiosity, WordPress currently thinks the following countries (and territories – I’m not sure how the division occurs…) have recorded exactly one view: Aaland Islands, Afghanistan, Belize, Cuba, Djibouti, El Salvador, Fiji, Guatemala, Guernsey, Laos, Martinique, Mozambique, New Caledonia, Tajikistan, US Virgin Islands, and Zambia. Visiting all of those would be a fun post-viva trip…

Conclusion

As I said, we all know that 100,000 is just a number, but taking half an hour to write this has been a good chance to reflect on what I’ve written here in the past three years. People often ask whether I would recommend that they start something similar. My answer is ‘probably yes’, so long as the writer is getting something out of most posts they produce in real time. When writing about anything hard and technical, you have to accept that until you become very famous, interest in what you produce is always going to be quite low, so the satisfaction has to be gained from the process of understanding and digesting the mathematics itself. None of us will be selling the musical rights any time soon.

I have two pieces of advice to anyone in such a position. 1) Wait until you’ve written five posts before publishing any of them. This is a good test of whether you actually want to do it, and you’ll feel much more plausible linking to a website with more than two articles on it. 2) Don’t set monthly post count targets. Tempting though it is to try this to prevent your blog dying, it doesn’t achieve anything in the long term. If you have lots to say, say lots; if you don’t, occasionally saying something worthwhile feels a lot better when you look back on it than producing your target number of articles which later feel underwhelming.

I don’t know whether this will make it to $6+2\epsilon$ years, but for now, I’m still enjoying the journey through mathematics.

# The Fisher Information and Cramer-Rao Bound

Given that I’d done this twice before, and was giving the same tutorial five times this week, I was surprised at the extent to which the definition of the Fisher Information caused me to confuse both myself and the students. I thought it would be worth summarising some of the main ways to get confused, and talking about one genuine, quantitative use of the Fisher Information.

Recall we are in a frequentist model, where there is an unknown parameter $\theta$ controlling the distribution of some observable random variable X. Our aim is to make inference about $\theta$ using the value of X we observe. We use a lower case x to indicate a value we actually observe (ie a variable as opposed to a random variable). For each value of $\theta$, there is a density function $f_\theta(x)$ controlling X. We summarise this in the likelihood function $L(x,\theta)$.

The important thing to remember is that there are two unknowns here. X is unknown because it is genuinely a random variable. Whereas $\theta$ is unknown because that is how the situation has been established. $\theta$ is fixed, but we are ignorant of its value. If we knew $\theta$ we wouldn’t need to be doing any statistical inference to find out about it! A useful thing to keep in mind in everything that follows is: “Is this quantity a RV or not?” This is equivalent to “Is this a function of X or not?”, but the original form is perhaps clearer.

For some value of X=x, we define the maximum likelihood estimator to be

$\hat\theta(X):=\text{argmax}_\theta L(X,\theta).$

In words, given some data, $\hat\theta$ is the parameter under which this data is most likely. Note that $L(x,\theta)$ is a probability density function for fixed $\theta$, but NOT for fixed x. (This changes in a Bayesian framework.) For example, there might well be values of x for which $L(x,\theta)=0\,\forall \theta\in\Theta$.

Note also that we are only interested in the relative values of $L(x,\theta_1), L(x,\theta_2)$. So it doesn’t matter if we rescale L by a constant factor (although this means the marginal in x is no longer a pdf). We typically consider the log-likelihood $l(x,\theta)=\log L(x,\theta)$ instead, as this has a more manageable form when the underlying RV is an IID sequence. Anyway, since we are interested in the ratio of different values of L, we are interested in the difference between values of the log-likelihood l.

Now we come to various versions of the information. Roughly speaking, this is a measure of how good the estimator is. We define the observed information:

$J(\theta):=-\frac{d^2 l(\theta)}{d\theta^2}.$

This is an excellent example of the merits of considering the question I suggested earlier. Here, J is indeed a random variable. The abbreviated notation being used can lead one astray. Of course, $l(\theta)=l(X,\theta)$, and so it must be random. The second question is: “where are we evaluating this second derivative?”

For this, we should be considering what our aim is. We know we are defining the MLE by maximising the likelihood function for fixed x. We have said that the difference between values of l gives a measure of relative likelihood. So if the likelihood function has a sharp peak at $\hat\theta$, then this gives us more confidence than if the peak is very shallow. (I am using ‘confidence’ in a non-technical. Confidence intervals are related, but I am not considering that now.) The absolute value second derivative is precisely a measure of this property.

Ok, but the information does not evaluate this second derivative at $\hat\theta$, it evaluates it at $\theta$. The key point is that it is still a good measure if it evaluates the second derivative at a point close to $\hat\theta$. And if $\hat\theta$ is a good estimator, which it typically will be, especially when we have an IID sequence and the number of terms grows large, then $\theta$ and $\hat\theta$ will be close together, and so it remains a plausible measure.

This idea is particularly important when we come to consider the Fisher InformationThis is defined as

$I(\theta):= \mathbb{E}J(\theta)=\mathbb{E} -\frac{d^2 l(\theta)}{d\theta^2}.$

The cause for confusion is exactly what is mean by this expectation. It is not implausible that this is present, since we have already explained why $J(\theta)$ is a random variable. But we need to decide what distribution we are to integrate with respect to. After all, we don’t actually know the distribution of X. If we did, we wouldn’t be doing statistical inference on it!

So the key thing to remember is that in $I(\theta)$, the value $\theta$ plays two roles. First, it gives the distribution of X with respect to which we integrate. Also, it tells us where to evaluate this second derivative. This makes sense overall. If the distribution we are considering is $l(\cdot,\theta)$, then we expect $\hat\theta$ to be close to the true value $\theta$, and so it makes sense to evaluate it there.

Now we deduce the Cramer-Rao bound, which says that for any unbiased estimator $\hat\theta$ of $\theta$, we have

$\text{Var}(\hat\theta)\ge \frac{1}{I(\theta)}.$

First we explain that unbiased means that $\mathbb{E}\hat\theta=\theta$. This is a property that we would like any estimator to have, though often we have to settle for this property asymptotically. Again, we should be careful about the role of $\theta$. Here we mean that given some parameter $\theta$, $\hat\theta$ is then a RV depending onto the actual data, and so has a variance, which happens to be bounded below by a function of the Fisher Information.

So let’s prove this. First we need a quick result about the score, which is defined as:

$U(\theta)=\frac{dl(\theta)}{d\theta}.$

Again, this is a random variable. We want to show that $\mathbb{E}U(\theta)=0$. This is not difficult. Writing $f(x)=L(x,\theta)$, we have

$\mathbb{E}U(\theta)=\int f(x)\frac{\partial}{\partial\theta}\log f(x)dx$

$= \int \frac{\partial}{\partial\theta} L(x,\theta)dx=\frac{d}{d\theta}\int f(x)dx=\frac{d}{d\theta}1=0,$

as required. Next we consider the covariance of U and $\hat\theta$. Since we have established that $\mathbb{E}U=0$, this is simply $\mathbb{E}[U\hat\theta]$.

$\text{Cov}(U,\hat\theta)=\int \hat\theta(x)f(x) \frac{d \log f(x)}{d\theta}dx=\int \hat\theta(x)f(x)\cdot \frac{\frac{\partial f(x)}{\partial \theta}}{f(x)} dx$

$= \int \hat\theta(x)\frac{\partial f(x)}{\partial \theta}=\frac{\partial}{\partial \theta}\int \hat\theta(x)f(x)$

$= \frac{\partial}{\partial\theta} \mathbb{E}\hat\theta=\frac{d\theta}{d\theta}=1,$

as we assumed at the beginning that $\hat\theta$ was unbiased. Then, from Cauchy-Schwarz, we obtain

$\text{Var}(U)\text{Var}(\hat\theta)\ge \text{Cov}(U,\hat\theta)=1.$

So it suffices to prove that $\text{Var}(U)=I(\theta)$. This is a very similar integral rearrangement to what has happened earlier, so I will leave it as an exercise (possibly an exercise in Googling).

Note a good example of this is question 4 on the sheet. At any rate, this is where we see the equality case. We are finding the MLE for $\theta$ given an observation from $\text{Bin}(n,\theta)$. Unsurprisingly, $\hat\theta=\frac{X}{m}$. We know from our knowledge of the binomial distribution that the variance of this is $\frac{\theta(1-\theta)}{n}$, and indeed it turns out that the Fisher Information is precisely the reciprocal of this.

The equality case must happen when the score is proportional to the observed value. I don’t have a particularly strong intuition for when and why this should happen.

In any case, I hope this was helpful and interesting in some way!

# Means and Markov’s Inequality

The first time we learn what a mean is, it is probably called an average. The first time we meet it in a maths lesson, it is probably defined as follows: given a list of values, or possibilities, the mean is the sum of all the values divided by the number of such values.

This can be seen as both a probabilistic and a statistical statement. Ideally, these things should not be different, but at a primary school level (and some way beyond), there is a distinction to be drawn between the mean of a set of data values, say the heights of children in the class, and the mean outcome of rolling a dice. The latter is the mean of something random, while the former is the mean of something fixed and determined.

The reason that the same method works for both of these situations is that the distribution for the outcome of rolling a dice is uniform on the set of possible values. Though this is unlikely to be helpful to many, you could think of this as a consequence of the law of large numbers. The latter, performed jointly in all possible values says that you expect to have roughly equal numbers of each value when you take a large number of samples. If we refer to the strong law, this says that in fact we see this effect in the limit as we take increasingly large samples with probability one. Note that it is not trivial to apply LLN jointly to all values for a general continuous random variable. The convergence of sample distribution functions to the cdf of the underlying distribution is the content of the Glivenko-Cantelli Theorem.

In any case, this won’t work when there isn’t this symmetry where all values are equally likely. So in general, we have to define the mean of a discrete random variable as

$\mu=\sum k\mathbb{P}(X=k).$

In other words, we are taking a sum of values multiplied by probabilities. By taking a suitable limit, a sum weighted by discrete probabilities converges to an integral weighted by a pdf. So this is a definition that will easily generalise.

Anyway, typically the next stage is to discuss the median. In the setting where we can define the mean directly as a sum of values, we must be given some list of values, which we can therefore write in ascending order. It’s then easy to define the median as the middle value in this ordered list. If the number of elements is odd, this is certainly well-defined. If the number is even, it is less clear. A lot of time at school was spent addressing this question, and the generally-agreed answer seemed to be that the mean of the middle two elements would do nicely. We shouldn’t waste any further time addressing this, as we are aiming for the continuous setting, where in general there won’t be discrete gaps between values in the support.

This led onwards to the dreaded box-and-whisker diagrams, which represent the min, lower quartile, median, upper quartile, and max in order. The diagram is structured to draw attention to the central points in the distribution, as these are in many applications of greater interest. The question of how to define the quartiles if the number of data points is not 3 modulo 4 is of exponentially less interest than the question of how to define the median for an even number of values, in my opinion. What is much more interesting is to note that the middle box of such a diagram would be finite for many continuous distributions with infinite support, such as the exponential distribution and the normal distribution.

Note that it is possible to construct any distribution as a function of a U[0,1] distribution by inverting the cdf. The box-and-whisker diagram essentially gives five points in this identification scheme.

Obviously, the ordered list definition fails to work for such distributions. So we need a better definition of median, which generalises. We observe that half the values are greater than the median, and so in a probabilistic setting, we say that the probability of being less than the median is equal to the probability of being greater. So we want to define it implicitly as:

$\mathbb{P}(X>M)=\mathbb{P}(X

So for a continuous distribution without atoms,

$\mathbb{P}(X>M)=\frac12,$

and this uniquely defines M.

The natural question to start asking is how this compares to the mean. In particular, we want to discuss the relative sizes. Any result about the possible relative values of the mean and median can be reversed by considering the negation of the random variable, so we focus on continuous random variables with non-negative support. If nothing else, these are the conditions for lots of data we might be interested in sampling in the ‘real world’.

It’s worth having a couple of questions to clarify what we are interested in. How about: is it possible for the mean to be 1000 times larger than the median; and is it possible for the median to be 1000 times larger than the mean?

The latter is easier to address. If the median is 1000 and the mean is 1, then with probability ½ the random variable X is at least 1000. So these values make a contribution to the mean of at least 500, while the other values make a contribution of at least zero (since we’ve demanded the RV be positive). This is a contradiction.

The former question turns out to be possible. The motivation should come from our box-and-whisker diagram! Once we have fixed the middle box, the median and quartiles are fixed, but we are free to fiddle with the outer regions as much as we like, so by making the max larger and larger, we can increase the mean freely without affecting the median. Perhaps it is clearest to view a discrete example: 1, 2, N. The median will always be 2, so we can increase N as much as desired to get a large mean.

The first answer is in a way more interesting, because it generalises to give a result about the tail of distributions. Viewing the median as the ½-quantile, we are saying that it cannot be too large relative to the mean. Markov’s inequality provides an identical statement about the general quantile. Instead of thinking about the constant a in an a-quantile, we look at values in the support.

Suppose we want a bound on $\mathbb{P}(X>a)$ for some positive a. Then if we define the function f by

$f(x)=a \textbf{1}_{\{x\ge a\}},$

so $f(x)\le x$ for all values. Hence the mean of f(X) is at most the mean of X. But the mean of f(X) can be calculated as

$a\mathbb{P}(X>a),$

and so we conclude that

$\mathbb{P}(X>a)\leq \frac{\mu}{a},$

which is Markov’s Inequality.

It is worth remarking that this is trivially true when $a\le \mu$, since probabilities are always at most 1 anyway. Even beyond this region, it is generally quite weak. Note that it becomes progressively stronger if the contribution to the mean from terms greater than a is mainly driven by the contribution from terms close to a. So the statement is strong if the random variable has a light tail.

This motivates considering deviations from the mean, rather than the random variable itself. And to lighten the tail, we can square, for example, to consider the square distance from the mean. This version is Chebyshev’s Inequality:

$\mathbb{P}(|X-\mu|^2>a\sigma^2)\le \frac{1}{a}.$

Applying Markov an exponential function of a random variable is called a Chernoff Bound, and gives in some sense the bound on tails of a distribution obtained in this way.

# Bayesian Inference and the Jeffreys Prior

Last term I was tutoring for the second year statistics course in Oxford. This post is about the final quarter of the course, on the subject of Bayesian inference, and in particular on the Jeffreys prior.

There are loads and loads of articles sitting around on the web contributing the debate about the relative merits of Bayesian and frequentist methods. I do not want to continue that debate here, partly because I don’t have a strong opinion, but mainly because I don’t really understand that much about the underlying issues.

What I will say is that after a few months of working fairly intensively with various complicated stochastic processes, I am starting to feel fairly happy throwing about conditional probability rather freely. When discussing some of the more combinatorial models for example, quite often we have no desire to compute or approximate complication normalising constants, and so instead talk about ‘weights’. And a similar idea underlies Bayesian inference. As in frequentist methods we have an unknown parameter, and we observe some data. Furthermore, we know the probability that such data might have arisen under any value of the parameter. We want to make inference about the value of the parameter given the data, so it makes sense to multiply the probability that the data emerged as a result of some parameter value by some weighting on the set of parameter values.

In summary, we assign a prior distribution representing our initial beliefs about the parameter before we have seen any data, then we update this by weighting by the likelihood that the observed data might have arisen from a particular parameter. We often write this as:

$\pi(\theta| x)\propto f(x|\theta)\pi(\theta)$,

or say that posterior = likelihood x prior. Note that in many applications it won’t be necessary to work out what the normalising constant on the distribution ought to be.

That’s the setup for Bayesian methods. I think the general feeling about the relative usefulness of such an approach is that it all depends on the prior. Once we have the prior, everything is concrete and unambiguously determined. But how should we choose the prior?

There are two cases worth thinking about. The first is where we have a lot of information about the problem already. This might well be the case in some forms of scientific research, where future analysis aims to build on work already completed. It might also be the case that we have already performed some Bayesian calculations, so our current prior is in fact the posterior from a previous set of experiments. In any case, if we have such an ‘informative prior’, it makes sense to use it in some circumstances.

Alternatively, it might be the case that for some reason we care less about the actual prior than about the mathematical convenience of manipulating it. In particular, certain likelihood functions give rise to conjugate priors, where the form of the posterior is the same as the form of the prior. For example, a normal likelihood function admits a normal conjugate prior, and a binomial likelihood function gives a Beta conjugate prior.

In general though, it is entirely possible that neither of these situations will hold but we still want to try Bayesian analysis. The ideal situation would be if the choice of prior had no effect on the analysis, but if that were true, then we couldn’t really be doing any Bayesian analysis. The Jeffreys prior is one natural candidate because it removes a specific problem with choosing a prior to express ignorance.

It sounds reasonable to say that if we have total ignorance about the parameter, then we should take the prior to be uniform on the set of possible values taken by the parameter. There are two potential objections to this. The first is that if the parameter could take any real value, then the prior will not be a distribution as the uniform distribution on the reals is not normalisable. Such a prior is called improper. This isn’t a huge problem really though. For making inference we are only interested in the posterior distribution, and so if the posterior turns out to be normalisable we are probably fine.

The second problem is more serious. Even though we want to express ignorance of the parameter, is there a canonical choice for THE parameter? An example will make this objection more clear. Suppose we know nothing about the parameter T except that it lies in [0,1]. Then the uniform distribution on [0,1] seems like the natural candidate for the prior. But what if we considered T^100 to be the parameter instead? Again if we have total ignorance we should assign T^100 the uniform distribution on its support, which is again [0,1]. But if T^100 is uniform on [0,1], then T is massively concentrated near 1, and in particular cannot also be uniformly distributed on [0,1]. So as a minimum requirement for expressing ignorance, we want a way of generating a prior that doesn’t depend on the choice of parameterisation.

The Jeffreys prior has this property. Note that there may be separate problems with making such an assumption, but this prior solves this particular objection. We define it to be $\pi(\theta)\propto [I(\theta)]^{1/2}$ where I is the Fisher information, defined as

$I(\theta)=-\mathbb{E}_\theta\Big[\frac{\partial^2 l(X_1,\theta)}{\partial \theta^2}\Big],$

where the expectation is over the data X_1 for fixed $\theta$, and l is the log-likelihood. Proving that this has the property that it is invariant under reparameterisation requires demonstrating that the Jeffreys prior corresponding to $g(\theta)$ is the same as applying a change of measure to the Jeffreys prior for $\theta$. The proof is a nice exercise in the chain rule, and I don’t want to reproduce it here.

For a Binomial likelihood function, we find that the Jeffreys prior is Beta(1/2,1/2), which has density that looks roughly like a bucket suspended above [0,1]. It is certainly worth asking why the ‘natural’ choice for prior might put lots of mass at the edge of the domain for the parameter.

I don’t have a definitive answer, but I do have an intuitive idea which comes from the meaning of the Fisher information. As the second derivative of the log-likelihood, a large Fisher information means that with high probability we will see data for which the likelihood changes substantially if we vary the parameter. In particular, this means that the posterior probability of a parameter close to 0 will be eliminated more quickly by the data if the true parameter is different.

If the variance is small, as it is for parameter near 0, then the data generated by this parameter will have the greatest effect on the posterior, since the likelihood will be small almost everywhere except near the parameter. We see the opposite effect if the variance is large. So it makes sense to compensate for this by placing extra prior mass at parameter values where the data has the strongest effect. Note that in the previous example, the Jeffreys prior is in fact exactly inversely proportional to the standard deviation. For the above argument to make sense, we need it to be monotonic with respect to SD, and it just happens that in this case, being 1/SD is precisely the form required to be invariant under reparameterisation.

Anyway, I thought that was reasonably interesting, as indeed was the whole course. I feel reassured that I can justify having my work address as the Department of Statistics since I now know at least epsilon about statistics!

# Increments of Random Partitions

The following is problem 2.1.4. from Combinatorial Stochastic Processes:

Let $X_i$ be the indicator of the event that i the least element of some block of an exchangeable random partition $\Pi_n$ of [n]. Show that the joint law of the $(X_i,1\leq i\leq n)$ determines the law of $\Pi_n$.

As Pitman says, this is a result by Serban Nacu, the paper for which can be found here. In this post I’m going to explain what an exchangeable random partition is, how to prove the result, and a couple of consequences.

The starting point is the question ‘what is an exchangeable random partition?’ The most confusing aspect is that there are multiple definitions depending on whether the blocks of the partition are sets or just integers corresponding to a size. Eg, {1,2,4} u {3} is a partition of [4], corresponding to the partition 3+1 of 4. Obviously one induces the other, and in an exchangeable setting the laws of one may determine the laws of the other.

In the second case, we assume 3+1 is the same partition as 1+3. If order does matter then we call it a composition instead. This gets a bit annoying for set partitions, as we don’t want these to be ordered either. But if we want actually to talk about the sets in question we have to give them labels, which becomes an ordering, so we need some canonical way to assign these labels. Typically we will say $\Pi_n=\{A_1,\ldots,A_k\}$, where the curly brackets indicate that we don’t care about order, and we choose the labels by order of appearance, so by increasing order of least elements.

We say that a random partition $\Pi_n$ of [n] is exchangeable if its distribution is invariant the action on partitions induced by the symmetric group. That is, relabelling doesn’t change probabilities. We can express this functionally by saying

$\mathbb{P}(\Pi_n=\{A_1,\ldots,A_k\})=p(|A_1|,\ldots,|A_k|),$

for p a symmetric function. This function is then called the exchangeable partition probability function (EPPF) by Pitman.

Consider a partition of 4 into sets of sizes 3 and 1. There is a danger that this definition looks like it might be saying that the probability that A_1 is the set of size 3 is the same as the probability that A_1 is the set of size 1. This would be a problem because we expect to see some size-biasing to the labelling. Larger sets are more likely to contain small elements, merely because they contain more elements. Fortunately the definition is not broken after all. The statement above makes no reference to the probabilities of seeing various sizes for A_1 etc. For that, we would have to sum over all partitions with that property. It merely says that the partitions:

$\{1,2,3\}\cup\{4\},\quad \{1,2,4\}\cup\{3\},\quad\{1,3,4\}\cup\{2\},\quad \{2,3,4\}\cup\{1\}$

have respective probabilities:

$p(3,1),\quad p(3,1),\quad p(3,1),\quad p(1,3),$

and furthermore these are equal.

Anyway, now let’s turn to the problem. The key idea is that we want to be looking at strings of 0s and 1s that can only arise in one way. For example, the string 10…01 can only arise corresponding to the partitions {1,2,…,n-1} u {n} and {1,2,…,n-2,n} u {n-1}. So now we know p(n-1,1) and so also p(1,n-1). Furthermore, note that 10…0 and 11…1 give the probabilities of 1 block of size n and n blocks of size 1 respectively at once.

So then the string 10…010 can only arise from partitions {1,2,…,n-2,n} u {n-1} or {1,2,…,n-2} u {n-1,n}. We can calculate the probability that it came from the former using the previously found value of p(n-1,1) and a combinatorial weighting, so the remaining probability is given by p(2,n-2). Keep going. It is clear what ‘keep going’ means in the case of p(a,b) but for partitions with more than two blocks it seems a bit more complicated.

Let’s fix k the number of blocks in partitions under consideration, and start talking about compositions, that is $a_1+\ldots+a_k=n$. The problem we might face in trying to generalise the previous argument is that potentially lots of compositions might generate the same sequence of 0s and 1s, so the ‘first time’ we consider a composition might be the same for more than one composition. Trying it out in the case k=3 makes it clear that this is not going to happen, but we need some partial ordering structure to explain why this is the case.

Recall that a composition with k blocks is a sequence $a=(a_1,\ldots,a_k)$ which sums to n. Let’s say a majorizes b if all its partial sums are at least as large. That is $a_1+\ldots+a_l\geq b_1+\ldots+b_l$ for all $1\leq l \leq k$. We say this is strict if at least one of the inequalities is strict. It is not hard to see that if a majorizes b then this is strict unless a = b.

Since we don’t care about ordering, we assume for now that all compositions are arranged in non-increasing order. So we find a partition corresponding to some such composition $a_1,\ldots,a_k$. The partition is:

$\{1,\ldots,a_1\}\cup\{a_1+1,\ldots,a_1+a_2\}\cup\{a_1+a_2+1,\ldots,a_1+a_2+a_3\}\cup\ldots\cup\{n-a_k,\ldots,n\}.$

This generates a sequence of 0s and 1s as describe above, with $a_i-1$ 0s between the i’th 1 and the (i+1)th 1. The claim is that given some composition which admits a partition with this same corresponding sequence, that composition must majorize a. Proof by induction on l. So in fact we can prove Nacu’s result inductively down the partial ordering described. We know the probability of the sequence of 0s and 1s corresponding to the partition of [n] described by assumption. We know the probability of any partition corresponding to a composition which majorizes a by induction, and we know how many partitions with this sequence each such composition generates. Combining all of this, we can find the probability corresponding to a.

Actually I’m not going to say much about consequences of this except to paraphrase very briefly what Nacu says in the paper. One of the neat consequences of this result is that it allows us to prove in a fairly straightforward way that the only infinite family of exchangeable random partitions with independent increments is the so-called Chinese Restaurant process.

Instead of attempting to prove this, I will explain what all the bits mean. First, the Chinese Restaurant process is the main topic of the next chapter of the book, so I won’t say any more about it right now, except that its definition is almost exact what is required to make this particular result true.

We can’t extend the definition of exchangeable to infinite partitions immediately, because considering invariance under the symmetric group on the integers is not very nice, in particular because there’s a danger all the probabilities will end up being zero. Instead, we consider restrictions of the partition to $[n]\subset\mathbb{N}$, and demand that these nest appropriately, and are exchangeable.

Independent increments is a meaningful thing to consider since one way to construct a partition, infinite or otherwise, is to consider elements one at a time in the standard ordering, either adding the new element to an already present block, or starting block. Since 0 or 1 in the increment sequence corresponds precisely to these events, it is meaningful to talk about independent increments.

# Poisson Tails

I’ve had plenty of ideas for potential probability posts recently, but have been a bit too busy to write any of them up. I guess that’s a good thing in some sense. Anyway, this is a quick remark based on an argument I was thinking about yesterday. It combines Large Deviation theory, which I have spent a lot of time learning about this year, and the Poisson process, which I have spent a bit of time teaching.

Question

Does the Poisson distribution have an exponential tail? I ended up asking this question for two completely independent reasons yesterday. Firstly, I’ve been reading up about some more complex models of random networks. Specifically, the Erdos-Renyi random graph is interesting mathematical structure in its own right, but the independent edge condition results in certain regularity properties which are not seen in many real-world networks. In particular, the degree sequence of real-world networks typically follows an approximate power law. That is, the tail is heavy. This corresponds to our intuition that most networks contain ‘hubs’ which are connected to a large region of the network. Think about key servers or websites like Wikipedia and Google which are linked to by millions of other pages, or the social butterfly who will introduce friends from completely different circles. In any case, this property is not observed in an Erdos-Renyi graph, where the degrees are binomial, and in the sparse situation, rescale in the limit to a Poisson distribution. So, to finalise this observation, we want to be able to prove formally that the Poisson distribution has an exponential (so faster than power-law) tail.

The second occurrence of this question concerns large deviations for the exploration process of a random graph. This is a topic I’ve mentioned elsewhere (here for the exploration process, here for LDs) so I won’t recap extensively now. Anyway, the results we are interested in give estimates for the rate of decay in probability for the event that the path defined by the exploration process differs substantially from the expected path as n grows. A major annoyance in this analysis is the possibility of jumps. A jump occurs if a set of o(n) adjacent underlying random variables (here, the increments in the exploration process) have O(n) sum. A starting point might be to consider whether O(1) adjacent RVs can have O(n) sum, or indeed whether a single Poisson random variable can have sum of order n. In practice, this asks whether the probability $\mathbb{P}(X>\alpha n)$ decays faster than exponentially in n. If it does, then this is dominated on a large deviations scale. If it decays exactly exponentially in n, then we have to consider such jumps in the analysis.

Approach

We can give a precise statement of the probabilities that a Po($\lambda$) random variable X returns a given integer value:

$\mathbb{P}(X=k)=e^{-\lambda}\frac{\lambda^k}{k!}.$

Note that these are the terms in the Taylor expansion of $e^{\lambda}$ appropriately normalised. So, while it looks like it should be possible to evaluate

$\mathbb{P}(X>\alpha n)=e^{-\lambda}\sum_{\alpha n}^\infty \frac{\lambda^k}{k!},$

this seems impossible to do directly, and it isn’t even especially obvious what a sensible bounding strategy might be.

The problem of estimating the form of the limit in probability of increasing unlikely deviations from expected behaviour surely reminds us of Cramer’s theorem. But this and other LD theory is generally formulated in terms of n random variables displaying some collective deviation, rather than a single random variable, with the size of the deviation growing. But we can transform our problem into that form by appealing to the three equivalent definitions of the Poisson process.

Recall that the Poisson process is the canonical description of, say, an arrivals process, where events in disjoint intervals are independent, and the expected number of arrives in a fixed interval is proportional to the width of the interval, giving a well-defined notion of ‘rate’ as we would want. The two main ways to define the process are: 1) the times between arrivals are given by i.i.d. Exponential RVs with parameter $\lambda$ equal to the rate; and 2) the number of arrivals in interval [s,t] is independent of all other times, and has distribution given by Po($\lambda(t-s)$). The fact that this definition gives a well-defined process is not necessarily obvious, but let’s not discuss that further here.

So the key equivalence to be exploited is that the event $X>n$ for $X\sim \text{Po}(\lambda)$ is a statement that there are at least n arrivals by time 1. If we move to the exponential inter-arrival times definition, we can write this as:

$\mathbb{P}(Z_1+\ldots+Z_n<1),$

where the Z’s are the i.i.d. exponential random variables. But this is exactly what we are able to specify through Cramer’s theorem. Recall that the moment generating function of an exponential distribution is not finite everywhere, but that doesn’t matter as we construct our rate function by taking the supremum over some index t of:

$I(x)=\sup_t (xt-\log \mathbb{E}e^{tZ_1})=\sup_t(xt-\log(\frac{\lambda}{\lambda-t})).$

A simple calculation then gives

$I(x)=\lambda x-1 - \log \lambda x.$

$\Rightarrow I(x)\uparrow \infty\text{ as }x\downarrow 0.$

Note that I(1) is the same for both Exp($\lambda$) and Po($\lambda$), because of the PP equality of events:

$\{Z_1+\ldots+Z_n\leq n\}=\{\text{Po}(\lambda n)=\text{Po}(\lambda)_1+\ldots+\text{Po}(\lambda)_n> n\},$

similar to the previous argument. In particular, for all $\epsilon>0$,

$\mathbb{P}(\text{Po}(\lambda)>n)=\mathbb{P}(\frac{Z_1+\ldots+Z_n}{n}<\frac{1}{n})<\mathbb{P}(\frac{Z_1+\ldots+Z_n}{n}<\epsilon),\text{ for large }n.$

$\mathbb{P}(\text{Po}(\lambda)>n)=O(e^{-nI(\epsilon)}),\text{ for all }\epsilon.$

Since we can take $I(\epsilon)$ as large as we want, we conclude that the probability decays faster than exponentially in n.

# Extreme Value Theory

This is something interesting which came up on the first problem sheet for the Part A Statistics course. The second question introduced the Weibull distribution, defined in terms of parameters $\alpha,\lambda>0$ through the distribution function:

$F(x)=\begin{cases}0 & x<0\\ 1-\exp(-(\frac{x}{\lambda})^\alpha) & x\geq 0.\end{cases}$

As mentioned in the statement of the question, this distribution is “typically used in industrial reliability studies in situations where failure of a system comprising many similar components occurs when the weakest component fails”. Why could that be? Expressed more theoretically, the lifetimes of various components might reasonably be assumed to behave like i.i.d. random variables in many contexts. Then the failure time of the system is given by the minimum of the constituent random variables.

So this begs the question: what does the distribution of minimum of a collection of i.i.d. random variables look like? First, we need to think why there should be an answer at all. I mean, it would not be unreasonable to assume that this would depend rather strongly on the underlying distribution. But of course, we might say the same thing about sums of i.i.d. random variables, but there is the Central Limit Theorem. Phrased in a way that is deliberately vague, this says that subject to some fairly mild conditions on the underlying distribution (finite variance in this case), the sum of n i.i.d. RVs look like a normal distribution for large n. Here we know what ‘looks like’ means, since we have a notion of a family of normal distributions. Formally, though, we might say that ‘looks like’ means that the image of the distribution under some linear transformation, where the coefficients are possibly functions of n, converges to the distribution N(0,1) as n grows.

The technical term for this is to say that the underlying RV we are considering, which in this case would be $X_1+\ldots +X_n)$ is in the domain of attraction of N(0,1). Note that other distributions in the family of normals are also in the domain of attraction of N(0,1), and vice versa, so this forms an equivalence relation on the space of distributions, though this is not hugely helpful since most interesting statements involve some sort of limit.

Anyway, with that perspective, it is perhaps more reasonable to imagine that the minimum of a collection of i.i.d. RVs might have some limit distribution. Because we typically feel more comfortable thinking about right-tails rather than left-tails of probability distributions, this problem is more often considered for the maximum of i.i.d. RVs. The Fisher-Tippett-Gnedenko theorem, proved in various forms in the first half of the 20th century, asserts that again under mild regularity assumptions, the maximum of such a collection does lie in the domain of attraction of one of a small set of distributions. The Weibull distribution as defined above is one of these. (Note that if we are considering domains of attraction, then scaling x by a constant is of no consequence, so we can drop the parameterisation by $\lambda$.)

This is considered the first main theorem of Extreme Value Theory, which addresses precisely this sort of problem. It is not hard to consider why this area is of interest. To decide how much liquidity they require, an insurance company needs to know the likely size of the maximum claim during the policy. Similarly, the designer of a sea-wall doesn’t care about the average wave-height – what matters is how strong the once-in-a-century storm which threatens the town might be. A good answer might also explain how to resolve the apparent contradiction that most human characteristics are distributed roughly normally across the population. Normal distributions are unbounded, yet physiological constraints enable us to state with certainty that there will never be twelve foot tall men (or women). In some sense, EVT is a cousin of Large Deviation theory, the difference being that unlikely events in a large family of i.i.d. RVs are considered on a local scale rather than globally. Note that large deviations for Cramer’s theorem in the case where the underlying distribution has a heavy tail are driven by a single very deviant value, rather than by lots of slightly deviant data, so in this case the theories are comparable, though generally analysed from different perspectives.

In fact, we have to consider the reversed Weibull distribution for a maximum, which is supported on $(-\infty,0]$. This is one of three possibly distribution families for the limit of a maximum. The other two are the Gumbel distribution

$F(x)=e^{-e^{-x}},$

and the Frechet distribution

$F(x)=\exp(-x^{-\alpha}),\quad x>0.$

Note that $\alpha$ is a positive parameter in both the Frechet and Gumbel distributions. These three distributions can be expressed as a single one parameter family, the Generalised Extreme Value distribution.

The differences between them lie in the tail behaviour. The reversed Weibull distribution has an actual upper bound, the Gumbel an exponential, fast-decaying tail, and the Frechet a polynomial ‘fat’ tail. It is not completely obvious that these properties are inherited from the original distribution. After all, to get from the original distribution to extreme value distribution, we are taking the maximum, then rescaling and translating in a potentially quite complicated way. However, it is perhaps reasonable to see that the property of the underlying distribution having an upper bound is preserved through this process. Obviously, the bound itself is not preserved – after all, we are free to apply arbitrary linear transformations to the distributions!

In any case, it does turn out to be the case that the U[0,1] distribution has maximum converging to a reversed Weibull; the exponential tails of the Exp(1) and N(0,1) distributions lead to a Gumbel limit; and the fat-tailed Pareto distribution gives the Frechet limit. The calculations are reasonably straightforward, especially once the correct rescaling is known. See this article from Duke for an excellent overview and the details for these examples I have just cited. These notes discuss further properties of these limiting distributions, including the unsurprising fact that their form is preserved under taking the maximum of i.i.d. copies. This is analogous to the fact that the family of normal distributions is preserved under taking arbitrary finite sums.

From a statistical point of view, devising a good statistical test for what class of extreme value distribution a particular set of data obeys is of great interest. Why? Well mainly because of the applications, some of which were suggested above. But also because of the general statistical principle that it is unwise to extrapolate beyond the range of the available data. But that is precisely what we need to do if we are considering extreme values. After all, the designer of that sea-wall can’t necessarily rely on the largest storm in the future being roughly the same as the biggest storm in the past. So because the EVT theorem gives a clear description of the distribution, to find the limiting properties, which is where the truly large extremes might occur, it suffices to find a good test for the form of the limit distribution – that is, which of the three possibilities is relevant, and what the parameter should be. This seems to be fairly hard in general. I didn’t understand much of it, but this paper provided an interesting review.

Anyway, that was something interesting I didn’t know about (for the record, I also now know how to construct a sensible Q-Q plot for the Weibull distribution!), though I am assured that EVT was a core element of the mainstream undergraduate mathematics syllabus forty years ago.

# Gaussian tail bounds and a word of caution about CLT

The first part is more of an aside. In a variety of contexts, whether for testing Large Deviations Principles or calculating expectations by integrating over the tail, it is useful to know good approximations to the tail of various distributions. In particular, the exact form of the tail of a standard normal distribution is not particularly tractable. The following upper bound is therefore often extremely useful, especially because it is fairly tight, as we will see.

Let $Z\sim N(0,1)$ be a standard normal RV. We are interested in the tail probability $R(x)=\mathbb{P}(Z\geq x)$. The density function of a normal RV decays very rapidly, as the exponential of a quadratic function of x. This means we might expect that conditional on $\{Z\geq x\}$, with high probability Z is in fact quite close to x. This concentration of measure property would suggest that the tail probability decays at a rate comparable to the density function itself. In fact, we can show that:

$\mathbb{P}(Z>x)< \frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.$

It is in fact relatively straightforward:

$\mathbb{P}(Z>x)=\frac{1}{\sqrt{2\pi}}\int_x^\infty e^{-u^2/2}du< \frac{1}{\sqrt{2\pi}}\int_x^\infty \frac{u}{x}e^{-u^2/2}du=\frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.$

Just by comparing derivatives, we can also show that this bound is fairly tight. In particular:

$\frac{1}{\sqrt{2\pi}}\frac{x}{x^2+1}e^{-x^2/2}<\mathbb{P}(Z>x)< \frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.$

—-

Now for the second part about CLT. The following question is why I started thinking about various interpretations of CLT in the previous post. Suppose we are trying to prove the Strong Law of Large Numbers for a random variable with 0 mean and unit variance. Suppose we try to use an argument via Borel-Cantelli:

$\mathbb{P}(\frac{S_n}{n}>\epsilon) = \mathbb{P}(\frac{S_n}{\sqrt{n}}>\epsilon\sqrt{n})\stackrel{\text{CLT}}{\approx}\mathbb{P}(Z>\epsilon\sqrt{n}).$

Now we can use our favourite estimate on the tail of a normal distribution.

$\mathbb{P}(Z>\epsilon\sqrt{n})\leq \frac{1}{\epsilon\sqrt{n}\sqrt{2\pi}}e^{-n/2}$

$\Rightarrow \sum_n \mathbb{P}(Z>\epsilon\sqrt{n})\leq \frac{1}{\sqrt{2\pi}}(e^{-1/2})^n=\frac{1}{\sqrt{2\pi}(e^{1/2}-1)}<\infty.$

By Borel-Cantelli, we conclude that with probability 1, eventually $\frac{S_n}{n}<\epsilon$. This holds for all $\epsilon>0$, and a symmetric result for the negative case. We therefore obtain the Strong Law of Large Numbers.

The question is: was that application of CLT valid? It certainly looks ok, but I claim not. The main problem is that the deviations under discussion fall outside the remit of discussion. CLT gives a limiting expression for deviations on the $\sqrt{n}$ scale.

Let’s explain this another way. Let’s take $\epsilon=10^{-2}$. CLT says that as n becomes very large

$\mathbb{P}(\frac{S_n}{\sqrt{n}}>1000)\approx \mathbb{P}(Z>1000).$

But we don’t know how large n has to be before this approximation is vaguely accurate. If in fact it only becomes accurate for $n=10^{12}$, then it is not relevant for estimating

$\mathbb{P}(\frac{S_n}{\sqrt{n}}>\epsilon\sqrt{n}).$

This looks like an artificial example, but the key is that this problem becomes worse as n grows, (or as we increase the number which currently reads as 1000), and certainly is invalid in the limit. [I find the original explanation about scale of deviation treated by CLT more manageable, but hopefully this further clarifies.]

One solution might be to find some sort of uniform convergence criterion for CLT, ie a (hopefully rapidly decreasing) function $f(n)$ such that

$\sup_{x\in\mathbb{R}}|\mathbb{P}(\frac{S_n}{\sqrt{n}}>x)-\mathbb{P}(Z>x)|\leq f(n).$

This is possible, as given by the Berry-Esseen theorem, but even the most careful refinements in the special case where the third moment is bounded fail to give better bounds than

$f(n)\sim \frac{1}{\sqrt{n}}.$

Adding this error term will certainly destroy any hope we had of the sum being finite. Of course, part of the problem is that the supremum in the above definition is certainly not going to be attained at any point under discussion in these post-$\sqrt{n}$ deviations. We really want to take a supremum over larger-than-usual deviations if this is to work out.

By this stage, however, I hope it is clear what the cautionary note is, even if the argument could potentially be patched. CLT is a theorem about standard deviations. Separate principles are required to deal with the case of large deviations. This feels like a strangely ominous note on which to end, but I don’t think there’s much more to say. Do comment below if you think there’s a quick fix to the argument for SLLN presented above.

# Large Deviations and the CLT

Taking a course on Large Deviations has forced me to think a bit more carefully about what happens when you have large collections of IID random variables. I guess the first thing think to think about is ‘What is a Large Deviation‘? In particular, how large or deviant does it have to be?

Of primary interest is the tail of the distribution function of $S_n=X_1+\ldots+X_n$, where the $X_i$ are independent and identically distributed as $X$. As we can always negate everything later if necessary, we typically consider the probability of events of the type:

$\mathbb{P}(S_n\geq \theta(n))$

where $\theta(n)$ is some function which almost certainly increases fairly fast with $n$. More pertinently, if we are looking for some limit which corresponds to an actual random variable, we perhaps want to look at lots of related $\theta(n)$s simultaneously. More concretely, we should fix $\theta$ and consider the probabilities

$\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha).$ (*)

Throughout, we lose no generality by assuming that $\mathbb{E}X=0$. Of course, it is possible that this expectation does not exist, but that is certainly a question for another post!

Now let’s consider the implications of our choice of $\theta(n)$. If this increases with $n$ too slowly, and the likely deviation of $S_n$ is greater than $\theta(n)$, then the event might not be a large deviation at all. In fact, the difference between this event and the event ($S_n$ is above 0, that is, its mean) becomes negligible, and so the probability at (*) might be 1/2 or whatever, regardless of the value of $\alpha$. So object $\lim \frac{S_n}{\theta(n)}$ whatever that means, certainly cannot be a proper random variable, as if we were to have convergence in distribution, this would imply that the limit RV consisted of point mass at each of $\{+\infty, -\infty\}$.

On the other hand, if $\theta(n)$ increases rapidly with $n$, then the probabilities at (*) might become very small indeed when $\alpha>0$. For example, we might expect:

$\lim_{n\rightarrow\infty}\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha)=\begin{cases}0& \alpha>0\\1&\alpha<0.\end{cases}$

and more information to be required when $\alpha=0$. This is what we mean by a large deviation event. Although we always have to define everything concretely in terms of some finite sum $S_n$, we are always thinking about the behaviour in the limit. A large deviation principle exists in an enormous range of cases to show that these probabilities in fact decay exponentially. Again, that is the subject for another post, or indeed the lecture course I’m attending.

Instead, I want to return to the Central Limit Theorem. I first encountered this result in popular science books in a vague “the histogram of boys’ heights looks like a bell” kind of way, then, once a normal random variable had been to some extent defined, it returned in A-level statistics courses in a slightly more fleshed out form. As an undergraduate, you see it in several forms, including as a corollary following from Levy’s convergence theorem.

In all applications though, it is generally used as a method of calculating good approximations. It is not uncommon to see it presented as:

$\mathbb{P}(a\sigma\sqrt{n}+\mu n\leq S_n\leq b\sigma\sqrt{n}+\mu n)\approx \frac{1}{\sqrt{2\pi}}\int_a^b e^{-x^2/2}dx.$

Although in many cases that is the right way to think use it, it isn’t the most interesting aspect of the theorem itself. CLT says that the correct scaling of $\theta(n)$ so that the deviation probabilities lie between the two cases outline above is the same (that is, $\theta(n)=O(\sqrt{n})$ in some sense) for an enormous class of distributions, and in particular, most distributions that one might encounter in practice (ie finite mean, finite variance). There is even greater universality, as furthermore the limit distribution at this interface has the same form (some appropriate normal distribution) whenever $X$ is in this class of distributions. I think that goes a long way to explaining why we should care about the theorem. It also immediately prompts several questions:

• What happens for less regular distributions? It is now more clear what the right question to ask in this setting might be. What is the appropriate scaling for $\theta(n)$ in this case, if such a scaling exists? Is there a similar universality property for suitable classes of distributions?
• What is special about the normal distribution? The theorem itself shows us that it appears as a universal scaling limit in distribution, but we might reasonably ask what properties such a distribution should have, as perhaps this will offer a clue to a version of CLT or LLNs for less regular distributions.
• We can see that the Weak Law of Large Numbers follows immediately from CLT. In fact we can say more, perhaps a Slightly Less Weak LLN, that

$\frac{S_n-\mu n}{\sigma \theta(n)}\stackrel{d}{\rightarrow}0$

• whenever $\sqrt{n}<<\theta(n)$. But of course, we also have a Strong Law of Large Numbers, which asserts that the empirical mean converges almost surely. What is the threshhold for almost sure convergence, because there is no a priori reason why it should be $\theta(n)=n$?

To be continued next time.