Questionable Statistics

My four years in the Statistics Department have been characterised by a continual feeling that I don’t know enough about statistics. That said, I can’t help but notice how statistics are used in non-specialist contexts, and of course there are times when eyebrows might reasonably be raised. Andrew Gelman (who gave a very enjoyable departmental seminar on this topic in May) has a notable blog critiquing poor statistical methodology, especially in social science; and the blogosphere is packed with compilations of egregious statistical oxymorons (“All schools *must* be above average” etc).

I’m aiming for more of a middle ground here. I’ve picked three examples of articles I found interesting at some level over the past few months. In each of them, some kind of data presentation or experiment design arose, and in all cases, I think it resulted in more questions rather than more answers, not really in a good way, since in all cases I did actually want to know more about what was going on.

Extreme rounding errors

This Guardian article about inequality in career trajectories for teachers in schools serving more advantaged and more deprived areas raises a lot of interesting questions, and actually proposes answers to several of them. The problem is the double bar chart halfway down.

First a superficial moan. Rounding to the nearest integer is fine, indeed sensible in most contexts, but not when you are treating small numbers. Comparing the extreme bars, we could have 11.5 v 12.5 or we could have 10.5 v 13.5. The headlines would probably read “N% more teachers leave disadvantaged schools”. In the first case N is 9, in the second it’s 29. So it’s not really a superficial moan. Data is supposed to offer a measure of the effect under consideration, and a discussion of how treat this effect requires a veneer of reliability about the size of the effect. We don’t just need to be asking “how can we solve this”; we also need to be asking “is this worth solving?” I’m not saying the answer is no, but based on this graph alone the answer is not ‘definitely yes’. [1]

A question to ask in the context of this report is “why do more teachers leave the profession from schools in deprived areas?” But I think most people could speculate a broadly sensible answer to this. The data raises the more subtle question “why are the rates of leaving the profession so similar across all types of school?” Essentially I think this survey is a negative result – the effect just isn’t as strong as I’d have suspected, nor I imagine the authors nor the Guardian education editorship.

By contrast, even with the amateurish presentation, the rates of moving school clearly are significant and worth talking about based on this survey. Significant data demands further information though. Once an effect is significant, we need to know more details, like to what extent the teachers are moving to schools in the same band or to mostly towards the left end of the graph. Again though, there’s no comment on the magnitude of this effect. Assuming the rate is given per year [2], then among teachers who do not leave the profession, the average tenure of a given job is more than ten years, even in the most deprived category. Maybe this is all very heavy-tailed, and so the average isn’t a good thing to be using. But this doesn’t seem like an unreasonable number to me? If the claim is that this is a major problem for the teaching profession, then we should be told what the numbers are in vaguely comparable industries.

The final thing is pretty bad. Three paragraphs under the graph, it is claimed that the rate of leaving is 70% higher in most- than least-deprived categories. We decided that N was somewhere between 9 and 29. But not 70. I can only assume that this came from the ‘moving’ category instead…

[1] I’m assuming the sample pool was large enough that the effect is statistically significant, because there isn’t a direct link to the source data or even the source report unfortunately.

[2] and is it plausible that this is rate per year? Does the profession really experience ten percent annual turnover? Is it even plausible that the scale is the same? Maybe rate of leaving really is twice rate of moving school, and I have no idea about anything, but this seems questionable.

Many, many categories

The bank Halifax commissions an annual survey about pocket money, and this year’s received widespread coverage. My attention was caught by this article on BBC news. The short summary is that they surveyed about 1,200 children from about 600 families (so in many cases a direct comparison between siblings could have been possible), and it sounds like they restricted to the age range 8-15, and asked how much pocket money they received, and whether they wanted more.

The clickbait summary was that boys receive about 13% more than girls. While it doesn’t give the exact numbers, it also mentions that more boys than girls thought they deserved more money. A psychologist conjectures that this latter effect might explain the former effect, and indeed this seems plausible. The problem is that elsewhere in the article, it says that in the corresponding survey in 2015, boys receive on average only 2% more than girls.

We therefore have to consider a couple of things. 1) Is it plausible that the actual average pocket money rates (among the whole population) have fluctuated in such a gender-dependent way in 12 months? 2) If not, then can we still say something useful, even if our confidence in the methodology of the survey is reduced?

I think the answer to the first question is definitely ‘no’. So we come to the second question. Firstly, how could this have arisen? Well, there’s no indication how the children were chosen, but equally it’s hard to think of a way of choosing them that would increase artificially the boys’ pocket money. In this last sentence, I’m deliberately suggesting the null hypothesis that boys and girls are the same. This is misleading. For now, our null hypothesis is that 2015 should be the same as 2016. Indeed, it’s equally plausible that this year’s summary data was ‘more correct’, and so we are asking why the difference was excessively compressed in 2015.

This is entirely a matter of effect size, and it’s to the article’s credit that so much information is given further down the page that we can actually make the following remarks quantitatively. At the bottom, they show the average pay (across genders) in different regions. There is a lot of variation here. The rate of pay in East Anglia is only 60% what it is in London. At some level, I’m surprised it isn’t even higher. The article tells us that children between the ages of eight and fifteen were surveyed. By what rate did your pocket money change in that interval? Surely it doubled at the very least? Weekly pocket money at age eight is for sweets and an occasional packet of Pokemon cards (substitute 21st century equivalent freely…). Teenagers need to pay for outings and clothes etc, so there’s no reason at all why this should be comparable. For the purposes of definiteness in the next calculation, let’s say that fifteen-year-olds get paid four times as much as eight-year-olds on average, which I think is rather conservative.

So here’s a plausible reason for the fluctuations. Suppose between 2015 and 2016, the survey substituted 10 eight-year-old boys from the Midlands for 10 fifteen-year-old boys from London. The expectation change in the average amongst the 600 boys surveyed is

\frac{10}{600} \times (\sim 6.5) \times (4 \times (0.6)^{-1}) \approx 0.7,

in pounds, which is on the order of the actual change from 2015 to 2016. Choosing the participants in a non-uniform way seems like too basic a mistake to make, but mild fluctuations in the distribution of ages and locations, as well as the occasional outlier (they want to use the mean, so one oligarch’s daughter could make a noticeable difference in a sample size of 600) seems more plausible as an explanation to me than a general change in the population rates. Choosing participants in a uniform way is just hard when there are loads and loads of categories.

I’m not really saying that this survey is bad – they have to report what they found, and without a lot more information, I have no idea whether this year’s is more or less plausible than last year’s. But if you add the phrase “especially in 2016” to everything the psychologist says, it suddenly seems a bit ridiculous. So it would be worth making the remark that even if this effect is statistically significant, that doesn’t mean the effect size is large relative to lots of other less controversial effect sizes visible in the data.

Comparing precious metals

I’ve recently returning from this year’s International Mathematical Olympiad, and now we are well into the swing of its sweatier namesake in Rio. At both events, there have been many opportunities to observe how different people with different ambitions and levels of extroversion display different levels of pleasure at the same medal outcomes. In the light of this, I was drawn to this article, which has been getting quite a lot of traction online.

The basic premise is simple. Silver medallists are less happy than bronze medallists (when – unlike at the IMO – there is only one of each medal), and it’s not hard to come up with a cheap narrative: silver-winners are disappointed to have missed out on gold; bronze-winners are glad to have got a medal at all. This is all pretty convincing, so maybe there’s no need actually to collect any data, especially since asking a crowd of undergraduates to rate faces on the 1-10 agony-ecstacy scale sounds like a bit of an effort. Let’s read a couple of sentences, describing a follow-up study on judo medallists at the Athens Olympics:

Altogether, they found that thirteen of the fourteen gold medal winners smiled immediately after they completed their winning match, while eighteen of the twenty-six bronze medalists smiled. However, none of the silver medalists smiled immediately after their match ended.

You might wonder how come there are almost twice as many bronzes as golds. Well, very full details about the judo repechage structure can be found here, but the key information is that all bronze-medallists won their final match, as, naturally, did the gold-medallists. The silver-medallists lost their final match, ie the gold-medal final. So this study is literally reporting that highly-driven competitive sportsmen are happier straight after they win, than straight after they lose. This doesn’t have much to do with medals, and isn’t very exciting, in my opinion. I would, however, be interested to know what was eating the one gold medallist who didn’t smile immediately.

This isn’t really a gripe about the statistics, more about the writing. They are testing for an effect which seems highly plausible, but is branded as surprising. The study for which they give more methodological details in fact seems to be testing for a related effect, which is not just plausible, but essentially self-evident. So naturally I want to know more about the original study, which is the only place left for there to be any interesting effects, for example if they looked at athletics events which don’t have a binary elimination structure. But we aren’t told enough to find these interesting effects, if they exist. How annoying. The only thing left is to think about is my bronze medal at IMO 2008. They said eight years would be enough for the wounds to heal but I’m still not smiling.

Advertisement

100k Views

When I started this blog, I implicitly made a promise to myself that I would be aiming for quality of posts rather than quantity of posts. Evaluating quality is hard, whereas one always feels some vague sense of purpose by seeing some measure of output increase. Nonetheless, I feel I have mostly kept this promise, and haven’t written that much solely for the sake of getting out another post. This post is something of an exception, since I noticed in the office on Friday that this blog was closing in on 100,000 page views. Some of these were not actually just me pressing F5 repeatedly. Obviously this is no more relevant an integer to be a threshold as any other, and one shouldn’t feel constrained by base 10, but it still feels like a good moment for a quick review.

Here are some interesting statistics over the (3+\epsilon) years of this website’s existence.

  • 175 posts, not including this one, or the three which are currently in draft status. This works out at about 4.7 posts a month. By some margin the most prolific period was May 2012, when I was revising for my Part III exams in Cambridge, and a series of posts about the fiddliest parts of stochastic calculus and network theory seemed a good way to consolidate this work. I’ve learned recently that PhDs are hard, and in fact it’s been a year since I last produced at least five posts in a month, if you discount the series of olympiad reports, which though enjoyable, don’t exactly require a huge amount of mathematical engagement.
  • By at least one order of magnitude, the most viewed day was 17th August 2014, when several sources simultaneously linked to the third part of my report on IMO 2014 in Cape Town. An interesting point to note is that WordPress counts image views separately to page views, so the rare posts which have a gallery attached count well in a high risk / high return sense. In any case, the analytics consider that this day resulted in 2,366 views by 345 viewers. During a typical period, the number of views per visitor fluctuates between roughly 1.5 and 1.8, so clearly uploading as many photos of maths competitions as possible is the cheap route to lots of hits, at least by the WordPress metric.

EAE stats views

  • One might well expect the distributions involved in such a setup to follow a power-law. It’s not particularly clear from the above data about views per month since late 2012 whether this holds. One anomalously large data point (corresponding to the interest in IMO 2014) does not indicate a priori a power law tail… In addition, there is a general upward trend. Since a substantial proportion of traffic arrives from Google, one might naively assume that traffic rate might be proportion to amount of content, which naturally will grow in time, though it seems impractical to test this. One might also expect more recent posts to be more popular, though in practice this seems not to have been the case.
  • The variance in popularity of the various posts has been surprisingly large. At some level, I guess I thought there would be more viewers who browse through lots of things, but such people would probably do so from the home page, so it doesn’t show up as viewing lots of different articles. There is some internal linking between some pages, but not enough to be a major effect.
  • At either end of the spectrum, a post about coupling and the FKG inequality has received only 16 views in 2.5 years, while a guide to understanding the Levy-Khintchine formula has, in slightly less time, racked up 2,182 hits. There are direct reasons for this. Try googling Levy-Khintchine formula and see what comes up. In a sense, this is not enough, since you also need people to be googling the term in question, and picking topics that are hard but not super-hard at a masters level is probably maximising interest. But I don’t have a good underlying reason for why some posts should end up being more Google-friendly than others.
  • In particular, quality of article seems at best loosely correlated with number of views. This is perhaps worrying, both for my reputation, and for the future of written media, but we will see. Indeed, two articles on the Dubins-Schwarz theorem and a short crib sheet for convergence of random variables, both get a regular readership, despite seeming to have been written (in as much as a blog post can be) on the back of an envelope. I also find it amusing that the Dubins-Schwarz theorem is always viewed at the same time of the year, roughly mid-February, as presumably it comes up in the second term of masters courses, just like it did in my own.
  • By contrast, I remain quite pleased with the Levy-Khintchine article. It’s the sort of topic which is perfectly suited to this medium, since most books on Levy processes seem to assume implicit that their readers will already be familiar with this statement. So it seemed like a worthwhile enterprise to summarise this derivation, and it’s nice to see that others clearly feel the same, and indeed I still find some posts of this flavour useful as revision for myself.

Blog log plot

  • This seemed like a particularly good data set in which to go hunting for power-laws. I appreciate that taking a print-screen of an Excel chart will horrify many viewers, but never mind. The above plot shows the log of page view values for those mathematical articles with at least 200 hits. You can see the Levy-Khintchine as a mild anomaly at the far left. While I haven’t done any serious analysis, this looks fairly convincing.
  • I haven’t engaged particularly seriously in interaction with other blogs and other websites. Perhaps I should have done? I enjoy reading such things, but networking in this fashion seems like a zero-sum game overall except for a few particularly engaged people, even if one gets a pleasing spike in views from a reciprocal tweet somewhere. As a result, the numbers of comments and out-going clicks here is essentially negligible.
  • Views from the UK narrowly outnumber views from the US, but at present rates this will be reversed very soon. I imagine if I discounted the olympiad posts, which are sometimes linked from UK social media, this would have happened already.
  • From purely book-keeping curiosity, WordPress currently thinks the following countries (and territories – I’m not sure how the division occurs…) have recorded exactly one view: Aaland Islands, Afghanistan, Belize, Cuba, Djibouti, El Salvador, Fiji, Guatemala, Guernsey, Laos, Martinique, Mozambique, New Caledonia, Tajikistan, US Virgin Islands, and Zambia. Visiting all of those would be a fun post-viva trip…

Conclusion

As I said, we all know that 100,000 is just a number, but taking half an hour to write this has been a good chance to reflect on what I’ve written here in the past three years. People often ask whether I would recommend that they start something similar. My answer is ‘probably yes’, so long as the writer is getting something out of most posts they produce in real time. When writing about anything hard and technical, you have to accept that until you become very famous, interest in what you produce is always going to be quite low, so the satisfaction has to be gained from the process of understanding and digesting the mathematics itself. None of us will be selling the musical rights any time soon.

I have two pieces of advice to anyone in such a position. 1) Wait until you’ve written five posts before publishing any of them. This is a good test of whether you actually want to do it, and you’ll feel much more plausible linking to a website with more than two articles on it. 2) Don’t set monthly post count targets. Tempting though it is to try this to prevent your blog dying, it doesn’t achieve anything in the long term. If you have lots to say, say lots; if you don’t, occasionally saying something worthwhile feels a lot better when you look back on it than producing your target number of articles which later feel underwhelming.

I don’t know whether this will make it to 6+2\epsilon years, but for now, I’m still enjoying the journey through mathematics.

The Fisher Information and Cramer-Rao Bound

Given that I’d done this twice before, and was giving the same tutorial five times this week, I was surprised at the extent to which the definition of the Fisher Information caused me to confuse both myself and the students. I thought it would be worth summarising some of the main ways to get confused, and talking about one genuine, quantitative use of the Fisher Information.

Recall we are in a frequentist model, where there is an unknown parameter \theta controlling the distribution of some observable random variable X. Our aim is to make inference about \theta using the value of X we observe. We use a lower case x to indicate a value we actually observe (ie a variable as opposed to a random variable). For each value of \theta, there is a density function f_\theta(x) controlling X. We summarise this in the likelihood function L(x,\theta).

The important thing to remember is that there are two unknowns here. X is unknown because it is genuinely a random variable. Whereas \theta is unknown because that is how the situation has been established. \theta is fixed, but we are ignorant of its value. If we knew \theta we wouldn’t need to be doing any statistical inference to find out about it! A useful thing to keep in mind in everything that follows is: “Is this quantity a RV or not?” This is equivalent to “Is this a function of X or not?”, but the original form is perhaps clearer.

For some value of X=x, we define the maximum likelihood estimator to be

\hat\theta(X):=\text{argmax}_\theta L(X,\theta).

In words, given some data, \hat\theta is the parameter under which this data is most likely. Note that L(x,\theta) is a probability density function for fixed \theta, but NOT for fixed x. (This changes in a Bayesian framework.) For example, there might well be values of x for which L(x,\theta)=0\,\forall \theta\in\Theta.

Note also that we are only interested in the relative values of L(x,\theta_1), L(x,\theta_2). So it doesn’t matter if we rescale L by a constant factor (although this means the marginal in x is no longer a pdf). We typically consider the log-likelihood l(x,\theta)=\log L(x,\theta) instead, as this has a more manageable form when the underlying RV is an IID sequence. Anyway, since we are interested in the ratio of different values of L, we are interested in the difference between values of the log-likelihood l.

Now we come to various versions of the information. Roughly speaking, this is a measure of how good the estimator is. We define the observed information:

J(\theta):=-\frac{d^2 l(\theta)}{d\theta^2}.

This is an excellent example of the merits of considering the question I suggested earlier. Here, J is indeed a random variable. The abbreviated notation being used can lead one astray. Of course, l(\theta)=l(X,\theta), and so it must be random. The second question is: “where are we evaluating this second derivative?”

For this, we should be considering what our aim is. We know we are defining the MLE by maximising the likelihood function for fixed x. We have said that the difference between values of l gives a measure of relative likelihood. So if the likelihood function has a sharp peak at \hat\theta, then this gives us more confidence than if the peak is very shallow. (I am using ‘confidence’ in a non-technical. Confidence intervals are related, but I am not considering that now.) The absolute value second derivative is precisely a measure of this property.

Ok, but the information does not evaluate this second derivative at \hat\theta, it evaluates it at \theta. The key point is that it is still a good measure if it evaluates the second derivative at a point close to \hat\theta. And if \hat\theta is a good estimator, which it typically will be, especially when we have an IID sequence and the number of terms grows large, then \theta and \hat\theta will be close together, and so it remains a plausible measure.

This idea is particularly important when we come to consider the Fisher InformationThis is defined as

I(\theta):= \mathbb{E}J(\theta)=\mathbb{E} -\frac{d^2 l(\theta)}{d\theta^2}.

The cause for confusion is exactly what is mean by this expectation. It is not implausible that this is present, since we have already explained why J(\theta) is a random variable. But we need to decide what distribution we are to integrate with respect to. After all, we don’t actually know the distribution of X. If we did, we wouldn’t be doing statistical inference on it!

So the key thing to remember is that in I(\theta), the value \theta plays two roles. First, it gives the distribution of X with respect to which we integrate. Also, it tells us where to evaluate this second derivative. This makes sense overall. If the distribution we are considering is l(\cdot,\theta), then we expect \hat\theta to be close to the true value \theta, and so it makes sense to evaluate it there.

Now we deduce the Cramer-Rao bound, which says that for any unbiased estimator \hat\theta of \theta, we have

\text{Var}(\hat\theta)\ge \frac{1}{I(\theta)}.

First we explain that unbiased means that \mathbb{E}\hat\theta=\theta. This is a property that we would like any estimator to have, though often we have to settle for this property asymptotically. Again, we should be careful about the role of \theta. Here we mean that given some parameter \theta, \hat\theta is then a RV depending onto the actual data, and so has a variance, which happens to be bounded below by a function of the Fisher Information.

So let’s prove this. First we need a quick result about the score, which is defined as:

U(\theta)=\frac{dl(\theta)}{d\theta}.

Again, this is a random variable. We want to show that \mathbb{E}U(\theta)=0. This is not difficult. Writing f(x)=L(x,\theta), we have

\mathbb{E}U(\theta)=\int f(x)\frac{\partial}{\partial\theta}\log f(x)dx

= \int \frac{\partial}{\partial\theta} L(x,\theta)dx=\frac{d}{d\theta}\int f(x)dx=\frac{d}{d\theta}1=0,

as required. Next we consider the covariance of U and \hat\theta. Since we have established that \mathbb{E}U=0, this is simply \mathbb{E}[U\hat\theta].

\text{Cov}(U,\hat\theta)=\int \hat\theta(x)f(x) \frac{d \log f(x)}{d\theta}dx=\int \hat\theta(x)f(x)\cdot \frac{\frac{\partial f(x)}{\partial \theta}}{f(x)} dx

= \int \hat\theta(x)\frac{\partial f(x)}{\partial \theta}=\frac{\partial}{\partial \theta}\int \hat\theta(x)f(x)

= \frac{\partial}{\partial\theta} \mathbb{E}\hat\theta=\frac{d\theta}{d\theta}=1,

as we assumed at the beginning that \hat\theta was unbiased. Then, from Cauchy-Schwarz, we obtain

\text{Var}(U)\text{Var}(\hat\theta)\ge \text{Cov}(U,\hat\theta)=1.

So it suffices to prove that \text{Var}(U)=I(\theta). This is a very similar integral rearrangement to what has happened earlier, so I will leave it as an exercise (possibly an exercise in Googling).

Note a good example of this is question 4 on the sheet. At any rate, this is where we see the equality case. We are finding the MLE for \theta given an observation from \text{Bin}(n,\theta). Unsurprisingly, \hat\theta=\frac{X}{m}. We know from our knowledge of the binomial distribution that the variance of this is \frac{\theta(1-\theta)}{n}, and indeed it turns out that the Fisher Information is precisely the reciprocal of this.

The equality case must happen when the score is proportional to the observed value. I don’t have a particularly strong intuition for when and why this should happen.

In any case, I hope this was helpful and interesting in some way!

Enhanced by Zemanta

Means and Markov’s Inequality

The first time we learn what a mean is, it is probably called an average. The first time we meet it in a maths lesson, it is probably defined as follows: given a list of values, or possibilities, the mean is the sum of all the values divided by the number of such values.

This can be seen as both a probabilistic and a statistical statement. Ideally, these things should not be different, but at a primary school level (and some way beyond), there is a distinction to be drawn between the mean of a set of data values, say the heights of children in the class, and the mean outcome of rolling a dice. The latter is the mean of something random, while the former is the mean of something fixed and determined.

The reason that the same method works for both of these situations is that the distribution for the outcome of rolling a dice is uniform on the set of possible values. Though this is unlikely to be helpful to many, you could think of this as a consequence of the law of large numbers. The latter, performed jointly in all possible values says that you expect to have roughly equal numbers of each value when you take a large number of samples. If we refer to the strong law, this says that in fact we see this effect in the limit as we take increasingly large samples with probability one. Note that it is not trivial to apply LLN jointly to all values for a general continuous random variable. The convergence of sample distribution functions to the cdf of the underlying distribution is the content of the Glivenko-Cantelli Theorem.

In any case, this won’t work when there isn’t this symmetry where all values are equally likely. So in general, we have to define the mean of a discrete random variable as

\mu=\sum k\mathbb{P}(X=k).

In other words, we are taking a sum of values multiplied by probabilities. By taking a suitable limit, a sum weighted by discrete probabilities converges to an integral weighted by a pdf. So this is a definition that will easily generalise.

Anyway, typically the next stage is to discuss the median. In the setting where we can define the mean directly as a sum of values, we must be given some list of values, which we can therefore write in ascending order. It’s then easy to define the median as the middle value in this ordered list. If the number of elements is odd, this is certainly well-defined. If the number is even, it is less clear. A lot of time at school was spent addressing this question, and the generally-agreed answer seemed to be that the mean of the middle two elements would do nicely. We shouldn’t waste any further time addressing this, as we are aiming for the continuous setting, where in general there won’t be discrete gaps between values in the support.

This led onwards to the dreaded box-and-whisker diagrams, which represent the min, lower quartile, median, upper quartile, and max in order. The diagram is structured to draw attention to the central points in the distribution, as these are in many applications of greater interest. The question of how to define the quartiles if the number of data points is not 3 modulo 4 is of exponentially less interest than the question of how to define the median for an even number of values, in my opinion. What is much more interesting is to note that the middle box of such a diagram would be finite for many continuous distributions with infinite support, such as the exponential distribution and the normal distribution.

Note that it is possible to construct any distribution as a function of a U[0,1] distribution by inverting the cdf. The box-and-whisker diagram essentially gives five points in this identification scheme.

Obviously, the ordered list definition fails to work for such distributions. So we need a better definition of median, which generalises. We observe that half the values are greater than the median, and so in a probabilistic setting, we say that the probability of being less than the median is equal to the probability of being greater. So we want to define it implicitly as:

\mathbb{P}(X>M)=\mathbb{P}(X<M).

So for a continuous distribution without atoms,

\mathbb{P}(X>M)=\frac12,

and this uniquely defines M.

The natural question to start asking is how this compares to the mean. In particular, we want to discuss the relative sizes. Any result about the possible relative values of the mean and median can be reversed by considering the negation of the random variable, so we focus on continuous random variables with non-negative support. If nothing else, these are the conditions for lots of data we might be interested in sampling in the ‘real world’.

It’s worth having a couple of questions to clarify what we are interested in. How about: is it possible for the mean to be 1000 times larger than the median; and is it possible for the median to be 1000 times larger than the mean?

The latter is easier to address. If the median is 1000 and the mean is 1, then with probability ½ the random variable X is at least 1000. So these values make a contribution to the mean of at least 500, while the other values make a contribution of at least zero (since we’ve demanded the RV be positive). This is a contradiction.

The former question turns out to be possible. The motivation should come from our box-and-whisker diagram! Once we have fixed the middle box, the median and quartiles are fixed, but we are free to fiddle with the outer regions as much as we like, so by making the max larger and larger, we can increase the mean freely without affecting the median. Perhaps it is clearest to view a discrete example: 1, 2, N. The median will always be 2, so we can increase N as much as desired to get a large mean.

The first answer is in a way more interesting, because it generalises to give a result about the tail of distributions. Viewing the median as the ½-quantile, we are saying that it cannot be too large relative to the mean. Markov’s inequality provides an identical statement about the general quantile. Instead of thinking about the constant a in an a-quantile, we look at values in the support.

Suppose we want a bound on \mathbb{P}(X>a) for some positive a. Then if we define the function f by

f(x)=a \textbf{1}_{\{x\ge a\}},

so f(x)\le x for all values. Hence the mean of f(X) is at most the mean of X. But the mean of f(X) can be calculated as

a\mathbb{P}(X>a),

and so we conclude that

\mathbb{P}(X>a)\leq \frac{\mu}{a},

which is Markov’s Inequality.

It is worth remarking that this is trivially true when a\le \mu, since probabilities are always at most 1 anyway. Even beyond this region, it is generally quite weak. Note that it becomes progressively stronger if the contribution to the mean from terms greater than a is mainly driven by the contribution from terms close to a. So the statement is strong if the random variable has a light tail.

This motivates considering deviations from the mean, rather than the random variable itself. And to lighten the tail, we can square, for example, to consider the square distance from the mean. This version is Chebyshev’s Inequality:

\mathbb{P}(|X-\mu|^2>a\sigma^2)\le \frac{1}{a}.

Applying Markov an exponential function of a random variable is called a Chernoff Bound, and gives in some sense the bound on tails of a distribution obtained in this way.

Bayesian Inference and the Jeffreys Prior

Last term I was tutoring for the second year statistics course in Oxford. This post is about the final quarter of the course, on the subject of Bayesian inference, and in particular on the Jeffreys prior.

There are loads and loads of articles sitting around on the web contributing the debate about the relative merits of Bayesian and frequentist methods. I do not want to continue that debate here, partly because I don’t have a strong opinion, but mainly because I don’t really understand that much about the underlying issues.

What I will say is that after a few months of working fairly intensively with various complicated stochastic processes, I am starting to feel fairly happy throwing about conditional probability rather freely. When discussing some of the more combinatorial models for example, quite often we have no desire to compute or approximate complication normalising constants, and so instead talk about ‘weights’. And a similar idea underlies Bayesian inference. As in frequentist methods we have an unknown parameter, and we observe some data. Furthermore, we know the probability that such data might have arisen under any value of the parameter. We want to make inference about the value of the parameter given the data, so it makes sense to multiply the probability that the data emerged as a result of some parameter value by some weighting on the set of parameter values.

In summary, we assign a prior distribution representing our initial beliefs about the parameter before we have seen any data, then we update this by weighting by the likelihood that the observed data might have arisen from a particular parameter. We often write this as:

\pi(\theta| x)\propto f(x|\theta)\pi(\theta),

or say that posterior = likelihood x prior. Note that in many applications it won’t be necessary to work out what the normalising constant on the distribution ought to be.

That’s the setup for Bayesian methods. I think the general feeling about the relative usefulness of such an approach is that it all depends on the prior. Once we have the prior, everything is concrete and unambiguously determined. But how should we choose the prior?

There are two cases worth thinking about. The first is where we have a lot of information about the problem already. This might well be the case in some forms of scientific research, where future analysis aims to build on work already completed. It might also be the case that we have already performed some Bayesian calculations, so our current prior is in fact the posterior from a previous set of experiments. In any case, if we have such an ‘informative prior’, it makes sense to use it in some circumstances.

Alternatively, it might be the case that for some reason we care less about the actual prior than about the mathematical convenience of manipulating it. In particular, certain likelihood functions give rise to conjugate priors, where the form of the posterior is the same as the form of the prior. For example, a normal likelihood function admits a normal conjugate prior, and a binomial likelihood function gives a Beta conjugate prior.

In general though, it is entirely possible that neither of these situations will hold but we still want to try Bayesian analysis. The ideal situation would be if the choice of prior had no effect on the analysis, but if that were true, then we couldn’t really be doing any Bayesian analysis. The Jeffreys prior is one natural candidate because it removes a specific problem with choosing a prior to express ignorance.

It sounds reasonable to say that if we have total ignorance about the parameter, then we should take the prior to be uniform on the set of possible values taken by the parameter. There are two potential objections to this. The first is that if the parameter could take any real value, then the prior will not be a distribution as the uniform distribution on the reals is not normalisable. Such a prior is called improper. This isn’t a huge problem really though. For making inference we are only interested in the posterior distribution, and so if the posterior turns out to be normalisable we are probably fine.

The second problem is more serious. Even though we want to express ignorance of the parameter, is there a canonical choice for THE parameter? An example will make this objection more clear. Suppose we know nothing about the parameter T except that it lies in [0,1]. Then the uniform distribution on [0,1] seems like the natural candidate for the prior. But what if we considered T^100 to be the parameter instead? Again if we have total ignorance we should assign T^100 the uniform distribution on its support, which is again [0,1]. But if T^100 is uniform on [0,1], then T is massively concentrated near 1, and in particular cannot also be uniformly distributed on [0,1]. So as a minimum requirement for expressing ignorance, we want a way of generating a prior that doesn’t depend on the choice of parameterisation.

The Jeffreys prior has this property. Note that there may be separate problems with making such an assumption, but this prior solves this particular objection. We define it to be \pi(\theta)\propto [I(\theta)]^{1/2} where I is the Fisher information, defined as

I(\theta)=-\mathbb{E}_\theta\Big[\frac{\partial^2 l(X_1,\theta)}{\partial \theta^2}\Big],

where the expectation is over the data X_1 for fixed \theta, and l is the log-likelihood. Proving that this has the property that it is invariant under reparameterisation requires demonstrating that the Jeffreys prior corresponding to g(\theta) is the same as applying a change of measure to the Jeffreys prior for \theta. The proof is a nice exercise in the chain rule, and I don’t want to reproduce it here.

For a Binomial likelihood function, we find that the Jeffreys prior is Beta(1/2,1/2), which has density that looks roughly like a bucket suspended above [0,1]. It is certainly worth asking why the ‘natural’ choice for prior might put lots of mass at the edge of the domain for the parameter.

I don’t have a definitive answer, but I do have an intuitive idea which comes from the meaning of the Fisher information. As the second derivative of the log-likelihood, a large Fisher information means that with high probability we will see data for which the likelihood changes substantially if we vary the parameter. In particular, this means that the posterior probability of a parameter close to 0 will be eliminated more quickly by the data if the true parameter is different.

If the variance is small, as it is for parameter near 0, then the data generated by this parameter will have the greatest effect on the posterior, since the likelihood will be small almost everywhere except near the parameter. We see the opposite effect if the variance is large. So it makes sense to compensate for this by placing extra prior mass at parameter values where the data has the strongest effect. Note that in the previous example, the Jeffreys prior is in fact exactly inversely proportional to the standard deviation. For the above argument to make sense, we need it to be monotonic with respect to SD, and it just happens that in this case, being 1/SD is precisely the form required to be invariant under reparameterisation.

Anyway, I thought that was reasonably interesting, as indeed was the whole course. I feel reassured that I can justify having my work address as the Department of Statistics since I now know at least epsilon about statistics!

Extreme Value Theory

This is something interesting which came up on the first problem sheet for the Part A Statistics course. The second question introduced the Weibull distribution, defined in terms of parameters \alpha,\lambda>0 through the distribution function:

F(x)=\begin{cases}0 & x<0\\ 1-\exp(-(\frac{x}{\lambda})^\alpha) & x\geq 0.\end{cases}

As mentioned in the statement of the question, this distribution is “typically used in industrial reliability studies in situations where failure of a system comprising many similar components occurs when the weakest component fails”. Why could that be? Expressed more theoretically, the lifetimes of various components might reasonably be assumed to behave like i.i.d. random variables in many contexts. Then the failure time of the system is given by the minimum of the constituent random variables.

So this begs the question: what does the distribution of minimum of a collection of i.i.d. random variables look like? First, we need to think why there should be an answer at all. I mean, it would not be unreasonable to assume that this would depend rather strongly on the underlying distribution. But of course, we might say the same thing about sums of i.i.d. random variables, but there is the Central Limit Theorem. Phrased in a way that is deliberately vague, this says that subject to some fairly mild conditions on the underlying distribution (finite variance in this case), the sum of n i.i.d. RVs look like a normal distribution for large n. Here we know what ‘looks like’ means, since we have a notion of a family of normal distributions. Formally, though, we might say that ‘looks like’ means that the image of the distribution under some linear transformation, where the coefficients are possibly functions of n, converges to the distribution N(0,1) as n grows.

The technical term for this is to say that the underlying RV we are considering, which in this case would be X_1+\ldots +X_n) is in the domain of attraction of N(0,1). Note that other distributions in the family of normals are also in the domain of attraction of N(0,1), and vice versa, so this forms an equivalence relation on the space of distributions, though this is not hugely helpful since most interesting statements involve some sort of limit.

Anyway, with that perspective, it is perhaps more reasonable to imagine that the minimum of a collection of i.i.d. RVs might have some limit distribution. Because we typically feel more comfortable thinking about right-tails rather than left-tails of probability distributions, this problem is more often considered for the maximum of i.i.d. RVs. The Fisher-Tippett-Gnedenko theorem, proved in various forms in the first half of the 20th century, asserts that again under mild regularity assumptions, the maximum of such a collection does lie in the domain of attraction of one of a small set of distributions. The Weibull distribution as defined above is one of these. (Note that if we are considering domains of attraction, then scaling x by a constant is of no consequence, so we can drop the parameterisation by \lambda.)

This is considered the first main theorem of Extreme Value Theory, which addresses precisely this sort of problem. It is not hard to consider why this area is of interest. To decide how much liquidity they require, an insurance company needs to know the likely size of the maximum claim during the policy. Similarly, the designer of a sea-wall doesn’t care about the average wave-height – what matters is how strong the once-in-a-century storm which threatens the town might be. A good answer might also explain how to resolve the apparent contradiction that most human characteristics are distributed roughly normally across the population. Normal distributions are unbounded, yet physiological constraints enable us to state with certainty that there will never be twelve foot tall men (or women). In some sense, EVT is a cousin of Large Deviation theory, the difference being that unlikely events in a large family of i.i.d. RVs are considered on a local scale rather than globally. Note that large deviations for Cramer’s theorem in the case where the underlying distribution has a heavy tail are driven by a single very deviant value, rather than by lots of slightly deviant data, so in this case the theories are comparable, though generally analysed from different perspectives.

In fact, we have to consider the reversed Weibull distribution for a maximum, which is supported on (-\infty,0]. This is one of three possibly distribution families for the limit of a maximum. The other two are the Gumbel distribution

F(x)=e^{-e^{-x}},

and the Frechet distribution

F(x)=\exp(-x^{-\alpha}),\quad x>0.

Note that \alpha is a positive parameter in both the Frechet and Gumbel distributions. These three distributions can be expressed as a single one parameter family, the Generalised Extreme Value distribution.

The differences between them lie in the tail behaviour. The reversed Weibull distribution has an actual upper bound, the Gumbel an exponential, fast-decaying tail, and the Frechet a polynomial ‘fat’ tail. It is not completely obvious that these properties are inherited from the original distribution. After all, to get from the original distribution to extreme value distribution, we are taking the maximum, then rescaling and translating in a potentially quite complicated way. However, it is perhaps reasonable to see that the property of the underlying distribution having an upper bound is preserved through this process. Obviously, the bound itself is not preserved – after all, we are free to apply arbitrary linear transformations to the distributions!

In any case, it does turn out to be the case that the U[0,1] distribution has maximum converging to a reversed Weibull; the exponential tails of the Exp(1) and N(0,1) distributions lead to a Gumbel limit; and the fat-tailed Pareto distribution gives the Frechet limit. The calculations are reasonably straightforward, especially once the correct rescaling is known. See this article from Duke for an excellent overview and the details for these examples I have just cited. These notes discuss further properties of these limiting distributions, including the unsurprising fact that their form is preserved under taking the maximum of i.i.d. copies. This is analogous to the fact that the family of normal distributions is preserved under taking arbitrary finite sums.

From a statistical point of view, devising a good statistical test for what class of extreme value distribution a particular set of data obeys is of great interest. Why? Well mainly because of the applications, some of which were suggested above. But also because of the general statistical principle that it is unwise to extrapolate beyond the range of the available data. But that is precisely what we need to do if we are considering extreme values. After all, the designer of that sea-wall can’t necessarily rely on the largest storm in the future being roughly the same as the biggest storm in the past. So because the EVT theorem gives a clear description of the distribution, to find the limiting properties, which is where the truly large extremes might occur, it suffices to find a good test for the form of the limit distribution – that is, which of the three possibilities is relevant, and what the parameter should be. This seems to be fairly hard in general. I didn’t understand much of it, but this paper provided an interesting review.

Anyway, that was something interesting I didn’t know about (for the record, I also now know how to construct a sensible Q-Q plot for the Weibull distribution!), though I am assured that EVT was a core element of the mainstream undergraduate mathematics syllabus forty years ago.

Simpson’s Paradox – making sense of statistical confusion

Test 1 Test 2
Success Failure Success (%) Success Failure Success (%)
Drug 30 70 30 30 20 60
Placebo 10 40 20 100 100 50

Consider the introduction of a new drug, which needs to be tested against a placebo. A first test suggests that the drug is an improvement, in so far as its effects are better than doing nothing. As is usual practice, data only becomes respected when its conclusions are replicated, so a second test, with fresh subjects is performed, and a similar improvement is seen. Flushed with success, the scientist writes up his findings, combining the results of the two tests for maximum impact.

Success Failure Success (%)
Drug 60 90 40
Placebo 110 140 44

But how can this have happened? Whereas the drug was clearly significantly better than the placebo in both of the tests, when the results are combined, it has performed worse. This is often called Simpson’s Paradox, but like most paradoxes, it points to a flaw in the mathematican rather than in the maths. Let’s think about some interpretations of this confusing phenomenon.

Firstly, even the most casual second glance at the original data table makes it seem less surprising that this problem has arisen. The performance of both drug and placebo was significantly better in the second test. However, in the second test, there were loads of placebo subjects, and many fewer drug subjects. A sensible thing to consider is to take the situation to a natural extreme. Suppose the second test had only one subject on the drug, and it had been a success. Then, we would have claimed 100% performance on that test, but this wouldn’t have influenced the overall performance across both surveys enormously. If they had only tested one person, this would have immediately flagged up suspicions when we glanced at the table, but in the information given, we were lulled into a false sense of security because the data sizes didn’t seem too weird.

Secondly, when thinking about any kind of statistical analysis, data needs to be taken in the context in which it is provided. However, this ‘paradox’ is entirely arithmetic. There is no explicit reference to the pharmaceutical nature of the situation in the data. Suppose instead we had presented the figures in a sporting context.

Australia Bangladesh
Innings Runs Average Innings Runs Average
Cook 2 60 30 2 120 60
Strauss 1 20 20 4 200 50

This table presents essentially the same numerical information in the context of two batsmen’s performances against two different countries. Against both Australia and Bangladesh, Cook has a higher average than Strauss, but as before, when the total average is calculated, Strauss with 44 emerges above Cook with 40. But in this setting, it all makes a lot more sense. Of course Strauss’ average will be relatively inflated because he has spent much more time playing against easier opposition.

Perhaps the most intelligent remark to make is that given the original data, to add the results of the two tests is absurd. The success or otherwise of a drug should be reasonably constant (to within whatever we mean by reasonable statistical error) across two tests. If we get wildly different figures, especially for the placebo data (!), then the only conclusion can be that there are some other factors affecting the patients which we haven’t accounted for, but which are having a much greater effect than the drug! (Note that this doesn’t apply in the cricket framework, because there is no reason to suppose that a batsman’s performance in one series should be similar to any other. But in this case, we have previously justified this adding of contrasting data, so all is fine.)

Most articles about a paradox would conclude by giving general advice about avoiding the pitfalls. And I guess in this situation it’s not too tricky: relative sample sizes are a factor in determining net averages; and adding test results which are completely different can be dangerous even if they both give the same crude conclusion. Perhaps the best moral might be that, while school statistics courses are often forced to encourage an attitude of ‘if in doubt, draw a histogram’, having a think about the underlying situation can be useful before getting deep into some number-crunching.