|Test 1||Test 2|
|Success||Failure||Success (%)||Success||Failure||Success (%)|
Consider the introduction of a new drug, which needs to be tested against a placebo. A first test suggests that the drug is an improvement, in so far as its effects are better than doing nothing. As is usual practice, data only becomes respected when its conclusions are replicated, so a second test, with fresh subjects is performed, and a similar improvement is seen. Flushed with success, the scientist writes up his findings, combining the results of the two tests for maximum impact.
But how can this have happened? Whereas the drug was clearly significantly better than the placebo in both of the tests, when the results are combined, it has performed worse. This is often called Simpson’s Paradox, but like most paradoxes, it points to a flaw in the mathematican rather than in the maths. Let’s think about some interpretations of this confusing phenomenon.
Firstly, even the most casual second glance at the original data table makes it seem less surprising that this problem has arisen. The performance of both drug and placebo was significantly better in the second test. However, in the second test, there were loads of placebo subjects, and many fewer drug subjects. A sensible thing to consider is to take the situation to a natural extreme. Suppose the second test had only one subject on the drug, and it had been a success. Then, we would have claimed 100% performance on that test, but this wouldn’t have influenced the overall performance across both surveys enormously. If they had only tested one person, this would have immediately flagged up suspicions when we glanced at the table, but in the information given, we were lulled into a false sense of security because the data sizes didn’t seem too weird.
Secondly, when thinking about any kind of statistical analysis, data needs to be taken in the context in which it is provided. However, this ‘paradox’ is entirely arithmetic. There is no explicit reference to the pharmaceutical nature of the situation in the data. Suppose instead we had presented the figures in a sporting context.
This table presents essentially the same numerical information in the context of two batsmen’s performances against two different countries. Against both Australia and Bangladesh, Cook has a higher average than Strauss, but as before, when the total average is calculated, Strauss with 44 emerges above Cook with 40. But in this setting, it all makes a lot more sense. Of course Strauss’ average will be relatively inflated because he has spent much more time playing against easier opposition.
Perhaps the most intelligent remark to make is that given the original data, to add the results of the two tests is absurd. The success or otherwise of a drug should be reasonably constant (to within whatever we mean by reasonable statistical error) across two tests. If we get wildly different figures, especially for the placebo data (!), then the only conclusion can be that there are some other factors affecting the patients which we haven’t accounted for, but which are having a much greater effect than the drug! (Note that this doesn’t apply in the cricket framework, because there is no reason to suppose that a batsman’s performance in one series should be similar to any other. But in this case, we have previously justified this adding of contrasting data, so all is fine.)
Most articles about a paradox would conclude by giving general advice about avoiding the pitfalls. And I guess in this situation it’s not too tricky: relative sample sizes are a factor in determining net averages; and adding test results which are completely different can be dangerous even if they both give the same crude conclusion. Perhaps the best moral might be that, while school statistics courses are often forced to encourage an attitude of ‘if in doubt, draw a histogram’, having a think about the underlying situation can be useful before getting deep into some number-crunching.