Gaussian tail bounds and a word of caution about CLT

The first part is more of an aside. In a variety of contexts, whether for testing Large Deviations Principles or calculating expectations by integrating over the tail, it is useful to know good approximations to the tail of various distributions. In particular, the exact form of the tail of a standard normal distribution is not particularly tractable. The following upper bound is therefore often extremely useful, especially because it is fairly tight, as we will see.

Let Z\sim N(0,1) be a standard normal RV. We are interested in the tail probability R(x)=\mathbb{P}(Z\geq x). The density function of a normal RV decays very rapidly, as the exponential of a quadratic function of x. This means we might expect that conditional on \{Z\geq x\}, with high probability Z is in fact quite close to x. This concentration of measure property would suggest that the tail probability decays at a rate comparable to the density function itself. In fact, we can show that:

\mathbb{P}(Z>x)< \frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.

It is in fact relatively straightforward:

\mathbb{P}(Z>x)=\frac{1}{\sqrt{2\pi}}\int_x^\infty e^{-u^2/2}du< \frac{1}{\sqrt{2\pi}}\int_x^\infty \frac{u}{x}e^{-u^2/2}du=\frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.

Just by comparing derivatives, we can also show that this bound is fairly tight. In particular:

\frac{1}{\sqrt{2\pi}}\frac{x}{x^2+1}e^{-x^2/2}<\mathbb{P}(Z>x)< \frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.

—-

Now for the second part about CLT. The following question is why I started thinking about various interpretations of CLT in the previous post. Suppose we are trying to prove the Strong Law of Large Numbers for a random variable with 0 mean and unit variance. Suppose we try to use an argument via Borel-Cantelli:

\mathbb{P}(\frac{S_n}{n}>\epsilon) = \mathbb{P}(\frac{S_n}{\sqrt{n}}>\epsilon\sqrt{n})\stackrel{\text{CLT}}{\approx}\mathbb{P}(Z>\epsilon\sqrt{n}).

Now we can use our favourite estimate on the tail of a normal distribution.

\mathbb{P}(Z>\epsilon\sqrt{n})\leq \frac{1}{\epsilon\sqrt{n}\sqrt{2\pi}}e^{-n/2}

\Rightarrow \sum_n \mathbb{P}(Z>\epsilon\sqrt{n})\leq \frac{1}{\sqrt{2\pi}}(e^{-1/2})^n=\frac{1}{\sqrt{2\pi}(e^{1/2}-1)}<\infty.

By Borel-Cantelli, we conclude that with probability 1, eventually \frac{S_n}{n}<\epsilon. This holds for all \epsilon>0, and a symmetric result for the negative case. We therefore obtain the Strong Law of Large Numbers.

The question is: was that application of CLT valid? It certainly looks ok, but I claim not. The main problem is that the deviations under discussion fall outside the remit of discussion. CLT gives a limiting expression for deviations on the \sqrt{n} scale.

Let’s explain this another way. Let’s take \epsilon=10^{-2}. CLT says that as n becomes very large

\mathbb{P}(\frac{S_n}{\sqrt{n}}>1000)\approx \mathbb{P}(Z>1000).

But we don’t know how large n has to be before this approximation is vaguely accurate. If in fact it only becomes accurate for n=10^{12}, then it is not relevant for estimating

\mathbb{P}(\frac{S_n}{\sqrt{n}}>\epsilon\sqrt{n}).

This looks like an artificial example, but the key is that this problem becomes worse as n grows, (or as we increase the number which currently reads as 1000), and certainly is invalid in the limit. [I find the original explanation about scale of deviation treated by CLT more manageable, but hopefully this further clarifies.]

One solution might be to find some sort of uniform convergence criterion for CLT, ie a (hopefully rapidly decreasing) function f(n) such that

\sup_{x\in\mathbb{R}}|\mathbb{P}(\frac{S_n}{\sqrt{n}}>x)-\mathbb{P}(Z>x)|\leq f(n).

This is possible, as given by the Berry-Esseen theorem, but even the most careful refinements in the special case where the third moment is bounded fail to give better bounds than

f(n)\sim \frac{1}{\sqrt{n}}.

Adding this error term will certainly destroy any hope we had of the sum being finite. Of course, part of the problem is that the supremum in the above definition is certainly not going to be attained at any point under discussion in these post-\sqrt{n} deviations. We really want to take a supremum over larger-than-usual deviations if this is to work out.

By this stage, however, I hope it is clear what the cautionary note is, even if the argument could potentially be patched. CLT is a theorem about standard deviations. Separate principles are required to deal with the case of large deviations. This feels like a strangely ominous note on which to end, but I don’t think there’s much more to say. Do comment below if you think there’s a quick fix to the argument for SLLN presented above.

Large Deviations and the CLT

Taking a course on Large Deviations has forced me to think a bit more carefully about what happens when you have large collections of IID random variables. I guess the first thing think to think about is ‘What is a Large Deviation‘? In particular, how large or deviant does it have to be?

Of primary interest is the tail of the distribution function of S_n=X_1+\ldots+X_n, where the X_i are independent and identically distributed as X. As we can always negate everything later if necessary, we typically consider the probability of events of the type:

\mathbb{P}(S_n\geq \theta(n))

where \theta(n) is some function which almost certainly increases fairly fast with n. More pertinently, if we are looking for some limit which corresponds to an actual random variable, we perhaps want to look at lots of related \theta(n)s simultaneously. More concretely, we should fix \theta and consider the probabilities

\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha). (*)

Throughout, we lose no generality by assuming that \mathbb{E}X=0. Of course, it is possible that this expectation does not exist, but that is certainly a question for another post!

Now let’s consider the implications of our choice of \theta(n). If this increases with n too slowly, and the likely deviation of S_n is greater than \theta(n), then the event might not be a large deviation at all. In fact, the difference between this event and the event (S_n is above 0, that is, its mean) becomes negligible, and so the probability at (*) might be 1/2 or whatever, regardless of the value of \alpha. So object \lim \frac{S_n}{\theta(n)} whatever that means, certainly cannot be a proper random variable, as if we were to have convergence in distribution, this would imply that the limit RV consisted of point mass at each of \{+\infty, -\infty\}.

On the other hand, if \theta(n) increases rapidly with n, then the probabilities at (*) might become very small indeed when \alpha>0. For example, we might expect:

\lim_{n\rightarrow\infty}\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha)=\begin{cases}0& \alpha>0\\1&\alpha<0.\end{cases}

and more information to be required when \alpha=0. This is what we mean by a large deviation event. Although we always have to define everything concretely in terms of some finite sum S_n, we are always thinking about the behaviour in the limit. A large deviation principle exists in an enormous range of cases to show that these probabilities in fact decay exponentially. Again, that is the subject for another post, or indeed the lecture course I’m attending.

Instead, I want to return to the Central Limit Theorem. I first encountered this result in popular science books in a vague “the histogram of boys’ heights looks like a bell” kind of way, then, once a normal random variable had been to some extent defined, it returned in A-level statistics courses in a slightly more fleshed out form. As an undergraduate, you see it in several forms, including as a corollary following from Levy’s convergence theorem.

In all applications though, it is generally used as a method of calculating good approximations. It is not uncommon to see it presented as:

\mathbb{P}(a\sigma\sqrt{n}+\mu n\leq S_n\leq b\sigma\sqrt{n}+\mu n)\approx \frac{1}{\sqrt{2\pi}}\int_a^b e^{-x^2/2}dx.

Although in many cases that is the right way to think use it, it isn’t the most interesting aspect of the theorem itself. CLT says that the correct scaling of \theta(n) so that the deviation probabilities lie between the two cases outline above is the same (that is, \theta(n)=O(\sqrt{n}) in some sense) for an enormous class of distributions, and in particular, most distributions that one might encounter in practice (ie finite mean, finite variance). There is even greater universality, as furthermore the limit distribution at this interface has the same form (some appropriate normal distribution) whenever X is in this class of distributions. I think that goes a long way to explaining why we should care about the theorem. It also immediately prompts several questions:

  • What happens for less regular distributions? It is now more clear what the right question to ask in this setting might be. What is the appropriate scaling for \theta(n) in this case, if such a scaling exists? Is there a similar universality property for suitable classes of distributions?
  • What is special about the normal distribution? The theorem itself shows us that it appears as a universal scaling limit in distribution, but we might reasonably ask what properties such a distribution should have, as perhaps this will offer a clue to a version of CLT or LLNs for less regular distributions.
  • We can see that the Weak Law of Large Numbers follows immediately from CLT. In fact we can say more, perhaps a Slightly Less Weak LLN, that

\frac{S_n-\mu n}{\sigma \theta(n)}\stackrel{d}{\rightarrow}0

  • whenever \sqrt{n}<<\theta(n). But of course, we also have a Strong Law of Large Numbers, which asserts that the empirical mean converges almost surely. What is the threshhold for almost sure convergence, because there is no a priori reason why it should be \theta(n)=n?

To be continued next time.