Avoiding Mistakes in Probability Exams

Over the past week, I’ve given several tutorials to second year undergraduates preparing for upcoming papers on probability and statistics. In particular, I’ve now seen a lot of solutions to a lot of past papers and specimen questions, and it’s worthwhile to consider some of the typical mistakes students can make on these questions. Of course, as with any maths exam, there’s always the possibility of a particularly subtle or involved question coming up, but if the following three common areas of difficulty can be avoided, you’re on track for doing well.

Jacobians

In a previous course, a student will learn how to calculate the pdf of a function of a random variable. Here, we move onto the more interesting and useful case of finding the (joint) density of function(s) of two or more random variables. The key thing to remember here is that manipulating pdfs is not a strange arbitrary exercise – it is just integration. It is rarely of interest to consider the value of a pdf at a single point. We can draw meaningful conclusions from a pdf or from comparison of two pdfs by integrating them.

Then the question of substituting for new random variables is precisely integration by substitution, which we are totally happy with in the one-dimensional case, and should be fairly happy with in the two-dimensional case. To get from one joint density to another, we multiply by the absolute value of the Jacobian. To ensure you get it right, it makes sense to write out the informal infinitesimal relation

f_{U,V}(u,v) du dv = f_{X,Y}(x,y)dx dy.

This is certainly relevant if we put integral signs in front of both sides, and explains why you obtain f_{U,V} = \frac{d(x,y)}{d(u,v)} f_{X,Y} rather than the other way round. Note though that if \frac{d(u,v)}{d(x,y)} is easier to calculate for some reason, then you can evaluate this and take the inverse, as your functions will almost certainly be locally bijective almost everywhere.

It is important to take the modulus of the Jacobian, since densities cannot be negative! If this looks like a fudge, then consider the situation in one dimension. If we substitute for x\mapsto f(x)=1-x, then f’ is obviously negative, BUT we also end up reversing the order of the bounds of the integral, eg [1/3, ¾] will become [2/3,1/4]. So we have a negative integrand (after multiplying by f'(x)) but bounds in the ‘wrong’ order. These two factors of -1 will obviously cancel, so it suffices just to multiply by |f'(x)| at that stage. It is harder to express in words, but a similar relation works for the Jacobian substitution.

You also need to check where the new joint density is non-zero. Suppose X, Y are supported on [0,1], then when we write f_{X,Y}(x,y) we should indicate that it is 0 off this region, either by splitting into cases, or adding the indicator function 1_{\{x,y\in[0,1]\}} as a factor. This is even more important after substitutions, as the range of the resulting random variables might be less obvious than the originals. Eg with X,Y as above, and U=X^2, V=X/Y, the resulting pdf will be non-zero only when u\in[0,1], v\ge \sqrt{u}. Failing to account for this will often lead to ludicrous answers. A general rule is that you can always check that any distribution you’ve produced does actually integrate to one.

Convergence using MGFs

There are two main reasons to use MGFs and PGFs. The first is that they behave nicely when applied to (possibly random) sums of independent random variables. The independence property is crucial to allow splitting of the MGF of the sum into the product of MGFs of the summands. Of course, implicit in this argument is that MGFs determine distributions.

A key theorem of the course is that this works even in the limit, so you can use MGFs to show convergence in distribution of a family of distributions. For this, you need to show that the MGFs converge pointwise on some interval [-a,a] around 0. (Note that the moments of the distribution are given by the family of derivatives at 0, as motivation for why this condition might be necessary.) Normally for such questions, you will have been asked to define the MGF earlier in the question, and probably will have found the MGF of a particular distribution or family of distributions, which might well end up appearing as the final answer.

Sometimes such an argument might involve substituting in something unusual, like t/N, rather than t, into a known MGF. Normally a Taylor series can be used to show the final convergence result. If you have a fraction, try to cancel terms so that you only have to evaluate one Taylor series, rather than lots.

Using the Markov Property

The Markov property is initially confusing, but once we become comfortable with the statement, it is increasingly irritating to have to answer the question: “show that this process has the Markov property.” This question is irritating because in most cases we want to answer: “because it obviously does!” Which is compelling, but unlikely to be considered satisfactory in a mathematics exam. Normally we observe that the random dynamics of the next step are a function only of the present location. Looking for the word ‘independent’ in the statement of the process under discussion is a good place to start for any argument along these lines.

The most developed example of a Markov process in this course is the Poisson process. I’ve written far too much about this before, so I won’t do so again, except to say this. When we think of the Poisson process, we generally have two thoughts going through our minds, namely the equivalent definitions of IID exponential inter-arrival times, and stationary, Poisson increments (or the infinitesimal version). If we draw a sketch of a sample trajectory of this process, we can label everything up and it is clear how it all fits together. But if you are asked to give a definition of the Poisson process (N_t), it is inappropriate to talk about inter-arrival times unless you define them in terms of N_t, since that is the process you are actually trying to define! It is fine to write out

T_k:=\min\{t: N_t=k\},\quad N_t=\max\{k: Y_1+Y_2+\ldots+Y_k\le t\}

but the relation between the two characterisations of the process is not obvious. That is why it is a theorem of the course.

We have to be particularly careful of the difference in definition when we are calculating probabilities of various events. A classic example is this. Find the distribution of N_2, conditional on T_3=1. It’s very tempting to come up with some massive waffle to argue that the answer is 3+Po(1). The most streamlined observation is that the problem is easy if we are conditioning instead on N_1=3. We just use the independent Poisson increments definition of (N_t), with no reference to inter-arrival times required. But then the Markov property applied at time 1 says that the distribution of (N_2) depends only on the value of N_1, not on the process on the interval [0,1). In a sense, the condition that T_3=1 is giving us extra information on the behaviour of the process up to time 1, and the Markov property, which we know holds for the Poisson process, asserts precisely that the extra information doesn’t matter.

Advertisement

Large Deviations and the CLT

Taking a course on Large Deviations has forced me to think a bit more carefully about what happens when you have large collections of IID random variables. I guess the first thing think to think about is ‘What is a Large Deviation‘? In particular, how large or deviant does it have to be?

Of primary interest is the tail of the distribution function of S_n=X_1+\ldots+X_n, where the X_i are independent and identically distributed as X. As we can always negate everything later if necessary, we typically consider the probability of events of the type:

\mathbb{P}(S_n\geq \theta(n))

where \theta(n) is some function which almost certainly increases fairly fast with n. More pertinently, if we are looking for some limit which corresponds to an actual random variable, we perhaps want to look at lots of related \theta(n)s simultaneously. More concretely, we should fix \theta and consider the probabilities

\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha). (*)

Throughout, we lose no generality by assuming that \mathbb{E}X=0. Of course, it is possible that this expectation does not exist, but that is certainly a question for another post!

Now let’s consider the implications of our choice of \theta(n). If this increases with n too slowly, and the likely deviation of S_n is greater than \theta(n), then the event might not be a large deviation at all. In fact, the difference between this event and the event (S_n is above 0, that is, its mean) becomes negligible, and so the probability at (*) might be 1/2 or whatever, regardless of the value of \alpha. So object \lim \frac{S_n}{\theta(n)} whatever that means, certainly cannot be a proper random variable, as if we were to have convergence in distribution, this would imply that the limit RV consisted of point mass at each of \{+\infty, -\infty\}.

On the other hand, if \theta(n) increases rapidly with n, then the probabilities at (*) might become very small indeed when \alpha>0. For example, we might expect:

\lim_{n\rightarrow\infty}\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha)=\begin{cases}0& \alpha>0\\1&\alpha<0.\end{cases}

and more information to be required when \alpha=0. This is what we mean by a large deviation event. Although we always have to define everything concretely in terms of some finite sum S_n, we are always thinking about the behaviour in the limit. A large deviation principle exists in an enormous range of cases to show that these probabilities in fact decay exponentially. Again, that is the subject for another post, or indeed the lecture course I’m attending.

Instead, I want to return to the Central Limit Theorem. I first encountered this result in popular science books in a vague “the histogram of boys’ heights looks like a bell” kind of way, then, once a normal random variable had been to some extent defined, it returned in A-level statistics courses in a slightly more fleshed out form. As an undergraduate, you see it in several forms, including as a corollary following from Levy’s convergence theorem.

In all applications though, it is generally used as a method of calculating good approximations. It is not uncommon to see it presented as:

\mathbb{P}(a\sigma\sqrt{n}+\mu n\leq S_n\leq b\sigma\sqrt{n}+\mu n)\approx \frac{1}{\sqrt{2\pi}}\int_a^b e^{-x^2/2}dx.

Although in many cases that is the right way to think use it, it isn’t the most interesting aspect of the theorem itself. CLT says that the correct scaling of \theta(n) so that the deviation probabilities lie between the two cases outline above is the same (that is, \theta(n)=O(\sqrt{n}) in some sense) for an enormous class of distributions, and in particular, most distributions that one might encounter in practice (ie finite mean, finite variance). There is even greater universality, as furthermore the limit distribution at this interface has the same form (some appropriate normal distribution) whenever X is in this class of distributions. I think that goes a long way to explaining why we should care about the theorem. It also immediately prompts several questions:

  • What happens for less regular distributions? It is now more clear what the right question to ask in this setting might be. What is the appropriate scaling for \theta(n) in this case, if such a scaling exists? Is there a similar universality property for suitable classes of distributions?
  • What is special about the normal distribution? The theorem itself shows us that it appears as a universal scaling limit in distribution, but we might reasonably ask what properties such a distribution should have, as perhaps this will offer a clue to a version of CLT or LLNs for less regular distributions.
  • We can see that the Weak Law of Large Numbers follows immediately from CLT. In fact we can say more, perhaps a Slightly Less Weak LLN, that

\frac{S_n-\mu n}{\sigma \theta(n)}\stackrel{d}{\rightarrow}0

  • whenever \sqrt{n}<<\theta(n). But of course, we also have a Strong Law of Large Numbers, which asserts that the empirical mean converges almost surely. What is the threshhold for almost sure convergence, because there is no a priori reason why it should be \theta(n)=n?

To be continued next time.

Levy’s Convergence Theorem

We consider some of the theory of Weak Convergence from the Part III course Advanced Probability. It has previously been seen, or at least discussed, that characteristic functions uniquely determine the laws of random variables. We will show Levy‘s theorem, which equates weak convergence of random variables and pointwise convergence of characteristic functions.

We have to start with the most important theorem about weak convergence, which is essentially a version of Bolzano-Weierstrass for measures on a metric space M. We say that a sequence of measures is tight if for any \epsilon>0, there exists a compact K_\epsilon such that $\sup_n\mu(M\backslash K_\epsilon)\leq \epsilon$. Informally, each measure is concentrated compactly, and this property is uniform across all the measures. We can now state and prove a result of Prohorov:

Theorem (Prohorov): Let (\mu_n) be a tight sequence of probability measures. Then there exists a subsequence (n_k) and a probability measure \mu such that \mu_{n_k}\Rightarrow \mu.

Summary of proof in the case M=\mathbb{R}By countability, we can use Bolzano-Weierstrass and a standard diagonal argument to find a subsequence such that the distribution functions

F_{n_k}(x)\rightarrow F(x)\quad\forall x\in\mathbb{Q}

Then extend F to the whole real line by taking a downward rational limit, which ensures that F is cadlag. Convergence of the distribution functions then holds at all points of continuity of F by monotonicity and approximating by rationals from above. It only remains to check that F(-\infty)=0,F(\infty)=1, which follows from tightness. Specifically, monotonicity guarantees that F has countably many points of discontinuity, so can choose some large N such that both N and -N are points of continuity, and exploit that eventually

\sup_n \mu_n([-N,N])>1-\epsilon

We can define the limit (Borel) measure from the distribution function by taking the obvious definition F(b)-F(a) on intervals, then lifting to the Borel sigma-algebra by Caratheodory’s extension theorem.

Theorem (Levy): X_n,X random variables in \mathbb{R}^d. Then:

L(X_n)\rightarrow L(X)\quad\iff\quad \phi_{X_n}(z)\rightarrow \phi_X(z)\quad \forall z\in\mathbb{R}^d

The direction \Rightarrow is easy: x\mapsto e^{i\langle z,x\rangle} is continuous and bounded.

In the other direction, we can in fact show a stronger constructive result. Precisely, if \exists \psi:\mathbb{R}^d\rightarrow \mathbb{C} continuous at 0 with \psi(0)=1 (*) and such that \phi_{X_n}(z)\rightarrow \psi(z)\quad \forall z\in\mathbb{R}^d, then \psi=\phi_X the characteristic function of some random variable and L(X_n)\rightarrow L(X). Note that the conditions (*) are the minimal such that \phi could be a characteristic function.

We now proceed with the proof. We apply a lemma that is basically a calculation that we don’t repeat here.

\mathbb{P}(||X||_\infty>K)\stackrel{\text{Lemma}}{<}C_dK^d\int_{[-\frac{1}{K},\frac{1}{K}]^d}(1-\Re \phi_{X_n}(u))du\stackrel{\text{DOM}}{\rightarrow}C_dK^d\int (1-\Re \psi(u))du

where we apply that the integrand is dominated by 2. From the conditions on \psi, this is <\epsilon for large enough K. This bound is of course also uniform in n, and so the random variables are tight. Prohorov then gives a convergent subsequence, and so a limit random variable exists.

Suppose the whole sequence doesn’t converge to X. Then by Prohorov, there is a separate subsequence which converges to Y say, so by the direction of Levy already proved there is convergence of characteristic functions along this subsequence. But characteristic functions determine law, so X=Y, which is a contradiction.