An obvious remark

Aside

An obvious remark:

If a sequence of independent random variables X_n converge almost surely to some limit X, this limit must be a constant (almost surely).

I’ve been thinking about the Central Limit Theorem about related Large Deviations results this afternoon, and wasted almost an hour worrying about situations which were effectively well-disguised special cases of the above.

Why is it true? Well, suppose each X_n is \mathcal{F}_n-measurable. But by independence, we might as well take \mathcal{F}_n=\sigma(X_n). Then the limit variable X is independent of \mathcal{F}_n for all n, and thus independent of \cup F_n=\mathcal{F}\supset \sigma(X). If X is independent of itself, it must be almost surely constant.

The Poisson Process – A Third Characteristion

There remains the matter of the distribution of the number of people to arrive in a fixed non-infinitissimal time interval. Consider the time interval [0,1], which we divide into n smaller intervals of equal width. As n grows large enough that we know the probability that two arrivals occur in the same interval tends to zero (as this is \leq no(\frac{1}{n})), we can consider this as a sequence of iid Bernoulli random variables as before. So

\mathbb{P}(N_1=k)=\binom{n}{k}(\frac{\lambda}{n})^k(1-\frac{\lambda}{n})^{n-k}
\approx \frac{n^k}{k!} \frac{\lambda^k}{n^k}(1-\frac{\lambda}{n})^n\approx \frac{\lambda^k}{k!}e^{-\lambda}.

We recognise this as belonging to a Poisson (hence the name of the process!) random variable. We can repeat this for a general time interval and obtain N_t\sim \text{Po}(\lambda t).

Note that we implicitly assumed that, in the infinitissimal case at least, behaviour in disjoint intervals was independent. We would hope that this would lift immediately to the large intervals, but it is not immediately obvious how to make this work. This property of independent increments is one of the key definitions of a Levy Process, of which the Poisson process is one of the two canonical examples (the other is Brownian Motion).

As before, if we can show that the implication goes both ways (and for this case it is not hard – letting t\rightarrow 0 clearly gives the infinitissimal construction), we can prove results about Poisson random variables with ease, for example \text{Po}(\lambda)+\text{Po}(\mu)=\text{Po}(\lambda+\mu).

This pretty much concludes the construction of the Poisson process. We have three characterisations:
1) X_n\sim\text{Exp}(\lambda) all iid.
2) The infinitissimal construction as before, with independence.
3) The number of arrivals in a time interval of width t \sim \text{Po}(\lambda t). (This is sometimes called a stationary increments property.) Furthermore, we have independent increments.

A formal derivation of the equivalence of these forms is important but technical, and so not really worth attempting here. See James Norris’s book for example for a fuller exposition.

The final remark is that the Poisson Process has the Markov property. Recall that this says that conditional on the present, future behaviour is independent of the past. Without getting into too much detail, we might like to prove this by using the independent increments property. But remember that for a continuous process, it is too much information to keep track of all the distributions at once. It is sufficient to track only the finite marginal distributions, provided the process is cadlag, which the Poisson process is, assuming we deal with the discontinuities in the right way. Alternatively, the exponential random variable is memoryless, a property that can be lifted, albeit with some technical difficulties, to show the Markov property.

The Poisson Process – Distributions of Holding Times

So, I was planning to conclude my lecture course on Markov Chains at the Cambridge-Linyi summer school with an hour devoted to the Poisson Process. Unfortunately, working through the discrete time material and examples took longer than I’d expected, so we never got round to it. As I’d put a fair bit of work into the preparation, I figured I might as well write it up here instead.

We need a plausible mathematical model for a situation where people or events arrive randomly in continuous time, in a manner that is as close as possible to the notion of iid random variables. In discrete time, a natural candidate would be a set of iid Bernoulli random variables. For example, with probability p a single bus will arrive in an time interval of a minute. With probability 1-p, no bus will arrive. We might have some logistical motivation for why it is not possible that two or more arrive in a given interval, or we could instead choose a more complicated distribution.

One way to proceed would be to specify the distribution of the times between arrivals. These should be independent and identically distributed, at least intuitively. However, although we might be able to give a sensible guess right now, it is not immediately clear what this distribution should be. For now, we merely remark that the arrival times are called X_1,X_2,\ldots, and the holding times between arrivals are defined by
S_1=X_1, S_n=X_n-X_{n-1},n\geq 2.

In fact the motivating discrete example gives us much of the machinery we will actually need. Recall that when we define probability distributions for continuously-valued random variables we need a different plan of attack than for discrete RVs. Whereas for the discrete case, it is enough to specify the probability of each outcome, for a continuous random variable, we have to specify the probabilities of intervals, and take care that they have the obvious additive and nesting properties that we want. Taking the integral (whether Riemannian or Lebesgue) of a so-called density function is a natural way to do this.

Similarly here, we build up from small time intervals. The first remark is this: it is natural that the expected number of arrivals in the first minute is equal to the expected number of arrivals in the second minute. After all, we are considering the most general process possible. If there are an infinite number of potential arriving agents, then behaviour in the first minute should be independent and equal (in distribution) to behaviour in the second minute. We can naturally extend this idea to a linear relation. If N_s is the number of arrivals in the time [0,s], then we should have \mathbb{E}N_s=\lambda s, where \lambda is some constant of proportionality, equal to \mathbb{E}N_1.

The key remark is that as s\rightarrow 0, \mathbb{P}(N_s=1) becomes small, and \mathbb{P}(N_s\geq 2) becomes very small. In fact it suffices that \mathbb{P}(N_s \geq 2)=o(s), as this implies:

\mathbb{P}(N_s=0)=1-\lambda s+o(s),\quad \mathbb{P}(N_s=1)=\lambda s+o(s).

Note that we are not currently attempting a formal construction of the process. As always, finding a probability space with enough freedom to equip a continuous process is a fiddly task. We are for now just trying to work out what the distribution of the holding times between arrivals should be. There are obvious advantages to defining the process as a collection of iid random variables, for example that we can construct it on a product space.

To do this, we split the time interval [0,1] into blocks

[0,1]=[0,\frac{1}{n}]\cup[\frac{1}{n},\frac{2}{n}]\cup\ldots\cup[\frac{n-1}{n},1]

So the probability that someone arrives in the time [\frac{k}{n},\frac{k+1}{n}] is \frac{\lambda}{n}. So

\mathbb{P}(\text{no-one arrives in time }[0,1])=(1-\frac{\lambda}{n})^n\approx e^{-\lambda}.

As we are working in an n\rightarrow\infty regime, we can replace 1 by general time t, to obtain:

\mathbb{P}(\text{no-one arrives in time }[0,t])=(1-\frac{\lambda t}{n})^n\approx e^{-\lambda t}.

So the distribution function of the first arrival time is F(t)=1-e^{-\lambda t} in the conventional notation.
Thus X_1\sim \text{Exp}(\lambda).

However, to emphasis how useful the infinitissimal definition is for actual problems, consider these examples.

1) If we have two arrivals processes at the same object, for example arriving from the left and the right at the same shop, say with rates \lambda,\mu, then we want to show that the first arrival time is still exponential. Because of the linearity of expectation property, it is clearly from the definition that the total arrivals process is Poisson with rate \lambda+\mu, and so the result follows. Showing this by examining the joint distribution of two exponential random variables is also possible, but much less elegant.

2) Similarly, if we have two shops, A and B, and each arriving person chooses one at random, then the first arrival time at A is: with probability 1/2 distributed as Exp(\lambda), with probability 1/4 as \Gamma(2,\lambda), and so on. A fairly non-trivial calculation is required to show that this is the same as Exp(\frac{1}{2}\lambda), whereas this follows almost instantly using the the infinitissimal definition.

Moral: with the infinitissimal case, the difficulties are all in the probability space. However, once we have settled those problems, everything else is nice as the key property is linear. Whereas for a construction by iid jump times, the existence and well-definedness is clear, but even for a distribution as tractable as the exponential random variable, manipulation can be tricky.

Levy’s Convergence Theorem

We consider some of the theory of Weak Convergence from the Part III course Advanced Probability. It has previously been seen, or at least discussed, that characteristic functions uniquely determine the laws of random variables. We will show Levy‘s theorem, which equates weak convergence of random variables and pointwise convergence of characteristic functions.

We have to start with the most important theorem about weak convergence, which is essentially a version of Bolzano-Weierstrass for measures on a metric space M. We say that a sequence of measures is tight if for any \epsilon>0, there exists a compact K_\epsilon such that $\sup_n\mu(M\backslash K_\epsilon)\leq \epsilon$. Informally, each measure is concentrated compactly, and this property is uniform across all the measures. We can now state and prove a result of Prohorov:

Theorem (Prohorov): Let (\mu_n) be a tight sequence of probability measures. Then there exists a subsequence (n_k) and a probability measure \mu such that \mu_{n_k}\Rightarrow \mu.

Summary of proof in the case M=\mathbb{R}By countability, we can use Bolzano-Weierstrass and a standard diagonal argument to find a subsequence such that the distribution functions

F_{n_k}(x)\rightarrow F(x)\quad\forall x\in\mathbb{Q}

Then extend F to the whole real line by taking a downward rational limit, which ensures that F is cadlag. Convergence of the distribution functions then holds at all points of continuity of F by monotonicity and approximating by rationals from above. It only remains to check that F(-\infty)=0,F(\infty)=1, which follows from tightness. Specifically, monotonicity guarantees that F has countably many points of discontinuity, so can choose some large N such that both N and -N are points of continuity, and exploit that eventually

\sup_n \mu_n([-N,N])>1-\epsilon

We can define the limit (Borel) measure from the distribution function by taking the obvious definition F(b)-F(a) on intervals, then lifting to the Borel sigma-algebra by Caratheodory’s extension theorem.

Theorem (Levy): X_n,X random variables in \mathbb{R}^d. Then:

L(X_n)\rightarrow L(X)\quad\iff\quad \phi_{X_n}(z)\rightarrow \phi_X(z)\quad \forall z\in\mathbb{R}^d

The direction \Rightarrow is easy: x\mapsto e^{i\langle z,x\rangle} is continuous and bounded.

In the other direction, we can in fact show a stronger constructive result. Precisely, if \exists \psi:\mathbb{R}^d\rightarrow \mathbb{C} continuous at 0 with \psi(0)=1 (*) and such that \phi_{X_n}(z)\rightarrow \psi(z)\quad \forall z\in\mathbb{R}^d, then \psi=\phi_X the characteristic function of some random variable and L(X_n)\rightarrow L(X). Note that the conditions (*) are the minimal such that \phi could be a characteristic function.

We now proceed with the proof. We apply a lemma that is basically a calculation that we don’t repeat here.

\mathbb{P}(||X||_\infty>K)\stackrel{\text{Lemma}}{<}C_dK^d\int_{[-\frac{1}{K},\frac{1}{K}]^d}(1-\Re \phi_{X_n}(u))du\stackrel{\text{DOM}}{\rightarrow}C_dK^d\int (1-\Re \psi(u))du

where we apply that the integrand is dominated by 2. From the conditions on \psi, this is <\epsilon for large enough K. This bound is of course also uniform in n, and so the random variables are tight. Prohorov then gives a convergent subsequence, and so a limit random variable exists.

Suppose the whole sequence doesn’t converge to X. Then by Prohorov, there is a separate subsequence which converges to Y say, so by the direction of Levy already proved there is convergence of characteristic functions along this subsequence. But characteristic functions determine law, so X=Y, which is a contradiction.

An Exchangeable Law of Large Numbers

In the proof of De Finetti’s Theorem in my last post, I got to a section where I needed to show a particular convergence property of a sequence of exchangeable random variables. For independent identically distributed RVs, we have Kolmogorov’s 0-1 law, and in particular a strong law of large numbers. Does a version of this result hold for exchangeable sequences? As these represent only a mild generalisation of iid sequences, we might hope so. The following argument demonstrates that this is true, as well as providing a natural general proof of De Finetti.

Define \mathcal{E}_n=\sigma(\{f(X_1,\ldots,X_n): f\text{ symmetric, Borel}\}), the smallest sigma-field wrt which the first n RVs are exchangeable. Note that \mathcal{E}_1\supset\mathcal{E}_2\supset\ldots\supset \mathcal{E}=\cap_n\mathcal{E}_n, the exchangeable sigma-field.

So now take g(X) symmetric in the first n variables. By exchangeability E[\frac{1}{n}\sum_1^n f(X_j)g(X)]=E[f(X_1)g(X)]. Now set g=1_A, for A\in\mathcal{E}_n, and so because the LHS integrand is \mathcal{E}_n-meas. we have Z_n=\frac{1}{n}\sum_1^n f(X_j)=E[f(X_1)|\mathcal{E}_n]. So Z is a backwards martingale.

We have a convergence theorem for backwards martingales, which tells us that \lim_n n^{-1}\sum^n f(X_j) exists, and in fact = E[f(X_1)|\mathcal{E}] almost surely. Setting f(X)=1(X\leq x) gives that \lim_n\frac{\#\{X_i\leq x: i\leq n\}}{n}=F(x):=P(X_1\leq x|\mathcal{E}). We now perform a similar procedure for functions defined on the first k RVs, in an attempt to demonstrate independence.

For f:\mathbb{R}^k\rightarrow\mathbb{R}, we seek a backwards martingale, so we take sums over the n^{(k)} ways to choose k of the first n RVs. So \frac{1}{n(n-1)\ldots(n-k+1)}\sum_{I\subset[n]} f(X_{i_1},\ldots,X_{i_k}) is a backwards martingale, and hence E[f(X_1,\ldots,X_k)|\mathcal{E}]=\lim_n \frac{1}{n(n-1)\ldots(n-k+1)}\sum f(-). As before, set f(y_1,\ldots,y_k)=1(y_1\leq x_1)\ldots 1(y_k\leq x_k). Crucially, we can replace the falling factorial term with n^{-k} as we are only considering the limit, then exchange summation as everything is positive and nice to get: E[f(X_1,\ldots,X_k)|\mathcal{E}]=\lim(\frac{1}{n}\sum 1(X_1\leq x_1))\ldots(\frac{1}{n}\sum 1(X_k\leq x_k)) thus demonstrating independence of (X_n) conditional on \mathcal{E}.

So what have we done? Well, we’ve certainly proven de Finetti in the most general case, and we have in addition demonstrated the existence of a Strong Law of Large Numbers for exchangeable sequences, where the limit variable is \mathcal{E}-measurable.

Convergence of Random Variables

The relationship between the different modes of convergence of random variables is one of the more important topics in any introduction to probability theory. For some reason, many of the textbooks leave the proofs as exercises, so it seems worthwhile to present a sketched but comprehensive summary.

Almost sure convergence: X_n\rightarrow X\;\mathbb{P}-a.s. if \mathbb{P}(X_n\rightarrow X)=1.

Convergence in Probability: X_n\rightarrow X in \mathbb{P}-probability if \mathbb{P}(|X_n-X|>\epsilon)\rightarrow 0 for any \epsilon>0.

Convergence in Distribution: X_n\stackrel{d}{\rightarrow} X if \mathbb{E}f(X_n)\rightarrow \mathbb{E}f(X) for any bounded, continuous function f. Note that this definition is valid for RVs defined on any metric space. When they are real-valued, this is equivalent to the condition that F_{X_n}(x)\rightarrow F_X(x) for every point x\in \mathbb{R} where F_X is continuous. It is further equivalent (by Levy’s Convergence Theorem) to its own special case, convergence of characteristic functions: \phi_{X_n}(u)\rightarrow \phi_X(U) for all u\in\mathbb{R}.

Note: In contrast to the other conditions for convergence, convergence in distribution (also known as weak convergence) doesn’t require the RVs to be defined on the same probability space. This thought can be useful when constructing counterexamples.

L^p-convergence: X_n\rightarrow X in L^p if ||X_n-X||_p\rightarrow 0; that is, \mathbb{E}|X_n-X|^p\rightarrow 0.

Uniform Integrability: Informally, a set of RVs is UI if the integrals over small sets tend to zero uniformly. Formally: (X_n) is UI if \sup_{n,A\in\mathcal{F}}\{\mathbb{E}[|X_n|1(A)]|\mathbb{P}(A)\leq \delta\}\rightarrow 0 as \delta\rightarrow 0.

Note: In particular, a single RV, and a collection of independent RVs are UI. If X~U[0,1] and X_n=n1(X\leq \frac{1}{n}), then the collection is not UI.

Continue reading