# Birthday Coincidences and Poisson Approximations

This morning, Facebook was extremely keen to remind me via every available medium that four of my friends celebrate their birthday today. My first thought was that I hope they all enjoy their day, and my second thought was to ask what the chance of this was. I have about 200 Facebook friends, and so this struck me as an unlikely occurrence. But this problem has form, and it felt worthwhile to try some calculations to see if my intuition was well-founded.

Siméon Denis Poisson celebrated his 234th birthday on 21st June this year.

The classical birthday problem

The starting point is the question: how many friends do you have to have before you expect to start seeing anyone sharing a birthday? There are a ridiculous number of articles about this on the web already, so I will say little, except that I don’t want to call this the ‘birthday paradox’, because it’s not a paradox at all. At best it might be counter-intuitive, but then the moral should be to change our intuition for this type of problem.

Throughout, let’s discount February 29th, as this doesn’t add much. So then, to guarantee having a shared pair of birthdays, you need to have 366 friends. But if you have a mere 23 friends, then the probability of having some pair that share a birthday is slightly greater than a half. The disparity between these two numbers leads to the counter-intuition. Some people might find it helpful to think that instead of counting friends, we should instead be counting pairs of friends, but I don’t personally find this especially helpful.

For me, thinking about the calculation in very slightly more generality is helpful. Here, and throughout, let’s instead take N to be the number of days in a year, and K the number of friends, or kids in the class if you prefer. Then, as usual, it is easier to calculate the probability that no two share a birthday (that is, that all the birthdays are distinct) than the probability that some two share a birthday. We could think of the number of ways to pick the set of birthdays, or we could look at the kids one-at-a-time, and demand that their birthday is not one of those we’ve already seen. Naturally, we get the same answer, that is

$\frac{^N P_K}{N^K} = 1\cdot \frac{N-1}{N}\cdot\ldots \frac{N-K+1}{N}.$

We’ve assumed here that all birthdates are equally likely. We’ll come back to this assumption right at the end. For now, let’s assume that both N and K are large, and we’ll try to decide roughly how large K has to be in relation to N for this answer to be away from 0 and 1. If we pair opposite terms up, we might approximate this by

$(\frac{N-\frac{K}{2}}{N})^K = (1-\frac{K}{2N})^K\approx e^{-K^2/2N}.$

In fact, AM-GM says that this is an overestimate, and a bit more care can be used to show that this is a good-approximation to first order. So we see that if $K=\Theta(\sqrt{N})$ for large N, we get a non-trivial limit.

Challenges for four-way shared birthdays

So the original problem I posed is harder, because there isn’t (unless I’m missing something) a natural way to choose birthdays one-at-a-time, or describe the set of suitable birthday sets. There are two major obstacles in a calculation such as this. Firstly, the overlap of people, that is we might have five or more birthdays overlapping; secondly, the overlap of days, that is we might have several days with four (or more) birthdays. We’ll end up worrying more about the second situation.

We start by eliminating both problems, by asking for the probability that exactly four friends are born on January 1st. The general form of this probability is $\frac{\binom{K}{4} }{N^4} \cdot (\frac{N-1}{N})^{K-4}$. Now, if $K\ll N$, this final term should not be significant. Removing this is not exactly the same as specifying the probability that at least four birthdays are on January 1st. But in fact this removal turns a lower bound (because {exactly four}<{at least four}) into an upper (in fact a union) bound. So if the factor being removed is very close to one, we can use whichever expression is more convenient.

In the real life case of N=365, K=200, this term is not negligible. But accounting for this, we get that the probability of exactly four birthdays on 1st January is ~0.0021. Our upper bound on the probability of at least four is ~0.0036.

But now that we know the probability for a given day, can we calculate $(1-0.0021)^{365}$ to estimate the probability that we never have four-overlap? When we did our previous iterative calculation, we were using independence of the different kids’ birthdays. But the event that we have four-overlap on January 1st is not quite independent of the event that we have four-overlap on January 2nd. Why? Well if we know at least four people were born on January 1st, there are fewer people left (potentially) to be born on January 2nd. But maybe this dependence is mild enough that we can ignore it?

We can, however, use some moment estimates. The expected number of days with four-overlap is $365\cdot 0.0021 \approx 0.77$. So the probability that there is at least one day with four-overlap is at most ~0.77.

But we really want a lower bound. So, maybe we can borrow exactly the second-moment argument we tried (there for isolated vertices in the random graph) in the previous post? Here, the probability that both January 1st and January 2nd are four-overlapping is

$\frac{\binom{K}{4}\binom{K-4}{4}}{N^8}\cdot (\frac{N-2}{N})^{K-8}\approx 4.3\times 10^{-6}.$

From this, we can evaluate the expectation of the square of the number of days with four-overlap, and thus find that the variance is ~0.74. So we use Chebyshev, calling this number of days #D for now:

$\mathbb{P}(\# D=0)\le \mathbb{P}(|\#D - \mathbb{E}\# D|^2 \ge (\mathbb{E}\# D)^2 ) \le \frac{\mathrm{Var} \# D}{(\mathbb{E} \#D)^2}.$

In our case, this unfortunately gives us an upper bound greater than 1 on this probability, and thus a lower bound of zero on the probability that there is at least one day with four-overlap. Which isn’t especially interesting…

Fairly recently, I spoke about the Lovasz Local Lemma, which can be used to find lower bounds on the probabilities of intersections of events, many of which are independent (in a particular precise sense). Perhaps this might be useful here? The natural choice of ‘bad event’ is that particular 4-sets of people share a birthday. There are $\binom{K}{4}$ such events, and each is independent of the collection of $\binom{K-4}{4}$ disjoint events. Thus we can consider using LLL if $e\cdot (\binom{K}{4}-\binom{K-4}{4})\cdot 0.0021 \le 1$. Unfortunately, this difference of binomial coefficients is large in our example, and so in fact the LHS has order $10^3$.

Random number of friends – coupling to a Poisson Process

All of these methods failed because without independence we had to use estimates which were really not tight at all. But we can re-introduce independence if we remove the constraints on the model. Suppose instead of demanding I have K friends, I instead demand that I have a random number of friends, with distribution Poisson(K). Now it is reasonable to assume that for each day, I have a Poisson(K/365) friends with that birthday, independently for each day.

If we end up having exactly K friends with this random choice, then the distribution of the number of 4-overlap days is exactly the same as in the original setup. However, crucially, if we end up having at most K friends with this random choice, the distribution of the number of 4-overlap days is stochastically dominated by the original distribution. So instead let’s assume we have Poisson(L) friends, where L<K, and see how well we can do. For definiteness, we’ll go back to N=365, K=200 now. Let’s say X is the distribution of birthdays in the original model, and $\Xi$ for the distribution of birthdays in the model with a random number of friends

Then

$\mathbb{P}(\exists \ge 4\text{-overlap in }\Xi) = 1- \mathbb{P}(\mathrm{Po}(L/365)\le 3)^365.$ (*)

Now we can write the crucial domination relation as

$\mathbb{P}(\exists \ge 4\text{-overlap in }X)\ge \mathbb{P}( \exists \ge 4\text{-overlap in }\Xi \,|\, |\Xi|\le 200),$

and then use an inequality version of the law of total probability to bound further as

$\ge \frac{ \mathbb{P}(\exists \ge 4\text{-overlap in }\Xi) - \mathbb{P}(|\Xi|>200)}{\mathbb{P}(|\Xi|\le 200)}.$

This is a function of L, and in principle we could find its maximum, perhaps as $N\rightarrow\infty$. Here, though, let’s just take L=365/2 and see what happens. For (*) we get ~0.472.

To estimate $\mathbb{P}(\mathrm{Po}(365/2)>200)$, observe that this event corresponds to 1.4 standard deviations above the mean, so we can approximate using quantiles of the normal distribution, via the CLT. (Obviously this isn’t completely precise, but it could be made precise if we really wanted.) I looked up a table, and this probability is, conveniently for calculations, roughly 0.1. Thus we obtain a lower bound of $\frac{0.472-0.1}{0.9}$. Allowing for the fairly weak estimates at various points, we still get a lower bound of around 0.4. Which is good, because it shows that my intuition wasn’t right, but that I was in the right ball-park for it being a ‘middle-probability event’.

Remarks and References

– The reason for doing the upper bound for the probability of exact 4-overlap is that the same argument for at-least-4-overlap would have given an upper bound of 1. However, this Poisson Process coupling is also a much better method for obtaining an upper bound on either event.

– Birthdays are not uniformly distributed through the year. The deviation is strong enough that even from the set of birth frequencies (rather than the sequence of birth frequencies), we can reject a null hypothesis of uniformity. Early September is pretty close to the maximum. Two comments: 1) this is the time of year where small variations in birth date have a big effect on education, especially in primary school; 2) we are 37 weeks into the year…

– It is known that 187 friends is the first time the probability of having at-least-4-overlap is greater than ½. You can find the full sequence on OEIS as A014088. I used to have about 650 Facebook friends, before I decided that I’d prefer instead the pleasant surprise of finding out what old acquaintances were up to when I next spoke to them. In this case, the median of the distribution of the largest number sharing a birthday would be seven.

– Eric Weisstein’s article on Mathworld is, in my opinion, the best resource for a mathematician on the first two pages of Google hits by at least two orders of magnitude. In the notation of this article, we were calculating $P_4(n=200,d=365)$. There are also some good general asymptotics, or at least recipes for asymptotics, in equations (17) and (18).

– The paper Methods for Studying Coincidences by Diaconis and Mosteller is, as one might expect, extremely readable, and summarises many results and applications, including several generalisations.

# Avoiding Mistakes in Probability Exams

Over the past week, I’ve given several tutorials to second year undergraduates preparing for upcoming papers on probability and statistics. In particular, I’ve now seen a lot of solutions to a lot of past papers and specimen questions, and it’s worthwhile to consider some of the typical mistakes students can make on these questions. Of course, as with any maths exam, there’s always the possibility of a particularly subtle or involved question coming up, but if the following three common areas of difficulty can be avoided, you’re on track for doing well.

Jacobians

In a previous course, a student will learn how to calculate the pdf of a function of a random variable. Here, we move onto the more interesting and useful case of finding the (joint) density of function(s) of two or more random variables. The key thing to remember here is that manipulating pdfs is not a strange arbitrary exercise – it is just integration. It is rarely of interest to consider the value of a pdf at a single point. We can draw meaningful conclusions from a pdf or from comparison of two pdfs by integrating them.

Then the question of substituting for new random variables is precisely integration by substitution, which we are totally happy with in the one-dimensional case, and should be fairly happy with in the two-dimensional case. To get from one joint density to another, we multiply by the absolute value of the Jacobian. To ensure you get it right, it makes sense to write out the informal infinitesimal relation

$f_{U,V}(u,v) du dv = f_{X,Y}(x,y)dx dy.$

This is certainly relevant if we put integral signs in front of both sides, and explains why you obtain $f_{U,V} = \frac{d(x,y)}{d(u,v)} f_{X,Y}$ rather than the other way round. Note though that if $\frac{d(u,v)}{d(x,y)}$ is easier to calculate for some reason, then you can evaluate this and take the inverse, as your functions will almost certainly be locally bijective almost everywhere.

It is important to take the modulus of the Jacobian, since densities cannot be negative! If this looks like a fudge, then consider the situation in one dimension. If we substitute for $x\mapsto f(x)=1-x$, then f’ is obviously negative, BUT we also end up reversing the order of the bounds of the integral, eg [1/3, ¾] will become [2/3,1/4]. So we have a negative integrand (after multiplying by f'(x)) but bounds in the ‘wrong’ order. These two factors of -1 will obviously cancel, so it suffices just to multiply by |f'(x)| at that stage. It is harder to express in words, but a similar relation works for the Jacobian substitution.

You also need to check where the new joint density is non-zero. Suppose X, Y are supported on [0,1], then when we write $f_{X,Y}(x,y)$ we should indicate that it is 0 off this region, either by splitting into cases, or adding the indicator function $1_{\{x,y\in[0,1]\}}$ as a factor. This is even more important after substitutions, as the range of the resulting random variables might be less obvious than the originals. Eg with X,Y as above, and $U=X^2, V=X/Y$, the resulting pdf will be non-zero only when $u\in[0,1], v\ge \sqrt{u}$. Failing to account for this will often lead to ludicrous answers. A general rule is that you can always check that any distribution you’ve produced does actually integrate to one.

Convergence using MGFs

There are two main reasons to use MGFs and PGFs. The first is that they behave nicely when applied to (possibly random) sums of independent random variables. The independence property is crucial to allow splitting of the MGF of the sum into the product of MGFs of the summands. Of course, implicit in this argument is that MGFs determine distributions.

A key theorem of the course is that this works even in the limit, so you can use MGFs to show convergence in distribution of a family of distributions. For this, you need to show that the MGFs converge pointwise on some interval [-a,a] around 0. (Note that the moments of the distribution are given by the family of derivatives at 0, as motivation for why this condition might be necessary.) Normally for such questions, you will have been asked to define the MGF earlier in the question, and probably will have found the MGF of a particular distribution or family of distributions, which might well end up appearing as the final answer.

Sometimes such an argument might involve substituting in something unusual, like t/N, rather than t, into a known MGF. Normally a Taylor series can be used to show the final convergence result. If you have a fraction, try to cancel terms so that you only have to evaluate one Taylor series, rather than lots.

Using the Markov Property

The Markov property is initially confusing, but once we become comfortable with the statement, it is increasingly irritating to have to answer the question: “show that this process has the Markov property.” This question is irritating because in most cases we want to answer: “because it obviously does!” Which is compelling, but unlikely to be considered satisfactory in a mathematics exam. Normally we observe that the random dynamics of the next step are a function only of the present location. Looking for the word ‘independent’ in the statement of the process under discussion is a good place to start for any argument along these lines.

The most developed example of a Markov process in this course is the Poisson process. I’ve written far too much about this before, so I won’t do so again, except to say this. When we think of the Poisson process, we generally have two thoughts going through our minds, namely the equivalent definitions of IID exponential inter-arrival times, and stationary, Poisson increments (or the infinitesimal version). If we draw a sketch of a sample trajectory of this process, we can label everything up and it is clear how it all fits together. But if you are asked to give a definition of the Poisson process $(N_t)$, it is inappropriate to talk about inter-arrival times unless you define them in terms of $N_t$, since that is the process you are actually trying to define! It is fine to write out

$T_k:=\min\{t: N_t=k\},\quad N_t=\max\{k: Y_1+Y_2+\ldots+Y_k\le t\}$

but the relation between the two characterisations of the process is not obvious. That is why it is a theorem of the course.

We have to be particularly careful of the difference in definition when we are calculating probabilities of various events. A classic example is this. Find the distribution of $N_2$, conditional on $T_3=1$. It’s very tempting to come up with some massive waffle to argue that the answer is 3+Po(1). The most streamlined observation is that the problem is easy if we are conditioning instead on $N_1=3$. We just use the independent Poisson increments definition of $(N_t)$, with no reference to inter-arrival times required. But then the Markov property applied at time 1 says that the distribution of $(N_2)$ depends only on the value of $N_1$, not on the process on the interval [0,1). In a sense, the condition that $T_3=1$ is giving us extra information on the behaviour of the process up to time 1, and the Markov property, which we know holds for the Poisson process, asserts precisely that the extra information doesn’t matter.

# Poisson Tails

I’ve had plenty of ideas for potential probability posts recently, but have been a bit too busy to write any of them up. I guess that’s a good thing in some sense. Anyway, this is a quick remark based on an argument I was thinking about yesterday. It combines Large Deviation theory, which I have spent a lot of time learning about this year, and the Poisson process, which I have spent a bit of time teaching.

Question

Does the Poisson distribution have an exponential tail? I ended up asking this question for two completely independent reasons yesterday. Firstly, I’ve been reading up about some more complex models of random networks. Specifically, the Erdos-Renyi random graph is interesting mathematical structure in its own right, but the independent edge condition results in certain regularity properties which are not seen in many real-world networks. In particular, the degree sequence of real-world networks typically follows an approximate power law. That is, the tail is heavy. This corresponds to our intuition that most networks contain ‘hubs’ which are connected to a large region of the network. Think about key servers or websites like Wikipedia and Google which are linked to by millions of other pages, or the social butterfly who will introduce friends from completely different circles. In any case, this property is not observed in an Erdos-Renyi graph, where the degrees are binomial, and in the sparse situation, rescale in the limit to a Poisson distribution. So, to finalise this observation, we want to be able to prove formally that the Poisson distribution has an exponential (so faster than power-law) tail.

The second occurrence of this question concerns large deviations for the exploration process of a random graph. This is a topic I’ve mentioned elsewhere (here for the exploration process, here for LDs) so I won’t recap extensively now. Anyway, the results we are interested in give estimates for the rate of decay in probability for the event that the path defined by the exploration process differs substantially from the expected path as n grows. A major annoyance in this analysis is the possibility of jumps. A jump occurs if a set of o(n) adjacent underlying random variables (here, the increments in the exploration process) have O(n) sum. A starting point might be to consider whether O(1) adjacent RVs can have O(n) sum, or indeed whether a single Poisson random variable can have sum of order n. In practice, this asks whether the probability $\mathbb{P}(X>\alpha n)$ decays faster than exponentially in n. If it does, then this is dominated on a large deviations scale. If it decays exactly exponentially in n, then we have to consider such jumps in the analysis.

Approach

We can give a precise statement of the probabilities that a Po($\lambda$) random variable X returns a given integer value:

$\mathbb{P}(X=k)=e^{-\lambda}\frac{\lambda^k}{k!}.$

Note that these are the terms in the Taylor expansion of $e^{\lambda}$ appropriately normalised. So, while it looks like it should be possible to evaluate

$\mathbb{P}(X>\alpha n)=e^{-\lambda}\sum_{\alpha n}^\infty \frac{\lambda^k}{k!},$

this seems impossible to do directly, and it isn’t even especially obvious what a sensible bounding strategy might be.

The problem of estimating the form of the limit in probability of increasing unlikely deviations from expected behaviour surely reminds us of Cramer’s theorem. But this and other LD theory is generally formulated in terms of n random variables displaying some collective deviation, rather than a single random variable, with the size of the deviation growing. But we can transform our problem into that form by appealing to the three equivalent definitions of the Poisson process.

Recall that the Poisson process is the canonical description of, say, an arrivals process, where events in disjoint intervals are independent, and the expected number of arrives in a fixed interval is proportional to the width of the interval, giving a well-defined notion of ‘rate’ as we would want. The two main ways to define the process are: 1) the times between arrivals are given by i.i.d. Exponential RVs with parameter $\lambda$ equal to the rate; and 2) the number of arrivals in interval [s,t] is independent of all other times, and has distribution given by Po($\lambda(t-s)$). The fact that this definition gives a well-defined process is not necessarily obvious, but let’s not discuss that further here.

So the key equivalence to be exploited is that the event $X>n$ for $X\sim \text{Po}(\lambda)$ is a statement that there are at least n arrivals by time 1. If we move to the exponential inter-arrival times definition, we can write this as:

$\mathbb{P}(Z_1+\ldots+Z_n<1),$

where the Z’s are the i.i.d. exponential random variables. But this is exactly what we are able to specify through Cramer’s theorem. Recall that the moment generating function of an exponential distribution is not finite everywhere, but that doesn’t matter as we construct our rate function by taking the supremum over some index t of:

$I(x)=\sup_t (xt-\log \mathbb{E}e^{tZ_1})=\sup_t(xt-\log(\frac{\lambda}{\lambda-t})).$

A simple calculation then gives

$I(x)=\lambda x-1 - \log \lambda x.$

$\Rightarrow I(x)\uparrow \infty\text{ as }x\downarrow 0.$

Note that I(1) is the same for both Exp($\lambda$) and Po($\lambda$), because of the PP equality of events:

$\{Z_1+\ldots+Z_n\leq n\}=\{\text{Po}(\lambda n)=\text{Po}(\lambda)_1+\ldots+\text{Po}(\lambda)_n> n\},$

similar to the previous argument. In particular, for all $\epsilon>0$,

$\mathbb{P}(\text{Po}(\lambda)>n)=\mathbb{P}(\frac{Z_1+\ldots+Z_n}{n}<\frac{1}{n})<\mathbb{P}(\frac{Z_1+\ldots+Z_n}{n}<\epsilon),\text{ for large }n.$

$\mathbb{P}(\text{Po}(\lambda)>n)=O(e^{-nI(\epsilon)}),\text{ for all }\epsilon.$

Since we can take $I(\epsilon)$ as large as we want, we conclude that the probability decays faster than exponentially in n.

# The Inspection Paradox and related topics

In the final class for Applied Probability, we discussed the so-called Inspection Paradox for an arrivals process. We assume that buses, sat, arrive as a Poisson process with rate 1, and consider the size of the interval (between buses) containing some fixed time T. The ‘paradox’ is that the size of this interval is larger in expectation than the average time between buses, which of course is given by an exponential random variable.

As with many paradoxes, it isn’t really that surprising after all. Perhaps what is more surprising is that the difference between the expected observed interval and the expected actual interval time is quite small here. There are several points of interest:

1) The Markov property of the Poisson process is key. In particular, this says that the expectation (and indeed the distribution) of the waiting time for a given customer arriving at T is not dependent on T, even if T is a random variable (or rather, a class of random variables, the stopping times). So certainly the inspection paradox property will hold whenever the process has the Markov property, because the inspected interval contains the inspected waiting time, which is equal in distribution to any fixed interval.

2) Everything gets slightly complicated by the fact that the Poisson process is defined to begin at 0. In particular, it is not reversible. Under the infinitesimal (or even the independent Poisson increments) definition, we can view the Poisson process not as a random non-decreasing function of time, but rather as a random collection of points on the positive reals. With this setup, it is clearly no problem to define instead a random collection of points on all the reals. [If we consider this instead as a random collection of point masses, then this is one of the simplest examples of a random measure, but that’s not hugely relevant here.]

We don’t need to worry too much about what value the Poisson process takes at any given time if we are only looking at increments, but if it makes you more comfortable, you can still safely assume that it is 0 at time 0. Crucially, the construction IS now reversible. The number of points in the interval [s,t] has distribution parameterised by t-s, so we it doesn’t matter which direction we are moving in down the real line. In this case, A_t, the time since the previous arrival, and E_t, the waiting time until the next arrival, are both Exp(1) RVs, as the memorylessness property applies in each time direction.

For the original Poisson process, we actually have A_t stochastically dominated by an Exp(1) distribution, because it is conditioned to be less than or equal to t. So in this case, the expected interval time is some complicated function of t, lying strictly between 1 and 2. In our process extended to the whole real line, the expected interval time is exactly 2.

This idea of extending onto the whole real line explains why we mainly consider delayed renewal processes rather than just normal renewal processes. The condition that we start a holding time at 0 is often not general enough, particularly when the holding times are not exponential and so the process is not memoryless.

3) There is a general size-biasing principle in action here. Roughly speaking, we are more likely to arrive in a large interval than in a small interval. The scaling required is proportional to the length of the interval. Given a density function f(x) of X, we define the size-biased density function to be xf(x). We need to normalise to give a probability distribution, and dividing by the expectation EX is precisely what is needed. Failure to account for when an observation should have the underlying distribution or the size-biased distribution is a common cause of supposed paradoxes. A common example is ‘on average my friends have more friends than I do’. Obviously, various assumption on me and my friends, and how we are drawn from the set of people (and the distribution of number of friends) is required that might not necessarily be entirely accurate in all situations.

In the Poisson process example above, the holding times have density function $e^{-x}$, so the size-biased density function if $xe^{-x}$. This corresponds to a $\Gamma(2,1)$ distribution, which may be written as the sum of two independent Exp(1) RVs as suggested above.

4) A further paradox mentioned on the sheet is the waiting time paradox. This says that the expected waiting time is longer if you arrive at a random time than if you just miss a bus. This is not too surprising: consider at least the stereotypical complaint about buses in this country arriving in threes, at least roughly. Then if you just miss a bus, there is a 2/3 chance that another will be turning up essentially immediately. On the sheet, we showed that the $\Gamma(\alpha,1)$ distribution has this property also, provided $\alpha<1$.

We can probably do slightly better than this. The memoryless property of the exponential distribution says that:

$\mathbb{P}(Z>t+s|Z>t)=\mathbb{P}(Z>s).$

In general, for the sort of holding times we might see at a bus stop, we might expect it to be the case that if we have waited a long time already, then we are less likely relatively to have to wait a long time more, that is:

$\mathbb{P}(Z>t+s|Z>t)\leq\mathbb{P}(Z>s),$

and furthermore this will be strict if neither s nor t is 0. I see no reason not to make up a definition, and call this property supermemorylessness. However, for the subclass of Gamma distributions described above, we have the opposite property:

$\mathbb{P}(Z>t+s|Z>t)\geq\mathbb{P}(Z>s).$

Accordingly, let’s call this submemorylessness. If this is strict, then it says that we are more likely to have to wait a long time if we have already been waiting a long time. This seems contrary to most real-life distributions, but it certainly explains the paradox. If we arrive at a random time, then the appropriate holding time has been ‘waiting’ for a while, so is more likely to require a longer observed wait than if I had arrived as the previous bus departed.

In conclusion, before you think of something as a paradox, think about whether the random variables being compared are being generated in the same way, in particular whether one is size-biased, and whether there are effects due to non-memorylessness.

# The Levy-Khintchine Formula

Because of a string of coincidences involving my choice of courses for Part III and various lecturers’ choices about course content, I didn’t learn what a Levy process until a few weeks’ ago. Trying to get my head around the Levy-Khintchine formula took a little while, so the following is what I would have liked to have been able to find back then.

A Levy process is an adapted stochastic process started from 0 at time zero, and with stationary, independent increments. This is reminiscent, indeed a generalisation, of the definition of Brownian motion. In that case, we were able to give a concrete description of the distribution of $X_1$. For a general Levy process, we have

$X_1=X_{1/n}+(X_{2/n}-X_{1/n})+\ldots+(X_1-X_{1-1/n}).$

So the distribution of $X_1$ is infinitely divisible, that is, can be expressed as the distribution of the sum n iid random variables for all n. Viewing this definition in terms of convolutions of distributions may be more helpful, especially as we will subsequently consider characteristic functions. If this is the first time you have seen this property, note that it is not a universal property. For example, it is not clear how to write a U[0,1] random variable as a convolution of two iid RVs. Note that exactly the same argument suffices to show that the distribution of $X_t$ is infinitely divisible.

It will be most convenient to work with the characteristic functions

$\mathbb{E}\exp(i\langle \lambda,X_t\rangle).$

By stationarity of increments, we can show that this is equal to

$\exp(-\Psi(\lambda)t)\quad\text{where}\quad \mathbb{E}\exp(i\langle \lambda,X_1\rangle)=:\exp(-\Psi(\lambda)).$

This function $\Psi(\lambda)$ is called the characteristic exponent. The argument resembles that used for Cauchy’s functional equations, by dealing first with the rationals using stationarity of increments, then lifting to the reals by the (right-)continuity of

$t\mapsto \mathbb{E}\exp(i\langle \lambda,X_t\rangle).$

As ever, $\Psi(\lambda)$ uniquely determines the distribution of $X_1$, and so it also uniquely determines the distribution of Levy process. The only condition on $\Psi$ is that it be the characteristic function of an infinitely divisible distribution. This condition is given explicitly by the Levy-Khintchine formula.

Levy-Khintchine

$\Psi(\lambda)$ is the characteristic function of an infinitely divisible distribution iff

$\Psi(\lambda)=i\langle a,\lambda\rangle +\frac12 Q(\lambda)+\int_{\mathbb{R}^d}(1-e^{i\langle \lambda,x\rangle}+i\langle \lambda,x\rangle 1_{|x|<1})\Pi(dx).$

for $a\in\mathbb{R}^d$, Q a quadratic form on $\mathbb{R}^d$, and $\Pi$ a so-called Levy measure satisfying $\int (1\wedge |x|^2)\Pi(dx)<\infty$.

This looks a bit arbitrary, so first let’s explain what each of these terms ‘means’.

• $i\langle a,\lambda\rangle$ comes from a drift of $-a$. Note that a deterministic linear function is a (not especially interesting) Levy process.
• $\frac12Q(\lambda)$ comes from a Brownian part $\sqrt{Q}B_t$.

The rest corresponds to the jump part of the process. Note that a Poisson process is an example of a Levy process, hence why we might consider thinking about jumps in the first place. The reason why there is an indicator function floating around is that we have to think about two regimes separately, namely large and small jumps. Jumps of size bounded below cannot happen too often as otherwise the process might explode off to infinity in finite time with positive probability. On the other hand, infinitesimally small jumps can happen very often (say on a dense set) so long as everything is controlled to prevent an explosion on the macroscopic scale.

There is no canonical choice for where the divide between these regimes happens, but conventionally this is taken to be at $|x|=1$. The restriction on the Levy measure near 0 ensures that the sum of the squares all jumps up some finite time converges absolutely.

• $\Pi\cdot 1_{|x|\geq 1}$ gives the intensity of a standard compound Poisson process. The jumps are well-spaced, and so it is a relatively simple calculation to see that the characteristic function is

$\int_{\mathbb{R}^d}(1-e^{i\langle \lambda,x\rangle})1_{|x|\geq 1}\Pi(dx).$

The intensity $\Pi\cdot 1_{|x|<1}$ gives infinitely many hits in finite time, so if the expectation of this measure is not 0, we explode immediately. We compensate by drifting away from this at rate

$\int_{\mathbb{R}^d}x1_{|x|<1}\Pi(dx).$

To make this more rigorous, we should really consider $1_{\epsilon<|x|<1}$ then take a limit, but this at least explains where all the terms come from. Linearity allows us to interchange integrals and inner products, to get the term

$\int_{\mathbb{R}^d}(1-e^{-i\langle \lambda,x\rangle}+i\langle\lambda,x\rangle 1_{|x|<1})\Pi(dx).$

If the process has bounded variation, then we must have Q=0, and also

$\int (1\wedge |x|)\Pi(dx)<\infty,$

that is, not too many jumps on an |x| scale. In this case, then this drift component is well-defined and linear $\lambda$, so can be incorporated with the drift term at the beginning of the Levy-Khintchine expression. If not, then there are some $\lambda$ for which it does not exist.

There are some other things to be said about Levy processes, including

• Stable Levy processes, where $\Psi(k\lambda)=k^\alpha \Psi(\lambda)$, which induces the rescaling-invariance property: $k^{-1/\alpha}X_{kt}\stackrel{d}{=}X$. The distribution of each $X_t$ is then also a stable distribution.
• Resolvents, where instead of working with the process itself, we work with the distribution of the process at a random exponential time.

# Markovian Excursions

In the previous post, I talked about the excursions of a Brownian motion. Today I’m thinking about how to extend these ideas to more general Markov chains. First we want to rule out some situations. In particular, we aren’t hugely interested in discrete time Markov chains. The machinery is fairly well established for excursions, whether or not the chain is transient. Furthermore, if the state space is discrete, as for a Poisson process for example, the discussion is not hugely interesting either. Remember that the technical challenges in the constructions of local time arise because of the Blumenthal 0-1 law property that Brownian motion visits 0 infinitely often in the small window after the start time. We therefore want the process to be regular at the point of the state space under discussion. This is precisely the condition described above for BM about 0.

Why is it harder in general?

The informal notion of a local time should transfer to a more general Markov chain, but there are some problems. Firstly, to define something in terms of an integral is not general enough. The state space E needs some topological structure, but any meaningful definition must be in terms of functions from E into the reals. There were also all sorts of special properties of Brownian motion, including the canonical time-space rescaling that came in handy in that particular case. It turns out to be easiest to consider the excursion measure on a general Markov chain through its Laplace transform.

Definition and Probabilistic interpretations

The resolvent is the Laplace transform of the transition probability $P_t(x,A)$, viewed as an operator on functions $f:E\rightarrow \mathbb{R}$.

$R_\lambda f(x):=\mathbb{E}_x\left[\int_0^\infty e^{-\lambda t}f(X_t)dt\right]=\int_0^\infty e^{-\lambda t}P_tf(x)dt$.

We can interpret this in terms of the original process in a couple of ways which may be useful later. The motivation is that we would like to specify a Poisson process of excursions, for which we need to know the rate. We hope that the rate will in fact be constant, so it will in fact to suffice to work out the properties of the expected number of excursions (or whatever) up to various random times, in particular those given by exponential RVs.

So, we take $\zeta\sim\text{Exp}(\lambda)$ independent of everything else, and assume that we ‘kill the chain’ at time $\zeta$. Then, by shuffling expectations in and out of the integral and separating independent bits, we get:

$R_\lambda f(x)=\mathbb{E}_x\int_0^\zeta f(X_s)ds = \frac{1}{\lambda}\mathbb{E}_xf(X_\zeta)$.

As in the Brownian local time description

$R_\lambda 1_A(x)=\mathbb{E}(\text{time spent in }A\text{ before death at time }\zeta_\lambda)$.

Markovian property

We want to show that excursions are Markov, once we’ve sorted out what an ‘excursion’ actually means in this context. We do know how to deal with the Markovian property once we are already on an excursion. It is relatively straightforward to define an extension of the standard transition probability operator, to include a condition that the chain should not hit point a during the transition. That is

$_aP_t(x,A):= \mathbb{P}_x(X_t\in A\cap H_a>t)$.

This will suffice to define the behaviour once an excursion has started. The more complicated bit will be the entrance law $n_t(A)$, being the probability of arriving at A after time t of an excursion. To summarise, as with BM, all the technical difficulties with an excursion happen at the beginning, ie bouncing around the start point, but once it is ‘up-and-running’, its path is Markovian and controlled by $_aP_t$.

Marking

The link between the resolvent and the excursions, is provided as in the Brownian case, by supplying a PPP of marks at uniform rate $\lambda$ to real time. This induces a mark process on excursions, weighted by an (exponential) function of excursion length. We make no distinction between an excursion including one mark or many marks. Then the measure on marked excursion is, in a mild abuse of notation:

$n_\lambda=(1-e^{-\lambda\zeta(f)})\cdot n.$

We compare with the Laplace transform $n_\lambda(dx)=\int_0^\infty e^{-\lambda t}dtn_t(dx)$ using a probabilistic argument.

We can apply the measure to a function in the usual way: $\lambda n_\lambda(1_A)$ is the measure of those excursions for which the first mark occurs in $A$. So by taking $A=E$, we get

$\lambda n_\lambda(1)=\text{ Excursion measure }=\int_U n(df)(1-e^{-\lambda\zeta(f)}).$

We have therefore linked the exponential mark process on excursion measure with the Laplace transform of the entrance law. So in particular:

$\frac{\lambda n_\lambda(A)}{\lambda n_\lambda(1)}=\mathbb{P}(\text{first mark when in }A)=\int_0^\infty \lambda e^{-\lambda t}P_t(0,A)dt=\lambda R_\lambda 1_A(0)$.

The resolvent is relatively easy to calculate explicitly, and so we can find the Laplace transform $n_\lambda(A)$. From this it is generally possible to invert the transform to find the entrance law $n_t$.

References

A Guided Tour Through Excursions – L. C. G. Rogers.

This pair of posts is very much a paraphrase of chapters 3 and 4 of this excellent text. The original can be found here (possibly not open access?)

# The Poisson Process – A Third Characteristion

There remains the matter of the distribution of the number of people to arrive in a fixed non-infinitissimal time interval. Consider the time interval [0,1], which we divide into n smaller intervals of equal width. As n grows large enough that we know the probability that two arrivals occur in the same interval tends to zero (as this is $\leq no(\frac{1}{n})$), we can consider this as a sequence of iid Bernoulli random variables as before. So

$\mathbb{P}(N_1=k)=\binom{n}{k}(\frac{\lambda}{n})^k(1-\frac{\lambda}{n})^{n-k}$
$\approx \frac{n^k}{k!} \frac{\lambda^k}{n^k}(1-\frac{\lambda}{n})^n\approx \frac{\lambda^k}{k!}e^{-\lambda}.$

We recognise this as belonging to a Poisson (hence the name of the process!) random variable. We can repeat this for a general time interval and obtain $N_t\sim \text{Po}(\lambda t)$.

Note that we implicitly assumed that, in the infinitissimal case at least, behaviour in disjoint intervals was independent. We would hope that this would lift immediately to the large intervals, but it is not immediately obvious how to make this work. This property of independent increments is one of the key definitions of a Levy Process, of which the Poisson process is one of the two canonical examples (the other is Brownian Motion).

As before, if we can show that the implication goes both ways (and for this case it is not hard – letting $t\rightarrow 0$ clearly gives the infinitissimal construction), we can prove results about Poisson random variables with ease, for example $\text{Po}(\lambda)+\text{Po}(\mu)=\text{Po}(\lambda+\mu)$.

This pretty much concludes the construction of the Poisson process. We have three characterisations:
1) $X_n\sim\text{Exp}(\lambda)$ all iid.
2) The infinitissimal construction as before, with independence.
3) The number of arrivals in a time interval of width t $\sim \text{Po}(\lambda t)$. (This is sometimes called a stationary increments property.) Furthermore, we have independent increments.

A formal derivation of the equivalence of these forms is important but technical, and so not really worth attempting here. See James Norris’s book for example for a fuller exposition.

The final remark is that the Poisson Process has the Markov property. Recall that this says that conditional on the present, future behaviour is independent of the past. Without getting into too much detail, we might like to prove this by using the independent increments property. But remember that for a continuous process, it is too much information to keep track of all the distributions at once. It is sufficient to track only the finite marginal distributions, provided the process is cadlag, which the Poisson process is, assuming we deal with the discontinuities in the right way. Alternatively, the exponential random variable is memoryless, a property that can be lifted, albeit with some technical difficulties, to show the Markov property.

# The Poisson Process – Distributions of Holding Times

So, I was planning to conclude my lecture course on Markov Chains at the Cambridge-Linyi summer school with an hour devoted to the Poisson Process. Unfortunately, working through the discrete time material and examples took longer than I’d expected, so we never got round to it. As I’d put a fair bit of work into the preparation, I figured I might as well write it up here instead.

We need a plausible mathematical model for a situation where people or events arrive randomly in continuous time, in a manner that is as close as possible to the notion of iid random variables. In discrete time, a natural candidate would be a set of iid Bernoulli random variables. For example, with probability p a single bus will arrive in an time interval of a minute. With probability 1-p, no bus will arrive. We might have some logistical motivation for why it is not possible that two or more arrive in a given interval, or we could instead choose a more complicated distribution.

One way to proceed would be to specify the distribution of the times between arrivals. These should be independent and identically distributed, at least intuitively. However, although we might be able to give a sensible guess right now, it is not immediately clear what this distribution should be. For now, we merely remark that the arrival times are called $X_1,X_2,\ldots$, and the holding times between arrivals are defined by
$S_1=X_1, S_n=X_n-X_{n-1},n\geq 2$.

In fact the motivating discrete example gives us much of the machinery we will actually need. Recall that when we define probability distributions for continuously-valued random variables we need a different plan of attack than for discrete RVs. Whereas for the discrete case, it is enough to specify the probability of each outcome, for a continuous random variable, we have to specify the probabilities of intervals, and take care that they have the obvious additive and nesting properties that we want. Taking the integral (whether Riemannian or Lebesgue) of a so-called density function is a natural way to do this.

Similarly here, we build up from small time intervals. The first remark is this: it is natural that the expected number of arrivals in the first minute is equal to the expected number of arrivals in the second minute. After all, we are considering the most general process possible. If there are an infinite number of potential arriving agents, then behaviour in the first minute should be independent and equal (in distribution) to behaviour in the second minute. We can naturally extend this idea to a linear relation. If $N_s$ is the number of arrivals in the time [0,s], then we should have $\mathbb{E}N_s=\lambda s$, where $\lambda$ is some constant of proportionality, equal to $\mathbb{E}N_1$.

The key remark is that as $s\rightarrow 0$, $\mathbb{P}(N_s=1)$ becomes small, and $\mathbb{P}(N_s\geq 2)$ becomes very small. In fact it suffices that $\mathbb{P}(N_s \geq 2)=o(s)$, as this implies:

$\mathbb{P}(N_s=0)=1-\lambda s+o(s),\quad \mathbb{P}(N_s=1)=\lambda s+o(s).$

Note that we are not currently attempting a formal construction of the process. As always, finding a probability space with enough freedom to equip a continuous process is a fiddly task. We are for now just trying to work out what the distribution of the holding times between arrivals should be. There are obvious advantages to defining the process as a collection of iid random variables, for example that we can construct it on a product space.

To do this, we split the time interval [0,1] into blocks

$[0,1]=[0,\frac{1}{n}]\cup[\frac{1}{n},\frac{2}{n}]\cup\ldots\cup[\frac{n-1}{n},1]$

So the probability that someone arrives in the time $[\frac{k}{n},\frac{k+1}{n}]$ is $\frac{\lambda}{n}$. So

$\mathbb{P}(\text{no-one arrives in time }[0,1])=(1-\frac{\lambda}{n})^n\approx e^{-\lambda}.$

As we are working in an $n\rightarrow\infty$ regime, we can replace 1 by general time t, to obtain:

$\mathbb{P}(\text{no-one arrives in time }[0,t])=(1-\frac{\lambda t}{n})^n\approx e^{-\lambda t}.$

So the distribution function of the first arrival time is $F(t)=1-e^{-\lambda t}$ in the conventional notation.
Thus $X_1\sim \text{Exp}(\lambda)$.

However, to emphasis how useful the infinitissimal definition is for actual problems, consider these examples.

1) If we have two arrivals processes at the same object, for example arriving from the left and the right at the same shop, say with rates $\lambda,\mu$, then we want to show that the first arrival time is still exponential. Because of the linearity of expectation property, it is clearly from the definition that the total arrivals process is Poisson with rate $\lambda+\mu$, and so the result follows. Showing this by examining the joint distribution of two exponential random variables is also possible, but much less elegant.

2) Similarly, if we have two shops, A and B, and each arriving person chooses one at random, then the first arrival time at A is: with probability 1/2 distributed as Exp($\lambda$), with probability 1/4 as $\Gamma(2,\lambda)$, and so on. A fairly non-trivial calculation is required to show that this is the same as $Exp(\frac{1}{2}\lambda)$, whereas this follows almost instantly using the the infinitissimal definition.

Moral: with the infinitissimal case, the difficulties are all in the probability space. However, once we have settled those problems, everything else is nice as the key property is linear. Whereas for a construction by iid jump times, the existence and well-definedness is clear, but even for a distribution as tractable as the exponential random variable, manipulation can be tricky.

# Queues and Migration Processes

Simple Queues

A queue generally has the form of a countable state-space Markov Chain. Here we assume that customers are served in the order they arrive (often referred to as: FIFO – First In First Out). The standard Kendall notation is M/M/C/K. Here the Ms stand for Markov (or memoryless) arrivals, and service times respectively, possibly replaced by G if the process admits more general distributions. C is the number of servers, and K the capacity of the system.

The first example is a M/M/C/C queue, motivated by a telephone network. Here, there are c lines, and an arriving call is lost if all lines are busy at the arrival time. We assume arrivals are a PP($\lambda$) process, and the call times are iid Exp($\mu$) RVs. We record the number of busy lines, which is a continuous-time Markov chain on state-space [0,c]. As usual, we assume the system operates in equilibrium, and so in particular we must have $\lambda<\mu c$. It is easy to find the equilibrium distribution. The only non-zero transition probabilities are

$q(i-1,i)=\lambda, \quad q(i,i-1)=\mu i\quad i=1,\ldots,c$

and so can that that $\pi_i=\frac{\nu_i}{i!}\pi_0$ satisfies the DBEs where $\nu:=\frac{\lambda}{\mu}$, sometimes called the traffic intensity. This gives

$\pi_c=\mathbb{P}(c\text{ lines busy})=\frac{\nu^c}{c!}\left(\sum_{k=0}^c \frac{\nu^k}{k!}\right)^{-1}$

and we define this to be $E(\nu,c)$, Erlang’s formula.

Note that if $(X(t),t\in\mathbb{R})$ is a MC in equilibrium, $(X(-t),t\in\mathbb{R})$ is a MC with the same ED, modulo style of discontinuities (ie whether transitions are left-continuous or right-continuous). Therefore, in any queue where all customers get served, eg an M/M/1 queue, for which $\lambda<\mu$ (otherwise the MC is transient, so no ED!), the departure process is the same (in distribution) as the arrivals process.

We can check that this holds for a series of M/M/1 queues, and that in equilibrium, the sizes of the queues are independent. This is merely an extension of the observation that future arrivals for a given queue are independent of the present, and likewise past departures are independent of the future, but the argument is immediately obvious.

Migration Processes

We consider a closed migration process on J colonies, with populations described by a vector $n=(n_1,\ldots,n_J)$. We say that the instantaneous rate of movement of an individual from colony j to colony k is $q(n,T_{jk}n)=\lambda_{jk}\phi_j(n_j)$, for some functions $\phi_j(0)=0$. We can describe the ED of this system in terms of the distribution obtained when a single individual performs a random walk through the state space, with $\phi_j(1)$ taken to be 1 for each j. We call the DBEs for this case the traffic equations, with solutions $(\alpha_j,j\in J)$. Then, by checking the PBEs, it is clear that the equilibrium distribution of the original migration process satisfies

$\pi(n)\propto \prod_{j=1}^J \frac{\alpha_j^{n_j}}{\prod_{r=1}^{n_j}\phi_j(r)}$

This is important, as it would have been computationally difficult to solve the original equations for an ED.

The same result holds for an open migration process, where individuals can enter and leave the system, arriving at colony j at rate $\nu_j$, and leaving at rate $\mu_k\phi_k(n_k)$. Note that this has the same form as if each colony was served by a PP($\alpha_j\lambda_j$), with departures at rate $\lambda_j\phi_j(n_j):=(\mu_j+\sum_k\lambda_{jk})\phi_j(n_j)$. But (obviously) this interpretation is not equivalent to the model

The time reversal is also an OMP. One has to check that the transition rates have the correct form, and so the exit process from each colony (in equilibrium, naturally), is PP($\alpha_j\mu_j$)\$. Most importantly, given a communicating class, we can think of the restriction to this class as an OMP, so in particular, rates of transition between classes are Poisson.

Little’s Law

Informally, the mean number of customers should be equal to the arrival rate multiplied by the mean sojourn time in the system of a customer. This is easiest to formalise by taking an expectation up to a regeneration time. This is T, the first time the system returns to its original state (assumed to be 0 customers), an a.s. finite stopping time.

Set $L:=\mathbb{E}\frac{1}{T}\int_0^Tn(s)ds$, the average number of customers, and $W:=\frac{\mathbb{E}\sum_1^N W_n}{\mathbb{E}N}$ where N is the number of customers arriving in [0,T], and $W_n$ is the waiting time of the n-th customer.

Little’s Law asserts that $L=\lambda W$. Note that for a Markov Chain in equilibrium, can define L more simply as $\mathbb{E}n$ and similarly for W.

It is easiest proved by considering the area between the arrivals process and the departure process in two ways: integrating over height and width. Note that working up to a regeneration time is convenient because at that time the processes are equal.

The migration processes above are said to be linear if $\phi_j(n_j)=n_j$. This process has the form of a Markov chain, and so even in a more general state space, the distribution of points in disjoint subsets of the state-space are independent Poisson processes.

Often though, we start with no individuals in the system, but still the distribution is given by a time-inhomogenous Poisson random measure. The mean is specified by

$M(t,E)=\nu\int_0^t P(u,E)du$

where $\nu$ is the net arrival rate, and $P(u,E)$ is the probability that an individual is in E, a time interval of u after arriving.

As one would suspect, this is easiest to check through generating functions, since independence has a straightforward generating function analogue, and the expression for a Poisson RV is manageable.