The Inspection Paradox and related topics

In the final class for Applied Probability, we discussed the so-called Inspection Paradox for an arrivals process. We assume that buses, sat, arrive as a Poisson process with rate 1, and consider the size of the interval (between buses) containing some fixed time T. The ‘paradox’ is that the size of this interval is larger in expectation than the average time between buses, which of course is given by an exponential random variable.

As with many paradoxes, it isn’t really that surprising after all. Perhaps what is more surprising is that the difference between the expected observed interval and the expected actual interval time is quite small here. There are several points of interest:

1) The Markov property of the Poisson process is key. In particular, this says that the expectation (and indeed the distribution) of the waiting time for a given customer arriving at T is not dependent on T, even if T is a random variable (or rather, a class of random variables, the stopping times). So certainly the inspection paradox property will hold whenever the process has the Markov property, because the inspected interval contains the inspected waiting time, which is equal in distribution to any fixed interval.

2) Everything gets slightly complicated by the fact that the Poisson process is defined to begin at 0. In particular, it is not reversible. Under the infinitesimal (or even the independent Poisson increments) definition, we can view the Poisson process not as a random non-decreasing function of time, but rather as a random collection of points on the positive reals. With this setup, it is clearly no problem to define instead a random collection of points on all the reals. [If we consider this instead as a random collection of point masses, then this is one of the simplest examples of a random measure, but that’s not hugely relevant here.]

We don’t need to worry too much about what value the Poisson process takes at any given time if we are only looking at increments, but if it makes you more comfortable, you can still safely assume that it is 0 at time 0. Crucially, the construction IS now reversible. The number of points in the interval [s,t] has distribution parameterised by t-s, so we it doesn’t matter which direction we are moving in down the real line. In this case, A_t, the time since the previous arrival, and E_t, the waiting time until the next arrival, are both Exp(1) RVs, as the memorylessness property applies in each time direction.

For the original Poisson process, we actually have A_t stochastically dominated by an Exp(1) distribution, because it is conditioned to be less than or equal to t. So in this case, the expected interval time is some complicated function of t, lying strictly between 1 and 2. In our process extended to the whole real line, the expected interval time is exactly 2.

This idea of extending onto the whole real line explains why we mainly consider delayed renewal processes rather than just normal renewal processes. The condition that we start a holding time at 0 is often not general enough, particularly when the holding times are not exponential and so the process is not memoryless.

3) There is a general size-biasing principle in action here. Roughly speaking, we are more likely to arrive in a large interval than in a small interval. The scaling required is proportional to the length of the interval. Given a density function f(x) of X, we define the size-biased density function to be xf(x). We need to normalise to give a probability distribution, and dividing by the expectation EX is precisely what is needed. Failure to account for when an observation should have the underlying distribution or the size-biased distribution is a common cause of supposed paradoxes. A common example is ‘on average my friends have more friends than I do’. Obviously, various assumption on me and my friends, and how we are drawn from the set of people (and the distribution of number of friends) is required that might not necessarily be entirely accurate in all situations.

In the Poisson process example above, the holding times have density function e^{-x}, so the size-biased density function if xe^{-x}. This corresponds to a \Gamma(2,1) distribution, which may be written as the sum of two independent Exp(1) RVs as suggested above.

4) A further paradox mentioned on the sheet is the waiting time paradox. This says that the expected waiting time is longer if you arrive at a random time than if you just miss a bus. This is not too surprising: consider at least the stereotypical complaint about buses in this country arriving in threes, at least roughly. Then if you just miss a bus, there is a 2/3 chance that another will be turning up essentially immediately. On the sheet, we showed that the \Gamma(\alpha,1) distribution has this property also, provided \alpha<1.

We can probably do slightly better than this. The memoryless property of the exponential distribution says that:


In general, for the sort of holding times we might see at a bus stop, we might expect it to be the case that if we have waited a long time already, then we are less likely relatively to have to wait a long time more, that is:


and furthermore this will be strict if neither s nor t is 0. I see no reason not to make up a definition, and call this property supermemorylessness. However, for the subclass of Gamma distributions described above, we have the opposite property:


Accordingly, let’s call this submemorylessness. If this is strict, then it says that we are more likely to have to wait a long time if we have already been waiting a long time. This seems contrary to most real-life distributions, but it certainly explains the paradox. If we arrive at a random time, then the appropriate holding time has been ‘waiting’ for a while, so is more likely to require a longer observed wait than if I had arrived as the previous bus departed.

In conclusion, before you think of something as a paradox, think about whether the random variables being compared are being generated in the same way, in particular whether one is size-biased, and whether there are effects due to non-memorylessness.


The Poisson Process – A Third Characteristion

There remains the matter of the distribution of the number of people to arrive in a fixed non-infinitissimal time interval. Consider the time interval [0,1], which we divide into n smaller intervals of equal width. As n grows large enough that we know the probability that two arrivals occur in the same interval tends to zero (as this is \leq no(\frac{1}{n})), we can consider this as a sequence of iid Bernoulli random variables as before. So

\approx \frac{n^k}{k!} \frac{\lambda^k}{n^k}(1-\frac{\lambda}{n})^n\approx \frac{\lambda^k}{k!}e^{-\lambda}.

We recognise this as belonging to a Poisson (hence the name of the process!) random variable. We can repeat this for a general time interval and obtain N_t\sim \text{Po}(\lambda t).

Note that we implicitly assumed that, in the infinitissimal case at least, behaviour in disjoint intervals was independent. We would hope that this would lift immediately to the large intervals, but it is not immediately obvious how to make this work. This property of independent increments is one of the key definitions of a Levy Process, of which the Poisson process is one of the two canonical examples (the other is Brownian Motion).

As before, if we can show that the implication goes both ways (and for this case it is not hard – letting t\rightarrow 0 clearly gives the infinitissimal construction), we can prove results about Poisson random variables with ease, for example \text{Po}(\lambda)+\text{Po}(\mu)=\text{Po}(\lambda+\mu).

This pretty much concludes the construction of the Poisson process. We have three characterisations:
1) X_n\sim\text{Exp}(\lambda) all iid.
2) The infinitissimal construction as before, with independence.
3) The number of arrivals in a time interval of width t \sim \text{Po}(\lambda t). (This is sometimes called a stationary increments property.) Furthermore, we have independent increments.

A formal derivation of the equivalence of these forms is important but technical, and so not really worth attempting here. See James Norris’s book for example for a fuller exposition.

The final remark is that the Poisson Process has the Markov property. Recall that this says that conditional on the present, future behaviour is independent of the past. Without getting into too much detail, we might like to prove this by using the independent increments property. But remember that for a continuous process, it is too much information to keep track of all the distributions at once. It is sufficient to track only the finite marginal distributions, provided the process is cadlag, which the Poisson process is, assuming we deal with the discontinuities in the right way. Alternatively, the exponential random variable is memoryless, a property that can be lifted, albeit with some technical difficulties, to show the Markov property.

The Poisson Process – Distributions of Holding Times

So, I was planning to conclude my lecture course on Markov Chains at the Cambridge-Linyi summer school with an hour devoted to the Poisson Process. Unfortunately, working through the discrete time material and examples took longer than I’d expected, so we never got round to it. As I’d put a fair bit of work into the preparation, I figured I might as well write it up here instead.

We need a plausible mathematical model for a situation where people or events arrive randomly in continuous time, in a manner that is as close as possible to the notion of iid random variables. In discrete time, a natural candidate would be a set of iid Bernoulli random variables. For example, with probability p a single bus will arrive in an time interval of a minute. With probability 1-p, no bus will arrive. We might have some logistical motivation for why it is not possible that two or more arrive in a given interval, or we could instead choose a more complicated distribution.

One way to proceed would be to specify the distribution of the times between arrivals. These should be independent and identically distributed, at least intuitively. However, although we might be able to give a sensible guess right now, it is not immediately clear what this distribution should be. For now, we merely remark that the arrival times are called X_1,X_2,\ldots, and the holding times between arrivals are defined by
S_1=X_1, S_n=X_n-X_{n-1},n\geq 2.

In fact the motivating discrete example gives us much of the machinery we will actually need. Recall that when we define probability distributions for continuously-valued random variables we need a different plan of attack than for discrete RVs. Whereas for the discrete case, it is enough to specify the probability of each outcome, for a continuous random variable, we have to specify the probabilities of intervals, and take care that they have the obvious additive and nesting properties that we want. Taking the integral (whether Riemannian or Lebesgue) of a so-called density function is a natural way to do this.

Similarly here, we build up from small time intervals. The first remark is this: it is natural that the expected number of arrivals in the first minute is equal to the expected number of arrivals in the second minute. After all, we are considering the most general process possible. If there are an infinite number of potential arriving agents, then behaviour in the first minute should be independent and equal (in distribution) to behaviour in the second minute. We can naturally extend this idea to a linear relation. If N_s is the number of arrivals in the time [0,s], then we should have \mathbb{E}N_s=\lambda s, where \lambda is some constant of proportionality, equal to \mathbb{E}N_1.

The key remark is that as s\rightarrow 0, \mathbb{P}(N_s=1) becomes small, and \mathbb{P}(N_s\geq 2) becomes very small. In fact it suffices that \mathbb{P}(N_s \geq 2)=o(s), as this implies:

\mathbb{P}(N_s=0)=1-\lambda s+o(s),\quad \mathbb{P}(N_s=1)=\lambda s+o(s).

Note that we are not currently attempting a formal construction of the process. As always, finding a probability space with enough freedom to equip a continuous process is a fiddly task. We are for now just trying to work out what the distribution of the holding times between arrivals should be. There are obvious advantages to defining the process as a collection of iid random variables, for example that we can construct it on a product space.

To do this, we split the time interval [0,1] into blocks


So the probability that someone arrives in the time [\frac{k}{n},\frac{k+1}{n}] is \frac{\lambda}{n}. So

\mathbb{P}(\text{no-one arrives in time }[0,1])=(1-\frac{\lambda}{n})^n\approx e^{-\lambda}.

As we are working in an n\rightarrow\infty regime, we can replace 1 by general time t, to obtain:

\mathbb{P}(\text{no-one arrives in time }[0,t])=(1-\frac{\lambda t}{n})^n\approx e^{-\lambda t}.

So the distribution function of the first arrival time is F(t)=1-e^{-\lambda t} in the conventional notation.
Thus X_1\sim \text{Exp}(\lambda).

However, to emphasis how useful the infinitissimal definition is for actual problems, consider these examples.

1) If we have two arrivals processes at the same object, for example arriving from the left and the right at the same shop, say with rates \lambda,\mu, then we want to show that the first arrival time is still exponential. Because of the linearity of expectation property, it is clearly from the definition that the total arrivals process is Poisson with rate \lambda+\mu, and so the result follows. Showing this by examining the joint distribution of two exponential random variables is also possible, but much less elegant.

2) Similarly, if we have two shops, A and B, and each arriving person chooses one at random, then the first arrival time at A is: with probability 1/2 distributed as Exp(\lambda), with probability 1/4 as \Gamma(2,\lambda), and so on. A fairly non-trivial calculation is required to show that this is the same as Exp(\frac{1}{2}\lambda), whereas this follows almost instantly using the the infinitissimal definition.

Moral: with the infinitissimal case, the difficulties are all in the probability space. However, once we have settled those problems, everything else is nice as the key property is linear. Whereas for a construction by iid jump times, the existence and well-definedness is clear, but even for a distribution as tractable as the exponential random variable, manipulation can be tricky.

Models for Random Access Channels

To describe a situation where, for example, mobile phones compete for the use of a satellite for transmitting their messages, we consider a random access network. Here, we assume packets arrive at (necessarily different) stations at Poisson rate \nu, and times of transmissions are discrete. A station will always attempt to transmit a packet which has just arrived, then, if that fails, will retransmit after a time given by a independent random variables. Transmissions fail if at least two stations attempt to use the same window. For the ALOHA protocol, we assume these RVs are geometric with mean f^{-1}. We hope that the backlog, that is N_t the number of stations with packets to retransmit at time t eventually clears. Say Z_t is the number of attempts in (t,t+1)$, proportional to Y_t+\text{Bin}(N_t,f) where Y is the number of arrivals in (t,t+1), and then N_{t+1}=N_t+Y_t-1_{Z_t=1}.

Then the drift in N is positive once N gets large. This is, however, not enough to prove that the system is transient. Instead we compute p(n), the probability that the channel unjams (that is, an instance where no transmissions are attempted) before the backlog increases, given that the backlog is currently n. We can calculate that this probability is \sim n(1-f)^{n-1} and so \sum p(n)<\infty. Applying Borel-Cantelli to the record process, that is the times at which the backlog increases, it can be shown that almost surely, the channel unjams only finitely many times.

To maximise throughput, we would take f=\frac{1}{n}, but of course the individual stations don’t know what is. However, they could guess. Suppose they view the process and keep a counter:


Then transmission happens with probability f=\frac{1}{S_t}, with the idea that S tracks N well, especially when N is large. Observe that

\mathbb{E}[S_{t+1}-S_t|N_t=n,S_t=s]\rightarrow (a-c)e^{-k}+(b-c)ke^{-k}+c

where k=\frac{n}{s}. So (a,b,c)=(2-e,0,1) for example gives negative S drift when k>1, and positive drift when k<1, which is required to draw S towards N. But S might still diverge asymptotically away from N? We don’t want N to increase too fast when k>1, but

\mathbb{E}[N_{t+1}-N_t|N_t=n,S_t=s]\rightarrow \nu-ke^{-k}

so demanding \nu-ke^{-k}<k is what we require when \nu<e^{-1}, which is satisfied by the example suggested.

If a station only knows its own transmission attempts, we can use Ethernet, where after r unsuccessful attempts, wait for a time B_r chosen uniformly from [1,\ldots,2^r]. This is called binary exponential backoff. When \nu<\log 2, get infinitely many successful transmissions almost surely. But for positive recurrence, would have an equilibrium distribution, so by Little’s Law, we can equate arrival and departure rates. Considering the probabilities of 0 and 1 transmissions, obtain

\nu=\pi_1e^{-\nu}+\pi_0\nu e^{-\nu}\quad\Rightarrow\quad \nu\leq e^{-\nu}\quad\Rightarrow\quad \nu<0.567

So can be non-recurrent, but have infinitely many successful transmissions!

Queues and Migration Processes

Simple Queues

A queue generally has the form of a countable state-space Markov Chain. Here we assume that customers are served in the order they arrive (often referred to as: FIFO – First In First Out). The standard Kendall notation is M/M/C/K. Here the Ms stand for Markov (or memoryless) arrivals, and service times respectively, possibly replaced by G if the process admits more general distributions. C is the number of servers, and K the capacity of the system.

The first example is a M/M/C/C queue, motivated by a telephone network. Here, there are c lines, and an arriving call is lost if all lines are busy at the arrival time. We assume arrivals are a PP(\lambda) process, and the call times are iid Exp(\mu) RVs. We record the number of busy lines, which is a continuous-time Markov chain on state-space [0,c]. As usual, we assume the system operates in equilibrium, and so in particular we must have \lambda<\mu c. It is easy to find the equilibrium distribution. The only non-zero transition probabilities are

q(i-1,i)=\lambda, \quad q(i,i-1)=\mu i\quad i=1,\ldots,c

and so can that that \pi_i=\frac{\nu_i}{i!}\pi_0 satisfies the DBEs where \nu:=\frac{\lambda}{\mu}, sometimes called the traffic intensity. This gives

\pi_c=\mathbb{P}(c\text{ lines busy})=\frac{\nu^c}{c!}\left(\sum_{k=0}^c \frac{\nu^k}{k!}\right)^{-1}

and we define this to be E(\nu,c), Erlang’s formula.

Note that if (X(t),t\in\mathbb{R}) is a MC in equilibrium, (X(-t),t\in\mathbb{R}) is a MC with the same ED, modulo style of discontinuities (ie whether transitions are left-continuous or right-continuous). Therefore, in any queue where all customers get served, eg an M/M/1 queue, for which \lambda<\mu (otherwise the MC is transient, so no ED!), the departure process is the same (in distribution) as the arrivals process.

We can check that this holds for a series of M/M/1 queues, and that in equilibrium, the sizes of the queues are independent. This is merely an extension of the observation that future arrivals for a given queue are independent of the present, and likewise past departures are independent of the future, but the argument is immediately obvious.

Migration Processes

We consider a closed migration process on J colonies, with populations described by a vector n=(n_1,\ldots,n_J). We say that the instantaneous rate of movement of an individual from colony j to colony k is q(n,T_{jk}n)=\lambda_{jk}\phi_j(n_j), for some functions \phi_j(0)=0. We can describe the ED of this system in terms of the distribution obtained when a single individual performs a random walk through the state space, with \phi_j(1) taken to be 1 for each j. We call the DBEs for this case the traffic equations, with solutions (\alpha_j,j\in J). Then, by checking the PBEs, it is clear that the equilibrium distribution of the original migration process satisfies

\pi(n)\propto \prod_{j=1}^J \frac{\alpha_j^{n_j}}{\prod_{r=1}^{n_j}\phi_j(r)}

This is important, as it would have been computationally difficult to solve the original equations for an ED.

The same result holds for an open migration process, where individuals can enter and leave the system, arriving at colony j at rate \nu_j, and leaving at rate \mu_k\phi_k(n_k). Note that this has the same form as if each colony was served by a PP(\alpha_j\lambda_j), with departures at rate \lambda_j\phi_j(n_j):=(\mu_j+\sum_k\lambda_{jk})\phi_j(n_j). But (obviously) this interpretation is not equivalent to the model

The time reversal is also an OMP. One has to check that the transition rates have the correct form, and so the exit process from each colony (in equilibrium, naturally), is PP(\alpha_j\mu_j)$. Most importantly, given a communicating class, we can think of the restriction to this class as an OMP, so in particular, rates of transition between classes are Poisson.

Little’s Law

Informally, the mean number of customers should be equal to the arrival rate multiplied by the mean sojourn time in the system of a customer. This is easiest to formalise by taking an expectation up to a regeneration time. This is T, the first time the system returns to its original state (assumed to be 0 customers), an a.s. finite stopping time.

Set L:=\mathbb{E}\frac{1}{T}\int_0^Tn(s)ds, the average number of customers, and W:=\frac{\mathbb{E}\sum_1^N W_n}{\mathbb{E}N} where N is the number of customers arriving in [0,T], and W_n is the waiting time of the n-th customer.

Little’s Law asserts that L=\lambda W. Note that for a Markov Chain in equilibrium, can define L more simply as \mathbb{E}n and similarly for W.

It is easiest proved by considering the area between the arrivals process and the departure process in two ways: integrating over height and width. Note that working up to a regeneration time is convenient because at that time the processes are equal.

The migration processes above are said to be linear if \phi_j(n_j)=n_j. This process has the form of a Markov chain, and so even in a more general state space, the distribution of points in disjoint subsets of the state-space are independent Poisson processes.

Often though, we start with no individuals in the system, but still the distribution is given by a time-inhomogenous Poisson random measure. The mean is specified by

M(t,E)=\nu\int_0^t P(u,E)du

where \nu is the net arrival rate, and P(u,E) is the probability that an individual is in E, a time interval of u after arriving.

As one would suspect, this is easiest to check through generating functions, since independence has a straightforward generating function analogue, and the expression for a Poisson RV is manageable.

One queue or many queues?

You have counters serving customers. What is a better arrangement? To have queues, one for each counter, as in a supermarket (in the heady days before self-checkout, when ‘something unexpected in the bagging area’ might necessitate a trip to the doctor); or a single queue feeding to counters when they become free, as at a train station, or airport passport control?

Queueing is a psychologically intense process, at least in this country. Here are some human points:

– When you have to choose your line, you might, if you have nothing better to do, spend your time in the queue worrying about whether you chose optimally; whether Fortune has given a customer who arrived after you an advantage. (*)

– A single queue feels long, but conversely gives its users the pleasing feeling of almost constant progress. A particularly difficult customer doesn’t have as great an effect.

Interestingly, we can qualify these remarks in purely mathematical terms. We shall see that the mean waiting times are the same, but that, as we might expect, the variance is less for the single queue.

So to begin, we assume that the arrival process is Poisson, and the service times are exponential. These assumptions are very much not required, but it makes it easier to set up the model as a Markov chain. We do need that \lambda, the arrival rate, is less than \mu N, where \mu is the service rate. Why? Well, otherwise the number of customers waiting to be served will almost surely grow to infinity. When supply is greater than demand, as we specify, we know that there will (almost surely) be infinitely many times when there are no customers in the system. This shows that the Markov chain is recurrent, and so it is meaningful to talk about a mean waiting time.

We assume that in the N-queues arrangement, if a counter is free but the corresponding queue is empty, then a customer from a non-zero queue will take the place if possible. We do not worry about how this customer is chosen. Indeed, we do not concern ourselves with how the customers choose a queue, except to ask that the choice algorithm is previsible. That is, they cannot look into the future and optimise (or otherwise) accordingly. This assumption is standard and reasonable by considering the real-world situation.

Now if we set things up in the right way, we can couple the two processes. Precisely, we start with no customers, and define


to be the sequence of arrival times (note that almost surely no two customers arrive at the same time). Now say that


are the times at which a customer starts being served, that is, moves from the queue to the counter. Say S_1,S_2,\ldots are the service times of the customers, where S_i is the service time of the customer who leaves the queue at time D_i.

The important point is that service times are independent of the queueing process, and are i.i.d. This is where we use the fact that the queue choice strategies are previsible. So, in fact all we need to examine is how customers spend in the queue. Concretely, take the queueing time, and S the service time. These are independent so:


and \text{Var} (Q+S)=\text{Var } Q+\text{Var } S.

Observe that A_i,S_i have the same meanings and distributions in both queue models. It is not obvious that D_i does. We notice that for both models, if we define

C_n(t)=\#\{i:D_i+S_i\geq t\} on [D_n,\infty),

the number of the first n customers being served at time t, we have

D_{n+1}=A_{n+1}\vee \inf\{t\geq D_n: C_n(t)\leq N-1\}.

Therefore, by induction on n, we have that (D_i) is well-defined as the same function of (A_i,S_i) for both models. This fact was given as implicitly obvious by many of the remarks on this topic which I found. In my opinion, it is no more obvious than the solution to the original problem.

We can also define, for both models, the number of customers served before the queue is empty again as:

\inf\{N:\exists \epsilon>0\text{ s.t. }c_N(A_{N+1}-\epsilon)=0\},

ie, the least N for which the queue is empty just before the N+1-th customer arrives. As previously discussed, N is a.s. finite. The processes are Markov chains, the first time the queue is empty is a stopping time, and it is meaningful to consider the average waiting time of the N customers before the queue is empty again. There is an expectation over N implicit here too, but this won’t be necessary for our result.

We can now finish the proof. This mean waiting time is


for the single queue. For multiple queues, suppose the i-th customer to arrive is the \sigma(i)-th to be served, where \sigma\in S_N is a permutation. Then the waiting time is

\frac{1}{N}\sum_{i=1}^N (D_{\sigma(i)}-A_i),

which is clearly equal to the previous expression as we can reorder a finite sum.

For variance, we have to consider

\frac{1}{N}\sum_{i=1}^N (D_{\sigma(i)}-A_i)^2.$

Observe that for a\leq b\leq c\leq d, we have 0\leq(d-c)(b-a) and so

(c-a)^2+(d-b)^2\leq (d-a)^2+(c-b)^2,

which we can then apply repeatedly to show that


with equality precisely when \sigma=\text{id}. So we conclude that the variance is in general higher when there are queues. And the reason is not that different to what we had originally suspected (*), namely that customers do not necessarily get served in the same order as they arrive when there are multiple queues, and so the variance of waiting time is greater.


This problem comes from Examples Sheet 1 of the Part III course Stochastic Networks. An official solution of the same flavour as this is also linked from the course page.