How to Prove Fermat’s Little Theorem

The following article was prompted by a question from one of my mentees on the Senior Mentoring Scheme. A pdf version is also available.

Background Ramble

When students first meet problems in number theory, it often seems rather different from other topics encountered at a similar time. For example, in Euclidean geometry, we immediately meet the criteria for triangle similarity or congruence, and various circle theorems. Similarly, in any introduction to inequalities, you will see AM-GM, Cauchy-Schwarz, and after a quick online search it becomes apparent that these are merely the tip of the iceberg for the bewildering array of useful results that a student could add to their toolkit.

Initially, number theory lacks such milestones. In this respect, it is rather like combinatorics. However, bar one or two hugely general ideas, a student gets better at olympiad combinatorics questions by trying lots of olympiad combinatorics questions.

I don’t think this is quite the case for the fledgling number theorist. For them, a key transition is to become comfortable with some ideas and notation, particularly modular arithmetic, which make it possible to express natural properties rather neatly. The fact that multiplication is well-defined modulo n is important, but not terribly surprising. The Chinese Remainder Theorem is a `theorem’ only in that it is useful and requires proof. When you ask a capable 15-year-old why an arithmetic progression with common difference 7 must contain multiples of 3, they will often say exactly the right thing. Many will even give an explanation for the regularity in occurrence of these which is precisely the content of the theorem. The key to improving number theory problem solving skills is to take these ideas, which are probably obvious, but sitting passively at the back of your mind, and actively think to deploy them in questions.

Fermat’s Little Theorem

It can therefore come as a bit of a shock to meet your first non-obvious (by which I mean, the first result which seems surprising, even after you’ve thought about it for a while) theorem, which will typically be Fermat’s Little Theorem. This states that:

\text{For a prime }p,\text{ and }a\text{ any integer:}\quad a^p\equiv a\mod p. (1)

Remarks

  • Students are typically prompted to check this result for the small cases p=3, 5 and 7. Trying p=9 confirms that we do really need the condition that p be prime. This appears on the 2012 November Senior Mentoring problem sheet and is a very worthwhile exercise in recently acquired ideas, so I will say no more about it here.
  • Note that the statement of FLT is completely obvious when a is a multiple of p. The rest of the time, a is coprime to p, so we can divide by a to get the equivalent statement:

\text{For a prime }p,\text{ and }a\text{ any integer coprime to }p:\quad a^{p-1}\equiv 1\mod p. (2)

  • Sometimes it will be easier to prove (2) than (1). More importantly, (2) is sometimes easier to use in problems. For example, to show a^{p^2}\equiv a \mod p, it suffices to write as:

a^{p^2}\equiv a^{(p-1)(p+1)+1}\equiv (a^{p-1})^{p+1}\times a\equiv 1^{p+1}\times a \equiv a.

  • A word of warning. FLT is one of those theorems which it is tempting to use on every problem you meet, once you know the statement. Try to resist this temptation! Also check the statement with small numbers (eg p=3 ,a=2) the first few times you use it, as with any new theorem. You might be surprised how often solutions contain assertions along the lines of

a^p\equiv p \mod (a-1).

Proofs

I must have used FLT dozens of times (or at least tried to use it – see the previous remark), before I really got to grips with a proof. I think I was daunted by the fact that the best method for, say, p=7, a careful systematic check, would clearly not work in the general case. FLT has several nice proofs, and is well worth thinking about for a while before reading what follows. However, I hope these hints provide a useful prompt towards discovering some of the more interesting arguments.

Induction on a to prove (1)

  • Suppose a^p\equiv a\mod p. Now consider (a+1)^p modulo p.
  • What happens to each of the (p+1) terms in the expansions?
  • If necessary, look at the expansion in the special case p=5 or 7, formulate a conjecture, then prove it for general p.

Congruence classes modulo p to prove (2)

  • Consider the set \{a,2a,3a,\ldots,(p-1)a\} modulo p.
  • What is this set? If the answer seems obvious, think about what you would have to check for a formal proof.
  • What could you do now to learn something about a^{p-1}?

Combinatorics to prove (1)

  • Suppose I want a necklace with p beads, and I have a colours for these beads. We count how many arrangements are possible.
  • Initially, I have the string in a line, so there are p labelled places for beads. How many arrangements?
  • Join the two ends. It is now a circle, so we don’t mind where the labelling starts: Red-Green-Blue is the same as Green-Blue-Red.
  • So, we’ve counted some arrangements more than once. How many have we counted exactly once?
  • How many have we counted exactly p times? Have we counted any arrangements some other number of times?

Group Theory to prove (2)

This is mainly for the interest of students who have seen some of the material for FP3, or some other introduction to groups.

  • Can we view multiplication modulo p as a group? Which elements might we have to ignore to ensure that we have inverses?
  • What is \{1,a,a^2,a^3,\ldots\} in this context? Which axiom is hardest to check?
  • How is the size of the set of powers of a modulo p related to the size of the whole group of congruences?
  • Which of the previous three proofs is this argument is most similar to this one?
  • Can you extend this to show the Fermat-Euler Theorem:

\text{For any integer }n,\text{ and }a\text{ coprime to }n:\quad a^{\phi(n)}\equiv 1 \mod n,

where \phi(n) counts how many integers between 1 and n are coprime to n.

Advertisement

From G(n,p) to G(n,m)

In the previous posts about random graphs, I was focusing on the model G(n,p). Here, we have n vertices, and we insert an edge between any pair of vertices independently with probability p. In particular, the number of edge which appear in a realisation of G(n,p) is a random variable, distributed as \text{Bin}(\frac{n(n-1)}{2},p).

The original model examined by Erdos and Renyi, after whom the random graph described above was named, was slightly different. Still with n vertices, they specified how many edges m they wanted in the graph, and chose uniformly at random from the set of graphs with this number of edges. This model is usually denoted G(n,m). Normally we can tell them apart by context. Obviously, p is a probability so lies in [0,1], whereas m is a positive integer, so there isn’t much room for ambiguity.

The key observation is that, if H is some graph with n vertices and m edges, then the probability that this is appears as G(n,p) is

p^{E(H)}(1-p)^{n-E(H)}.

This is constant if we vary H while fixing m. In other words, G(n,p) conditioned to have m edges is G(n,m). So, via some sort of law of total probability, we can construct G(n,p) by taking m to be distributed as \text{Bin}(\frac{n(n-1)}{2},p), and conditional on that, sampling from G(n,m). (*)

We can couple G(n,p) for all p, by assigning iid uniform [0,1] random variables to each pair of vertices, then including the edge in G(n,p) only if the value of the RV is greater than 1-p. Similarly, it is often helpful to think of G(n,m) as m varies as a random process, where edges are added one at a time, and at each stage the next edge is chosen uniformly at random from those not currently present. Perhaps because of this, it is sometimes easier to prove results for G(n,p) than for G(n,m) so we want to develop a framework to move between the two.

The decomposition (*) gives a relatively straightforward way to move from a result in G(n,m) to a result in G(n,p). By the Central Limit Theorem, the number of edges in G(n,p) is \binom{n}{2}p+O(n) in the limit, and so if a result with high probability in G(n,m) for all m in some interval, say [\binom{n}{2}p-Kn,\binom{n}{2}p+Kn] for some large K, then the law of total probability shows that the property holds with high probability in G(n,p).

In general, we get more interesting properties when p is a function of n. As discussed in previous posts, the scaling p=\frac{\lambda}{n} is particularly worth studying. CLT now shows that G(n,\frac{\lambda}{n}) has \frac{\lambda n}{2}+O(\sqrt{n}) edges in the limit. If you are confused why you can’t just substitute this value for p into the previous expression, note that p(1-p) does appear in the general asymptotic variance, but this gets absorbed into the “big O” notation when p is constant.

More importantly, many properties that we might want to consider are not in general affected in the limit by adding or removing O(\sqrt{n}) edges. For example, with high probability, G(n,m) has largest component of size (\zeta_\lambda+o(1)) n whenever \lambda>1 and m\in [\frac{\lambda n}{2}-O(\sqrt{n}),\frac{\lambda n}{2}+O(\sqrt{n})]. Some of this notation would need to be made a bit more precise in a formal argument, but for now, let’s take that as given. This then implies that with high probability, G(n,\frac{\lambda}{n}) has largest component of size (\zeta_\lambda+o(1))n also.

Of course, from the logical structure of this blog, this deduction is a bit bogus, because we have only just introduced G(n,m), and have no idea about the properties of its giant components yet. We seek instead an argument to deduce facts about G(n,m) from facts about G(n,p). Because G(n,m) cannot obviously be written as some conditioned combination of G(n,p)s, this instinctively seems harder. Bollobas gives various general conditions to carry results over between the two regimes in his Part III course notes, but I feel that an examples would be the easiest way to explain the ideas.

The size of the largest component is such an important quantity, we might as well consider that, in the subcritical case. We work with G(n,\frac{\lambda}{n}), for which we have the result:

\mathbb{P}_{\lambda+o(1)}(C_1\geq a\log n)=O(n^{-\delta}),

for some \delta>0, whenever a>I_\lambda^{-1}, the rate function at 1 of the total population size of a Poisson \lambda branching process. For now, that doesn’t matter too much, except that it is continuous as a function of \lambda. We want to show that G(n,\frac{\lambda n}{2}) has the same property.

The trick is to consider G(n,\frac{\lambda+\epsilon}{n}) instead. Let E_a be the event described above. By the law of total probability and the decomposition mentioned above, we have:

\mathbb{P}_{\lambda+\epsilon}(E_a)=\sum_{m=0}^n\mathbb{P}(\text{Bin}(\frac{n(n-1)}{2},\frac{\lambda+\epsilon}{n})=m)\mathbb{P}_{n,m}(E_a).

We are going to split this sum into

\sum_{m=0}^{\frac{(\lambda+\epsilon )n}{2}-n^{3/4}}+\sum_{m=\frac{(\lambda+\epsilon)n}{2}-n^{3/4}}^n.

On the first of these sums, we bound using the fact that probabilities are less than 1, and on the second, we use that \mathbb{P}_{n,m}(E_a) is an increasing function of m. This property is special to the event we are considering – in general one might have to be a bit more clever, perhaps using continuity of P_{n,m}(E), interpreting continuity in the limit with n. Anyway, this enables us to bound:

\mathbb{P}_{\lambda+\epsilon}(E_a)\geq \mathbb{P}(\text{Bin}(\frac{n(n-1)}{2},\frac{\lambda+\epsilon}{n})\leq \frac{(\lambda+\epsilon)n}{2}-n^{3/4})+\mathbb{P}_{\frac{(\lambda+\epsilon)n}{2}-n^{3/4}}(E_a)\mathbb{P}(\text{Bin}(\frac{n(n-1)}{2},\frac{\lambda+\epsilon}{n})\geq\frac{(\lambda+\epsilon)n}{2}-n^{3/4}).

By the Central Limit Theorem, this first probability tends to 0, while the final term tends to 1. We therefore have:

\mathbb{P}_{\lambda+\epsilon}(E_a)\geq \mathbb{P}_{\frac{(\lambda+\epsilon)n}{2}-n^{3/4}}(E_a).

We demanded that a>I_\lambda^{-1}, and mentioned that this function was continuous, so since we have total freedom over \epsilon, in particular, we can choose \epsilon>0 such that a>I_{\lambda+\epsilon}^{-1}. By the work on G(n,p), we have \mathbb{P}_{\lambda+\epsilon}(E_a)=O(n^{-\delta}), and for large enough n, we have \frac{(\lambda+\epsilon)n}{2}-n^{3/4}>>\frac{\lambda n}{2}, and so the result \mathbb{P}_{\frac{\lambda n}{2}}(E_a)=O(n^{-\delta}) follows.

Analytic vs Probabilistic Arguments for a Supercritical BP

This follows on directly from the previous post. I was originally going to talk only about what follows, but I got rather carried away with the branching process account. I was stuck on a particular exercise, and we ended up coming up with two arguments: one analytic and one probabilistic. Since the typical flavour of this blog is to present problems which show the advantage of the probabilistic approach, it seems only fair to remark on this case, where the analytic method was less interesting, but much simpler.

Recall that we have a supercritical random graph G(n,\frac{\lambda}{n}), \lambda>1, and we are considering the rescaled exploration process S_{nt}, which has asymptotic mean \mu_t=1-t-e^{-\lambda t}. We can calculate similarly an expression for the asymptotic variance

\frac{\text{Var}(S_{nt})}{n}\rightarrow v_t=e^{-\lambda t}(1-e^{-\lambda t}).

To use this to verify the result about the size of the giant component, we verify that \mu_{\zeta_\lambda+x/\sqrt{n}} is negative, and has small variance, which would confirm that the giant component has size bounded above by \zeta_\lambda almost surely. A similar argument is required for the lower bound. The variance is a separate matter, but it is therefore necessary that \mu_t should be decreasing at t=\zeta_\lambda, that is \mu_t'=\lambda e^{-\lambda \zeta_\lambda}<0. This is what we try to prove in the remainder of this post. Recall that in the previous post we have checked that it is equal to zero here.

Heuristic Explanation

\mu_t has been rescaled from the original definition of the exploration process in both size and time-scale so some care is needed to see why this should hold in the limit. Remember that all components apart from the giant component are of size O(log n). So immediately after exhausting the giant component, you are likely to be visiting components of size roughly log n. A time interval of dt for \mu corresponds to ndt for S, during which S will visit some components of size log n and some of O(1) and some in between. In particular, some fixed proportion of vertices are isolated, that is, in a component of size 1.

There is then a complicated size-biasing train of thought. A component of size log n is more likely to come up than an isolated vertex, but there are not as many of them. The log n components push the derivative \mu_t' towards zero, because S_t decreases by 1 over a time-interval of length log n, which gives a gradient of zero in the limit. However, the isolated vertices give a gradient of -1, because S_t decreases by 1 over a time interval of 1. Despite the fact that log n intervals are likely to appear earlier, it still remains the case that after exhausting a component (in particular, at time t=\zeta_\lambda, after exhausting the giant component), with some bounded below positive probability you will choose an isolated vertex next. The component size only affects that time-scale if it is O(n), which none of the remaining components are, so the derivative \mu_{\zeta_\lambda}' consists of some complicated weighted mean of 0 and -1. In particular, it is negative.

Analytic solution

Obviously, that won’t do in practice. Suppressing lambdas for ease of notation, the key fact is: e^{-\lambda \zeta}=1-\zeta. We want to show that \lambda e^{-\lambda \zeta}<1. Substituting

\lambda=-\frac{\log(1-\zeta)}{\zeta},

means that it is required to show:

-\frac{1-\zeta}{\zeta}\log(1-\zeta)<1.

Differentiating the left hand side gives:

\frac{\log(1-\zeta)+\zeta}{\zeta^2}<0,

since of course \log(1-\zeta)=\zeta+\frac{\zeta^2}{2}+\frac{\zeta^3}{3}+\dots. So it suffice to check the result for small \zeta. But, again using a Taylor series:

-\frac{1-\zeta}{\zeta}\log(1-\zeta)=1-\frac12\zeta+O(\zeta^2)<1,

for small \zeta. This gives the required result.

Probabilistic Interpretation and Solution

First, we observe that \lambda e^{-\lambda\zeta}=\lambda(1-\zeta) is the expected number of vertices in the first generation of a \text{Po}(\lambda) whose progeny become extinct. This motivates considering the canonical decomposition of a supercritical branching process Z into the skeleton process and the dual process. The skeleton Z^+ consists of all vertices which have infinitely many successors. It is relatively easy to show that this is a branching process with offspring distribution \text{Po}(\lambda\zeta) conditioned on being positive. The dual process Z^* is a G-W branching process with offspring distribution \text{Po}(\lambda) conditioned on dying. This is the same as a branching process with offspring distribution \text{Po}(\lambda(1-\zeta), by a sprinkling argument, which says that if we begin with a Poisson number of things, then remove each one independently with some fixed probability, the remaining number of things is Poisson also.

We can construct the original branching process by

  • With probability \zeta, take the skeleton, and affixe independent copies of Z^* at every vertex in the skeleton.
  • With probability 1-\zeta, just take a copy of Z^*.

It is immediately clear that \lambda(1-\zeta)\leq 1. After all, the dual process is almost surely finite, so the offspring distribution cannot have expectation greater than 1. Checking that this is strong is more fiddly. The best way I have come up with is to examine the tail of the distribution of total population size of the original branching process.

The total population size T of a branching process has an exponential tail if the offspring distribution is subcritical. It isn’t hugely surprising that this behaves like a large deviation for iid RVs, since in the limit such an event requires a lot of the offspring counts to deviate substantially from the mean. The same holds in the supercritical case, with the additional complication that though the finite tail decays exponential, there is positive probability that the total size will be infinite. In the critical case, however, there is a power-law decay. This is not hugely surprising as it marks the threshhold for the appearance of the infinite population, just as in a multiplicative coalescent at time 1, we have a load of very large components just about to form a giant component. The tool for all of these results is Dwass’s Theorem, which says:

\mathbb{P}(T=n)=\frac{1}{n}\mathbb{P}(X_1+\ldots+X_n=n-1),

where X_1 are iid with the offspring distribution. When \mathbb{E}X_1\neq 1, this is a large deviation event, for which Cramer’s theorem applies (assuming, as is the case for the Poisson distribution, that the offspring distribution has finite variance). When, \mathbb{E}X=1, the Central Limit Theorem says that with high probability,

X_1+\ldots+X_n\in [n-n^{3/4},n+n^{3/4}],

so, skating over the details of whether everything is exactly uniform within this CLT scaling window,

\mathbb{P}(T=n)\geq \frac{1}{n}\cdot\frac{1}{2n^{3/4}}.

The true exponent of the power law decay is substantially slower than this, but the above argument works as a back-of-the-envelope bound.

In particular, if the dual process has mean 1, then the population size of the original branching process is given by taking a distribution with exponential tail with some probability and a distribution with power-law tail with some probability. Obviously the power-law will dominate, which contradicts the assumption that the original branching process was supercritical, and so has an exponential tail.

Exploring the Supercritical Random Graph

I’ve spent a bit of time this week reading and doing all the exercises from some excellent notes by van der Hofstad about random graphs. I think they are absolutely excellent and would not be surprised if they become the standard text for an introduction to probabilistic combinatorics. You can find them hosted on the author’s website. I’ve been reading chapters 4 and 5, which approaches the properties of phase transitions in G(n,p) by formalising the analogy between component sizes and population sizes in a binomial branching process. When I met this sort of material for the first time during Part III, the proofs generally relied on careful first and second moment bounds, which is fine in many ways, but I enjoyed vdH’s (perhaps more modern?) approach, as it seems to give a more accurate picture of what is actually going on. In this post, I am going to talk about using the branching process picture to explain why the giant component emerges when it does, and how to get a grip on how large it is at any time after it has emerged.

Background

A quick tour through the background, and in particular the notation will be required. At some point I will write a post about this topic in a more digestible format, but for now I want to move on as quickly as possible.

We are looking at the sparse random graph G(n,\frac{\lambda}{n}), in the super-critical phase \lambda>1. With high probability (that is, with probability tending to 1 as n grows), we have a so-called giant component, with O(n) vertices.

Because all the edges in the configuration are independent, we can view the component containing a fixed vertex as a branching process. Given vertex v(1), the number of neighbours is distributed like \text{Bi}(n-1,\frac{\lambda}{n}). The number of neighbours of each of these which we haven’t already considered is then \text{Bi}(n-k,\frac{\lambda}{n}), conditional on k, the number of vertices we have already discounted. After any finite number of steps, k=o(n), and so it is fairly reasonable to approximate this just by \text{Bi}(n,\frac{\lambda}{n}). Furthermore, as n grows, this distribution converges to \text{Po}(\lambda), and so it is natural to expect that the probability that the fixed vertex lies in a giant component is equal to the survival probability \zeta_\lambda (that is, the probability that it is infinite) of a branching process with \text{Po}(\lambda) offspring distribution. Note that given a graph, the probability of a fixed vertex lying in a giant component is equal to the fraction of the vertex in the giant component. At this point it is clear why the emergence of the giant component must happen at \lambda=1, because we require \mathbb{E}\text{Po}(\lambda)>1 for the survival probability to be non-zero. Obviously, all of this needs to be made precise and rigorous, and this is treated in sections 4.3 and 4.4 of the notes.

Exploration Process

A common functional of a rooted branching process to consider is the following. This is called in various places an exploration process, a depth-first process or a Lukasiewicz path. We take a depth-first labelling of the tree v(0), v(1), v(2),… , and define c(k) to be the number of children of vertex v(k). We then define the exploration process by:

S(0)=0,\quad S(k+1)=S(k)+c(k)-1.

By far the best way to think of this is to imagine we are making the depth-first walk on the tree. S(k) records how many vertices we have seen (because they are connected by an edge to a vertex we have visited) but have not yet visited. To clarify understanding of the definition, note that when you arrive at a vertex with no children, this should decrease by one, as you can see no new vertices, but have visited an extra one.

This exploration process is useful to consider for a couple of reasons. Firstly, you can reconstruct the branching process directly from it. Secondly, while other functionals (eg the height, or contour process) look like random walks, the exploration process genuinely is a random walk. The distribution of the number of children of the next vertex we arrive at is independent of everything we have previously seen in the tree, and is the same for every vertex. If we were looking at branching processes in a different context, we might observe that this gives some information in a suitably-rescaled limit, as rescaled random walks converge to Brownian motion if the variance of the (offspring) distribution is finite. (This is Donsker’s result, which I should write something about soon…)

The most important property is that the exploration process returns to 0 precisely when we have exhausted all the vertices in a component. At that point, we have seen exactly the vertices which we have explored. There is no reason not to extend the definition to forests, that is a union of trees. The depth-first exploration is the same – but when we have exhausted one component, we move onto another component, chosen according to some labelling property. Then, running minima of the exploration process (ie times when it is smaller than it has been before) correspond to jumping between components, and thus excursions above the minimum to components themselves. The running minimum will be non-positive, with absolute value equal to the number of components already exhausted.

Although the exploration process was defined with reference to and in the language of trees, the result of a branching process, this is not necessary. With some vertex denoted as the root, we can construct a depth-first labelling of a general graph, and the exploration process follows exactly as before. Note that we end up ignoring all edges except a set that forms a forest. This is what we will apply to G(n,p).

Exploring G(n,p)

When we jump between components in the exploration process on a supercritical (that is \lambda>1) random graph, we move to a component chosen randomly with size-biased distribution. If there is a giant component, as we know there is in the supercritical case, then this will dominate the size-biased distribution. Precisely, if the giant component takes up a fraction H of the vertices, then the number of components to be explored before we get to the giant component is geometrically distributed with parameter H. All other components have size O(log n), so the expected number of vertices explored before we get to the giant component is O(log n)/H = o(n), and so in the limit, we explore the giant component immediately.

The exploration process therefore gives good control on the giant component in the limit, as roughly speaking the first time it returns to 0 is the size of the giant component. Fortunately, we can also control the distribution of S_t, the exploration process at time t. We have that:

S_t+(t-1)\sim \text{Bi}(n-1,1-(1-p)^t).

This is not too hard to see. S_t+(t-1) is number of vertices we have explored or seen, ie are connected to a vertex we have explored. Suppose the remaining vertices are called unseen, and we began the exploration at vertex 1. Then any vertex with label in {2,…,n} is unseen if it successively avoids being in the neighbourhood of v(1), v(2), … v(t). This happens with probability (1-p)^t, and so the probability of being an explored or seen vertex is the complement of this.

In the supercritical case, we are taking p=\frac{\lambda}{n} with \lambda>1, and we also want to speed up S, so that all the exploration processes are defined on [0,1], and rescale the sizes by n, so that it records the fraction of the graph rather than the number of vertices. So we set consider the rescaling \frac{1}{n}S_{nt}.

It is straightforward to use the distribution of S_t we deduce that the asymptotic mean \mathbb{E}\frac{1}{n}S_{nt}=\mu_t = 1-t-e^{-\lambda t}.

Now we are in a position to provide more concrete motivation for the claim that the proportion of vertices in the giant component is \zeta_\lambda, the survival probability of a branching process with \text{Po}(\lambda) offspring distribution. It helps to consider instead the extinction probability 1-\zeta_\lambda. We have:

1-\zeta_\lambda=\sum_{k\geq 0}\mathbb{P}(\text{Po}(\lambda)=k)(1-\zeta_\lambda)^k=e^{-\lambda\zeta_\lambda},

where the second equality is a consequence of the simple form for the moment generating function of the Poisson distribution.

As a result, we have that \mu_{\zeta_\lambda}=0. In fact we also have a central limit theorem for S_t, which enables us to deduce that \frac{1}{n}S_{n\zeta_\lambda}=0 with high probability, as well as in expectation, which is precisely what is required to prove that the giant component of G(n,\frac{\lambda}{n}) has size n(\zeta_\lambda+o(1)).

The Inspection Paradox and related topics

In the final class for Applied Probability, we discussed the so-called Inspection Paradox for an arrivals process. We assume that buses, sat, arrive as a Poisson process with rate 1, and consider the size of the interval (between buses) containing some fixed time T. The ‘paradox’ is that the size of this interval is larger in expectation than the average time between buses, which of course is given by an exponential random variable.

As with many paradoxes, it isn’t really that surprising after all. Perhaps what is more surprising is that the difference between the expected observed interval and the expected actual interval time is quite small here. There are several points of interest:

1) The Markov property of the Poisson process is key. In particular, this says that the expectation (and indeed the distribution) of the waiting time for a given customer arriving at T is not dependent on T, even if T is a random variable (or rather, a class of random variables, the stopping times). So certainly the inspection paradox property will hold whenever the process has the Markov property, because the inspected interval contains the inspected waiting time, which is equal in distribution to any fixed interval.

2) Everything gets slightly complicated by the fact that the Poisson process is defined to begin at 0. In particular, it is not reversible. Under the infinitesimal (or even the independent Poisson increments) definition, we can view the Poisson process not as a random non-decreasing function of time, but rather as a random collection of points on the positive reals. With this setup, it is clearly no problem to define instead a random collection of points on all the reals. [If we consider this instead as a random collection of point masses, then this is one of the simplest examples of a random measure, but that’s not hugely relevant here.]

We don’t need to worry too much about what value the Poisson process takes at any given time if we are only looking at increments, but if it makes you more comfortable, you can still safely assume that it is 0 at time 0. Crucially, the construction IS now reversible. The number of points in the interval [s,t] has distribution parameterised by t-s, so we it doesn’t matter which direction we are moving in down the real line. In this case, A_t, the time since the previous arrival, and E_t, the waiting time until the next arrival, are both Exp(1) RVs, as the memorylessness property applies in each time direction.

For the original Poisson process, we actually have A_t stochastically dominated by an Exp(1) distribution, because it is conditioned to be less than or equal to t. So in this case, the expected interval time is some complicated function of t, lying strictly between 1 and 2. In our process extended to the whole real line, the expected interval time is exactly 2.

This idea of extending onto the whole real line explains why we mainly consider delayed renewal processes rather than just normal renewal processes. The condition that we start a holding time at 0 is often not general enough, particularly when the holding times are not exponential and so the process is not memoryless.

3) There is a general size-biasing principle in action here. Roughly speaking, we are more likely to arrive in a large interval than in a small interval. The scaling required is proportional to the length of the interval. Given a density function f(x) of X, we define the size-biased density function to be xf(x). We need to normalise to give a probability distribution, and dividing by the expectation EX is precisely what is needed. Failure to account for when an observation should have the underlying distribution or the size-biased distribution is a common cause of supposed paradoxes. A common example is ‘on average my friends have more friends than I do’. Obviously, various assumption on me and my friends, and how we are drawn from the set of people (and the distribution of number of friends) is required that might not necessarily be entirely accurate in all situations.

In the Poisson process example above, the holding times have density function e^{-x}, so the size-biased density function if xe^{-x}. This corresponds to a \Gamma(2,1) distribution, which may be written as the sum of two independent Exp(1) RVs as suggested above.

4) A further paradox mentioned on the sheet is the waiting time paradox. This says that the expected waiting time is longer if you arrive at a random time than if you just miss a bus. This is not too surprising: consider at least the stereotypical complaint about buses in this country arriving in threes, at least roughly. Then if you just miss a bus, there is a 2/3 chance that another will be turning up essentially immediately. On the sheet, we showed that the \Gamma(\alpha,1) distribution has this property also, provided \alpha<1.

We can probably do slightly better than this. The memoryless property of the exponential distribution says that:

\mathbb{P}(Z>t+s|Z>t)=\mathbb{P}(Z>s).

In general, for the sort of holding times we might see at a bus stop, we might expect it to be the case that if we have waited a long time already, then we are less likely relatively to have to wait a long time more, that is:

\mathbb{P}(Z>t+s|Z>t)\leq\mathbb{P}(Z>s),

and furthermore this will be strict if neither s nor t is 0. I see no reason not to make up a definition, and call this property supermemorylessness. However, for the subclass of Gamma distributions described above, we have the opposite property:

\mathbb{P}(Z>t+s|Z>t)\geq\mathbb{P}(Z>s).

Accordingly, let’s call this submemorylessness. If this is strict, then it says that we are more likely to have to wait a long time if we have already been waiting a long time. This seems contrary to most real-life distributions, but it certainly explains the paradox. If we arrive at a random time, then the appropriate holding time has been ‘waiting’ for a while, so is more likely to require a longer observed wait than if I had arrived as the previous bus departed.

In conclusion, before you think of something as a paradox, think about whether the random variables being compared are being generated in the same way, in particular whether one is size-biased, and whether there are effects due to non-memorylessness.

Skorohod Space

The following is a summary of a chapter from Billingsley’s Convergence of Probability Measures. The ideas are easy to explain heuristically, but this was the first text I could find which explained how to construct Skorohod space for functions on the whole of the non-negative reals in enough stages that it was easily digestible.

It is relatively straightforward to define a topology on C[0,1], as we can induce from the most sensible metric. In this topology, functions f and g are close together if

\sup_{t\in[0,1]} |f(t)-g(t)| is small.

For cadlag functions, things are a bit more complicated. Two functions might be very similar, but have a discontinuity of similar magnitude at slightly different places. The sup norm of the difference is therefore macroscopically large. So we want a metric that also allows uniformly small deformations of the time scale.

We define the Skorohod (or Skorokhod depending on your transliteration preferences) metric d on D[0,1] as follows. Let \Lambda be the family of continuous, strictly increasing functions from [0,1] to [0,1] which map 0 to 0 and 1 to 1. This will be our family of suitable reparameterisations of the time scale (or abscissa – a new word I learned today. The other axis in a co-ordinate pair is called the ordinate). Anyway, we now say that

d(x,y)<\epsilon\quad\text{if }\exists \lambda\in\Lambda\text{ s.t. }

||\lambda - id||_\infty<\epsilon\quad\text{and}\quad ||f-\lambda\circ g||_\infty<\epsilon.

In other words, after reparameterising the time scale for g, without moving any time by more than epsilon, the functions are within epsilon in the sup metric.

Weak Convergence

We have the condition: if \{P_n\} is a tight sequence of probability measures and we have

\text{If }P_n\pi_{t_1,\ldots,t_k}^{-1}\Rightarrow P\pi_{t_1,\ldots,t_k}^{-1}\quad\forall t_1,\ldots,t_k\in[0,1],\quad\text{then }P_n\Rightarrow P,

where \pi_{t_1,\ldots,t_k} is the projection onto a finite-dimensional set. This is a suitable condition for C[0,1]. For D[0,1], we have the additional complication that these projections might not be continuous everywhere.

We can get over this problem. For a measure P, set T_P to be the set of t\in[0,1] such that \pi_t is continuous P-almost everywhere (ie for all f\in D apart from a collection with P-measure = 0). Then, for all P, it is not hard to check that 0,1\in T_P and [0,1]\backslash T_P is countable.

The tightness condition requires two properties:

1) \lim_{K\rightarrow\infty} \limsup_{n}P_n[f:||f||\geq K]=0.

2) \forall \epsilon>0:\,\lim_\delta\limsup_n P_n[f:w_f'(\delta)\geq\epsilon]=0.

These say, respectively, that the measure of ||f|| doesn’t escape to \infty, and there is no mass given in the limit to functions which ‘wiggle with infinite frequency on an epsilon scale of amplitude’.

D_\infty=D[0,\infty)

Our earlier definition of the Skorohod metric could have been written:

d(f,g)=\inf_{\lambda\in\Lambda}\{||\lambda-\text{id}||\vee||f-\lambda\circ g||\}.

From a topological convergence point of view, there’s no need to use the sup norm on \lambda - \text{id}. We want to regulate smoothness of the reparameterisation, so we could use the norm:

||\lambda||^\circ=\sup_{s<t}|\log\frac{\lambda(t)-\lambda(s)}{t-s}|,

that is, the slope is uniformly close to 1 if ||\lambda||^\circ is small. The advantage of this choice of norm is that an extension to D[0,\infty) is immediate. Also, the induced product norm

d^\circ(f,g)=\inf_{\lambda\in\Lambda} \{||\lambda - \text{id}||^\circ \vee||x-\lambda\circ y||\}

is complete. This gives us a few problems, as for example

d_\circ(1_{[0,1)},1_{[0,1-\frac{1}{n})})=1,

as you can’t reparameterise over the jump in a way that ensures the log of the gradient is relatively small. (In particular, to keep the sup norm less than 1, we would need \lambda to send [1-\frac{1}{n}]\mapsto 1, and so ||\lambda||^\circ=\infty by definition.)

So we can’t immediately define Skorohod convergence on D_\infty by demanding convergence on any restriction to [0,t]. We overcome this in a similar way to convergence of distribution functions.

Lemma: If d_t^\circ (f_n,f)\rightarrow_n 0 then for any s<t with f cts at s, then d_s^\circ(f_n,f)\rightarrow_n 0.

So this says that the functions converge in Skorohod space if for arbitrarily large times T where the limit function is continuous, the restrictions to [0,T] converge. (Note that cadlag functions have at most countably many discontinuities, so this is fine.)

A metric for D_\infty

If we want to specify an actual metric d_\infty^\circ, the usual tools for specifying a countable product metric will do here:

d_\infty^\circ(f,g)=\sum_{m\geq 1}2^{-m}[1\wedge d_m^\circ(f^m,g^m)],

where f^m is the restriction of f to [0,m], with the potential discontinuity at m smoothed out:

f^m(t)=\begin{cases}t&t\leq m-1\\ (m-t)f(t)&t\in[m-1,m]\\ 0&t\geq m.\end{cases}

In particular, d_\infty^\circ(f,g)=0\Rightarrow f^m=g^m\,\forall m.

It can be checked that:

Theorem: d_\infty^\circ(f_n,f)\rightarrow 0 in D_\infty if and only iff

\exists \lambda_n\in\Lambda_\infty\text{ s.t. }||\lambda_n-\text{id}||\rightarrow 0

\text{and }\sup_{t\leq m}|\lambda_n\circ f_n-f|\rightarrow_n 0,\,\forall m,

and that d_\infty^\circ (f_n,f)\rightarrow 0 \Rightarrow d_t^\circ(f_n,f)\rightarrow 0 for every point of continuity t of f.

Similarly weak convergence and tightness properties are available, roughly as you might expect. It is probably better to reference Billingsley’s book or similar sources rather than further attempting to summarise them here.

Supremum of Brownian Motion

We define the supremum process of Brownian Motion by:

S_t:=\sup_{0\leq s\leq t}B_s.

Here are two facts about Brownian Motion. Firstly, the Reflection Principle:

\mathbb{P}(S_t\geq b,B_t\leq a)=\mathbb{P}(B_t\geq 2b-a),

which we motivate by ‘stopping’ at time S_t, and using the SMP for Brownian Motion, even though it isn’t a stopping time. By setting a=b, we get:

\mathbb{P}(S_t\geq b)=\mathbb{P}(S_t\geq b,B_t\leq b)+\mathbb{P}(B_t\geq b)=2\mathbb{P}(B_t\geq b)=\mathbb{P}(|B|\geq b),

and conclude that

S_t\stackrel{d}{=}|B_t|\quad\text{for each }t\geq 0.

The second fact comes from the decomposition of BM into local times and excursions:

(S_t,S_t-B_t)_{t\geq 0}\stackrel{d}{=}(L_t,|B_t|)_{t\geq 0},

where L is the local time process at 0, and this equality in distribution holds for the processes. See the previous post on excursion theory for explanation of what local times mean.

In particular, combining these two facts gives:

S_t\stackrel{d}{=}S_t-B_t\quad\text{for every }t\geq 0.

I thought that was rather surprising, and wanted to think of a straightforward reason why this should be true. I think the following works:

Brownian motion is time-reversible. In particular, as processes, we have

(B_s)_{s\geq 0}\stackrel{d}{=}(B_{t-s}-B_t)_{s\geq 0}

\Rightarrow \sup_{0\leq r\leq t}B_r\stackrel{d}{=}\sup_{0\leq r\leq t}(B_{t-r}-B_t)

\Rightarrow S_t\stackrel{d}{=}S_t-B_t.

Subordinators and the Arcsine rule

After the general discussion of Levy processes in the previous post, we now discuss a particular class of such processes. The majority of content and notation below is taken from chapters 1-3 of Jean Bertoin’s Saint-Flour notes.

We say X_t is a subordinator if:

  • It is a right-continuous adapted stochastic process, started from 0.
  • It has stationary, independent increments.
  • It is increasing.

Note that the first two conditions are precisely those required for a Levy process. We could also allow the process to take the value \infty, where the hitting time of infinity represents ‘killing’ the subordinator in some sense. If this hitting time is almost surely infinite, we say it is a strict subordinator. There is little to be gained right now from considering anything other than strict subordinators.

Examples

  • A compound Poisson process, with finite jump measure supported on [0,\infty). Hereafter we exclude this case, as it is better dealt with in other languages.
  • A so-called stable Levy process, where \Phi(\lambda)=\lambda^\alpha, for some \alpha\in(0,1). (I’ll define \Phi very soon.) Note that checking that the sample paths are increasing requires only that X_1\geq 0 almost surely.
  • The hitting time process for Brownian Motion. Note that this does indeed have jumps as we would need. (This has \Phi(\lambda)=\sqrt{2\lambda}.)

Properties

  • In general, we describe Levy processes by their characteristic exponent. As a subordinator takes values in [0,\infty), we can use the Laplace exponent instead:

\mathbb{E}\exp(-\lambda X_t)=:\exp(-t\Phi(\lambda)).

  • We can refine the Levy-Khintchine formula;

\Phi(\lambda)=k+d\lambda+\int_{[0,\infty)}(1-e^{-\lambda x})\Pi(dx),

  • where k is the kill rate (in the non-strict case). Because the process is increasing, it must have bounded variation, and so the quadratic part vanishes, and we have a stronger condition on the Levy measure: \int(1\wedge x)\Pi(dx)<\infty.
  • The expression \bar{\Pi}(x):=k+\Pi((x,\infty)) for the tail of the Levy measure is often more useful in this setting.
  • We can think of this decomposition as the sum of a drift, and a PPP with characteristic measure \Pi+k\delta_\infty. As we said above, we do not want to consider the case that X is a step process, so either d>0 or \Pi((0,\infty))=\infty is enough to ensure this.

Analytic Methods

We give a snapshot of a couple of observations which make these nice to work with. Define the renewal measure U(dx) by:

\int_{[0,\infty)}f(x)U(dx)=\mathbb{E}\left(\int_0^\infty f(X_t)dt\right).

If we want to know the distribution function of this U, it will suffice to consider the indicator function f(x)=1_{X_t\leq x} in the above.

The reason to exclude step processes specifically is to ensure that X has a continuous inverse:

L_x=\sup\{t\geq 0:X_t\leq x\} so U(x)=\mathbb{E}L_x is continuous.

In fact, this renewal measure characterises the subordinator uniquely, as we see by taking the Laplace transform:

\mathcal{L}U(\lambda)=\int_{[0,\infty)}e^{-\lambda x}U(dx)=\mathbb{E}\int e^{-\lambda X_t}dt

=\int \mathbb{E}e^{-\lambda X_t}dt=\int\exp(-t\Phi(\lambda))dt=\frac{1}{\Phi(\lambda)}.

The Arcsine Law

X is Markov, which induces a so-called regenerative property on the range of X, \mathcal{R}. Formally, given s, we do not always have s\in\mathcal{R} (as the process might jump over s), but we can define D_s=\inf\{t>s:t\in\mathcal{R}\}. Then

\{v\geq 0:v+D_s\in\mathcal{R}\}\stackrel{d}{=}\mathcal{R}.

In fact, the converse holds as well. Any random set with this regenerative property is the range of some subordinator. Note that D_s is some kind of dual to X, since it is increasing, and the regenerative property induces some Markovian properties.

In particular, we consider the last passage time g_t=\sup\{s<t:s\in\mathcal{R}\}, in the case of a stable subordinator with \Phi(\lambda)=\lambda^\alpha. Here, \mathcal{R} is self-similar with scaling exponent \alpha. The distribution of \frac{g_t}{t} is thus independent of t. In this situation, we can derive the generalised arcsine rule for the distribution of g_1:

\mathbb{R}(g_1\in ds)=\frac{\sin \alpha\pi}{\pi}s^{\alpha-1}(1-s)^{-\alpha}ds.

The most natural application of this is to the hitting time process of Brownian Motion, which is stable with \alpha=\frac12. Then g_1=S_1-B_1, in the usual notation for the supremum process. Furthermore, we have equality in distribution of the processes (see previous posts on excursion theory and the short aside which follows):

(S_t-B_t)_{t\geq 0}\stackrel{d}{=}(|B_t|)_{t\geq 0}.

So g_1 gives the time of the last zero of BM before time 1, and the arcsine law shows that its distribution is given by:

\mathbb{P}(g_1\leq t)=\frac{2}{\pi}\text{arcsin}\sqrt{t}.

The Levy-Khintchine Formula

Because of a string of coincidences involving my choice of courses for Part III and various lecturers’ choices about course content, I didn’t learn what a Levy process until a few weeks’ ago. Trying to get my head around the Levy-Khintchine formula took a little while, so the following is what I would have liked to have been able to find back then.

A Levy process is an adapted stochastic process started from 0 at time zero, and with stationary, independent increments. This is reminiscent, indeed a generalisation, of the definition of Brownian motion. In that case, we were able to give a concrete description of the distribution of X_1. For a general Levy process, we have

X_1=X_{1/n}+(X_{2/n}-X_{1/n})+\ldots+(X_1-X_{1-1/n}).

So the distribution of X_1 is infinitely divisible, that is, can be expressed as the distribution of the sum n iid random variables for all n. Viewing this definition in terms of convolutions of distributions may be more helpful, especially as we will subsequently consider characteristic functions. If this is the first time you have seen this property, note that it is not a universal property. For example, it is not clear how to write a U[0,1] random variable as a convolution of two iid RVs. Note that exactly the same argument suffices to show that the distribution of X_t is infinitely divisible.

It will be most convenient to work with the characteristic functions

\mathbb{E}\exp(i\langle \lambda,X_t\rangle).

By stationarity of increments, we can show that this is equal to

\exp(-\Psi(\lambda)t)\quad\text{where}\quad \mathbb{E}\exp(i\langle \lambda,X_1\rangle)=:\exp(-\Psi(\lambda)).

This function \Psi(\lambda) is called the characteristic exponent. The argument resembles that used for Cauchy’s functional equations, by dealing first with the rationals using stationarity of increments, then lifting to the reals by the (right-)continuity of

t\mapsto \mathbb{E}\exp(i\langle \lambda,X_t\rangle).

As ever, \Psi(\lambda) uniquely determines the distribution of X_1, and so it also uniquely determines the distribution of Levy process. The only condition on \Psi is that it be the characteristic function of an infinitely divisible distribution. This condition is given explicitly by the Levy-Khintchine formula.

Levy-Khintchine

\Psi(\lambda) is the characteristic function of an infinitely divisible distribution iff

\Psi(\lambda)=i\langle a,\lambda\rangle +\frac12 Q(\lambda)+\int_{\mathbb{R}^d}(1-e^{i\langle \lambda,x\rangle}+i\langle \lambda,x\rangle 1_{|x|<1})\Pi(dx).

for a\in\mathbb{R}^d, Q a quadratic form on \mathbb{R}^d, and \Pi a so-called Levy measure satisfying \int (1\wedge |x|^2)\Pi(dx)<\infty.

This looks a bit arbitrary, so first let’s explain what each of these terms ‘means’.

  • i\langle a,\lambda\rangle comes from a drift of -a. Note that a deterministic linear function is a (not especially interesting) Levy process.
  • \frac12Q(\lambda) comes from a Brownian part \sqrt{Q}B_t.

The rest corresponds to the jump part of the process. Note that a Poisson process is an example of a Levy process, hence why we might consider thinking about jumps in the first place. The reason why there is an indicator function floating around is that we have to think about two regimes separately, namely large and small jumps. Jumps of size bounded below cannot happen too often as otherwise the process might explode off to infinity in finite time with positive probability. On the other hand, infinitesimally small jumps can happen very often (say on a dense set) so long as everything is controlled to prevent an explosion on the macroscopic scale.

There is no canonical choice for where the divide between these regimes happens, but conventionally this is taken to be at |x|=1. The restriction on the Levy measure near 0 ensures that the sum of the squares all jumps up some finite time converges absolutely.

  • \Pi\cdot 1_{|x|\geq 1} gives the intensity of a standard compound Poisson process. The jumps are well-spaced, and so it is a relatively simple calculation to see that the characteristic function is

\int_{\mathbb{R}^d}(1-e^{i\langle \lambda,x\rangle})1_{|x|\geq 1}\Pi(dx).

The intensity \Pi\cdot 1_{|x|<1} gives infinitely many hits in finite time, so if the expectation of this measure is not 0, we explode immediately. We compensate by drifting away from this at rate

\int_{\mathbb{R}^d}x1_{|x|<1}\Pi(dx).

To make this more rigorous, we should really consider 1_{\epsilon<|x|<1} then take a limit, but this at least explains where all the terms come from. Linearity allows us to interchange integrals and inner products, to get the term

\int_{\mathbb{R}^d}(1-e^{-i\langle \lambda,x\rangle}+i\langle\lambda,x\rangle 1_{|x|<1})\Pi(dx).

If the process has bounded variation, then we must have Q=0, and also

\int (1\wedge |x|)\Pi(dx)<\infty,

that is, not too many jumps on an |x| scale. In this case, then this drift component is well-defined and linear \lambda, so can be incorporated with the drift term at the beginning of the Levy-Khintchine expression. If not, then there are some \lambda for which it does not exist.

There are some other things to be said about Levy processes, including

  • Stable Levy processes, where \Psi(k\lambda)=k^\alpha \Psi(\lambda), which induces the rescaling-invariance property: k^{-1/\alpha}X_{kt}\stackrel{d}{=}X. The distribution of each X_t is then also a stable distribution.
  • Resolvents, where instead of working with the process itself, we work with the distribution of the process at a random exponential time.