I’ve had plenty of ideas for potential probability posts recently, but have been a bit too busy to write any of them up. I guess that’s a good thing in some sense. Anyway, this is a quick remark based on an argument I was thinking about yesterday. It combines Large Deviation theory, which I have spent a lot of time learning about this year, and the Poisson process, which I have spent a bit of time teaching.

**Question**

Does the Poisson distribution have an exponential tail? I ended up asking this question for two completely independent reasons yesterday. Firstly, I’ve been reading up about some more complex models of random networks. Specifically, the Erdos-Renyi random graph is interesting mathematical structure in its own right, but the independent edge condition results in certain regularity properties which are not seen in many real-world networks. In particular, the degree sequence of real-world networks typically follows an approximate power law. That is, the tail is heavy. This corresponds to our intuition that most networks contain ‘hubs’ which are connected to a large region of the network. Think about key servers or websites like Wikipedia and Google which are linked to by millions of other pages, or the social butterfly who will introduce friends from completely different circles. In any case, this property is not observed in an Erdos-Renyi graph, where the degrees are binomial, and in the sparse situation, rescale in the limit to a Poisson distribution. So, to finalise this observation, we want to be able to prove formally that the Poisson distribution has an exponential (so faster than power-law) tail.

The second occurrence of this question concerns large deviations for the exploration process of a random graph. This is a topic I’ve mentioned elsewhere (here for the exploration process, here for LDs) so I won’t recap extensively now. Anyway, the results we are interested in give estimates for the rate of decay in probability for the event that the path defined by the exploration process differs substantially from the expected path as n grows. A major annoyance in this analysis is the possibility of jumps. A jump occurs if a set of o(n) adjacent underlying random variables (here, the increments in the exploration process) have O(n) sum. A starting point might be to consider whether O(1) adjacent RVs can have O(n) sum, or indeed whether a single Poisson random variable can have sum of order n. In practice, this asks whether the probability decays faster than exponentially in n. If it does, then this is dominated on a large deviations scale. If it decays exactly exponentially in n, then we have to consider such jumps in the analysis.

**Approach**

We can give a precise statement of the probabilities that a Po() random variable X returns a given integer value:

Note that these are the terms in the Taylor expansion of appropriately normalised. So, while it looks like it should be possible to evaluate

this seems impossible to do directly, and it isn’t even especially obvious what a sensible bounding strategy might be.

The problem of estimating the form of the limit in probability of increasing unlikely deviations from expected behaviour surely reminds us of Cramer’s theorem. But this and other LD theory is generally formulated in terms of n random variables displaying some collective deviation, rather than a single random variable, with the size of the deviation growing. But we can transform our problem into that form by appealing to the three equivalent definitions of the Poisson process.

Recall that the Poisson process is the canonical description of, say, an arrivals process, where events in disjoint intervals are independent, and the expected number of arrives in a fixed interval is proportional to the width of the interval, giving a well-defined notion of ‘rate’ as we would want. The two main ways to define the process are: 1) the times between arrivals are given by i.i.d. Exponential RVs with parameter equal to the rate; and 2) the number of arrivals in interval [s,t] is independent of all other times, and has distribution given by Po(). The fact that this definition gives a well-defined process is not necessarily obvious, but let’s not discuss that further here.

So the key equivalence to be exploited is that the event for is a statement that there are at least n arrivals by time 1. If we move to the exponential inter-arrival times definition, we can write this as:

where the Z’s are the i.i.d. exponential random variables. But this is exactly what we are able to specify through Cramer’s theorem. Recall that the moment generating function of an exponential distribution is not finite everywhere, but that doesn’t matter as we construct our rate function by taking the supremum over some index t of:

A simple calculation then gives

Note that I(1) is the same for both Exp() and Po(), because of the PP equality of events:

similar to the previous argument. In particular, for all ,

Since we can take as large as we want, we conclude that the probability decays faster than exponentially in n.

###### Related articles

- Large Deviations 1 – Motivation and Cramer’s Theorem (eventuallyalmosteverywhere.wordpress.com)
- Large Deviations 2 – LDPs, Rate Functions and Lower Semi-Continuity (eventuallyalmosteverywhere.wordpress.com)
- Large Deviations 3 – Gartner-Ellis Theorem: Where do the all terms come from? (eventuallyalmosteverywhere.wordpress.com)
- How many draws of a random number [0,1] are needed to sum to 1 (bayesianthink.blogspot.com)
- How-to: Resample from a Large Data Set in Parallel (with R on Hadoop) (cloudera.com)
- The law of small numbers (r-bloggers.com)

Pingback: Generating Functions for the IMO | Eventually Almost Everywhere

In fact, it is not hard to see that the sum is asymptotically equivalent to its largest term (that is, the first term in this case), which decays like $\exp(-n \log n + \dots)$ by Stirling’s formula. Just note how much smaller the rest of the terms are…

This is a very general phenomenon, which underlies what is called Laplace’s method: in certain kind of sums and integrals what is really going on is that the “mass” concentrates around the place where the maximum is achieved, and you can study this concentration to the point that you actually write down the asymptotics.