The Coupon Collector Problem

This post talks about a problem which I’ve mentioned in passing a few times previously. A research problem I’ve been thinking about in the past week involves some careful calculations for a related setup, so I thought it would be worth writing some short notes about the classical problem. As an child who has tried to collect football cards or the sets of toys you find inside cereal packets will know, it always takes ages to pick up the last few cards, since you will mainly be picking up types that you already have.

The classical coupon collector problem is exactly a mathematical statement of this situation. We have a finite set, that we might as well take to be [n]. Then we have a collection of IID random variables which are uniformly distributed on [n]. Let’s call these X_1,X_2,\ldots. We are interested in how many samples we need before we have seen each element of [n]. We could define this finishing time as


 We are interested in the distribution of this random variable T. The key initial observation is that we can specify the distribution of the number of samples needed between seeing new objects as geometric. For example, given that we have seen k possible values, the number of Xs until we see a new value is distributed as Geom(n-k/n). This of course has expectation n/(n-k), and so the expectation of T is


We can immediately do a crude approximation of this as

n\int_1^n \frac{dx}{x}=\log n.

We should ask just how approximate this is, and so shortly we will move to a discussion of the harmonic numbers


First though, we should settle the concentration question. Note that this is concretely different to the question of approximating the harmonic numbers. Concentration asks how tightly the distribution is packed around its mean. Later, we will ask exactly how close the mean is to n log n.

The most crude measure is variance. Note that in the above calculation we used linearity of expectation, which holds generally. We also have a result concerning linearity of variance, which Wikipedia asserts is commonly called Bienayme’s formula, provided the summand random variables are independent. And that is the case here. Recall that the variance of a Geom(p) random variable is \frac{q}{p^2}, where p+q=1. So we obtain

\mathrm{Var}(T)= \sum_{k=1}^n \frac{n(n-k)}{k^2}

\le \sum_{k=1}^n \frac{n^2}{k^2}\le n^2\frac{\pi^2}{6}.

So the standard deviation is an order of magnitude less than the expectation, hence by Chebyshev or similar, we can get some concentration bounds.

Harmonic Numbers

We return now to the question of how good an estimate log n is for


First, we note that log n must be an underestimate. In the following picture

DSC_1814 - Copy

log n is the area under the black curve between 1 and n, whereas H_n also includes the block between n and n+1, and the red area above the curve. So, in the limit, the estimate is off by at least the total red area (as the first error term decays to zero). The value of this area is defined to be the Euler-Mascheroni constant, a name attributed to the two men who first made a serious attempt to approximate it.

Why should this be finite? Well note that if we translated each red bit so they lay one on top of the other, they would not overlap (apart from boundaries) and be contained within the block with width [1,2] and height 1. So the constant is bounded above by 1. Since the function 1/x is strictly convex, approximating each red sector by a triangle with the same boundary points would be an underestimate, so the constant \gamma>1/2. The value turns out to be about 0.577.

So thus far we have the approximation

H_n\approx \log n + \gamma.

In fact a strong asymptotic statement is

H_n=\log n +\gamma+\frac{1}{2n}+o\left(\frac{1}{n}\right).

Rather than address that, I will show that, asymptotically:

\log n+\gamma< H_n<\log n +\gamma+\frac{1}{n}.

We treat the upper bound first. Note that \log(n+1)=\log n + \log (1+\frac{1}{n})=\log n +\frac{1}{n}+O(\frac{1}{n^2}). So, just by bounding with areas, noting that we need to take the upper limit of the integral to be (n+1), we get:

H_n < \int_1^{n+1} \frac{dx}{x} + \gamma=\log(n+1)+\gamma\leq \log n + \frac{1}{n}+\gamma.

For the lower bound, it is required to show that


is strictly greater than the red area contributing to \gamma that lies to the right of x=n+1 But for this we can use the same argument as before, by translating all the red areas as far to the left as possible so they all lie on top of each other, and in particular within a box of dimensions 1 x 1/n+1, which is less than the integral.

We conclude this article by relating these harmonic numbers to the Stirling numbers of the first kind, which are a particular example of Bell polynomials. The easiest way to state the result is

H_n=\frac{1}{n!}\#\{\text{permutations of }[n+1]\text{ with 2 cycles}\}.

To see why this might be, consider the number of permutations of [n+1] with two cycles of length k and n+1-k, where for now we assume these are not equal. Then the number of such permutations is

\binom{n+1}{k}(k-1)! (n-k)!=\frac{(n+1)!}{k\cdot(n-k+1)}=n!\left[\frac{n+1}{k\cdot(n-k+1)}\right]= n!\left[\frac{1}{k}+\frac{1}{n-k+1}\right].

After checking the case k=n+1/2 if applicable, and making sure all the bounds match up correctly, the result follows.