Hoeffding’s inequality and convex ordering

A number of previous posts on this blog have discussed Markov’s inequality, and the control one can obtain on a distribution by applying this to functions of the distribution, or functions of random variables with this distribution. In particular, studying the square of the deviation from the mean \left(X-\mathbb{E}X\right)^2 leads to Chebyshev’s inequality, while even stronger bounds can sometimes be obtained using e^{tX}, for some appropriate value of t\in\mathbb{R}. Sometimes this final method, which relies on the finiteness of \mathbb{E}[e^{tX}] is referred to as a Chernoff bound, especially when an attempt is made to choose the optimal value of t for the estimate under consideration.

The case of a random walk, that is the sum of IID increments, is particularly well-studied. In [Ho63], Hoeffding studies a generalisation where the increments are merely independent, and outlines a further generalisation to discrete-time martingales, which is developed by Azuma in [Az67], leading to the famous Azuma-Hoeffding inequality. A key feature of this theorem is that we require bounded increments, which leads to tight choices of t_i in the Chernoff bound for each increment.

In addition, [Ho63] addresses the alternative model where the increments of a random walk are chosen uniformly without replacement from a particular set. The potted summary is that the sum of random increments chosen without replacement has the same mean, but is more concentrated that the corresponding sum of random increments chosen with replacement. This means that any of the concentration results proved in the earlier sections of [Ho63] for the latter situation apply equally to the setting without replacement.

Various versions of this result are referred to as Hoeffding’s inequality by the literature, and the main goal of this short article is to list some results which are definitely true about this setting!

(Note, in some contexts, ‘Hoeffding’s inequality’ will mean the concentration bound for either the with-replacement or the without-replacement setting. In this blog post, we are discussing inequalities comparing these two settings.)

Starting point – second moments in ‘vanilla’ Hoeffding

We won’t use this notation heavily, but for now let’s assume that we are studying a (multi)set of values \mathcal{X}=\{x_1,\ldots,x_N\}, and will study a sequence of m elements (X_1,\ldots,X_m) from this set with replacement (ie chosen in an IID fashion), and a sequence of m elements (Y_1,\ldots,Y_m) chosen from this set without replacement.

Some preliminary remarks:

  • The Y_is are still chosen uniformly which can be interpreted as
    • Any permutation of the multiset \{x_1,\ldots,x_n\} is equally likely to appear as (Y_1,\ldots,Y_n);
    • If determining the sequence one element at a time, then in a conditional sense, the choice of the ‘next’ element is uniform among those elements which have not yet been chosen.
  • In particular, the sequence (Y_1,\ldots,Y_m) is exchangeable (though not knowing what that means will not be a significant impediment to what follows).
  • It is clear that \mathbb{E}[X_1+\ldots+X_m]=\mathbb{E}[Y_1+\ldots+Y_m], so concentration around this common mean is the next question of interest.
  • Intuitively, when m\ll N, we wouldn’t expect there to be much difference between sampling with or without replacement.
  • When sampling without replacement, the cases m and N-m are symmetric with respect to the sum of \mathcal{X} and, in particular, have the same concentration.

As a brief warmup, we establish that

\mathrm{Var}(Y_1+\ldots+Y_m)\le \mathrm{Var}(X_1+\ldots+X_m).

Note that the RHS is m\sigma^2, where \sigma^2<\infty is the variance of a single uniform sample from \mathcal{X}. Since these sums have the same mean, it suffices to show that

\mathbb{E}\left[ (Y_1+\ldots+Y_m)^2\right] \le \mathbb{E}\left[ (X_1+\ldots+X_m)^2\right].

But this really has nothing to do with probability now, since these quantities are just symmetric polynomials in \mathcal{X}. That said, we gain a little bit if we use the exchangeability property to rewrite the LHS and the RHS as

n\mathbb{E}[Y_1^2] + n(n-1)\mathbb{E}[Y_1Y_2] \le n\mathbb{E}[X_1^2] + n(n-1)\mathbb{E}[X_1X_2].

Since Y_1\stackrel{d}= X_1, it remains to handle the cross term, which involves showing that

\frac{1}{N(N-1)}\sum_{i\ne j} x_ix_j \le \frac{1}{N^2} \sum_{i,j}x_ix_j,

which really is an algebra exercise, and can be demonstrated by the rearrangement inequality, or Chebyshev’s inequality (the other one – ie sequence-valued FKG not the probabilistic second-moment bound one), or by clever pre-multiplying and cancelling to reduce to Cauchy-Schwarz.

In [BPS18], which we will turn to in much greater detail shortly, the authors assert in the first sentence that Hoeffding treats also the setting of weighted sums, that is a comparison of

\alpha_1X_1+\ldots+\alpha_mX_m,\quad \alpha_1Y_1+\ldots+\alpha_mY_m.

A casual reader of [Ho63] may struggle initially to identify that this has indeed happened. At the level of second-moment comparison though, the argument given above carries over essentially immediately. We have

\mathbb{E}\left[ (\alpha_1Y_1+\ldots+\alpha_n Y_n)^2\right] = \left(\alpha_1^2+\ldots+\alpha_n^2\right) \mathbb{E}[Y_1^2] + \left(\sum_{i\ne j} \alpha_i\alpha_j \right)\mathbb{E}\left[Y_1Y_2\right],

and of course the corresponding decomposition for (X_i), from which the variance comparison inequality follows as before.

Convex ordering

For many applications, variance is sufficient as a measure of concentration, but one can say more. As a prelude, note that one characterisation of stochastic ordering (which I have sometimes referred to as stochastic domination on this blog) is that

X\le_{st}Y \;\iff\;\mathbb{E}[f(X)] \le \mathbb{E}[f(Y)],

whenever f is a non-decreasing function. Unsurprisingly, it’s the \Leftarrow direction which is harder to prove, though often the \Rightarrow direction is the most useful in applications.

[To inform what follows, I used a number of references, in particular [BM05].]

A natural extension to handle concentration is (stochastic) convex ordering, which is defined as

X\le_{cx} Y\;\iff \; \mathbb{E}[f(X)]\le \mathbb{E}[f(Y)],

whenever f is a convex function. Note immediately that this definition can only ever apply to random variables for which \mathbb{E}X=\mathbb{E}Y. Otherwise, consider the functions f(x)=x, f(x)=-x, both of which are convex.

This sort of comparison has plenty of application in economics and actuarial fields, where there is a notion of risk-aversion, and convexity is the usual criterion to define risk (in the sense that the danger of a significant loss is viewed as more significant than a corresponding possibility of a significant gain). Also note that in this context, X is preferred to Y by a risk-averse observer.

In fact, it can be shown [Oh69] that a sufficient condition for X\le_{cx} Y is the existence of \alpha\in\mathbb{R} such that

F_X(x)\le F_Y(x)\;\iff\; x\ge \alpha,

where F_X(x)=\mathbb{P}(X\le x) is the usual distribution function of X. That is, the distribution functions ‘cross exactly once’. Note that in general \alpha\ne \mathbb{E}X=\mathbb{E}Y.

We will return later to the question of adapting this measure to handle distributions with different expectations, but for now, we will return to the setting of sampling with and without replacement and show that

Y_1+\ldots+Y_m\le_{cx}X_1+\ldots+X_m.

This is the result shown by Hoeffding, but by a more complicated method. We use the observations of [BPS18] which are stated in a more complicated setting which we may return to later. For notational convenience, let’s assume the elements of \mathcal{X} are distinct so that ‘repetitions of x_i‘ has an unambiguous meaning.

We set up a natural coupling of (X_i),(Y_i) as follows. Let (X_i,i\ge 1) be IID uniform samples from \mathcal{X}, and generate (Y_1,\ldots,Y_N) by removing any repetitions from (X_1,X_2,\ldots).

Note then that given the sequence (Y_1,\ldots,Y_m), the multiset of values \{X_1,\ldots,X_m\} definitely includes Y_1 at least once, but otherwise includes each Y_i either zero, one or more times, in a complicated way. However, this distribution doesn’t depend on the values taken by (Y_i), since it is based on uniform sampling. In particular, if we condition instead on the set \{Y_1,\ldots,Y_m\}, we find that

\mathbb{E}\left[X_1+\ldots+X_m\,\big|\, \{Y_1,\ldots,Y_m\}\right] = Y_1+\ldots+Y_m,

and so we may coarsen the conditioning to obtain

\mathbb{E}\left[X_1+\ldots+X_m\,\big|\, Y_1+\ldots+Y_m\right] = Y_1+\ldots+Y_m.

This so-called martingale coupling is very useful for proving convex ordering. You can think of \mathbb{E}[X|Y]=Y as saying “X is generated by Y plus additional randomness”, which is consistent with the notion that Y is more concentrated than X. In any case, it meshes well with Jensen’s inequality, since if f is convex, we obtain

\mathbb{E}[f(X)] = \mathbb{E}\left[ \mathbb{E}[f(X)|Y] \right] \ge \mathbb{E}\left[f\left(\mathbb{E}[X|Y] \right)\right] = \mathbb{E}\left[ f(Y)\right].

So we have shown that sampling without replacement is greater than sampling with replacement in the convex ordering, if we consider sums of the samples.

If we want to handle weighted sums, and tried to reproduce the previous argument directly, it would fail, since

\mathbb{E}\left[\alpha_1X_1+\ldots+\alpha_nX_n\,\big|\,\{Y_1,\ldots,Y_m\}\right]=(\alpha_1+\ldots+\alpha_m)(Y_1+\ldots+Y_m),

which is not in general equal to \alpha_1Y_1+\ldots+\alpha_mY_m.

Non-uniform weighted sampling

Of course, it’s not always the case that we want to sample according to the uniform distribution. My recent personal motivation for investigating this setup has been the notion of exploring a graph generated by the configuration model, for which vertices appear for the first time in a random order that is size-biased by their degrees.

In general we can think of each element of \mathcal{X} as having some weight w(i), and at each step, we pick x_i from the remaining elements with probability proportion to w(i). This procedure is known in other contexts as successive sampling. Note that the sequence (Y_1,\ldots,Y_m) is no longer exchangeable. This is clear enough when there are two elements in \mathcal{X}, and w(1)\gg w(2). So it is much more likely that (Y_1,Y_2)=(x_1,x_2) than (Y_1,Y_2)=(x_2,x_1).

In addition, it will in general not be the case that

\mathbb{E}[Y_1+\ldots+Y_m]=\mathbb{E}[X_1+\ldots+X_m],

and this is easily seen again in the above example, where \mathbb{E}[Y_1+Y_2]=x_1+x_2 while \mathbb{E}[X_1+X_2]\approx 2x_1.

However, one can still compare such distributions, with the notion of increasing convex ordering, which is defined as:

X\le_{icx} Y\;\iff\; \mathbb{E}[f(X)]\le \mathbb{E}[f(Y)],

for all convex non-decreasing functions f. From an economic point view, increasing convex ordering becomes a comment on the relative contribution of losses above some threshold between the two distributions (and that X is better or worse than Y for all such thresholds), and is sometimes called stop-loss order. More formally, as in [BM05] we have:

X\le_{icx}Y\;\iff\; \mathbb{E}[\max(X-t,0)]\le \mathbb{E}[\max(Y-t,0)]\,\forall t\in\mathbb{R}.

In [BPS18], the authors study the case of successive sampling where the weights are monotone-increasing in the values, that is w(i)\ge w(j)\;\iff\; x_i\ge x_j, of which size-biased sampling is a special case. With this setup, the heuristic is that ‘large values are more likely to appear earlier’ in the sample without replacement, and making this notion precise underpins all the work.

The coupling introduced above is particularly useful in this more exotic setting. The authors show that conditional on the event A of the form \{Y_1,\ldots,Y_m\}=\{x_{i_1}\le x_{i_2}\le\ldots\le x_{i_m}\}, one has

\mathbb{P}(X_1=x_{i_j}\,\big|\,\mathbf{A}) \le \mathbb{P}(X_1=x_{i_k}\,\big|\,\mathbf{A}),\text{ precisely when }j\le k.

This is achieved by comparing the explicit probabilities of pairs of orderings for (Y_1,\ldots,Y_m) with x_{i_j},x_{i_k} exchanged. See [BPS18] for this short calculation – a link to the Arxiv version is in the references. So to calculate \mathbb{E}[X_1\,\big|\, \mathbf{A}], we have the setup for the rearrangement inequality / Chebyshev again, as

\mathbb{E}[X_1\,\big|\, \mathbf{A}] = \sum_{j=1}^m x_{i_j}\mathbb{P}(X_1=x_{i_j}) \le \frac{1}{n}\left(\sum_{j=1}^m x_{i_j}\right)\left(\sum \mathbb{P}\right) = \frac{\left(Y_1+\ldots+Y_m\,\big|\,\mathbf{A}\right)}{m}.

As before, letting X,Y be the sums of the samples with and without replacement, respectively, we have \mathbb{E}[X\,\big|\, Y] \ge Y, which the authors term a submartingale coupling. In particular, the argument

\mathbb{E}[f(X)] = \mathbb{E}\left[ \mathbb{E}[f(X)|Y] \right] \ge \mathbb{E}\left[f\left(\mathbb{E}[X|Y] \right)\right] \ge \mathbb{E}\left[ f(Y)\right],

works exactly as before to establish convex ordering.

References

[Az67] – Azuma – 1967 – Weighted sums of certain dependent random variables

[BM05] – Bauerle, Muller – 2005 – Stochastic orders and risk measures: consistency and bounds. (Online version here)

[BPS18] – Ben Hamou, Peres, Salez – 2018 – Weighted sampling without replacement (ArXiv version here)

[Ho63] – Hoeffding – 1963 – Probability inequalities for sums of bounded random variables

[Oh69] – Ohlin – 1969 – On a class of measures of dispersion with application to optimal reinsurance

Advertisement

Lecture 8 – Bounds in the critical window

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

Preliminary – positive correlation, Harris inequality

I wrote about independence, association, and the FKG property a long time ago, while I was still an undergraduate taking a first course on Percolation in Cambridge. That post is here. In the lecture I discussed the special case of the FKG inequality applied in the setting of product measure setting, of which the Erdos-Renyi random graph is an example, and which is sometimes referred to as the Harris inequality.

Given two increasing events A and B, say for graphs on [n], then if \mathbb{P} is product measure on the edge set, we have

\mathbb{P}(A\cap B)\ge \mathbb{P}(A)\mathbb{P}(B).

Intuitively, since both A and B are ‘positively-correlated’ with the not-rigorous notion of ‘having more edges’, then are genuinely positively-correlated with each other. We will use this later in the post, in the form \mathbb{E}[X|A]\ge \mathbb{E}[X], whenever X is an increasing RV and A is an increasing event.

The critical window

During the course, we’ve discussed separately the key qualitative features of the random graph G(n,\frac{\lambda}{n}) in the

  • subcritical regime when \lambda<1, for which we showed that all the components are small, in the sense that \frac{1}{n}|L_1| \stackrel{\mathbb{P}}\rightarrow 0, although the same argument would also give |L_1|\le K\log n with high probability if we used stronger Chernoff bounds;
  • supercritical regime when \lambda>1, for which there is a unique giant component , ie that \frac{1}{n}|L_1|\stackrel{\mathbb{P}}\rightarrow \zeta_\lambda>0, the survival probability of a Galton-Watson branching process with Poisson(\lambda) offspring distribution. Arguing for example by a duality argument shows that with high probability all other components are small in the same sense as in the subcritical regime.

In between, of course we should study G(n,\frac{1}{n}), for which it was known that L_1\stackrel{d}\sim n^{2/3},\, L_2\stackrel{d}\sim n^{2/3},\ldots. (*) That is, the largest components are on the scale n^{2/3}, and there are lots of such critical components.

In the early work on random graphs, the story ended roughly there. But in the 80s, these questions were revived, and considerable work by Bollobas and Luczak, among many others, started investigating the critical setting in more detail. In particular, between the subcritical and the supercritical regimes, the ratio \frac{|L_2|}{|L_1|} between the sizes of the largest and second-largest components goes from ‘concentrated on 1’ to ‘concentrated on 0’. So it is reasonable to ask what finer scaling of the edge probability p(n) around \frac{1}{n} should be chosen to see this transition happen.

Critical window

In this lecture, we studied the critical window, describing sequences of probabilities of the form

p(n)=\frac{1+\lambda n^{-1/3}}{n},

where \lambda\in(-\infty,+\infty). (Obviously, this is a different use of \lambda to previous lectures.)

It turns out that as we move \lambda from -\infty to +\infty, this window gives exactly the right scaling to see the transition of \frac{|L_2|}{|L_1|} described above. Work by Bollobas and Luczak and many co-authors and others in the 80s establish a large number of results in this window, but for the purposes of this course, this can be summarised as saying that the critical window has the same scaling behaviour as p(n)=1/n, with a large number of components on the scale \sim n^{2/3} (see (*) earlier), but different scaling limits.

Note: Earlier in the course, we have discussed local limits, in particular for G(n,\lambda/n), where the local limit is a Galton-Watson branching process tree with offspring distribution \mathrm{Poisson}(\lambda). Such local properties are not sufficient to distinguish between different probabilities within the critical window. Although there are lots of critical components, it remains the case that asymptotically almost all vertices are in ‘small components’.

The precise form of the scaling limit for

\frac{1}{n^{2/3}} \left( |L_1|, |L_2|, |L_3|,\ldots \right)

as n\rightarrow\infty was shown by Aldous in 1997, by lifting a scaling limit result for the exploration process, which was discussed in this previous lecture and this one too. Since Brownian motion lies outside the assumed background for this course, we can’t discuss that, so this lecture establishes upper bounds on the correct scale of |L_1| in the critical window. Continue reading

Lecture 7 – The giant component

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

As we edge into the second half of the course, we are now in a position to return to the question of the phase transition between the subcritical regime \lambda<1 and the supercritical regime \lambda>1 concerning the size of the largest component L_1(G(n,\lambda/n)).

In Lecture 3, we used the exploration process to give upper bounds on the size of this largest component in the subcritical regime. In particular, we showed that

\frac{1}{n}\big| L_1(G(n,\lambda/n)) \big| \stackrel{\mathbb{P}}\rightarrow 0.

If we used slightly stronger random walk concentration estimates (Chernoff bounds rather than 2nd-moment bounds from Chebyshev’s inequality), we could in fact have shown that with high probability the size of this largest component was at most some logarithmic function of n.

In this lecture, we turn to the supercritical regime. In the previous lecture, we defined various forms of weak local limit, and asserted (without attempting the notationally-involved combinatorial calculation) that the random graph G(n,\lambda/n) converges locally weakly in probability to the Galton-Watson tree with \text{Poisson}(\lambda) offspring distribution, as we’ve used informally earlier in the course.

Of course, when \lambda>1, this branching process has strictly positive survival probability \zeta_\lambda>0. At a heuristic level, we imagine that all vertices whose local neighbourhood is ‘infinite’ are in fact part of the same giant component, which should occupy (\zeta_\lambda+o_{\mathbb{P}}(1))n vertices. In its most basic form, the result is

\frac{1}{n}\big|L_1(G(n,\lambda/n))\big|\;\stackrel{\mathbb{P}}\longrightarrow\; \zeta_\lambda,\quad \frac{1}{n}\big|L_2(G(n,\lambda/n))\big| \;\stackrel{\mathbb{P}}\longrightarrow\; 0, (*)

where the second part is a uniqueness result for the giant component.

The usual heuristic for proving this result is that all ‘large’ components must in fact be joined. For example, if there are two giant components, with sizes \approx \alpha n,\approx \beta n, then each time we add a new edge (such an argument is often called ‘sprinkling‘), the probability that these two components are joined is \approx 2ab, and so if we add lots of edges (which happens as we move from edge probability \lambda-\epsilon\mapsto \lambda ) then with high probability these two components get joined.

It is hard to make this argument rigorous, and the normal approach is to show that with high probability there are no components with sizes within a certain intermediate range (say between \Theta(\log n) and n^\alpha) and then show that all larger components are the same by a joint exploration process or a technical sprinkling argument. Cf the books of Bollobas and of Janson, Luczak, Rucinski. See also this blog post (and the next page) for a readable online version of this argument.

I can’t find any version of the following argument, which takes the weak local convergence as an assumption, in the literature, but seems appropriate to this course. It is worth noting that, as we shall see, the method is not hugely robust to adjustments in case one is, for example, seeking stronger estimates on the giant component (eg a CLT).

Anyway, we proceed in three steps:

Step 1: First we show, using the local limit, that for any \epsilon>0,

\frac{1}{n}\big|L_1(G(n,\lambda/n))\big| \le \zeta_\lambda+\epsilon, with high probability as n\rightarrow\infty.

Step 2: Using a lower bound on the exploration process, for \epsilon>0 small enough

\frac{1}{n}\big|L_1(G(n,\lambda/n))\big| \ge \epsilon, with high probability.

Step 3: Motivated by duality, we count isolated vertices to show

\mathbb{P}(\epsilon n\le |L_1| \le (\zeta_\lambda-\epsilon)n) \rightarrow 0.

We will return to uniqueness at the end.

Step 1

This step is unsurprising. The local limit gives control on how many vertices are in small components of various sizes, and so gives control on how many vertices are in small components of all finite sizes (taking limits in the right order). This gives a bound on how many vertices can be in the giant component. Continue reading

Lecture 3 – Couplings, comparing distributions

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

In this third lecture, we made our first foray into the scaling regime for G(n,p) which will be the main focus of the course, namely the sparse regime when p=\frac{\lambda}{n}. The goal for today was to give a self-contained proof of the result that in the subcritical setting \lambda<1, there is no giant component, that is, a component supported on a positive proportion of the vertices, with high probability as n\rightarrow\infty.

More formally, we showed that the proportion of vertices contained within the largest component of G(n,\frac{\lambda}{n}) vanishes in probability:

\frac{1}{n} \left| L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right| \stackrel{\mathbb{P}}\longrightarrow 0.

The argument for this result involves an exploration process of a component of the graph. This notion will be developed more formally in future lectures, aiming for good approximation rather than bounding arguments.

But for now, the key observation is that when we ‘explore’ the component of a uniformly chosen vertex v\in[n] outwards from v, at all times the number of ‘children’ of v which haven’t already been considered is ‘at most’ \mathrm{Bin}(n-1,\frac{\lambda}{n}). Since, for example, if we already know that eleven vertices, including the current one w are in C(v), then the distribution of the number of new vertices to be added to consideration because they are directly connected to w has conditional distribution \mathrm{Bin}(n-11,\frac{\lambda}{n}).

Firstly, we want to formalise the notion that this is ‘less than’ \mathrm{Bin}(n,\frac{\lambda}{n}), and also that, so long as we don’t replace 11 by a linear function of n, that \mathrm{Bin}(n-11,\frac{\lambda}{n})\stackrel{d}\approx \mathrm{Poisson}(\lambda).

Couplings to compare distributions

A coupling of two random variables (or distributions) X and Y is a realisation (\hat X,\hat Y) on the same probability space with correct marginals, that is

\hat X\stackrel{d}=X,\quad \hat Y\stackrel{d}=Y.

We saw earlier in the course that we could couple G(n,p) and G(n,q) by simulating both from the same family of uniform random variables, and it’s helpful to think of this in general: ‘constructing the distributions from the same source of randomness’.

Couplings are a useful notion to digest at this point, as they embody a general trend in discrete probability theory. Wherever possible, we try to do as we can with the random objects, before starting any calculations. Think about the connectivity property of G(n,p) as discussed in the previous lecture. This can be expressed directly as a function of p in terms of a large sum, but showing it is an increasing function of p is essentially impossible by computation, whereas this is very straightforward using the coupling.

We will now review how to use couplings to compare distributions.

For a real-valued random variable X, with distribution function F_X, we always have the option to couple with a uniform U(0,1) random variable. That is, when U\sim U[0,1], we have (F_X^{-1}(U)\stackrel{d}= X, where the inverse of the distribution function is defined (in the non-obvious case of atoms) as

F_X^{-1}(u)=\inf\left\{ x\in\mathbb{R}\,:\, F(x)\ge u\right\}.

Note that when the value taken by U increases, so does the value taken by F_X^{-1}(U). This coupling can be used simultaneously on two random variables X and Y, as (F_X^{-1}(U),F_Y^{-1}(U)), to generate a coupling of X and Y.

The total variation distance between two probability measures is

d_{\mathrm{TV}}(\mu,\nu):= \sup_{A}|\mu(A)-\nu(A)|,

with supremum taken over all events in the joint support S of \mu,\nu. This is particularly clear in the case of discrete measures, as then

d_{\mathrm{TV}}(\mu,\nu)=\frac12 \sum_{x\in S} \left| \mu\left(\{x\}\right) - \nu\left(\{x\}\right) \right|.

(Think of the difference in heights between the bars, when you plot \mu,\nu simultaneously as a bar graph…)

The total variation distances records how well we can couple two distributions, if we want them to be equal as often as possible. It is therefore a bad measure of distributions with different support. For example, the distributions \delta_0 and \delta_{1/n} are distance 1 apart (the maximum) for all values of n. Similarly, the uniform distribution on [0,1] and the uniform distribution on \{0,1/n,2/n,\ldots, n-1/n, 1\} are also distance 1 apart.

When there is more overlap, the following result is useful.

Proposition: Any coupling (\hat X,\hat Y) of X\sim \mu,\,Y\sim \nu satisfies \mathbb{P}(X=Y)\le 1-d_{\mathrm{TV}}(\mu,\nu), and there exists a coupling such that equality is achieved. Continue reading

Birthday Coincidences and Poisson Approximations

This morning, Facebook was extremely keen to remind me via every available medium that four of my friends celebrate their birthday today. My first thought was that I hope they all enjoy their day, and my second thought was to ask what the chance of this was. I have about 200 Facebook friends, and so this struck me as an unlikely occurrence. But this problem has form, and it felt worthwhile to try some calculations to see if my intuition was well-founded.

rainbowfishcake_compressed

Siméon Denis Poisson celebrated his 234th birthday on 21st June this year.

The classical birthday problem

The starting point is the question: how many friends do you have to have before you expect to start seeing anyone sharing a birthday? There are a ridiculous number of articles about this on the web already, so I will say little, except that I don’t want to call this the ‘birthday paradox’, because it’s not a paradox at all. At best it might be counter-intuitive, but then the moral should be to change our intuition for this type of problem.

Throughout, let’s discount February 29th, as this doesn’t add much. So then, to guarantee having a shared pair of birthdays, you need to have 366 friends. But if you have a mere 23 friends, then the probability of having some pair that share a birthday is slightly greater than a half. The disparity between these two numbers leads to the counter-intuition. Some people might find it helpful to think that instead of counting friends, we should instead be counting pairs of friends, but I don’t personally find this especially helpful.

For me, thinking about the calculation in very slightly more generality is helpful. Here, and throughout, let’s instead take N to be the number of days in a year, and K the number of friends, or kids in the class if you prefer. Then, as usual, it is easier to calculate the probability that no two share a birthday (that is, that all the birthdays are distinct) than the probability that some two share a birthday. We could think of the number of ways to pick the set of birthdays, or we could look at the kids one-at-a-time, and demand that their birthday is not one of those we’ve already seen. Naturally, we get the same answer, that is

\frac{^N P_K}{N^K} = 1\cdot \frac{N-1}{N}\cdot\ldots \frac{N-K+1}{N}.

We’ve assumed here that all birthdates are equally likely. We’ll come back to this assumption right at the end. For now, let’s assume that both N and K are large, and we’ll try to decide roughly how large K has to be in relation to N for this answer to be away from 0 and 1. If we pair opposite terms up, we might approximate this by

(\frac{N-\frac{K}{2}}{N})^K = (1-\frac{K}{2N})^K\approx e^{-K^2/2N}.

In fact, AM-GM says that this is an overestimate, and a bit more care can be used to show that this is a good-approximation to first order. So we see that if K=\Theta(\sqrt{N}) for large N, we get a non-trivial limit.

Challenges for four-way shared birthdays

So the original problem I posed is harder, because there isn’t (unless I’m missing something) a natural way to choose birthdays one-at-a-time, or describe the set of suitable birthday sets. There are two major obstacles in a calculation such as this. Firstly, the overlap of people, that is we might have five or more birthdays overlapping; secondly, the overlap of days, that is we might have several days with four (or more) birthdays. We’ll end up worrying more about the second situation.

We start by eliminating both problems, by asking for the probability that exactly four friends are born on January 1st. The general form of this probability is \frac{\binom{K}{4} }{N^4} \cdot (\frac{N-1}{N})^{K-4}. Now, if K\ll N, this final term should not be significant. Removing this is not exactly the same as specifying the probability that at least four birthdays are on January 1st. But in fact this removal turns a lower bound (because {exactly four}<{at least four}) into an upper (in fact a union) bound. So if the factor being removed is very close to one, we can use whichever expression is more convenient.

In the real life case of N=365, K=200, this term is not negligible. But accounting for this, we get that the probability of exactly four birthdays on 1st January is ~0.0021. Our upper bound on the probability of at least four is ~0.0036.

But now that we know the probability for a given day, can we calculate (1-0.0021)^{365} to estimate the probability that we never have four-overlap? When we did our previous iterative calculation, we were using independence of the different kids’ birthdays. But the event that we have four-overlap on January 1st is not quite independent of the event that we have four-overlap on January 2nd. Why? Well if we know at least four people were born on January 1st, there are fewer people left (potentially) to be born on January 2nd. But maybe this dependence is mild enough that we can ignore it?

We can, however, use some moment estimates. The expected number of days with four-overlap is 365\cdot 0.0021 \approx 0.77. So the probability that there is at least one day with four-overlap is at most ~0.77.

But we really want a lower bound. So, maybe we can borrow exactly the second-moment argument we tried (there for isolated vertices in the random graph) in the previous post? Here, the probability that both January 1st and January 2nd are four-overlapping is

\frac{\binom{K}{4}\binom{K-4}{4}}{N^8}\cdot (\frac{N-2}{N})^{K-8}\approx 4.3\times 10^{-6}.

From this, we can evaluate the expectation of the square of the number of days with four-overlap, and thus find that the variance is ~0.74. So we use Chebyshev, calling this number of days #D for now:

\mathbb{P}(\# D=0)\le \mathbb{P}(|\#D - \mathbb{E}\# D|^2 \ge (\mathbb{E}\# D)^2 ) \le \frac{\mathrm{Var} \# D}{(\mathbb{E} \#D)^2}.

In our case, this unfortunately gives us an upper bound greater than 1 on this probability, and thus a lower bound of zero on the probability that there is at least one day with four-overlap. Which isn’t especially interesting…

Fairly recently, I spoke about the Lovasz Local Lemma, which can be used to find lower bounds on the probabilities of intersections of events, many of which are independent (in a particular precise sense). Perhaps this might be useful here? The natural choice of ‘bad event’ is that particular 4-sets of people share a birthday. There are \binom{K}{4} such events, and each is independent of the collection of \binom{K-4}{4} disjoint events. Thus we can consider using LLL if e\cdot (\binom{K}{4}-\binom{K-4}{4})\cdot 0.0021 \le 1. Unfortunately, this difference of binomial coefficients is large in our example, and so in fact the LHS has order 10^3.

Random number of friends – coupling to a Poisson Process

All of these methods failed because without independence we had to use estimates which were really not tight at all. But we can re-introduce independence if we remove the constraints on the model. Suppose instead of demanding I have K friends, I instead demand that I have a random number of friends, with distribution Poisson(K). Now it is reasonable to assume that for each day, I have a Poisson(K/365) friends with that birthday, independently for each day.

If we end up having exactly K friends with this random choice, then the distribution of the number of 4-overlap days is exactly the same as in the original setup. However, crucially, if we end up having at most K friends with this random choice, the distribution of the number of 4-overlap days is stochastically dominated by the original distribution. So instead let’s assume we have Poisson(L) friends, where L<K, and see how well we can do. For definiteness, we’ll go back to N=365, K=200 now. Let’s say X is the distribution of birthdays in the original model, and \Xi for the distribution of birthdays in the model with a random number of friends

Then

\mathbb{P}(\exists \ge 4\text{-overlap in }\Xi) = 1- \mathbb{P}(\mathrm{Po}(L/365)\le 3)^365. (*)

Now we can write the crucial domination relation as

\mathbb{P}(\exists \ge 4\text{-overlap in }X)\ge \mathbb{P}( \exists \ge 4\text{-overlap in }\Xi \,|\, |\Xi|\le 200),

and then use an inequality version of the law of total probability to bound further as

\ge \frac{ \mathbb{P}(\exists \ge 4\text{-overlap in }\Xi) - \mathbb{P}(|\Xi|>200)}{\mathbb{P}(|\Xi|\le 200)}.

This is a function of L, and in principle we could find its maximum, perhaps as N\rightarrow\infty. Here, though, let’s just take L=365/2 and see what happens. For (*) we get ~0.472.

To estimate \mathbb{P}(\mathrm{Po}(365/2)>200), observe that this event corresponds to 1.4 standard deviations above the mean, so we can approximate using quantiles of the normal distribution, via the CLT. (Obviously this isn’t completely precise, but it could be made precise if we really wanted.) I looked up a table, and this probability is, conveniently for calculations, roughly 0.1. Thus we obtain a lower bound of \frac{0.472-0.1}{0.9}. Allowing for the fairly weak estimates at various points, we still get a lower bound of around 0.4. Which is good, because it shows that my intuition wasn’t right, but that I was in the right ball-park for it being a ‘middle-probability event’.

Remarks and References

– The reason for doing the upper bound for the probability of exact 4-overlap is that the same argument for at-least-4-overlap would have given an upper bound of 1. However, this Poisson Process coupling is also a much better method for obtaining an upper bound on either event.

– Birthdays are not uniformly distributed through the year. The deviation is strong enough that even from the set of birth frequencies (rather than the sequence of birth frequencies), we can reject a null hypothesis of uniformity. Early September is pretty close to the maximum. Two comments: 1) this is the time of year where small variations in birth date have a big effect on education, especially in primary school; 2) we are 37 weeks into the year…

– It is known that 187 friends is the first time the probability of having at-least-4-overlap is greater than ½. You can find the full sequence on OEIS as A014088. I used to have about 650 Facebook friends, before I decided that I’d prefer instead the pleasant surprise of finding out what old acquaintances were up to when I next spoke to them. In this case, the median of the distribution of the largest number sharing a birthday would be seven.

– Eric Weisstein’s article on Mathworld is, in my opinion, the best resource for a mathematician on the first two pages of Google hits by at least two orders of magnitude. In the notation of this article, we were calculating P_4(n=200,d=365). There are also some good general asymptotics, or at least recipes for asymptotics, in equations (17) and (18).

– The paper Methods for Studying Coincidences by Diaconis and Mosteller is, as one might expect, extremely readable, and summarises many results and applications, including several generalisations.

The Inspection Paradox and related topics

In the final class for Applied Probability, we discussed the so-called Inspection Paradox for an arrivals process. We assume that buses, sat, arrive as a Poisson process with rate 1, and consider the size of the interval (between buses) containing some fixed time T. The ‘paradox’ is that the size of this interval is larger in expectation than the average time between buses, which of course is given by an exponential random variable.

As with many paradoxes, it isn’t really that surprising after all. Perhaps what is more surprising is that the difference between the expected observed interval and the expected actual interval time is quite small here. There are several points of interest:

1) The Markov property of the Poisson process is key. In particular, this says that the expectation (and indeed the distribution) of the waiting time for a given customer arriving at T is not dependent on T, even if T is a random variable (or rather, a class of random variables, the stopping times). So certainly the inspection paradox property will hold whenever the process has the Markov property, because the inspected interval contains the inspected waiting time, which is equal in distribution to any fixed interval.

2) Everything gets slightly complicated by the fact that the Poisson process is defined to begin at 0. In particular, it is not reversible. Under the infinitesimal (or even the independent Poisson increments) definition, we can view the Poisson process not as a random non-decreasing function of time, but rather as a random collection of points on the positive reals. With this setup, it is clearly no problem to define instead a random collection of points on all the reals. [If we consider this instead as a random collection of point masses, then this is one of the simplest examples of a random measure, but that’s not hugely relevant here.]

We don’t need to worry too much about what value the Poisson process takes at any given time if we are only looking at increments, but if it makes you more comfortable, you can still safely assume that it is 0 at time 0. Crucially, the construction IS now reversible. The number of points in the interval [s,t] has distribution parameterised by t-s, so we it doesn’t matter which direction we are moving in down the real line. In this case, A_t, the time since the previous arrival, and E_t, the waiting time until the next arrival, are both Exp(1) RVs, as the memorylessness property applies in each time direction.

For the original Poisson process, we actually have A_t stochastically dominated by an Exp(1) distribution, because it is conditioned to be less than or equal to t. So in this case, the expected interval time is some complicated function of t, lying strictly between 1 and 2. In our process extended to the whole real line, the expected interval time is exactly 2.

This idea of extending onto the whole real line explains why we mainly consider delayed renewal processes rather than just normal renewal processes. The condition that we start a holding time at 0 is often not general enough, particularly when the holding times are not exponential and so the process is not memoryless.

3) There is a general size-biasing principle in action here. Roughly speaking, we are more likely to arrive in a large interval than in a small interval. The scaling required is proportional to the length of the interval. Given a density function f(x) of X, we define the size-biased density function to be xf(x). We need to normalise to give a probability distribution, and dividing by the expectation EX is precisely what is needed. Failure to account for when an observation should have the underlying distribution or the size-biased distribution is a common cause of supposed paradoxes. A common example is ‘on average my friends have more friends than I do’. Obviously, various assumption on me and my friends, and how we are drawn from the set of people (and the distribution of number of friends) is required that might not necessarily be entirely accurate in all situations.

In the Poisson process example above, the holding times have density function e^{-x}, so the size-biased density function if xe^{-x}. This corresponds to a \Gamma(2,1) distribution, which may be written as the sum of two independent Exp(1) RVs as suggested above.

4) A further paradox mentioned on the sheet is the waiting time paradox. This says that the expected waiting time is longer if you arrive at a random time than if you just miss a bus. This is not too surprising: consider at least the stereotypical complaint about buses in this country arriving in threes, at least roughly. Then if you just miss a bus, there is a 2/3 chance that another will be turning up essentially immediately. On the sheet, we showed that the \Gamma(\alpha,1) distribution has this property also, provided \alpha<1.

We can probably do slightly better than this. The memoryless property of the exponential distribution says that:

\mathbb{P}(Z>t+s|Z>t)=\mathbb{P}(Z>s).

In general, for the sort of holding times we might see at a bus stop, we might expect it to be the case that if we have waited a long time already, then we are less likely relatively to have to wait a long time more, that is:

\mathbb{P}(Z>t+s|Z>t)\leq\mathbb{P}(Z>s),

and furthermore this will be strict if neither s nor t is 0. I see no reason not to make up a definition, and call this property supermemorylessness. However, for the subclass of Gamma distributions described above, we have the opposite property:

\mathbb{P}(Z>t+s|Z>t)\geq\mathbb{P}(Z>s).

Accordingly, let’s call this submemorylessness. If this is strict, then it says that we are more likely to have to wait a long time if we have already been waiting a long time. This seems contrary to most real-life distributions, but it certainly explains the paradox. If we arrive at a random time, then the appropriate holding time has been ‘waiting’ for a while, so is more likely to require a longer observed wait than if I had arrived as the previous bus departed.

In conclusion, before you think of something as a paradox, think about whether the random variables being compared are being generated in the same way, in particular whether one is size-biased, and whether there are effects due to non-memorylessness.

Independence and Association

Back when we did GCSE probability, we gave a definition of independent events as:

A and B are said to be independent if \mathbb{P}(A)\mathbb{P}(B)=\mathbb{P}(A\cap B).

We might also apply Bayes’ definition of conditional probability to say

\mathbb{P}(A|B)=\mathbb{P}(A)\quad\iff\quad A,B\text{ independent}\quad\iff\quad\mathbb{P}(B|A)=\mathbb{P}(B)

provided all the terms exist. (Eg the definition of \mathbb{P}(B|A) is at the very least non-obvious if the probability of A is 0.) In my opinion, this is a more naturally intuitive definition. For example, I think that when you toss two coins, the fact that the probability of the second coin being a tail is unaffected by whether the first is heads is more naturally ‘obvious’ than the fact that the joint probability of the two events is 1/4.

But, before getting too into anything philosophical, it is worth thinking about an equivalent situation for non-independent events. We remark that by an identical argument to above:

\mathbb{P}(A|B)\geq\mathbb{P}(A)\quad\iff\quad \mathbb{P}(A\cap B)\geq\mathbb{P}(A)\mathbb{P}(B)\quad\iff\quad\mathbb{P}(B|A)\geq\mathbb{P}(B)

Informally, this says that if we know A occurs, it increases the likelihood of B occuring. If we were talking about two random variables, we might say that they were positively correlated. But of course, by considering RVs 1_A,1_B, the result above is precisely the statement that the indicator functions have positive correlation.

Aim: To find a sufficient condition for positive correlation of random variables in a product measure.

Consider the following. Suppose A is an event which is positively correlated with the appearance of each edge. We might suspect that two such events A and B would be positively correlated. Instead, we consider a more concrete description. Recall that an event A is a subset of \Omega=\{0,1\}^E. Given w\in\Omega,e\in E, we say w^e\in\Omega defined by taking w and setting edge e to be open (note it may be open already). Now, we say event A is increasing, if

\forall w\in\Omega,\forall e\in E: w\in A\Rightarrow w^e\in A.

Note that this certainly implies the property previously mentioned, but the converse is not necessarily true.

Anyway, our revised aim will be to show that increasing events A and B are positively correlated for product measure.

For now, we approach the problem from the other direction, namely we attempt to find which measures on \{0,1\}^E have the property that A and B are positively correlated for all increasing A, B. Note that as before, we can think of this as \mathbb{E}1_A1_B\geq\mathbb{E}1_A\mathbb{E}1_B, and again here it is useful to rephrase our framework in terms of random variables. There is a natural (product) partial ordering of \Omega=\{0,1\}^E, and from this there is an easy notion of increasing random variables. Recall a random variable is defined as a measurable map \Omega\rightarrow\mathbb{R} so no further work is required.

X is increasing if w\geq w'\Rightarrow X(w)\geq X(w').

So we clarify our aim, which is to find a condition on the measure \mu such that \mu(XY)\geq \mu(X)\mu(Y) for all increasing X, Y. When this occurs, we say \mu is positively associated. Note that this is equivalent to \mu(A\cap B)\geq \mu(A)\mu(B) for all increasing events A, B. Why? We can build up X and Y from increasing indicator functions like \{X\geq x\} in a usual monotone class argument.

On the way, we need a partial ordering on the set of probability measures. Obviously, if \mu(A)\leq \nu(A) for all events A, then in fact \mu=\nu! So instead we say \mu\leq_{st}\nu if \mu(A)\leq \nu(A) for all increasing A. This is called the stochastic ordering, and there is a technical result of Strassen, proving the intuitively obvious claim that if \mu_1\leq \mu_2, then we can couple the measures in a natural way. Formally:

Theorem: \mu_1\leq\mu_2 \iff \exists a probability measure \nu on \Omega^2 such that the marginals are \mu_1,\mu_2 and

\nu(\{(w_1,w_2):w_1\leq w_2\})=1.

Our main result will be the FKG inequality which asserts that when \mu satisfies the following FKG lattice property

\mu(w_1\vee w_2)\mu(w_1\wedge w_2)\geq \mu(w_1)\mu(w_2),\quad\forall w_1,w_2\in\Omega

then \mu is positively associated. We will prove the case |E|<\infty.

We proceed by showing that \mu_1\leq\mu_2\propto Y\mu_1, rescaled, for Y an increasing RV. [Note that we are now suppressing the ‘st’ subscript, as context makes the use clear.]

To show this, we prove the more general Holley’s Theorem:

This states that if two positive probability measures satisfy a related lattice condition:

\mu_2(w_1\vee w_2)\mu_1(w_1\wedge w_2)\geq \mu_1(w_1)\mu_2(w_2)\quad\forall w_1,w_2\in\Omega

then we have the stochastic domination result: \mu_1\leq \mu_2.

Note that the lattice condition states, very informally, that adding edges results in a greater relative increase with respect to the measure \mu_2, which has a natural similarity to the definition of stochastic domination.

We prove this, perhaps unexpectedly, by resorting to a Markov chain. We note that there is a Markov chain on \Omega with equilibrium distribution given by \mu_1. This is simple: the non-zero transition rates are those given by the addition or removal of a single edge. Assume that edges are added at unit rate, and that edges are removed with rate: G(w^e,w_e)=\frac{\mu_1(w_e)}{\mu_1(w^e)}.

Similarly, we can construct a Markov chain on state space \Omega^2, where non-zero transitions are given by the addition of an edge to both states in the pair, the removal of an edge from both states in the pair, and the removal of an edge from only the first edge in the pair. Note that, as before, we may be ‘adding’ an edge which is already present. Assuming we start in this set, this choice means that we are restricting the sample space to \{(\pi,w):\pi\leq w\}. We need the transition rate of the third type of transition to have the form: \frac{\mu_1(\pi_e)}{\mu_1(\pi^e)}-\frac{\mu_2(w_e)}{\mu_2(w^e)}. So the lattice condition precisely confirms that this is non-negative, and thus we have a well-constructed Markov chain. The marginals have equilibrium distributions \mu_1,\mu_2 by construction, and by the general theory of Markov chains, there is an equilibrium distribution, and this leaves us in precisely the right position to apply Strassen to conclude the result.#

Summary of consequences: We have demonstrated that product measure is positively associated, as it certainly satisfies the FKG condition. Recall that this is what we had suspected intuitively for reasons given at the start of this account. Next time, I will talk about the most natural companion result, the BK inequality, and the stronger Reimer’s Inequality.

References: Both the motivation and the material is derived from Prof. Grimmett’s Part III course, Percolation and Related Topics, which was one of the mathematical highlights of the year. This account of the subject is a paraphrase of his lecture notes, which were themselves based on his book Probability on Graphs. Mistakes, naturally, are mine. Background on the course, and an online source of the book can be found on the course website here.