# Hoeffding’s inequality and convex ordering

A number of previous posts on this blog have discussed Markov’s inequality, and the control one can obtain on a distribution by applying this to functions of the distribution, or functions of random variables with this distribution. In particular, studying the square of the deviation from the mean $\left(X-\mathbb{E}X\right)^2$ leads to Chebyshev’s inequality, while even stronger bounds can sometimes be obtained using $e^{tX}$, for some appropriate value of $t\in\mathbb{R}$. Sometimes this final method, which relies on the finiteness of $\mathbb{E}[e^{tX}]$ is referred to as a Chernoff bound, especially when an attempt is made to choose the optimal value of t for the estimate under consideration.

The case of a random walk, that is the sum of IID increments, is particularly well-studied. In [Ho63], Hoeffding studies a generalisation where the increments are merely independent, and outlines a further generalisation to discrete-time martingales, which is developed by Azuma in [Az67], leading to the famous Azuma-Hoeffding inequality. A key feature of this theorem is that we require bounded increments, which leads to tight choices of $t_i$ in the Chernoff bound for each increment.

In addition, [Ho63] addresses the alternative model where the increments of a random walk are chosen uniformly without replacement from a particular set. The potted summary is that the sum of random increments chosen without replacement has the same mean, but is more concentrated that the corresponding sum of random increments chosen with replacement. This means that any of the concentration results proved in the earlier sections of [Ho63] for the latter situation apply equally to the setting without replacement.

Various versions of this result are referred to as Hoeffding’s inequality by the literature, and the main goal of this short article is to list some results which are definitely true about this setting!

(Note, in some contexts, ‘Hoeffding’s inequality’ will mean the concentration bound for either the with-replacement or the without-replacement setting. In this blog post, we are discussing inequalities comparing these two settings.)

Starting point – second moments in ‘vanilla’ Hoeffding

We won’t use this notation heavily, but for now let’s assume that we are studying a (multi)set of values $\mathcal{X}=\{x_1,\ldots,x_N\}$, and will study a sequence of m elements $(X_1,\ldots,X_m)$ from this set with replacement (ie chosen in an IID fashion), and a sequence of m elements $(Y_1,\ldots,Y_m)$ chosen from this set without replacement.

Some preliminary remarks:

• The $Y_i$s are still chosen uniformly which can be interpreted as
• Any permutation of the multiset $\{x_1,\ldots,x_n\}$ is equally likely to appear as $(Y_1,\ldots,Y_n)$;
• If determining the sequence one element at a time, then in a conditional sense, the choice of the ‘next’ element is uniform among those elements which have not yet been chosen.
• In particular, the sequence $(Y_1,\ldots,Y_m)$ is exchangeable (though not knowing what that means will not be a significant impediment to what follows).
• It is clear that $\mathbb{E}[X_1+\ldots+X_m]=\mathbb{E}[Y_1+\ldots+Y_m]$, so concentration around this common mean is the next question of interest.
• Intuitively, when $m\ll N$, we wouldn’t expect there to be much difference between sampling with or without replacement.
• When sampling without replacement, the cases m and N-m are symmetric with respect to the sum of $\mathcal{X}$ and, in particular, have the same concentration.

As a brief warmup, we establish that

$\mathrm{Var}(Y_1+\ldots+Y_m)\le \mathrm{Var}(X_1+\ldots+X_m).$

Note that the RHS is $m\sigma^2$, where $\sigma^2<\infty$ is the variance of a single uniform sample from $\mathcal{X}$. Since these sums have the same mean, it suffices to show that

$\mathbb{E}\left[ (Y_1+\ldots+Y_m)^2\right] \le \mathbb{E}\left[ (X_1+\ldots+X_m)^2\right].$

But this really has nothing to do with probability now, since these quantities are just symmetric polynomials in $\mathcal{X}$. That said, we gain a little bit if we use the exchangeability property to rewrite the LHS and the RHS as

$n\mathbb{E}[Y_1^2] + n(n-1)\mathbb{E}[Y_1Y_2] \le n\mathbb{E}[X_1^2] + n(n-1)\mathbb{E}[X_1X_2].$

Since $Y_1\stackrel{d}= X_1$, it remains to handle the cross term, which involves showing that

$\frac{1}{N(N-1)}\sum_{i\ne j} x_ix_j \le \frac{1}{N^2} \sum_{i,j}x_ix_j,$

which really is an algebra exercise, and can be demonstrated by the rearrangement inequality, or Chebyshev’s inequality (the other one – ie sequence-valued FKG not the probabilistic second-moment bound one), or by clever pre-multiplying and cancelling to reduce to Cauchy-Schwarz.

In [BPS18], which we will turn to in much greater detail shortly, the authors assert in the first sentence that Hoeffding treats also the setting of weighted sums, that is a comparison of

$\alpha_1X_1+\ldots+\alpha_mX_m,\quad \alpha_1Y_1+\ldots+\alpha_mY_m.$

A casual reader of [Ho63] may struggle initially to identify that this has indeed happened. At the level of second-moment comparison though, the argument given above carries over essentially immediately. We have

$\mathbb{E}\left[ (\alpha_1Y_1+\ldots+\alpha_n Y_n)^2\right] = \left(\alpha_1^2+\ldots+\alpha_n^2\right) \mathbb{E}[Y_1^2] + \left(\sum_{i\ne j} \alpha_i\alpha_j \right)\mathbb{E}\left[Y_1Y_2\right],$

and of course the corresponding decomposition for $(X_i)$, from which the variance comparison inequality follows as before.

Convex ordering

For many applications, variance is sufficient as a measure of concentration, but one can say more. As a prelude, note that one characterisation of stochastic ordering (which I have sometimes referred to as stochastic domination on this blog) is that

$X\le_{st}Y \;\iff\;\mathbb{E}[f(X)] \le \mathbb{E}[f(Y)],$

whenever f is a non-decreasing function. Unsurprisingly, it’s the $\Leftarrow$ direction which is harder to prove, though often the $\Rightarrow$ direction is the most useful in applications.

[To inform what follows, I used a number of references, in particular [BM05].]

A natural extension to handle concentration is (stochastic) convex ordering, which is defined as

$X\le_{cx} Y\;\iff \; \mathbb{E}[f(X)]\le \mathbb{E}[f(Y)],$

whenever f is a convex function. Note immediately that this definition can only ever apply to random variables for which $\mathbb{E}X=\mathbb{E}Y$. Otherwise, consider the functions $f(x)=x, f(x)=-x$, both of which are convex.

This sort of comparison has plenty of application in economics and actuarial fields, where there is a notion of risk-aversion, and convexity is the usual criterion to define risk (in the sense that the danger of a significant loss is viewed as more significant than a corresponding possibility of a significant gain). Also note that in this context, X is preferred to Y by a risk-averse observer.

In fact, it can be shown [Oh69] that a sufficient condition for $X\le_{cx} Y$ is the existence of $\alpha\in\mathbb{R}$ such that

$F_X(x)\le F_Y(x)\;\iff\; x\ge \alpha,$

where $F_X(x)=\mathbb{P}(X\le x)$ is the usual distribution function of X. That is, the distribution functions ‘cross exactly once’. Note that in general $\alpha\ne \mathbb{E}X=\mathbb{E}Y$.

We will return later to the question of adapting this measure to handle distributions with different expectations, but for now, we will return to the setting of sampling with and without replacement and show that

$Y_1+\ldots+Y_m\le_{cx}X_1+\ldots+X_m.$

This is the result shown by Hoeffding, but by a more complicated method. We use the observations of [BPS18] which are stated in a more complicated setting which we may return to later. For notational convenience, let’s assume the elements of $\mathcal{X}$ are distinct so that ‘repetitions of $x_i$‘ has an unambiguous meaning.

We set up a natural coupling of $(X_i),(Y_i)$ as follows. Let $(X_i,i\ge 1)$ be IID uniform samples from $\mathcal{X}$, and generate $(Y_1,\ldots,Y_N)$ by removing any repetitions from $(X_1,X_2,\ldots)$.

Note then that given the sequence $(Y_1,\ldots,Y_m)$, the multiset of values $\{X_1,\ldots,X_m\}$ definitely includes $Y_1$ at least once, but otherwise includes each $Y_i$ either zero, one or more times, in a complicated way. However, this distribution doesn’t depend on the values taken by $(Y_i)$, since it is based on uniform sampling. In particular, if we condition instead on the set $\{Y_1,\ldots,Y_m\}$, we find that

$\mathbb{E}\left[X_1+\ldots+X_m\,\big|\, \{Y_1,\ldots,Y_m\}\right] = Y_1+\ldots+Y_m,$

and so we may coarsen the conditioning to obtain

$\mathbb{E}\left[X_1+\ldots+X_m\,\big|\, Y_1+\ldots+Y_m\right] = Y_1+\ldots+Y_m.$

This so-called martingale coupling is very useful for proving convex ordering. You can think of $\mathbb{E}[X|Y]=Y$ as saying “X is generated by Y plus additional randomness”, which is consistent with the notion that Y is more concentrated than X. In any case, it meshes well with Jensen’s inequality, since if f is convex, we obtain

$\mathbb{E}[f(X)] = \mathbb{E}\left[ \mathbb{E}[f(X)|Y] \right] \ge \mathbb{E}\left[f\left(\mathbb{E}[X|Y] \right)\right] = \mathbb{E}\left[ f(Y)\right].$

So we have shown that sampling without replacement is greater than sampling with replacement in the convex ordering, if we consider sums of the samples.

If we want to handle weighted sums, and tried to reproduce the previous argument directly, it would fail, since

$\mathbb{E}\left[\alpha_1X_1+\ldots+\alpha_nX_n\,\big|\,\{Y_1,\ldots,Y_m\}\right]=(\alpha_1+\ldots+\alpha_m)(Y_1+\ldots+Y_m),$

which is not in general equal to $\alpha_1Y_1+\ldots+\alpha_mY_m$.

Non-uniform weighted sampling

Of course, it’s not always the case that we want to sample according to the uniform distribution. My recent personal motivation for investigating this setup has been the notion of exploring a graph generated by the configuration model, for which vertices appear for the first time in a random order that is size-biased by their degrees.

In general we can think of each element of $\mathcal{X}$ as having some weight w(i), and at each step, we pick $x_i$ from the remaining elements with probability proportion to w(i). This procedure is known in other contexts as successive sampling. Note that the sequence $(Y_1,\ldots,Y_m)$ is no longer exchangeable. This is clear enough when there are two elements in $\mathcal{X}$, and $w(1)\gg w(2)$. So it is much more likely that $(Y_1,Y_2)=(x_1,x_2)$ than $(Y_1,Y_2)=(x_2,x_1)$.

In addition, it will in general not be the case that

$\mathbb{E}[Y_1+\ldots+Y_m]=\mathbb{E}[X_1+\ldots+X_m],$

and this is easily seen again in the above example, where $\mathbb{E}[Y_1+Y_2]=x_1+x_2$ while $\mathbb{E}[X_1+X_2]\approx 2x_1$.

However, one can still compare such distributions, with the notion of increasing convex ordering, which is defined as:

$X\le_{icx} Y\;\iff\; \mathbb{E}[f(X)]\le \mathbb{E}[f(Y)],$

for all convex non-decreasing functions f. From an economic point view, increasing convex ordering becomes a comment on the relative contribution of losses above some threshold between the two distributions (and that X is better or worse than Y for all such thresholds), and is sometimes called stop-loss order. More formally, as in [BM05] we have:

$X\le_{icx}Y\;\iff\; \mathbb{E}[\max(X-t,0)]\le \mathbb{E}[\max(Y-t,0)]\,\forall t\in\mathbb{R}.$

In [BPS18], the authors study the case of successive sampling where the weights are monotone-increasing in the values, that is $w(i)\ge w(j)\;\iff\; x_i\ge x_j$, of which size-biased sampling is a special case. With this setup, the heuristic is that ‘large values are more likely to appear earlier’ in the sample without replacement, and making this notion precise underpins all the work.

The coupling introduced above is particularly useful in this more exotic setting. The authors show that conditional on the event A of the form $\{Y_1,\ldots,Y_m\}=\{x_{i_1}\le x_{i_2}\le\ldots\le x_{i_m}\}$, one has

$\mathbb{P}(X_1=x_{i_j}\,\big|\,\mathbf{A}) \le \mathbb{P}(X_1=x_{i_k}\,\big|\,\mathbf{A}),\text{ precisely when }j\le k.$

This is achieved by comparing the explicit probabilities of pairs of orderings for $(Y_1,\ldots,Y_m)$ with $x_{i_j},x_{i_k}$ exchanged. See [BPS18] for this short calculation – a link to the Arxiv version is in the references. So to calculate $\mathbb{E}[X_1\,\big|\, \mathbf{A}]$, we have the setup for the rearrangement inequality / Chebyshev again, as

$\mathbb{E}[X_1\,\big|\, \mathbf{A}] = \sum_{j=1}^m x_{i_j}\mathbb{P}(X_1=x_{i_j}) \le \frac{1}{n}\left(\sum_{j=1}^m x_{i_j}\right)\left(\sum \mathbb{P}\right) = \frac{\left(Y_1+\ldots+Y_m\,\big|\,\mathbf{A}\right)}{m}.$

As before, letting $X,Y$ be the sums of the samples with and without replacement, respectively, we have $\mathbb{E}[X\,\big|\, Y] \ge Y$, which the authors term a submartingale coupling. In particular, the argument

$\mathbb{E}[f(X)] = \mathbb{E}\left[ \mathbb{E}[f(X)|Y] \right] \ge \mathbb{E}\left[f\left(\mathbb{E}[X|Y] \right)\right] \ge \mathbb{E}\left[ f(Y)\right],$

works exactly as before to establish convex ordering.

References

[Az67] – Azuma – 1967 – Weighted sums of certain dependent random variables

[BM05] – Bauerle, Muller – 2005 – Stochastic orders and risk measures: consistency and bounds. (Online version here)

[BPS18] – Ben Hamou, Peres, Salez – 2018 – Weighted sampling without replacement (ArXiv version here)

[Ho63] – Hoeffding – 1963 – Probability inequalities for sums of bounded random variables

[Oh69] – Ohlin – 1969 – On a class of measures of dispersion with application to optimal reinsurance

At the recent IMO in Hong Kong, there were several moments where the deputy leaders had to hang around, and I spent some of these moments discussing the following problem with Stephen Mackereth, my counterpart from New Zealand. He’s a mathematically-trained philosopher, so has a similar level of skepticism to me, but for different reasons, regarding supposed paradoxes in probability. Because, as we will see shortly, I don’t think this is a paradox in even the slightest fashion, I think there’s probably too much written about this on the internet already. So I’m aware that contributing further to this oeuvre is hypocritical, but we did the thinking in HKUST’s apparently famous Einstein Cafe, so it makes sense to write down the thoughts.

[And then forget about it for eight weeks. Oops.]

Here’s the situation. A cryptic friend gives you an envelope containing some sum of money, and shows you a second envelope. They then inform you that one of the envelopes contains twice as much money as the other. It’s implicit in this that the choice of which is which is uniform. You have the option to switch envelopes. Should you?

The supposed paradox arises by considering the amount in your envelope, say X. In the absence of further information, it is equally likely that the other envelope contains X/2 as 2X. Therefore, the average value of the other envelope is

$\frac12 \left(\frac{X}{2}+2X \right)= \frac54 X > X.$

So you should switch, since on average you gain money. But this is paradoxical, since the assignment of larger and smaller sums was uniform, so switching envelope should make no difference.

Probabilistic setup

This is not supposed to be a problem on a first-year probability exercise sheet. It’s supposed to be conducive to light discussion. So saying “I won’t engage with this problem until you tell me what the probability space is” doesn’t go down terribly well. But it is important to work out what is random, and what isn’t.

There are two sources of randomness, or at least ignorance. Firstly, there is the pair of values contained in the envelopes. Secondly, there is the assignment of this pair of values to the two envelopes. The second is a source of randomness, and this problem is founded on the premise that this second stage is ‘symmetric enough’ to smooth over any complications in the first stage. If we think that probability isn’t broken (and that’s what I think), then the answer is probably that the second stage isn’t symmetric enough.

Or, that the first stage isn’t very well-defined. In what follows, I’m going to make the second stage very symmetric, at the expense of setting up the first stage in what seems to me a reasonable way using the conventional language of probability theory to record our ignorance about the values in play.

So what’s the first stage? We must have a set of possible pairs of values taken by the envelopes. Let’s call this A, so

$A\subset \mathbb{A}:=\{(x,2x)\,:\, x\in (0,\infty)\}.$

Maybe we know what A is, but maybe we don’t, in which we should take $A=\mathbb{A}$, on the grounds that any pair is possible. Suppose that your friend has chosen the pair of values according to some distribution on $\mathbb{A}$, which we’ll assume has a density f, which is known by you. Maybe this isn’t the actual density, but it serves perfectly well if you treat it as *your* opinion on the likelihood. Then this actually does reduce to a problem along the lines of first-year probability, whether or not you get to see the amount in your envelope.

Suppose first that you do get to see the amount, and that it is x. Then the conditional probabilities that the pair is (x/2,x) or (x,2x) are, respectively

$\frac{f(x/2,x)}{f(x/2,x)+f(x,2x)},\quad \frac{f(x,2x)}{f(x/2,x)+f(x,2x)}.$

So you can work out your expected gain by switching, and decide accordingly. If you don’t know the value in your envelope, you can still work out the probability that it is better (in expectation) to switch, but this isn’t really a hugely meaningful measure, unless it is zero or one.

It’s worth noting that if you can view inside your envelope, and you know A has a particular form, then the game becomes determined. For example, if

$A\subset \{(n,2n), n\text{ an odd integer}\},$

then life is very easy. If you open your envelope and see an odd integer, you should switch, and if you see an even integer you shouldn’t.

We’ll return at the end to discuss a case where it is always better to switch, and why this isn’t actually a paradox.

Improper prior and paradox of resampling when $\mathbb{E}=\infty$

For now though, let’s assume that we don’t know anything about the amounts of money in the envelopes. Earlier, we said that “in the absence of further information, it is equally likely that the other envelope contains X/2 as 2X”. In the language of a distribution on $\mathbb{A}$, we are taking the uniform measure. Of course this not a distribution, in the same way that there isn’t a uniform distribution on the positive reals.

However, if this is your belief about the values in the pair of envelopes, what do you think is the mean value of the content of your envelope? Well, you think all values are equally likely. So, even though this isn’t a distribution, you pretty much think the value of your envelope has infinite expectation.

[This is where the philosophy comes in I guess. Is (expressing uniform ignorance about the content of the other envelope given knowledge of your own) the same as (expressing uniform ignorance of both envelopes at the beginning)? I think it is, even though it has a different consequence here, since the former can be turned into a proper distribution, whereas the latter cannot.]

Let’s briefly consider an alternative example. It’s fairly easy to conjure up distributions which are almost surely finite but which have infinite expectation. For example $\mathbb{P}(X=2^k)=2^{-k}$ for k=1,2,…, which is the content of the *St. Petersburg paradox*, another supposed paradox in probability, but one whose resolution is a bit more clear.

Anyway, let X and Y be independent copies of such a distribution. Now suppose your friend offers you an envelope containing amount X. You look at the value, and then you are offered the opportunity to switch to an envelope containing amount Y. Should you?

Well, if expectation is what you care about, then you definitely should. Because with probability one, you are staring at a finite value in your envelope, whereas the other unknown envelope promises infinite expectation, which is certainly larger than the value that you’re looking at.

Is this also a paradox? I definitely don’t think it is. The expectation of the content of your envelope is infinite, the expected gain is infinite with probability one, which is consistent with the expected content of the other envelope being infinite. [Note that you don’t want to be claiming that the expectation of X-Y is zero.]

An example density function

As an exercise that isn’t necessarily hugely interesting, let’s assume that f, the distribution of the smaller of the pair, is $\mathrm{Exp}(\lambda)$. So the mean of this smaller number is $1/\lambda$. Then, conditional on seeing x in my envelope, the expected value of the number in the other envelope is

$\frac{\frac{x}{2} e^{-\lambda x/2} + 2x e^{-\lambda x}}{e^{-\lambda x/2}+ e^{-\lambda x}}.$ (*)

Some straightforward manipulation shows that this quantity is at least x (implying it’s advantageous to switch) precisely when

$e^{-\lambda x/2}\ge \frac12.$

That is, when $x\le \frac{2\log 2}{\lambda}$. The shape of this interval should fit our intuition, namely that the optimal strategy should be to switch if the value in your envelope is small enough.

The point of doing this calculation is to emphasise that it ceases to be an interesting problem, and certainly ceases to be a paradox of any kind, once we specify f concretely. It doesn’t matter whether this is some true distribution (ie the friend is genuinely sampling the values somehow at random), or rather a perceived likelihood (that happens to be normalisable).

What if you should always switch?

The statement of the paradox only really has any bite if the suggestion is that we should always switch. Earlier, we discussed potential objections to considering the uniform prior in this setting, but what about other possible distributions f which might lead to this conclusion?

As at (*), we can conclude that when $f(x)+f(x/2)>0$, we should switch on seeing x precisely if

$f(x)\ge 2f\left(\frac{x}{2}\right).$

Therefore, partitioning the support of f into a collection of geometric sequences with exponent 2, it is clear that the mean of f is infinite if everything is integer-valued. If f is real-valued, there are some complications, but so long as everything is measurable, the same conclusion will hold.

So the you-should-switch-given-x strategy can only hold for all values of x if f has infinite mean. This pretty much wraps up my feelings. If the mean isn’t infinite, the statement of the paradox no longer holds, and if it is infinite, then the paradox dissolves into a statement about trying to order various expectations, all of which are infinite.

Conclusions

Mathematical summary: it’s Bayes. Things may be exchangeable initially, but not once you condition on the value of one of them! Well, not unless you have a very specific prior.

Philosophical summary: everything in my argument depends on the premise that one can always describe the protagonist’s prior opinion on the contents of the pair of envelopes with a (possibly degenerate) distribution. I feel this is reasonable. As soon as you write down $\frac12 \cdot\frac{x}{2} + \frac12 \cdot2x$, you are doing a conditional expectation, and it’s got to be conditional with respect to something. Here it’s the uniform prior, or at least the uniform prior restricted to the set of values that are now possible given the revelation of your number.

Second mathematical summary: once you are working with the uniform prior, or any measure with infinite mean, there’s no reason why

$\mathbb{E}\left[X|Y\right]>Y,$

with probability one (in terms of Y) should be surprising, since the LHS is (almost-surely) infinite while the RHS is almost surely finite, despite having infinite mean itself.

# The Chinese Restaurant Process

A couple of months ago I wrote a post about Polya’s Urn, the simplest example of self-reinforcing process. Recall that we have a bag containing black and white balls, and sequentially we draw a ball then replace it together with an additional ball of the same colour. The process is self-reinforcing in the sense that if there is a surplus of black balls, the dynamics will reinforce this by adding more black balls than white balls. Alternatively, you can think of a natural limit process when the number of balls is large, for which any distribution is an invariant distribution. We have seen models such as the Preferential Attachment dynamics for network creation, where the degrees of vertices clearly have this self-reinforcing property. New vertices are more likely to join to existing vertices with large degrees.

One difference between the Polya Urn and some of the models we might be interested in for applications is that for the urn model, the number of classes (in this context colours of balls) is fixed. In many applications, we will want to allow new classes to appear. In the process which follows, we will allow this, and the new classes will have initial size equal to 1, so will be at a disadvantage for the self-reinforcing dynamics. Nonetheless, some will show up in a meaningful way in the limit. It is worth emphasising that Polya’s Urn gave us the Dirichlet distribution in the limit, and this can be thought of as a partition of [0,1]. These more general processes will give us a more interesting family of partitions, called the Poisson-Dirichlet distributions. These will turn up in a wide variety of contexts, and this is perhaps the friendliest way to introduce them.

The model is this. We start with a single diner who sits at the first table. Then whenever the (n+1)th diner arrives, they join a table with k diners already with probability k/n+1, and they start a new table with probability 1/n+1.

(Aside: I’m not exactly sure how this relates to a Chinese restaurant? It seems more reminiscent of a university dining hall during freshers’ week, but I guess that would be a less catchy name for a model.)

Anyway, the interest in this description lies not in organising seating arrangements. Consider choosing uniformly at random from the set of permutations on [n+1]. Suppose x maps to n+1 and n+1 maps to y. Consider taking the permutation of [n] formed by instead mapping x to y and ignoring n+1. This has the uniform distribution on the set of permutations of [n]. By reversing this procedure, we can construct a uniform permutation of [n+1] from a uniform permutation of [n]. When you do this as a process for n growing, observe that the orbits correspond exactly to tables in the Chinese Restaurant Process. If we wanted the CRP to give all the information about the permutation, we could specify the ordering round each table, by saying that with probability 1/n+1 the new diner sits to the left of any given existing diner.

As a starting point for why this is a useful description of the uniform permutation distribution, observe that the size of the component containing the element 1 evolves as a Polya Urn with initial vector (1,1). The second 1 in the initial vector corresponds to the possibility of starting a new table, which is maintained at every stage. This tells us immediately that as n grows to infinity, the proportion of elements in the same cycle as 1 in the uniform permutation converges in distribution to U[0,1]. The construction also allows for an easy proof that the expected number of cycles is roughly log n for large n, since on each pass of the process, the probability that there is a new cycle formed is 1/k.

In this case, the partition induced on [n] by the process is clearly exchangeable given our permutation description. However, this will turn out to hold in greater generality. Note also,, that conditional on the size of the cycle containing 1, the sizes of the remaining cycles are given by a uniform permutation on a smaller number of elements. So the limiting result holds jointly in the first k cycle sizes for all k. More precisely, if $(N_1,N_2,\ldots)$ are the cycle sizes ordered by least element, then the frequencies converge to:

$(U_1,(1-U_1)U_2,(1-U_1)(1-U_2)U_3,\ldots),$

where the Us are independent U[0,1] RVs. This is known as aÂ stick-breaking procedure, where at each step we break off some proportion of the stick according to a fixed distribution, and assemble the pieces into a partition.

We generalise this process to get a two-parameter version. The standard notation for the parameters is $(\alpha,\theta)$. Then we amend the dynamics. We now have to take into account how many tables are occupied when the (n+1)th diner arrives. If k tables are occupied, and theÂ ith table has $n_i$ diners, then the new one will join this table with probability $\frac{n_i-\alpha}{n+\theta}$, and will start a new table otherwise, so with probability $\frac{\theta+k\alpha}{n+\theta}$. The original process therefore corresponds to parameters (0,1).

First we examine which parameters are possible. If $\alpha<0$, and $m|\alpha|<\theta<(m+1)|\alpha|$, then with high probability the (m+1)th table will eventually be occupied, whereafter the probability of forming a further table will be negative. So we have to demand instead that $\theta$ is an integer multiple of $-\alpha$. Then the number of tables is bounded by this multiple, so for large n, the probability of joining one of the k (fixed) tables is roughly $\frac{n_i}{n}$, so this should behave roughly like the standard Polya Urn. And indeed, the induced frequencies do converge to the Dirichlet distribution with k equal parameters.

Obviously $\alpha$ cannot be greater than 1, otherwise the probability of the second diner joining the first table is negative. If it is equal to 1, then every diner starts a new table, which isn’t very interesting. So we care about $\alpha\in[0,1)$, and for the probability of the second diner starting a new table to be non-negative we require $\theta>-\alpha$.

It turns out that the partitions induced by this process are exchangeable also. We also have a stick-breaking construction, although now the broken proportions are not IID, but distributed as

$U_i\sim \mathrm{Beta}(1-\alpha,\theta+i\alpha),$

with the same notation otherwise. It turns out that under mild assumptions, these are all the infinite exchangeable random partitions with this stick-breaking property.

My initial struggle with this process was to understand what roles $(\alpha,\theta)$ played in a more precise way. It turns out this is best explained through the limit of the partition, but Pitman’s Exercise 3.2.2 does at least give an idea of how such a process with parameters (1/2,0) might naturally arise as a version of an urn model.

3.2.2. Let an urn initially contain two balls of different colours. Draw 1 is a simple draw from the urn with replacement. Thereafter, balls are drawn from the urn, with replacement of the ball drawn, and addition of two more balls as follows. If the ball drawn is of a colour never drawn before, it is replaced together with two additional balls of two distinct new colours, different to the colours of balls already in the urn. Whereas if the ball drawn is of a colour that has been drawn before, it is replaced together with two balls of its own colour.

Let $n_1$ be the number of times a ball of the colour of the first ball drawn (and replaced) is drawn. Let $n_2,n_3,\ldots$ be the number of times balls of each other colour are drawn. Suppose after n draws, we have drawn k colours. (There will be other colours in the bag not yet drawn.) Then, for each drawn colour i, there are $2n_i-1$ balls of that colour in the bag, giving 2n-k in total. But there should be 2n balls in total, so there are k other balls. Then the probability that we see a new colour is k/2n, and the probability that we see colour i again is $\latex \frac{2n_i-1}{2n}=\frac{n_i-1/2}{n}$, which exactly corresponds to the dynamics for PD(1/2,0).

The other question I was puzzled by initially is where does the dust come from in the limit? Recall that in an infinite exchangeable partition, the sum of the frequencies does not need to be 1. The difference between this sum and 1 gives the probability that an element is in a block by itself. Obviously, when the number of tables is bounded (as when $\alpha<0$) this is not an issue, but for positive $\alpha$, this won’t hold. So we need to account for these singletons. The temptation is to imagine that these correspond to tables which are started but never joined. But this use of ‘never’ is not ideal. For each k, the k-th table will eventually include arbitrarily large numbers of diners. But for any finite n, there will likely be some proportion of people dining alone, some in pairs, and so on. So the sum of all of these proportions in the limit gives this dust.

Generalising Polya’s Urn in another direction, if I have time, I might write something about a model which I recently read about on arXiv where the classes are vertices of a graph, and there is dependence between them based on the presence of edges. This might also be a good moment to explain some other generalisations and stochastic approximation methods used to treat them.

REFERENCES

This post is almost entirely a paraphrase of Sections 3.1 and 3.2 from Pitman’s Combinatorial Stochastic Processes, available online here.

# Increments of Random Partitions

The following is problem 2.1.4. from Combinatorial Stochastic Processes:

Let $X_i$ be the indicator of the event that i the least element of some block of an exchangeable random partition $\Pi_n$ of [n]. Show that the joint law of the $(X_i,1\leq i\leq n)$ determines the law of $\Pi_n$.

As Pitman says, this is a result by Serban Nacu, the paper for which can be found here. In this post I’m going to explain what an exchangeable random partition is, how to prove the result, and a couple of consequences.

The starting point is the question ‘what is an exchangeable random partition?’ The most confusing aspect is that there are multiple definitions depending on whether the blocks of the partition are sets or just integers corresponding to a size. Eg, {1,2,4} u {3} is a partition of [4], corresponding to the partition 3+1 of 4. Obviously one induces the other, and in an exchangeable setting the laws of one may determine the laws of the other.

In the second case, we assume 3+1 is the same partition as 1+3. If order does matter then we call it a composition instead. This gets a bit annoying for set partitions, as we don’t want these to be ordered either. But if we want actually to talk about the sets in question we have to give them labels, which becomes an ordering, so we need some canonical way to assign these labels. Typically we will say $\Pi_n=\{A_1,\ldots,A_k\}$, where the curly brackets indicate that we don’t care about order, and we choose the labels by order of appearance, so by increasing order of least elements.

We say that a random partition $\Pi_n$ of [n] is exchangeable if its distribution is invariant the action on partitions induced by the symmetric group. That is, relabelling doesn’t change probabilities. We can express this functionally by saying

$\mathbb{P}(\Pi_n=\{A_1,\ldots,A_k\})=p(|A_1|,\ldots,|A_k|),$

for p a symmetric function. This function is then called the exchangeable partition probability function (EPPF) by Pitman.

Consider a partition of 4 into sets of sizes 3 and 1. There is a danger that this definition looks like it might be saying that the probability that A_1 is the set of size 3 is the same as the probability that A_1 is the set of size 1. This would be a problem because we expect to see some size-biasing to the labelling. Larger sets are more likely to contain small elements, merely because they contain more elements. Fortunately the definition is not broken after all. The statement above makes no reference to the probabilities of seeing various sizes for A_1 etc. For that, we would have to sum over all partitions with that property. It merely says that the partitions:

$\{1,2,3\}\cup\{4\},\quad \{1,2,4\}\cup\{3\},\quad\{1,3,4\}\cup\{2\},\quad \{2,3,4\}\cup\{1\}$

have respective probabilities:

$p(3,1),\quad p(3,1),\quad p(3,1),\quad p(1,3),$

and furthermore these are equal.

Anyway, now let’s turn to the problem. The key idea is that we want to be looking at strings of 0s and 1s that can only arise in one way. For example, the string 10…01 can only arise corresponding to the partitions {1,2,…,n-1} u {n} and {1,2,…,n-2,n} u {n-1}. So now we know p(n-1,1) and so also p(1,n-1). Furthermore, note that 10…0 and 11…1 give the probabilities of 1 block of size n and n blocks of size 1 respectively at once.

So then the string 10…010 can only arise from partitions {1,2,…,n-2,n} u {n-1} or {1,2,…,n-2} u {n-1,n}. We can calculate the probability that it came from the former using the previously found value of p(n-1,1) and a combinatorial weighting, so the remaining probability is given by p(2,n-2). Keep going. It is clear what ‘keep going’ means in the case of p(a,b) but for partitions with more than two blocks it seems a bit more complicated.

Let’s fix k the number of blocks in partitions under consideration, and start talking about compositions, that is $a_1+\ldots+a_k=n$. The problem we might face in trying to generalise the previous argument is that potentially lots of compositions might generate the same sequence of 0s and 1s, so the ‘first time’ we consider a composition might be the same for more than one composition. Trying it out in the case k=3 makes it clear that this is not going to happen, but we need some partial ordering structure to explain why this is the case.

Recall that a composition with k blocks is a sequence $a=(a_1,\ldots,a_k)$ which sums to n. Let’s say a majorizes b if all its partial sums are at least as large. That is $a_1+\ldots+a_l\geq b_1+\ldots+b_l$ for all $1\leq l \leq k$. We say this is strict if at least one of the inequalities is strict. It is not hard to see that if a majorizes b then this is strict unless a = b.

Since we don’t care about ordering, we assume for now that all compositions are arranged in non-increasing order. So we find a partition corresponding to some such composition $a_1,\ldots,a_k$. The partition is:

$\{1,\ldots,a_1\}\cup\{a_1+1,\ldots,a_1+a_2\}\cup\{a_1+a_2+1,\ldots,a_1+a_2+a_3\}\cup\ldots\cup\{n-a_k,\ldots,n\}.$

This generates a sequence of 0s and 1s as describe above, with $a_i-1$ 0s between the i’th 1 and the (i+1)th 1. The claim is that given some composition which admits a partition with this same corresponding sequence, that composition must majorize a. Proof by induction on l. So in fact we can prove Nacu’s result inductively down the partial ordering described. We know the probability of the sequence of 0s and 1s corresponding to the partition of [n] described by assumption. We know the probability of any partition corresponding to a composition which majorizes a by induction, and we know how many partitions with this sequence each such composition generates. Combining all of this, we can find the probability corresponding to a.

Actually I’m not going to say much about consequences of this except to paraphrase very briefly what Nacu says in the paper. One of the neat consequences of this result is that it allows us to prove in a fairly straightforward way that the only infinite family of exchangeable random partitions with independent increments is the so-called Chinese Restaurant process.

Instead of attempting to prove this, I will explain what all the bits mean. First, the Chinese Restaurant process is the main topic of the next chapter of the book, so I won’t say any more about it right now, except that its definition is almost exact what is required to make this particular result true.

We can’t extend the definition of exchangeable to infinite partitions immediately, because considering invariance under the symmetric group on the integers is not very nice, in particular because there’s a danger all the probabilities will end up being zero. Instead, we consider restrictions of the partition to $[n]\subset\mathbb{N}$, and demand that these nest appropriately, and are exchangeable.

Independent increments is a meaningful thing to consider since one way to construct a partition, infinite or otherwise, is to consider elements one at a time in the standard ordering, either adding the new element to an already present block, or starting block. Since 0 or 1 in the increment sequence corresponds precisely to these events, it is meaningful to talk about independent increments.

# Hewitt-Savage Theorem

This final instalment in my exploration of exchangeability gives a stronger version of Kolmogorov’s 0-1 law, and suggests some applications. It is easy to see that the tail sigma-field is a subset of the exchangeable sigma-field. For, if A is a tail event, then it is independent of the first n random variables in the underlying sequence, so in particular, is invariant under permutations of initial segments of the sequence.

Kolmogorov’s 0-1 Law: $(X_n)$ a sequence of independent (not necessarily iid) random variables in some probability space. Define the tail sigma-field $\tau=\cap_n \sigma(X_{n+1},X_{n+2},\ldots)$. Then $\tau$ is trivial; that is, $\forall A\in\tau\; P(A)\in\{0,1\}$.

Proof: Set $\tau_n=\sigma(X_{n+1},X_{n+2},\ldots), F_n=\sigma(X_1,\ldots,X_n)$. Then $F_n$ is independent of $\tau_m$ whenever $m\geq n$. So $F_n$ is independent of $\tau$ for all n, hence so is $\cup_n F_n$, which generates the entire sigma-field $F_\infty$, so this is independent of $\tau$ also. Since $A\in \tau\Rightarrow A\in F_\infty$ trivially, the independence criterion gives $P(A)=P(A\cap A)=P(A)P(A)$, and hence $P(A)\in\{0,1\}$.

Hewitt-Savage 0-1 Law:Â $(X_n)$ a sequence of iid random variables. Then the sigma field of exchangeable events $\mathcal{E}$ is trivial.

Proof:Â Take $A\in\mathcal{E}$, and approximate by $A_n\in F_n, P(A\triangle A_n)\rightarrow 0$ which is possible, since $\cup F_n$ generates the whole sigma-field. Write $A_n=\{(X_1,\ldots,X_n)\in B_n\}$ for later ease of notation. To exploit exchangeability, set $\tilde{A}_n=\{X_{n+1},\ldots,X_{2n}\in B_n\}$, as the permutation of RVs that sends $A_n\mapsto \tilde{A}_n$ leaves A invariant. So $P(\tilde{A}_n\triangle A)=P(A_n\triangle A)\rightarrow 0\Rightarrow P(A_n\cap \tilde{A}_n)\rightarrow P(A)$. But because (X) is iid (Note, this is where we use identical distributions), $P(A_n\cap \tilde{A}_n)=P(A_n)P(\tilde{A}_n)=P(A_n)^2\rightarrow P(A)^2$. Hence $P(A)\in\{0,1\}$.

Application: Given a stochastic process with iid increments, the event that a state is visited infinitely often is in the tail space of the process, however it is not in the tail space of the increments, so Kolmogorov does not apply. It is however an exchangeable event, and so occurs with probability 0 or 1.

References (for this and the related two previous posts):

Kingman – Uses of Exchangeability (1978)

Breiman – Probability, Chapter 3

Zitkovic – Theory of Probability Lecture Notes

# An Exchangeable Law of Large Numbers

In the proof of De Finetti’s Theorem in my last post, I got to a section where I needed to show a particular convergence property of a sequence of exchangeable random variables. For independent identically distributed RVs, we have Kolmogorov’s 0-1 law, and in particular a strong law of large numbers. Does a version of this result hold for exchangeable sequences? As these represent only a mild generalisation of iid sequences, we might hope so. The following argument demonstrates that this is true, as well as providing a natural general proof of De Finetti.

Define $\mathcal{E}_n=\sigma(\{f(X_1,\ldots,X_n): f\text{ symmetric, Borel}\})$, the smallest sigma-field wrt which the first n RVs are exchangeable. Note that $\mathcal{E}_1\supset\mathcal{E}_2\supset\ldots\supset \mathcal{E}=\cap_n\mathcal{E}_n$, the exchangeable sigma-field.

So now take g(X) symmetric in the first n variables. By exchangeability $E[\frac{1}{n}\sum_1^n f(X_j)g(X)]=E[f(X_1)g(X)]$. Now set $g=1_A$, for $A\in\mathcal{E}_n$, and so because the LHS integrand is $\mathcal{E}_n$-meas. we have $Z_n=\frac{1}{n}\sum_1^n f(X_j)=E[f(X_1)|\mathcal{E}_n]$. So Z is a backwards martingale.

We have a convergence theorem for backwards martingales, which tells us that $\lim_n n^{-1}\sum^n f(X_j)$ exists, and in fact $= E[f(X_1)|\mathcal{E}]$ almost surely. Setting $f(X)=1(X\leq x)$ gives that $\lim_n\frac{\#\{X_i\leq x: i\leq n\}}{n}=F(x):=P(X_1\leq x|\mathcal{E})$. We now perform a similar procedure for functions defined on the first k RVs, in an attempt to demonstrate independence.

For $f:\mathbb{R}^k\rightarrow\mathbb{R}$, we seek a backwards martingale, so we take sums over the $n^{(k)}$ ways to choose k of the first n RVs. So $\frac{1}{n(n-1)\ldots(n-k+1)}\sum_{I\subset[n]} f(X_{i_1},\ldots,X_{i_k})$ is a backwards martingale, and hence $E[f(X_1,\ldots,X_k)|\mathcal{E}]=\lim_n \frac{1}{n(n-1)\ldots(n-k+1)}\sum f(-)$. As before, set $f(y_1,\ldots,y_k)=1(y_1\leq x_1)\ldots 1(y_k\leq x_k)$. Crucially, we can replace the falling factorial term with $n^{-k}$ as we are only considering the limit, then exchange summation as everything is positive and nice to get: $E[f(X_1,\ldots,X_k)|\mathcal{E}]=\lim(\frac{1}{n}\sum 1(X_1\leq x_1))\ldots(\frac{1}{n}\sum 1(X_k\leq x_k))$ thus demonstrating independence of $(X_n)$ conditional on $\mathcal{E}$.

So what have we done? Well, we’ve certainly proven de Finetti in the most general case, and we have in addition demonstrated the existence of a Strong Law of Large Numbers for exchangeable sequences, where the limit variable is $\mathcal{E}$-measurable.

# Exchangeability and De Finetti’s Theorem

ExchangeabilityÂ generalises the notion of a sequence of random variables being iid. Essentially, the motivation is that in frequentist statistics data is assumed to be generated by a series of iid RVs with distribution parameterised by some unknown p. The theory for sequences of iid RVs is rich, with laws of large numbers, and limit theorems. However, from a Bayesian perspective, the parameter p has some prior distribution, so the random variables which give the data are no longer independent. That is, each random variable has non-trivial dependence on p, so in general will have non-trivial dependence on each other.

We say a sequence of random variables $X=(X_1,X_2,\ldots)$ is exchangeableÂ if the law of X is invariant under finite permutation of the indices of the sequence. Formally, if for any $\sigma\in S_n, (X_1,\ldots,X_n)\stackrel{d}{=}(X_{\sigma(1)},\ldots,X_{\sigma_n)}$. Note that permutations with non-trivial action on an infinite subset of N are not considered in this definition, as the law of the entire sequence of RVs is generated by the laws of finite subsets of the sequence. For example, take $Y,Y_1,Y_2,\ldots$ iid, and set $X_n=Y+Y_n$. Provided Y has some non-trivial distribution, the sequence X is not iid, but it is exchangeable. Note that, conditional on Y, the sequence X is iid. This is the exact situation as in the Bayesian inference framework, where the RVs are iid conditional on some underlying random parameter. De Finetti’s Theorem gives that this in fact holds for any exchangeable sequence.

Theorem (De Finetti): $X=(X_1,X_2,\ldots)$ an exchangeable sequence of random variables. Then there exists a random probability measure $\mu$ (that is, a RV taking values in the space of probability measures) such that conditional on $\mu,\; X_1,X_2\ldots \stackrel{iid}{\sim}\mu$. Continue reading