# Hoeffding’s inequality and convex ordering

A number of previous posts on this blog have discussed Markov’s inequality, and the control one can obtain on a distribution by applying this to functions of the distribution, or functions of random variables with this distribution. In particular, studying the square of the deviation from the mean $\left(X-\mathbb{E}X\right)^2$ leads to Chebyshev’s inequality, while even stronger bounds can sometimes be obtained using $e^{tX}$, for some appropriate value of $t\in\mathbb{R}$. Sometimes this final method, which relies on the finiteness of $\mathbb{E}[e^{tX}]$ is referred to as a Chernoff bound, especially when an attempt is made to choose the optimal value of t for the estimate under consideration.

The case of a random walk, that is the sum of IID increments, is particularly well-studied. In [Ho63], Hoeffding studies a generalisation where the increments are merely independent, and outlines a further generalisation to discrete-time martingales, which is developed by Azuma in [Az67], leading to the famous Azuma-Hoeffding inequality. A key feature of this theorem is that we require bounded increments, which leads to tight choices of $t_i$ in the Chernoff bound for each increment.

In addition, [Ho63] addresses the alternative model where the increments of a random walk are chosen uniformly without replacement from a particular set. The potted summary is that the sum of random increments chosen without replacement has the same mean, but is more concentrated that the corresponding sum of random increments chosen with replacement. This means that any of the concentration results proved in the earlier sections of [Ho63] for the latter situation apply equally to the setting without replacement.

Various versions of this result are referred to as Hoeffding’s inequality by the literature, and the main goal of this short article is to list some results which are definitely true about this setting!

(Note, in some contexts, ‘Hoeffding’s inequality’ will mean the concentration bound for either the with-replacement or the without-replacement setting. In this blog post, we are discussing inequalities comparing these two settings.)

Starting point – second moments in ‘vanilla’ Hoeffding

We won’t use this notation heavily, but for now let’s assume that we are studying a (multi)set of values $\mathcal{X}=\{x_1,\ldots,x_N\}$, and will study a sequence of m elements $(X_1,\ldots,X_m)$ from this set with replacement (ie chosen in an IID fashion), and a sequence of m elements $(Y_1,\ldots,Y_m)$ chosen from this set without replacement.

Some preliminary remarks:

• The $Y_i$s are still chosen uniformly which can be interpreted as
• Any permutation of the multiset $\{x_1,\ldots,x_n\}$ is equally likely to appear as $(Y_1,\ldots,Y_n)$;
• If determining the sequence one element at a time, then in a conditional sense, the choice of the ‘next’ element is uniform among those elements which have not yet been chosen.
• In particular, the sequence $(Y_1,\ldots,Y_m)$ is exchangeable (though not knowing what that means will not be a significant impediment to what follows).
• It is clear that $\mathbb{E}[X_1+\ldots+X_m]=\mathbb{E}[Y_1+\ldots+Y_m]$, so concentration around this common mean is the next question of interest.
• Intuitively, when $m\ll N$, we wouldn’t expect there to be much difference between sampling with or without replacement.
• When sampling without replacement, the cases m and N-m are symmetric with respect to the sum of $\mathcal{X}$ and, in particular, have the same concentration.

As a brief warmup, we establish that

$\mathrm{Var}(Y_1+\ldots+Y_m)\le \mathrm{Var}(X_1+\ldots+X_m).$

Note that the RHS is $m\sigma^2$, where $\sigma^2<\infty$ is the variance of a single uniform sample from $\mathcal{X}$. Since these sums have the same mean, it suffices to show that

$\mathbb{E}\left[ (Y_1+\ldots+Y_m)^2\right] \le \mathbb{E}\left[ (X_1+\ldots+X_m)^2\right].$

But this really has nothing to do with probability now, since these quantities are just symmetric polynomials in $\mathcal{X}$. That said, we gain a little bit if we use the exchangeability property to rewrite the LHS and the RHS as

$n\mathbb{E}[Y_1^2] + n(n-1)\mathbb{E}[Y_1Y_2] \le n\mathbb{E}[X_1^2] + n(n-1)\mathbb{E}[X_1X_2].$

Since $Y_1\stackrel{d}= X_1$, it remains to handle the cross term, which involves showing that

$\frac{1}{N(N-1)}\sum_{i\ne j} x_ix_j \le \frac{1}{N^2} \sum_{i,j}x_ix_j,$

which really is an algebra exercise, and can be demonstrated by the rearrangement inequality, or Chebyshev’s inequality (the other one – ie sequence-valued FKG not the probabilistic second-moment bound one), or by clever pre-multiplying and cancelling to reduce to Cauchy-Schwarz.

In [BPS18], which we will turn to in much greater detail shortly, the authors assert in the first sentence that Hoeffding treats also the setting of weighted sums, that is a comparison of

$\alpha_1X_1+\ldots+\alpha_mX_m,\quad \alpha_1Y_1+\ldots+\alpha_mY_m.$

A casual reader of [Ho63] may struggle initially to identify that this has indeed happened. At the level of second-moment comparison though, the argument given above carries over essentially immediately. We have

$\mathbb{E}\left[ (\alpha_1Y_1+\ldots+\alpha_n Y_n)^2\right] = \left(\alpha_1^2+\ldots+\alpha_n^2\right) \mathbb{E}[Y_1^2] + \left(\sum_{i\ne j} \alpha_i\alpha_j \right)\mathbb{E}\left[Y_1Y_2\right],$

and of course the corresponding decomposition for $(X_i)$, from which the variance comparison inequality follows as before.

Convex ordering

For many applications, variance is sufficient as a measure of concentration, but one can say more. As a prelude, note that one characterisation of stochastic ordering (which I have sometimes referred to as stochastic domination on this blog) is that

$X\le_{st}Y \;\iff\;\mathbb{E}[f(X)] \le \mathbb{E}[f(Y)],$

whenever f is a non-decreasing function. Unsurprisingly, it’s the $\Leftarrow$ direction which is harder to prove, though often the $\Rightarrow$ direction is the most useful in applications.

[To inform what follows, I used a number of references, in particular [BM05].]

A natural extension to handle concentration is (stochastic) convex ordering, which is defined as

$X\le_{cx} Y\;\iff \; \mathbb{E}[f(X)]\le \mathbb{E}[f(Y)],$

whenever f is a convex function. Note immediately that this definition can only ever apply to random variables for which $\mathbb{E}X=\mathbb{E}Y$. Otherwise, consider the functions $f(x)=x, f(x)=-x$, both of which are convex.

This sort of comparison has plenty of application in economics and actuarial fields, where there is a notion of risk-aversion, and convexity is the usual criterion to define risk (in the sense that the danger of a significant loss is viewed as more significant than a corresponding possibility of a significant gain). Also note that in this context, X is preferred to Y by a risk-averse observer.

In fact, it can be shown [Oh69] that a sufficient condition for $X\le_{cx} Y$ is the existence of $\alpha\in\mathbb{R}$ such that

$F_X(x)\le F_Y(x)\;\iff\; x\ge \alpha,$

where $F_X(x)=\mathbb{P}(X\le x)$ is the usual distribution function of X. That is, the distribution functions ‘cross exactly once’. Note that in general $\alpha\ne \mathbb{E}X=\mathbb{E}Y$.

We will return later to the question of adapting this measure to handle distributions with different expectations, but for now, we will return to the setting of sampling with and without replacement and show that

$Y_1+\ldots+Y_m\le_{cx}X_1+\ldots+X_m.$

This is the result shown by Hoeffding, but by a more complicated method. We use the observations of [BPS18] which are stated in a more complicated setting which we may return to later. For notational convenience, let’s assume the elements of $\mathcal{X}$ are distinct so that ‘repetitions of $x_i$‘ has an unambiguous meaning.

We set up a natural coupling of $(X_i),(Y_i)$ as follows. Let $(X_i,i\ge 1)$ be IID uniform samples from $\mathcal{X}$, and generate $(Y_1,\ldots,Y_N)$ by removing any repetitions from $(X_1,X_2,\ldots)$.

Note then that given the sequence $(Y_1,\ldots,Y_m)$, the multiset of values $\{X_1,\ldots,X_m\}$ definitely includes $Y_1$ at least once, but otherwise includes each $Y_i$ either zero, one or more times, in a complicated way. However, this distribution doesn’t depend on the values taken by $(Y_i)$, since it is based on uniform sampling. In particular, if we condition instead on the set $\{Y_1,\ldots,Y_m\}$, we find that

$\mathbb{E}\left[X_1+\ldots+X_m\,\big|\, \{Y_1,\ldots,Y_m\}\right] = Y_1+\ldots+Y_m,$

and so we may coarsen the conditioning to obtain

$\mathbb{E}\left[X_1+\ldots+X_m\,\big|\, Y_1+\ldots+Y_m\right] = Y_1+\ldots+Y_m.$

This so-called martingale coupling is very useful for proving convex ordering. You can think of $\mathbb{E}[X|Y]=Y$ as saying “X is generated by Y plus additional randomness”, which is consistent with the notion that Y is more concentrated than X. In any case, it meshes well with Jensen’s inequality, since if f is convex, we obtain

$\mathbb{E}[f(X)] = \mathbb{E}\left[ \mathbb{E}[f(X)|Y] \right] \ge \mathbb{E}\left[f\left(\mathbb{E}[X|Y] \right)\right] = \mathbb{E}\left[ f(Y)\right].$

So we have shown that sampling without replacement is greater than sampling with replacement in the convex ordering, if we consider sums of the samples.

If we want to handle weighted sums, and tried to reproduce the previous argument directly, it would fail, since

$\mathbb{E}\left[\alpha_1X_1+\ldots+\alpha_nX_n\,\big|\,\{Y_1,\ldots,Y_m\}\right]=(\alpha_1+\ldots+\alpha_m)(Y_1+\ldots+Y_m),$

which is not in general equal to $\alpha_1Y_1+\ldots+\alpha_mY_m$.

Non-uniform weighted sampling

Of course, it’s not always the case that we want to sample according to the uniform distribution. My recent personal motivation for investigating this setup has been the notion of exploring a graph generated by the configuration model, for which vertices appear for the first time in a random order that is size-biased by their degrees.

In general we can think of each element of $\mathcal{X}$ as having some weight w(i), and at each step, we pick $x_i$ from the remaining elements with probability proportion to w(i). This procedure is known in other contexts as successive sampling. Note that the sequence $(Y_1,\ldots,Y_m)$ is no longer exchangeable. This is clear enough when there are two elements in $\mathcal{X}$, and $w(1)\gg w(2)$. So it is much more likely that $(Y_1,Y_2)=(x_1,x_2)$ than $(Y_1,Y_2)=(x_2,x_1)$.

In addition, it will in general not be the case that

$\mathbb{E}[Y_1+\ldots+Y_m]=\mathbb{E}[X_1+\ldots+X_m],$

and this is easily seen again in the above example, where $\mathbb{E}[Y_1+Y_2]=x_1+x_2$ while $\mathbb{E}[X_1+X_2]\approx 2x_1$.

However, one can still compare such distributions, with the notion of increasing convex ordering, which is defined as:

$X\le_{icx} Y\;\iff\; \mathbb{E}[f(X)]\le \mathbb{E}[f(Y)],$

for all convex non-decreasing functions f. From an economic point view, increasing convex ordering becomes a comment on the relative contribution of losses above some threshold between the two distributions (and that X is better or worse than Y for all such thresholds), and is sometimes called stop-loss order. More formally, as in [BM05] we have:

$X\le_{icx}Y\;\iff\; \mathbb{E}[\max(X-t,0)]\le \mathbb{E}[\max(Y-t,0)]\,\forall t\in\mathbb{R}.$

In [BPS18], the authors study the case of successive sampling where the weights are monotone-increasing in the values, that is $w(i)\ge w(j)\;\iff\; x_i\ge x_j$, of which size-biased sampling is a special case. With this setup, the heuristic is that ‘large values are more likely to appear earlier’ in the sample without replacement, and making this notion precise underpins all the work.

The coupling introduced above is particularly useful in this more exotic setting. The authors show that conditional on the event A of the form $\{Y_1,\ldots,Y_m\}=\{x_{i_1}\le x_{i_2}\le\ldots\le x_{i_m}\}$, one has

$\mathbb{P}(X_1=x_{i_j}\,\big|\,\mathbf{A}) \le \mathbb{P}(X_1=x_{i_k}\,\big|\,\mathbf{A}),\text{ precisely when }j\le k.$

This is achieved by comparing the explicit probabilities of pairs of orderings for $(Y_1,\ldots,Y_m)$ with $x_{i_j},x_{i_k}$ exchanged. See [BPS18] for this short calculation – a link to the Arxiv version is in the references. So to calculate $\mathbb{E}[X_1\,\big|\, \mathbf{A}]$, we have the setup for the rearrangement inequality / Chebyshev again, as

$\mathbb{E}[X_1\,\big|\, \mathbf{A}] = \sum_{j=1}^m x_{i_j}\mathbb{P}(X_1=x_{i_j}) \le \frac{1}{n}\left(\sum_{j=1}^m x_{i_j}\right)\left(\sum \mathbb{P}\right) = \frac{\left(Y_1+\ldots+Y_m\,\big|\,\mathbf{A}\right)}{m}.$

As before, letting $X,Y$ be the sums of the samples with and without replacement, respectively, we have $\mathbb{E}[X\,\big|\, Y] \ge Y$, which the authors term a submartingale coupling. In particular, the argument

$\mathbb{E}[f(X)] = \mathbb{E}\left[ \mathbb{E}[f(X)|Y] \right] \ge \mathbb{E}\left[f\left(\mathbb{E}[X|Y] \right)\right] \ge \mathbb{E}\left[ f(Y)\right],$

works exactly as before to establish convex ordering.

References

[Az67] – Azuma – 1967 – Weighted sums of certain dependent random variables

[BM05] – Bauerle, Muller – 2005 – Stochastic orders and risk measures: consistency and bounds. (Online version here)

[BPS18] – Ben Hamou, Peres, Salez – 2018 – Weighted sampling without replacement (ArXiv version here)

[Ho63] – Hoeffding – 1963 – Probability inequalities for sums of bounded random variables

[Oh69] – Ohlin – 1969 – On a class of measures of dispersion with application to optimal reinsurance