Hoeffding’s inequality and convex ordering

A number of previous posts on this blog have discussed Markov’s inequality, and the control one can obtain on a distribution by applying this to functions of the distribution, or functions of random variables with this distribution. In particular, studying the square of the deviation from the mean \left(X-\mathbb{E}X\right)^2 leads to Chebyshev’s inequality, while even stronger bounds can sometimes be obtained using e^{tX}, for some appropriate value of t\in\mathbb{R}. Sometimes this final method, which relies on the finiteness of \mathbb{E}[e^{tX}] is referred to as a Chernoff bound, especially when an attempt is made to choose the optimal value of t for the estimate under consideration.

The case of a random walk, that is the sum of IID increments, is particularly well-studied. In [Ho63], Hoeffding studies a generalisation where the increments are merely independent, and outlines a further generalisation to discrete-time martingales, which is developed by Azuma in [Az67], leading to the famous Azuma-Hoeffding inequality. A key feature of this theorem is that we require bounded increments, which leads to tight choices of t_i in the Chernoff bound for each increment.

In addition, [Ho63] addresses the alternative model where the increments of a random walk are chosen uniformly without replacement from a particular set. The potted summary is that the sum of random increments chosen without replacement has the same mean, but is more concentrated that the corresponding sum of random increments chosen with replacement. This means that any of the concentration results proved in the earlier sections of [Ho63] for the latter situation apply equally to the setting without replacement.

Various versions of this result are referred to as Hoeffding’s inequality by the literature, and the main goal of this short article is to list some results which are definitely true about this setting!

(Note, in some contexts, ‘Hoeffding’s inequality’ will mean the concentration bound for either the with-replacement or the without-replacement setting. In this blog post, we are discussing inequalities comparing these two settings.)

Starting point – second moments in ‘vanilla’ Hoeffding

We won’t use this notation heavily, but for now let’s assume that we are studying a (multi)set of values \mathcal{X}=\{x_1,\ldots,x_N\}, and will study a sequence of m elements (X_1,\ldots,X_m) from this set with replacement (ie chosen in an IID fashion), and a sequence of m elements (Y_1,\ldots,Y_m) chosen from this set without replacement.

Some preliminary remarks:

  • The Y_is are still chosen uniformly which can be interpreted as
    • Any permutation of the multiset \{x_1,\ldots,x_n\} is equally likely to appear as (Y_1,\ldots,Y_n);
    • If determining the sequence one element at a time, then in a conditional sense, the choice of the ‘next’ element is uniform among those elements which have not yet been chosen.
  • In particular, the sequence (Y_1,\ldots,Y_m) is exchangeable (though not knowing what that means will not be a significant impediment to what follows).
  • It is clear that \mathbb{E}[X_1+\ldots+X_m]=\mathbb{E}[Y_1+\ldots+Y_m], so concentration around this common mean is the next question of interest.
  • Intuitively, when m\ll N, we wouldn’t expect there to be much difference between sampling with or without replacement.
  • When sampling without replacement, the cases m and N-m are symmetric with respect to the sum of \mathcal{X} and, in particular, have the same concentration.

As a brief warmup, we establish that

\mathrm{Var}(Y_1+\ldots+Y_m)\le \mathrm{Var}(X_1+\ldots+X_m).

Note that the RHS is m\sigma^2, where \sigma^2<\infty is the variance of a single uniform sample from \mathcal{X}. Since these sums have the same mean, it suffices to show that

\mathbb{E}\left[ (Y_1+\ldots+Y_m)^2\right] \le \mathbb{E}\left[ (X_1+\ldots+X_m)^2\right].

But this really has nothing to do with probability now, since these quantities are just symmetric polynomials in \mathcal{X}. That said, we gain a little bit if we use the exchangeability property to rewrite the LHS and the RHS as

n\mathbb{E}[Y_1^2] + n(n-1)\mathbb{E}[Y_1Y_2] \le n\mathbb{E}[X_1^2] + n(n-1)\mathbb{E}[X_1X_2].

Since Y_1\stackrel{d}= X_1, it remains to handle the cross term, which involves showing that

\frac{1}{N(N-1)}\sum_{i\ne j} x_ix_j \le \frac{1}{N^2} \sum_{i,j}x_ix_j,

which really is an algebra exercise, and can be demonstrated by the rearrangement inequality, or Chebyshev’s inequality (the other one – ie sequence-valued FKG not the probabilistic second-moment bound one), or by clever pre-multiplying and cancelling to reduce to Cauchy-Schwarz.

In [BPS18], which we will turn to in much greater detail shortly, the authors assert in the first sentence that Hoeffding treats also the setting of weighted sums, that is a comparison of

\alpha_1X_1+\ldots+\alpha_mX_m,\quad \alpha_1Y_1+\ldots+\alpha_mY_m.

A casual reader of [Ho63] may struggle initially to identify that this has indeed happened. At the level of second-moment comparison though, the argument given above carries over essentially immediately. We have

\mathbb{E}\left[ (\alpha_1Y_1+\ldots+\alpha_n Y_n)^2\right] = \left(\alpha_1^2+\ldots+\alpha_n^2\right) \mathbb{E}[Y_1^2] + \left(\sum_{i\ne j} \alpha_i\alpha_j \right)\mathbb{E}\left[Y_1Y_2\right],

and of course the corresponding decomposition for (X_i), from which the variance comparison inequality follows as before.

Convex ordering

For many applications, variance is sufficient as a measure of concentration, but one can say more. As a prelude, note that one characterisation of stochastic ordering (which I have sometimes referred to as stochastic domination on this blog) is that

X\le_{st}Y \;\iff\;\mathbb{E}[f(X)] \le \mathbb{E}[f(Y)],

whenever f is a non-decreasing function. Unsurprisingly, it’s the \Leftarrow direction which is harder to prove, though often the \Rightarrow direction is the most useful in applications.

[To inform what follows, I used a number of references, in particular [BM05].]

A natural extension to handle concentration is (stochastic) convex ordering, which is defined as

X\le_{cx} Y\;\iff \; \mathbb{E}[f(X)]\le \mathbb{E}[f(Y)],

whenever f is a convex function. Note immediately that this definition can only ever apply to random variables for which \mathbb{E}X=\mathbb{E}Y. Otherwise, consider the functions f(x)=x, f(x)=-x, both of which are convex.

This sort of comparison has plenty of application in economics and actuarial fields, where there is a notion of risk-aversion, and convexity is the usual criterion to define risk (in the sense that the danger of a significant loss is viewed as more significant than a corresponding possibility of a significant gain). Also note that in this context, X is preferred to Y by a risk-averse observer.

In fact, it can be shown [Oh69] that a sufficient condition for X\le_{cx} Y is the existence of \alpha\in\mathbb{R} such that

F_X(x)\le F_Y(x)\;\iff\; x\ge \alpha,

where F_X(x)=\mathbb{P}(X\le x) is the usual distribution function of X. That is, the distribution functions ‘cross exactly once’. Note that in general \alpha\ne \mathbb{E}X=\mathbb{E}Y.

We will return later to the question of adapting this measure to handle distributions with different expectations, but for now, we will return to the setting of sampling with and without replacement and show that

Y_1+\ldots+Y_m\le_{cx}X_1+\ldots+X_m.

This is the result shown by Hoeffding, but by a more complicated method. We use the observations of [BPS18] which are stated in a more complicated setting which we may return to later. For notational convenience, let’s assume the elements of \mathcal{X} are distinct so that ‘repetitions of x_i‘ has an unambiguous meaning.

We set up a natural coupling of (X_i),(Y_i) as follows. Let (X_i,i\ge 1) be IID uniform samples from \mathcal{X}, and generate (Y_1,\ldots,Y_N) by removing any repetitions from (X_1,X_2,\ldots).

Note then that given the sequence (Y_1,\ldots,Y_m), the multiset of values \{X_1,\ldots,X_m\} definitely includes Y_1 at least once, but otherwise includes each Y_i either zero, one or more times, in a complicated way. However, this distribution doesn’t depend on the values taken by (Y_i), since it is based on uniform sampling. In particular, if we condition instead on the set \{Y_1,\ldots,Y_m\}, we find that

\mathbb{E}\left[X_1+\ldots+X_m\,\big|\, \{Y_1,\ldots,Y_m\}\right] = Y_1+\ldots+Y_m,

and so we may coarsen the conditioning to obtain

\mathbb{E}\left[X_1+\ldots+X_m\,\big|\, Y_1+\ldots+Y_m\right] = Y_1+\ldots+Y_m.

This so-called martingale coupling is very useful for proving convex ordering. You can think of \mathbb{E}[X|Y]=Y as saying “X is generated by Y plus additional randomness”, which is consistent with the notion that Y is more concentrated than X. In any case, it meshes well with Jensen’s inequality, since if f is convex, we obtain

\mathbb{E}[f(X)] = \mathbb{E}\left[ \mathbb{E}[f(X)|Y] \right] \ge \mathbb{E}\left[f\left(\mathbb{E}[X|Y] \right)\right] = \mathbb{E}\left[ f(Y)\right].

So we have shown that sampling without replacement is greater than sampling with replacement in the convex ordering, if we consider sums of the samples.

If we want to handle weighted sums, and tried to reproduce the previous argument directly, it would fail, since

\mathbb{E}\left[\alpha_1X_1+\ldots+\alpha_nX_n\,\big|\,\{Y_1,\ldots,Y_m\}\right]=(\alpha_1+\ldots+\alpha_m)(Y_1+\ldots+Y_m),

which is not in general equal to \alpha_1Y_1+\ldots+\alpha_mY_m.

Non-uniform weighted sampling

Of course, it’s not always the case that we want to sample according to the uniform distribution. My recent personal motivation for investigating this setup has been the notion of exploring a graph generated by the configuration model, for which vertices appear for the first time in a random order that is size-biased by their degrees.

In general we can think of each element of \mathcal{X} as having some weight w(i), and at each step, we pick x_i from the remaining elements with probability proportion to w(i). This procedure is known in other contexts as successive sampling. Note that the sequence (Y_1,\ldots,Y_m) is no longer exchangeable. This is clear enough when there are two elements in \mathcal{X}, and w(1)\gg w(2). So it is much more likely that (Y_1,Y_2)=(x_1,x_2) than (Y_1,Y_2)=(x_2,x_1).

In addition, it will in general not be the case that

\mathbb{E}[Y_1+\ldots+Y_m]=\mathbb{E}[X_1+\ldots+X_m],

and this is easily seen again in the above example, where \mathbb{E}[Y_1+Y_2]=x_1+x_2 while \mathbb{E}[X_1+X_2]\approx 2x_1.

However, one can still compare such distributions, with the notion of increasing convex ordering, which is defined as:

X\le_{icx} Y\;\iff\; \mathbb{E}[f(X)]\le \mathbb{E}[f(Y)],

for all convex non-decreasing functions f. From an economic point view, increasing convex ordering becomes a comment on the relative contribution of losses above some threshold between the two distributions (and that X is better or worse than Y for all such thresholds), and is sometimes called stop-loss order. More formally, as in [BM05] we have:

X\le_{icx}Y\;\iff\; \mathbb{E}[\max(X-t,0)]\le \mathbb{E}[\max(Y-t,0)]\,\forall t\in\mathbb{R}.

In [BPS18], the authors study the case of successive sampling where the weights are monotone-increasing in the values, that is w(i)\ge w(j)\;\iff\; x_i\ge x_j, of which size-biased sampling is a special case. With this setup, the heuristic is that ‘large values are more likely to appear earlier’ in the sample without replacement, and making this notion precise underpins all the work.

The coupling introduced above is particularly useful in this more exotic setting. The authors show that conditional on the event A of the form \{Y_1,\ldots,Y_m\}=\{x_{i_1}\le x_{i_2}\le\ldots\le x_{i_m}\}, one has

\mathbb{P}(X_1=x_{i_j}\,\big|\,\mathbf{A}) \le \mathbb{P}(X_1=x_{i_k}\,\big|\,\mathbf{A}),\text{ precisely when }j\le k.

This is achieved by comparing the explicit probabilities of pairs of orderings for (Y_1,\ldots,Y_m) with x_{i_j},x_{i_k} exchanged. See [BPS18] for this short calculation – a link to the Arxiv version is in the references. So to calculate \mathbb{E}[X_1\,\big|\, \mathbf{A}], we have the setup for the rearrangement inequality / Chebyshev again, as

\mathbb{E}[X_1\,\big|\, \mathbf{A}] = \sum_{j=1}^m x_{i_j}\mathbb{P}(X_1=x_{i_j}) \le \frac{1}{n}\left(\sum_{j=1}^m x_{i_j}\right)\left(\sum \mathbb{P}\right) = \frac{\left(Y_1+\ldots+Y_m\,\big|\,\mathbf{A}\right)}{m}.

As before, letting X,Y be the sums of the samples with and without replacement, respectively, we have \mathbb{E}[X\,\big|\, Y] \ge Y, which the authors term a submartingale coupling. In particular, the argument

\mathbb{E}[f(X)] = \mathbb{E}\left[ \mathbb{E}[f(X)|Y] \right] \ge \mathbb{E}\left[f\left(\mathbb{E}[X|Y] \right)\right] \ge \mathbb{E}\left[ f(Y)\right],

works exactly as before to establish convex ordering.

References

[Az67] – Azuma – 1967 – Weighted sums of certain dependent random variables

[BM05] – Bauerle, Muller – 2005 – Stochastic orders and risk measures: consistency and bounds. (Online version here)

[BPS18] – Ben Hamou, Peres, Salez – 2018 – Weighted sampling without replacement (ArXiv version here)

[Ho63] – Hoeffding – 1963 – Probability inequalities for sums of bounded random variables

[Oh69] – Ohlin – 1969 – On a class of measures of dispersion with application to optimal reinsurance

Advertisement

Lecture 7 – The giant component

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

As we edge into the second half of the course, we are now in a position to return to the question of the phase transition between the subcritical regime \lambda<1 and the supercritical regime \lambda>1 concerning the size of the largest component L_1(G(n,\lambda/n)).

In Lecture 3, we used the exploration process to give upper bounds on the size of this largest component in the subcritical regime. In particular, we showed that

\frac{1}{n}\big| L_1(G(n,\lambda/n)) \big| \stackrel{\mathbb{P}}\rightarrow 0.

If we used slightly stronger random walk concentration estimates (Chernoff bounds rather than 2nd-moment bounds from Chebyshev’s inequality), we could in fact have shown that with high probability the size of this largest component was at most some logarithmic function of n.

In this lecture, we turn to the supercritical regime. In the previous lecture, we defined various forms of weak local limit, and asserted (without attempting the notationally-involved combinatorial calculation) that the random graph G(n,\lambda/n) converges locally weakly in probability to the Galton-Watson tree with \text{Poisson}(\lambda) offspring distribution, as we’ve used informally earlier in the course.

Of course, when \lambda>1, this branching process has strictly positive survival probability \zeta_\lambda>0. At a heuristic level, we imagine that all vertices whose local neighbourhood is ‘infinite’ are in fact part of the same giant component, which should occupy (\zeta_\lambda+o_{\mathbb{P}}(1))n vertices. In its most basic form, the result is

\frac{1}{n}\big|L_1(G(n,\lambda/n))\big|\;\stackrel{\mathbb{P}}\longrightarrow\; \zeta_\lambda,\quad \frac{1}{n}\big|L_2(G(n,\lambda/n))\big| \;\stackrel{\mathbb{P}}\longrightarrow\; 0, (*)

where the second part is a uniqueness result for the giant component.

The usual heuristic for proving this result is that all ‘large’ components must in fact be joined. For example, if there are two giant components, with sizes \approx \alpha n,\approx \beta n, then each time we add a new edge (such an argument is often called ‘sprinkling‘), the probability that these two components are joined is \approx 2ab, and so if we add lots of edges (which happens as we move from edge probability \lambda-\epsilon\mapsto \lambda ) then with high probability these two components get joined.

It is hard to make this argument rigorous, and the normal approach is to show that with high probability there are no components with sizes within a certain intermediate range (say between \Theta(\log n) and n^\alpha) and then show that all larger components are the same by a joint exploration process or a technical sprinkling argument. Cf the books of Bollobas and of Janson, Luczak, Rucinski. See also this blog post (and the next page) for a readable online version of this argument.

I can’t find any version of the following argument, which takes the weak local convergence as an assumption, in the literature, but seems appropriate to this course. It is worth noting that, as we shall see, the method is not hugely robust to adjustments in case one is, for example, seeking stronger estimates on the giant component (eg a CLT).

Anyway, we proceed in three steps:

Step 1: First we show, using the local limit, that for any \epsilon>0,

\frac{1}{n}\big|L_1(G(n,\lambda/n))\big| \le \zeta_\lambda+\epsilon, with high probability as n\rightarrow\infty.

Step 2: Using a lower bound on the exploration process, for \epsilon>0 small enough

\frac{1}{n}\big|L_1(G(n,\lambda/n))\big| \ge \epsilon, with high probability.

Step 3: Motivated by duality, we count isolated vertices to show

\mathbb{P}(\epsilon n\le |L_1| \le (\zeta_\lambda-\epsilon)n) \rightarrow 0.

We will return to uniqueness at the end.

Step 1

This step is unsurprising. The local limit gives control on how many vertices are in small components of various sizes, and so gives control on how many vertices are in small components of all finite sizes (taking limits in the right order). This gives a bound on how many vertices can be in the giant component. Continue reading

Azuma-Hoeffding Inequality

It’s (probably) my last Michaelmas term in Oxford, at least for the time being, and so also the last time giving tutorials on either of the probability courses that students take in their first two years. This time, I’m teaching the second years, and as usual the aim of the majority of the first half of the course is to acquire as sophisticated an understanding as possible of the Central Limit Theorem. I feel a key step is appreciating that CLT tells you about the correct scaling for the deviations from the mean of these partial sums of IID random variables. The fact that these deviations on this correct scaling converge in law to a normal distribution, irrespective (apart from mild conditions) on the underlying distribution, is interesting, but should be viewed as a secondary, bonus, property.

Emphasising the scaling of deviations in CLT motivates the next sections of this (or any) course. We develop tools like Markov’s inequality to control the probability that a random variable is much larger than its expectation, and experiment with applying this to various functions of the random variable to get stronger bounds. When the moment generating function exists, this is an excellent choice for this analysis. We end up with a so-called Chernoff bound. For example, we might consider the probability that when we toss N coins, at least a proportion ¾ are Heads. A Chernoff bound says that this probability decays exponentially in N.

One direction to take is to ask how to control precisely the parameter of this exponential decay, which leads to Cramer’s theorem and the basis of the theory of Large Deviations. An alternative direction is to observe that the signed difference between the partial sums of independent random variables and their means is an example of a martingale, albeit not a very interesting one, since in general the increments of a martingale are not independent. So we might ask: under what circumstances can we show exponential tail bounds on the deviation of a martingale from its mean (that is, its initial value) at a fixed (perhaps large) time?

Azuma-Hoeffding inequality

The following result was derived and used by various authors in the 60s, including Azuma and Hoeffding (separately), but also others.

Let X_0,X_1,X_2,\ldots be a martingale with respect to some filtration, and we assume that the absolute value of each increment |X_i-X_{i-1}| is bounded almost surely by some c_i<\infty. Then, recalling that \mathbb{E}[X_n|\mathcal{F}_0]=X_0, we have

\mathbb{P}(X_n \ge X_0+t) \le \exp\left( -\frac{t^2}{2\sum_{i=1}^n c_i^2}\right).

Proof

We apply a Chernoff argument to each increment. First, observe that for Y a distribution supported on [-1,1] with mean zero, by convexity \mathbb{E}[e^{tY}] is maximised by taking Y equal to +1 and -1 each with probability ½. Thus

\mathbb{E}[e^{tY}]\le \frac12 e^t + \frac 12 e^{-t}=\cosh(t) \le e^{-t^2/2},

where the final inequality follows by directly comparing the Taylor series.

We’ll use this shortly. Before that, we start the usual argument for a Chernoff bound on X_n-X_0.

\mathbb{P}(X_n-X_0\ge t) = \mathbb{P}(e^{\theta(X_n-X_0)}\ge e^{\theta t})\le e^{-\theta t} \mathbb{E}[e^{\theta(X_n-X_0)}]

= e^{-\theta t} \mathbb{E}[\mathbb{E}[e^{\theta((X_n-X_{n-1}) +X_{n-1}-X_0)} | \mathcal{F}_{n-1}]]

= e^{-\theta t} \mathbb{E}[e^{\theta(X_{n-1}-X_0)} \mathbb{E}[e^{\theta(X_n-X_{n-1})}|\mathcal{F}_{n-1}] ],

and our preliminary result allows us to control this inner expectation

\le e^{-\theta t} e^{\theta^2c_n^2/2} \mathbb{E}[e^{\theta(X_{n-1}-X_0)}].

So now we can apply this inductively to obtain

\mathbb{P}(X_n-X_0\ge t) \le e^{-\theta t+ \theta^2 \sum_{i=1}^n c_i^2}.

Finally, as usual in such an argument, we need to choose a sensible value of the free parameter \theta, and naturally we want to choose it to make this RHS as small as possible, which is achieved when \theta = \frac{t}{\sum_{i=1}^n c_i^2}, and leads exactly to the statement of the inequality.

Applications

Unsurprisingly, we can easily apply this to the process of partial sums of IID random variables with mean zero and bounded support, to recover a Chernoff bound.

A more interesting example involves revealing the state (ie open or closed) of the edges of an Erdos-Renyi graph one at a time. We need to examine some quantitative property of the graph which can’t ever be heavily influenced by the presence or non-presence of a single given edge. The size of the largest clique, or the largest cut, are good examples. Adding or removing an edge can change these quantities by at most one.

So if we order the edges, and let the filtration \mathcal{F}_k be generated by the state of the first k edges in this ordering, then X_k=\mathbb{E}[\text{max cut}| \mathcal{F}_k] is a martingale. (A martingale constructed backwards in this fashion by conditioning a final state on a filtration is sometimes called a Doob martingale.) Using A-H on this shows that the deviations from the mean are of order \sqrt{N}, where N is the size of the graph. In the sparse case, it can be justified fairly easily that the maximum cut has size \Theta(N), since for example there will always be some positive proportion of isolated vertices. However, accurate asymptotics for the mean of this quantity seem (at least after a brief search of the literature – please do correct me if this is wrong!) to be unknown. So this might be an example of the curious situation where we can control the deviations around the mean better than the mean itself!

Beyond bounded increments

One observation we might make about the proof is that it is tight only if all the increments X_i-X_{i-1} are supported on \{-c_i,+c_i\}, which is stronger than demanding that the absolute value is bounded. If in fact we have X_i-X_{i-1}\in[-d_i,c_i] almost surely, then, with a more detailed preliminary lemma, we can have instead a bound of \exp\left( -\frac{2t^2}{\sum_{i=1}^n (c_i+d_i)^2} \right).

While it isn’t a problem in these examples, in many settings the restriction to bounded increments is likely to be the obstacle to applying A-H. Indeed, in the technical corner of my current research problem, this is exactly the challenge I faced. Fortunately, at least in principle, all is not necessarily lost. We might, for example, be able to establish bounds (c_i) as described, such that the probability that any |X_i-X_{i-1}| exceeds its c_i is very small. You could then construct a coupled process (Y_i), that is equal to X_i whenever the increments are within the given range, and something else otherwise. For Y to fit the conditions of A-H, the challenge is to ensure we can do this such that the increments remain bounded (ie the ‘something else’ also has to be within [-c_i,c_i] ) and also that Y remains a martingale. This total probability of a deviation is bounded above by the probability of Y experiencing that deviation, plus the probability of Y and X decoupling. To comment on the latter probability is hard in general without saying a bit more about the dependence structure in X itself.

Means and Markov’s Inequality

The first time we learn what a mean is, it is probably called an average. The first time we meet it in a maths lesson, it is probably defined as follows: given a list of values, or possibilities, the mean is the sum of all the values divided by the number of such values.

This can be seen as both a probabilistic and a statistical statement. Ideally, these things should not be different, but at a primary school level (and some way beyond), there is a distinction to be drawn between the mean of a set of data values, say the heights of children in the class, and the mean outcome of rolling a dice. The latter is the mean of something random, while the former is the mean of something fixed and determined.

The reason that the same method works for both of these situations is that the distribution for the outcome of rolling a dice is uniform on the set of possible values. Though this is unlikely to be helpful to many, you could think of this as a consequence of the law of large numbers. The latter, performed jointly in all possible values says that you expect to have roughly equal numbers of each value when you take a large number of samples. If we refer to the strong law, this says that in fact we see this effect in the limit as we take increasingly large samples with probability one. Note that it is not trivial to apply LLN jointly to all values for a general continuous random variable. The convergence of sample distribution functions to the cdf of the underlying distribution is the content of the Glivenko-Cantelli Theorem.

In any case, this won’t work when there isn’t this symmetry where all values are equally likely. So in general, we have to define the mean of a discrete random variable as

\mu=\sum k\mathbb{P}(X=k).

In other words, we are taking a sum of values multiplied by probabilities. By taking a suitable limit, a sum weighted by discrete probabilities converges to an integral weighted by a pdf. So this is a definition that will easily generalise.

Anyway, typically the next stage is to discuss the median. In the setting where we can define the mean directly as a sum of values, we must be given some list of values, which we can therefore write in ascending order. It’s then easy to define the median as the middle value in this ordered list. If the number of elements is odd, this is certainly well-defined. If the number is even, it is less clear. A lot of time at school was spent addressing this question, and the generally-agreed answer seemed to be that the mean of the middle two elements would do nicely. We shouldn’t waste any further time addressing this, as we are aiming for the continuous setting, where in general there won’t be discrete gaps between values in the support.

This led onwards to the dreaded box-and-whisker diagrams, which represent the min, lower quartile, median, upper quartile, and max in order. The diagram is structured to draw attention to the central points in the distribution, as these are in many applications of greater interest. The question of how to define the quartiles if the number of data points is not 3 modulo 4 is of exponentially less interest than the question of how to define the median for an even number of values, in my opinion. What is much more interesting is to note that the middle box of such a diagram would be finite for many continuous distributions with infinite support, such as the exponential distribution and the normal distribution.

Note that it is possible to construct any distribution as a function of a U[0,1] distribution by inverting the cdf. The box-and-whisker diagram essentially gives five points in this identification scheme.

Obviously, the ordered list definition fails to work for such distributions. So we need a better definition of median, which generalises. We observe that half the values are greater than the median, and so in a probabilistic setting, we say that the probability of being less than the median is equal to the probability of being greater. So we want to define it implicitly as:

\mathbb{P}(X>M)=\mathbb{P}(X<M).

So for a continuous distribution without atoms,

\mathbb{P}(X>M)=\frac12,

and this uniquely defines M.

The natural question to start asking is how this compares to the mean. In particular, we want to discuss the relative sizes. Any result about the possible relative values of the mean and median can be reversed by considering the negation of the random variable, so we focus on continuous random variables with non-negative support. If nothing else, these are the conditions for lots of data we might be interested in sampling in the ‘real world’.

It’s worth having a couple of questions to clarify what we are interested in. How about: is it possible for the mean to be 1000 times larger than the median; and is it possible for the median to be 1000 times larger than the mean?

The latter is easier to address. If the median is 1000 and the mean is 1, then with probability ½ the random variable X is at least 1000. So these values make a contribution to the mean of at least 500, while the other values make a contribution of at least zero (since we’ve demanded the RV be positive). This is a contradiction.

The former question turns out to be possible. The motivation should come from our box-and-whisker diagram! Once we have fixed the middle box, the median and quartiles are fixed, but we are free to fiddle with the outer regions as much as we like, so by making the max larger and larger, we can increase the mean freely without affecting the median. Perhaps it is clearest to view a discrete example: 1, 2, N. The median will always be 2, so we can increase N as much as desired to get a large mean.

The first answer is in a way more interesting, because it generalises to give a result about the tail of distributions. Viewing the median as the ½-quantile, we are saying that it cannot be too large relative to the mean. Markov’s inequality provides an identical statement about the general quantile. Instead of thinking about the constant a in an a-quantile, we look at values in the support.

Suppose we want a bound on \mathbb{P}(X>a) for some positive a. Then if we define the function f by

f(x)=a \textbf{1}_{\{x\ge a\}},

so f(x)\le x for all values. Hence the mean of f(X) is at most the mean of X. But the mean of f(X) can be calculated as

a\mathbb{P}(X>a),

and so we conclude that

\mathbb{P}(X>a)\leq \frac{\mu}{a},

which is Markov’s Inequality.

It is worth remarking that this is trivially true when a\le \mu, since probabilities are always at most 1 anyway. Even beyond this region, it is generally quite weak. Note that it becomes progressively stronger if the contribution to the mean from terms greater than a is mainly driven by the contribution from terms close to a. So the statement is strong if the random variable has a light tail.

This motivates considering deviations from the mean, rather than the random variable itself. And to lighten the tail, we can square, for example, to consider the square distance from the mean. This version is Chebyshev’s Inequality:

\mathbb{P}(|X-\mu|^2>a\sigma^2)\le \frac{1}{a}.

Applying Markov an exponential function of a random variable is called a Chernoff Bound, and gives in some sense the bound on tails of a distribution obtained in this way.

Large Deviations 1 – Motivation and Cramer’s Theorem

I’ve been doing a lot of thinking about Large Deviations recently, in particular how to apply the theory to random graphs and related models. I’ve just writing an article about some of the more interesting aspects, so thought it was probably worth turning it into a few posts.

Motivation

Given X_1,X_2,\ldots i.i.d. real-valued random variables with finite expectation, and S_n:=X_1+\ldots+X_n, the Weak Law of Large Numbers asserts that the empirical mean \frac{S_n}{n} converges in distribution to \mathbb{E}X_1. So \mathbb{P}(S_n\geq n(\mathbb{E}X_1+\epsilon))\rightarrow 0. In fact, if \mathbb{E}X_1^2<\infty, we have the Central Limit Theorem, and a consequence is that \mathbb{P}(S_n\geq n\mathbb{E}X_1+n^\alpha)\rightarrow 0 whenever \alpha>\frac12.

In a concrete example, if we toss a coin some suitably large number of times, the probability that the proportion of heads will be substantially greater or smaller than \frac12 tends to zero. So the probability that at least \frac34 of the results are heads tends to zero. But how fast? Consider first four tosses, then eight. A quick addition of the relevant terms in the binomial distribution gives:

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of four tosses are heads}\right)=\frac{1}{16}+\frac{4}{16}=\frac{5}{16},

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of twelve tosses are heads}\right)=\frac{1}{2^{12}}+\frac{12}{2^{12}}+\frac{66}{2^{12}}+\frac{220}{2^{12}}=\frac{299}{2^{12}}.

There are two observations to be made. The first is that the second is substantially smaller than the first – the decay appears to be relatively fast. The second observation is that \frac{220}{2^{12}} is substantially larger than the rest of the sum. So by far the most likely way for at least \tfrac34 out of twelve tosses to be heads is if exactly \tfrac34 are heads. Cramer’s theorem applies to a general i.i.d. sequence of RVs, provided the tail is not too heavy. It show that the probability of any such large deviation event decays exponentially with n, and identifies the exponent.

Theorem (Cramer): Let (X_i) be i.i.d. real-valued random variables which satisfy \mathbb{E}e^{tX_1}<\infty for every t\in\mathbb{R}. Then for any a>\mathbb{E}X_1,

\lim_{n\rightarrow \infty}\frac{1}{n}\log\mathbb{P}(S_n\geq an)=-I(a),

\text{where}\quad I(z):=\sup_{t\in\mathbb{R}}\left[zt-\log\mathbb{E}e^{tX_1}\right].

Remarks

  • So, informally, \mathbb{P}(S_n\geq an)\sim e^{-nI(a)}.
  • I(z) is called the Fenchel-Legendre transform (or convex conjugate) of \log\mathbb{E}e^{tX_1}.
  • Considering t=0 confirms that I(z)\in[0,\infty].
  • In their extremely useful book, Dembo and Zeitouni present this theorem in greater generality, allowing X_i to be supported on \mathbb{R}^d, considering a more general set of large deviation events, and relaxing the requirement for finite mean, and thus also the finite moment generating function condition. All of this will still be a special case of the Gartner-Ellis theorem, which will be examined in a subsequent post, so we make do with this form of Cramer’s result for now.

The proof of Cramer’s theorem splits into an upper bound and a lower bound. The former is relatively straightforward, applying Markov’s inequality to e^{tS_n}, then optimising over the choice of t. This idea is referred to by various sources as the exponential Chebyshev inequality or a Chernoff bound. The lower bound is more challenging. We reweight the distribution function F(x) of X_1 by a factor e^{tx}, then choose t so that the large deviation event is in fact now within the treatment of the CLT, from which suitable bounds are obtained.

To avoid overcomplicating this initial presentation, some details have been omitted. It is not clear, for example, whether I(x) should be finite whenever x is in the support of X_1. (It certainly must be infinite outside – consider the probability that 150% or -40% of coin tosses come up heads!) In order to call this a Large Deviation Principle, we also want some extra regularity on I(x), not least to ensure it is unique. This will be discussed in the next posts.

Effective Bandwidth

Here, devices have fixed capacity, but packet sizes are random. So, we still have a capacity constraint for the links, but we accept that it won’t be possible to ensure that we stay within those limits all the time, and seek instead to minimise the probability that the limits are exceeded, while keeping throughput as high as possible.

An important result is Chernoff’s Bound: \mathbb{P}(Y\geq 0)\leq \inf_{s\geq 0}\mathbb{E}e^{sY}. The proof is very straightforward: apply Markov’s inequality to the non-negative random variable e^{SY}. So in particular \frac{1}{n}\log\mathbb{P}(X_1+\ldots+X_n\geq 0)\leq \inf M(s), where M(s)=\log\mathbb{E}e^{sX}, and Cramer’s Theorem asserts that after taking a limit in n on the LHS, equality holds, provided \mathbb{E}X<0,\mathbb{P}(X>0)>0.

We assume that the traffic has the form S=\sum_{j=1}^J\sum_{i=1}^{n_j}X_{ji}, where these summands are iid, interpreted as one of the n_j loads used on source j. We have

\log\mathbb{P}(S>c)\leq\log \mathbb{E}[e^{s(S-C)}]=\sum_{j=1}^Jn_jM_j(s)-sC

so \inf(\sum n_jM_j(s)-sC)\leq -\gamma\quad\Rightarrow\quad \mathbb{P}(s\geq C)\leq e^{-\gamma}

so we want this to hold for large \gamma.

We might then choose to restrict attention to

A=\{n:\sum n_jM_j-sC\leq-\gamma,\text{ some }s\geq 0\}

So, when operating near capacity, say with call profile n* on (ie near) the boundary of A, with s* the argmin of the above. Then the tangent plane is \sum n_jM_j(s^*)-s^*C=-\gamma, and since A’s complement is convex, it suffices to stay on the ‘correct’ side (ie halfspace) of this tangent plane.

We can rewrite as \sum n_jM_j(S^*)\leq C-\frac{\gamma}{s^*}. Note that this is reasonable since s* is fixed, and we call \frac{M_j(s)}{s}=:\alpha_j(s), the effective bandwidth. It is with respect to this average that we are bounding probabilities, hence ‘effective’.

Observe that \alpha_j(s) is increasing by Jensen as (\mathbb{E}e^X)^t\leq \mathbb{E}e^{tX} for t>1 implies that for t>s, (\mathbb{E}e^{sX})^t\leq(\mathbb{E}e^{tX})^s.

In particular,

\mathbb{E}X\leq \alpha_j(s)\leq \text{ess sup}X