Lecture 7 – The giant component

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

As we edge into the second half of the course, we are now in a position to return to the question of the phase transition between the subcritical regime \lambda<1 and the supercritical regime \lambda>1 concerning the size of the largest component L_1(G(n,\lambda/n)).

In Lecture 3, we used the exploration process to give upper bounds on the size of this largest component in the subcritical regime. In particular, we showed that

\frac{1}{n}\big| L_1(G(n,\lambda/n)) \big| \stackrel{\mathbb{P}}\rightarrow 0.

If we used slightly stronger random walk concentration estimates (Chernoff bounds rather than 2nd-moment bounds from Chebyshev’s inequality), we could in fact have shown that with high probability the size of this largest component was at most some logarithmic function of n.

In this lecture, we turn to the supercritical regime. In the previous lecture, we defined various forms of weak local limit, and asserted (without attempting the notationally-involved combinatorial calculation) that the random graph G(n,\lambda/n) converges locally weakly in probability to the Galton-Watson tree with \text{Poisson}(\lambda) offspring distribution, as we’ve used informally earlier in the course.

Of course, when \lambda>1, this branching process has strictly positive survival probability \zeta_\lambda>0. At a heuristic level, we imagine that all vertices whose local neighbourhood is ‘infinite’ are in fact part of the same giant component, which should occupy (\zeta_\lambda+o_{\mathbb{P}}(1))n vertices. In its most basic form, the result is

\frac{1}{n}\big|L_1(G(n,\lambda/n))\big|\;\stackrel{\mathbb{P}}\longrightarrow\; \zeta_\lambda,\quad \frac{1}{n}\big|L_2(G(n,\lambda/n))\big| \;\stackrel{\mathbb{P}}\longrightarrow\; 0, (*)

where the second part is a uniqueness result for the giant component.

The usual heuristic for proving this result is that all ‘large’ components must in fact be joined. For example, if there are two giant components, with sizes \approx \alpha n,\approx \beta n, then each time we add a new edge (such an argument is often called ‘sprinkling‘), the probability that these two components are joined is \approx 2ab, and so if we add lots of edges (which happens as we move from edge probability \lambda-\epsilon\mapsto \lambda ) then with high probability these two components get joined.

It is hard to make this argument rigorous, and the normal approach is to show that with high probability there are no components with sizes within a certain intermediate range (say between \Theta(\log n) and n^\alpha) and then show that all larger components are the same by a joint exploration process or a technical sprinkling argument. Cf the books of Bollobas and of Janson, Luczak, Rucinski. See also this blog post (and the next page) for a readable online version of this argument.

I can’t find any version of the following argument, which takes the weak local convergence as an assumption, in the literature, but seems appropriate to this course. It is worth noting that, as we shall see, the method is not hugely robust to adjustments in case one is, for example, seeking stronger estimates on the giant component (eg a CLT).

Anyway, we proceed in three steps:

Step 1: First we show, using the local limit, that for any \epsilon>0,

\frac{1}{n}\big|L_1(G(n,\lambda/n))\big| \le \zeta_\lambda+\epsilon, with high probability as n\rightarrow\infty.

Step 2: Using a lower bound on the exploration process, for \epsilon>0 small enough

\frac{1}{n}\big|L_1(G(n,\lambda/n))\big| \ge \epsilon, with high probability.

Step 3: Motivated by duality, we count isolated vertices to show

\mathbb{P}(\epsilon n\le |L_1| \le (\zeta_\lambda-\epsilon)n) \rightarrow 0.

We will return to uniqueness at the end.

Step 1

This step is unsurprising. The local limit gives control on how many vertices are in small components of various sizes, and so gives control on how many vertices are in small components of all finite sizes (taking limits in the right order). This gives a bound on how many vertices can be in the giant component.

(Note: parts of this argument appear in the text and exercises of Section 1.4 in the draft of Volume II of van der Hofstad’s notes, which can be found here.)

We can proceed in greater generality, by considering a sequence of random graphs G_n which converge locally weakly in probability to T, a random tree, with survival probability \zeta=\mathbb{P}(|T|=\infty)>0. We will show that:

Proposition: \mathbb{P}(L_1(G_n)\ge (\zeta+\epsilon)n) \rightarrow 0, for each \epsilon>0.

As a preliminary, note that for every k\in\mathbb{N}, there are finitely many rooted graphs (H,\rho_H) with size k. We can also identify whether a graph has size k by looking at a ball of radius r>k around any vertex. In particular, by summing over all graphs with size k, the weak local limit implies:

\frac{1}{n}\sum_{v\in[n]} \mathbf{1}_{\{|C^{G_n}(\rho_n)|=k\}} = \frac{1}{n} \sum_{|V(H)|=k} \sum_{v\in[n]} \mathbf{1}_{\{B_r^{G_n}(\rho_n)\simeq (H,\rho_H)\}}

\stackrel{\mathbb{P}}\longrightarrow \;\sum_{|V(H)|=k} \mathbb{P}(B_r^T(\rho)\simeq (H,\rho_H)) = \mathbb{P}(|T|=k).

Furthermore, we can then control the tail as

\frac{1}{n}\sum_{v\in[n]} \mathbf{1}_{\{|C^{G_n}(v)|\ge k\}}\;\stackrel{\mathbb{P}}\longrightarrow \mathbb{P}(|T|\ge k).

(Recall that the LHS of this statement is the proportion of vertices in components of size at least k.)

We will make the trivial but useful observation that in any graph the largest component has size at least k precisely if at least k vertices are in components of size at least k (!). Ie

|L_1(G)|\ge k\quad\iff\quad \sum_{v\in[n]} \mathbf{1}{\{C^G(v)|\ge k\}} \ge k.

Returning now to the problem at hand, we have \mathbb{P}(|T|\ge k)\downarrow\zeta as k\rightarrow\infty, so we may pick k such that \mathbb{P}(|T|\ge k)<\zeta+\epsilon.

But then, using our ‘trivial but useful’ observation:

\mathbb{P}(L_1(G_n)\ge (\zeta+\epsilon)n) = \mathbb{P}(\sum_{v\in[n]} \mathbf{1}_{\{|C^{G_n}(v)|\ge (\zeta+\epsilon)n \}} \ge (\zeta+\epsilon)n)

\le \mathbb{P}(\frac{1}{n}\sum_{v\in[n]} \mathbf{1}_{\{|C^{G_n}(v)|\ge k\}} \ge \zeta+\epsilon). (**)

Note that we have replaced (\zeta+\epsilon)n by k in this final step for a bound. However, the random quantity inside the probability is known to converge in probability to \mathbb{P}(|T|\ge k)<\zeta+\epsilon. So in fact this probability (**) vanishes as n\rightarrow\infty.

Step 2

Remember the exploration process, where v=v_1,v_2,\ldots,v_n is a labelling of the vertices of G(n,\lambda/n) in breadth-first order. Defining

X_i:= \#\{w\in[n]\,:\, w\in\Gamma(v_i),\,w\not\in \Gamma(v_j),\,j\in[i-1]\},

the number of children of vertex v_i, we set

S_0:=0,\quad S_i:=S_{i-1}+(X_i-1),\; i\ge 1,

to be (a version of) the exploration process. It will be useful to study

H_0:=0,\quad H_k:=\min\{i\,:\, S_i=-k\},

the hitting times of (-k), as then \{v_{H_{k-1}+1},\ldots,v_{H_k}\} is the kth component to be explored.

Unlike for a tree, we have multiple components, and essentially the process decreases by one each time we start a new component, which means that the current value no longer describes the number of vertices on the stack. In general, this is given by S_i - \min_{0\le j\le i}S_j, and so

X_i\stackrel{d}= \text{Bin}(n-i-(S_{i-1}-\min_{0\le j\le i-1}S_j),\, \frac{\lambda}{n}),

which we may stochastically bound below by

\ge_{st} \text{Bin}(n-2i-S_{i-1},\,\frac{\lambda}{n}),

noting that this is extremely crude.

We want to study whether S_i ever exceeds \epsilon n, for some \epsilon>0 to be determined later.

For reasons that will become clear in the following deduction, it’s convenient to fix \alpha>0 small such that \lambda(1-2\alpha)>1, and then choose \epsilon>0 such that

\alpha\left[\lambda(1-2\alpha-\epsilon)-1\right]>\epsilon.

(which is possible by continuity since the given relation holds when \epsilon=0.)

Now, when i\le \alpha n and S_{i-1}\le \epsilon n, we have

X_i\ge_{st} \text{Bin}(n(1-2\alpha-\epsilon),\frac{\lambda}{n}).

The following argument requires some kind of submartingale approach (involving coupling with a simpler process at the stopping time) to make rigorous, which is beyond the scope of this course’s prerequisites.

However, informally, if we assume that \max_{i\le \alpha n} S_i\le \epsilon n, ‘then’

\frac{1}{n}S_{\alpha n}\ge_{st} \frac{1}{n}\text{Bin}(\alpha n\cdot n(1-2\alpha-\epsilon),\,\frac{\lambda}{n}) - \alpha.

But this distribution is concentrated on a value which is, by our obscure assumption, >\epsilon (!) contradicting the assumption on the maximum. Thus we conclude that

\mathbb{P}(\max_{i\le \alpha n} S_i\le \epsilon n)\rightarrow 0, as n\rightarrow\infty.

We conclude that \max_{i\le \alpha n} S_i\ge\epsilon n holds with high probability. But remember that S_{i+1}\ge S_i -1 so if S_i\ge \epsilon n, then all of S_i,S_{i+1},\ldots,S_{i+\lfloor \epsilon n\rfloor} are non-negative, and so certainly v_i,v_{i+1},\ldots,v_{i+\lfloor \epsilon n\rfloor} are in the same component of the graph, and L_1(G(n,\lambda/n))\ge \epsilon n with high probability.

Step 3

The motivation for this section is duality. Recall (from Lecture 5) that if we condition a supercritical Poisson GW tree on extinction, we obtain the distribution of a dual subcritical Poisson GW tree. This relation moves across to the world of the sparse Erdos-Renyi random graph. If you exclude the giant component, you are left with a subcritical random graph (on a smaller vertex set), and this applies equally well to the local limits. Essentially, if we exclude a component, and take the local limit of what remains, we get the wrong answer unless the component we excluded was a giant component with size \approx \zeta_\lambda n, or was small.

As we shall see, this effect is captured sufficiently by counting isolated vertices.

First, we state a Fact: when 1-\zeta_\lambda <y<1, then ye^{-\lambda y}>e^{-\lambda}. This convexity property is easily checked by comparing derivatives, and will be useful shortly.

Now, we study I_n, the number of isolated vertices in G_n, under conditioning that \{1,2,\ldots,k\} is a component for various values k. Note that unless k=1, we have

\mathbb{E}[I_n\,\big|\, \{1,2,\ldots,k\}\text{ a cpt}] = (n-k)(1-\frac{\lambda}{n})^{n-k-1},

for exactly the same reason as when we did this calculation for the original graph several lectures back. We will consider k in the range \epsilon n \le k\le (\zeta_\lambda - \epsilon) n.

We can take a limit of this expectation in appropriate uniformly using the Fact above, since the function ye^{-\lambda y} is suitably well-behaved, to obtain

\liminf_{n\rightarrow\infty} \frac{1}{n}\min_{\epsilon n\le k\le (\zeta-\epsilon)n} (n-k)(1-\frac{\lambda}{n})^{n-k-1}

\ge \min_{\epsilon \le x \le \zeta-\epsilon} (1-x)e^{-\lambda(1-x)}\ge e^{-\lambda}+\epsilon',

where \epsilon'>0. So

\liminf_{n\rightarrow\infty} \min_{\epsilon n\le k\le (zeta-\epsilon)n} \frac{1}{n}\mathbb{E}\left[ I_n\,\big|\, \{1,\ldots,k\}\text{ a cpt}\right] \ge e^{-\lambda}+\epsilon'.

But \frac{I_n}{n} is bounded above (by 1, of course), and so lower bounds on the expectation give lower bounds on upper tail, leading to

\liminf_{n\rightarrow\infty}\min_{\epsilon n\le k\le (\zeta-\epsilon)n} \mathbb{P}\left( \frac{I_n}{n}\ge e^{-\lambda}+\frac{\epsilon'}{2}\,\big|\, \{1,\ldots,k\}\text{ a cpt} \right) >0.

However, we know \frac{I_n}{n}\stackrel{\mathbb{P}}\rightarrow e^{-\lambda} (for example by local convergence…). Therefore, in order to make the unconditional probability vanish, the probability of the conditioning event in question must also vanish, ie

\mathbb{P}\left(\epsilon n\le |C^{G_n}(1)|\le (\zeta_\lambda-\epsilon)n\right)\rightarrow 0.

Finally, since

\mathbb{P}\left(\epsilon n \le |C^{G_n}(1)|\le (\zeta_\lambda-\epsilon)n\right) \ge \frac{1}{\epsilon}\mathbb{P}(\epsilon n\le |L_1(G)|\le (\zeta_\lambda-\epsilon)n),

the corresponding result holds for the largest component, not just the observed component.

Uniqueness and overall comments

Uniqueness can be obtained by a slight adjustment of Step 1. Morally, Step 1 is saying that a proportion asymptotically at most \zeta_\lambda of the vertices are in large components, so it is possible (and an exercise in the course) to adjust the argument to show

\frac{1}{n}|L_1|+\frac{1}{n}|L_2| \le \zeta_\lambda+\epsilon, with high probability,

from which the uniqueness result follows immediately.

In particular, it’s worth noting that this is an example of a bootstrapping argument, where we show a weak version of our goal result (in Step 2), but then use this to show the full result.

Note also that we can use the duality principle to show logarithmic bounds on the size of the second-largest component in exactly the same way that we showed logarithmic bounds on the size of the largest component in the subcritical regime. The whole point of duality is that these are the same problem!

Advertisements

Azuma-Hoeffding Inequality

It’s (probably) my last Michaelmas term in Oxford, at least for the time being, and so also the last time giving tutorials on either of the probability courses that students take in their first two years. This time, I’m teaching the second years, and as usual the aim of the majority of the first half of the course is to acquire as sophisticated an understanding as possible of the Central Limit Theorem. I feel a key step is appreciating that CLT tells you about the correct scaling for the deviations from the mean of these partial sums of IID random variables. The fact that these deviations on this correct scaling converge in law to a normal distribution, irrespective (apart from mild conditions) on the underlying distribution, is interesting, but should be viewed as a secondary, bonus, property.

Emphasising the scaling of deviations in CLT motivates the next sections of this (or any) course. We develop tools like Markov’s inequality to control the probability that a random variable is much larger than its expectation, and experiment with applying this to various functions of the random variable to get stronger bounds. When the moment generating function exists, this is an excellent choice for this analysis. We end up with a so-called Chernoff bound. For example, we might consider the probability that when we toss N coins, at least a proportion ¾ are Heads. A Chernoff bound says that this probability decays exponentially in N.

One direction to take is to ask how to control precisely the parameter of this exponential decay, which leads to Cramer’s theorem and the basis of the theory of Large Deviations. An alternative direction is to observe that the signed difference between the partial sums of independent random variables and their means is an example of a martingale, albeit not a very interesting one, since in general the increments of a martingale are not independent. So we might ask: under what circumstances can we show exponential tail bounds on the deviation of a martingale from its mean (that is, its initial value) at a fixed (perhaps large) time?

Azuma-Hoeffding inequality

The following result was derived and used by various authors in the 60s, including Azuma and Hoeffding (separately), but also others.

Let X_0,X_1,X_2,\ldots be a martingale with respect to some filtration, and we assume that the absolute value of each increment |X_i-X_{i-1}| is bounded almost surely by some c_i<\infty. Then, recalling that \mathbb{E}[X_n|\mathcal{F}_0]=X_0, we have

\mathbb{P}(X_n \ge X_0+t) \le \exp\left( -\frac{t^2}{2\sum_{i=1}^n c_i^2}\right).

Proof

We apply a Chernoff argument to each increment. First, observe that for Y a distribution supported on [-1,1] with mean zero, by convexity \mathbb{E}[e^{tY}] is maximised by taking Y equal to +1 and -1 each with probability ½. Thus

\mathbb{E}[e^{tY}]\le \frac12 e^t + \frac 12 e^{-t}=\cosh(t) \le e^{-t^2/2},

where the final inequality follows by directly comparing the Taylor series.

We’ll use this shortly. Before that, we start the usual argument for a Chernoff bound on X_n-X_0.

\mathbb{P}(X_n-X_0\ge t) = \mathbb{P}(e^{\theta(X_n-X_0)}\ge e^{\theta t})\le e^{-\theta t} \mathbb{E}[e^{\theta(X_n-X_0)}]

= e^{-\theta t} \mathbb{E}[\mathbb{E}[e^{\theta((X_n-X_{n-1}) +X_{n-1}-X_0)} | \mathcal{F}_{n-1}]]

= e^{-\theta t} \mathbb{E}[e^{\theta(X_{n-1}-X_0)} \mathbb{E}[e^{\theta(X_n-X_{n-1})}|\mathcal{F}_{n-1}] ],

and our preliminary result allows us to control this inner expectation

\le e^{-\theta t} e^{\theta^2c_n^2/2} \mathbb{E}[e^{\theta(X_{n-1}-X_0)}].

So now we can apply this inductively to obtain

\mathbb{P}(X_n-X_0\ge t) \le e^{-\theta t+ \theta^2 \sum_{i=1}^n c_i^2}.

Finally, as usual in such an argument, we need to choose a sensible value of the free parameter \theta, and naturally we want to choose it to make this RHS as small as possible, which is achieved when \theta = \frac{t}{\sum_{i=1}^n c_i^2}, and leads exactly to the statement of the inequality.

Applications

Unsurprisingly, we can easily apply this to the process of partial sums of IID random variables with mean zero and bounded support, to recover a Chernoff bound.

A more interesting example involves revealing the state (ie open or closed) of the edges of an Erdos-Renyi graph one at a time. We need to examine some quantitative property of the graph which can’t ever be heavily influenced by the presence or non-presence of a single given edge. The size of the largest clique, or the largest cut, are good examples. Adding or removing an edge can change these quantities by at most one.

So if we order the edges, and let the filtration \mathcal{F}_k be generated by the state of the first k edges in this ordering, then X_k=\mathbb{E}[\text{max cut}| \mathcal{F}_k] is a martingale. (A martingale constructed backwards in this fashion by conditioning a final state on a filtration is sometimes called a Doob martingale.) Using A-H on this shows that the deviations from the mean are of order \sqrt{N}, where N is the size of the graph. In the sparse case, it can be justified fairly easily that the maximum cut has size \Theta(N), since for example there will always be some positive proportion of isolated vertices. However, accurate asymptotics for the mean of this quantity seem (at least after a brief search of the literature – please do correct me if this is wrong!) to be unknown. So this might be an example of the curious situation where we can control the deviations around the mean better than the mean itself!

Beyond bounded increments

One observation we might make about the proof is that it is tight only if all the increments X_i-X_{i-1} are supported on \{-c_i,+c_i\}, which is stronger than demanding that the absolute value is bounded. If in fact we have X_i-X_{i-1}\in[-d_i,c_i] almost surely, then, with a more detailed preliminary lemma, we can have instead a bound of \exp\left( -\frac{2t^2}{\sum_{i=1}^n (c_i+d_i)^2} \right).

While it isn’t a problem in these examples, in many settings the restriction to bounded increments is likely to be the obstacle to applying A-H. Indeed, in the technical corner of my current research problem, this is exactly the challenge I faced. Fortunately, at least in principle, all is not necessarily lost. We might, for example, be able to establish bounds (c_i) as described, such that the probability that any |X_i-X_{i-1}| exceeds its c_i is very small. You could then construct a coupled process (Y_i), that is equal to X_i whenever the increments are within the given range, and something else otherwise. For Y to fit the conditions of A-H, the challenge is to ensure we can do this such that the increments remain bounded (ie the ‘something else’ also has to be within [-c_i,c_i] ) and also that Y remains a martingale. This total probability of a deviation is bounded above by the probability of Y experiencing that deviation, plus the probability of Y and X decoupling. To comment on the latter probability is hard in general without saying a bit more about the dependence structure in X itself.

Means and Markov’s Inequality

The first time we learn what a mean is, it is probably called an average. The first time we meet it in a maths lesson, it is probably defined as follows: given a list of values, or possibilities, the mean is the sum of all the values divided by the number of such values.

This can be seen as both a probabilistic and a statistical statement. Ideally, these things should not be different, but at a primary school level (and some way beyond), there is a distinction to be drawn between the mean of a set of data values, say the heights of children in the class, and the mean outcome of rolling a dice. The latter is the mean of something random, while the former is the mean of something fixed and determined.

The reason that the same method works for both of these situations is that the distribution for the outcome of rolling a dice is uniform on the set of possible values. Though this is unlikely to be helpful to many, you could think of this as a consequence of the law of large numbers. The latter, performed jointly in all possible values says that you expect to have roughly equal numbers of each value when you take a large number of samples. If we refer to the strong law, this says that in fact we see this effect in the limit as we take increasingly large samples with probability one. Note that it is not trivial to apply LLN jointly to all values for a general continuous random variable. The convergence of sample distribution functions to the cdf of the underlying distribution is the content of the Glivenko-Cantelli Theorem.

In any case, this won’t work when there isn’t this symmetry where all values are equally likely. So in general, we have to define the mean of a discrete random variable as

\mu=\sum k\mathbb{P}(X=k).

In other words, we are taking a sum of values multiplied by probabilities. By taking a suitable limit, a sum weighted by discrete probabilities converges to an integral weighted by a pdf. So this is a definition that will easily generalise.

Anyway, typically the next stage is to discuss the median. In the setting where we can define the mean directly as a sum of values, we must be given some list of values, which we can therefore write in ascending order. It’s then easy to define the median as the middle value in this ordered list. If the number of elements is odd, this is certainly well-defined. If the number is even, it is less clear. A lot of time at school was spent addressing this question, and the generally-agreed answer seemed to be that the mean of the middle two elements would do nicely. We shouldn’t waste any further time addressing this, as we are aiming for the continuous setting, where in general there won’t be discrete gaps between values in the support.

This led onwards to the dreaded box-and-whisker diagrams, which represent the min, lower quartile, median, upper quartile, and max in order. The diagram is structured to draw attention to the central points in the distribution, as these are in many applications of greater interest. The question of how to define the quartiles if the number of data points is not 3 modulo 4 is of exponentially less interest than the question of how to define the median for an even number of values, in my opinion. What is much more interesting is to note that the middle box of such a diagram would be finite for many continuous distributions with infinite support, such as the exponential distribution and the normal distribution.

Note that it is possible to construct any distribution as a function of a U[0,1] distribution by inverting the cdf. The box-and-whisker diagram essentially gives five points in this identification scheme.

Obviously, the ordered list definition fails to work for such distributions. So we need a better definition of median, which generalises. We observe that half the values are greater than the median, and so in a probabilistic setting, we say that the probability of being less than the median is equal to the probability of being greater. So we want to define it implicitly as:

\mathbb{P}(X>M)=\mathbb{P}(X<M).

So for a continuous distribution without atoms,

\mathbb{P}(X>M)=\frac12,

and this uniquely defines M.

The natural question to start asking is how this compares to the mean. In particular, we want to discuss the relative sizes. Any result about the possible relative values of the mean and median can be reversed by considering the negation of the random variable, so we focus on continuous random variables with non-negative support. If nothing else, these are the conditions for lots of data we might be interested in sampling in the ‘real world’.

It’s worth having a couple of questions to clarify what we are interested in. How about: is it possible for the mean to be 1000 times larger than the median; and is it possible for the median to be 1000 times larger than the mean?

The latter is easier to address. If the median is 1000 and the mean is 1, then with probability ½ the random variable X is at least 1000. So these values make a contribution to the mean of at least 500, while the other values make a contribution of at least zero (since we’ve demanded the RV be positive). This is a contradiction.

The former question turns out to be possible. The motivation should come from our box-and-whisker diagram! Once we have fixed the middle box, the median and quartiles are fixed, but we are free to fiddle with the outer regions as much as we like, so by making the max larger and larger, we can increase the mean freely without affecting the median. Perhaps it is clearest to view a discrete example: 1, 2, N. The median will always be 2, so we can increase N as much as desired to get a large mean.

The first answer is in a way more interesting, because it generalises to give a result about the tail of distributions. Viewing the median as the ½-quantile, we are saying that it cannot be too large relative to the mean. Markov’s inequality provides an identical statement about the general quantile. Instead of thinking about the constant a in an a-quantile, we look at values in the support.

Suppose we want a bound on \mathbb{P}(X>a) for some positive a. Then if we define the function f by

f(x)=a \textbf{1}_{\{x\ge a\}},

so f(x)\le x for all values. Hence the mean of f(X) is at most the mean of X. But the mean of f(X) can be calculated as

a\mathbb{P}(X>a),

and so we conclude that

\mathbb{P}(X>a)\leq \frac{\mu}{a},

which is Markov’s Inequality.

It is worth remarking that this is trivially true when a\le \mu, since probabilities are always at most 1 anyway. Even beyond this region, it is generally quite weak. Note that it becomes progressively stronger if the contribution to the mean from terms greater than a is mainly driven by the contribution from terms close to a. So the statement is strong if the random variable has a light tail.

This motivates considering deviations from the mean, rather than the random variable itself. And to lighten the tail, we can square, for example, to consider the square distance from the mean. This version is Chebyshev’s Inequality:

\mathbb{P}(|X-\mu|^2>a\sigma^2)\le \frac{1}{a}.

Applying Markov an exponential function of a random variable is called a Chernoff Bound, and gives in some sense the bound on tails of a distribution obtained in this way.

Large Deviations 1 – Motivation and Cramer’s Theorem

I’ve been doing a lot of thinking about Large Deviations recently, in particular how to apply the theory to random graphs and related models. I’ve just writing an article about some of the more interesting aspects, so thought it was probably worth turning it into a few posts.

Motivation

Given X_1,X_2,\ldots i.i.d. real-valued random variables with finite expectation, and S_n:=X_1+\ldots+X_n, the Weak Law of Large Numbers asserts that the empirical mean \frac{S_n}{n} converges in distribution to \mathbb{E}X_1. So \mathbb{P}(S_n\geq n(\mathbb{E}X_1+\epsilon))\rightarrow 0. In fact, if \mathbb{E}X_1^2<\infty, we have the Central Limit Theorem, and a consequence is that \mathbb{P}(S_n\geq n\mathbb{E}X_1+n^\alpha)\rightarrow 0 whenever \alpha>\frac12.

In a concrete example, if we toss a coin some suitably large number of times, the probability that the proportion of heads will be substantially greater or smaller than \frac12 tends to zero. So the probability that at least \frac34 of the results are heads tends to zero. But how fast? Consider first four tosses, then eight. A quick addition of the relevant terms in the binomial distribution gives:

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of four tosses are heads}\right)=\frac{1}{16}+\frac{4}{16}=\frac{5}{16},

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of twelve tosses are heads}\right)=\frac{1}{2^{12}}+\frac{12}{2^{12}}+\frac{66}{2^{12}}+\frac{220}{2^{12}}=\frac{299}{2^{12}}.

There are two observations to be made. The first is that the second is substantially smaller than the first – the decay appears to be relatively fast. The second observation is that \frac{220}{2^{12}} is substantially larger than the rest of the sum. So by far the most likely way for at least \tfrac34 out of twelve tosses to be heads is if exactly \tfrac34 are heads. Cramer’s theorem applies to a general i.i.d. sequence of RVs, provided the tail is not too heavy. It show that the probability of any such large deviation event decays exponentially with n, and identifies the exponent.

Theorem (Cramer): Let (X_i) be i.i.d. real-valued random variables which satisfy \mathbb{E}e^{tX_1}<\infty for every t\in\mathbb{R}. Then for any a>\mathbb{E}X_1,

\lim_{n\rightarrow \infty}\frac{1}{n}\log\mathbb{P}(S_n\geq an)=-I(a),

\text{where}\quad I(z):=\sup_{t\in\mathbb{R}}\left[zt-\log\mathbb{E}e^{tX_1}\right].

Remarks

  • So, informally, \mathbb{P}(S_n\geq an)\sim e^{-nI(a)}.
  • I(z) is called the Fenchel-Legendre transform (or convex conjugate) of \log\mathbb{E}e^{tX_1}.
  • Considering t=0 confirms that I(z)\in[0,\infty].
  • In their extremely useful book, Dembo and Zeitouni present this theorem in greater generality, allowing X_i to be supported on \mathbb{R}^d, considering a more general set of large deviation events, and relaxing the requirement for finite mean, and thus also the finite moment generating function condition. All of this will still be a special case of the Gartner-Ellis theorem, which will be examined in a subsequent post, so we make do with this form of Cramer’s result for now.

The proof of Cramer’s theorem splits into an upper bound and a lower bound. The former is relatively straightforward, applying Markov’s inequality to e^{tS_n}, then optimising over the choice of t. This idea is referred to by various sources as the exponential Chebyshev inequality or a Chernoff bound. The lower bound is more challenging. We reweight the distribution function F(x) of X_1 by a factor e^{tx}, then choose t so that the large deviation event is in fact now within the treatment of the CLT, from which suitable bounds are obtained.

To avoid overcomplicating this initial presentation, some details have been omitted. It is not clear, for example, whether I(x) should be finite whenever x is in the support of X_1. (It certainly must be infinite outside – consider the probability that 150% or -40% of coin tosses come up heads!) In order to call this a Large Deviation Principle, we also want some extra regularity on I(x), not least to ensure it is unique. This will be discussed in the next posts.

Effective Bandwidth

Here, devices have fixed capacity, but packet sizes are random. So, we still have a capacity constraint for the links, but we accept that it won’t be possible to ensure that we stay within those limits all the time, and seek instead to minimise the probability that the limits are exceeded, while keeping throughput as high as possible.

An important result is Chernoff’s Bound: \mathbb{P}(Y\geq 0)\leq \inf_{s\geq 0}\mathbb{E}e^{sY}. The proof is very straightforward: apply Markov’s inequality to the non-negative random variable e^{SY}. So in particular \frac{1}{n}\log\mathbb{P}(X_1+\ldots+X_n\geq 0)\leq \inf M(s), where M(s)=\log\mathbb{E}e^{sX}, and Cramer’s Theorem asserts that after taking a limit in n on the LHS, equality holds, provided \mathbb{E}X<0,\mathbb{P}(X>0)>0.

We assume that the traffic has the form S=\sum_{j=1}^J\sum_{i=1}^{n_j}X_{ji}, where these summands are iid, interpreted as one of the n_j loads used on source j. We have

\log\mathbb{P}(S>c)\leq\log \mathbb{E}[e^{s(S-C)}]=\sum_{j=1}^Jn_jM_j(s)-sC

so \inf(\sum n_jM_j(s)-sC)\leq -\gamma\quad\Rightarrow\quad \mathbb{P}(s\geq C)\leq e^{-\gamma}

so we want this to hold for large \gamma.

We might then choose to restrict attention to

A=\{n:\sum n_jM_j-sC\leq-\gamma,\text{ some }s\geq 0\}

So, when operating near capacity, say with call profile n* on (ie near) the boundary of A, with s* the argmin of the above. Then the tangent plane is \sum n_jM_j(s^*)-s^*C=-\gamma, and since A’s complement is convex, it suffices to stay on the ‘correct’ side (ie halfspace) of this tangent plane.

We can rewrite as \sum n_jM_j(S^*)\leq C-\frac{\gamma}{s^*}. Note that this is reasonable since s* is fixed, and we call \frac{M_j(s)}{s}=:\alpha_j(s), the effective bandwidth. It is with respect to this average that we are bounding probabilities, hence ‘effective’.

Observe that \alpha_j(s) is increasing by Jensen as (\mathbb{E}e^X)^t\leq \mathbb{E}e^{tX} for t>1 implies that for t>s, (\mathbb{E}e^{sX})^t\leq(\mathbb{E}e^{tX})^s.

In particular,

\mathbb{E}X\leq \alpha_j(s)\leq \text{ess sup}X