Lecture 3 – Couplings, comparing distributions

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

In this third lecture, we made our first foray into the scaling regime for G(n,p) which will be the main focus of the course, namely the sparse regime when $p=\frac{\lambda}{n}$. The goal for today was to give a self-contained proof of the result that in the subcritical setting $\lambda<1$, there is no giant component, that is, a component supported on a positive proportion of the vertices, with high probability as $n\rightarrow\infty$.

More formally, we showed that the proportion of vertices contained within the largest component of $G(n,\frac{\lambda}{n})$ vanishes in probability:

$\frac{1}{n} \left| L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right| \stackrel{\mathbb{P}}\longrightarrow 0.$

The argument for this result involves an exploration process of a component of the graph. This notion will be developed more formally in future lectures, aiming for good approximation rather than bounding arguments.

But for now, the key observation is that when we ‘explore’ the component of a uniformly chosen vertex $v\in[n]$ outwards from v, at all times the number of ‘children’ of v which haven’t already been considered is ‘at most’ $\mathrm{Bin}(n-1,\frac{\lambda}{n})$. Since, for example, if we already know that eleven vertices, including the current one w are in C(v), then the distribution of the number of new vertices to be added to consideration because they are directly connected to w has conditional distribution $\mathrm{Bin}(n-11,\frac{\lambda}{n})$.

Firstly, we want to formalise the notion that this is ‘less than’ $\mathrm{Bin}(n,\frac{\lambda}{n})$, and also that, so long as we don’t replace 11 by a linear function of n, that $\mathrm{Bin}(n-11,\frac{\lambda}{n})\stackrel{d}\approx \mathrm{Poisson}(\lambda)$.

Couplings to compare distributions

coupling of two random variables (or distributions) X and Y is a realisation $(\hat X,\hat Y)$ on the same probability space with correct marginals, that is

$\hat X\stackrel{d}=X,\quad \hat Y\stackrel{d}=Y.$

We saw earlier in the course that we could couple G(n,p) and G(n,q) by simulating both from the same family of uniform random variables, and it’s helpful to think of this in general: ‘constructing the distributions from the same source of randomness’.

Couplings are a useful notion to digest at this point, as they embody a general trend in discrete probability theory. Wherever possible, we try to do as we can with the random objects, before starting any calculations. Think about the connectivity property of G(n,p) as discussed in the previous lecture. This can be expressed directly as a function of p in terms of a large sum, but showing it is an increasing function of p is essentially impossible by computation, whereas this is very straightforward using the coupling.

We will now review how to use couplings to compare distributions.

For a real-valued random variable X, with distribution function $F_X$, we always have the option to couple with a uniform U(0,1) random variable. That is, when $U\sim U[0,1]$, we have $(F_X^{-1}(U)\stackrel{d}= X$, where the inverse of the distribution function is defined (in the non-obvious case of atoms) as

$F_X^{-1}(u)=\inf\left\{ x\in\mathbb{R}\,:\, F(x)\ge u\right\}.$

Note that when the value taken by U increases, so does the value taken by $F_X^{-1}(U)$. This coupling can be used simultaneously on two random variables X and Y, as $(F_X^{-1}(U),F_Y^{-1}(U))$, to generate a coupling of X and Y.

The total variation distance between two probability measures is

$d_{\mathrm{TV}}(\mu,\nu):= \sup_{A}|\mu(A)-\nu(A)|$,

with supremum taken over all events in the joint support S of $\mu,\nu$. This is particularly clear in the case of discrete measures, as then

$d_{\mathrm{TV}}(\mu,\nu)=\frac12 \sum_{x\in S} \left| \mu\left(\{x\}\right) - \nu\left(\{x\}\right) \right|.$

(Think of the difference in heights between the bars, when you plot $\mu,\nu$ simultaneously as a bar graph…)

The total variation distances records how well we can couple two distributions, if we want them to be equal as often as possible. It is therefore a bad measure of distributions with different support. For example, the distributions $\delta_0$ and $\delta_{1/n}$ are distance 1 apart (the maximum) for all values of n. Similarly, the uniform distribution on [0,1] and the uniform distribution on $\{0,1/n,2/n,\ldots, n-1/n, 1\}$ are also distance 1 apart.

When there is more overlap, the following result is useful.

Proposition: Any coupling $(\hat X,\hat Y)$ of $X\sim \mu,\,Y\sim \nu$ satisfies $\mathbb{P}(X=Y)\le 1-d_{\mathrm{TV}}(\mu,\nu)$, and there exists a coupling such that equality is achieved.

Proof: The first part of the resultis shown by

$\mathbb{P}\left(\hat X=\hat Y = x\right)\le \mathbb{P}\left(\hat X=x\right) = \mu\left(\{x\}\right),$

and so we may also bound by $\nu\left(\{x\}\right)$. Thus

$\mathbb{P}\left(\hat X=\hat Y\right) = \sum_x \min\left( \mu\left(\{x\}\right),\, \nu\left(\{x\}\right) \right),$

$\mathbb{P}\left(\hat X \ne\hat Y\right) = 1 - \sum_x \min\left(\mu\left(\{x\}\right),\, \nu\left(\{x\}\right) \right)$

$= \sum_x \left[ \frac12 \mu(\{x\}) + \frac12 \nu(\{x\}) - \min\left( \mu(\{x\}), \,\nu(\{x\})\right) \right] = \frac12 \sum_x \left| \mu(\{x\}) - \nu(\{x\}) \right|.$

The second part of the theorem is more of an exercise in notation, and less suitable for this blog article.

Crucially, observe that the canonical coupling discussed above is definitely not necessarily the best coupling for total variation distance. For example, the uniform distribution on $\{0,1,2,\ldots, n-1\}$ and the uniform distribution on $\{1,2,\ldots, n-1, n\}$ are very close in the total variation distance, but under the canonical coupling involving the distribution function, they are never actually equal!

Stochastic domination

Given two random variables X and Y, we say that Y stochastically dominates X if

$\mathbb{P}(X\ge x)\le \mathbb{P}(Y\ge x),\quad \forall x\in\mathbb{R}$,

and write $X\le_{st}Y$. If we are aiming for results in probability, then it is easy to read results in probability for X off from results in probability for Y, at least in one direction. For this relation, it turns out that the canonical coupling is particularly useful, unsurprisingly so since it is defined in terms of the distribution function, which precisely describes the tail probabilities. In fact the reverse implication is true too, as stated in the following theorem.

Theorem (Strassen): $X\le_{st} Y$ if and only if there exists a coupling $(\hat X,\hat Y)$ such that $\mathbb{P}(\hat X\le \hat Y)=1$.

Proof: The reverse implication is clear since

$\mathbb{P}\left(\hat X \ge x\right) = \mathbb{P}\left( \hat Y\ge \hat X\ge x\right) \le \mathbb{P}\left( \hat Y \ge x\right),$

and $\hat X,\,\hat Y$ have the correct marginals.

Examples: In many situations, the coupling description makes it much easier to verify the stochastic domination relation.

• $\lambda\le \mu$ implies $\mathrm{Poisson}(\lambda)\le_{st} \mathrm{Poisson}(\mu)$. Checking the tail of the Poisson mass function would be a bit of a nightmare, but since we have the superposition coupling $\mathrm{Po}(\mu)\stackrel{d}= \mathrm{Po}(\lambda) + \mathrm{Po}(\mu-\lambda)$, for independent RVs on the RHS, we can read it off.
• Similarly $\mathrm{Bin}(n,p)\le \mathrm{Bin}(m,q)$ when $n\le m,\, p\le q$. Again, checking the tail would be computationally-annoying.
• One of the most useful examples is the size-biased distribution, obtained from a distribution X by $\mathbb{P}(X_{sb}=x) \propto x\mathbb{P}(X=x)$, when X is a non-negative RV. As discussed on one of the exercises, |C(v)|, which we study repeatedly, can be expressed as the size-biased version of a uniform choice from amongst the list of component sizes in the graph. We have $X\le_{st} X_{sb}$, which is intuitively reasonable since the size-biased distribution ‘biases in favour of larger values’.

1) The fact that we study the component $C(v)$ containing a uniformly chosen vertex $v\in[n]$ is another example of ‘revealing’ the total randomness in a helpful order. Note that $|C(v)|$ requires extra randomness on top of the randomness of the graph itself. When analysing $C(v)$ it was convenient to sample the graph section-by-section having settled on v. However, for the final comparison of $|C(v)|$ and $|L_1(G)|$, the opposite order is helpful.

Since, whenever there is a component of size at least $\epsilon n$, the probability that we pick v within that component is at least $\frac{\epsilon n}{n}$. Thus we can conclude that

$\mathbb{P}\left(|C(v)|\ge \epsilon n\right) \ge \frac{\epsilon n}{n}\mathbb{P}\left( \left|L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right| \ge \epsilon n\right).$

Here, since we are aiming for convergence in probability, $\epsilon>0$ is fixed, and

$\frac{1}{n}\left|L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right|\stackrel{\mathbb{P}}\longrightarrow 0,$

follows from

$\frac{1}{n}\left| C(v) \right| \stackrel{\mathbb{P}}\longrightarrow 0,$

which we have shown using the exploration process.

2) In fact, the final step of the argument can be strengthened to show that

$\lim_{K\rightarrow\infty}\limsup_{n\rightarrow\infty}\mathbb{P}\left(\left|L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right)\ge K\log n\right)=0,$

though this requires familiarity with Chernoff bounds / Large Deviations estimates, which would have been a distraction to have introduced at exactly this point in the course. Showing that the size of the largest component is concentrated in probability on the scale of $\log n$ is also possible, though for this we obviously need to track exact errors, rather than just use bounding arguments.

3) There are many more ways to compare distributions, with and without couplings, some of which have been of central importance in studying the age distributions corresponding to mean field forest fires, as introduced in a recent paper with Ed Crane and Balazs Rath, which can, as of this morning, be found on Arxiv. In the interests of space, further discussion of more exotic measures on distributions is postponed to a separate post.