# Lecture 3 – Couplings, comparing distributions

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

In this third lecture, we made our first foray into the scaling regime for G(n,p) which will be the main focus of the course, namely the sparse regime when $p=\frac{\lambda}{n}$. The goal for today was to give a self-contained proof of the result that in the subcritical setting $\lambda<1$, there is no giant component, that is, a component supported on a positive proportion of the vertices, with high probability as $n\rightarrow\infty$.

More formally, we showed that the proportion of vertices contained within the largest component of $G(n,\frac{\lambda}{n})$ vanishes in probability:

$\frac{1}{n} \left| L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right| \stackrel{\mathbb{P}}\longrightarrow 0.$

The argument for this result involves an exploration process of a component of the graph. This notion will be developed more formally in future lectures, aiming for good approximation rather than bounding arguments.

But for now, the key observation is that when we ‘explore’ the component of a uniformly chosen vertex $v\in[n]$ outwards from v, at all times the number of ‘children’ of v which haven’t already been considered is ‘at most’ $\mathrm{Bin}(n-1,\frac{\lambda}{n})$. Since, for example, if we already know that eleven vertices, including the current one w are in C(v), then the distribution of the number of new vertices to be added to consideration because they are directly connected to w has conditional distribution $\mathrm{Bin}(n-11,\frac{\lambda}{n})$.

Firstly, we want to formalise the notion that this is ‘less than’ $\mathrm{Bin}(n,\frac{\lambda}{n})$, and also that, so long as we don’t replace 11 by a linear function of n, that $\mathrm{Bin}(n-11,\frac{\lambda}{n})\stackrel{d}\approx \mathrm{Poisson}(\lambda)$.

Couplings to compare distributions

coupling of two random variables (or distributions) X and Y is a realisation $(\hat X,\hat Y)$ on the same probability space with correct marginals, that is

$\hat X\stackrel{d}=X,\quad \hat Y\stackrel{d}=Y.$

We saw earlier in the course that we could couple G(n,p) and G(n,q) by simulating both from the same family of uniform random variables, and it’s helpful to think of this in general: ‘constructing the distributions from the same source of randomness’.

Couplings are a useful notion to digest at this point, as they embody a general trend in discrete probability theory. Wherever possible, we try to do as we can with the random objects, before starting any calculations. Think about the connectivity property of G(n,p) as discussed in the previous lecture. This can be expressed directly as a function of p in terms of a large sum, but showing it is an increasing function of p is essentially impossible by computation, whereas this is very straightforward using the coupling.

We will now review how to use couplings to compare distributions.

For a real-valued random variable X, with distribution function $F_X$, we always have the option to couple with a uniform U(0,1) random variable. That is, when $U\sim U[0,1]$, we have $(F_X^{-1}(U)\stackrel{d}= X$, where the inverse of the distribution function is defined (in the non-obvious case of atoms) as

$F_X^{-1}(u)=\inf\left\{ x\in\mathbb{R}\,:\, F(x)\ge u\right\}.$

Note that when the value taken by U increases, so does the value taken by $F_X^{-1}(U)$. This coupling can be used simultaneously on two random variables X and Y, as $(F_X^{-1}(U),F_Y^{-1}(U))$, to generate a coupling of X and Y.

The total variation distance between two probability measures is

$d_{\mathrm{TV}}(\mu,\nu):= \sup_{A}|\mu(A)-\nu(A)|$,

with supremum taken over all events in the joint support S of $\mu,\nu$. This is particularly clear in the case of discrete measures, as then

$d_{\mathrm{TV}}(\mu,\nu)=\frac12 \sum_{x\in S} \left| \mu\left(\{x\}\right) - \nu\left(\{x\}\right) \right|.$

(Think of the difference in heights between the bars, when you plot $\mu,\nu$ simultaneously as a bar graph…)

The total variation distances records how well we can couple two distributions, if we want them to be equal as often as possible. It is therefore a bad measure of distributions with different support. For example, the distributions $\delta_0$ and $\delta_{1/n}$ are distance 1 apart (the maximum) for all values of n. Similarly, the uniform distribution on [0,1] and the uniform distribution on $\{0,1/n,2/n,\ldots, n-1/n, 1\}$ are also distance 1 apart.

When there is more overlap, the following result is useful.

Proposition: Any coupling $(\hat X,\hat Y)$ of $X\sim \mu,\,Y\sim \nu$ satisfies $\mathbb{P}(X=Y)\le 1-d_{\mathrm{TV}}(\mu,\nu)$, and there exists a coupling such that equality is achieved. Continue reading

# Non-separable Skorohod Representations

In the previous post, I discussed the statement and proof of the Skorohod representation theorem. This concerns the conditions under which it is possible to couple distributions which converge in law, to obtain a family of random variable on a possibly very large probability space, which converge almost surely. The condition for the theorem to hold is that the base space, or at least the support of the limiting distribution should be a separable metric space. Skorohod’s original proof concerned the case where all the distributions were supported on a complete, separable metric space (Polish space), but this extension is not particularly involved, and was proven not long after the original result.

It is natural to ask exactly what goes wrong in non-separable or non-metrizable spaces. Recall a space is separable if it contains a countable dense subset. Obviously, finite or countable sets are by definition separable with any metric. Considering the points with rational coordinates shows that $\mathbb{R}^d$ is separable for each d, and the Stone-Weierstrass theorem shows that continuous functions with on a bounded interval are also separable with the uniform topology, as they can be approximated uniformly well by polynomials with rational coefficients. One heuristic is that a separable space does not have ‘too many’ open sets.

There are references (for example, see [2]) to examples of Skorohod non-representation in non-metrizable topological spaces, which are ‘big’ enough to allow convergence in distribution with respect to a particular class of test functions, but where the distributions are not uniformly tight, so cannot converge almost surely. However, I don’t really understand this well at all, and have struggled to chase the references, some of which are unavailable, and some in French.

Instead, I want to talk about an example given in [1] of a family of distributions on a non-separable space, which cannot be coupled to converge almost surely. The space is (0,1) equipped with the discrete metric, which says that $d(x,y)=1$ whenever $x\ne y$. Note that it is very hard to have even deterministic convergence in this space, since the only way to be close to a element of the space is indeed to be equal to that element. We will construct random variables and it will unsurprising that they cannot possibly converge almost surely in any coupling, but the exact nature of the construction will lead to convergence in distribution.

Based on what we proved last time, the support of the limiting distribution will be non-separable. It turns out that the existence of such a distribution is equiconsistent in the sense of formal logic with the existence of an extension of Lebesgue measure to the whole power set of (0,1). This is not allowed under the Axiom of Choice, but is consistent under the slightly weaker Axiom of Dependent Choice (AC). This weaker condition says, translated into language more familiar to me, that every directed graph with arbitrary (and in particular, potentially uncountable) vertex set, and with all out-degrees at least 1 contains an infinite directed path. This seems obvious when viewed through the typically countable context of graph theory. But the natural construction is to start somewhere and ‘just keep going’ wherever possible, which involves making a choice from the out-neighbourhood at lots of vertices. Thus it is clear why this is weaker than AC. Anyway, in the sequel, we assume that this extension of Lebesgue measure exists.

Example (from [1]): We take $(X_n)_{n\ge 1}$ to be an IID sequence of non-negative RVs defined on the probability space $((0,1),\mathcal{B}(0,1),\mathrm{Leb})$, with expectation under Lebesgue measure equal to 1. It is not obvious how to do this, with the restriction on the probability space. One example might be to write $\omega\in(0,1)$ as $\overline{\omega_1\omega_2\ldots}$, the binary expansion, and then set $X_n=2\omega_n$. We will later require that $X_n$ is not identically 1, which certainly holds in this example just given.

Let $\mu$ be the extension of Lebesgue measure to the power set $\mathcal{P}=\mathcal{P}(0,1)$. Now define the measures:

$\mu_n(B)=\mathbb{E}_\mu(X_n \mathbf{1}_B),\quad \forall B\in\mathcal{P}.$

To clarify, we are defining a family of measures which also are defined for all elements of the power set. We have defined them in a way that is by definition a coupling. This will make it possible to show convergence in distribution, but they will not converge almost surely in this coupling, or, in fact, under any coupling. Now consider a restricted class of sets, namely $B\in \sigma(X_1,\ldots,X_k)$, the class of sets distinguishable by the outcomes of the first k RVs.

[Caution: the interpretation of this increasing filtration is a bit different to the standard setting with for example Markov processes, as the sets under consideration are actually subsets of the probability space on which everything is defined. In particular, there is no notion that a ‘fixed deterministic set’ lies in all the layers of the filtration.]

Anyway, by independence, when n>k, by independence, we have

$\mu_n(B)=\mathbb{E}_\mu(X_n\mathbf{1}_B)=\mathbb{E}_\mu(X_n)\mathbb{E}_\mu(\mathbf{1}_B)=\mu(B).$

So whenever $B\in\mathcal{F}\bigcup_k \sigma(X_1,\ldots,X_k)$, $\lim_n \mu_n(B)=\mu(B)$. By MCT, we can extend this convergence to any bounded $\mathcal F$-measurable function.

This is the clever bit. We want to show that $\mu_n(B)\rightarrow\mu(B)$ for all $B\in\mathcal P$, but we only have it so far for $B\in\mathcal F$. But since $\mathcal{F}\subset \mathcal P$, which is the base field of the probability space under the (non-AC) assumption, we can take conditional expectations. In particular for any $B\in\mathcal P$, $\mathbb{E}_\mu[\mathbf{1}_B | \mathcal{F}]$ is a bounded, $\mathcal F$-measurable function. Hence, by definition of $\mu_n$ and the extended MCT result:

$\mu_n(B)=\mathbb{E}_\mu[X_n\mathbb{E}_\mu[\mathbf{1}_B|\mathcal F]]=\mathbf{E}_{\mu_n}[\mathbb{E}_\mu[\mathbf{1}_B|\mathcal F]] \rightarrow \mathbb{E}_\mu [\mathbb{E}_\mu[\mathbf{1}_B |\mathcal{F}]].$

Now, since by definition $\mathbf{1}_B$ is $\mathcal{P}$-measurable, applying the tower law gives that this is equal to $\mu(B)$. So we have

$\mu_n(B)\rightarrow \mu(B),\quad \forall B\in\mathcal{P}.$ (*)

This gives weak convergence $\mu_n\Rightarrow \mu$. At first glance it might look like we have proved a much stronger condition than we need. But recall that in any set equipped with the discrete topology, any set is both open and closed, and so to use the portmanteau lemma, (*) really is required.

Now we have to check that we can’t have almost sure convergence in any coupling of these measures. Suppose that we have a probability space with random variables $Y,(Y_n)$ satisfying $\mathcal L(Y)=\mu, \mathcal L(Y_n)=\mu_n$. But citing the example I gave of $X_n$ satisfying the conditions, the only values taken by $Y_n$ are 0 and 2, and irrespective of the coupling,

$\mathbb{P}(Y_n=2\text{ infinitely often})>0.$

So it is impossible that $Y_n$ can converge almost surely to any supported on [0,1].

References

[1] Berti, Pratelli, Rigo – Skorohod Representation and Disintegrability (here – possibly not open access)

[2] Jakubowski – The almost sure Skorokhod representation for subsequences in non-metric spaces.

# Skorohod Representation Theorem

Continuing the theme of revising theory in the convergence of random processes that I shouldn’t have forgotten so rapidly, today we consider the Skorohod Representation Theorem. Recall from the standard discussion of the different modes of convergence of random variables that almost sure convergence is among the strongest since it implies convergence in probability and thus convergence in distribution. (But not convergence in $L_1$. For example, take U uniform on [0,1], and $X_n=n\mathbf{1}_{\{U<\frac{1}{n}\}}$.)

Almost sure convergence is therefore in some sense the most useful form of convergence to have. However, it comes with a strong prerequisite, that the random variables be defined on the same probability space, which is not required for convergence in distribution. Indeed, one can set up weak versions of convergence in distribution which do not even require the convergents to be random variables. The Skorohod representation theorem gives a partial converse to this result. It states some conditions under which random variables which converge in distribution can be coupled on some larger probability space to obtain almost sure convergence.

Skorohod’s original proof dealt with convergence of distributions defined on complete, separable metric spaces (Polish spaces). The version discussed here is from Chapter 5 of Billingsley [1], and assumes the limiting distribution has separable support. More recent authors have considered stronger convergence conditions (convergence in total variation or Wasserstein distance, for example) with weaker topological requirements, and convergence of random variables defined in non-metrizable spaces.

Theorem (Skorohod representation theorem): Suppose that distributions $P_n\Rightarrow P$, where P is a distribution with separable support. Then we can define a probability space $(\Omega,\mathcal{F},\mathbb{P})$ and random variables $X,(X_n)_{n\ge 1}$ on this space such that the laws of $X,X_n$ are $P,P_n$ respectively and $X_n(\omega)\rightarrow X(\omega)$ for all $\omega\in\Omega$.

NB. We are proving ‘sure convergence’ rather than merely almost sure convergence! It is not surprising that this is possible, since changing the value of all the $X_n$s on a set with measure zero doesn’t affect the conditions for convergence in distribution.

Applications: Before going through the Billingsley proof, we consider one simple application of this result. Let S be a separable metric space containing the support of X, and g a continuous function $S\rightarrow S'$. Then

$X_n\stackrel{a.s.}{\rightarrow}X\quad\Rightarrow\quad g(X_n)\stackrel{a.s.}{\rightarrow}g(X).$

So, by applying the Skorohod representation theorem once, and the result that almost sure convergence implies convergence in distribution, we have shown that

$X_n\stackrel{d}{\rightarrow}X\quad\Rightarrow\quad g(X_n)\stackrel{d}{\rightarrow}g(X),$

subject to these conditions on the space supporting X. And we have avoided the need to be careful about exactly which class of functions determine convergence in distribution, as would be required for a direct argument.

Proof (from [1]): Unsurprisingly, the idea is to construct realisations of the $(X_n)$ from a realisation of X. We take X, and a partition of the support of X into small measurable sets, chosen so that the probability of lying in a particular set is almost the same for $X_n$ as for X, for large n. Then, the $X_n$ are constructed so that for large n, with limitingly high probability $X_n$ lies in the same small set as X.

Constructing the partition is the first step. For each $x\in S:=\mathrm{supp}(X)$, there must be some radius $\frac{\epsilon}{4} such that $P(\partial B(x,r_x)=0$. This is where we use separability. Since every point in the space is within $\frac{\epsilon}{4}$ of some element of a countable sequence of elements of the space, we can take a countable subset of these open balls $B(x,r_x)$ which cover the space. Furthermore, we can take a finite subset of the balls which cover all of the space apart from a set of measure at most $\epsilon$. We want the sets to be disjoint, and we can achieve this by removing the intersections inductively in the obvious way. We end up with a collection $B_0,B_1,\ldots,B_k$, where $B_0$ is the leftover space, such that

• $P(B_0)<\epsilon$
• $P(\partial B_i)=0,\quad i=0,1,\ldots,k$
• $\mathrm{diam}(B_i)<\epsilon,\quad i=1\ldots,k$.

Now suppose for each m, we take such a partition $B^m_0,B^m_1,\ldots,B^m_{k_m}$, for which $\epsilon_m=\frac{1}{2^m}$. Unsurprisingly, this scaling of $\epsilon$ is chosen so as to use Borel-Cantelli at the end. Then, from convergence in distribution, there exists an integer $N_m$ such that for $n\ge N_m$, we have

$P_n(B^m_i)\ge (1-\epsilon_m)P(B^m_i),\quad i=0,1,\ldots,k_m.$ (*)

Now, for $N_m\le n , for each $B^m_i$ with non-zero probability under P, take $Y_{n,i}$ to be independent random variables with law $P_n(\cdot | B^m_i)$ equal to the restriction onto the set. Now take $\xi\sim U[0,1]$ independent of everything so far. Now we make concrete the heuristic for constructing $X_n$ from X. We define:

$X_n=\sum_{i=0}^{k_m}\mathbf{1}_{\{\xi\le 1-\epsilon_m, X\in B^m_i\}} Y_{n,i} + \mathbf{1}_{\{\xi>1-\epsilon_m\}}Z_n.$

We haven’t defined $Z_n$ yet. But, from (*), there is a unique distribution such that taking $Z_n$ to be independent of everything so far, with this distribution, we have $\mathcal{L}(X_n)=P_n$. Note that by iteratively defining random variables which are independent of everything previously defined, our resulting probability space $\Omega$ will be a large product space.

Note that $\xi$ controls whether the $X_n$ follow the law we have good control over, and we also want to avoid the set $B^m_0$. So define $E_m:=\{X\not \in B^m_0, \xi\le 1-\epsilon_m\}$. Then, $P(E_m)<2\epsilon_m=2^{-(m-1)}$, and so by Borel-Cantelli, with probability 1, $E_m$ holds for all m larger than some threshold. Let us call this $\liminf_m E_m=: E$, and on this event E, we have by definition $X_n \rightarrow X$. So we have almost sure convergence. But we can easily convert this to sure convergence by removing all $\omega\in\Omega$ for which $\xi(\omega)=1$ and setting $X_n\equiv X$ on $E^c$, as this does not affect the distributions.

Omissions:

• Obviously, I have omitted the exact construction of the distribution of $Z_n$. This can be reverse reconstructed very easily, but requires more notation than is ideal for this medium.
• It is necessary to remove any sets $B^m_i$ with zero measure under P for the conditioning to make sense. These can be added to $B^m_0$ without changing any of the required conditions.
• We haven’t dealt with any $X_n$ for $n.

The natural question to ask is what happens if we remove the restriction that the space be separable. There are indeed counterexamples to the existence of a Skorohod representation. The clearest example I’ve found so far is supported on (0,1) with a metric inducing the discrete topology. If time allows, I will explain this construction in a post shortly.

References

[1] – Billingsley – Convergence of Probability Measures, 2nd edition (1999)

# Local Limits

In several previous posts, I have talked about scaling limits of various random graphs. Typically in this situation we are interested in convergence of large-scale properties of the graph as the size grows to some limit. These properties will normally be metric in flavour: diameter, component size and so on. To describe convergence of these properties, we divide by the relevant scale, which will often be some simple function of n. If we are looking to find an actual limit object, this is even more important. This is rather similar to describing properties of centred random walks. There, if we run the walk for time n, we have to rescale by $\frac{1}{\sqrt{n}}$ to see the fluctuations on a finite positive scale.

One of the best examples is Aldous’ Continuum Random Tree which we can view as the limit of a Galton-Watson tree conditioned to have total size n, as n tends to infinity. Because of the exploration process or contour process interpretation, where these functions behave rather like a random walk, the correct scaling in this context is again $\frac{1}{\sqrt{n}}$. The point about this convergence is that it is realised entirely as a convergence of some function that represents the tree. For each finite n, it is clear that the tree with n vertices is a graph, but this is neither clear nor true for the limit object. Although it does indeed have no cycles, if nothing else, if the CRT were a graph it would have [0,1] as vertex set and then would be highly non-obvious how to define the edges.

Local limits aim to give convergence towards a (discrete) infinite graph. The sort of properties we are looking for are now local properties such as degrees and correlations of degrees. These don’t require knowledge of the whole graph, only of some finite subset. First consider the possibility that the sequence of deterministic graphs has the property:

$G_1\leq G_2\leq G_3\leq\ldots$

where $\leq$ denotes an induced subgraph. Then it is relatively clear what the limit should be, as it is well-defined to take a union. This won’t work directly for a limit of random graphs, because the above relation in probability doesn’t even really make sense if we have a different probability space for each finite graph. This is a general clue that we should be looking to use convergence in distribution rather than anything stronger.

In the previous example, suppose the first finite graph $G_1$ consists of a single vertex v. If the limit graph (remember this is just the union, since that is well-defined) has bounded degrees, then there is some N such that $G_N$ contains all the information we might want about the limiting neighbourhood of vertex v. For some larger N, $G_N$ contains all the vertex and edges within distance r from our starting vertex v that appear in the limit graph.

This is all the motivation we require for a genuine definition. We will define our limit in terms of neighbourhoods, so we need some mechanism to choose the central vertex of such a neighbourhood. The answer is to consider rooted graphs, that it a graph with an identified vertex. We can introduce randomness by specifying a random graph, or by giving a distribution for the choice of root. If G is finite, the canonical choice is to choose the root uniformly from the set of vertices. This isn’t an option for an infinite graph, so we define the system as (G, p) where G is a (for now deterministic) graph, and p is a probability measure on V(G).

We say that the limit of finite $(G_n)$ is the random rooted infinite graph (G, p) if the neighbourhoods of $G_n$ around a randomly chosen vertex converge in distribution to the neighbourhoods of G around p. Formally, say $(G_n)[U_n]\stackrel{d}{\rightarrow} (G,p)$ if for all r>0, for any finite rooted graph (H,w), the probability that (H,w) is isomorphic to the ball of radius r in $G_n$ centred at randomly chosen $v_n$ converges to the probability that (H,w) is isomorphic to the ball of radius r around v in (G,v), where v is distributed according to measure p.

Informally, we might say that if we zoom in on an average vertex in $G_n$ for large n, the neighbourhood looks the same as the neighbourhood around the root in (G, p). We now consider three examples.

1) When we talk about approximating the component size in a sparse Erdos-Renyi random graph by a $\text{Po}(\lambda)$ branching process, this is exactly the limit sense we mean. The approximation fails if we fix n and take the neighbourhood size very large (eg radius n), but for finite neighbourhoods, or any radius growing more slowly than n, the approximation is good.

2) To emphasise why rooting the finite graphs makes a difference, consider the full binary tree with n levels (so $2^n-1$ vertices). If we fix the root, then the limit is the infinite-level binary tree, though this isn’t especially surprising or interesting.

Things get a bit more complicated if we root randomly. Remember that the motivation for random rooting is that we want to know the local structure around a vertex chosen at random in many applications. If we definitely know what vertex we are going to choose, we know the local structure a priori. Note that in an n-level binary tree, $2^{n-1}$ vertices are leaves, not counting the base of the tree, and $2^{n-2}$ are distance 1 from a leaf, and $2^{n-3}$ are distance 2 from a leaf and so on.

This gives us a precise description of the limiting local neighbourhood structure. The resulting limiting object is called the canopy tree. One picture of this can be found on page 6 of this paper. A verbal description is also possible. Consider the set of non-negative integers, arranged in the usual manner on the real line, with edges between adjacent elements. The distribution of the root will be supported on this set of vertices, corresponding to the distance from the leaves in the pre-limit graph. So we have mass 1/2 at 0, 1/4 at 1, 1/8 at 2 and so on. We then connect each vertex k to a full k-level binary tree. The resulting canopy tree looks like an infinite-level full binary tree, viewed from the leaves, which is of course a reasonable heuristic, since that is there the mass is concentrated if we randomly root.

3) In particular, the limit is not the infinite-level binary tree. The canopy tree and the infinite-level binary tree have qualitatively different properties. Simple random walk on the canopy tree is recurrent for example. In fact, a result of Benjamini and Schramm, as explained in this review by Curien, says that any local limit of uniformly bounded degree, uniformly rooted, planar graphs is recurrent for SRW. The infinite-level binary tree can be expressed as a local limit if we choose the root distribution sensibly, using large random 3-regular graphs. The previous result does not apply because the random 3-regular graphs are not almost surely planar.

REFERENCES:

– As well as the review paper linked above, these notes by David Aldous were very useful.

# Mixing Times 3 – Convex Functions on the Space of Measures

The meat of this course covers rate of convergence of the distribution of Markov chains. In particular, we want to be thinking about lots of distributions simultaneously, so we really to be comfortable working with the space of measures on a (for now) finite state space. This is not really too bad actually, since we can embed it in a finite-dimensional real vector space.

$\mathcal{M}_1(E)=\{(x_v:v\in\Omega),x_v\geq 0, \sum x_v=1\}\subset \mathbb{R}^\Omega.$

Since most operations we might want to apply to distributions are linear, it doesn’t make much sense to inherit the usual Euclidean metric. In the end, the metric we use is the same as the $L_1$ metric, but the motivation is worth exploring. Typically, the size of $|\Omega|$ will be function of n, a parameter which will tend to infinity. So we do not want to be too rooted in the actual set $\Omega$ for what will follow.

Perhaps the best justification for total variation distance is from a gambling viewpoint. Suppose your opinion for the distribution of some outcome is $\mu$, and a bookmaker has priced their odds according to their evaluation of the outcome as $\nu$. You want to make the most money, assuming that your opinion of the distribution is correct (which in your opinion, of course it is!). So assuming the bookmaker will accept an arbitrarily complicated (but finite obviously, since there are only $|\Omega|$ possible outcomes) bet, you want to place money on whichever event evinces the greatest disparity between your measure of likeliness and the bookmaker’s. If you can find an event which you think is very likely, and which the bookmaker thinks is unlikely, you are (again, according to your own opinion of the measure) on for a big profit. This difference is the total variation distance $||\mu-\nu||_{TV}$.

Formally, we define:

$||\mu-\nu||_{TV}:=\max_{A\subset\Omega}|\mu(A)-\nu(A)|.$

Note that if this maximum is achieved at A, it is also achieved at $A^c$, and so we might as well go with the original intuition of

$||\mu-\nu||_{TV}=\max_{A\subset\Omega} \left[\mu(A)-\nu(A)\right].$

If we decompose $\mu(A)=\sum_{x\in A}\mu(x)$, and similarly for $A^c$, then add the results, we obtain:

$||\mu-\nu||_{TV}=\frac12\sum_{x\in\Omega}|\mu(A)-\nu(A)|.$

There are plenty of other interesting interpretations of total variation distance, but I don’t want to get bogged down right now. We are interested in the rate of convergence of distributions of Markov chains. Given some initial distribution $\lambda$ of $X_0$, we are interested in $||\lambda P^t-\pi||_{TV}$. The problem is that doing everything in terms of some general $\lambda$ is really annoying, at the very least for notational reasons. So really we want to investigate

$d(t)=\max_{\lambda\in\mathcal{M}_1(E)}||\lambda P^t-\pi||_{TV},$

the worst-case scenario, where we choose the initial distribution that mixes the slowest, at least judging at time t. Now, here’s where the space of measures starts to come in useful. For now, we relax the requirement that measures must be probability distributions. In fact, we allow them to be negative as well. Then $\lambda P^t-\pi$ is some signed measure on $\Omega$ with zero total mass.

But although I haven’t yet been explicit about this, it is easy to see that $||\cdot||_{TV}$ is a norm on this space. In fact, it is (equivalent to – dividing by 1/2 makes no difference!) the product norm of the $L_1$ norm as defined before. Recall the norms are convex functions. This is an immediate consequence of the triangle inequality. The set of suitable distributions $\lambda$ is affine, because an affine combination of probability distributions is another probability distribution.

Then, we know from linear optimisation theory, that convex functions on an affine space achieve their maxima at boundary points. And the boundary points for this definition of $\lambda\in\mathcal{M}_1(E)$, are precisely the delta-measures at some point of the state space $\delta_v$. So in fact, we can replace our definition of d(t) by:

$d(t)=\max_{x\in\Omega}||P^t(x,\cdot)-\pi||_{TV},$

where $P^t(x,\cdot)$ is the same as $(\delta_x P^t)(\cdot)$. Furthermore, we can immediately apply this idea to get a second result for free. In some problems, particularly those with neat couplings across all initial distributions, it is easier to work with a larger class of transition probabilities, rather than the actual equilibrium distribution, so we define:

$\bar{d}(t):=\max_{x,y\in\Omega}||P^t(x,\cdot)-P^t(y,\cdot)||_{TV}.$

The triangle inequality gives $\bar{d}(t)\leq 2d(t)$ immediately. But we want to show $d(t)\leq \bar{d}(t)$, and we can do that as before, by considering

$\max_{\lambda,\mu\in\mathcal{M}_1(E)}||\lambda P^t-\mu P^t||_{TV}.$

The function we are maximising is a convex function on $\mathcal{M}_1(E)^2$, and so it attains its maximum at a boundary point, which must be $\lambda=\delta_x,\mu=\delta_y$. Hence $\bar{d}(t)$ is equal to the displayed expression above, which is certainly greater than or equal to the original formulation of d(t), as this is the maximum of the same expression over a strict subset.

I’m not suggesting this method is qualitatively different to that proposed by the authors of the book. However, I think this is very much the right way to be thinking about these matters of maximising norms over a space of measures. Partly this is good because it gives an easy ‘sanity check’ for any idea. But also because it gives some idea of whether it will or won’t be possible to extend the ideas to the case where the state space is infinite, which will be of interest much later.

# Weak Convergence and the Portmanteau Lemma

Much of the theory of Large Deviations splits into separate treatment of open and closed sets in the rescaled domains. Typically we seek upper bounds for the rate function on closed sets, and lower bounds for the rate function on open sets. When things are going well, these turn out to be same, and so we can get on with some applications and pretty much forget about the topology underlying the construction. Many sources made a comment along the lines of “this is natural, by analogy with weak convergence”.

Weak convergence is a topic I learned about in Part III Advanced Probability. I fear it may have been one of those things that leaked out of my brain shortly after the end of the exam season… Anyway, this feels like a good time to write down what it is all about a bit more clearly. (I’ve slightly cheated, and chosen definitions and bits of the portmanteau lemma which look maximally similar to the Large Deviation material, which I’m planning on writing a few posts about over the next week.)

The motivation is that we want to extend the notion of convergence in distribution of random variables to general measures. There are several ways to define convergence in distribution, so accordingly there are several ways to generalise it. Much of what follows will be showing that these are equivalent.

We work in a metric space (X,d) and have a sequence $(\mu_n)$ and $\mu$ of (Borel) probability measures. We say that $(\mu_n)$ converges weakly to $\mu$, or $\mu_n\Rightarrow\mu$ if:

$\mu_n(f)\rightarrow\mu(f), \quad\forall f\in\mathcal{C}_b(X).$

So the test functions required for result are the class of bounded, continuous functions on X. We shall see presently that it suffices to check a smaller class, eg bounded Lipschitz functions. Indeed the key result, which is often called the portmanteau lemma, gives a set of alternative conditions for weak convergence. We will prove the equivalence cyclically.

Portmanteau Lemma

The following are equivalent.

a) $\mu_n\Rightarrow \mu$.

b) $\mu_n(f)\rightarrow\mu(f)$ for all bounded Lipschitz functions f.

c) $\limsup_n \mu_n(F)\leq \mu(F)$ for all closed sets F. Note that we demanded that all the measures be Borel, so there is no danger of $\mu(F)$ not being defined.

d) $\liminf_n \mu_n(F)\geq \mu(G)$ for all open sets G.

e) $\lim_n \mu_n(A)=\mu(A)$ whenever $\mu(\partial A)=0$. Such an A is called a continuity set.

Remarks

a) All of these statements are well-defined if X is a general topological space. I can’t think of any particular examples where we want to use measures on a non-metrizable space (eg C[0,1] with topology induced by pointwise convergence), but there seem to be a few references (such as the one cited here) implying that the results continue to hold in this case provided X is locally compact Hausdorff. This seems like an interesting thing to think about, but perhaps not right now.

b1) This doesn’t strike me as hugely surprising. I want to say that any bounded continuous function can be uniformly approximated almost everywhere by bounded Lipschitz functions. Even if that isn’t true, I am still not surprised.

b2) In fact this condition could be replaced by several alternatives. In the proof that follows, we only use one type of function, so any subset of $\mathcal{C}_b(X)$ that contains the ones we use will be sufficient to determine weak convergence.

c) and d) Why should the sign be this way round? The canonical example to have in mind is some sequence of point masses $\delta_{x_n}$ where $x_n\rightarrow x$ in some non-trivial way. Then there is some open set eg X\{x} such that $\mu_n(X\backslash x)=1$ but $\mu(X\backslash x)=0$. Informally, we might say that in the limit, some positive mass could ‘leak out’ into the boundary of an open set.

e) is then not surprising, as the condition of being a continuity set precisely prohibits the above situation from happening.

Proof

a) to b) is genuinely trivial. For b) to c), find some set F’ containing F such that $\mu(F')-\mu(F)=\epsilon$. Then find a Lipschitz function f which is 0 outside F’ and 1 on F. We obtain

$\limsup_n \mu_n(F)\leq \limsup \mu_n(f)=\mu(f)\leq \mu(F').$

But $\epsilon$ was arbitrary, so the result follows as it tends to zero. c) and d) are equivalent after taking $F^c=G$. If we assume c) and d) and apply them to $A^\circ, \bar{A}$, then e) follows.

e) to a) is a little trickier. Given a bounded continuous function f, assume WLOG that it has domain [0,1]. At most countably many events $\{f=a\}$ have positive mass under each of $\mu, (\mu_n)$. So given $M>0$, we can choose a sequence

$-1=a_0 such that $|a_{k+1}-a_k|<\frac{1}{M}$,

and $\mu(f=a_k)=\mu_n(f=a_k)=0$ for all k,n. Now it is clear what to do. $\{f\in[a_k,a_{k+1}]\}$ is a continuity set, so we can apply e), then patch everything together. There are slightly too many Ms and $\epsilon$s to do this sensibly in WordPress, so I will leave it at that.

I will conclude by writing down a combination of c) and d) that will look very familiar soon.

$\mu(A^\circ)\leq \liminf_n \mu_n(A)\leq \limsup_n\mu_n(A)\leq \mu(\bar{B}).$

References

Apart from the Part III Advanced Probability course, this article was prompted by various books on Large Deviations, including those by Frank den Hollander and Ellis / Dupuis. I’ve developed the proof above from the hints given in the appendix of these very comprehensible notes by Rassoul-Agha and Seppalainen.

# Exploring the Supercritical Random Graph

I’ve spent a bit of time this week reading and doing all the exercises from some excellent notes by van der Hofstad about random graphs. I think they are absolutely excellent and would not be surprised if they become the standard text for an introduction to probabilistic combinatorics. You can find them hosted on the author’s website. I’ve been reading chapters 4 and 5, which approaches the properties of phase transitions in G(n,p) by formalising the analogy between component sizes and population sizes in a binomial branching process. When I met this sort of material for the first time during Part III, the proofs generally relied on careful first and second moment bounds, which is fine in many ways, but I enjoyed vdH’s (perhaps more modern?) approach, as it seems to give a more accurate picture of what is actually going on. In this post, I am going to talk about using the branching process picture to explain why the giant component emerges when it does, and how to get a grip on how large it is at any time after it has emerged.

Background

A quick tour through the background, and in particular the notation will be required. At some point I will write a post about this topic in a more digestible format, but for now I want to move on as quickly as possible.

We are looking at the sparse random graph $G(n,\frac{\lambda}{n})$, in the super-critical phase $\lambda>1$. With high probability (that is, with probability tending to 1 as n grows), we have a so-called giant component, with O(n) vertices.

Because all the edges in the configuration are independent, we can view the component containing a fixed vertex as a branching process. Given vertex v(1), the number of neighbours is distributed like $\text{Bi}(n-1,\frac{\lambda}{n})$. The number of neighbours of each of these which we haven’t already considered is then $\text{Bi}(n-k,\frac{\lambda}{n})$, conditional on k, the number of vertices we have already discounted. After any finite number of steps, k=o(n), and so it is fairly reasonable to approximate this just by $\text{Bi}(n,\frac{\lambda}{n})$. Furthermore, as n grows, this distribution converges to $\text{Po}(\lambda)$, and so it is natural to expect that the probability that the fixed vertex lies in a giant component is equal to the survival probability $\zeta_\lambda$ (that is, the probability that it is infinite) of a branching process with $\text{Po}(\lambda)$ offspring distribution. Note that given a graph, the probability of a fixed vertex lying in a giant component is equal to the fraction of the vertex in the giant component. At this point it is clear why the emergence of the giant component must happen at $\lambda=1$, because we require $\mathbb{E}\text{Po}(\lambda)>1$ for the survival probability to be non-zero. Obviously, all of this needs to be made precise and rigorous, and this is treated in sections 4.3 and 4.4 of the notes.

Exploration Process

A common functional of a rooted branching process to consider is the following. This is called in various places an exploration process, a depth-first process or a Lukasiewicz path. We take a depth-first labelling of the tree v(0), v(1), v(2),… , and define c(k) to be the number of children of vertex v(k). We then define the exploration process by:

$S(0)=0,\quad S(k+1)=S(k)+c(k)-1.$

By far the best way to think of this is to imagine we are making the depth-first walk on the tree. S(k) records how many vertices we have seen (because they are connected by an edge to a vertex we have visited) but have not yet visited. To clarify understanding of the definition, note that when you arrive at a vertex with no children, this should decrease by one, as you can see no new vertices, but have visited an extra one.

This exploration process is useful to consider for a couple of reasons. Firstly, you can reconstruct the branching process directly from it. Secondly, while other functionals (eg the height, or contour process) look like random walks, the exploration process genuinely is a random walk. The distribution of the number of children of the next vertex we arrive at is independent of everything we have previously seen in the tree, and is the same for every vertex. If we were looking at branching processes in a different context, we might observe that this gives some information in a suitably-rescaled limit, as rescaled random walks converge to Brownian motion if the variance of the (offspring) distribution is finite. (This is Donsker’s result, which I should write something about soon…)

The most important property is that the exploration process returns to 0 precisely when we have exhausted all the vertices in a component. At that point, we have seen exactly the vertices which we have explored. There is no reason not to extend the definition to forests, that is a union of trees. The depth-first exploration is the same – but when we have exhausted one component, we move onto another component, chosen according to some labelling property. Then, running minima of the exploration process (ie times when it is smaller than it has been before) correspond to jumping between components, and thus excursions above the minimum to components themselves. The running minimum will be non-positive, with absolute value equal to the number of components already exhausted.

Although the exploration process was defined with reference to and in the language of trees, the result of a branching process, this is not necessary. With some vertex denoted as the root, we can construct a depth-first labelling of a general graph, and the exploration process follows exactly as before. Note that we end up ignoring all edges except a set that forms a forest. This is what we will apply to G(n,p).

Exploring G(n,p)

When we jump between components in the exploration process on a supercritical (that is $\lambda>1$) random graph, we move to a component chosen randomly with size-biased distribution. If there is a giant component, as we know there is in the supercritical case, then this will dominate the size-biased distribution. Precisely, if the giant component takes up a fraction H of the vertices, then the number of components to be explored before we get to the giant component is geometrically distributed with parameter H. All other components have size O(log n), so the expected number of vertices explored before we get to the giant component is O(log n)/H = o(n), and so in the limit, we explore the giant component immediately.

The exploration process therefore gives good control on the giant component in the limit, as roughly speaking the first time it returns to 0 is the size of the giant component. Fortunately, we can also control the distribution of S_t, the exploration process at time t. We have that:

$S_t+(t-1)\sim \text{Bi}(n-1,1-(1-p)^t).$

This is not too hard to see. $S_t+(t-1)$ is number of vertices we have explored or seen, ie are connected to a vertex we have explored. Suppose the remaining vertices are called unseen, and we began the exploration at vertex 1. Then any vertex with label in {2,…,n} is unseen if it successively avoids being in the neighbourhood of v(1), v(2), … v(t). This happens with probability $(1-p)^t$, and so the probability of being an explored or seen vertex is the complement of this.

In the supercritical case, we are taking $p=\frac{\lambda}{n}$ with $\lambda>1$, and we also want to speed up S, so that all the exploration processes are defined on [0,1], and rescale the sizes by n, so that it records the fraction of the graph rather than the number of vertices. So we set consider the rescaling $\frac{1}{n}S_{nt}$.

It is straightforward to use the distribution of S_t we deduce that the asymptotic mean $\mathbb{E}\frac{1}{n}S_{nt}=\mu_t = 1-t-e^{-\lambda t}$.

Now we are in a position to provide more concrete motivation for the claim that the proportion of vertices in the giant component is $\zeta_\lambda$, the survival probability of a branching process with $\text{Po}(\lambda)$ offspring distribution. It helps to consider instead the extinction probability $1-\zeta_\lambda$. We have:

$1-\zeta_\lambda=\sum_{k\geq 0}\mathbb{P}(\text{Po}(\lambda)=k)(1-\zeta_\lambda)^k=e^{-\lambda\zeta_\lambda},$

where the second equality is a consequence of the simple form for the moment generating function of the Poisson distribution.

As a result, we have that $\mu_{\zeta_\lambda}=0$. In fact we also have a central limit theorem for S_t, which enables us to deduce that $\frac{1}{n}S_{n\zeta_\lambda}=0$ with high probability, as well as in expectation, which is precisely what is required to prove that the giant component of $G(n,\frac{\lambda}{n})$ has size $n(\zeta_\lambda+o(1))$.

# CLT and Stable Distributions

One of the questions I posed at the end of the previous post about the Central Limit Theorem was this: what is special about the normal distribution?

More precisely, for a large class of variables (those with finite variance) the limit in distribution of $S_n$ after a natural rescaling is distributed as N(0,1). As a starting point for investigating similar results for a more general class of underlying distributions, it is worth considering what properties we might require of a distribution if it is to appear as a limit in distribution of sums of IID RVs, rescaled if necessary.

The property required is that the distribution is stable. In the rest of the post I am going to give an informal precis of the content of the relevant chapter of Feller.

Throughout, we assume a collection of IID RVs, $X,X_1,X_2,\ldots$, with the initial sums $S_n:=X_1+\ldots+X_n$. Then we say $X$ is stable in the broad sense if

$S_n\stackrel{d}{=}c_nX+\gamma_n,$

for some deterministic parameters $c_n,\gamma_n$ for every n. If in fact $\gamma_n=0$ then we say $X$ is stable in the strict sense. I’m not sure if this division into strict and broad is still widely drawn, but anyway. One interpretation might be that a collection of distributions is stable if they form a non-trivial subspace of the vector space of random variables and also form a subgroup under the operation of adding independent RVs. I’m not sure that this is hugely useful either though. One observation is that if $\mathbb{E}X$ exists and is 0, then so are all the $\gamma_n$s.

The key result to be shown is that

$c_n=n^{1/\alpha}$ for some $0<\alpha\leq 2$.

Relevant though the observation about means is, a more useful one is this. The stability property is retained if we replace the distribution of $X$ with the distribution of $X_1-X-2$ (independent copies naturally!). The behaviour of $c_n$ is also preserved. Now we can work with an underlying distribution that is symmetric about 0, rather than merely centred. The deduction that $\gamma_n=0$ still holds now, whether or not X has a mean.

Now we proceed with the proof. All equalities are taken to be in distribution unless otherwise specified. By splitting into two smaller sums, we deduce that

$c_{m+n}X=S_{m+n}=c_mX_1+c_nX_2.$

Extending this idea, we have

$c_{kr}X=S_{kr}=S_k^{(1)}+\ldots+S_k^{(r)}=c_kX_1+\ldots+c_kX_r=c_kS_r=c_kc_rX.$

Note that it is not even obvious yet that the $c_n$s are increasing. To get a bit more control, we proceed as follows. Set $v=m+n$, and express

$X=\frac{c_m}{c_v}X_1+\frac{c_n}{c_v}X_2,$

from which we can make the deduction

$\mathbb{P}(X>t)\geq \mathbb{P}(X_1>0,X_2>t\frac{c_v}{c_n})=\frac12\mathbb{P}(X_2>t\frac{c_v}{c_n}).$ (*)

So most importantly, by taking $t>>0$ in the above, and using that X is symmetric, we can obtain an upper bound

$\mathbb{P}(X_2>t\frac{c_v}{c_n})\leq \delta<\frac12,$

in fact for any $\delta<\frac12$ if we take $t$ large enough. But since

$\mathbb{P}(X_2>0)=\frac12(1-\mathbb{P}(X_2=0)),$

(which should in most cases be $\frac12$), this implies that $\frac{c_v}{c_n}$ cannot be very close to 0. In other words, $\frac{c_n}{c_v}$ is bounded above. This is in fact regularity enough to deduce that $c_n=n^{1/\alpha}$ from the Cauchy-type functional equation (*).

It remains to check that $\alpha\leq 2$. Note that this equality case $\alpha=2$ corresponds exactly to the $\frac{1}{\sqrt{n}}$ scaling we saw for the normal distribution, in the context of the CLT. This motivates the proof. If $\alpha>2$, we will show that the variance of X is finite, so CLT applies. This gives some control over $c_n$ in an $n\rightarrow\infty$ limit, which is plenty to ensure a contradiction.

To show the variance is finite, we use the definition of stable to check that there is a value of t such that

$\mathbb{P}(S_n>tc_n)<\frac14\,\forall n.$

Now consider the event that the maximum of the $X_i$s is $>tc_n$ and that the sum of the rest is non-negative. This has, by independence, exactly half the probability of the event demanding just that the maximum be bounded below, and furthermore is contained within the event with probability $<\frac14$ shown above. So if we set

$z(n)=n\mathbb{P}(X>tc_n)$

we then have

$\frac14>\mathbb{P}(S_n>tc_n)\geq\frac12\mathbb{P}(\max X_i>tc_n)=\frac12[1-(1-\frac{z}{n})^n]$

$\iff 1-e^{-z(n)}\leq \frac12\text{ for large }n.$

So, $z(n)=n(1-F(tc_n))$ is bounded as $n$ varies. Rescaling suitably, this gives that

$x^\alpha(1-R(x))

This is exactly what we need to control the variance, as:

$\mathbb{E}X^2=\int_0^\infty \mathbb{P}(X^2>t)dt=\int_0^\infty \mathbb{P}(X^2>u^2)2udu$

$=\int_0^\infty 4u\mathbb{P}(X>u)du\leq \int_0^\infty 1\wedge\frac{4M}{u^{-(\alpha-1)}}du<\infty,$

using that X is symmetric and that $\alpha>2$ for the final equalities. But we know from CLT that if the variance is finite, we must have $\alpha=2$.

All that remains is to mention how stable distributions fit into the context of limits in distribution of RVs. This is little more than a definition.

We say F is in the domain of attraction of a broadly stable distribution R if

$\exists a_n>0,b_n,\quad\text{s.t.}\quad \frac{S_n-b_n}{a_n}\stackrel{d}{\rightarrow}R.$

The role of $b_n$ is not hugely important, as a broadly stable distribution is in the domain of attraction of the corresponding strictly stable distribution.

The natural question to ask is: do the domains of attraction of stable distributions (for $0<\alpha\leq 2$) partition the space of probability distributions, or is some extra condition required?

Next time I will talk about stable distributions in a more analytic context, and in particular how a discussion of their properties is motivated by the construction of Levy processes.

# Large Deviations and the CLT

Taking a course on Large Deviations has forced me to think a bit more carefully about what happens when you have large collections of IID random variables. I guess the first thing think to think about is ‘What is a Large Deviation‘? In particular, how large or deviant does it have to be?

Of primary interest is the tail of the distribution function of $S_n=X_1+\ldots+X_n$, where the $X_i$ are independent and identically distributed as $X$. As we can always negate everything later if necessary, we typically consider the probability of events of the type:

$\mathbb{P}(S_n\geq \theta(n))$

where $\theta(n)$ is some function which almost certainly increases fairly fast with $n$. More pertinently, if we are looking for some limit which corresponds to an actual random variable, we perhaps want to look at lots of related $\theta(n)$s simultaneously. More concretely, we should fix $\theta$ and consider the probabilities

$\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha).$ (*)

Throughout, we lose no generality by assuming that $\mathbb{E}X=0$. Of course, it is possible that this expectation does not exist, but that is certainly a question for another post!

Now let’s consider the implications of our choice of $\theta(n)$. If this increases with $n$ too slowly, and the likely deviation of $S_n$ is greater than $\theta(n)$, then the event might not be a large deviation at all. In fact, the difference between this event and the event ($S_n$ is above 0, that is, its mean) becomes negligible, and so the probability at (*) might be 1/2 or whatever, regardless of the value of $\alpha$. So object $\lim \frac{S_n}{\theta(n)}$ whatever that means, certainly cannot be a proper random variable, as if we were to have convergence in distribution, this would imply that the limit RV consisted of point mass at each of $\{+\infty, -\infty\}$.

On the other hand, if $\theta(n)$ increases rapidly with $n$, then the probabilities at (*) might become very small indeed when $\alpha>0$. For example, we might expect:

$\lim_{n\rightarrow\infty}\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha)=\begin{cases}0& \alpha>0\\1&\alpha<0.\end{cases}$

and more information to be required when $\alpha=0$. This is what we mean by a large deviation event. Although we always have to define everything concretely in terms of some finite sum $S_n$, we are always thinking about the behaviour in the limit. A large deviation principle exists in an enormous range of cases to show that these probabilities in fact decay exponentially. Again, that is the subject for another post, or indeed the lecture course I’m attending.

Instead, I want to return to the Central Limit Theorem. I first encountered this result in popular science books in a vague “the histogram of boys’ heights looks like a bell” kind of way, then, once a normal random variable had been to some extent defined, it returned in A-level statistics courses in a slightly more fleshed out form. As an undergraduate, you see it in several forms, including as a corollary following from Levy’s convergence theorem.

In all applications though, it is generally used as a method of calculating good approximations. It is not uncommon to see it presented as:

$\mathbb{P}(a\sigma\sqrt{n}+\mu n\leq S_n\leq b\sigma\sqrt{n}+\mu n)\approx \frac{1}{\sqrt{2\pi}}\int_a^b e^{-x^2/2}dx.$

Although in many cases that is the right way to think use it, it isn’t the most interesting aspect of the theorem itself. CLT says that the correct scaling of $\theta(n)$ so that the deviation probabilities lie between the two cases outline above is the same (that is, $\theta(n)=O(\sqrt{n})$ in some sense) for an enormous class of distributions, and in particular, most distributions that one might encounter in practice (ie finite mean, finite variance). There is even greater universality, as furthermore the limit distribution at this interface has the same form (some appropriate normal distribution) whenever $X$ is in this class of distributions. I think that goes a long way to explaining why we should care about the theorem. It also immediately prompts several questions:

• What happens for less regular distributions? It is now more clear what the right question to ask in this setting might be. What is the appropriate scaling for $\theta(n)$ in this case, if such a scaling exists? Is there a similar universality property for suitable classes of distributions?
• What is special about the normal distribution? The theorem itself shows us that it appears as a universal scaling limit in distribution, but we might reasonably ask what properties such a distribution should have, as perhaps this will offer a clue to a version of CLT or LLNs for less regular distributions.
• We can see that the Weak Law of Large Numbers follows immediately from CLT. In fact we can say more, perhaps a Slightly Less Weak LLN, that

$\frac{S_n-\mu n}{\sigma \theta(n)}\stackrel{d}{\rightarrow}0$

• whenever $\sqrt{n}<<\theta(n)$. But of course, we also have a Strong Law of Large Numbers, which asserts that the empirical mean converges almost surely. What is the threshhold for almost sure convergence, because there is no a priori reason why it should be $\theta(n)=n$?

To be continued next time.

# Convergence of Random Variables

The relationship between the different modes of convergence of random variables is one of the more important topics in any introduction to probability theory. For some reason, many of the textbooks leave the proofs as exercises, so it seems worthwhile to present a sketched but comprehensive summary.

Almost sure convergence: $X_n\rightarrow X\;\mathbb{P}$-a.s. if $\mathbb{P}(X_n\rightarrow X)=1$.

Convergence in Probability: $X_n\rightarrow X$ in $\mathbb{P}$-probability if $\mathbb{P}(|X_n-X|>\epsilon)\rightarrow 0$ for any $\epsilon>0$.

Convergence in Distribution: $X_n\stackrel{d}{\rightarrow} X$ if $\mathbb{E}f(X_n)\rightarrow \mathbb{E}f(X)$ for any bounded, continuous function f. Note that this definition is valid for RVs defined on any metric space. When they are real-valued, this is equivalent to the condition that $F_{X_n}(x)\rightarrow F_X(x)$ for every point $x\in \mathbb{R}$ where $F_X$ is continuous. It is further equivalent (by Levy’s Convergence Theorem) to its own special case, convergence of characteristic functions: $\phi_{X_n}(u)\rightarrow \phi_X(U)$ for all $u\in\mathbb{R}$.

Note: In contrast to the other conditions for convergence, convergence in distribution (also known as weak convergence) doesn’t require the RVs to be defined on the same probability space. This thought can be useful when constructing counterexamples.

$L^p$-convergence: $X_n\rightarrow X$ in $L^p$ if $||X_n-X||_p\rightarrow 0$; that is, $\mathbb{E}|X_n-X|^p\rightarrow 0$.

Uniform Integrability: Informally, a set of RVs is UI if the integrals over small sets tend to zero uniformly. Formally: $(X_n)$ is UI if $\sup_{n,A\in\mathcal{F}}\{\mathbb{E}[|X_n|1(A)]|\mathbb{P}(A)\leq \delta\}\rightarrow 0$ as $\delta\rightarrow 0$.

Note: In particular, a single RV, and a collection of independent RVs are UI. If X~U[0,1] and $X_n=n1(X\leq \frac{1}{n})$, then the collection is not UI.