Non-separable Skorohod Representations

In the previous post, I discussed the statement and proof of the Skorohod representation theorem. This concerns the conditions under which it is possible to couple distributions which converge in law, to obtain a family of random variable on a possibly very large probability space, which converge almost surely. The condition for the theorem to hold is that the base space, or at least the support of the limiting distribution should be a separable metric space. Skorohod’s original proof concerned the case where all the distributions were supported on a complete, separable metric space (Polish space), but this extension is not particularly involved, and was proven not long after the original result.

It is natural to ask exactly what goes wrong in non-separable or non-metrizable spaces. Recall a space is separable if it contains a countable dense subset. Obviously, finite or countable sets are by definition separable with any metric. Considering the points with rational coordinates shows that \mathbb{R}^d is separable for each d, and the Stone-Weierstrass theorem shows that continuous functions with on a bounded interval are also separable with the uniform topology, as they can be approximated uniformly well by polynomials with rational coefficients. One heuristic is that a separable space does not have ‘too many’ open sets.

There are references (for example, see [2]) to examples of Skorohod non-representation in non-metrizable topological spaces, which are ‘big’ enough to allow convergence in distribution with respect to a particular class of test functions, but where the distributions are not uniformly tight, so cannot converge almost surely. However, I don’t really understand this well at all, and have struggled to chase the references, some of which are unavailable, and some in French.

Instead, I want to talk about an example given in [1] of a family of distributions on a non-separable space, which cannot be coupled to converge almost surely. The space is (0,1) equipped with the discrete metric, which says that d(x,y)=1 whenever x\ne y. Note that it is very hard to have even deterministic convergence in this space, since the only way to be close to a element of the space is indeed to be equal to that element. We will construct random variables and it will unsurprising that they cannot possibly converge almost surely in any coupling, but the exact nature of the construction will lead to convergence in distribution.

Based on what we proved last time, the support of the limiting distribution will be non-separable. It turns out that the existence of such a distribution is equiconsistent in the sense of formal logic with the existence of an extension of Lebesgue measure to the whole power set of (0,1). This is not allowed under the Axiom of Choice, but is consistent under the slightly weaker Axiom of Dependent Choice (AC). This weaker condition says, translated into language more familiar to me, that every directed graph with arbitrary (and in particular, potentially uncountable) vertex set, and with all out-degrees at least 1 contains an infinite directed path. This seems obvious when viewed through the typically countable context of graph theory. But the natural construction is to start somewhere and ‘just keep going’ wherever possible, which involves making a choice from the out-neighbourhood at lots of vertices. Thus it is clear why this is weaker than AC. Anyway, in the sequel, we assume that this extension of Lebesgue measure exists.

Example (from [1]): We take (X_n)_{n\ge 1} to be an IID sequence of non-negative RVs defined on the probability space ((0,1),\mathcal{B}(0,1),\mathrm{Leb}), with expectation under Lebesgue measure equal to 1. It is not obvious how to do this, with the restriction on the probability space. One example might be to write \omega\in(0,1) as \overline{\omega_1\omega_2\ldots}, the binary expansion, and then set X_n=2\omega_n. We will later require that X_n is not identically 1, which certainly holds in this example just given.

Let \mu be the extension of Lebesgue measure to the power set \mathcal{P}=\mathcal{P}(0,1). Now define the measures:

\mu_n(B)=\mathbb{E}_\mu(X_n \mathbf{1}_B),\quad \forall B\in\mathcal{P}.

To clarify, we are defining a family of measures which also are defined for all elements of the power set. We have defined them in a way that is by definition a coupling. This will make it possible to show convergence in distribution, but they will not converge almost surely in this coupling, or, in fact, under any coupling. Now consider a restricted class of sets, namely B\in \sigma(X_1,\ldots,X_k), the class of sets distinguishable by the outcomes of the first k RVs.

[Caution: the interpretation of this increasing filtration is a bit different to the standard setting with for example Markov processes, as the sets under consideration are actually subsets of the probability space on which everything is defined. In particular, there is no notion that a ‘fixed deterministic set’ lies in all the layers of the filtration.]

Anyway, by independence, when n>k, by independence, we have

\mu_n(B)=\mathbb{E}_\mu(X_n\mathbf{1}_B)=\mathbb{E}_\mu(X_n)\mathbb{E}_\mu(\mathbf{1}_B)=\mu(B).

So whenever B\in\mathcal{F}\bigcup_k \sigma(X_1,\ldots,X_k), \lim_n \mu_n(B)=\mu(B). By MCT, we can extend this convergence to any bounded \mathcal F-measurable function.

This is the clever bit. We want to show that \mu_n(B)\rightarrow\mu(B) for all B\in\mathcal P, but we only have it so far for B\in\mathcal F. But since \mathcal{F}\subset \mathcal P, which is the base field of the probability space under the (non-AC) assumption, we can take conditional expectations. In particular for any B\in\mathcal P, \mathbb{E}_\mu[\mathbf{1}_B | \mathcal{F}] is a bounded, \mathcal F-measurable function. Hence, by definition of \mu_n and the extended MCT result:

\mu_n(B)=\mathbb{E}_\mu[X_n\mathbb{E}_\mu[\mathbf{1}_B|\mathcal F]]=\mathbf{E}_{\mu_n}[\mathbb{E}_\mu[\mathbf{1}_B|\mathcal F]] \rightarrow \mathbb{E}_\mu [\mathbb{E}_\mu[\mathbf{1}_B |\mathcal{F}]].

Now, since by definition \mathbf{1}_B is \mathcal{P}-measurable, applying the tower law gives that this is equal to \mu(B). So we have

\mu_n(B)\rightarrow \mu(B),\quad \forall B\in\mathcal{P}. (*)

This gives weak convergence \mu_n\Rightarrow \mu. At first glance it might look like we have proved a much stronger condition than we need. But recall that in any set equipped with the discrete topology, any set is both open and closed, and so to use the portmanteau lemma, (*) really is required.

Now we have to check that we can’t have almost sure convergence in any coupling of these measures. Suppose that we have a probability space with random variables Y,(Y_n) satisfying \mathcal L(Y)=\mu, \mathcal L(Y_n)=\mu_n. But citing the example I gave of X_n satisfying the conditions, the only values taken by Y_n are 0 and 2, and irrespective of the coupling,

\mathbb{P}(Y_n=2\text{ infinitely often})>0.

So it is impossible that Y_n can converge almost surely to any supported on [0,1].

References

[1] Berti, Pratelli, Rigo – Skorohod Representation and Disintegrability (here – possibly not open access)

[2] Jakubowski – The almost sure Skorokhod representation for subsequences in non-metric spaces.

Advertisement

Skorohod Representation Theorem

Continuing the theme of revising theory in the convergence of random processes that I shouldn’t have forgotten so rapidly, today we consider the Skorohod Representation Theorem. Recall from the standard discussion of the different modes of convergence of random variables that almost sure convergence is among the strongest since it implies convergence in probability and thus convergence in distribution. (But not convergence in L_1. For example, take U uniform on [0,1], and X_n=n\mathbf{1}_{\{U<\frac{1}{n}\}}.)

Almost sure convergence is therefore in some sense the most useful form of convergence to have. However, it comes with a strong prerequisite, that the random variables be defined on the same probability space, which is not required for convergence in distribution. Indeed, one can set up weak versions of convergence in distribution which do not even require the convergents to be random variables. The Skorohod representation theorem gives a partial converse to this result. It states some conditions under which random variables which converge in distribution can be coupled on some larger probability space to obtain almost sure convergence.

Skorohod’s original proof dealt with convergence of distributions defined on complete, separable metric spaces (Polish spaces). The version discussed here is from Chapter 5 of Billingsley [1], and assumes the limiting distribution has separable support. More recent authors have considered stronger convergence conditions (convergence in total variation or Wasserstein distance, for example) with weaker topological requirements, and convergence of random variables defined in non-metrizable spaces.

Theorem (Skorohod representation theorem): Suppose that distributions P_n\Rightarrow P, where P is a distribution with separable support. Then we can define a probability space (\Omega,\mathcal{F},\mathbb{P}) and random variables X,(X_n)_{n\ge 1} on this space such that the laws of X,X_n are P,P_n respectively and X_n(\omega)\rightarrow X(\omega) for all \omega\in\Omega.

NB. We are proving ‘sure convergence’ rather than merely almost sure convergence! It is not surprising that this is possible, since changing the value of all the X_ns on a set with measure zero doesn’t affect the conditions for convergence in distribution.

Applications: Before going through the Billingsley proof, we consider one simple application of this result. Let S be a separable metric space containing the support of X, and g a continuous function S\rightarrow S'. Then

X_n\stackrel{a.s.}{\rightarrow}X\quad\Rightarrow\quad g(X_n)\stackrel{a.s.}{\rightarrow}g(X).

So, by applying the Skorohod representation theorem once, and the result that almost sure convergence implies convergence in distribution, we have shown that

X_n\stackrel{d}{\rightarrow}X\quad\Rightarrow\quad g(X_n)\stackrel{d}{\rightarrow}g(X),

subject to these conditions on the space supporting X. And we have avoided the need to be careful about exactly which class of functions determine convergence in distribution, as would be required for a direct argument.

Proof (from [1]): Unsurprisingly, the idea is to construct realisations of the (X_n) from a realisation of X. We take X, and a partition of the support of X into small measurable sets, chosen so that the probability of lying in a particular set is almost the same for X_n as for X, for large n. Then, the X_n are constructed so that for large n, with limitingly high probability X_n lies in the same small set as X.

Constructing the partition is the first step. For each x\in S:=\mathrm{supp}(X), there must be some radius \frac{\epsilon}{4}<r_x<\frac{\epsilon}{2} such that P(\partial B(x,r_x)=0. This is where we use separability. Since every point in the space is within \frac{\epsilon}{4} of some element of a countable sequence of elements of the space, we can take a countable subset of these open balls B(x,r_x) which cover the space. Furthermore, we can take a finite subset of the balls which cover all of the space apart from a set of measure at most \epsilon. We want the sets to be disjoint, and we can achieve this by removing the intersections inductively in the obvious way. We end up with a collection B_0,B_1,\ldots,B_k, where B_0 is the leftover space, such that

  • P(B_0)<\epsilon
  • P(\partial B_i)=0,\quad i=0,1,\ldots,k
  • \mathrm{diam}(B_i)<\epsilon,\quad i=1\ldots,k.

Now suppose for each m, we take such a partition B^m_0,B^m_1,\ldots,B^m_{k_m}, for which \epsilon_m=\frac{1}{2^m}. Unsurprisingly, this scaling of \epsilon is chosen so as to use Borel-Cantelli at the end. Then, from convergence in distribution, there exists an integer N_m such that for n\ge N_m, we have

P_n(B^m_i)\ge (1-\epsilon_m)P(B^m_i),\quad i=0,1,\ldots,k_m. (*)

Now, for N_m\le n <N_{m+1}, for each B^m_i with non-zero probability under P, take Y_{n,i} to be independent random variables with law P_n(\cdot | B^m_i) equal to the restriction onto the set. Now take \xi\sim U[0,1] independent of everything so far. Now we make concrete the heuristic for constructing X_n from X. We define:

X_n=\sum_{i=0}^{k_m}\mathbf{1}_{\{\xi\le 1-\epsilon_m, X\in B^m_i\}} Y_{n,i} + \mathbf{1}_{\{\xi>1-\epsilon_m\}}Z_n.

We haven’t defined Z_n yet. But, from (*), there is a unique distribution such that taking Z_n to be independent of everything so far, with this distribution, we have \mathcal{L}(X_n)=P_n. Note that by iteratively defining random variables which are independent of everything previously defined, our resulting probability space \Omega will be a large product space.

Note that \xi controls whether the X_n follow the law we have good control over, and we also want to avoid the set B^m_0. So define E_m:=\{X\not \in B^m_0, \xi\le 1-\epsilon_m\}. Then, P(E_m)<2\epsilon_m=2^{-(m-1)}, and so by Borel-Cantelli, with probability 1, E_m holds for all m larger than some threshold. Let us call this \liminf_m E_m=: E, and on this event E, we have by definition X_n \rightarrow X. So we have almost sure convergence. But we can easily convert this to sure convergence by removing all \omega\in\Omega for which \xi(\omega)=1 and setting X_n\equiv X on E^c, as this does not affect the distributions.

Omissions: 

  • Obviously, I have omitted the exact construction of the distribution of Z_n. This can be reverse reconstructed very easily, but requires more notation than is ideal for this medium.
  • It is necessary to remove any sets B^m_i with zero measure under P for the conditioning to make sense. These can be added to B^m_0 without changing any of the required conditions.
  • We haven’t dealt with any X_n for n<N_1.

The natural question to ask is what happens if we remove the restriction that the space be separable. There are indeed counterexamples to the existence of a Skorohod representation. The clearest example I’ve found so far is supported on (0,1) with a metric inducing the discrete topology. If time allows, I will explain this construction in a post shortly.

References

[1] – Billingsley – Convergence of Probability Measures, 2nd edition (1999)

Persistent Hubs

This post is based on the paper “Existence of a persistent hub in the convex preferential attachment model” which appeared on arXiv last week. It can be found here. My aim is to explain (again) the model; the application-based motivation for the result; and a heuristic for the result in a special case. In particular, I want to stress the relationship between PA models and urns.

The preferential attachment model attempts to describe the formation of large complex networks. It is constructed dynamically: vertices are introduced one at a time. For each new vertex, we have to choose which existing vertices to join to. This choice is random and reinforced. That is, the new vertex is more likely to join to an existing vertex with high degree than to an existing vertex with degree 1. It’s clear why this might correspond well to the evolution of, say, the world wide web. New webpages are much more likely to link to an established site, eg Wikipedia or Google, than to a uniformly randomly chosen page.

The model is motivated also by a desire to fit a common property of real-world networks that is not exhibited by, among others, the Erdos-Renyi random graph model. In such a network, we expect a few nodes to have much greater connectivity than average. In a sense these so-called hubs drive connectivity of the system. This makes sense in practice. If you are travelling by train around the South-East of England, it is very likely you will pass through at least one of Reading, East Croydon, or about five major terminus in London. It would be absurd for every station to be of equal significance to the network. By contrast, the typical vertex degree in the sparse Erdos-Renyi model is O(1), and has a limiting Poisson distribution, with a super-exponential tail.

So, this paper addresses the following question. We know that the PA model, when set up right, has power-law tails for the degree distribution, and so has a largest degree that is an order of magnitude larger than the average degree. Let’s call this the ‘hub’ for now. But the model is dynamic, so we should ask how this hub changes in time as we add extra vertices. In particular, is it the case that one vertex should grow so large that it remains as the dominant hub forever? This paper answers this question in the affirmative, for a certain class of preferential attachment schemes.

We assign a weighting system to possible degrees, that is a function from N to R+. In the case of proportional PA, this function could be f(n)=n. In general, we assume it is convex. Note that the more convex this weight function is, the stronger the preference a new vertex feels towards existing dominant vertices. Part of the author’s proof is a formalisation of this heuristic, which provides some machinery allowing us to treat only really the case f(n)=n. I will discuss only this case from now on.

I want to focus on the fact that we have another model which describes aspects of the degree evolution rather well. We consider some finite fixed collection of vertices at some time, and consider the evolution of their degrees. We will be interested in limiting properties, so the exact time doesn’t matter too much. We look instead at the jump chain, ie those times when one of the degrees changes. This happens when a new vertex joins to one of the chosen vertices. Given that the new vertex has joined one of the chosen vertices, the choice of which of the chosen vertices is still size-biased proportional to the current degrees. In other words, the jump chain of this degree sequence is precisely Polya’s Urn.

This is a powerful observation, as it allows us to make comments about the limiting behaviour of finite quantities almost instantly. In particular, we know that from any starting arrangement, Polya’s Urn converges almost surely. This is useful to the question of persistence for the following reason.

Recall that in the case of two colours, starting with one of each, we converge to the uniform distribution. We should view this as a special case of the Dirichlet distribution, which is supported on partitions into k intervals of [0,1]. In particular, for any fixed k, the probability that two of the intervals have the same size is 0, as the distribution is continuous. So, since the convergence of the proportions in Polya’s Urn is almost sure, with probability one all of the proportions are with \epsilon>0 of their limit, and so taking epsilon small enough, given the limit, which we are allowed to do, we can show that the colour which is largest in the limit is eventually the largest at finite times.

Unfortunately, we can’t mesh these together these finite-dimensional observations particularly nicely. What we require instead is a result showing that if a vertex has large enough degree, then it can never be overtaken by any new vertex. This proved via a direct calculation of the probability that a new vertex ‘catches up’ with a pre-existing vertex of some specified size.

That calculation is nice and not too complicated, but has slightly too many stages and factorial approximations to consider reproducing or summarising here. Instead, I offer the following heuristic for a bound on the probability that a new vertex will catch up with a pre-existing vertex of degree k. Let’s root ourselves in the urn interpretation for convenience.

If the initial configuration is (k,1), corresponding to k red balls and 1 blue, we should consider instead the proportion of red balls, which is k/k+1 obviously. Crucially (for proving convergence results if nothing else), this is a martingale, which is clearly bounded within [0,1]. So the expectation of the limiting proportion is also k/k+1. Let us consider the stopping time T at which the number of red balls is equal to the number of blue balls. We decompose the expectation by conditioning on whether T is finite.

\mathbb{E}X_\infty=\mathbb{E}[X_\infty|T<\infty]\mathbb{P}(T<\infty)+\mathbb{E}[X_\infty|T=\infty]\mathbb{P}(T=\infty)

\leq \mathbb{E}[X_\infty | X_T,T<\infty]\mathbb{P}(T<\infty)+(1-\mathbb{P}(T=\infty))

using that X_\infty\leq 1, regardless of the conditioning,

= \frac12 \mathbb{P}(T<\infty) + (1-\mathbb{P}(T<\infty))

\mathbb{P}(T<\infty) \leq \frac{2}{k+1}.

We really want this to be finite when we sum over k so we can use some kind of Borel-Cantelli argument. Indeed, Galashin gets a bound of O(k^{-3/2}) for this quantity. We should stress where we have lost information. We have made the estimate \mathbb{E}[X_\infty|T=\infty]=1 which is very weak. This is unsurprising. After all, the probability of this event is large, and shouldn’t really affect the limit that much when it does not happen. The conditioned process is repelled from 1/2, but that is of little relevance when starting from k/k+1. It seems likely this expectation is in fact \frac{k}{k+1}+O(k^{-3/2}), from which the result will follow.

Large Deviations and the CLT

Taking a course on Large Deviations has forced me to think a bit more carefully about what happens when you have large collections of IID random variables. I guess the first thing think to think about is ‘What is a Large Deviation‘? In particular, how large or deviant does it have to be?

Of primary interest is the tail of the distribution function of S_n=X_1+\ldots+X_n, where the X_i are independent and identically distributed as X. As we can always negate everything later if necessary, we typically consider the probability of events of the type:

\mathbb{P}(S_n\geq \theta(n))

where \theta(n) is some function which almost certainly increases fairly fast with n. More pertinently, if we are looking for some limit which corresponds to an actual random variable, we perhaps want to look at lots of related \theta(n)s simultaneously. More concretely, we should fix \theta and consider the probabilities

\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha). (*)

Throughout, we lose no generality by assuming that \mathbb{E}X=0. Of course, it is possible that this expectation does not exist, but that is certainly a question for another post!

Now let’s consider the implications of our choice of \theta(n). If this increases with n too slowly, and the likely deviation of S_n is greater than \theta(n), then the event might not be a large deviation at all. In fact, the difference between this event and the event (S_n is above 0, that is, its mean) becomes negligible, and so the probability at (*) might be 1/2 or whatever, regardless of the value of \alpha. So object \lim \frac{S_n}{\theta(n)} whatever that means, certainly cannot be a proper random variable, as if we were to have convergence in distribution, this would imply that the limit RV consisted of point mass at each of \{+\infty, -\infty\}.

On the other hand, if \theta(n) increases rapidly with n, then the probabilities at (*) might become very small indeed when \alpha>0. For example, we might expect:

\lim_{n\rightarrow\infty}\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha)=\begin{cases}0& \alpha>0\\1&\alpha<0.\end{cases}

and more information to be required when \alpha=0. This is what we mean by a large deviation event. Although we always have to define everything concretely in terms of some finite sum S_n, we are always thinking about the behaviour in the limit. A large deviation principle exists in an enormous range of cases to show that these probabilities in fact decay exponentially. Again, that is the subject for another post, or indeed the lecture course I’m attending.

Instead, I want to return to the Central Limit Theorem. I first encountered this result in popular science books in a vague “the histogram of boys’ heights looks like a bell” kind of way, then, once a normal random variable had been to some extent defined, it returned in A-level statistics courses in a slightly more fleshed out form. As an undergraduate, you see it in several forms, including as a corollary following from Levy’s convergence theorem.

In all applications though, it is generally used as a method of calculating good approximations. It is not uncommon to see it presented as:

\mathbb{P}(a\sigma\sqrt{n}+\mu n\leq S_n\leq b\sigma\sqrt{n}+\mu n)\approx \frac{1}{\sqrt{2\pi}}\int_a^b e^{-x^2/2}dx.

Although in many cases that is the right way to think use it, it isn’t the most interesting aspect of the theorem itself. CLT says that the correct scaling of \theta(n) so that the deviation probabilities lie between the two cases outline above is the same (that is, \theta(n)=O(\sqrt{n}) in some sense) for an enormous class of distributions, and in particular, most distributions that one might encounter in practice (ie finite mean, finite variance). There is even greater universality, as furthermore the limit distribution at this interface has the same form (some appropriate normal distribution) whenever X is in this class of distributions. I think that goes a long way to explaining why we should care about the theorem. It also immediately prompts several questions:

  • What happens for less regular distributions? It is now more clear what the right question to ask in this setting might be. What is the appropriate scaling for \theta(n) in this case, if such a scaling exists? Is there a similar universality property for suitable classes of distributions?
  • What is special about the normal distribution? The theorem itself shows us that it appears as a universal scaling limit in distribution, but we might reasonably ask what properties such a distribution should have, as perhaps this will offer a clue to a version of CLT or LLNs for less regular distributions.
  • We can see that the Weak Law of Large Numbers follows immediately from CLT. In fact we can say more, perhaps a Slightly Less Weak LLN, that

\frac{S_n-\mu n}{\sigma \theta(n)}\stackrel{d}{\rightarrow}0

  • whenever \sqrt{n}<<\theta(n). But of course, we also have a Strong Law of Large Numbers, which asserts that the empirical mean converges almost surely. What is the threshhold for almost sure convergence, because there is no a priori reason why it should be \theta(n)=n?

To be continued next time.