Trap models and laws of not-so-large numbers

Posted on August 4, 2017 by dominicyeo

I’m back in Rio, this time for the Brazilian Probability School, which this year is being held in parallel with the Brazilian Mathematical Colloquium, so there’s a lot of possible lectures to be attending across a wide range of topics. I’ve been paying particular to a course by Veronique Gayrard concerning the phenomenon of aging, as seen in various spin-glass and trap models. [Lecture notes exist, but haven’t yet been put online.]

I want to write something about the setup for one of these models. It took me quite a long time to settle on a title for this post, and as you can see I’ve hedged. At least in this post, I’m not so interested in the model (and don’t want to try and offer a physical motivation at this point) but rather in talking about the natural model-independent problem it reduces to.

Motivation

Let $X_1,X_2,X_3,\ldots$ be IID random variables which take some fixed value K>0 with probability 1/K, and otherwise take the value zero. The law of large numbers says that for large m, the rescaled partial sum process $\frac{1}{m}(X_1+\ldots+X_m) \approx 1$ . The weak LLN makes this precise in the sense of convergence in distribution, and the strong LLN gives almost sure convergence.

But the speed of convergence is obviously not uniform over all distributions of the underlying IID random variables. This is particularly clear in the setup I’ve outlined, in the regime where $K\rightarrow\infty$ . Certainly if $1\ll m\ll K$ , then we have $\frac{1}{m}(X_1+\ldots+X_m) \ge \frac{K}{m}$ with probability $\approx \frac{m}{K}$ and otherwise $\frac{1}{m}(X_1+\ldots+X_m) = 0$ . So if we let K and m diverge together with scaling as given, the only version of a LLN we can write down is

$\frac{1}{m}(X_1+\ldots+X_m)\stackrel{d}\rightarrow 0,$

which is obviously different to the original version for fixed K and diverging m.

If we take $m=\Theta(K)$ , then the rescaled partial sum process converges in distribution to a scaled Poisson process. Of course, the Poisson process obeys it’s own law of large numbers (or law of large times), but on this scale the first-order behaviour is random.

At a more general level, what we are doing in the previous examples is looking at a process which converges to equilibrium, but studying it on a faster timescale than the timescale of this convergence. The REM-like trap model, which will be the eventual focus of this post, does exactly this to a continuous-time Markov chain, with the additional factor that the holding rates are random and heavy-tailed.

The mean-field REM-like trap model

This REM-like trap model is defined as follows. We have N sites, and for these sites we sample an IID collection of holding rates $(\tau_N(1),\ldots,\tau_N(N))$ according to some distribution. We then choose a sequence of a IID uniform samples from {1,…,N}, labelled $(J_N(1),J_N(2),\ldots)$ . We think of this as recording an itinerary of visits to the sites, where the jth site we visit is $J_N(j)$ . (Though notice that under this definition, it’s possible that the jth site we visit and the j+1st site we visit are the same.) We wait at each site for an exponential holding time, with parameter $\tau_N(j)$ if we are at site j, and these holding times are independent of the other holding times, and independent of the trajectory, all conditional on $(\tau_N(1),\ldots,\tau_N(N))$ .

You can think of this as a continuous-time RW on the complete graph $K_N$ (with self-loops), where the jump chain is uniform, and the holding rates are given by $(\tau_N(1),\ldots,\tau_N(N))$ . This explains the notation, and how you’d construct a similar model on a different underlying graph.

The general of a trap model is a random walk with very inhomogeneous speed, for example because some holding times have very large expectation. In a setting with more inbuilt geometry, for example on a lattice, we can imagine the RW getting trapped in regions associated with atypically low speeds. We might therefore think of a site with very long holding times as being deep, in the sense that the chain might get stuck there.

This will be most interesting if we allow an extreme range of values taken by $\tau_N$ , and so the best choice is a distribution in the domain of attraction of an $\alpha$ -stable law with parameter $\alpha\in(0,1)$ . That is $\mathbb{P}(\tau \ge u) = u^{-\alpha}L(u)$ , where L is a slowly-varying function at $+\infty$ .

This distribution has infinite mean, and so we couldn’t apply either LLN to a sequence of copies of $\tau$ . However, obviously the sequence $(\tau_N(1),\ldots,\tau_N(N))$ almost surely does have finite mean, since each entry is finite! So for each N, the trap model will have a LLN on large timescales, but we will investigate at faster timescales.

The clock process

At least for the purpose of this post, we will focus on the clock process, which records the (continuous) time which elapses before we arrive at the *k*th state of the jump chain.

That is,

$S_N(0)=0, \quad S_N(k) = \sum_{i=0}^k \mathrm{Exp}\left(1/\tau_N(J_N(i))\right),$

where the exponential random variables are independent except through their parameters. This can be made even more clear if we take advantage of the method to write a general exponential distribution as a multiple of a exponential distribution with parameter 1. Let $e_0,e_1,\ldots$ be IID exponential RVs independent of $(\tau_N(1),\ldots,\tau_N(N))$ and the jump chain. Then

$S_N(k)=\sum_{i=0}^k \tau_N(J_N(i)) e_i.$

Let’s briefly pause to apply the LLN to $S_N$ for fixed N. It matters whether we consider the quenched or annealed settings here. As usual, quenched means we fix a realisation of the random environment, and draw all conclusions in terms of that environment (think of conditional expectations). And annealed means that we also include the randomness of the environment. This is notationally annoying, so as a shorthand we write $\mathbb{E}_{\tau_N}$ for quenched expectations $\mathbb{E}[\cdot \,|\, \tau_N(1),\ldots,\tau_N(N)]$ , and $\mathbb{E}$ for an expectation over all randomness.

Then the quenched rate of growth of $S_N$ is given by

$\mathbb{E}_{\tau_N}\left[ \frac{S_N(k)}{k}\right] = \frac{\tau_N(1)+\ldots +\tau_N(N)}{N},$

and so the annealed rate

$\mathbb{E}\left[\frac{S_N(k)}{k}\right] = \infty,$

since $\mathbb{E}[\tau_N(1)]=\infty$ . But as in the introduction, these rates are only relevant to laws of large numbers when k grows on a large enough timescale, and we will consider smaller scales of k.

Timescales of the clock process

We’re going to look for scaling limits of the clock process. The increments are ‘sort of IID’ and ‘sort of heavy-tailed’ (we’ll clarify these sort ofs when we need to) so it wouldn’t be surprising if the scaling limits are Levy processes. The clock process is increasing, so in fact the scaling limits should be subordinators, and it wouldn’t be surprising if under some circumstances they turned out to be stable subordinators.

There is flexibility about how to do the rescaling. From now on, we are working in a $N\rightarrow\infty$ regime. Let’s assume we look at $t a_N$ steps of the jump chain, where $(a_N)$ is some divergent sequence. A property of large sums of IID stable distributions with parameter $\alpha\in(0,1)$ is that the scaling of the value of the sum is comparable to the scale of the largest summand. That is, the partial sum is dominated by its largest summands. Compare with the standard case for non-negative RVs, where for k summands, the sum is $\Theta(k)$ , while the largest summand is $O(\log k)$ .

So to identify the scale of the clock process after $a_N$ steps of the jump chain, it’s sufficient to identify the scale of its expected largest holding time. All of this is vague at the level of constants, so we choose a divergent sequence $(c_N)$ for which

$\mathbb{P}(\tau_N(1) \ge c_N) = \Theta\left(\frac{1}{a_N}\right).$

Note 1: this means that the number of holding times among the first $a_N$ which are at least $c_N$ is binomial with $\Theta(1)$ expectation. The fact that is well-approximated by a Poisson distribution will be relevant shortly.

Note 2: because we already insisted that $\tau(x)$ had a slowly-varying tail, this gives control of the $\mathbb{P}(\tau_N(1)\ge 5c_N)$ etc as well.

We expect that $S_N(t a_N) = \Theta(c_N)$ , and so we consider scaling limits of the process

$\tilde S_N(t):= \frac{1}{c_N} S_N(\lfloor t a_N \rfloor),$

as usual. [Note I am using the opposite convention to VG’s notes, where ~ denotes the unrescaled clock process.]

Scaling limits

We identify two types of scaling limit, depending on whether $a_N\ll N$ or $a_N = \Theta(N)$ . The former is called an intermediate timescale, while the latter is an extreme timescale. After this long motivation and notational preliminary section, my goal is to explain (partly to myself) why these scaling limits are different.

First, we state the result for intermediate timescales. Let $S^{\mathrm{int}}$ be the stable subordinator with parameter $\alpha$ , that is with Levy measure $\alpha \Gamma(\alpha)u^{-\alpha}$ . Then $\tilde S_N \Rightarrow S^{\mathrm{int}}$ , in the Skorohod topology. We need to be clear about the sense of convergence, and the role of the random environment. It turns out that if in addition $a_N\ll \frac{N}{\log N}$ , then this convergence holds for almost all realisations of the random environment. That is, the laws of the processes (with respect to the randomness of the jump chain / holding times etc) converge. When $a_N$ is only $\ll N$ , then the convergence holds in probability with respect to the environment. It took me a while to parse what this means. It means that for large N, the probability that the random environment induces a law of $\tilde S_N$ which is far from the law of $S^{\mathrm{int}}$ tends to zero.

The exact Levy triple of the limit process is not the important message here, and if that’s unfamiliar, then it isn’t a problem. The point is that you would also get this limiting Levy process if you took the sum process of genuinely IID random variables with the same $\alpha$ -tail. And this is not surprising. Since recall that in the intermediate timescale $a_N\ll N$ , so during the first $t a_N$ steps of the jump chain, we do not typically visit many sites more than once. Indeed, if $a_N\ll \sqrt{N}$ , then this is the birthday problem, and we typically visit no site more than once. However, even in the weaker setting $a_N\ll N$ , look at the deepest 1000 sites we visit during the first $ta_N$ steps. We can compute that, in expectation, we visit essentially zero of these more than once. But these 1000 sites dominate the clock process at $ta_N$ . So from the point of view of the clock process, since we hardly ever visit relevant sites twice, the depths $\tau_N(J_N(1)),\tau_N(J_N(2)),\ldots$ are essentially independent, and so it’s unsurprising that we get the scaling limit corresponding to IID partial sums.

For extreme timescales, by contrast, this fails. If we take $a_N=1000 N$ , we expect to visit each site roughly 1000 times, indeed the number of visits to a given site will be approximately $\mathrm{Poisson}(1000)$ . But it’s still the case that the scaling limit will be dominated by the deepest sites. In particular, at some point on this timescale we will visit the deepest site, and indeed we will visit it multiple times if we look at $ta_N$ for large t. So the jumps of any scaling limit are not independent any more unless we condition on all the depths $\tau_N$ .

However, all is not lost, since we can show that the point process of rescaled depths $\sum \delta_{\tau_N(i)/c_N}$ converges to a Poisson random measure on $[0,\infty)$ . The candidate for the scaling limit of the clock process is then the subordinator whose Levy measure is this Poisson random measure. This isn’t itself a Levy process, but it is a mixture of Levy processes, reflecting that on extreme timescales the quenched and annealed viewpoints are different since there is enough time to visit the whole landscape.

Heuristically, the extreme timescale is the entry point for convergence to equilibrium. Indeed, taking $t\rightarrow\infty$ , the number of visits to each of the 1000 top sites converge to their expectation, corresponding to convergence of the clock process to equilibrium, since these holding times continue to dominate the sum. The clock process therefore starts to feel the finiteness of the state space, which introduces dependence between the most relevant holding times, which was not the close on intermediate timescales.

In the next post, I’m going to try and summarise VG’s descriptions of taking this model beyond the mean-field setting, where the range of possibilities becomes much much richer. I’m also going to try and say something and glassy dynamics and ageing, and why the physical motivation justifies considering these particular models and scalings.

Convergence of Transition Probabilities

Posted on January 31, 2014 by dominicyeo

As you can see, I haven’t got round to writing a post for a while. Some of my reasons for this have been good, and some have not. One reason has been that I’ve had to give a large number of tutorials for the fourth quarter of the second year probability course here in Oxford. The second half of this course concerns discrete-time Markov chains, and the fourth problem sheet discusses various modes of convergence for such models, as well as a brief tangent onto Poisson Processes. I’ve written more about Poisson Processes than perhaps was justifiable in the past, so I thought I’d say some words about convergence of transition probabilities in discrete-time Markov chains.

Just to be concrete, let’s assume the state space K is finite, and labelled {1,2,…,k}, so that it becomes meaningful to discuss

$p_{12}^{(n)}:=\mathbb{P}(X_n=2|X_0=1).$

That is, the probability that if we start at state 1, then after n ‘moves’ we are at state 2. We are interested in the circumstances under which this converges to the stationary distribution. The heuristic is that we can view a time-step of a Markov chain as an operation on the space of distributions on K. Note that this operation is deterministic. If this sounds complicated, what we mean is that we specify an initial distribution, that is the distribution of $X_0$ . If we consider the distribution of $X_1$ , this is given by $\lambda P$ , where $\lambda$ is the initial distribution, and P the transition matrix.

Anyway, the heuristic is that the stationary distribution is the unique fixed point of this operation on the space of distributions. It is therefore not unreasonable to assume that unless there are some periodic effects, we expect repeated use of this operation to move us closer to this fixed point.

We can further clarify this by considering the matrix form. Note that a transition P always has an eigenvalue equal to 1. This is equivalent to say that there is a solution to $\pi P=\pi$ . Note it is not immediately equivalent to saying that P has a stationary distribution, as the latter must be non-negative and have elements summing to one. Only the first property is difficult, and relies on some theory or cleverness to prove. It can also be shown that all eigenvalues satisfy $|\lambda|\le 1$ , and in general, there will be a single eigenvalue (ie dimension 1 eigenspace) with $|\lambda|=1$ , and the rest satisfies $|\lambda|<1$ . Then, if we diagonalise P, it is clear why $\pi P^n$ converges entry-wise, as $\pi UP^n U^{-1}$ converges. In the latter, only the entries in the row corresponding to $\lambda=1$ converge to something non-zero.

In summary, there is a strong heuristic for why in general, the transition probabilities should converge, and if they converge, that they should converge to the stationary distribution. In fact, we can prove that for any finite Markov chain, $p_{ij}^{(n)}\rightarrow \pi_j$ , provided we two conditions hold. The conditions are that the chain is irreducible and aperiodic.

In the rest of this post, I want to discuss what might go wrong when these conditions are not satisfied. We begin with irreducibility. A chain is irreducible if it has precisely one communicating class. That means that we can get from any state to any other state, not necessarily in one step, with positive probability. One obvious reason why the statement of the theorem cannot hold in this setting is that $\pi$ is not uniquely defined when the chain is not irreducible. Suppose, for example, that we have two closed communicating classes A and B. Then, supported on each of them is an invariant distribution $\pi^A$ and $\pi^B$ , so any affine combination of the two $\lambda \pi^A+(1-\lambda) \pi^B$ will give a stationary distribution for the whole chain.

In fact, the solution to this problem is not too demanding. If we are considering $p_{ij}^{(n)}$ for $i\in A$ a closed communicating class, then we know that $p_{ij}^{(n)}=0$ whenever $j\not\in A$ . For the remaining j, we can use the theorem in its original form on the Markov chain, with state space reduced to A. Here, it is now irreducible.

The only case left to address is if i is in an open communicating class. In that case, it suffices to work out the hitting probabilities starting from i of each of the closed communicating classes. Provided these classes themselves satisfy the requirements of the theorem, we can write

$p_{ij}^{(n)}\rightarrow h_i^A \pi^A_j,\quad i\not\in A, j\in A.$

To prove this, we need to show that as the number of steps grows to infinity, the probability that we are in closed class A converges to $h_i^A$ . Then, we decompose this large number of steps so to say that not only have we entered A with roughly the given probability, but in fact with roughly the given probability we entered A a long time in the past, and so there has been enough time for the original convergence result to hold in A.

Now we turn to periodicity. If a chain has period k, this says that we can split the state space into k classes $A_1,\ldots,A_k$ , such that $p_{ij}^{(n)}=0$ whenever $n\not\equiv j-i \mod k$ . Equivalently, the directed graph describing the possible transitions of the chain is k-partite. This definition makes it immediately clear that $p_{ij}^{(n)}$ cannot converge in this case. However, it is possible that $p_{ij}^{(kn)}$ will converge. Indeed, to verify this, we would need to consider the Markov chain with transition matrix $P^k$ . Note that this is no longer irreducible, as it there are no transitions allowed between classes $A_1,\ldots,A_k$ . Indeed, a more formal definition of the period, in terms of the lcd of possible return times allows us to conclude that there is no finer reducibility structure. That is, $A_1,\ldots,A_k$ genuinely are the closed classes when we consider the chain with matrix $P^k$ . And so the Markov chain with transition matrix $P^k$ restricted to any of the $A_i$ s satisfies the conditions of the theorem.

There remains one case which I’ve casually brushed over. When we were discussing the irreducible case, I said that if we had at least one communicating classes, then we could work out the limiting transition probabilities from a state in an open class to a state in a closed class by calculating the hitting probability of that closed class, then applying the standard version of the theorem to that closed class. This relies on the closed class being aperiodic.

Suppose otherwise that the destination closed class A has period k as before. If it were to be the case that the number of steps required to arrive at A had some fixed value mod k, or modulo a non-trivial divisor of k, then we certainly wouldn’t have convergence, for the same reasons as in the globally periodic case. However, we should ask whether we can ever have convergence?

In fact, the answer is yes. For concreteness, and because it’s easier to write ‘odd’ and ‘even’ than $m \mod k$ , let’s assume A has size 2 and period 2. That is, once we arrive in A, thereafter we alternate deterministically between the two states. Anyway, for some large time n, we can write $p_{ca}^{(n)}$ for $a\in A, c\not\in A$ as:

$p_{ca}^{(n)}=h_i^A(n),$

where the latter term is the probability that we arrive in A at a time-step which has the same parity as n. It’s not terribly hard to come up with an example where this holds, and this idea holds in greater generality, where A has period k (and not necessarily just k states), we have to demand that the probability of arriving at a time which is a mod k is equal for all a in [0,k-1].

Of course, for applications, we don’t normally care much about irreducible chains, and we can easily remove periodicity by introducing so-called laziness, whereby on each time-step we flip a coin (biased if necessary) and stay put if it comes up heads, and apply the transition matrix if it comes up tails. Then it’s possible to get from any state to itself in one step, and so we are by construction aperiodic.

Recent Progress and Gromov-Hausdorff Convergence

Posted on June 27, 2013 by dominicyeo

For the past few weeks I’ve been working on the problem of Cycle-Induced Forest Fires, which I’ve referred to in passing in some recent posts. The aim has been to find a non-contrived process which exhibits self-organised criticality, that is, where the process displays critical characteristics (scaling laws, multiple components at the largest order of magnitude) forever. Note that this is in contrast to the conventional Erdos-Renyi graph process, which is only critical at a single time n/2.

The conjecture is that the largest component in equilibrium typically has size on a scale of n^2/3. An argument based on the equilibrium proportion of isolated vertices gives an upper bound on this exponent. The working argument I have for the lower bound at the moment can comfortably fit on the back of a napkin, with perhaps some context provided verbally. Of course, the current full text is very much larger than that, mainly because the napkin would feature assertions like “event A happens at time $O(n^\beta)$ “; whereas the more formal argument has to go like:

“With high probability as $n\rightarrow\infty$ , event A happens between times $n^{\beta-\epsilon},n^{\beta+\epsilon},$ for any suitably small $\epsilon>0.$ Furthermore, the probability that A happens after this upper threshold decays exponentially with n for fixed $\epsilon$ , and the probability that A happens before the lower threshold is at most $n^{-\epsilon}$ . Finally, this is under the implicit assumption that there will be no fragmentations before event A, and this holds with probability $1-o(1)$ etc.”

It’s got to the point where I’ve exhausted the canonical set of symbols for small quantities: $\epsilon,\delta,(\eta ?)$ .

This has been a very long way of setting up what was going to be my main point, which is that at many points during undergraduate mathematics, colleagues (and occasionally to be honest, probably myself too) have complained that they “don’t want to have anything to do with analysis. They just want to focus on algebra / number theory / statistics / fluids…” Anyway, the point of this ramble was that I think I’ve realised that it is very hard to think about any sort of open problem without engaging with the sort of ideas that a few years ago I would have thought of (and possibly dismissed) as ‘analysis’.

Much of my working on this problem has been rather from first principles, so haven’t been thinking much about any neat less elementary theory recently.

Ok, so on with the actual post now.

Last month I talked about local limits of graphs, which describe convergence in distribution of (local) neighbourhood structure about a ‘typical’ vertex. This is the correct context in which to make claims like “components of $G(n,\frac{\lambda}{n})$ look like Galton-Watson trees with offspring distribution $\text{Po}(\lambda)$ “.

Even from this example, we can see a couple of drawbacks and omissions from this limiting picture. In the sub-critical regime, this G-W tree will be almost surely finite, but the number of vertices in the graph is going to infinity. More concretely, the limit description only tells us about a single component. If we wanted to know about a second component, in this case, it would be roughly independent of the size of the first component, and with the same distribution, but if we wanted to know about all components, it would get much more complicated.

Similarly, this local limit description isn’t particularly satisfactory in the supercritical regime. When the component in question is finite, this description is correct, but with high probability we have a giant component, and so the ‘typical’ vertex is with some positive probability in the giant component. This is reflected by the fact that the G-W tree with supercritical offspring distribution is infinite with some positive probability. However, the giant component does not look like a $\text{Po}(\lambda)$ G-W tree. As we exhaust O(n) vertices, the offspring distribution decreases, in expectation at least. In fact, without the assumption that the giant component is with high probability unique (so $\frac{L_1}{n}=1-\mathbb{P}(|C(v)|<\infty$ ), we can’t even deduce the expected size of the giant component from the local limit result.

This is all unsurprising. By definition a local limit describes the structure near some vertex. How near? Well, finitely near. It can be arbitrarily large, but still finite, so in particular, the change in the offspring distribution after O(n) vertices as mentioned above will not be covered.

So, if we want to learn more about the global structure of a large discrete object, we need to consider a different type of limit. In particular, the limit will not necessarily be a graph. Rather than try to define a priori a ‘continuum’ version of a graph, it is sensible to generalise from the idea that a graph is a discrete object and instead consider it as a metric space.

In this article, I don’t want to spend much time at all thinking about how to encode a finite graph as a metric space. We have a natural notion of graph distance between vertices, and it is not hard to extend this to points on edges. Alternatively, for sparse graphs, we have an encoding through various functions, which live in some (metric) function space.

However, in general, the graph will be a metric object itself, rather than necessarily a subset of a global metric space. We will be interested in convergence, so we need a suitable style of convergence of different metric spaces.

The natural candidate for this is the Gromov-Hausdorff metric, and the corresponding Gromov-Hausdorff convergence.

The Hausdorff distance between two subsets X, Y of a metric space is defined as follows. Informally, we say that $d_H(X,Y)<\epsilon$ if any point of X is within distance $\epsilon$ from some point of Y, in the sense of the original metric. Formally

$d_H(X,Y):=\max \{\sup_{x\in X}\inf_{y\in Y}d(x,y), \sup_{y\in Y}\inf_{x\in X}d(x,y)\}.$

It is not particularly illuminating to prove that this is in fact a metric. In fact, it isn’t a metric as the definition stands, but rather a pseudo-metric, which is exactly the same, only allowing d(X,Y)=0 when X and Y are not equal. Note that

$d(X^\circ,\bar X)=0,$

for any set X, so this gives an example, provided X is not both open and closed. Furthermore, if the underlying metric space is unbounded, then the Hausdorff distance between two sets might be infinite. For example in $\mathbb{R}$ ,

$d_H(\mathbb{R}_{<0},\mathbb{R}_{>0})=\infty.$

We can overcome this pair of objections by restricting attention to closed, bounded sets. In practice, many spaces under consideration will be real in flavour, so it makes sense to define this for compact sets when appropriate.

But this still leaves the underlying problem, which is how to define a distance function on metric spaces. If two metric spaces X and Y were both subspaces of some larger metric space then it would be easy, as we now have the Hausdorff distance. So this is in fact how we proceed in general. We don’t need any knowledge of this covering space a priori, we can just choose the one which minimises the resulting Hausdorff distance. That is

$d_{GH}(X,Y)=\inf\{d_H(\phi(X),\psi(Y))\},$

where the infimum is taken over all metric spaces (E,d), and isometric embeddings $\phi: X\rightarrow E, \psi: Y\rightarrow E$ .

The first observation is that this will again be a pseudometric unless we demand that X, Y be closed and bounded. The second is that this index set is not a set. Fortunately, this is quickly rectified. Instead consider all metrics on the disjoint union of sets X and Y, which is set, and contains the subset of those metrics which restrict to the correct metric on each of X and Y. It can be checked that this forms a true metric on the set of compact metric spaces up to isometry.

We have an alternative characterisation. Given compact sets X and Y, a correspondence between X and Y is a set of pairs in $X\times Y$ , such that both projection maps are surjective. Ie for any x in X, there is some pair (x,y) in the correspondence. Let $\mathcal{C}(X,Y)$ be the set of such correspondences. We then define the distortion of correspondence $\mathcal{R}$ by:

$\text{dis}(\mathcal{R}):=\sup\{|d_X(x_1,x_2)-d_2(y_1,y_2)|: (x_i,y_i)\in\mathcal{R}\}.$

Then

$d_{GH}(X,Y)=\frac{1}{2}\inf_{\mathcal{R}\in\mathcal{C}(X,Y)}\text{dis}(\mathcal{R}).$

In particular, this gives another reason why we don’t have to worry about taking an infimum over a proper class.

Gromov-Hausdorff convergence then has the natural definition. Note that this does not respect topological equivalence, ie homeomorphism. For example,

$\bar{B(0,\frac{1}{n})}\stackrel{GH}{\rightarrow} \{0\},$

where the latter has the trivial metric. In particular, although all the closed balls are homeomorphic, the G-H limit is not.

A final remark is that the trees we might be looking at are not necessarily compact, so it is useful to have a notion of how this might be extended to non-compact spaces. The answer is to borrow the idea from local limits of considering large finite balls around a fixed central point. In the case of trees, this is particularly well-motivated, as it is often quite natural to have a canonical choice for the ‘root’.

Then with identified points $p_n\in X_n$ , say $(X_n,p_n)\rightarrow (X,p)$ if for any R>0 the R-ball around p_n in X_n converges to the R-ball around p in X. We adjust the definition of distortion to include the condition that the infimum be over correspondences for which $(p_X,p_Y)$ is an element.

REFERENCES

This article was based on some lecture notes by Jean-Francois Le Gall from the Clay Institute Summer School which can be found on the author’s website here (about halfway down the page). This material is in chapter 3. I also used Nicolas Curien’s tutorials on this chapter to inform some of the examples. The resolution of the proper class problem was mentioned by several sources I examined. These notes by Jan Christina were among the best.

Graph matching & Levenshtein distance (p1) (alikhuram.wordpress.com)
Visibility of unrectifiable planar sets (ilaba.wordpress.com)
Littlewood’s Three Principles (shinyetemon.wordpress.com)
Almost all continuous functions are nowhere differentiable (chiasme.wordpress.com)
Definability of Truth in Probabilistic Logic (qmaurmann.wordpress.com)
Cantor-Bendixson rank in topology: Sierpinski-Mazurkiewicz theorem (chiasme.wordpress.com)

Skorohod Space

Posted on December 10, 2012 by dominicyeo

The following is a summary of a chapter from Billingsley’s Convergence of Probability Measures. The ideas are easy to explain heuristically, but this was the first text I could find which explained how to construct Skorohod space for functions on the whole of the non-negative reals in enough stages that it was easily digestible.

It is relatively straightforward to define a topology on C[0,1], as we can induce from the most sensible metric. In this topology, functions f and g are close together if

$\sup_{t\in[0,1]} |f(t)-g(t)|$ is small.

For cadlag functions, things are a bit more complicated. Two functions might be very similar, but have a discontinuity of similar magnitude at slightly different places. The sup norm of the difference is therefore macroscopically large. So we want a metric that also allows uniformly small deformations of the time scale.

We define the Skorohod (or Skorokhod depending on your transliteration preferences) metric d on D[0,1] as follows. Let $\Lambda$ be the family of continuous, strictly increasing functions from [0,1] to [0,1] which map 0 to 0 and 1 to 1. This will be our family of suitable reparameterisations of the time scale (or abscissa – a new word I learned today. The other axis in a co-ordinate pair is called the ordinate). Anyway, we now say that

$d(x,y)<\epsilon\quad\text{if }\exists \lambda\in\Lambda\text{ s.t. }$

$||\lambda - id||_\infty<\epsilon\quad\text{and}\quad ||f-\lambda\circ g||_\infty<\epsilon.$

In other words, after reparameterising the time scale for g, without moving any time by more than epsilon, the functions are within epsilon in the sup metric.

Weak Convergence

We have the condition: if $\{P_n\}$ is a tight sequence of probability measures and we have

$\text{If }P_n\pi_{t_1,\ldots,t_k}^{-1}\Rightarrow P\pi_{t_1,\ldots,t_k}^{-1}\quad\forall t_1,\ldots,t_k\in[0,1],\quad\text{then }P_n\Rightarrow P,$

where $\pi_{t_1,\ldots,t_k}$ is the projection onto a finite-dimensional set. This is a suitable condition for C[0,1]. For D[0,1], we have the additional complication that these projections might not be continuous everywhere.

We can get over this problem. For a measure P, set $T_P$ to be the set of $t\in[0,1]$ such that $\pi_t$ is continuous P-almost everywhere (ie for all $f\in D$ apart from a collection with P-measure = 0). Then, for all P, it is not hard to check that $0,1\in T_P$ and $[0,1]\backslash T_P$ is countable.

The tightness condition requires two properties:

1) $\lim_{K\rightarrow\infty} \limsup_{n}P_n[f:||f||\geq K]=0.$

2) $\forall \epsilon>0:\,\lim_\delta\limsup_n P_n[f:w_f'(\delta)\geq\epsilon]=0.$

These say, respectively, that the measure of $||f||$ doesn’t escape to $\infty$ , and there is no mass given in the limit to functions which ‘wiggle with infinite frequency on an epsilon scale of amplitude’.

$D_\infty=D[0,\infty)$

Our earlier definition of the Skorohod metric could have been written:

$d(f,g)=\inf_{\lambda\in\Lambda}\{||\lambda-\text{id}||\vee||f-\lambda\circ g||\}.$

From a topological convergence point of view, there’s no need to use the sup norm on $\lambda - \text{id}$ . We want to regulate smoothness of the reparameterisation, so we could use the norm:

$||\lambda||^\circ=\sup_{s<t}|\log\frac{\lambda(t)-\lambda(s)}{t-s}|,$

that is, the slope is uniformly close to 1 if $||\lambda||^\circ$ is small. The advantage of this choice of norm is that an extension to $D[0,\infty)$ is immediate. Also, the induced product norm

$d^\circ(f,g)=\inf_{\lambda\in\Lambda} \{||\lambda - \text{id}||^\circ \vee||x-\lambda\circ y||\}$

is complete. This gives us a few problems, as for example

$d_\circ(1_{[0,1)},1_{[0,1-\frac{1}{n})})=1,$

as you can’t reparameterise over the jump in a way that ensures the log of the gradient is relatively small. (In particular, to keep the sup norm less than 1, we would need $\lambda$ to send $[1-\frac{1}{n}]\mapsto 1$ , and so $||\lambda||^\circ=\infty$ by definition.)

So we can’t immediately define Skorohod convergence on $D_\infty$ by demanding convergence on any restriction to [0,t]. We overcome this in a similar way to convergence of distribution functions.

Lemma: If $d_t^\circ (f_n,f)\rightarrow_n 0$ then for any s<t with f cts at s, then $d_s^\circ(f_n,f)\rightarrow_n 0$ .

So this says that the functions converge in Skorohod space if for arbitrarily large times T where the limit function is continuous, the restrictions to [0,T] converge. (Note that cadlag functions have at most countably many discontinuities, so this is fine.)

A metric for $D_\infty$

If we want to specify an actual metric $d_\infty^\circ$ , the usual tools for specifying a countable product metric will do here:

$d_\infty^\circ(f,g)=\sum_{m\geq 1}2^{-m}[1\wedge d_m^\circ(f^m,g^m)],$

where $f^m$ is the restriction of f to [0,m], with the potential discontinuity at m smoothed out:

$f^m(t)=\begin{cases}t&t\leq m-1\\ (m-t)f(t)&t\in[m-1,m]\\ 0&t\geq m.\end{cases}$

In particular, $d_\infty^\circ(f,g)=0\Rightarrow f^m=g^m\,\forall m$ .

It can be checked that:

Theorem: $d_\infty^\circ(f_n,f)\rightarrow 0$ in $D_\infty$ if and only iff

$\exists \lambda_n\in\Lambda_\infty\text{ s.t. }||\lambda_n-\text{id}||\rightarrow 0$

$\text{and }\sup_{t\leq m}|\lambda_n\circ f_n-f|\rightarrow_n 0,\,\forall m,$

and that $d_\infty^\circ (f_n,f)\rightarrow 0 \Rightarrow d_t^\circ(f_n,f)\rightarrow 0$ for every point of continuity t of f.

Similarly weak convergence and tightness properties are available, roughly as you might expect. It is probably better to reference Billingsley’s book or similar sources rather than further attempting to summarise them here.

An obvious remark

Aside

An obvious remark:

If a sequence of independent random variables $X_n$ converge almost surely to some limit $X$ , this limit must be a constant (almost surely).

I’ve been thinking about the Central Limit Theorem about related Large Deviations results this afternoon, and wasted almost an hour worrying about situations which were effectively well-disguised special cases of the above.

Why is it true? Well, suppose each $X_n$ is $\mathcal{F}_n$ -measurable. But by independence, we might as well take $\mathcal{F}_n=\sigma(X_n)$ . Then the limit variable $X$ is independent of $\mathcal{F}_n$ for all $n$ , and thus independent of $\cup F_n=\mathcal{F}\supset \sigma(X)$ . If $X$ is independent of itself, it must be almost surely constant.

Weak Law of Large Numbers and Central Limit Theorem via the Levy’s continuity theorem. (maikolsolis.wordpress.com)
republicans reject the central limit theorem? (orgtheory.wordpress.com)
Poisson processes appropriate for today (gottwurfelt.wordpress.com)

Martingale Convergence

Posted on April 20, 2012 by dominicyeo

I continue the theme of explaining important bookwork from the Part III course, Advanced Probability, as succinctly as possible. In this post, we consider the convergence properties of discrete time martingales.

1) Theorem: Assume $X$ is a martingale bounded in $L^1$ . Then

$X_n\rightarrow X_\infty\in L^1(\mathcal{F}_\infty)$ a.s.

Remark: The theorem and proof works equally well for $X$ a supermartingale.

Proof: Essentially, we want to reduce convergence of the random variables which make up the martingale to a countable collection of events. We do this by considering upcrossings, which counts the number of times the process alternates from less than a given interval to greater than a given interval. The formal definition will be too wide for this format, so we summarise as

$N_n([a,b],X)=$ the number of disjoint time intervals up to time n in which X goes from $[-\infty a)$ to $(b,\infty]$ . Define $N([a,b],X)$ to be the limit as n increases to infinity.

It is a genuinely easy check that a sequence converges (possibly to $\infty$ ) iff this number of upcrossings of any interval with rational bounds is finite. We will show that the martingale almost surely has this property. The key lemma is a bound due to Doob:

Lemma: $(b-a)\mathbb{E}[N_n([a,b],X)]\leq \mathbb{E}[(X_n-a)^-]$

Proof: Say $S_1<T_1<S_2<T_2,\ldots$ are the successive hitting times of $[-\infty,a),(b,\infty]$ respectively. So $N_n=\inf\{k:T_k\leq n\}$ . We decompose, abbreviating the number of upcrossings as N.

$\sum_{k=1}^n(X_{T_k\wedge n}-X_{S_k\wedge n})=\sum_{k=1}^N(X_{T_k}-X_{S_k})+(X_n-X_{S_{N+1}})1_{\{S_{N+1}\leq n\}}$

Now take an expectation of both sides, applying the Optional Stopping Theorem to the bounded stopping times on the LHS. (If we are working with a supermartingale, formally we need to take $\mathbb{E}[\cdot|\mathcal{F}_{S_k}]$ of each summand on LHS to show that they are non-negative, and so taking a further expectation over $\mathcal{F}_{S_k}$ gives the required result.) We obtain:

$0\geq (b-a)\mathbb{E}N-\mathbb{E}(X_n-X_{S_{N+1}})1_{\{S_{N+1}\leq n\}}$

If $S_{N+1}>n$ then both $1_{\{S_{N+1}\leq n\}}=(X_n-a)^-=0$ . Otherwise $(X_n-a)^-\geq X_{S_{N+1}}-X_n$ . This complete the proof of the lemma.

Since $\mathbb{E}[(X_n-a)^-]\leq \mathbb{E}|X_n|+a<\infty$ , where this last bound is uniform in n by assumption, applying monotone convergence, we get that $N([a,b],X)$ is almost surely finite for every pair $a<b\in\mathbb{Q}$ . Because this set is countable, we can deduce that this holds almost surely for every pair simultaneously. We therefore define $X_\infty(w)=\lim X_n(w)$ when this limit exists, and 0 otherwise. With probability one the limit exists. Fatou’s lemma confirms that $X_\infty\in L^1(\mathcal{F}_\infty)$ .

2) We often want to have convergence in $L^1$ as well. Recall for Part II Probability and Measure (or elsewhere) that

UI + Convergence almost surely is necessary and sufficient for convergence in $L^1$ .

This applies equally well to this situation. Note that for a martingale, this condition is often convenient, because, for example, we know that the set $\{\mathbb{E}[X_\infty|\mathcal{F}_n],n\}$ is UI for any integrable $X_\infty$ .

3) Convergence in $L^p$ is easier to guarantee.

Theorem: i) X a martingale bounded in $L^p$ iff ii) $X_n\rightarrow X_\infty$ in $L^p$ and almost surely iff iii) $\exists Z\in L^p$ s.t. $X_n=\mathbb{E}[Z|\mathcal{F}_n]$ a.s.

Remark: As part of the proof, we will show, as expected, that $X_\infty,Z$ are the same.

Proof: i)->ii) Almost sure convergence follows from the above result applied to the p-th power process. We apply Doob’s inequality about running maxima in a martingale process:

$||X_n^*||_p:=||\sup_{m\leq n}X_m||_p\leq \frac{p}{p-1}||X_n||_p$

Using this, we see that $X_n^*\uparrow X_\infty^*:=\sup|X_k|$ . Now consider $|X_n-X_\infty|\leq 2X_\infty^*\in L^p$ and use Dominated Convergence to confirm convergence in $L^p$ .

Note that Doob’s $L^p$ inequality can be proven using the same author’s Maximal inequality and Holder.

ii)->iii) As we suspected, we show $Z=X_\infty$ is suitable.

$||X_n-\mathbb{E}[X_\infty|\mathcal{F}_n]||_p\stackrel{\text{large }m}{=}||\mathbb{E}[X_m-X_\infty|\mathcal{F}_n]||_p\stackrel{\text{Jensen}}{\leq}||X_m-X_\infty||_p\rightarrow 0$

iii)->i) is easy. Z bounded in $L^p$ implies X bounded by a simple application of the triangle inequality in the definition of conditional expectation.

Convergence of Random Variables

Posted on January 15, 2012 by dominicyeo

The relationship between the different modes of convergence of random variables is one of the more important topics in any introduction to probability theory. For some reason, many of the textbooks leave the proofs as exercises, so it seems worthwhile to present a sketched but comprehensive summary.

Almost sure convergence: $X_n\rightarrow X\;\mathbb{P}$ -a.s. if $\mathbb{P}(X_n\rightarrow X)=1$ .

Convergence in Probability: $X_n\rightarrow X$ in $\mathbb{P}$ -probability if $\mathbb{P}(|X_n-X|>\epsilon)\rightarrow 0$ for any $\epsilon>0$ .

Convergence in Distribution: $X_n\stackrel{d}{\rightarrow} X$ if $\mathbb{E}f(X_n)\rightarrow \mathbb{E}f(X)$ for any bounded, continuous function f. Note that this definition is valid for RVs defined on any metric space. When they are real-valued, this is equivalent to the condition that $F_{X_n}(x)\rightarrow F_X(x)$ for every point $x\in \mathbb{R}$ where $F_X$ is continuous. It is further equivalent (by Levy’s Convergence Theorem) to its own special case, convergence of characteristic functions: $\phi_{X_n}(u)\rightarrow \phi_X(U)$ for all $u\in\mathbb{R}$ .

Note: In contrast to the other conditions for convergence, convergence in distribution (also known as weak convergence) doesn’t require the RVs to be defined on the same probability space. This thought can be useful when constructing counterexamples.

$L^p$ -convergence: $X_n\rightarrow X$ in $L^p$ if $||X_n-X||_p\rightarrow 0$ ; that is, $\mathbb{E}|X_n-X|^p\rightarrow 0$ .

Uniform Integrability: Informally, a set of RVs is UI if the integrals over small sets tend to zero uniformly. Formally: $(X_n)$ is UI if $\sup_{n,A\in\mathcal{F}}\{\mathbb{E}[|X_n|1(A)]|\mathbb{P}(A)\leq \delta\}\rightarrow 0$ as $\delta\rightarrow 0$ .

Note: In particular, a single RV, and a collection of independent RVs are UI. If X~U[0,1] and $X_n=n1(X\leq \frac{1}{n})$ , then the collection is not UI.

Continue reading →

Eventually Almost Everywhere

A blog about probability and olympiads by Dominic Yeo

Tag Archives: convergence

Trap models and laws of not-so-large numbers

Convergence of Transition Probabilities

Recent Progress and Gromov-Hausdorff Convergence

Skorohod Space

An obvious remark

Aside

Martingale Convergence

Convergence of Random Variables

Related articles

Related articles

Related articles