Gromov-Hausdorff Distance and Correspondences

This term, some members of the probability group have been reading Probability and Real Trees, by Steve Evans based on his Saint-Flour course in 2005. A pdf can be found here. This morning was my turn to present, and I gave a precis of Chapter 4, which concerns metrics on metric spaces, a family of tools which will be essential for later chapters which discuss convergence of trees, viewed as metric objects. Hausdorff Distance We start by considering a metric on subsets of a given base space X. The Hausdorff distance between two sets A, B is defined as

d_H(A,B):=\inf\{ r>0: A\subset U_r(B), B\subset U_r(A)\},

where U_r(A):=\{x\in X, d(x,A)< r\} consists of set A, and all the points within r of set A. So the Hausdorff distance measures how much we have to fatten each set before it contains the other. Note that if we have a giant set next to a tiny one, we will have to fatten the tiny one a great deal more to achieve this. Sometimes it will be more helpful to think of the following alternative characterisation

d_H(A,B) = \max\left \{ \sup_{a\in A}\inf_{b\in B} d(a,b), \inf_{a\in A}\sup_{b\in B} d(a,b)\right\}.

In words, we measure how far away is the point in A farthest from B, and vice versa, and take the larger of the two. The presence of the sups and infs indicates that the inclusion or otherwise of the boundaries of A,B does not affect this distance. In particular, this means that d_H(A,\bar A) = d_H(A,A^{\circ}), and so to allow us to call Hausdorff distance a metric, we restrict attention to the closed subsets of X, M(X).

We also observe immediately that if X does not have bounded diameter, then it is possible for the Hausdorff distance between two sets to be infinite. We won’t worry about that for now, because we will mainly be considering host spaces which are compact, for which the following results will be useful.

Lemma 4.4 – If (X,d) is complete, then (M(X),d_H) is also complete.

Proof: Assume we have a sequence of closed sets S_1,S_2,\ldots \subset X which have the Cauchy property under d_H. I’m going to do this slightly differently to Evans, because it’s not the case you can immediately choose x_n\in S_n for each n, such that d(x_n,x_m)\le d_H(S_n,S_m) for all m,n. For an explicit counterexample, see comment from Ed Crane below the article.

Note that if a subsequence of a Cauchy sequence converges to a limit, then the whole sequence converges to that same limit. So we can WLOG replace S_1,S_2,\ldots by some subsequence such that d_H(S_n,S_{n+1})\le 2^{-n}. Now it is clear that for any x_n\in S_n, there is a choice of x_{n+1}\in S_{n+1} such that d(x_n,x_{n+1})\le 2^{-n} (*). Starting from arbitrary x_1\in S_1, we can construct in this manner a sequence x_1,x_2,\ldots,\in X that is Cauchy, and thus has a limit point x\in X.

Let \mathcal{X} be the set of sequences (x_m,x_{m+1},\ldots) for some m, with x_n\in S_n\,\forall n\ge m, satisfying (*). Now let S be the closure of the set of limit points of this family of sequences, which we have shown is non-empty.

Then for any n, and any x_n\in S_n, we can construct such a sequence, and its limit point x, and the triangle inequality down the path x_n,x_{n+1},\ldots gives d(x_n,S)\le 2^{-(n-1)}. Furthermore, by construction S\subset U_{2^{-(n-1)}}(S_n), hence it follows that S_n \stackrel{d_H}\rightarrow S.

Lemma 4.5 – Given (X,d) compact, (M(X),d_H) is also compact.

Sketch proof: We are working with metric spaces, for which the existence of a finite \epsilon-net for every \epsilon>0 is equivalent to compactness. [An \epsilon-net is a set S of points in the space such that every point of the space lies within \epsilon of an element of S. Thinking another way, it is the set of centres of the balls in a finite covering of the space by \epsilon-balls.] It is not too hard to check that if S_\epsilon is an \epsilon-net for (X,d), then \mathcal{P}(S_\epsilon) is finite, and an \epsilon-net for (M(X),d_H).

Gromov-Hausdorff Distance

So far this is fine, but won’t be useful by itself for comparing how similar two trees are as metric spaces, because we can’t be sure a priori that we can embed them in a common host space. To resolve this, we consider instead the Gromov-Hausdorff distance, which will serve as a distance between metric spaces, even when they are not canonically defined as subsets of a common metric space.

Given X, Y metric spaces, we define

d_{GH}(X,Y)=\inf_Z \left\{ d_H(X',Y') \, : \, X',Y' \subset (Z,d)\text{ a metric space }, X'\simeq X, Y'\simeq Y\right\}.

In words, the Gromov-Hausdorff distance between two metric spaces examines the ways to embed them isometrically into a common larger metric space, and gives the minimal Hausdorff distance between them under the class of such embeddings. One issue is that the collection of all metric spaces is not a set. For example, given any set, we can define a metric via the discrete metric, so the collection of metric spaces is at least as large as the collection of all sets, which is not a set. Fortunately, all is not broken, as when we consider a general metric space Z in which we might embed copies of X and Y we are wasting lots of the perhaps very complicated space, because we only need to compare the subsets which are isometric copies of X and Y. So in fact, we lose nothing if we assume that Z is a disjoint union of copies of X and Y, with a metric chosen appropriately. So

d_{GH}=\inf\left\{ d_H(X,Y) : d\text{ a metric on }X\coprod Y\text{ restricting to }d_X \text{ on }X,\, d_Y\text{ on }Y \right\}.

In practice though, this is difficult to compute, since the set of things we have to minimize over is complicated. It turns out we can find an equivalent characterisation which will be easier to use in a number of examples, including the case of real trees which is the whole point of the course.

Correspondence and Distortion

We define a correspondence from X to Y to be

\mathcal{R}\subset X\times Y\text{ s.t. } \pi_X(\mathcal {R}) = X, \, \pi_Y(\mathcal {R}) = Y,

where \pi_X,\pi_Y are the canonical projection maps from X\times Y into X,Y respectively. So we can think of a correspondence as being something like a matching from X to Y. In a matching, we insist that the projection maps into each set are injections, ie that each element of X (resp Y) can appear in at most one pair, whereas for a correspondence, we demand that the projection maps are surjections, ie that each element of X appears in at least one pair.

Then the distortion of a correspondence

\mathrm{dis}(\mathcal{R}):= \sup\left\{ |d_X(x,x') - d_Y(y,y')| \,;\, (x,y),(x',y')\in \mathcal{R} \right\}.

In words, if two sets are non-isomorphic, then a correspondence can’t describe an isometry between the sets, and the distortion is a measure of how far from being an isometry the correspondence is. That is, given a pair of pairs in the correspondence, for an isometry, the distance between the X-elements would be equal to the distance between the Y-elements, and the distortion measures the largest discrepancy between such pairs of pairs over the correspondence.

Theorem 4.11 d_{GH}(X,Y) = \frac12 \inf_{\mathcal{R}} (\mathrm{dis}\mathcal R), where the infimum is taken over all correspondences \mathcal{R} X to Y.

Remark: The RHS of this equivalence can be thought of as the set coupling between X and Y such that the pairs have as equal distances as possible.

Proof: Given an embedding into X\coprod Y with d_H(X,Y)<r, we have \mathcal{R} with \mathrm{dis}\mathcal {R}<2r, by taking:

\mathcal{R}=\{(x,y): d(x,y)<r\}.

From the definition of Hausdorff distance, it follows that the for every x, there is a y with d(x,y)<r, and hence the appropriate projection maps are projections.

So it remains to prove that d_{GH}(X,Y)\le \frac12 \mathrm{dis}\mathcal{R}. We can define a metric on X\times Y by

d(x,y)=\int\left\{ d_X(x,x')+d_Y(y,y') + r \,:\, (x',y')\in \mathcal{R} \right\}.

Then for any x\in X, there is (x,y)\in\mathcal{R}, and thus d(x,y)\le r, and vice versa hence d_H(X,Y)\le r.

It only remains to check that this is actually a metric. Let’s take x,\bar x \in X, and so

d(\bar x,y)\le \inf\{ d_X(x,\bar x) + d_X(x,x')+d_Y(y,y')+r \,: \, (x',y')\in\mathcal{R}\},

so taking d_X(x,\bar x) outside the brackets gives one form of the triangle inequality. We have to check the ‘other combination’ of the triangle inequality. We assume that the infima for (x,y), (\bar x,y) are attained at (x',y'),(\bar x',\bar y') respectively.

d(x,y)+d(\bar x,y)= 2r+ d_X(x,x')+d_X(\bar x,\bar x') + d_Y(y,y')+d_Y(d,\bar y').

But we also have d_X(x',\bar x')-d_Y(y',\bar y')\ge -r from the definition of distortion, and so adding these gives the triangle inequality we want, and completes the proof of this theorem.corc


Noise Sensitivity and Influence 1

Last week, Simon gave a junior probability seminar about noise sensitivity, and its applications in particular to dynamic critical percolation, so I thought I would write something about this topic based on his talk, and some other sources.

The motivation is critical site percolation on the triangular lattice. This has critical probability p_c=\frac12, and, at this critical probability, almost surely there is no infinite component. But it turns out that there is ‘almost’ an infinite component, in the following technical sense. Dynamical site percolation is given by switching the state of each site according to independent Poisson processes with the same parameter. Then, by construction, if we start at time 0 with the critical probability, then at any fixed time T>0, the configuration is distributed as critical site percolation, and hence almost surely has no infinite component. If, however, we consider the interval of times [0,T], then almost surely there *is* a time when there is an infinite component, and indeed the set of such times has positive Hausdorff dimension.


Noise sensitivity addresses the question of how likely some function of a product measure is to change under small perturbations to the state. To give a precise definition, we take a function f_n:\{0,1\}^n \rightarrow \{0,1\} on a product space, with product measure \mu. Then given \omega\in\Omega_n=\{0,1\}^n, define w^\epsilon to be the outcome when each component of \omega is resampled with probability \epsilon. Note that it matters quantitatively whether we consider ‘resampled’ or ‘swapped’, but it doesn’t affect any of the results qualitatively.

Then the noise stability of f at level \epsilon is defined to be


where \omega \sim \mu. Note that \omega^\epsilon\sim \mu also, but obviously \omega,\omega^\epsilon are not independent. Noise stability measures the magnitude of this dependence. The sequence of functions (f_n) is said to be noise-sensitive if S_\epsilon(f_n)\rightarrow 0 as n\rightarrow\infty.

A related concept is influence. This should be viewed as a local version of noise sensitivity, where we consider the likelihood that the function will change as a result of swapping a single fixed component of the state. Perhaps abusing standard notation, define \omega^{(i)} to be the result of swapping the ith component of \omega. Then the influence of the ith coordinate is

I_i(f) := \mu(\{\omega\in\Omega: f(\omega)\ne f(\omega^{(i)})\}).

Examples and Results

As in so many models, we are interested in the relationship between the local setting and the global setting. Benjamini, Kalai and Schramm (99) showed that, in the case of the uniform product measure,

\sum_{i\in[n]} I_i(f_n)^2 \stackrel{n\rightarrow\infty}\rightarrow 0,

is a sufficient condition for noise-sensitivity of (f_n). Note the converse is trivially false, by considering the function f(w) that records the parity of the number of 1s in w. Here, the influence is 1, since changing one component by definition changes the value of the function. But the function is noise sensitive, as changing a proportion \epsilon of the components gives a roughly 1/2 probability for large n that the value of the function changes.

However, if we restrict to the class of monotone increasing functions, where changing a 0 to a 1 can never cause f to decrease, the converse is true. Outside the setting of uniform product measure, lots of versions of this theorem are open.

This perhaps seems counter-intuitive. After all, we would expect high global sensitivity to follow from high local sensitivity. I don’t have a good counter-intuition for why this should be true. The original authors themselves describe this as ‘paradoxical’ in [1].

It is natural to ask which component has the largest influence, and in particular whether we can guarantee the existence of a component with influence at least some particular value. Obviously, if the Boolean function is constant, all the influences are zero, and other examples, such as the indicator function of a specific state, will also have very small influences. But such functions also have small variance (under the uniform product measure), so we need to make comparison to this. We can show that

\text{Var}(f)\le \frac14 \sum_{i\in[n]}I_i(f),

which is a (discrete) example of a Poincare inequality, where deviations of a function from its mean on a particular space in L^p are bounded in terms of the derivative of the function. Here, influence is a discrete analogue of derivative.

Proof: We observe first that if \mathbb{P}(f(w)=1)=p, then \mathbb{E}[f(w)]=p, and

\text{Var}(f)=p(1-p)^2+(1-p)p^2=p(1-p)=\frac12 \mathbb{P}(f(w)\ne f(\bar w)),

where w,\bar w are independent realisations of the underlying measure. We move from w\mapsto \bar w one component at a time, examining the influence at each step. Let v^{(i)}\in \{0,1\}^n be given by the first i components of \bar w and the final (n-i) components of w. Then v^{(0)}=w, v^{(n)}=\bar w. Finally, define

J:=\{i:w_i\ne \bar w_i\}.

We also say that a state is i-pivotal, if changing the ith component changes the value of the function. Note that if f(w)\ne f(\bar w), then there exists some i<n such that f(v^{(i)})\ne f(v^{(i+1)}). In particular:

\{f(w)\ne f(\bar w)\}\subset \{1\in J, v^{(0)}\text{ 1-pivotal}\}\cup

\{2\in J, v^{(1)}\text{ 2-pivotal}\}\cup\ldots\cup\{n\in J, v^{(n-1)}\text{ n-pivotal}\}.

Note now that for any i<n, v^{(i-1)} has uniform product measure on \Omega, since it is an interpolation of sorts between independent realisations of the uniform product measure. Furthermore, since we haven’t interpolated at component i yet, it is independent of the event \{i \in J\}. Thus

\mathbb{P}(i\in J, v^{(i-1)}\text{ i-pivotal})=\frac12 I_i(f).

Putting everything together with a union bound gives

\text{Var}(f)=\frac12 \mathbb{P}(f(w)\ne f(\bar w))\le \frac14 \sum_{i\in [n]}I_i(f),

as desired.

We might ask when this result is tight. It is certainly not tight if we during the interpolation procedure, the value of f changes lots of times. Since a consequence of this Poincare inequality is that there is some component i such that I_i(f)\ge \frac{1}{4n}, we might conjecture that there is a better lower bound.

It turns out, as shown in [2], that we can add an extra logarithmic factor. Precisely, for some absolute constant c>0, there exists an i such that:

I_i(f)\ge c \text{Var}(f)\frac{\log n}{n}.

Indeed, there’s an example which shows that this is optimal. Consider partitioning the n components into groups of size roughly log n – log log n, all in base 2. Then the function is 1, if there exists at least one group, which we call tribes, where all the components are 1. So a particular component is pivotal precisely when all the other components in its tribe are already 1. Thus

I_i(f)\approx \left(\frac{1}{2}\right)^{\log n - \log\log n}= \frac{\log n}{n}.

One consequence of this is that by BKS, we learn that the function is noise-sensitive. Is this clear directly? Note that if w contains a tribe of 1s, then with high probability w^{(\epsilon)} for large n, roughly \epsilon(\log n-\log \log n) of these get changed. In particular, it is very unlikely that the same tribe will be a tribe of 1s in w^\epsilon. So heuristically, the presence of a tribe in w^\epsilon feels almost independent of the presence of a tribe in w. But because of the (approximate) symmetry of the setup, it is much easier to work with influences, and use BKS.

I will continue the discussion of the Fourier techniques used to derive some of these results in a new post soon.


[1] – Benjamini, Kalai, Schramm (1999) – Noise Sensitivity of Boolean Functions and Applications to Percolation. (arXiv)

[2] – Kahn, Kalai, Linial (1988) – The influence of variables on Boolean Functions. (link)

Garban, Steif – Noise sensitivity of Boolean functions and Percolation. (arXiv) I have borrowed heavily from these excellent notes. The discrete Poincare inequality appears in Section 1.2, with an outline of the solution in Exercise 1.5.

Coupling from the Past

In a long series of previous posts I have talked about mixing times for Markov chains. We consider how long it takes for the distribution of a particular Markov chain to approach equilibrium. We are particularly interested in the asymptotics when some parameter of the model grows, such as the size of the state space, grows to infinity.

But why are we interested in the underlying problem? The idea of Markov Chain Monte Carlo methods is to sample from an intractable distribution by instead sampling from a Markov chain which approximates the distribution well at large times. A distribution might be intractable because it is computationally demanding to work out the normalising constant, or it might be distributed uniformly on a complicated combinatorial set. If, however, the distribution is the equilibrium distribution of some Markov chain, then we know how to at least sample from a distribution which is close to the one we want. But we need to know how long to run the process. We will typically tolerate some small error in approximating the distribution (whether we measure this in terms of total variation distance or some other metric doesn’t really matter at this heuristic level), but we need to know how it scale. If we double the size of the system, do we need to double the number of iterations of the chain, or square it. This is really important if we are going to use this for large real-world models with finite computing power!

Sometimes though, an approximation is not enough. If we want an exact sample from the equilibrium distribution, Markov chains typically will not help us as it is only in very artificial examples that the distribution after some finite time is actually the equilibrium distribution. One thing that we might use is a stationary time, which is a stopping time T, for which X_T\stackrel{d}{=}\pi. Note that there is one trivial way to do this. We can sample Y from distribution \pi before starting the process, then stop X at the first time T for which X_T=Y. But this is no help really, as we need to have Y in the first place!

So we are really interested in less trivial stationary times. Perhaps the best example is the top-to-random shuffle. Here we are given a pack of labelled cards, WLOG initially in descending order at each step we move the top card in the pile to a randomly-chosen location in the pile (which includes back onto the top). Then it turns out that the first time we move the card originally at the bottom from the top to somewhere is a strong stationary time. This is fairly natural, as by this time, every card has been involved in at least one randomising event.

Anyway, so this gives a somewhat artificial way to sample from the uniform distribution on a pack of cards. This strong stationary time is almost surely finite, with distribution given by the coupon collector problem, for which the expectation grows as n\log n, where n is the number of cards.

The problem with this method is that it is not easy in general to come up with a non-contrived stationary time such as this one. The idea of coupling from the past, discussed by some previous authors but introduced in this context by Propp and Wilson in the mid ’90s, is another method to achieve perfect sampling from the equilibrium distribution of a Markov chain. The idea here is to work backwards rather than forwards. The rest of this post, which discusses this idea, is based on the talk given at the Junior Probability Seminar by Irene, and on the chapter in the Levin, Peres, Wilmer book.

The key to the construction is a coupling of the transitions of a Markov chain. In the setting of a simple random walk, we have by construction a coupling of the transitions. It doesn’t matter which state we are at: we toss a coin to decide whether to move up or down, and we can do this without reference to our current position. Levin, Peres and WIlmer call this a random mapping representation in general, and it is yet another concept that is less scary than its definition might suggest.

Given a transition matrix P on state space S, such a representation is a function

\phi: S\times[0,1]\rightarrow S,\text{ s.t. }\mathbb{P}(\phi(i,U)=j)=p_{ij},

where U is a U(0,1) random variable independent of choice of i. In particular, once we have the random value of u, we can consider \phi(i,u) as i varies, to obtain a random map S\rightarrow S. Crucially, this map is not necessarily a bijection.

Note first that there are many possibilities for constructing the representation \phi. For some chains, and some representations, in particular random walks on vertex-transitive graphs (such as SRW – only for now we are restricting attention to finite state spaces) it is possible to choose \phi so that it always gives a bijection, but it is also always possible to choose it so that there is some probability it doesn’t give a bijection.

Let U_1,U_2,\ldots be an IID sequence of U[0,1] random variables, and write \phi_i for the random map induced by U_i. Then consider the sequence of iterated maps:

\phi_1, \phi_1\circ \phi_2, \ldots, \phi_1\circ\ldots\circ\phi_n,

and let T be the (random) smallest time such that the image of \phi_1\circ\ldots\circ \phi_T is a single state. Ie, as we go backwards in time through the maps \phi_i, we are gradually losing various states, corresponding to the maps not being bijections. Since the state space is finite, and the probability of not being a bijection is positive, it can be shown that T is almost surely finite. The claim then is that

Y=\text{Im}(\phi_1\circ\ldots\circ \phi_T)

is distributed as the equilibrium distribution of the chain. We finish by proving this.

Proof: Since the algorithm terminates after finite time almost surely, given any \epsilon>0, we can choose N such that the probability the algorithm stops in at most N steps is greater than 1-\epsilon.

Now run the Markov chain from time -N, started in the equilibrium distribution, with the transition from time -t to -(t-1) given by the random mapping driven by U_t. Thus at time 0, the distribution of the chain is still the equilibrium distribution. But if we condition on the event that T\le N, then X_0=\phi_1\circ \ldots \circ\phi_n(X_{-N})=Y regardless of the initial value. So \mathbb{P}(X_0\ne Y)<\epsilon, and hence the result follows, since \epsilon>0 was arbitrary.

What makes this easier than strong stationary times is that we don’t have to be clever to come up with the stopping time. It is however still important to know how long on average it takes to run the algorithm. At the end of her talk, Irene showed how to adapt this algorithm to deal with Probabilistic Cellular Automata. Roughly speaking, these are a sequence of infinite strings of 0s and 1s. The value of some element is determined randomly as a function of the values in the row underneath, say the element directly underneath and the two either side. In that setting, if you start with a finite subsequence and couple from the past by looking down to lower rows, each time you drop down a row you consider one further element, so in fact the coupling from the past algorithm has to eliminate possibilities fast enough to make up for this, if we want to terminate almost surely in finite time.

Here’s a link to the paper which discusses this in fuller detail.

Enhanced by Zemanta

Duality for Stochastic Processes

In the past couple of weeks, we’ve launched a new junior probability seminar in Oxford. (If you’re reading this and would like to come, please drop me an email!) The first session featured Daniel Straulino talking about his work on the Spatial Lambda Fleming-Viot process, a model for the evolution of populations allowing for selection and geometry. A lively discussion of duality broke out halfway through, following which it seemed reasonable to devote another session to this topic. Luke Miller presented some classical and less classical examples of the theory this afternoon. I found all of this interesting and relevant, and thought it would be worth writing some things down, and tying it in with one of the courses on this subject that we attended at ALEA in Luminy last October.

The majority of this article is based on Luke’s talk. Errors, omissions and over-simplifications are of course my own.

The setup is that we have two stochastic processes X_t\subset R, Y_t\subset S. For now we make no assertion about whether the two state spaces R and S are the same or related, and we make no comment on the dependence relationship between X and Y. Let P_x,Q_y be the respective probability measures, representing starting from x and y respectively. Then given a bivariate, measurable function h(.,.) on R x S, such that:

E^P_x h(X_t,y)=E^Q_y h(x,Y_t),\quad \forall x,y\quad\forall t,

then we say X and Y are dual with respect to h.

The interpretation should be that X is a process forwards in time, and Y is a process backwards in time. So X_t, Y_0 represent the present, while X_0, Y_t represent the past, which is the initial time for original process X. The fact that the result holds for all times t allows us to carry the equality through a derivative, to obtain an equality of generators:

\mathcal{G}^X h(x,y)=\mathcal{G}^Y h(x,y),\quad \forall x,y.

On the LHS, the generator acts on x, while on the RHS it acts on y. Although it still isn’t obvious (at least to me) when a pair of processes might have this property, especially for an arbitrary function, this seems the more natural definition to think about.

Note that this does indeed require a specific function h. There were murmurings in our meeting about the possibility of a two processes having a strong duality property, where this held for all h in some broad class of test functions. On more reflection, which may nonetheless be completely wrong, this seems unlikely to happen very often, except in some obviously degenerate cases, such as h constant. If this holds, then as the set of expectations of a class of functions of a random variable determines the distribution, we find that the instantaneous behaviour of Y is equal in distribution to the instantaneous behaviour of X when started from fixed (x,y). It seems unlikely that you might get many examples of this that are not deterministic or independent (eg two Brownian motions, or other space-, time-homogeneous Markov process).

Anyway, a canonical example of this is the Wright-Fisher diffusion, which provides a simple model for a population which evolves in discrete-time generations. We assume that there are two types in the population: {A,a} seems to be the standard notation. Children choose their type randomly from the distribution of types in the previous generation. In other words, if there are N individuals at all times, and X_k is the number of type A individuals, then:

X_{k+1} | X_k \stackrel{d}{=} \mathrm{Bin}(N, \frac{X_k}{N}).

It is not hard to see that in a diffusion limit as the number of individuals tends to infinity, the proportion of type A individuals is a martingale, and so the generator for this process will not depend on f’. In fact by checking a Taylor series, we can show that:

\mathcal{G}_{WF}f(x)=\frac{1}{2} x(1-x)f''(x),

for all f subject to mild regularity conditions. In particular, we can show that for f_n(x)=x^n, we have:

\mathcal{G}_{WF} f_n(x)=\binom{n}{2}(f_{n-1}(x)-f_n(x))

after some rearranging. This looks like the generator of a jump process, indeed a jump process where all the increments are -1. This suggests there might be a coalescent as the dual process, and indeed it turns out that Kingman’s coalescent, where any pair of blocks coalesce at uniform rate, is the dual. We have the relation in expectation:

\mathbb{E}_x[X_t^n]= \mathbb{E}_n[x^{N_t}],

where the latter term is the moment generating function of the number of blocks at time t of Kingman’s coalescent started from n blocks.

In particular, we can control the moments of the Wright-Fisher diffusion using the mgf of the Kingman’s coalescent, which might well be easier to work with.

That’s all very elegant, but we were talking about why this might be useful in a broader setting. In the context of this question, there seems to be an obstacle towards applying this idea above more generally. This is an underlying idea in population genetics models that as well as the forward process, there is also a so-called ancestral process looking backwards in time, detailing how time t individuals are related historically. It would be convenient if this process, which we might expect to be some nice coalescent, was the dual of the forward process.

But this seems to be a problem, as duals are a function of the duality function, so do we have uniqueness? It would not be very satisfying if there were several coalescents processes that could all be the dual of the forward process. Though some worked examples suggest this might not happen, because a dual and its duality function has to satisfy too many constaints, there seems no a priori reason why not. It seems that the strength of the results you can derive from the duality relation is only as strong as the duality relation itself. This is not necessarily a problem from the point of view of applications, so long as the duality function is something it might actually be useful to work with.

It’s getting late and this text is getting long, so I shall postpone a discussion of duality for interacting particle systems until tomorrow. In summary, by restricting to a finite state space, we can allow ourselves the option of a more algebraic approach, from which some direct uses of duality can be read off. I will also mention a non-technical but I feel helpful way to view many examples of duality in interacting particle systems as deterministic forward and backwards walks through a random environment, in what might be considering an extreme example of coupling.

Enhanced by Zemanta

The Top-to-Random Shuffle III

This post concludes my non-exhaustive list of things I think are interesting about the top-to-random shuffle. In previous posts I have talked about the construction and correct sense of convergence to randomness, and that this algorithm does genuinely achieve uniform randomness at some hitting time which is easy to specify. Promising that posts will be short hasn’t worked in the past so I won’t do that again now, but the idea of this post is brief:

When we specified the dynamics of the top-to-random shuffle, we insisted that the top card card could be placed anywhere in the deck with equal probability including back on top. This appears to be doing nothing except slowing down the shuffling process. Why is this important for convergence to randomness?

Fortunately the answer is short: if we do not let the top card be inserted back onto the top, allowing the configuration to stay the same, then we can divide up the set of orderings into two classes, and the pack will alternate between them.

Why is this a problem? Suppose the classes are called X and Y, and X is the class that contains the original ordering 1,2,…,n. Then after k shuffles, the ordering of the deck will be in X if k is even and in Y if k is odd. Remember our definition of ‘close to randomness’ will be the greatest difference in probability of an event between the actual distribution and the uniform distribution. As before, you can think of this by a betting analogy – what proportion profit can you make again someone who thinks it’s uniform by knowing the true distribution?

Well, it will turn out that the sets X and Y have the same size, so in the uniform distribution, the probability that an ordering is in X is 1/2. Whereas if the pack alternates, then so long as we know how many shuffles have occurred, this probability is either 0 or 1. In particular this is far from 1/2. We should remark that if we introduce the notion of sampling at a random time, or taking an average over all large times in some sense, such problems may disappear, but the result obtained may be less useful. See this post on Cesaro Mixing for details presented in a more rigorous style.

So it remains to see why this is true. First a definition. A transposition is when two elements in a permutation are exchanged. Eg 31452 -> 35412 by transposing 1 and 5. It makes sense intuitively that we can get from any permutation to any other permutation by making successive transpositions. Indeed, this is precisely what is happening in the top-to-random shuffle. To avoid continually having to write it out, we call the original permutation 1,2,…,n the identity permutation.

Then the idea is that X is the set of permutations we can obtain by starting with the identity and applying an even number of transpositions, while Y is the set obtained by applying an odd number of transpositions. For this to work, we will need to show that these sets are disjoint. That is, no permutation can be generated by both an odd number and an even number of transpositions. This is important, as a permutation can certainly be generated from transpositions in multiple ways. For example, if the elements are 1,2,3, we can obtain the permutation 2,1,3 by transposing 1 and 2, obviously. However, we could alternatively start by transposing 2 and 3 to get 1,3,2, then 1 and 3 to get 3,1,2, then 2 and 3 again to get 2,1,3. Note that both of these require an odd number of transpositions.

We will call a permutation even if it is generated by an even number of transpositions, and odd otherwise. We also say that its sign (alternatively signature, parity) is +1 or -1 respectively. To prove this is well-defined, we really want to find a different property that is easier to track.

A useful trick is to count how many pairs of elements are not in the correct order. Let’s do this for our previous example: 31452. There are 5 elements so 5 x 4 / 2 = 10 pairs of elements. We list them:

  • 1 and 2 are in the correct order.
  • 1 and 3 are not, as 3 comes before 1 in this permutation.
  • 1 and 4 are correct.
  • 1 and 5 are correct.
  • 2 and 3 are not.
  • 2 and 4 are not.
  • 2 and 5 are not.
  • 3 and 4 are correct.
  • 3 and 5 are correct.
  • 4 and 5 are not.

So 5 pairs are not in the correct order. Since 5 is odd, the claim is that this means 31452 is an odd permutation. To check this, and to confirm that the sign is well-defined, it suffices to check that the number of so-called inversions, or pairs in the wrong order, changes parity every time we apply a transposition.

This is clearly true if we transpose adjacent elements. Then the orderings of all pairs remain the same, apart from the pair we transposed, which changes. Then, if the elements are not adjacent, instead of transposing them directly, we can perform a succession of transpositions of adjacent elements. The easiest way to describe this is again by example. Suppose we want to transpose 3 and 5 in 31452.

31452 -> 13452 -> 14352 -> 14532 -> 15432 -> 51432.

Note that the middle transposition is actually transposing 3 and 5, and the others are symmetric about this middle operation. In particular, there is an odd number of transpositions in total. So we have proved the result for general transpositions, and thus we now know that the sign of a permutation is well-defined. Note also that there are an equal number of odd and even permutations of every n=>2. For every odd permutation, transposing 1 and 2 gives an even permutation, and vice versa, uniquely, giving a bijection.

What’s really going on is that we are able to multiply permutations, by doing one after the other. Unlike multiplying real numbers, the order in which we do this now matters. In this context, the set of permutations is an example of a general structure called a group. The idea of partitioning a group into subsets which are in some sense symmetric and where some other operation jumps between the subsets is a useful motivation point for a whole avenue of interesting theory. Not to be explored now unfortunately…

The Top-to-Random Shuffle II

In the last post, I introduced the top-to-random shuffle. In particular, we considered why this sort of procedure was important as an alternative to choosing an ordering afresh, and how to we would go about measuring how close we had got to randomness.

In this post, I want to develop the second of these points. The intuition might be that we can get very close to uniform randomness if we repeat the shuffle often enough. Recall this means that even if we choose our bet in a really complicated and careful way, we still couldn’t make much profit by knowing the actual distribution of the ordering. But we might also suspect that the pack will never be exactly random, in the same way that the distribution of the proportion of heads seen on a repeatedly-flipped coin will eventually get very close to 1/2, but will not be exactly 1/2.

This intuition is extremely sensible, and in general is true. It is a nice fact, however, that it fails for the top-to-random shuffle, where we do in fact get to a uniformly random deck. Recall that we approximated how long it would take to get to a state that was roughly random by calculating the time taken for the original bottom card to rise to the top of the deck. This time was:


where n is the total number of cards. A further shuffle is required to send the original card back into the pack somewhere. The claim is that the pack is now uniformly random. Note that if we ever actually use this, we have to be careful because although we can calculate the average time at which this event happens, the time itself is random. Rather than worry about that, let’s see why it is true.

As a motivating example, let’s assume that the pack was originally in the order 1,2,3,…,n, and consider the final relative order of the cards numbered 1 and 2. There is some small probability that on the first go (and then again on the second go possibly also) that the card number 1 stays put. Let’s ignore that possibility and progress to the first time that the 1 has been moved into the interior of the pack, so the 2 is now on top. When we do this, we are choosing a number between 2 and n uniformly at random, and moving the card numbered 1 to this position. Let’s call this number X.

Now, when we try to shuffle the 2, we are choosing a number between 1 and n uniformly at random, and moving the card numbered 2 to this position. Let’s call this number Y. Under exactly what circumstances does 2 end up above 1? The clearest example is if Y=X. Then card 2 has been moved to the position previously occupied by card 1. So card 1 moves up a position (since the card 2 is no longer on the top). The final configuration therefore includes card 1 directly above card 2. So we can say

\mathbb{P}(\text{2 above 1})=\mathbb{P}(Y<X),

where the fact that the inequality in the second probability is strict is important. To calculate this second probability, we want to exploit the symmetry of the situation. The only problems are that the case X=Y is not symmetric as then 1 ends up above 2 as described above, and also that X cannot be 1. So we account for these separately. Note

\mathbb{P}(Y=1)=\frac{1}{n},\quad \mathbb{P}(Y=X)=\frac{1}{n}.

The second result holds overall since it holds whenever we condition on a particular value of X. These events are also disjoint. Then

\mathbb{P}(Y<X)=\mathbb{P}(Y=1)+\mathbb{P}(Y<X | Y>1, X\neq Y)\mathbb{P}(Y>1,X\neq Y)

= \frac{1}{n}+\frac{1}{2}(1-\mathbb{P}(Y>1,Y\neq X))

=\frac{1}{n}+\frac{1}{2}(1-\frac{1}{n}-\frac{1}{n}) = \frac{1}{2}.

In summary, the cards 1 and 2 will be in uniformly random order. We might like to extend this idea, but it will get complicated when we add card 3 to the mix, as it is possible (if unlikely) that 1 and 2 will be further mixed before this. This shouldn’t affect the result much, but it will get complicated to define the notation required to carry this sort of argument all the way up to the nth card.

Even using induction is not going to make life substantially easier. Knowing that once we have inserted cards 1,2,…,k into the pack they are in uniformly random order is not enough to make inference about what happens once we put k+1 into the pack. We have to know something about the current positions of 1,2,…,k. For example, if one of these cards is definitely on the bottom of the pack, then the probability that k+1 ends up last among 1,2,…,k+1 is 1/n rather than 1/k+1 as it should be. So in fact we would have to control an annoying amount of information jointly.

In the argument we attempted above, we were looking at the first times some card k got folded back into the pack. Note that this division of time is different to the one we were using for the coupon collector approach to the mixing time in the previous post. Let’s try to use that instead here.

Now we consider the times at which a card is moved below card n. We deliberately decline to say what these cards are. But rather, we want to prove that, conditional on the cards below n being A_k=\{a_1,\ldots,a_k\}, the ordering of these is uniform on S_{A_k}, that is, every possibility is equally likely. Now this is easy to prove by induction. For, by conditioning on A_k and a_{k+1} being the new card to be moved below n, we are conditioning on the set of cards below n now being A_{k+1}=A_k\cup\{a_{k+1}\}. The position of the new card is uniformly random within this, by construction of the top-to-random shuffle, and so the new arrangement is uniformly random on the (k+1)! possibilities.

To see why we have proved the original result we wanted, note that this argument works at the time when the original bottom card is now at the top. So the remaining cards are uniformly randomly ordered. Inserting card n at random gives an arrangement that is uniformly random overall. So as we suggested before, working out how long it takes to get close to randomness in this case reduces to working out how long it is before the original bottom card hits the top and is re-inserted, as at that point, the pack genuinely is uniformly random.

The Top-to-Random Shuffle

This article is based on a talk I gave to the Maths Society at St Paul’s School on Monday. It may turn into a short series if I have time before I go to ALEA in Luminy near Marseille on Saturday.

My original plan had been to talk about riffle-shuffling, and some of the interesting mixing time themed results one can obtain. As a motivating example, I began by discussing the simpler top-to-random shuffle, and this proved sufficiently interesting to occupy the time I had been allowed (and mea culpa a bit more). It therefore seems worth writing a hopefully moderately accessible blog post on the subject. The aim of this post at least is to discuss the idea that repeatedly shuffling brings a pack of cards close to randomness. We have to settle on a definition of ‘close to randomness’, and find some ways to calculate this.

Suppose we are playing some bizarre card game where it is necessary that three cards labelled, uncontroversially, 1, 2 and 3 need to be placed in a random order. If we are organised, we can write down all the ways to do this in a list:

123, 132, 213, 231, 312, 321.

We want to select each of these with equal probability. We could for example use a dice. Most relevantly, even a computer as ancient as my laptop is very happy simulating a random choice from this set. (Now is not the time to talk about exactly how pseudo-random or otherwise this choice would be.)

Of course, when we play a sensible card game we have not three cards, but fifty-two. So the approach described above still works in theory, but no longer in practice, as the list of possible arrangements now has size 52!. Recall this is defined to be

52!=1\times 2 \times\ldots \times 52.

The reason we get this particular expression is that when we are choosing the first card, we have 52 possible choices. Then, regardless of what this first card actually is, there are precisely 51 cards left from which to choose the second card. So there are 52×51 ways to pick the first two cards in the arrangement, and so on, giving the answer. We can approximate how large 52! is by counting powers of ten rather crudely. It seems reasonable that it should be about 10^{65}. Note that the number of atoms in the universe is *only* about 10^{80}, so if we are going to write down this list, we better have very compact handwriting! But being serious, this number is way too large to realistically compute with, so we have to come up with some cleverer methods.

One way is to spread the cards out on a table then pick them up one at a time, ensuring at all times that the choice of card is uniform among those currently present, and not related to any of the past choices. This is relatively easy for a computer, but hard for a human, and certainly deeply tedious for anyone waiting to receive their hand!

So we seek a different approach, namely an algorithm for shuffling. Our aim is to introduce overall randomness by repeatedly applying some simple but random process. Note we have to be careful about our definition of ‘random’ here. The permutation 123456 is just as ‘random’ as the permutation 361524. That is, if they are fixed, then they are not random at all. Just because it is easier to decribe one of them verbally does not mean it is less random. For example, if I am trying to cheat at poker, then I might be able to if I knew the exact order of the cards in the pack before the dealer dealt. It wouldn’t matter what that order was. I would have to adjust my strategy based on the order, but it wouldn’t affect the fact that I had a massive advantage!

The shuffling algorithm to be discussed here is the top-to-random shuffle. Like all the best things in life, this does exactly what it says on the tin. At a given time, we remove the top card from the deck at present, and insert it at a randomly chosen point in the deck. This could be on the bottom, and it could also be back on the top. It feels like this possibility to remain constant can’t possibly help us, but later we will discuss why we need this.

In any case, it feels natural that if we keep applying this procedure, the arrangement of the deck should start to get more and more random, in the sense that knowing the original arrangement will tell us successively little about the current arrangement as time progresses. But we need to find a way to quantify this if we are to do any mathematics.

When we are talking about real numbers, it is fairly clear what it means if I say that the numbers 2, 1.1, 1.01, 1.001 and so on are getting closer and closer to 1. Indeed we can measure the distance along the number line between each term and 1, using the absolute difference. It is not so clear how to compute the distance between two probability distributions. Bearing in mind the fact that a distribution on the set of permutations of cards is defined to be a set of 52! probabilities that sum to 1, there will be a 52!-1 dimensional space (eg the plane is two-dimensional, the world is three-dimensional, *and so on* – whatever that means) where we have a nice distance formula already.

But this is not what we will choose to use. Rather we return to the cheating-at-poker analogy. Suppose I am playing some sort of game involving the pack of cards with my enemy. He or she thinks the deck is perfectly random, but I know the actual distribution. How big a profit can I make by exploiting this knowledge? This will be our measure of how far a distribution is from uniform. It turns out that this will coincide precisely with the formal definition of total variation distance, but that language belongs to a different level of rigour and is not relevant here.

What is relevant is an explanatory example. Suppose we start with the arrangement 12345678. We are now going to perform one iteration of the top-to-random shuffle. The outcome might, for example, be 23456178, if we insert the 1 between the 6 and the 7. Note there were 8 places for the card to go, so the probability of this particular outcome is 1/8. Now let’s see how I might use my knowledge of the distribution to my advantage. Suppose I suggest the bet that the bottom card is an 8. My enemy thinks the stack is uniformly randomly arranged, so the probability of this is 1/8. On the other hand, I know that the only way the 8 might disappear from the bottom is if I place the 1 under it, which happens with probability 1/8. So in fact, I know the probability of this event is 7/8, which gives me an advantage of 3/4. In fact, I could come up with bets that do even better than this, but they are less simple to describe verbally.

At what point do I lose this advantage? Well, we said that the probability that the 8 leaves the bottom of the stack is 1/8. And it will continue to be 1/8 on every turn where it is at the bottom. Recalling that the outcomes of successive shuffles are independent, note this is reminiscent of rolling a dice until a six comes up. The number of rolls required to get the six is an example of a geometric random variable. I don’t want to spoil S1 (or whichever module) by going into too much detail, but it turns out that if the probability of an event happening on a single go is p, then the average time we have to wait is 1/p. So 1/(1/8)=8 of course, and this is how long we typically have to wait before the bet I placed before becomes much less effective.

Now seems like a good time to stop talking about 8 cards and start talking about n cards. Obviously, in practice, we will want n to be 52. Anyway, by the same argument as before, it takes on average n iterations before the bottom card leaves the bottom. This is important, because after then, my bet that the bottom card is n is no longer so effective. However, I could equally place a bet that one of the bottom *two* cards is n.

So we consider how long it takes before n is no longer one of the bottom two cards. Well certainly we need to wait until it is no long *the* bottom card, which takes time n on average. Then, once it is second bottom, there is now a 2/n chance that we move the previously top card below it, so by the same argument as before, the time for this to happen is n/2 on average. If we want this effect to disappear, we have to wait until the original bottom card is in fact at the top of the pile for the first time, and by extending our previous argument, the average time for this is


Fortunately, we have tools for approximating this sort of sum, in particular integration, which is the practice of finding the area under certain curves. It turns out that the answer is roughly n log n. You can think of as log n a measure of the number of digits required to write out n. (This is not the exact definition but it will do for now. In any case, log n gets larger as n gets larger, but not very fast.) There’s a lot more about this in my previous post on the coupon collector problem, from a more technical point of view.

The next question will be to prove that it is actually quite well shuffled by this time, but that’s for another post. The other question to ask is whether this is satisfactory overall? For n=52, the number of operations we have to perform is about 230, which is fine for a computer, but deeply tedious for anyone sitting at a casino table waiting for the next hand. So next time we’ll talk about the riffle shuffle, which seems to introduce a lot of randomness in each go, but we’ll also see that we have to be careful, because the randomness may not be as great as our intuition suggests.

Delayed Connectivity in Random Graphs


I presented a poster at the Oxford SIAM Student Chapter Conference on Friday. It was nice to win the prize for best poster, but mainly I enjoyed putting it together. For once it meant ignoring the technical details and anything requiring displayed formulae, and focusing only on aspects that could be conveyed with bullet points and images. Anyway, this is what I came up with. The real thing is sitting safely in a tube in my office, ready for the next time it is needed in a hurry!

Delayed Connectivity Poster

The Poisson Process – Distributions of Holding Times

So, I was planning to conclude my lecture course on Markov Chains at the Cambridge-Linyi summer school with an hour devoted to the Poisson Process. Unfortunately, working through the discrete time material and examples took longer than I’d expected, so we never got round to it. As I’d put a fair bit of work into the preparation, I figured I might as well write it up here instead.

We need a plausible mathematical model for a situation where people or events arrive randomly in continuous time, in a manner that is as close as possible to the notion of iid random variables. In discrete time, a natural candidate would be a set of iid Bernoulli random variables. For example, with probability p a single bus will arrive in an time interval of a minute. With probability 1-p, no bus will arrive. We might have some logistical motivation for why it is not possible that two or more arrive in a given interval, or we could instead choose a more complicated distribution.

One way to proceed would be to specify the distribution of the times between arrivals. These should be independent and identically distributed, at least intuitively. However, although we might be able to give a sensible guess right now, it is not immediately clear what this distribution should be. For now, we merely remark that the arrival times are called X_1,X_2,\ldots, and the holding times between arrivals are defined by
S_1=X_1, S_n=X_n-X_{n-1},n\geq 2.

In fact the motivating discrete example gives us much of the machinery we will actually need. Recall that when we define probability distributions for continuously-valued random variables we need a different plan of attack than for discrete RVs. Whereas for the discrete case, it is enough to specify the probability of each outcome, for a continuous random variable, we have to specify the probabilities of intervals, and take care that they have the obvious additive and nesting properties that we want. Taking the integral (whether Riemannian or Lebesgue) of a so-called density function is a natural way to do this.

Similarly here, we build up from small time intervals. The first remark is this: it is natural that the expected number of arrivals in the first minute is equal to the expected number of arrivals in the second minute. After all, we are considering the most general process possible. If there are an infinite number of potential arriving agents, then behaviour in the first minute should be independent and equal (in distribution) to behaviour in the second minute. We can naturally extend this idea to a linear relation. If N_s is the number of arrivals in the time [0,s], then we should have \mathbb{E}N_s=\lambda s, where \lambda is some constant of proportionality, equal to \mathbb{E}N_1.

The key remark is that as s\rightarrow 0, \mathbb{P}(N_s=1) becomes small, and \mathbb{P}(N_s\geq 2) becomes very small. In fact it suffices that \mathbb{P}(N_s \geq 2)=o(s), as this implies:

\mathbb{P}(N_s=0)=1-\lambda s+o(s),\quad \mathbb{P}(N_s=1)=\lambda s+o(s).

Note that we are not currently attempting a formal construction of the process. As always, finding a probability space with enough freedom to equip a continuous process is a fiddly task. We are for now just trying to work out what the distribution of the holding times between arrivals should be. There are obvious advantages to defining the process as a collection of iid random variables, for example that we can construct it on a product space.

To do this, we split the time interval [0,1] into blocks


So the probability that someone arrives in the time [\frac{k}{n},\frac{k+1}{n}] is \frac{\lambda}{n}. So

\mathbb{P}(\text{no-one arrives in time }[0,1])=(1-\frac{\lambda}{n})^n\approx e^{-\lambda}.

As we are working in an n\rightarrow\infty regime, we can replace 1 by general time t, to obtain:

\mathbb{P}(\text{no-one arrives in time }[0,t])=(1-\frac{\lambda t}{n})^n\approx e^{-\lambda t}.

So the distribution function of the first arrival time is F(t)=1-e^{-\lambda t} in the conventional notation.
Thus X_1\sim \text{Exp}(\lambda).

However, to emphasis how useful the infinitissimal definition is for actual problems, consider these examples.

1) If we have two arrivals processes at the same object, for example arriving from the left and the right at the same shop, say with rates \lambda,\mu, then we want to show that the first arrival time is still exponential. Because of the linearity of expectation property, it is clearly from the definition that the total arrivals process is Poisson with rate \lambda+\mu, and so the result follows. Showing this by examining the joint distribution of two exponential random variables is also possible, but much less elegant.

2) Similarly, if we have two shops, A and B, and each arriving person chooses one at random, then the first arrival time at A is: with probability 1/2 distributed as Exp(\lambda), with probability 1/4 as \Gamma(2,\lambda), and so on. A fairly non-trivial calculation is required to show that this is the same as Exp(\frac{1}{2}\lambda), whereas this follows almost instantly using the the infinitissimal definition.

Moral: with the infinitissimal case, the difficulties are all in the probability space. However, once we have settled those problems, everything else is nice as the key property is linear. Whereas for a construction by iid jump times, the existence and well-definedness is clear, but even for a distribution as tractable as the exponential random variable, manipulation can be tricky.

NMSS 2012 – Strong Law of Large Numbers for a Coin Flip

The 2012 National Mathematics Summer School, held at Queens’ College affiliated to the University of Birmingham, and run by the United Kingdom Mathematics Trust, is drawing to a close today. I gave a problem-based talk on Probability to two groups of 20 junior students (15/16 year olds selected based on strong performance in national competitions for their agegroups), and a lecture to the six senior students (some of 2011’s strongest and most enthusiastic junior students) on the SLLN for the simplest non-trivial random variable imaginable: a coin flip.

In case any of the students, or indeed anyone else, is interested, a text of the problems, and the worked solutions that took up the majority of the lecture will be available here for a short while. Do email me if there are any questions!

Senior Probability Solutions. [Link removed. Email me if interested]