# The reflection principle and conditioned RWs

I haven’t published a post about probability for far too long. Several are queued, so perhaps this will be the start of a deluge.

Anyway, with my advisor at Technion, I’m still working on some problems concerning Gaussian random walk subject to some conditioning which is complicated, but in practice (we hope) only mildly different to conditioning the walk to stay positive. Our conditioning at step n depends on some external randomness, but also on the future trajectory of the walk (related to the embedding of the walk in a 2D DGFF), thus ruining the possibility of applying the Markov property in any proof without significant preliminary work.

It seemed worth writing briefly about some of these results in a slightly simpler setting. The goal is to assemble many of the ingredients required to prove a local limit for Gaussian random walk conditioned to stay positive, in a sense which will be clarified towards the end. This is not the best way to go about getting scaling limits (as discussed briefly here, and for which see references [Ig74] and [Bo76]), and it’s probably not the best way to get local limits in the simplest setting, but it’s the method we are currently working to generalise, and follows the outline of [B-JD05], but in much less technical detail.

Probabilities via the reflection principle

We start with Brownian motion. The reflection principle, as described briefly in this post from the depths of history, is a classical technique for studying the maximum of Brownian motion. Roughly speaking, we exploit the fact that $(W_t,t\ge 0)\stackrel{d}=(-W_t,t\ge 0)$, but we then apply this at the hitting time of a particular positive value, using the Strong Markov Property.

Let $S_t=\max_{0\le s\le t}W_s$ be the running maximum of the Brownian motion $W_t$, and $\tau_b$ the hitting time of b. Then

$\mathbb{P}(S_t\ge b, B_t\le a)=\mathbb{P}(\tau_b

which, by SMP at $\tau_b$ and the reflection invariance of a standard BM, is equal to

$\mathbb{P}(\tau_b

This obviously assumed $b\ge a$, but if we set $b=a$, we find

$\mathbb{P}(S_t\ge b)=\mathbb{P}(B_t>b)+\mathbb{P}(S_t\ge b,B_t\le b)=2\mathbb{P}(B_t\ge b).$

Or, in other words, $S_t\stackrel{d}=|B_t|$.

While we can’t derive such nice equalities in distribution, the reflection principle is robust enough to treat more complicated settings, such as Brownian bridge.

We might want to ask about the maximum of a standard Brownian bridge, but we can be more general, and ask about the maximum of a Brownian bridge with drift (let’s say general bridge here). It’s important to remember that a general Brownian bridge has the same distribution as a linear transformation of a standard Brownian bridge. Everything is Gaussian, after all. So asking whether the maximum of a general Brownian bridge is less than a particular value is equivalent to asking whether a standard Brownian bridge lies below a fixed line. Wherever possible, we make such a transformation at the start and perform the simplest version of the required calculation.

So, suppose we have a bridge B from (0,0) to (t,a), and we want to study $\max_{s\in[0,t]} B_s$. Fix some $b>a$, and work with a standard Brownian motion $W_s$. By a similar argument to before,

$\mathbb{P}(\tau_b\le t, W_t\in[a,a+\mathrm{d}x]) = \mathbb{P}(W_t\in [2b-a-\mathrm{d}x,2b-a]) = \frac{\mathrm{d}x}{\sqrt{2\pi t}}e^{-(2b-a)^2/2t},$

and

$\mathbb{P}(W_t\in[a,a+\mathrm{d}x])=\frac{\mathrm{d}x}{\sqrt{2\pi t}}e^{-a^2/2t}.$

So

$\mathbb{P}(\max_{s\in[0,t]}B_t\ge b) = \exp\left(\frac{-(2b-a)^2 + a^2}{2t}\right).$

Random walk conditioned to stay positive

Our main concern is conditioning to stay above zero. Let $\mathbb{P}_{0,x}^{t,y}$ be some complete if cumbersome notation for a Brownian bridge B from (0,x) to (t,y). Then another simple transformation of the previous result gives

$\mathbb{P}_{0,x}^{t,y}(B_s\ge 0,\,s\in[0,t])=1-\exp\left( \frac{-(x+y)^2 + (x-y)^2}{2t} \right)= 1-\exp\left(-\frac{2xy}{t}\right).$

Then, if $xy\ll t$, we can approximate this by $\frac{2xy}{t}$. (*)

Extend the notation so $\mathbb{P}_{0,x}$ describes Brownian motion started from (0,x). Then integrating over y, gives

$\mathbb{P}_{0,x}(B_s\ge 0,\, s\in[0,t] ) = \frac{x}{t}\mathbb{E}[B_t\vee 0] = \sqrt{\frac{2}{\pi}} \frac{x}{\sqrt{t}}.$

(It might appear that we have integrated the approximation (*) over parts of the range where it is not a valid approximation, but the density of $B_t=\Theta(t)$ vanishes exponentially fast, and so actually it’s not a problem.)

We now want to extend this to random walks. Some remarks:

• We used the Gaussian property of Brownian motion fairly heavily throughout this derivation. In general random walks are not Gaussian, though we can make life easier by focusing on this case.
• We also used the continuity property of Brownian motion when we applied the reflection principle. For a general random walk, it’s hopeless to consider the hitting times of individual values. We have to consider instead the hitting times of regions $\tau_{(-\infty,b]}$, and so on. One can still apply SMP and a reflection principle, but this gives bounds rather than equalities. (The exception is simple random walk, for which other more combinatorial methods may be available anyway.)
• On the flip side, if we are interested in Brownian motion/bridge staying positive, we can’t start it from zero, as then the probability of this event is zero, by Blumenthal’s 0-1 law. By contrast, we can certainly ask about random walk staying positive when started from zero without taking a limit.

A useful technique will be the viewpoint of random walk as the values taken by Brownian motion at a sequence of stopping times. This Skorohod embedding is slightly less obvious when considering a general random walk bridge inside a general Brownian bridge, but is achievable. We want to study quantities like

$\mathbb{P}(S_k\ge 0,\, k=1,\ldots,n \big| S_0=x,S_n=y),$

where for simplicity let’s just take $(S_k,k\ge 0)$ to be a random walk with standard Gaussian increments. It’s possible we might want to take a scaling limit in x and y as functions of n. But first if we take x,y fixed, and embed the random walk bridge with these endpoints into the corresponding Brownian bridge with $t\approx n$, we are then faced with the question:

What’s the probability that the Brownian bridge goes below zero, but the embedded RW with n steps does not?

If the Brownian bridge conditioned to go below zero spends time $\Theta_p(n)$ below zero, then for large n it’s asymptotically very unlikely that the n places at which we embed the random walk avoids this set of intervals.

Several technical estimates are required to make this analysis rigorous. The conclusion is that there exists a function $f(x)$ for which $f(x)=x(1+o(1))$ as $x\rightarrow\infty$, such that

$q_n(x,y):=\mathbb{P}(S_k\ge 0,\, k=0,1,\ldots,n \,\big|\, S_0=x,S_n=y) \sim \frac{2f(x)f(y)}{n},$

$\text{and}\quad q_n(x):=\mathbb{P}(S_k\ge 0,\,k=0,1,\ldots,n\,\big|\, S_0=x)\sim \sqrt{\frac{2}{\pi}}\frac{f(x)}{\sqrt{n}}.$

As earlier, the second is obtained from the first by integrating over suitable y. This function $f$ has to account for the extra complications when either end-point is near zero, for which the event where the Brownian motion goes negative without the random walk going negative requires additional care.

Limits for the conditioned random walk

In the previous post on this topic, we addressed scaling limits in space and time for conditioned random walks. But we don’t have to look at the classical Donsker scaling to see the effects of conditioning to stay positive. In our setting, we are interested in studying the distribution of $S_m$ conditional on the event $(S_1\ge 0,S_2\ge 0,\ldots, S_n\ge 0)$, with limits taken in the order $n\rightarrow\infty$ and then $m\rightarrow\infty$.

(At a more general level, it is meaningful to describe the random walk conditioned on staying positive forever. Although this would a priori require conditioning on an event of probability zero, it can be handled formally as an example of an h-transform.)

As explained in that previous post, the scaling invariance of the Bessel process $W^+$ (which it’s not unreasonable to think of as ‘Brownian motion conditioned to stay non-negative’) suggests that this limit should exist, and be given by the entrance law of $W^+$. But this is hard to extract from a scaling limit.

However, we can use the previous estimate to start a direct calculation.

$\mathbb{P}(S_m\in \mathrm{d}y \,\big|\, S_k\ge 0,\, k=1,\ldots,n) = \frac{q_m(0,y) q_{n-m}(y) \mathbb{P}(S_m\in\mathrm{d}y)}{q_n(0)}.$

Here, we used the Markov property at time m to split the event that $S_m=y$ and the walk stays positive into two time-intervals. We will later take m large, so we may approximate as

$\frac{2f(0)f(y)/m \times \sqrt{\frac{2}{\pi}}f(y)/\sqrt{n-m}\times \mathbb{P}(S_m\in\mathrm{d}y) } { \sqrt{\frac{2}{\pi}}f(0)/\sqrt{n}}\stackrel{n\rightarrow\infty}=\frac{2f(y)^2}{m}\mathbb{P}(S_m\in\mathrm{d}y).$

This final probability emphasises that as $m\rightarrow\infty$ we only really have to consider $y=\Theta(\sqrt{m})$, so set $y=z\sqrt{m}$, and we obtain

$\lim_{n\rightarrow\infty}\mathbb{P}(\frac{S_m}{\sqrt{m}}\in\mathrm{d}z\,\big|\, S_k\ge 0,\,k=1,\ldots,n)$

$\sim \sqrt{m}\cdot\frac{2z^2m}{m}\cdot \frac{1}{\sqrt{2\pi}}\frac{1}{\sqrt{m}}e^{-z^2/2} = \sqrt{\frac{2}{\pi}}z^2 e^{-z^2/2}.$

This is precisely the entrance law of the 3-dimensional Bessel process, usually denoted $R$. This process is invariant under time-rescaling in the same fashion as Brownian motion. Indeed, one representation of R is as the radial part of a three-dimensional Brownian motion, given by independent BMs in each coordinate. (See [Pi75] for explanation of the relation to ‘BM conditioned to stay non-negative’.) We could complete the analogy by showing that $q_n(x,y)$ converges to the transition density of R as well. (Cf the prelude to Theorem 2.2 of [B-JD05].)

Final remarks

The order of taking limits is truly crucial. We can also obtain a distributional scaling limit at time n under conditioning to stay non-negative up to time n. But then this is the size-biased normal distribution $\sim ze^{-z^2/2}$ (the Rayleigh distribution), rather than the square-size-biased normal distribution we say in this setting. And we can exactly see why. Relative to the normal distribution which applies in the absence of conditioning, we require size-biasing to account for the walk staying positive up to time m, and then also size-biasing to account for the walk staying positive for the rest of time (or up to n in the $n\rightarrow\infty$ limit if you prefer).

The asymptotics for $q_n(x,y)$ were the crucial step, for which only heuristics are present in this post. It remains the case that estimates of this kind form the crucial step in other more exotic conditioning scenarios. This is immediately visible (even if the random walk notation is rather exotic) in, for example, Proposition 2.2 of [CHL17], of which we currently require a further level of generalisation.

References

[Bo76] – Bolthausen – On a functional central limit theorem for random walks conditioned to stay positive

[B-JD05] – Bryn-Jones, Doney – A functional limit theorem for random walk conditioned to stay non-negative

[CHL17] – Cortines, Hartung, Louidor – The structure of extreme level sets in branching Brownian motion

[Ig74] – Iglehart – Functional central limit theorems for random walks conditioned to stay positive

[Pi75] – Pitman – One-dimensional Brownian motion and the three-dimensional Bessel process

Advertisements

# DGFF 4 – Properties of the Green’s function

I’m at UBC this month for the PIMS probability summer school. One of the long courses is being given by Marek Biskup about the Discrete Gaussian Free Field (notes and outline here) so this seems like a good moment to revive the sequence of posts about the DGFF. Here’s DGFF1, DGFF2, DGFF3 from November.

The first draft of this post was about the maximum of the DGFF in a large box $V_N$, and also about the Green’s function $G^{V_N}(x,y)$, which specifies the covariance structure of the DGFF. This first draft also became too long, so I’m splitting it into two somewhat shorter ones. As we’ll see, some understanding and standard estimates of the Green’s function is enough to say quite a bit about the maximum. In this first post, we’ll explore some ‘low-hanging fruit’ concerning the Green’s function, as defined through a simple random walk, which are useful, but rarely explained in the DGFF literature.

Symmetry of Green’s function

We start with one of these low-hanging fruit. If $G^{V_N}$ is to be a covariance matrix, it has to be symmetric. In the first post, showing that the definition of the DGFF as a random field with given Hamiltonian is equivalent to $\mathcal{N}(0,G^{V_N})$ certainly can be viewed as a proof of symmetry. However, it would be satisfying if there was a direct argument in the language of the definition of the Green’s function.

To make this self-contained, recall the random walk definition of $G^{V_N}(x,y)$. Let $(S_m)_{m\ge 0}$ be simple random walk on $V_N$, and $\mathbb{P}_x,\,\mathbb{E}_x$ denote starting the random walk at $x\in V_N$. As usual, let $\tau_y,\,\tau_A$ denote the hitting time of a vertex y or a set A respectively. Then

$G^{V_N}(x,y):= \mathbb{E}_x \left[ \sum_{m=0}^{\tau_{\partial V_N}}1_{(S_m=y) }\right].$

That is, $G^{V_N}(x,y)$ is the expected number of visits to y by a random walk from x, before it exits $V_N$.

Let’s drop the superscript for now, as everything should hold for a more general subset of the lattice. I don’t think it’s immediately obvious at the level of Markov chains why G(x,y)=G(y,x). In particular, it’s not the case that

$\mathbb{P}_x(\tau_y < \tau_{D^c}) = \mathbb{P}_y(\tau_x <\tau_{D^c}),$

and it feels that we can’t map between paths $x \to \partial D$ and $y\to \partial D$ in a way that preserves the number of visits to y and x, respectively. However, we can argue that for any m

$\mathbb{P}_x(S_m=y, \tau_{D^c}>m) = \mathbb{P}_y(S_m=x, \tau_{D^c}>m),$

by looking at the suitable paths of $(S_m)$. That is, if we have a path $x=S_0,S_1,\ldots,S_m=y$ that stays within D, then the probability of seeing this path starting from x and its reverse direction starting from y are equal. Why? Because

$\mathbb{P}_x(S_0=x,S_1=v_1,\ldots,S_{m-1}=v_{m-1},S_m=y) = \prod_{\ell=0}^{m-1} \frac{1}{\mathrm{deg}(v_\ell)},$

and

$\mathbb{P}_y(S_0=y,S_1=v_{m-1},\ldots,S_{m-1}=v_1, S_m=x) = \prod_{\ell=0}^{m-1} \frac{1}{\mathrm{deg}(v_{m-\ell})} = \prod_{\ell=1}^m \frac{1}{\mathrm{deg}(v_\ell)}.$

Since $D\subset \mathbb{Z}^d$ and x,y are in the interior of D, we must have $\mathrm{deg}(x)=\mathrm{deg}(y)$, and so these two expressions are equal. Summing over all such two-way paths, and then all m gives the result.

Fixing one argument

We now focus on $G^D(\cdot,y)$, where the second argument is fixed. This is the solution to the Poisson equation

$\Delta G^D(\cdot,y) = -\delta_y(\cdot),\quad G^D(x,y)=0,\; \forall x\in \partial D.$

To see this, can use a standard hitting probability argument (as here) with the Markov property. This is harmonic in $D\backslash \{y\}$, and since we know

$G^D(y,y)= \frac{1}{\mathbb{P}_y(\text{RW hits }\partial D\text{ before returning to }y)},$

this uniquely specifies $G^D(\cdot,y)$. Anyway, since harmonic functions achieve their maxima at the boundary, we have $G(y,y)\ge G(x,y)$ for all $x\in D$. We can also see this from the SRW definition as

$G(x,y)=G(y,x) = \mathbb{P}_y (\tau_x < \tau_{\partial D} ) G(x,x) \le G(x,x).$

Changing the domain

Now we want to consider nested domains $D\subset E$, and compare $G^D(\cdot,\cdot)$ and $G^E(\cdot,\cdot)$ on DxD. The idea is that for SRW started from $x\in D$, we have $\tau_{\partial D}\le \tau_{\partial E}$, since one boundary is contained within the other. From this, we get

$G^D(x,y)\le G^E(x,y),\quad \forall x,y\in D,$

and we will use the particular case y=x.

For example, if $x\in V_N$, the box with width N, then the box with width 2N centred on x contains the whole of $V_N$. So, if we set $\bar {V}_{2N}:= [-N,N]^d$, then with reference to the diagram, we have

$G^{V_N}(x,x)\le G^{\bar{V}_{2N}}(0,0),\quad x\in V_N.$

As we’ll see when we study the maximum of the DGFF on $V_N$, uniform control over the pointwise variance will be a useful tool.

Maximising the Green’s function

The idea of bounding $G^{V_N}(x,x)$ by $G^{\bar V_{2N}}(0,0)$ for any $x\in V_N$ is clever and useful. But a more direct approach would be to find the value of x that maximises $G^{V_N}(x,x)$. We would conjecture that when $V_N$ has a central vertex, then this is the maximiser.

We can prove this directly from the definition of the Green’s function in terms of random walk occupation times. Let’s assume we are working with $\bar{V}_N$ for even N, so that 0 is the central vertex. Again, since

$G^D(x,x)=\frac{1}{\mathbb{P}_x(\text{RW hits }\partial D\text{ before returning to }x)},$ (*)

it would suffice to show that this probability is minimised when x=0. This feels right, since 0 is furthest from the boundary. Other points are closer to the boundary in some directions but further in others, so we can’t condition on the maximum distance from its start point achieved by an excursion of SRW (we’re vertex-transitive, so these look the same from all starting points), as even allowing for the four possible rotations, for an excursion of diameter slightly larger than N, starting at the centre is maximally bad.

However, intuitively it does feel as if being closer to the boundary makes you more likely to escape earlier. In fact, with a bit more care, we can couple the SRW started from 0 and the SRW started from $r=(r^x,r^y)\ne 0$ such that the latter always exits first. For convenience we’ll assume also that $r^x,r^y$ are both even.

I couldn’t find any reference to this, so I don’t know whether it’s well-known or not. The following argument involves projecting into each axis, and doing separate couplings for transitions in the x-direction and transitions in the y-direction. We assume WLOG that x is in the upper-right quadrant as shown. Then, let $0=S_0,S_1,S_2,\ldots$ be SRW started from 0, and we will construct $r=R_0,R_1,R_2,\ldots$ on the same probability space as $(S_m)_{m\ge 0}$ as follows. For every m, we set the increment $R_{m+1}-R_m$ to be $\pm(S_{m+1}-S_m)$. It remains to specify the sign, which will be determined by the direction of the S-increment, and a pair of stopping times. The marginal is therefore again an SRW, started from r. Temporarily, we use the unusual notation $S_m= (S^x_m,S^y_m)$ for the coordinates of $S_m$.

So, if $S_{m+1}-S_m=(1,0), (-1,0)$, ie S moves left or right, then we set

$R_{m+1}-R_m = \begin{cases} -(S_{m+1}-S_m) &\quad \text{if }mT^x.\end{cases}$ (*)

where $T^x:= \min\{m\,:\, R^x_m=S^x_m\}$. That is, $R^x$ moves in the opposing direction to $S^x$ until the first time when they are equal (hence the parity requirement), and then they move together. WLOG assume that $r^x>0$. Then suppose $S^x_m=\pm N$ and such m is minimal. Then by construction, if $m\ge T^x$, then $R^x_m=\pm N$ also. If $m, then we must have $S^x_m=-N$, and so since $R^x$‘s trajectory is a mirror image of $S^x$‘s, in fact $R^x_m = N+r^x>N$, so $R^x$ hit +N first. In both cases, we see that $R^x$ hits $\pm N$ at the same time or before $S^x$.

In other words, when $S^x_m$ has non-negative x coordinate, the lazy random walk $R^x$ follows the same trajectory as $S^x$, and when it has negative x coordinate, the $R^x$ mirrors $S^x$. At some time, it may happen that $S^x_m= R^x_m=0$ (recall the parity condition on r). Call this time $T^x$. We then adjust the description of the coupling so that (*) is the mechanism for $m, and then for $m\ge T^x$, we take $S^x_m=R^x_m$.

Similarly, if $S_{m+1}-S_m =(0,1), (0,-1)$, ie S moves up or down, then we set

$R_{m+1}-R_m = \begin{cases} -(S_{m+1}-S_m)&\quad \text{ if }m

with corresponding definition of the stopping time $T^y$.

This completes the coupling, and by considering $T^x\wedge T^y$, we have shown what that the exit time for the walk started from zero dominates the exit time for walk started from r. Recall that so far we are in the case where the box has even width and $r=(r^x,r^y)$ has even coordinates.

This exit time comparison isn’t exactly what we need to compare $G^N(0,0)$ and $G^N(x,x)$. It’s worth remarking at this stage that if all we cared about was the Green’s function on the integer line [-N,N], we would have an easier argument, as by the harmonic property of $G(\cdot,y)$

$G^{[-N,N]}(0,r)=\frac{N-r}{N}G^{[-N,N]}(0,0),$

$G^{[-N,N]}(r,0) = \frac{N}{N+r}G^{[-N,N]}(r,r),$

and so $G(0,0)>G(r,r)$ follows by symmetry. To lift from 1D to 2D directly, we need a bit more than this. It’s possible that S returns in both x- and y- coordinates more often than R, but never at the same time. Fortunately, the coupling we defined slightly earlier does give us a bit more control.

Let $\tau^x(S), \tau^x(R)$ be the first times that $S^x, R^x$ hit $\pm N$. Under this coupling, for any $m\ge 0$

$\mathbb{P}(S^x_m=0, m

since these events are literally equal. Since we showed that $\tau^x(R)\le \tau^x(S)$ almost surely, we can further deduce

$\mathbb{P}(S^x_m=0,m

$=\mathbb{P}(R^x_m=r^x, m

To address the corresponding events for which $m\ge T^x$, we apply the strong Markov property at $T^x$, to obtain SRW $Z_m$ started from r/2, and let $\tau_{-N},\tau_{+N}$ be the hitting times of $-N,+N$ respectively and $\tau_{\pm N}=\tau_{-N}\wedge \tau_{+N}$. It will now suffice to prove that

$\mathbb{P}(Z_m=0, m< \tau_{\pm N}) \ge \mathbb{P}(Z_m=r,m<\tau_{\pm N}),$ (**)

as then we can apply the law of total probability and sum over values of $T^x$ and $m\ge 0$.

To prove this result, we consider the following bijection between trajectories of length m from r/2 to {0,r}. We decompose the trajectories into excursions away from r/2, and then a final meander from r/2 to {0,r} that stays on the same side of r/2. We construct the new trajectory by preserving all the initial excursions, but reversing all the steps of the final meander. So if the original trajectory ended up at 0, the image ends up at r. Trivially, the initial excursions in the image only hit $\pm N$ if the excursions in the original trajectory did this too. But it’s also easy to see, by a similar argument to the coupling at the start of this section, that if the original trajectory ends at r and does not hit $\pm N$, then so does the image. However, the converse is not true. So we conclude (**), and thus

$\mathbb{P}(S_m^x=0) \ge \mathbb{P}(R_m^x=0)$

for all m by combining everything we have seen so far. And so we can now lift to a statement about $S_m$ itself, that is considering both coordinates separately.

The remaining cases for r require a little more care over the definition of $T^x$, though the same projection argument works, for fundamentally the same reason. (Note that in the above argument, if $S^x_m=-N$ and $m, then in fact $R^x_m\ge N+2$, and so it’s not hard to convince yourself that a sensible adjustment to the stopping time will allow a corresponding result with $R^x_m\ge N+1$ in the odd $r^x$ case.) The case for N odd is harder, since in one dimension there are two median sites, and it’s clear by symmetry that we can’t couple them such that RW from one always exits at least as early as RW from the other. However, the distributions of exit times started from these two sites are the same (by symmetry), and so although we can’t find a coupling, we can use similar stopping times to obtain a result in probability.

In the next post, we’ll see how to apply this uniform bound on $G^{V_N}(x,x)$ to control the maximum of the DGFF on $V_N$. In particular, we address how the positive correlations of DGFF influence the behaviour of the maximum by comparison with independent Gaussians at each site.

# Skorohod embedding

Background

Suppose we are given a standard Brownian motion $(B_t)$, and a stopping time T. Then, so long as T satisfies one of the regularity conditions under which the Optional Stopping Theorem applies, we know that $\mathbb{E}[B_T]=0$. (See here for a less formal introduction to OST.) Furthermore, since $B_t^2-t$ is a martingale, $\mathbb{E}[B_T^2]=\mathbb{E}[T]$, so if the latter is finite, so is the former.

Now, using the strong Markov property of Brownian motion, we can come up with a sequence of stopping times $0=T_0, T_1, T_2,\ldots$ such that the increments $T_k-T_{k-1}$ are IID with the same distribution as T. Then $0,B_{T_1},B_{T_2},\ldots$ is a centered random walk. By taking T to be the hitting time of $\{-1,+1\}$, it is easy to see that we can embed simple random walk in a Brownian motion using this approach.

Embedding simple random walk in Brownian motion.

The Skorohod embedding question asks: can all centered random walks be constructed in this fashion, by stopping Brownian motion at a sequence of stopping time? With the strong Markov property, it immediately reduces the question of whether all centered finite-variance distributions X can be expressed as $B_T$ for some integrable stopping time T.

The answer to this question is yes, and much of what follows is drawn from, or at least prompted by Obloj’s survey paper which details the problem and rich history of the many approaches to its solution over the past seventy years.

Applications and related things

The relationship between random walks and Brownian motion is a rich one. Donsker’s invariance principle asserts that Brownian motion appears as the scaling limit of a random walk. Indeed, one can construct Brownian motion itself as the limit of a sequence of consistent random walks with normal increments on an increasingly dense set of times. Furthermore, random walks are martingales, and we know that continuous, local martingales can be expressed as a (stochastically) time-changed Brownian motion, from the Dubins-Schwarz theorem.

The Skorohod embedding theorem can be used to prove results about random walks with general distribution by proving the corresponding result for Brownian motion, and checking that the construction of the sequence of stopping times has the right properties to allow the result to be carried back to the original setting. It obviously also gives a coupling between a individual random walk and a Brownian motion which may be useful in some contexts, as well as a coupling between any pair of random walks. This is useful in proving results for random walks which are much easier for special cases of the distribution. For example, when the increments are Gaussian, or when there are combinatorial approaches to a problem about simple random walk. At the moment no aspect of this blog schedule is guaranteed, but I plan to talk about the law of the iterated logarithm shortly, whose proof is approachable in both of these settings, as well as for Brownian motion, and Skorohod embedding provides the route to the general proof.

At the end, we will briefly compare some other ways to couple a random walk and a Brownian motion.

Adding extra randomness

One thing we could do is sample a copy of X independently from the Brownian motion, then declare $T= \tau_{X}:= \inf\{t\ge 0: B_t=X\}$, the hitting time of (random value) X. But recall that unfortunately $\tau_x$ has infinite expectation for all non-zero x, so this doesn’t fit the conditions required to use OST.

Skorohod’s original method is described in Section 3.1 of Obloj’s notes linked above. The method is roughly to pair up positive values taken by X appropriately with negative values taken by X in a clever way. If we have a positive value b and a negative value a, then $\tau_{a,b}$, the first hitting time of $\mathbb{R}\backslash (a,b)$ is integrable. Then we choose one of these positive-negative pairs according to the projection of the distribution of X onto the pairings, and let T be the hitting time of this pair of values. The probability of hitting b conditional on hitting {a,b} is easy to compute (it’s $\frac{-a}{b-a}$) so we need to have chosen our pairs so that the ‘probability’ of hitting b (ie the density) comes out right. In particular, this method has to start from continuous distributions X, and treat atoms in the distribution of X separately.

The case where the distribution X is symmetric (that is $X\stackrel{d}=-X$) is particularly clear, as then the pairs should be $(-x,x)$.

However, it feels like there is enough randomness in Brownian motion already, and subsequent authors showed that indeed it wasn’t necessary to introduce extra randomness to provide a solution.

One might ask whether it’s possible to generate the distribution on the set of pairs (as above) out of the Brownian motion itself, but independently from all the hitting times. It feels like it might be possible to make the distribution on the pairs measurable with respect to

$\mathcal{F}_{0+} = \bigcap\limits_{t>0} \mathcal{F}_t,$

the sigma-algebra of events determined by limiting behaviour as $t\rightarrow 0$ (which is independent of hitting times). But of course, unfortunately $\mathcal{F}_{0+}$ has a zero-one law, so it’s not possible to embed non-trivial distributions there.

Dubins solution

The exemplar for solutions without extra randomness is due to Dubins, shortly after Skorohod’s original argument. The idea is to express the distribution X as the almost sure limit of a martingale. We first use the hitting time of a pair of points to ‘decide’ whether we will end up positive or negative, and then given this information look at the hitting time (after this first time) of two subsequent points to ‘decide’ which of four regions of the real interval we end up in.

I’m going to use different notation to Obloj, corresponding more closely with how I ended up thinking about this method. We let

$a_+:= \mathbb{E}[X \,|\, X>0], \quad a_- := \mathbb{E}[X\,|\, X<0],$ (*)

and take $T_1 = \tau_{\{a_-,a_+\}}$. We need to check that

$\mathbb{P}\left( B_{T_1}=a_+\right) = \mathbb{P}\left(X>0\right),$

for this to have a chance of working. But we know that

$\mathbb{P}\left( B_{T_1}=a_+\right) = \frac{a_+}{a_+-a_-},$

and we can also attack the other side using (*) and the fact that $\mathbb{E}[X]=0$, using the law of total expectation:

$0=\mathbb{E}[X]=\mathbb{E}[X\,|\, X>0] \mathbb{P}(X>0) + \mathbb{E}[X\,|\,X<0]\mathbb{P}(X<0) = a_+ \mathbb{P}(X>0) + a_- \left(1-\mathbb{P}(X>0) \right),$

$\Rightarrow\quad \mathbb{P}(X>0)=\frac{a_+}{a_+-a_-}.$

Now we define

$a_{++}=\mathbb{E}[X \,|\, X>a_+],\quad a_{+-}=\mathbb{E}[X\,|\, 0

and similarly $a_{-+},a_{--}$. So then, conditional on $B_{T_1}=a_+$, we take

$T_2:= \inf_{t\ge T_1}\left\{ B_t\not\in (a_{+-},a_{++}) \right\},$

and similarly conditional on $B_{T_1}=a_-$. By an identical argument to the one we have just deployed, we have $\mathbb{E}\left[B_{T_2} \,|\,\mathcal{F}_{T_1} \right] = B_{T_1}$ almost surely. So, although the $a_{+-+}$ notation now starts to get very unwieldy, it’s clear we can keep going in this way to get a sequence of stopping times $0=T_0,T_1,T_2,\ldots$ where $B_{T_n}$ determines which of the $2^n$ regions of the real line any limit $\lim_{m\rightarrow\infty} B_{T_m}$ should lie in.

A bit of work is required to check that the almost sure limit $T_n\rightarrow T$ is almost surely finite, but once we have this, it is clear that $B_{T_n}\rightarrow B_T$ almost surely, and $B_T$ has the distribution required.

Komlos, Major, Tusnady coupling

We want to know how close we can make this coupling between a centered random walk with variance 1, and a standard Brownian motion. Here, ‘close’ means uniformly close in probability. For large times, the typical difference between one of the stopping times $0,T_1,T_2,\ldots$ in the Skorohod embedding and its expectation (recall $\mathbb{E}[T_k]=k$) is $\sqrt{n}$. So, constructing the random walk $S_0,S_1,S_2,\ldots$ from the Brownian motion via Skorohod embedding leads to

$\left |S_k - B_k \right| = \omega(n^{1/4}),$

for most values of $k\le n$. Strassen (1966) shows that the true scale of the maximum

$\max_{k\le n} \left| S_k - B_k \right|$

is slightly larger than this, with some extra powers of $\log n$ and $\log\log n$ as one would expect.

The Komlos-Major-Tusnady coupling is a way to do a lot better than this, in the setting where the distribution of the increments has a finite MGF near 0. Then, there exists a coupling of the random walk and the Brownian motion such that

$\max_{k\le n}\left|S_k- B_k\right| = O(\log n).$

That is, there exists C such that

$\left[\max_{k\le n} \left |S_k-B_k\right| - C\log n\right] \vee 0$

is a tight family of distributions, indeed with uniform exponential tail. To avoid digressing infinitely far from my original plan to discuss the proof of the law of iterated logarithm for general distributions, I’ll stop here. I found it hard to find much coverage of the KMT result apart from the challenging original paper, and many versions expressed in the language of empirical processes, which are similar to random walks in many ways relevant to convergence and this coupling, but not for Skorohod embedding. So, here is a link to some slides from a talk by Chatterjee which I found helpful in getting a sense of the history, and some of the modern approaches to this type of normal approximation problem.

# Hitting Probabilities for Markov Chains

This continues my previous post on popular questions in second year exams. In the interest of keeping it under 2,500 words I’m starting a new article.

In a previous post I’ve spoken about the two types of Markov chain convergence, in particular, considering when they apply. Normally the ergodic theorem can be used to treat the case where the chain is periodic, so the transition probabilities do not converge to a stationary distribution, but do have limit points – one at zero corresponding to the off-period transitions, and one non-zero. With equal care, the case where the chain is not irreducible can also be treated.

A favourite question for examiners concerns hitting probabilities and expected hitting times of a set A. Note these are unlikely to come up simultaneously. Unless the hitting probability is 1, the expected hitting time is infinite! In both cases, we use the law of total probability to derive a family of equations satisfied by the probabilities/times. The only difference is that for hitting times, we add +1 on the right hand side, as we advance one time-step to use the law of total probability.

The case of hitting probabilities is perhaps more interesting. We have:

$h_i^A = 1,\; i\in A, \quad h_i^A=\sum_{j\in S}p_{ij}h_j^A,\; i\not\in A.$

There are two main cases of interest: where the chain is finite but has multiple closed communicating classes, and where the chain is infinite, so even though it is irreducible, a trajectory might diverge before hitting 0.

For the case of a finite non-irreducible Markov chain, this is fairly manageable, by solving backwards from states where we know the values. Although of course you could ask about the hitting probability of an open state, the most natural question is to consider the probability of ending up in a particular closed class. Then we know that the hitting probability starting from site in the closed class A is 1, and the probability starting from any site in a different closed class is 0. To find the remaining values, we can work backwards one step at a time if the set of possible transitions is sparse enough, or just solve the simultaneous equations for $\{h_i^A: i\text{ open}\}$.

We therefore care mainly about an infinite state-space that might be transient. Typically this might be some sort of birth-and-death chain on the positive integers. In many cases, the hitting probability equations can be reduced to a quadratic recurrence relation which can be solved, normally ending up with the form

$h_i=A+B\lambda^i$,

where $\lambda$ might well be q/p or similar if the chain is symmetric. If the chain is bounded, typically you might know $h_0=1, h_N=0$ or similar, and so you can solve two simultaneous equations to find A and B. For the unbounded case you might often only have one condition, so you have to rely instead on the result that the hitting probabilities are the minimal solution to the family of equations. Note that you will always have $h^i_i=1$, but with no conditions, $h^i_j\equiv 1$ is always a family of solutions.

It is not clear a priori what it means to be a minimal solution. Certainly it is not clear why one solution might be pointwise smaller than another, but in the case given above, it makes sense. Supposing that $\lambda<1$, and A+B=1 say, then as we vary the parameters, the resulting set of ‘probabilities’ does indeed vary monotically pointwise.

Why is this true? Why should the minimum solution give the true hitting probability values? To see this, take the equations, and every time an $h_i^A$ appears on the right-hand side, substitute in using the equations. So we obtain, for $i\not\in A$,

$h_i^A=\sum_{j\in A}p_{ij}+\sum_{j\not\in A} p_{ij}h_j^A,$

and after a further iteration

$h_i^A=\sum_{j_1\in A}p_{ij_1}+\sum_{j_1\not\in A, j_2\in A}p_{ij_1}p_{j_1j_2}+\sum_{j_1,j_2\not\in A}p_{ij_1}p_{j_1j_2}h_{j_2}^A.$

So we see on the RHS the probability of getting from i to A in one step, and in two steps, and if keep iterating, we will get a large sum corresponding to the probability of getting from i to A in 1 or 2 or … or N steps, plus an extra term. Note that the extra term does not have to correspond to the probability of not hitting A by time N. After all, we do not yet know that $(h_{i}^A)$ as defined by the equations gives the hitting probabilities. However, we know that the probability of hitting A within N steps converges to the probability of hitting A at all, since the sequence is increasing and bounded, so if we take a limit of both sides, we get $h_i^A$ on the left, and something at least as large as the hitting probability starting from i on the right, because of the extra positive term. The result therefore follows.

It is worth looking out for related problems that look like a hitting probability calculation. There was a nice example on one of the past papers. Consider a simple symmetric random walk on the integers modulo n, arranged clockwise in a circle. Given that you start at state 0, what is the probability that your first return to state 0 involves a clockwise journey round the circle?

Because the system is finite and irreducible, it is not particularly interesting to consider the actual hitting probabilities. Also, note that if it is convenient to do so, we can immediately reduce the problem when n is even. In two steps, the chain moves from j to j+2 and j-2 with probability ¼ each, and stays at j with probability ½. So the two step chain is exactly equivalent to the lazy version of the same dynamics on n/2.

Anyway, even though the structure is different, our approach should be the same as for the hitting probability question, which is to look one step into the future. For example, to stand a chance of working, our first two moves must both be clockwise. Thereafter, we are allowed to move anticlockwise. There is nothing special about starting at 0 in defining the original probability. We could equally well ask for the probability that starting from j, the first time we hit 0 we have moved clockwise round the circle.

The only thing that is now not obvious is how to define moving clockwise round the circle, since it is not the case that all the moves have to be clockwise to have experienced a generally clockwise journey round the circle, but we definitely don’t want to get into anything complicated like winding numbers! In fact, the easiest way to make the definition is that given the hitting time of 0 is T, we demand that the chain was at state n at time T-1.

For convenience (ie to make the equations consistent) we take $h_0=0, h_n=1$ in an obvious abuse of notation, and then

$h_j=\frac12h_{j-1}+\frac12 h_{j+1},$

from which we get

$h_j=a+bj \Rightarrow h_j=\frac{j}{n}.$

Of course, once we have this in mind, we realise that we could have cut the circle at 0 (also known as n) and unfolded it to reduce the problem precisely to symmetric gambler’s ruin. In particular, the answer to the original problem is 1/2n, which is perhaps just a little surprising – maybe by thinking about the BM approximation to simple random walk, and that BM started from zero almost certainly crosses zero infinitely many times near we might have expected this probability to decay faster. But once it is unfolded into gambler’s ruin, we have the optimal stopping martingale motivation to reassure us that this indeed looks correct.

# The Top-to-Random Shuffle II

In the last post, I introduced the top-to-random shuffle. In particular, we considered why this sort of procedure was important as an alternative to choosing an ordering afresh, and how to we would go about measuring how close we had got to randomness.

In this post, I want to develop the second of these points. The intuition might be that we can get very close to uniform randomness if we repeat the shuffle often enough. Recall this means that even if we choose our bet in a really complicated and careful way, we still couldn’t make much profit by knowing the actual distribution of the ordering. But we might also suspect that the pack will never be exactly random, in the same way that the distribution of the proportion of heads seen on a repeatedly-flipped coin will eventually get very close to 1/2, but will not be exactly 1/2.

This intuition is extremely sensible, and in general is true. It is a nice fact, however, that it fails for the top-to-random shuffle, where we do in fact get to a uniformly random deck. Recall that we approximated how long it would take to get to a state that was roughly random by calculating the time taken for the original bottom card to rise to the top of the deck. This time was:

$n+\frac{n}{2}+\frac{n}{3}+\ldots+\frac{n}{n-1},$

where n is the total number of cards. A further shuffle is required to send the original card back into the pack somewhere. The claim is that the pack is now uniformly random. Note that if we ever actually use this, we have to be careful because although we can calculate the average time at which this event happens, the time itself is random. Rather than worry about that, let’s see why it is true.

As a motivating example, let’s assume that the pack was originally in the order 1,2,3,…,n, and consider the final relative order of the cards numbered 1 and 2. There is some small probability that on the first go (and then again on the second go possibly also) that the card number 1 stays put. Let’s ignore that possibility and progress to the first time that the 1 has been moved into the interior of the pack, so the 2 is now on top. When we do this, we are choosing a number between 2 and n uniformly at random, and moving the card numbered 1 to this position. Let’s call this number X.

Now, when we try to shuffle the 2, we are choosing a number between 1 and n uniformly at random, and moving the card numbered 2 to this position. Let’s call this number Y. Under exactly what circumstances does 2 end up above 1? The clearest example is if Y=X. Then card 2 has been moved to the position previously occupied by card 1. So card 1 moves up a position (since the card 2 is no longer on the top). The final configuration therefore includes card 1 directly above card 2. So we can say

$\mathbb{P}(\text{2 above 1})=\mathbb{P}(Y

where the fact that the inequality in the second probability is strict is important. To calculate this second probability, we want to exploit the symmetry of the situation. The only problems are that the case X=Y is not symmetric as then 1 ends up above 2 as described above, and also that X cannot be 1. So we account for these separately. Note

$\mathbb{P}(Y=1)=\frac{1}{n},\quad \mathbb{P}(Y=X)=\frac{1}{n}.$

The second result holds overall since it holds whenever we condition on a particular value of X. These events are also disjoint. Then

$\mathbb{P}(Y1, X\neq Y)\mathbb{P}(Y>1,X\neq Y)$

$= \frac{1}{n}+\frac{1}{2}(1-\mathbb{P}(Y>1,Y\neq X))$

$=\frac{1}{n}+\frac{1}{2}(1-\frac{1}{n}-\frac{1}{n}) = \frac{1}{2}.$

In summary, the cards 1 and 2 will be in uniformly random order. We might like to extend this idea, but it will get complicated when we add card 3 to the mix, as it is possible (if unlikely) that 1 and 2 will be further mixed before this. This shouldn’t affect the result much, but it will get complicated to define the notation required to carry this sort of argument all the way up to the nth card.

Even using induction is not going to make life substantially easier. Knowing that once we have inserted cards 1,2,…,k into the pack they are in uniformly random order is not enough to make inference about what happens once we put k+1 into the pack. We have to know something about the current positions of 1,2,…,k. For example, if one of these cards is definitely on the bottom of the pack, then the probability that k+1 ends up last among 1,2,…,k+1 is 1/n rather than 1/k+1 as it should be. So in fact we would have to control an annoying amount of information jointly.

In the argument we attempted above, we were looking at the first times some card k got folded back into the pack. Note that this division of time is different to the one we were using for the coupon collector approach to the mixing time in the previous post. Let’s try to use that instead here.

Now we consider the times at which a card is moved below card n. We deliberately decline to say what these cards are. But rather, we want to prove that, conditional on the cards below n being $A_k=\{a_1,\ldots,a_k\}$, the ordering of these is uniform on $S_{A_k}$, that is, every possibility is equally likely. Now this is easy to prove by induction. For, by conditioning on $A_k$ and $a_{k+1}$ being the new card to be moved below n, we are conditioning on the set of cards below n now being $A_{k+1}=A_k\cup\{a_{k+1}\}$. The position of the new card is uniformly random within this, by construction of the top-to-random shuffle, and so the new arrangement is uniformly random on the (k+1)! possibilities.

To see why we have proved the original result we wanted, note that this argument works at the time when the original bottom card is now at the top. So the remaining cards are uniformly randomly ordered. Inserting card n at random gives an arrangement that is uniformly random overall. So as we suggested before, working out how long it takes to get close to randomness in this case reduces to working out how long it is before the original bottom card hits the top and is re-inserted, as at that point, the pack genuinely is uniformly random.

# Mixing Times 6 – Aldous-Broder Algorithm and Cover Times

In several previous posts, I’ve discussed the Uniform Spanning Tree. The definition is straightforward: we choose uniformly at random from the set of trees which span a fixed underlying graph. But for a dense underlying graph, there are a very large number spanning trees. Cayley’s formula says that the complete graph K_n has $n^{n-2}$ spanning trees, so to select from this list is impractical.

We seek a better algorithm. In a post about a year ago, I presented the result that the path between two fixed points x and y in the UST is distributed as the path generated by Loop-Erased Random Walk, for which we start at x and delete cycles as they appear. An initial problem might be that this only gives us a single path, which might be enough in some contexts, but in general we will want to specify the whole tree. Wilson’s Algorithm is an unsurprising but useful extension to this equivalence which does just that. You start by constructing the LERW between two vertices, then you add the LERW which connects some other vertex to the path you already have. Then you take a further vertex not currently explored and start LERW there, continuing until you hit the tree that you already have. Iterate this process, which must terminate after at most n steps when there are no vertices which to start from. The tree thus obtained is the UST. The tricky part is proving that the method for selecting which unused vertices to start from has no effect on the distribution of paths between two fixed points.

I want to consider a different algorithm, discovered roughly simultaneously by Aldous and Broder. Start a random walk on the underlying graph at some particular vertex. Every time we traverse an edge which takes us to a vertex we haven’t yet explored, add this edge to the tree. For now I don’t want to give a proof that this algorithm works, but rather to talk about how fast it works, because it ties in nicely with something from the Mixing Times book we’ve been reading recently. It is clear that the algorithm terminates at the first time the random walk has visited every vertex. This is a stopping time, called the cover time of the Markov chain. If we are working with an underlying complete, then we notice that this is annoying, because it means that the cover time will increase like n.log n. That is, it will take an increasingly long time to gather the final few vertices into the tree. Perhaps some combination of Aldous-Broder initially then Wilson’s method for the final o(n) vertices might be preferable?

I want to discuss how to treat this cover time. Often we have information about the hitting times of states from other states $\mathbb{E}_x T_y$. A relationship between S, the hitting time, defined to be the maximum of the previous display over x and y, and the expected cover time would be useful, especially for a highly symmetric graph like the complete graph where the expected hitting times are all the same.

Matthews’ Method relates these two for an irreducible finite Markov chain on n states. It says:

$t_{cov}\leq t_{hit}\left(1+\frac12+\ldots+\frac 1 n\right).$

We first remark that this agrees with what we should get for the random walk on the complete graph. There, the hitting time of x from y is a geometric random variable with success probability 1/n, hence expectation is n. The cover time is the standard coupon collector problem, giving expectation n log n, and the sum of reciprocals factor is asymptotically a good approximation.

The intuition is that if we continue until we hit state 1, then reset and continue until we hit state 2, and so on, by the time we hit state n after (n-1) iterations, this is a very poor overestimate of the cover time, because we are actually likely to have hit most states many times. What we want to do really is say that after we’ve hit state 1, we continue until we hit state 2, unless we’ve already done so, in which case we choose a different state to aim for, one which we haven’t already visited. But this becomes complicated because we then need to know the precise conditional probabilities of visiting any site on the way between two other states, which will depend rather strongly on the exact structure of the chain.

Peres et al give a coupling proof in Chapter 11 of their book which I think can be made a bit shorter, at least informally. The key step is that we still consider hitting the sites in order, only now in a random order.

That is, we choose a permutation $\sigma\in S_n$ uniformly at random, and we let $T_k$ be the first time that states $\sigma(1),\ldots,\sigma(k)$ have all been visited. This is a random time that is measurable in the product space, and for each $\sigma$ it is a stopping time.

The key observation is that $\mathbb{P}(T_{k+1}=T_k)=1-\frac{1}{k+1}$. This holds conditional on any path of the Markov chain because the requirement for the event is that $\sigma(k+1)$ is visited after $\{\sigma(1),\ldots,\sigma(k)\}$. The statement therefore holds as stated as well as just pathwise. Then, by the SMP, conditional on $\{T_{k+1}>T_k\}$, we have

$T_{k+1}-T_k \leq_{st} t_{hit}.$

Note that by the definition of $t_{hit}$, this bound on the hitting time $T_{k+1}$ is unaffected by concerns about where the chain actually is at $T_k$ (since it is not necessarily at $\sigma(k)$).

So, removing the conditioning, we have:

$\mathbb{E}\Big[T_{k+1}-T_k\Big]\leq\frac{1}{k+1}t_{hit},$

and so the telescoping sum gives us Matthews’ result.

One example is the cover time of random walk on the n x n torus, which turns out to be

$O(n^2(\log n)^2).$

If anyone remembers that Microsoft screensaver from many years ago which started with a black screen and a snake leaving a trail of white pixels as it negotiated the screen, this will be familiar. The last few black bits take a frustratingly long while to disappear. Obviously that isn’t quite a random walk, but it perhaps diminishes the surprise that it should take this long to find the cover time.

There are a couple of interesting things I wanted to say about electrical networks for Markov chains and analytic methods for mixing times, but the moment may have passed, so this is probably the last post about Mixing Times. Plans are in motion for a similar reading group next term, possible on Random Matrices.

# Mixing Times 5 – Cesaro Mixing

We have just finished discussing chapters 11 and 12 of Markov Chains and Mixing Times, the end of the ‘core material’. I thought that, rather than addressing some of the more interesting but technical spectral methods that have just arisen, it would be a good subject for a quick post to collate some of the information about Cesaro mixing, which is spread throughout this first section.

Idea

A main result in the introductory theory of Markov chains is that for an irreducible aperiodic chain X, the distribution of $X_t\rightarrow \pi$, the (unique) equilibrium distribution. The mixing time gives the rate at which this first mode of convergence takes place. We have freedom over the initial state, so we typically consider the ‘worst case scenario’, ie the slowest convergence. The most appropriate metric is given by the total variation distance, which is defined in previous posts. The most important point to note is that the mixing time should be thought of as the correct timescale for convergence, rather than some threshold. In particular, the time at which the chain is within 1/4 of the equilibrium distribution in the TV metric has the same order magnitude (in n, some parameter controlling the number of states) as the time at which it is within 1/20 of the equilibrium distribution.

But this isn’t the only result about convergence in distribution of functionals of Markov chains. Perhaps more intuitive is the ergodic theorem which asserts that the proportion of time spent in a particular state also converges to the equilibrium probability as time advances. We might write:

$\frac{1}{t}\sum_t \mathbf{1}(X_t=x)\rightarrow \pi(x),\quad \forall x\in \Omega.$

Note that if the state space $\Omega$ is finite, then we can also assume that this occurs uniformly in x. We can also think of the LHS of this convergence as a measure on $\Omega$ varying in time

$\frac{1}{t}\sum_t \mathbf{1}(X_t=\cdot),$

and the mixing time for this sequence of measures is defined as for the conventional mixing time, and is called the Cesaro mixing time, at least in the Levin / Peres / Wilmer text.

Advantages

There are some obvious advantages to considering Cesaro mixing. Principally, a main drawback of conventional mixing is that we are unable to consider periodic chains. This property was the main content of the previous post, but to summarise, if a chain switches between various classes in a partition of the state space in a deterministic periodic way, then the distribution does not necessarily converge to equilibrium. The previous post discusses several ways of resolving this problem in specific cases. Note that this problem does not affect Cesaro mixing as the ergodic theorem continues to hold in the periodic case. Indeed the form of the distribution (which we might call an occupation measure in some contexts) confirms the intuition about viewing global mixing as a sort of sum over mixing modulo k in time.

Other advantages include the fact that the dependence on the initial state is weaker. For instance, consider the occupation measures for a chain started at x which moves first to y, versus a chain which starts at y then proceeds as the original. It requires very little thought to see that for O(1) values of t, this difference in occupation measures between these chains becomes small.

Another bonus is that we can use so-called stationary times to control Cesaro mixing. A stationary time is a stopping time such that $X_\tau\stackrel{d}{=}\pi$. It is clear that if we wait until $\tau$, then run the chain for a further $\alpha\tau$, the chain will have spent $\frac{\alpha}{1+\alpha}$ of its duration in the equilibrium distribution, and so using Markov’s inequality and bounding the total variation distance between occupation measure and $\pi$ by 1 up until the stationary time, we can get good bounds for the Cesaro mixing time in terms of $\mathbb{E}\tau$.

Why does this fail to work for normal mixing? The key to the above argument was that by taking an average over time up to some $T>>\tau$, the dependence on the actual value of $X_\tau$ was suppressed. Consider the deterministic walk on the cycle $\mathbb{Z}_n$, which advances by 1 modulo n on each go. Now sample independently a random variable Z distributed uniformly on $\mathbb{Z}_n$. By definition, the random hitting time $\tau_Z$ is a stationary time, but in fact the chain’s distribution does not converge. The condition we actually require for normal mixing is that $\tau$ be a strong stationary time, meaning that $X_\tau\stackrel{d}{=}\pi$, and the value of $X_\tau$ is independent of $\tau$. With this definition we can proceed with a similar result for normal mixing. An example of a strong stationary time would be for shuffling a pack of cards by repeatedly inserting the top card into a random place in the rest of the pile. Then consider the moment at which the original bottom card first reaches the top of the pile. It does not take too much to reassure oneself that after the next move, we have a strong stationary time, since every card has been randomised at least once, and the position of the other cards is independent of how long it took the original bottom card to rise to the top.

Disadvantages

So why do we not consider Cesaro mixing rather than the conventional variety? Well, mainly because of how we actually use mixing times. The Metropolis algorithm gives a way to generate chains with a particular equilibrium distribution, including ones for which it is hard to sample directly. Mixing time theory then gives a quantitative answer to the question of how long it is necessary to run such a chain for before it gives a good estimate to the equilibrium distribution. In many cases, such a random walk on a large unknown network, the main aim when applying such Monte Carlo procedures is to minimise the difficulty of calculation. For Cesaro mixing, you have to store all the information about path states, while for conventional mixing you only care about your current location.

The other phenomenon that is lacking in Cesaro mixing is cutoff. This is where the total variation distance

$d_n(t)=||P^t(x,\cdot)-\pi||_{TV}$

converges to 0 suddenly. More formally, there is some timescale f(n) such that

$\lim_{n\rightarrow\infty}d_n(cf(n))=\begin{cases}1& c<1\& c>1,\end{cases}$

so in the n-limit, the graph of d looks like a step-function. Several of the shuffling chains exhibit this property, leading to the statements like “7 shuffles are required to mix a standard pack of cards”. Cesaro mixing smooths out this effect on an f(n) timescale.

A Further Example

Perhaps the best example where Cesaro mixing happens faster than normal mixing is in the case of a lazy biased random walk on $\mathbb{Z}_n$. (Ex. 4.10 in MTMC) Here, we stay put with probability 1/2, otherwise move clockwise with probability p>1/4 and counter-clockwise with probability 1/2 – p < 1/4. This chain is not reversible, as we can determine the direction (or arrow) of time by examining a path. Roughly speaking, the chain will drift clockwise at rate 2p – 1/2 > 0. In particular, at some time Kn, where K is large, we would expect to have completed ~ $\frac{K}{2p-\frac12}$ circuits of the vertices, and so the occupation measure will be close to the uniform equilibrium distribution if we choose K large enough.

On the other hand, the distribution of X at time Kn is still fairly concentrated. If we assume we are instead performing the random walk on $\mathbb{Z}$, the distribution after Kn i.i.d. increments is

$X_t\sim N(Kn(2p-\frac12), Kn\sigma^2).$

That is, the standard deviation is $O(\sqrt{n})<. So, even once we return to considering the random walk on the cyclic group, if we view it as a circle, we expect most of the probability mass to be concentrated near $Kn(2p-\frac12) \mod n$. By an identical heuristic argument, we see that the mixing time is achieved when the variance has order n, that is when time has order $n^2$.