# DGFF 4 – Properties of the Green’s function

I’m at UBC this month for the PIMS probability summer school. One of the long courses is being given by Marek Biskup about the Discrete Gaussian Free Field (notes and outline here) so this seems like a good moment to revive the sequence of posts about the DGFF. Here’s DGFF1, DGFF2, DGFF3 from November.

The first draft of this post was about the maximum of the DGFF in a large box $V_N$, and also about the Green’s function $G^{V_N}(x,y)$, which specifies the covariance structure of the DGFF. This first draft also became too long, so I’m splitting it into two somewhat shorter ones. As we’ll see, some understanding and standard estimates of the Green’s function is enough to say quite a bit about the maximum. In this first post, we’ll explore some ‘low-hanging fruit’ concerning the Green’s function, as defined through a simple random walk, which are useful, but rarely explained in the DGFF literature.

Symmetry of Green’s function

We start with one of these low-hanging fruit. If $G^{V_N}$ is to be a covariance matrix, it has to be symmetric. In the first post, showing that the definition of the DGFF as a random field with given Hamiltonian is equivalent to $\mathcal{N}(0,G^{V_N})$ certainly can be viewed as a proof of symmetry. However, it would be satisfying if there was a direct argument in the language of the definition of the Green’s function.

To make this self-contained, recall the random walk definition of $G^{V_N}(x,y)$. Let $(S_m)_{m\ge 0}$ be simple random walk on $V_N$, and $\mathbb{P}_x,\,\mathbb{E}_x$ denote starting the random walk at $x\in V_N$. As usual, let $\tau_y,\,\tau_A$ denote the hitting time of a vertex y or a set A respectively. Then

$G^{V_N}(x,y):= \mathbb{E}_x \left[ \sum_{m=0}^{\tau_{\partial V_N}}1_{(S_m=y) }\right].$

That is, $G^{V_N}(x,y)$ is the expected number of visits to y by a random walk from x, before it exits $V_N$.

Let’s drop the superscript for now, as everything should hold for a more general subset of the lattice. I don’t think it’s immediately obvious at the level of Markov chains why G(x,y)=G(y,x). In particular, it’s not the case that

$\mathbb{P}_x(\tau_y < \tau_{D^c}) = \mathbb{P}_y(\tau_x <\tau_{D^c}),$

and it feels that we can’t map between paths $x \to \partial D$ and $y\to \partial D$ in a way that preserves the number of visits to y and x, respectively. However, we can argue that for any m

$\mathbb{P}_x(S_m=y, \tau_{D^c}>m) = \mathbb{P}_y(S_m=x, \tau_{D^c}>m),$

by looking at the suitable paths of $(S_m)$. That is, if we have a path $x=S_0,S_1,\ldots,S_m=y$ that stays within D, then the probability of seeing this path starting from x and its reverse direction starting from y are equal. Why? Because

$\mathbb{P}_x(S_0=x,S_1=v_1,\ldots,S_{m-1}=v_{m-1},S_m=y) = \prod_{\ell=0}^{m-1} \frac{1}{\mathrm{deg}(v_\ell)},$

and

$\mathbb{P}_y(S_0=y,S_1=v_{m-1},\ldots,S_{m-1}=v_1, S_m=x) = \prod_{\ell=0}^{m-1} \frac{1}{\mathrm{deg}(v_{m-\ell})} = \prod_{\ell=1}^m \frac{1}{\mathrm{deg}(v_\ell)}.$

Since $D\subset \mathbb{Z}^d$ and x,y are in the interior of D, we must have $\mathrm{deg}(x)=\mathrm{deg}(y)$, and so these two expressions are equal. Summing over all such two-way paths, and then all m gives the result.

Fixing one argument

We now focus on $G^D(\cdot,y)$, where the second argument is fixed. This is the solution to the Poisson equation

$\Delta G^D(\cdot,y) = -\delta_y(\cdot),\quad G^D(x,y)=0,\; \forall x\in \partial D.$

To see this, can use a standard hitting probability argument (as here) with the Markov property. This is harmonic in $D\backslash \{y\}$, and since we know

$G^D(y,y)= \frac{1}{\mathbb{P}_y(\text{RW hits }\partial D\text{ before returning to }y)},$

this uniquely specifies $G^D(\cdot,y)$. Anyway, since harmonic functions achieve their maxima at the boundary, we have $G(y,y)\ge G(x,y)$ for all $x\in D$. We can also see this from the SRW definition as

$G(x,y)=G(y,x) = \mathbb{P}_y (\tau_x < \tau_{\partial D} ) G(x,x) \le G(x,x).$

Changing the domain

Now we want to consider nested domains $D\subset E$, and compare $G^D(\cdot,\cdot)$ and $G^E(\cdot,\cdot)$ on DxD. The idea is that for SRW started from $x\in D$, we have $\tau_{\partial D}\le \tau_{\partial E}$, since one boundary is contained within the other. From this, we get

$G^D(x,y)\le G^E(x,y),\quad \forall x,y\in D,$

and we will use the particular case y=x.

For example, if $x\in V_N$, the box with width N, then the box with width 2N centred on x contains the whole of $V_N$. So, if we set $\bar {V}_{2N}:= [-N,N]^d$, then with reference to the diagram, we have

$G^{V_N}(x,x)\le G^{\bar{V}_{2N}}(0,0),\quad x\in V_N.$

As we’ll see when we study the maximum of the DGFF on $V_N$, uniform control over the pointwise variance will be a useful tool.

Maximising the Green’s function

The idea of bounding $G^{V_N}(x,x)$ by $G^{\bar V_{2N}}(0,0)$ for any $x\in V_N$ is clever and useful. But a more direct approach would be to find the value of x that maximises $G^{V_N}(x,x)$. We would conjecture that when $V_N$ has a central vertex, then this is the maximiser.

We can prove this directly from the definition of the Green’s function in terms of random walk occupation times. Let’s assume we are working with $\bar{V}_N$ for even N, so that 0 is the central vertex. Again, since

$G^D(x,x)=\frac{1}{\mathbb{P}_x(\text{RW hits }\partial D\text{ before returning to }x)},$ (*)

it would suffice to show that this probability is minimised when x=0. This feels right, since 0 is furthest from the boundary. Other points are closer to the boundary in some directions but further in others, so we can’t condition on the maximum distance from its start point achieved by an excursion of SRW (we’re vertex-transitive, so these look the same from all starting points), as even allowing for the four possible rotations, for an excursion of diameter slightly larger than N, starting at the centre is maximally bad.

However, intuitively it does feel as if being closer to the boundary makes you more likely to escape earlier. In fact, with a bit more care, we can couple the SRW started from 0 and the SRW started from $r=(r^x,r^y)\ne 0$ such that the latter always exits first. For convenience we’ll assume also that $r^x,r^y$ are both even.

I couldn’t find any reference to this, so I don’t know whether it’s well-known or not. The following argument involves projecting into each axis, and doing separate couplings for transitions in the x-direction and transitions in the y-direction. We assume WLOG that x is in the upper-right quadrant as shown. Then, let $0=S_0,S_1,S_2,\ldots$ be SRW started from 0, and we will construct $r=R_0,R_1,R_2,\ldots$ on the same probability space as $(S_m)_{m\ge 0}$ as follows. For every m, we set the increment $R_{m+1}-R_m$ to be $\pm(S_{m+1}-S_m)$. It remains to specify the sign, which will be determined by the direction of the S-increment, and a pair of stopping times. The marginal is therefore again an SRW, started from r. Temporarily, we use the unusual notation $S_m= (S^x_m,S^y_m)$ for the coordinates of $S_m$.

So, if $S_{m+1}-S_m=(1,0), (-1,0)$, ie S moves left or right, then we set

$R_{m+1}-R_m = \begin{cases} -(S_{m+1}-S_m) &\quad \text{if }mT^x.\end{cases}$ (*)

where $T^x:= \min\{m\,:\, R^x_m=S^x_m\}$. That is, $R^x$ moves in the opposing direction to $S^x$ until the first time when they are equal (hence the parity requirement), and then they move together. WLOG assume that $r^x>0$. Then suppose $S^x_m=\pm N$ and such m is minimal. Then by construction, if $m\ge T^x$, then $R^x_m=\pm N$ also. If $m, then we must have $S^x_m=-N$, and so since $R^x$‘s trajectory is a mirror image of $S^x$‘s, in fact $R^x_m = N+r^x>N$, so $R^x$ hit +N first. In both cases, we see that $R^x$ hits $\pm N$ at the same time or before $S^x$.

In other words, when $S^x_m$ has non-negative x coordinate, the lazy random walk $R^x$ follows the same trajectory as $S^x$, and when it has negative x coordinate, the $R^x$ mirrors $S^x$. At some time, it may happen that $S^x_m= R^x_m=0$ (recall the parity condition on r). Call this time $T^x$. We then adjust the description of the coupling so that (*) is the mechanism for $m, and then for $m\ge T^x$, we take $S^x_m=R^x_m$.

Similarly, if $S_{m+1}-S_m =(0,1), (0,-1)$, ie S moves up or down, then we set

$R_{m+1}-R_m = \begin{cases} -(S_{m+1}-S_m)&\quad \text{ if }m

with corresponding definition of the stopping time $T^y$.

This completes the coupling, and by considering $T^x\wedge T^y$, we have shown what that the exit time for the walk started from zero dominates the exit time for walk started from r. Recall that so far we are in the case where the box has even width and $r=(r^x,r^y)$ has even coordinates.

This exit time comparison isn’t exactly what we need to compare $G^N(0,0)$ and $G^N(x,x)$. It’s worth remarking at this stage that if all we cared about was the Green’s function on the integer line [-N,N], we would have an easier argument, as by the harmonic property of $G(\cdot,y)$

$G^{[-N,N]}(0,r)=\frac{N-r}{N}G^{[-N,N]}(0,0),$

$G^{[-N,N]}(r,0) = \frac{N}{N+r}G^{[-N,N]}(r,r),$

and so $G(0,0)>G(r,r)$ follows by symmetry. To lift from 1D to 2D directly, we need a bit more than this. It’s possible that S returns in both x- and y- coordinates more often than R, but never at the same time. Fortunately, the coupling we defined slightly earlier does give us a bit more control.

Let $\tau^x(S), \tau^x(R)$ be the first times that $S^x, R^x$ hit $\pm N$. Under this coupling, for any $m\ge 0$

$\mathbb{P}(S^x_m=0, m

since these events are literally equal. Since we showed that $\tau^x(R)\le \tau^x(S)$ almost surely, we can further deduce

$\mathbb{P}(S^x_m=0,m

$=\mathbb{P}(R^x_m=r^x, m

To address the corresponding events for which $m\ge T^x$, we apply the strong Markov property at $T^x$, to obtain SRW $Z_m$ started from r/2, and let $\tau_{-N},\tau_{+N}$ be the hitting times of $-N,+N$ respectively and $\tau_{\pm N}=\tau_{-N}\wedge \tau_{+N}$. It will now suffice to prove that

$\mathbb{P}(Z_m=0, m< \tau_{\pm N}) \ge \mathbb{P}(Z_m=r,m<\tau_{\pm N}),$ (**)

as then we can apply the law of total probability and sum over values of $T^x$ and $m\ge 0$.

To prove this result, we consider the following bijection between trajectories of length m from r/2 to {0,r}. We decompose the trajectories into excursions away from r/2, and then a final meander from r/2 to {0,r} that stays on the same side of r/2. We construct the new trajectory by preserving all the initial excursions, but reversing all the steps of the final meander. So if the original trajectory ended up at 0, the image ends up at r. Trivially, the initial excursions in the image only hit $\pm N$ if the excursions in the original trajectory did this too. But it’s also easy to see, by a similar argument to the coupling at the start of this section, that if the original trajectory ends at r and does not hit $\pm N$, then so does the image. However, the converse is not true. So we conclude (**), and thus

$\mathbb{P}(S_m^x=0) \ge \mathbb{P}(R_m^x=0)$

for all m by combining everything we have seen so far. And so we can now lift to a statement about $S_m$ itself, that is considering both coordinates separately.

The remaining cases for r require a little more care over the definition of $T^x$, though the same projection argument works, for fundamentally the same reason. (Note that in the above argument, if $S^x_m=-N$ and $m, then in fact $R^x_m\ge N+2$, and so it’s not hard to convince yourself that a sensible adjustment to the stopping time will allow a corresponding result with $R^x_m\ge N+1$ in the odd $r^x$ case.) The case for N odd is harder, since in one dimension there are two median sites, and it’s clear by symmetry that we can’t couple them such that RW from one always exits at least as early as RW from the other. However, the distributions of exit times started from these two sites are the same (by symmetry), and so although we can’t find a coupling, we can use similar stopping times to obtain a result in probability.

In the next post, we’ll see how to apply this uniform bound on $G^{V_N}(x,x)$ to control the maximum of the DGFF on $V_N$. In particular, we address how the positive correlations of DGFF influence the behaviour of the maximum by comparison with independent Gaussians at each site.

# Random walks conditioned to stay positive

In this post, I’m going to discuss some of the literature concerning the question of conditioning a simple random walk to lie above a line with fixed gradient. A special case of this situation is conditioning to stay non-negative. Some notation first. Let $(S_n)_{n\ge 0}$ be a random walk with IID increments, with distribution X. Take $\mu$ to be the expectation of these increments, and we’ll assume that the variance $\sigma^2$ is finite, though at times we may need to enforce slightly stronger regularity conditions.

(Although simple symmetric random walk is a good example for asymptotic heuristics, in general we also assume that if the increments are discrete they don’t have parity-based support, or any other arithmetic property that prevents local limit theorems holding.)

We will investigate the probability that $S_n\ge 0$ for n=0,1,…,N, particularly for large N. For ease of notation we write $T=\inf\{n\ge 0\,:\, S_n<0\}$ for the hitting time of the negative half-plane. Thus we are interested in $S_n$ conditioned on T>N, or T=N, mindful that these might not be the same. We will also discuss briefly to what extent we can condition on $T=\infty$.

In the first paragraph, I said that this is a special case of conditioning SRW to lie above a line with fixed gradient. Fortunately, all the content of the general case is contained in the special case. We can repose the question of $S_n$ conditioned to stay above $n\alpha$ until step N by the question of $S_n-n\alpha$ (which, naturally, has drift $\mu-\alpha$) conditioned to stay non-negative until step N, by a direct coupling.

Applications

Simple random walk is a perfectly interesting object to study in its own right, and this is a perfectly natural question to ask about it. But lots of probabilistic models can be studied via naturally embedded SRWs, and it’s worth pointing out a couple of applications to other probabilistic settings (one of which is the reason I was investigating this literature).

In many circumstances, we can desribe random trees and random graphs by an embedded random walk, such as an exploration process, as described in several posts during my PhD, such as here and here. The exploration process of a Galton-Watson branching tree is a particularly good example, since the exploration process really is simple random walk, unlike in, for example, the Erdos-Renyi random graph G(N,p), where the increments are only approximately IID. In this setting, the increments are given by the offspring distribution minus one, and the hitting time of -1 is the total population size of the branching process. So if the expectation of the offspring distribution is at most 1, then the event that the size of the tree is large is an atypical event, corresponding to delayed extinction. Whereas if the expectation is greater than one, then it is an event with limiting positive probability. Indeed, with positive probability the exploration process never hits -1, corresponding to survival of the branching tree. There are plenty of interesting questions about the structure of a branching process tree conditional on having atypically large size, including the spine decomposition of Kesten [KS], but the methods described in this post can be used to quantify the probability, or at least the scale of the probability of this atypical event.

In my current research, I’m studying a random walk embedded in a construction of the infinite-volume DGFF pinned at zero, as introduced by Biskup and Louidor [BL]. The random walk controls the gross behaviour of the field on annuli with dyadically-growing radii. Anyway, in this setting the random walk has Gaussian increments. (In fact, there is a complication because the increments aren’t exactly IID, but that’s definitely not a problem at this level of exposition.) The overall field is decomposed as a sum of the random walk, plus independent DGFFs with Dirichlet boundary conditions on each of the annuli, plus asymptotically negligible corrections from a ‘binding field’. Conditioning that this pinned field be non-negative up to the Kth annulus corresponds to conditioning the random walk to stay above the magnitude of the minimum of each successive annular DGFF. (These minima are random, but tightly concentrated around their expectations.)

Conditioning on $\{T > N\}$

When we condition on $\{T>N\}$, obviously the resulting distribution (of the process) is a mixture of the distributions we obtain by conditioning on each of $\{T=N+1\}, \{T=N+2\},\ldots$. Shortly, we’ll condition on $\{T=N\}$ itself, but first it’s worth establishing how to relate the two options. That is, conditional on $\{T>N\}$, what is the distribution of T?

Firstly, when $\mu>0$, this event always has positive probability, since $\mathbb{P}(T=\infty)>0$. So as $N\rightarrow\infty$, the distribution of the process conditional on $\{T>N\}$ converges to the distribution of the process conditional on survival. So we’ll ignore this for now.

In the case $\mu\le 0$, everything is encapsulated in the tail of the probabilities $\mathbb{P}(T=N)$, and these tails are qualitatively different in the cases $\mu=0$ and $\mu<0$.

When $\mu=0$, then $\mathbb{P}(T=N)$ decays polynomially in N. In the special case where $S_n$ is simple symmetric random walk (and N has the correct parity), we can check this just by an application of Stirling’s formula to count paths with this property. By contrast, when $\mu<0$, even demanding $S_N=-1$ is a large deviations event in the sense of Cramer’s theorem, and so the probability decays exponentially with N. Mogulskii’s theorem gives a large deviation principle for random walks to lie above a line defined on the scale N. The crucial fact here is that the probabilistic cost of staying positive until N has the same exponent as the probabilistic cost of being positive at N. Heuristically, we think of spreading the non-expected behaviour of the increments uniformly through the process, at only polynomial cost once we’ve specified the multiset of values taken by the increments. So, when $\mu<0$, we have

$\mathbb{P}(T\ge(1+\epsilon)N) \ll \mathbb{P}(T= N).$

Therefore, conditioning on $\{T\ge N\}$ in fact concentrates T on N+o(N). Whereas by contrast, when $\mu=0$, conditioning on $\{T\ge N\}$ gives a nontrivial limit in distribution for T/N, supported on $[1,\infty)$.

A related problem is the value taken by $S_N$, conditional on {T>N}. It’s a related problem because the event {T>N} depends only on the process up to time N, and so given the value of $S_N$, even with the conditioning, after time N, the process is just an unconditioned RW. This is a classic application of the Markov property, beloved in several guises by undergraduate probability exam designers.

Anyway, Iglehart [Ig2] shows an invariance principle for $S_N | T>N$ when $\mu<0$, without scaling. That is $S_N=\Theta(1)$, though the limiting distribution depends on the increment distribution in a sense that is best described through Laplace transforms. If we start a RW with negative drift from height O(1), then it hits zero in time O(1), so in fact this shows that conditonal on $\{T\ge N\}$, we have T= N +O(1) with high probability. When $\mu=0$, we have fluctuations on a scale $\sqrt{N}$, as shown earlier by Iglehart [Ig1]. Again, thinking about the central limit theorem, this fits the asymptotic description of T conditioned on T>N.

Conditioning on $T=N$

In the case $\mu=0$, conditioning on T=N gives

$\left[\frac{1}{\sqrt{N}}S(\lfloor Nt\rfloor ) ,t\in[0,1] \right] \Rightarrow W^+(t),$ (*)

where $W^+$ is a standard Brownian excursion on [0,1]. This is shown roughly simultaneously in [Ka] and [DIM]. This is similar to Donsker’s theorem for the unconditioned random walk, which converges after rescaling to Brownian motion in this sense, or Brownian bridge if you condition on $S_N=0$. Skorohod’s proof for Brownian bridge [Sk] approximates the event $\{S_N=0\}$ by $\{S_N\in[-\epsilon \sqrt{N},+\epsilon \sqrt{N}]\}$, since the probability of this event is bounded away from zero. Similarly, but with more technicalities, a proof of convergence conditional on T=N can approximate by $\{S_m\ge 0, m\in[\delta N,(1-\delta)N], S_N\in [-\epsilon \sqrt{N},+\epsilon\sqrt{N}]\}$. The technicalities here emerge since T, the first return time to zero, is not continuous as a function of continuous functions. (Imagine a sequence of processes $f^N$ for which $f^N(x)\ge 0$ on [0,1] and $f^N(\frac12)=\frac{1}{N}$.)

Once you condition on $T=N$, the mean $\mu$ doesn’t really matter for this scaling limit. That is, so long as variance is finite, for any $\mu\in\mathbb{R}$, the same result (*) holds, although a different proof is in general necessary. See [BD] and references for details. However, this is particularly clear in the case where the increments are Gaussian. In this setting, we don’t actually need to take a scaling limit. The distribution of Gaussian *random walk bridge* doesn’t depend on the mean of the increments. This is related to the fact that a linear transformation of a Gaussian is Gaussian, and can be seen by examining the joint density function directly.

Conditioning on $T=\infty$

When $\mu>0$, the event $\{T=\infty\}$ occurs with positive probability, so it is well-defined to condition on it. When $\mu\le 0$, this is not the case, and so we have to be more careful.

First, an observation. Just for clarity, let’s take $\mu<0$, and condition on $\{T>N\}$, and look at the distribution of $S_{\epsilon N}$, where $\epsilon>0$ is small. This is approximately given by

$\frac{S_{\epsilon N}}{\sqrt{N}}\stackrel{d}{\approx}W^+(\epsilon).$

Now take $\epsilon\rightarrow\infty$ and consider the RHS. If instead of the Brownian excursion $W^+$, we instead had Brownian motion, we could specify the distribution exactly. But in fact, we can construct Brownian excursion as the solution to an SDE:

$\mathrm{d}W^+(t) = \left[\frac{1}{W^+(t)} - \frac{W^+(t)}{1-t}\right] \mathrm{d}t + \mathrm{d}B(t),\quad t\in(0,1)$ (**)

for B a standard Brownian motion. I might return in the next post to why this is valid. For now, note that the first drift term pushes the excursion away from zero, while the second term brings it back to zero as $t\rightarrow 1$.

From this, the second drift term is essentially negligible if we care about scaling $W^+(\epsilon)$ as $\epsilon\rightarrow 0$, and we can say that $W^+(\epsilon)=\Theta(\sqrt{\epsilon})$.

So, returning to the random walk, we have

$\frac{S_{\epsilon N}}{\sqrt{\epsilon N}}\stackrel{d}{\approx} \frac{W^+(\epsilon)}{\sqrt{\epsilon}} = \Theta(1).$

At a heuristic level, it’s tempting to try ‘taking $N\rightarrow\infty$ while fixing $\epsilon N$‘, to conclude that there is a well-defined scaling limit for the RW conditioned to stay positive forever. But we came up with this estimate by taking $N\rightarrow\infty$ and then $\epsilon\rightarrow 0$ in that order. So while the heuristic might be convincing, this is not the outline of a valid argument in any way. However, the SDE representation of $W^+$ in the $\epsilon\rightarrow 0$ regime is useful. If we drop the second drift term in (**), we define the three-dimensional Bessel process, which (again, possibly the subject of a new post) is the correct scaling limit we should be aiming for.

Finally, it’s worth observing that the limit $\{T=\infty\}=\lim_{N\rightarrow\infty} \{T>N\}$ is a monotone limit, and so further tools are available. In particular, if we know that the trajectories of the random walk satisfy the FKG property, then we can define this limit directly. It feels intuitively clear that random walks should satisfy the FKG inequality (in the sense that if a RW is large somewhere, it’s more likely to be large somewhere else). You can do a covariance calculation easily, but a standard way to show the FKG inequality applies is by verifying the FKG lattice condition, and unless I’m missing something, this is clear (though a bit annoying to check) when the increments are Gaussian, but not in general. Even so, defining this monotone limit does not tell you that it is non-degenerate (ie almost-surely finite), for which some separate estimates would be required.

A final remark: in a recent post, I talked about the Skorohod embedding, as a way to construct any centered random walk where the increments have finite variance as a stopped Brownian motion. One approach to conditioning a random walk to lie above some discrete function is to condition the corresponding Brownian motion to lie above some continuous extension of that function. This is a slightly stronger conditioning, and so any approach of this kind must quantify how much stronger. In Section 4 of [BL], the authors do this for the random walk associated with the DGFF conditioned to lie above a polylogarithmic curve.

References

[BD] – Bertoin, Doney – 1994 – On conditioning a random walk to stay nonnegative

[BL] – Biskup, Louidor – 2016 – Full extremal process, cluster law and freezing for two-dimensional discrete Gaussian free field

[DIM] – Durrett, Iglehart, Miller – 1977 – Weak convergence to Brownian meander and Brownian excursion

[Ig1] – Iglehart – 1974 – Functional central limit theorems for random walks conditioned to stay positive

[Ig2] – Iglehart – 1974 – Random walks with negative drift conditioned to stay positive

[Ka] – Kaigh – 1976 – An invariance principle for random walk conditioned by a late return to zero

[KS] – Kesten, Stigum – 1966 – A limit theorem for multidimensional Galton-Watson processes

[Sk] – Skorohod – 1955 – Limit theorems for stochastic processes with independent increments

# Skorohod embedding

Background

Suppose we are given a standard Brownian motion $(B_t)$, and a stopping time T. Then, so long as T satisfies one of the regularity conditions under which the Optional Stopping Theorem applies, we know that $\mathbb{E}[B_T]=0$. (See here for a less formal introduction to OST.) Furthermore, since $B_t^2-t$ is a martingale, $\mathbb{E}[B_T^2]=\mathbb{E}[T]$, so if the latter is finite, so is the former.

Now, using the strong Markov property of Brownian motion, we can come up with a sequence of stopping times $0=T_0, T_1, T_2,\ldots$ such that the increments $T_k-T_{k-1}$ are IID with the same distribution as T. Then $0,B_{T_1},B_{T_2},\ldots$ is a centered random walk. By taking T to be the hitting time of $\{-1,+1\}$, it is easy to see that we can embed simple random walk in a Brownian motion using this approach.

Embedding simple random walk in Brownian motion.

The Skorohod embedding question asks: can all centered random walks be constructed in this fashion, by stopping Brownian motion at a sequence of stopping time? With the strong Markov property, it immediately reduces the question of whether all centered finite-variance distributions X can be expressed as $B_T$ for some integrable stopping time T.

The answer to this question is yes, and much of what follows is drawn from, or at least prompted by Obloj’s survey paper which details the problem and rich history of the many approaches to its solution over the past seventy years.

Applications and related things

The relationship between random walks and Brownian motion is a rich one. Donsker’s invariance principle asserts that Brownian motion appears as the scaling limit of a random walk. Indeed, one can construct Brownian motion itself as the limit of a sequence of consistent random walks with normal increments on an increasingly dense set of times. Furthermore, random walks are martingales, and we know that continuous, local martingales can be expressed as a (stochastically) time-changed Brownian motion, from the Dubins-Schwarz theorem.

The Skorohod embedding theorem can be used to prove results about random walks with general distribution by proving the corresponding result for Brownian motion, and checking that the construction of the sequence of stopping times has the right properties to allow the result to be carried back to the original setting. It obviously also gives a coupling between a individual random walk and a Brownian motion which may be useful in some contexts, as well as a coupling between any pair of random walks. This is useful in proving results for random walks which are much easier for special cases of the distribution. For example, when the increments are Gaussian, or when there are combinatorial approaches to a problem about simple random walk. At the moment no aspect of this blog schedule is guaranteed, but I plan to talk about the law of the iterated logarithm shortly, whose proof is approachable in both of these settings, as well as for Brownian motion, and Skorohod embedding provides the route to the general proof.

At the end, we will briefly compare some other ways to couple a random walk and a Brownian motion.

One thing we could do is sample a copy of X independently from the Brownian motion, then declare $T= \tau_{X}:= \inf\{t\ge 0: B_t=X\}$, the hitting time of (random value) X. But recall that unfortunately $\tau_x$ has infinite expectation for all non-zero x, so this doesn’t fit the conditions required to use OST.

Skorohod’s original method is described in Section 3.1 of Obloj’s notes linked above. The method is roughly to pair up positive values taken by X appropriately with negative values taken by X in a clever way. If we have a positive value b and a negative value a, then $\tau_{a,b}$, the first hitting time of $\mathbb{R}\backslash (a,b)$ is integrable. Then we choose one of these positive-negative pairs according to the projection of the distribution of X onto the pairings, and let T be the hitting time of this pair of values. The probability of hitting b conditional on hitting {a,b} is easy to compute (it’s $\frac{-a}{b-a}$) so we need to have chosen our pairs so that the ‘probability’ of hitting b (ie the density) comes out right. In particular, this method has to start from continuous distributions X, and treat atoms in the distribution of X separately.

The case where the distribution X is symmetric (that is $X\stackrel{d}=-X$) is particularly clear, as then the pairs should be $(-x,x)$.

However, it feels like there is enough randomness in Brownian motion already, and subsequent authors showed that indeed it wasn’t necessary to introduce extra randomness to provide a solution.

One might ask whether it’s possible to generate the distribution on the set of pairs (as above) out of the Brownian motion itself, but independently from all the hitting times. It feels like it might be possible to make the distribution on the pairs measurable with respect to

$\mathcal{F}_{0+} = \bigcap\limits_{t>0} \mathcal{F}_t,$

the sigma-algebra of events determined by limiting behaviour as $t\rightarrow 0$ (which is independent of hitting times). But of course, unfortunately $\mathcal{F}_{0+}$ has a zero-one law, so it’s not possible to embed non-trivial distributions there.

Dubins solution

The exemplar for solutions without extra randomness is due to Dubins, shortly after Skorohod’s original argument. The idea is to express the distribution X as the almost sure limit of a martingale. We first use the hitting time of a pair of points to ‘decide’ whether we will end up positive or negative, and then given this information look at the hitting time (after this first time) of two subsequent points to ‘decide’ which of four regions of the real interval we end up in.

I’m going to use different notation to Obloj, corresponding more closely with how I ended up thinking about this method. We let

$a_+:= \mathbb{E}[X \,|\, X>0], \quad a_- := \mathbb{E}[X\,|\, X<0],$ (*)

and take $T_1 = \tau_{\{a_-,a_+\}}$. We need to check that

$\mathbb{P}\left( B_{T_1}=a_+\right) = \mathbb{P}\left(X>0\right),$

for this to have a chance of working. But we know that

$\mathbb{P}\left( B_{T_1}=a_+\right) = \frac{a_+}{a_+-a_-},$

and we can also attack the other side using (*) and the fact that $\mathbb{E}[X]=0$, using the law of total expectation:

$0=\mathbb{E}[X]=\mathbb{E}[X\,|\, X>0] \mathbb{P}(X>0) + \mathbb{E}[X\,|\,X<0]\mathbb{P}(X<0) = a_+ \mathbb{P}(X>0) + a_- \left(1-\mathbb{P}(X>0) \right),$

$\Rightarrow\quad \mathbb{P}(X>0)=\frac{a_+}{a_+-a_-}.$

Now we define

$a_{++}=\mathbb{E}[X \,|\, X>a_+],\quad a_{+-}=\mathbb{E}[X\,|\, 0

and similarly $a_{-+},a_{--}$. So then, conditional on $B_{T_1}=a_+$, we take

$T_2:= \inf_{t\ge T_1}\left\{ B_t\not\in (a_{+-},a_{++}) \right\},$

and similarly conditional on $B_{T_1}=a_-$. By an identical argument to the one we have just deployed, we have $\mathbb{E}\left[B_{T_2} \,|\,\mathcal{F}_{T_1} \right] = B_{T_1}$ almost surely. So, although the $a_{+-+}$ notation now starts to get very unwieldy, it’s clear we can keep going in this way to get a sequence of stopping times $0=T_0,T_1,T_2,\ldots$ where $B_{T_n}$ determines which of the $2^n$ regions of the real line any limit $\lim_{m\rightarrow\infty} B_{T_m}$ should lie in.

A bit of work is required to check that the almost sure limit $T_n\rightarrow T$ is almost surely finite, but once we have this, it is clear that $B_{T_n}\rightarrow B_T$ almost surely, and $B_T$ has the distribution required.

We want to know how close we can make this coupling between a centered random walk with variance 1, and a standard Brownian motion. Here, ‘close’ means uniformly close in probability. For large times, the typical difference between one of the stopping times $0,T_1,T_2,\ldots$ in the Skorohod embedding and its expectation (recall $\mathbb{E}[T_k]=k$) is $\sqrt{n}$. So, constructing the random walk $S_0,S_1,S_2,\ldots$ from the Brownian motion via Skorohod embedding leads to

$\left |S_k - B_k \right| = \omega(n^{1/4}),$

for most values of $k\le n$. Strassen (1966) shows that the true scale of the maximum

$\max_{k\le n} \left| S_k - B_k \right|$

is slightly larger than this, with some extra powers of $\log n$ and $\log\log n$ as one would expect.

The Komlos-Major-Tusnady coupling is a way to do a lot better than this, in the setting where the distribution of the increments has a finite MGF near 0. Then, there exists a coupling of the random walk and the Brownian motion such that

$\max_{k\le n}\left|S_k- B_k\right| = O(\log n).$

That is, there exists C such that

$\left[\max_{k\le n} \left |S_k-B_k\right| - C\log n\right] \vee 0$

is a tight family of distributions, indeed with uniform exponential tail. To avoid digressing infinitely far from my original plan to discuss the proof of the law of iterated logarithm for general distributions, I’ll stop here. I found it hard to find much coverage of the KMT result apart from the challenging original paper, and many versions expressed in the language of empirical processes, which are similar to random walks in many ways relevant to convergence and this coupling, but not for Skorohod embedding. So, here is a link to some slides from a talk by Chatterjee which I found helpful in getting a sense of the history, and some of the modern approaches to this type of normal approximation problem.

# DGFF 2 – Boundary conditions and Gibbs-Markov property

In the previous post, we defined the Discrete Gaussian Free Field, and offered some motivation via the discrete random walk bridge. In particular, when the increments of the random walk are chosen to be Gaussian, many natural calculations are straightforward, since Gaussian processes are well-behaved under conditioning and under linear transformations.

Non-zero boundary conditions

In the definition of the DGFF given last time, we demanded that $h\equiv 0$ on $\partial D$. But the model is perfectly well-defined under more general boundary conditions.

It’s helpful to recall again the situation with random walk and Brownian bridge. If we want a Brownian motion which passes through (0,0) and (1,s), we could repeat one construction for Brownian bridge, by taking a standard Brownian motion and conditioning (modulo probability zero technicalities) on passing through level s at time 1. But alternatively, we could set

$B^{\mathrm{drift-br}}(t) = B(t)+ t(s-B(1)),\quad t\in[0,1],$

or equivalently

$B^{\mathrm{drift-br}}(t)=B^{\mathrm{br}}(t)+ st, \quad t\in[0,1].$

That is, a Brownian bridge with drift can be obtain from a centered Brownian bridge by a linear transformation, and so certainly remains a Gaussian process. And exactly the same holds for a discrete Gaussian bridge: if we want non-zero values at the endpoints, we can obtain this distribution by taking the standard centred bridge and applying a linear transformation.

We can see how this works directly at the level of density functions. If we take $0=Z_0,Z_1,\ldots,Z_{N-1},Z_N=0$ a centred Gaussian bridge, then the density of $Z=\mathbf{z}\in \mathbb{R}^{N+1}$ is proportional to

$\mathbf{1}\{z_0=z_N=0\}\exp\left( -\frac12 \sum_{i=1}^N (z_i-z_{i-1})^2 \right).$ (3)

So rewriting $z_i= y_i- ki$ (where we might want $k=s/N$ to fit the previous example), the sum within the exponent rearranges as

$-\frac12 \sum_{i=1}^N (y_i-y_{i-1} - k)^2 = -\frac12 \sum_{i=1}^N (y_i-y_{i-1})^2 - 2k(y_N-y_0)+ Nk^2.$

So when the values at the endpoints $z_0,z_n,y_0,y_N$ are fixed, this middle term is a constant, as is the final term, and thus the density of the linearly transformed bridge has exactly the same form as the original one.

In two or more dimensions, the analogue of adding a linear function is to add a harmonic function. First, some notation. Let $\varphi$ be any function on $\partial D$. Then there is a unique harmonic extension of $\varphi$, for which $\nabla \varphi=0$ everywhere on D, the interior of the domain. Recall that $\nabla$ is the discrete graph Laplacian defined up to a constant by

$(\nabla \varphi) _x = \sum\limits_{x\sim y} \varphi_x - \varphi_y.$

If we want $h^D$ instead to have boundary values $\varphi$, it’s enough to replace $h^D$ with $h^D+\varphi$. Then, in the density for the DGFF ( (1) in the previous post), the term in the exponential becomes (ignoring the $\frac{1}{4d}$ )

$-\sum\limits_{x\sim y} \left[ (h^D_x-h^D_y)^2 + (\varphi_x-\varphi_y)^2 +2(h^D_x - h^D_y)(\varphi_x-\varphi_y)\right].$

For each $x\in D$, on taking this sum over its neighbours $y\in \bar D$, the final term vanishes (since $\varphi$ is harmonic), while the second term is just a constant. So the density of the transformed field, which we’ll call $h^{D,\varphi}$ is proportional to (after removing the constant arising from the second term above)

$\mathbf{1}\left\{h^{D,\varphi}_x = \varphi_x,\, x\in\partial D\right\} \exp\left( -\frac{1}{4d} \sum\limits_{x\sim y} \left( h^{D,\varphi}_x - h^{D,\varphi}_y \right)^2 \right).$

So $h^{D,\varphi}:= h^D + \varphi$ satisfies the conditions for the DGFF on D with non-zero boundary conditions $\varphi$.

Harmonic functions and RW – a quick review

Like the covariances in DGFF, harmonic functions on D are related to simple random walk on D stopped on $\partial D$. (I’m not claiming a direct connection right now.) We can define the harmonic extension $\varphi$ to an interior point x by taking $\mathbb{P}_x$ to be the law of SRW $x=Z_0,Z_1,Z_2,\ldots$ started from x, and then setting

$\varphi(x):= \mathbb{E}\left[ \varphi_{\tau_{\partial d}} \right],$

where $\tau_{\partial D}$ is the first time that the random walk hits the boundary.

Inverse temperature – a quick remark

In the original definition of the density of the DGFF, there is the option to add a constant $\beta>0$ within the exponential term so the density is proportional to

$\exp\left(-\beta \sum\limits_{x\sim y} (h_x-h_y)^2 \right).$

With zero boundary conditions, the effect of this is straightforward, as varying $\beta$ just rescales the values taken by the field. But with non-zero boundary conditions, the effect is instead to vary the magnitude of the fluctuations of the values of the field around the (unique) harmonic function on the domain with those BCs. In particular, when $\beta\rightarrow \infty$, the field is ‘reluctant to be far from harmonic’, and so $h^D \Rightarrow \varphi$.

This parameter $\beta$ is called inverse temperature. So low temperature corresponds to high $\beta$, and high stability, which fits some physical intuition.

A Markov property

For a discrete (Gaussian) random walk, the Markov property says that conditional on a given value at a given time, the trajectory of the process before this time is independent of the trajectory afterwards. The discrete Gaussian bridge is similar. Suppose we have as before $0=Z_0,Z_1,\ldots, Z_N=0$ a centred Gaussian bridge, and condition that $Z_k=y$, for $k\in\{1,\ldots,N-1\}$, and $y\in\mathbb{R}$. With this conditioning, the density (3) splits as a product

$\mathbf{1}\{z_0=z_N=0, z_k=y\}\exp\left(-\frac12 \sum\limits_{i=1}^N (z_i-z_{i-1})^2 \right) =$

$\mathbf{1}\{z_0=0,z_k=y\} \exp\left(-\frac12 \sum\limits_{i=1}^k (z_i-z_{i-1})^2 \right) \cdot \mathbf{1}\{z_k=y,z_N=0\} \exp\left(-\frac12 \sum\limits_{i=k+1}^N (z_i-z_{i-1})^2 \right).$

Therefore, with this conditioning, the discrete Gaussian bridge splits into a pair of independent discrete Gaussian bridges with drift. (The same would hold if the original process had drift too.)

The situation for the DGFF is similar, though rather than focusing on the condition, it makes sense to start by focusing on the sub-domain of interest. Let $A\subset D$, and take $B=\bar D\backslash A$. So in particular $\partial A\subset B$.

Then we have that conditional on $h^D\big|_{\partial A}$, the restricted fields $h^D\big|_{B\backslash \partial A}$ and $h^D\big|_A$ are independent. Furthermore, $h^D\big|_A$ has the distribution of the DGFF on A, with boundary condition given by $h^D\big|_{\partial A}$. As in the discrete bridge, this follows just by splitting the density. Every gradient term corresponds to an edge in the underlying graph that lies either entirely inside $\bar A$ or entirely inside B. This holds for a general class of Gibbs models where the Hamiltonian depends only on the sum of some function of the heights (taken to be constant in this ‘free’ model) and the sum of some function of their nearest-neighbour gradients.

One additional and useful interpretation is that if we only care about the field on the restricted region A, the dependence of $h^D\big|_A$ on $h^D\big|_{D\backslash A}$ comes only through $h^D\big|_{\partial A}$. But more than that, it comes only through the (random) harmonic function which extends the (random) values taken on the boundary of A to the whole of A. So, if $h^A$ is an independent DGFF on A with zero boundary conditions, we can construct the DGFF $h^D$ from its value on $D\backslash A$ via

$h^D_x \stackrel{d}= h^A_x + \varphi^{h^D\big|_{\partial A}},$

where $\varphi^{h^D\big|_{\partial A}}$ is the unique harmonic extension of the (random) values taken by $h^D$ on $\partial A$ to $\bar A$.

This Markov property is crucial to much of the analysis to come. There are several choices of the restricted domain which come up repeatedly. In the next post we’ll look at how much one can deduce by taking A to be the even vertices in D (recalling that every integer lattice $\mathbb{Z}^d$ is bipartite), and then taking A to be a finer sublattice within D. We’ll use this to get some good bounds on the probability that the DGFF is positive on the whole of D. Perhaps later we’ll look at a ring decomposition of $\mathbb{Z}^d$ consisting of annuli spreading out from a fixed origin. Then the distribution of the field at this origin can be considered, via the final idea discussed above, as the limit of an infinite sequence of random harmonic functions given by the values taken by the field at increasingly large radius from the origin. Defining the DGFF on the whole lattice depends on the existence or otherwise of this local limit.

# Parking on a ring, linear hashing

I’ve spent most of my doctorate trying to analyse how adding destructive dynamics affects the behaviour of a particular random growth process, the classical random graph. In this post I’m going to talk about another random growth process, which is slightly less natural, but for which one can show some similar qualitative properties.

The model, and the additive coalescent

Consider m places arranged in a circle, and for consistency of analogy we think of these as parking spaces. Some number n of cars will arrive one at a time. Each car will arrive at a space chosen uniformly at random. If it is empty they will park in it, otherwise they will look clockwise until they find an empty space, and park there. For now we are only interested in growth, so we assume cars never leave. We are interested in the sizes of blocks of consecutively parked cars.

The reason to consider this slightly unnatural statement is its equivalence to the problem of hashing with linear probing, apparently a key topic in computer science, which I won’t pretend that I know anything about. In any case, it’s a nice model, and it seems reasonable that it would have a basis in more realistic search algorithms.

So, how does the sequence of sizes of blocks of consecutively parked cars grow? Well, given the sequence of block sizes, it is reasonably easy to convince yourself that the order of the blocks around the circle is uniformly random, and the number of empty spaces between adjacent blocks is also uniformly random.

Assume for now that there are at least three blocks. A block of size x can merge with a block of size y with the arrival of the next car only if the blocks are adjacent, with exactly one empty space between them. The chance of this is uniform among all pairs of blocks. Now suppose this is the case, and that the block of size y lies clockwise from the block of size x. Then they will merge precisely if the next car arrives at any of the x occupied spaces in that block, or at the empty space between the pair of blocks. This has probability $\frac{x+1}{m}$. There’s also the opposite ordering to consider, where the block of size x lies clockwise from the other. The total probability of this merge $\{x,y\}\mapsto \{x+y+1\}$ is therefore proportional to (x+y+2).

So the process of block sizes looks a bit like the additive coalescent, at least for large blocks. This is in contrast to the random graph process, where the sequence of component sizes behaves exactly like a multiplicative coalescent, where blocks merge at a rate proportional to the product of their sizes.

Asymptotics

As in the random graph process, it’s interesting to ask roughly how large the largest block will be in such a configuration. Pittel [3] considers the case where the number of empty places $\ell = m-n \approx \beta m$, for some $\beta\in (0,1)$.

A less interesting model would be to choose the positions of the n cars uniformly at random. But then the size of a block is roughly geometric with parameter $\beta$, and there are $\Theta(m)$ blocks with high probability. Relatively straightforward calculations in extreme value theory suggest that the largest block is likely to have size on the order of $\log m$ in this setting.

Of course, the actual model is slightly more complicated, because the size of a block is self-reinforcing, since larger blocks are more likely to grow than smaller blocks. However, we can still get somewhere with naïve estimates. Let’s label the places clockwise. Then in order for there to be a block starting at 0 and stretching beyond $\alpha \log m$, a necessary condition is that at least $\alpha \log m$ cars arrive at those places. The number of cars which arrive at those places is binomial, since there are n cars, and each arrives at a place chosen uniformly, and independently of the other cars. So this event corresponds to

$\mathrm{Bin}(n,\frac{\alpha \log m}{m}) \ge \alpha \log m.$

Then, since $n\approx (1-\beta)n$, this event corresponds approximately to

$\mathrm{Po}((1-\beta)\alpha \log m) \ge \alpha \log m.$

The probability that a Poisson RV is at least a constant multiple larger than its mean decays exponentially with the mean, hence in this case the probability is asymptotically some negative power of m, depending on the value of $\alpha$. But there are $O(m)$ possible places for such a block to start, so whether we can apply a union bound usefully or not depends on whether the power of m is strictly less than -1.

Since all of this depends on $\alpha$, it is reasonable that everything is fine, and the largest block does have size at least $\alpha \log m$ when $\alpha$ is small, and very unlikely when $\alpha$ is large. This heuristic argument fits with Pittel’s theorem. Indeed, his result shows much stronger concentration: that the fluctuations of the size of the largest block are O(1).

Critical regime and empirical processes

The following is a paraphrase of the introduction and some methods from [2].

Obviously, once m=m cars have arrived, there’s no room for manoeuvre and definitely all the places are taken in one giant block. But it’s not obvious in general what scaling for the number of gaps will give rise to giant blocks of $\Theta(m)$ cars.

As for the random graph, we can find a process similar to the exploration process of a (random) graph which encodes much of the information we care about. Let $Y_k$ be the number of cars which arrive at place k. So the sum of the $Y_k$s will be n, the total number of cars. Now consider the process

$C_0=0, \ldots, C_{k+1}=C_k + Y_{k+1}-1.$

A block has the property that the number of arrivals within that set of places is equal to the number of places. So every time this *empirical process* C drops below its previous running minimum, this indicates the end of a block. To make this equivalence precise, we need to be a bit careful about where we start counting. It works exactly if we start at the beginning of a block. If not, it might introduce some unwanted divisions within the first block.

What we have is a process that looks roughly like a random walk that is constrained to pass through the point (m,n-m), which is equal to (m,-l). Even if we aren’t totally precise about how this is like a random walk, we would expect to see Brownian fluctuations after rescaling. Indeed, we might expect to see a Brownian bridge added to a deterministic linear function with negative gradient. But this is only meaningful if the random part is at least as large as the deterministic part, and since the fluctuations have order $\sqrt{m}$, if l is much larger than this, the rescaled empirical process is essentially deterministic, so we won’t see any macroscopic excursions above the minimum.

If l is substantially smaller than $\sqrt{m}$, then there is no real difference between (m,-l) and (m,0), and what we see is just a Brownian bridge. At this point, where we choose to start the process is actually important. If we were to start it at the minimum of the Brownian bridge instead, we would have seen a Brownian excursion, which corresponds to one block occupying (almost) all of the places.

Unsurprisingly, the story is completed by considering $\ell=\Theta(\sqrt{m})$, where the rescaled empirical process looks like a slanted Brownian bridge, that is Brownian motion conditioned to pass through $(1,-\frac{\ell}{\sqrt{m})$. There isn’t an obvious fix to the question of where to start the process, but it turns out that the correct way is now adding a Brownian excursion onto the deterministic linear function with gradient $- \frac{\ell}{\sqrt{m}}$. It’s now reasonable that the excursions above the minimum should macroscopic.

This scaling limit works dynamically as well, where the same Brownian excursion is used for different gradients of the deterministic line, corresponding to $\ell$ moving through the critical window $m-\Theta(\sqrt{m})$. Finally, a direction to Bertoin’s recent paper [1] for the model with an additional destructive property. Analogous to the forest fire, blocks of cars are removed at a rate proportional to their size (as a result, naturally, of ‘Molotov cocktails’…). Similar effects of self-organised criticality are seen when the rate of bombs is scaled appropriately.

References

[1] – Bertoin – Burning cars in a parking lot (paper / slides)

[2] – Chassaing + Louchard – Phase transition for parking blocks, Brownian excursion and coalescence (arXiv)

[3] – Pittel – Linear probing: the probable largest search time grows logarithmically with the number of records

# Fair games and the martingale strategy III

Gambler’s Ruin

Continuing directly from the previous post, the nicest example of the optional stopping theorem we developed there is to example a simple random walk constrained between two values, say 0 and N. This represents an idealised gambling situation, where the gambler stops playing either when they reach some pre-agreed profit, or when they go bankrupt. We assume that we start at level k, for k = 1,2,…,N-1.

Naturally, we want to know the probabilities of winning (ie getting to N) and losing (ie going bankrupt). We could set this up by conditioning on the first step. Let $p_k$ be the probability of winning starting from level k. Then we must have

$p_k= \frac12 p_{k+1}+\frac12 p_{k-1},\quad k=1,\ldots,N-1,$ (*)

with the obvious boundary conditions $p_0=0, p_N=1$. In an ideal world, we just know how to solve second order difference equations like (*). Well, actually it isn’t too hard, because we can see from (*) directly that

$p_{k+1}-p_k = p_k-p_{k-1},$

and so $p_k$ is a linear function of k, and so $p_k = k/N$ follows pretty much immediately.

But, we can also use OST profitably. Let T be the time at which we first hit 0 or N. It’s intuitively clear that this should have finite expectation, since the problems you might encounter with just the hitting time of a single level shouldn’t apply. Or you can consider the expected number of steps before you see N ups or downs in a row, which certainly provides an upper bound on T. This random number of steps is sort of geometric (at least, can be upper bounded by a geometric RV) and so has finite expectation. So can apply OST to X at T, and we have

$\mathbb{E}[X_T] = N\cdot \mathbb{P}(X_T=N) + 0 \cdot \mathbb{P}(X_T=0) = \mathbb{E}[X_0]=k,$

from which we also derive $p_k=k/N$.

The reason we talk about gambler’s ruin is by considering the limit $N\rightarrow\infty$ with k fixed. After a moment’s thought, it’s clear we can’t really talk about stopping the process when we hit infinity, since that won’t happen at any finite time. But we can ask what’s the probability that we eventually hit zero. Then, if we imagine a barrier at level N, the probability that we hit 0 at some point is bounded below by the probability that we hit 0 before we hit level N (given that we know we hit either zero or level N with probability one), and this is $\frac{N-k}{N}$, and by choosing N large enough, we can make this as close to 1 as we want. So the only consistent option is that the probability of hitting 0 at some point is one. Hence gambler’s ruin. With probability one, ruin will occur. There’s probably a moral lesson hiding there not especially subtly.

So the deal here seems to be that if you just care about your average, it doesn’t matter how to choose to play a sequence of fair games. But what if you care about something other than your average? In any real setting, we maybe care about slightly more than this. Suppose I offer you a bet on a coin toss: you get £3 if it comes up heads, and I get £1 if it comes up tails. Sounds like a good bet, since on average you gain a pound. But what about if you get £10,003 if it comes up heads and I get £10,001 if it comes up tails? I’m guessing you’re probably not quite so keen now.

But if you were an international bank, you might have fewer reservations about the second option. My intention is not to discuss whether our valuation of money is linear here, but merely to offer motivation for the financial option I’m about to propose. The point is that we are generally risk-averse (well, most of us, most of the time) and so we are scared of possible large losses, even when there is the possibility of large profits to balance it out.

Let’s assume we have our simple random walk, and for definiteness let’s say it starts at £1. Suppose (eg as a very niche birthday present) we have the following opportunity: at any point between now and time t=5, we have the right to buy one unit of the stock for £2.

We want to work out how much this opportunity, which from now on I’m going to call an option, is worth on average. Note that now it does seem that when we choose to cash in the option will have an effect on our return, and so we will have to include this in the analysis.

Note that, once we’ve bought a unit of the stock, we have an asset which is following a simple random walk (ie sequential fair games) and so from this point on its expected value remains unchanged. So in terms of expectation, we might as well sell the stock at the same moment we buy it. So if we cash in the option when the stock is currently worth £X, we will on average have a return of £(X-2). This means that we’ll only ever consider exercising our option if the current value of the stock is greater than £2. This narrows down our strategy slightly.

This sort of option minimises the risk of a large loss, since the worst thing that happens is that you never choose to exercise your option. So if you actually paid for the right to have this option, that cost is the largest amount you can lose. In the trading world, this type of opportunity is called an American option.

The trick here is to work backwards in time, thinking about strategies. If at time t=4, the stock is worth £1, then the best that can happen is that it’s worth £2 at time t=5, and this still gains you no wealth overall. Similarly if it’s worth £0 at time t=3. So we’ve identified a region where, if the stock value enters this region, we might as well rip up our contract, because we definitely aren’t going to gain anything. Remember now that we’ve also said you won’t ever cash in if the stock’s value is at most £2, because you don’t gain anything on average.

Now suppose that the stock has value £3 at time t=4. There’s no danger of it ever getting back below £2 during the lifetime of the option, so from now on your potential return is following the trajectory of a simple random walk, ie a fair game. So on average, it makes no difference whether you cash in now, or wait until t=5, or some combination of the two. The same argument holds if the stock has value £4 at time t=3 or time t=4, and so we can identify a region where you might as well cash in.

What about the final region? If the stock value is greater than £2, but not yet in the definitely-cash-in area, what should you do? Well, if you think about it, the value of the stock is a fair game. But your return should be better than that, because the stock price doesn’t take account of the fact that you wouldn’t buy in (and make a loss overall) if the value drops below £2. So at this stage, your future options are better than playing a fair game, and so it doesn’t make sense (in terms of maximising your *average*) to cash in.

Now we can actually work backwards in time to establish how much any starting value is worth under this optimal strategy. We can fill in the values in the ‘doomed’ area (ie all zeros) and on the ‘cash in now’ area (ie current value minus 2), and construct backwards using the fact that we have a random walk.

The final answer ends up being 7/16 if the stock had value £1 at time 0. Note that the main point here is that working out the qualitative form of the strategy was the non-trivial part. Once we’d done that, everything was fairly straightforward. I claim that this was a reasonably fun adjustment to the original problem, but have minimal idea whether pricing options is in general an interesting thing to do.

Anyway, I hope that provided an interesting overview to some of the topics of interest within the question of how to choose strategies for games based on random processes.

# Fair games and the martingale strategy I

I went back to my school a couple of weeks ago and gave a talk. I felt I’d given various incarnations of a talk on card-shuffling too many times, so it was time for a new topic. The following post (and time allowing, one or two more) is pretty much what I said.

The Martingale Strategy

Suppose we bet repeatedly on the outcome of tossing a fair coin. Since it’s November, my heart is set on buying an ice cream that costs £1, so my aim is to win this amount from our game. My strategy is this:

First, I bet £1. If I win, then that’s great, because I now have made exactly enough profit to buy the ice cream. If I lose, then I play again, and this time I bet £2. Again, if I win, then my total profit is £2-£1 = £1, so I stop playing and buy the ice cream. If I lose, then I play a third time, again doubling my stake. So if I win for the first time on the seventh go, my overall profit will be

£64 – (£1+£2+£4+£8+£16+£32) = £1,

and it’s clear that this can be continued and I will eventually win a round, and at this point my total profit will be £1. So I will always eventually be able to buy my ice cream.

But, there’s nothing special about the value £1, so I could replace the words ‘ice cream’ with ‘private tropical island’, so why am I still here in the UK on a wet Monday when I could be on my beach lounger?

There are some fairly obvious reasons why the strategy I’ve described is not actually a fail-safe way to make a profit. For a start, although with probability one a head will come up eventually, there is a small positive chance that the first 200 rolls will all be tails. At this point, I would have accrued a debt of roughly $2^{200}$ pounds, and this is slightly more than the number of atoms in the universe. All this for an ice cream?

So there are major problems carrying out this strategy in a finite world. And of course, it’s no good if we stop after a very large but finite number of turns, because then there’s always this very small chance that we’ve made a very large loss, which is bad, partly because we can’t have the ice cream, but also because it exactly cancels out the chance of making our £1 profit, and so our overall average profit is exactly zero.

Though I’ve set this up in an intentionally glib fashion, as so often is the case, we might have stumbled across an interesting mathematical idea. That is, if we play a fair game a finite number of times, we have a fair game overall, meaning our overall average profit is zero. But if we are allowed to play a potentially infinite number of times, then it’s not clear how to define our overall ‘average’ profit, since we feel it ought to be zero, as an extension of the finite case, but also might be positive, because it ends up being £1 with probability one.

It’s tempting at this stage to start writing statements like

$1 \times 1 + (-\infty) \times 0=0 ,$

to justify why this might have come about, where we consider the infinitely unlikely event that is infinitely costly. But this is only convincing at the most superficial level, and so it makes more sense to think a bit more carefully about under exactly what circumstances we can extend our observation about the overall fairness of a finite sequence of individual fair games.

A second example

The previous example was based upon a series of coin tosses, and we can use exactly the same source of randomness to produce a simple random walk. This is a process that goes up or down by 1 in each time step, where each option happens with probability ½, independently of the history.

We could avoid the requirement to deal with very large bets by always staking £1, and then cashing in the first time we have a profit of £1. Then, if we start the random walk at zero, it models our profit, and we stop the first time it gets to 1. It’s not obvious whether we hit 1 with probability one. Let’s show this.

In order to hit some positive value k, the random walk must pass through 1, 2, and so on, up to (k-1) and then finally k. So $\mathbb{P}(\text{hit k}) = [\mathbb{P}(\text{hit 1})]^k$. And similarly for negative values. Also, the probability that we return to zero is the same as the probability that we ever hit 1, since after one time-step they are literally the same problem (after symmetry). So, if the probability of hitting 1 is p<1, then the number of visits to zero is geometric (supported on 1,2,3,…) with parameter p, and so

$\mathbb{E}[\text{visits to k}] = \mathbb{E}[\text{visits to zero}] \times \mathbb{P}(\text{hit k})=(1+1/p) \times p^{|k|} = (p+1)p^{|k|-1}.$

Thus, when we sum over all values of k, we are summing a pair of geometric series with exponent <1, and so we get a finite answer. But if the expected number of visits to anywhere (ie the sum across all places) is finite, this is clearly ridiculous, since we are running the process for an infinite time, and at each time-step we must be somewhere! So we must in fact have p=1, and thus another potential counter-example to the claim that a sequence of fair games can sometimes be unfair.

We might have exactly the same set of practical objections, such as this method requiring arbitrarily large liquidity (even though it doesn’t grow exponentially fast so doesn’t seem so bad).

What will actually turn out to be useful is that although the bets are now small, the average time until we hit 1 is actually infinite. Remember that, even though most things we see in real life don’t have this property, it is completely possible for a random variable to take finite values yet have infinite expectation.

Notes on the Martingale Strategy

There’s no reason why the originally proposed strategy had to be based upon fair coin tosses. This strategy might work in a more general setting, where the chance of winning on a given turn is not ½, or is not even constant. So long as at each stage you bet exactly enough that, if you win, you recoup all your losses so far, and one extra pound, this has the same overall effect.

Of course, we need to check that we do eventually win a round, which is not guaranteed if the probability of winning (conditional on not having yet won) decays sufficiently fast. If we let $p_k$ be the probability of winning on turn k, given that we haven’t previously won, then we require that the probability of never winning $\prod_{k\ge 1}(1-p_k)=0$. By taking logs and taking care of the approximations, it can be seen that the divergence or otherwise of $\sum p_k$ determines which way this falls.

In the next post, we’ll talk about how the two problems encountered here, namely allowing large increments, and considering a stopping time with infinite expectation are exactly the two cases where something can go wrong. We’ll also talk about a slightly different setting, where the choice of when to stop playing becomes a bit more dynamic and complicated.

# When is a Markov chain a Markov chain?

I’ve been taking tutorials on the third quarter of the second-year probability course, in which the student have met discrete-time Markov chains for the first time. The hardest aspect of this introduction (apart from the rapid pace – they cover only slightly less material than I did in Cambridge, but in half the time) is, in my opinion, choosing which definition of the Markov property is most appropriate to use in a given setting.

We have the wordy “conditional on the present, the future is independent of the past”, which is probably too vague for any precise application. Then you can ask more formally that the transition probabilities are the same under two types of conditioning, that is conditioning on the whole history, and conditioning on just the current value

$\mathbb{P}(X_{n+1}=i_{n+1} \,\big|\, X_n=i_n,\ldots,X_0=i_0) = \mathbb{P}(X_{n+1}=i_{n+1} \,\big |\, X_n=i_n),$ (*)

and furthermore this must hold for all sets of values $(i_{n+1},\ldots,i_0)$ and if we want time-homogeneity (as is usually assumed at least implicitly when we use the word ‘chain’), then these expressions should be functions of $(i_n,i_{n+1})$ but not n.

Alternatively, one can define everything in terms of the probability of seeing a given path:

$\mathbb{P}(X_0=i_0,\ldots,X_n=i_n)= \lambda_{i_0}p_{i_0,i_1}\ldots p_{i_{n-1}i_n},$

where $\lambda$ is the initial distribution, and the $p_{i,j}$s are the entries of the transition matrix P.

Fortunately, these latter two definitions are equivalent, but it can be hard to know how to proceed when you’re asked to show that a given process is a Markov chain. I think this is partly because this is one of the rare examples of a concept that students meet, then immediately find it hard to think of any examples of similar processes which are not Markov chains. The only similar concept I can think of are vector spaces, which share this property mainly because almost everything in first-year mathematics is linear in some regard.

Non-examples of Markov chains

Anyway, during the tutorials I was asking for some suggestions of discrete-time processes on a countable or finite state space which are not Markov chains. Here are some things we came up with:

• Consider a bag with a finite collection of marbles of various colours. Record the colours of marbles sampled repeatedly without replacement. Then the colour of the next marble depends on the set you’ve already seen, not on the current colour. And of course, the process terminates.
• Non-backtracking random walk. Suppose you are on a graph where every vertex has degree at least 2, and in a step you move to an adjacent vertex, chosen uniformly among the neighbours, apart from the one from which you arrived.
• In a more applied setting, it’s reasonable to assume that if we wanted to know the chance it will rain tomorrow, this will be informed by the weather over the past week (say) rather than just today.

Showing a process is a Markov chain

We often find Markov chains embedded in other processes, for example a sequence of IID random variables $X_1,X_2,\ldots$. Let’s consider the random walk $S_n=\sum_{i=1}^n X_i$, where each $X_i =\pm 1$ with probability p and (1-p). Define the running maximum $M_n=\max_{m\le n}S_m$, and then we are interested in $Y_n:=M_n-S_n$, which we claim is a Markov chain, and we will use this as an example for our recipe to show this in general.

We want to show (*) for the process $Y_n$. We start with the LHS of (*)

$\mathbb{P}(Y_{n+1}=i_{n+1} \,\big|\, Y_n=i_n,\ldots,Y_0=i_0),$

and then we rewrite $Y_{n+1}$ as much as possible in terms of previous and current values of Y, and quantities which might be independent of previous values of Y. At this point it’s helpful to split into the cases $i_n=0$ and $i_n\ne 0$. We’ll treat the latter for now. Then

$Y_{n+1}=Y_n+X_{n+1},$

so we rewrite as

$=\mathbb{P}(X_{n+1}=i_{n+1}-i_n \, \big |\, Y_n=i_n,\ldots, Y_0=i_0),$

noting that we substitute $i_n$ for $Y_n$ since that’s in the conditioning. But this is now ideal, since $X_{n+1}$ is actually independent of everything in the conditioning. So we could get rid of all the conditioning. But we don’t really want to do that, because we want to have conditioning on $Y_n$ left. So let’s get rid of everything except that:

$=\mathbb{P}(X_{n+1}=i_{n+1}-i_n\, \big |\, Y_n=i_n).$

Now we can exactly reverse all of the other steps to get back to

$= \mathbb{P}(Y_{n+1}=i_{n+1} \,\big|\, Y_n=i_n),$

which is exactly what we required.

The key idea is that we stuck to the definition in terms of Y, and held all the conditioning in terms of Y, since that what actually determines the Markov property for Y, rearranging the event until it’s in terms of one of the underlying Xs, at which point it’s easy to use independence.

Showing a process is not a Markov chain

Let’s show that $M_n$ is not a Markov chain. The classic mistake to make here is to talk about possible paths the random walk S could take, which is obviously relevant, but won’t give us a clear reason why M is not Markov. What we should instead do is suggest two paths taken by M, which have the same ‘current’ value, but induce transition probabilities, because they place different restrictions on the possible paths taken by S.

In both diagrams, the red line indicates a possible path taken by $(M_0,M_1,\ldots,M_4)$, and the blue lines show possible paths of S which could induce these.

In the left diagram, clearly there’s only one such path that S could take, and so we know immediately what happens next. Either $X_5=+1$ (with probability p) in which case $M_5=S_5=3$, otherwise it’s -1, in which case $M_5=2$.

In the right diagram, there are two possibilities. In the case that $S_4=0$, clearly there’s no chance of the maximum increasing. So in the absence of other information, for $M_5=3$, we must have $X_4=X_5=+1$, and so the chance of this is $p^2$.

So although the same transitions are possible, they have different probabilities with different information about the history, and so the Markov property does not hold here.

# Azuma-Hoeffding Inequality

It’s (probably) my last Michaelmas term in Oxford, at least for the time being, and so also the last time giving tutorials on either of the probability courses that students take in their first two years. This time, I’m teaching the second years, and as usual the aim of the majority of the first half of the course is to acquire as sophisticated an understanding as possible of the Central Limit Theorem. I feel a key step is appreciating that CLT tells you about the correct scaling for the deviations from the mean of these partial sums of IID random variables. The fact that these deviations on this correct scaling converge in law to a normal distribution, irrespective (apart from mild conditions) on the underlying distribution, is interesting, but should be viewed as a secondary, bonus, property.

Emphasising the scaling of deviations in CLT motivates the next sections of this (or any) course. We develop tools like Markov’s inequality to control the probability that a random variable is much larger than its expectation, and experiment with applying this to various functions of the random variable to get stronger bounds. When the moment generating function exists, this is an excellent choice for this analysis. We end up with a so-called Chernoff bound. For example, we might consider the probability that when we toss N coins, at least a proportion ¾ are Heads. A Chernoff bound says that this probability decays exponentially in N.

One direction to take is to ask how to control precisely the parameter of this exponential decay, which leads to Cramer’s theorem and the basis of the theory of Large Deviations. An alternative direction is to observe that the signed difference between the partial sums of independent random variables and their means is an example of a martingale, albeit not a very interesting one, since in general the increments of a martingale are not independent. So we might ask: under what circumstances can we show exponential tail bounds on the deviation of a martingale from its mean (that is, its initial value) at a fixed (perhaps large) time?

Azuma-Hoeffding inequality

The following result was derived and used by various authors in the 60s, including Azuma and Hoeffding (separately), but also others.

Let $X_0,X_1,X_2,\ldots$ be a martingale with respect to some filtration, and we assume that the absolute value of each increment $|X_i-X_{i-1}|$ is bounded almost surely by some $c_i<\infty$. Then, recalling that $\mathbb{E}[X_n|\mathcal{F}_0]=X_0$, we have

$\mathbb{P}(X_n \ge X_0+t) \le \exp\left( -\frac{t^2}{2\sum_{i=1}^n c_i^2}\right).$

Proof

We apply a Chernoff argument to each increment. First, observe that for Y a distribution supported on [-1,1] with mean zero, by convexity $\mathbb{E}[e^{tY}]$ is maximised by taking Y equal to +1 and -1 each with probability ½. Thus

$\mathbb{E}[e^{tY}]\le \frac12 e^t + \frac 12 e^{-t}=\cosh(t) \le e^{-t^2/2},$

where the final inequality follows by directly comparing the Taylor series.

We’ll use this shortly. Before that, we start the usual argument for a Chernoff bound on $X_n-X_0$.

$\mathbb{P}(X_n-X_0\ge t) = \mathbb{P}(e^{\theta(X_n-X_0)}\ge e^{\theta t})\le e^{-\theta t} \mathbb{E}[e^{\theta(X_n-X_0)}]$

$= e^{-\theta t} \mathbb{E}[\mathbb{E}[e^{\theta((X_n-X_{n-1}) +X_{n-1}-X_0)} | \mathcal{F}_{n-1}]]$

$= e^{-\theta t} \mathbb{E}[e^{\theta(X_{n-1}-X_0)} \mathbb{E}[e^{\theta(X_n-X_{n-1})}|\mathcal{F}_{n-1}] ],$

and our preliminary result allows us to control this inner expectation

$\le e^{-\theta t} e^{\theta^2c_n^2/2} \mathbb{E}[e^{\theta(X_{n-1}-X_0)}].$

So now we can apply this inductively to obtain

$\mathbb{P}(X_n-X_0\ge t) \le e^{-\theta t+ \theta^2 \sum_{i=1}^n c_i^2}.$

Finally, as usual in such an argument, we need to choose a sensible value of the free parameter $\theta$, and naturally we want to choose it to make this RHS as small as possible, which is achieved when $\theta = \frac{t}{\sum_{i=1}^n c_i^2}$, and leads exactly to the statement of the inequality.

Applications

Unsurprisingly, we can easily apply this to the process of partial sums of IID random variables with mean zero and bounded support, to recover a Chernoff bound.

A more interesting example involves revealing the state (ie open or closed) of the edges of an Erdos-Renyi graph one at a time. We need to examine some quantitative property of the graph which can’t ever be heavily influenced by the presence or non-presence of a single given edge. The size of the largest clique, or the largest cut, are good examples. Adding or removing an edge can change these quantities by at most one.

So if we order the edges, and let the filtration $\mathcal{F}_k$ be generated by the state of the first k edges in this ordering, then $X_k=\mathbb{E}[\text{max cut}| \mathcal{F}_k]$ is a martingale. (A martingale constructed backwards in this fashion by conditioning a final state on a filtration is sometimes called a Doob martingale.) Using A-H on this shows that the deviations from the mean are of order $\sqrt{N}$, where N is the size of the graph. In the sparse case, it can be justified fairly easily that the maximum cut has size $\Theta(N)$, since for example there will always be some positive proportion of isolated vertices. However, accurate asymptotics for the mean of this quantity seem (at least after a brief search of the literature – please do correct me if this is wrong!) to be unknown. So this might be an example of the curious situation where we can control the deviations around the mean better than the mean itself!

Beyond bounded increments

One observation we might make about the proof is that it is tight only if all the increments $X_i-X_{i-1}$ are supported on $\{-c_i,+c_i\}$, which is stronger than demanding that the absolute value is bounded. If in fact we have $X_i-X_{i-1}\in[-d_i,c_i]$ almost surely, then, with a more detailed preliminary lemma, we can have instead a bound of $\exp\left( -\frac{2t^2}{\sum_{i=1}^n (c_i+d_i)^2} \right)$.

While it isn’t a problem in these examples, in many settings the restriction to bounded increments is likely to be the obstacle to applying A-H. Indeed, in the technical corner of my current research problem, this is exactly the challenge I faced. Fortunately, at least in principle, all is not necessarily lost. We might, for example, be able to establish bounds $(c_i)$ as described, such that the probability that any $|X_i-X_{i-1}|$ exceeds its $c_i$ is very small. You could then construct a coupled process $(Y_i)$, that is equal to $X_i$ whenever the increments are within the given range, and something else otherwise. For Y to fit the conditions of A-H, the challenge is to ensure we can do this such that the increments remain bounded (ie the ‘something else’ also has to be within $[-c_i,c_i]$ ) and also that Y remains a martingale. This total probability of a deviation is bounded above by the probability of Y experiencing that deviation, plus the probability of Y and X decoupling. To comment on the latter probability is hard in general without saying a bit more about the dependence structure in X itself.