DGFF 4 – Properties of the Green’s function

I’m at UBC this month for the PIMS probability summer school. One of the long courses is being given by Marek Biskup about the Discrete Gaussian Free Field (notes and outline here) so this seems like a good moment to revive the sequence of posts about the DGFF. Here’s DGFF1, DGFF2, DGFF3 from November.

The first draft of this post was about the maximum of the DGFF in a large box V_N, and also about the Green’s function G^{V_N}(x,y), which specifies the covariance structure of the DGFF. This first draft also became too long, so I’m splitting it into two somewhat shorter ones. As we’ll see, some understanding and standard estimates of the Green’s function is enough to say quite a bit about the maximum. In this first post, we’ll explore some ‘low-hanging fruit’ concerning the Green’s function, as defined through a simple random walk, which are useful, but rarely explained in the DGFF literature.

Symmetry of Green’s function

We start with one of these low-hanging fruit. If G^{V_N} is to be a covariance matrix, it has to be symmetric. In the first post, showing that the definition of the DGFF as a random field with given Hamiltonian is equivalent to \mathcal{N}(0,G^{V_N}) certainly can be viewed as a proof of symmetry. However, it would be satisfying if there was a direct argument in the language of the definition of the Green’s function.

To make this self-contained, recall the random walk definition of G^{V_N}(x,y). Let (S_m)_{m\ge 0} be simple random walk on V_N, and \mathbb{P}_x,\,\mathbb{E}_x denote starting the random walk at x\in V_N. As usual, let \tau_y,\,\tau_A denote the hitting time of a vertex y or a set A respectively. Then

G^{V_N}(x,y):= \mathbb{E}_x \left[ \sum_{m=0}^{\tau_{\partial V_N}}1_{(S_m=y) }\right].

That is, G^{V_N}(x,y) is the expected number of visits to y by a random walk from x, before it exits V_N.

Let’s drop the superscript for now, as everything should hold for a more general subset of the lattice. I don’t think it’s immediately obvious at the level of Markov chains why G(x,y)=G(y,x). In particular, it’s not the case that

\mathbb{P}_x(\tau_y < \tau_{D^c}) = \mathbb{P}_y(\tau_x <\tau_{D^c}),

and it feels that we can’t map between paths x \to \partial D and y\to \partial D in a way that preserves the number of visits to y and x, respectively. However, we can argue that for any m

\mathbb{P}_x(S_m=y, \tau_{D^c}>m) = \mathbb{P}_y(S_m=x, \tau_{D^c}>m),

by looking at the suitable paths of (S_m). That is, if we have a path x=S_0,S_1,\ldots,S_m=y that stays within D, then the probability of seeing this path starting from x and its reverse direction starting from y are equal. Why? Because

\mathbb{P}_x(S_0=x,S_1=v_1,\ldots,S_{m-1}=v_{m-1},S_m=y) = \prod_{\ell=0}^{m-1} \frac{1}{\mathrm{deg}(v_\ell)},

and

\mathbb{P}_y(S_0=y,S_1=v_{m-1},\ldots,S_{m-1}=v_1, S_m=x) = \prod_{\ell=0}^{m-1} \frac{1}{\mathrm{deg}(v_{m-\ell})} = \prod_{\ell=1}^m \frac{1}{\mathrm{deg}(v_\ell)}.

Since D\subset \mathbb{Z}^d and x,y are in the interior of D, we must have \mathrm{deg}(x)=\mathrm{deg}(y), and so these two expressions are equal. Summing over all such two-way paths, and then all m gives the result.

Fixing one argument

We now focus on G^D(\cdot,y), where the second argument is fixed. This is the solution to the Poisson equation

\Delta G^D(\cdot,y) = -\delta_y(\cdot),\quad G^D(x,y)=0,\; \forall x\in \partial D.

To see this, can use a standard hitting probability argument (as here) with the Markov property. This is harmonic in D\backslash \{y\}, and since we know

G^D(y,y)= \frac{1}{\mathbb{P}_y(\text{RW hits }\partial D\text{ before returning to }y)},

this uniquely specifies G^D(\cdot,y). Anyway, since harmonic functions achieve their maxima at the boundary, we have G(y,y)\ge G(x,y) for all x\in D. We can also see this from the SRW definition as

G(x,y)=G(y,x) = \mathbb{P}_y (\tau_x < \tau_{\partial D} ) G(x,x) \le G(x,x).

Changing the domain

Now we want to consider nested domains D\subset E, and compare G^D(\cdot,\cdot) and G^E(\cdot,\cdot) on DxD. The idea is that for SRW started from x\in D, we have \tau_{\partial D}\le \tau_{\partial E}, since one boundary is contained within the other. From this, we get

G^D(x,y)\le G^E(x,y),\quad \forall x,y\in D,

and we will use the particular case y=x.

For example, if x\in V_N, the box with width N, then the box with width 2N centred on x contains the whole of V_N. So, if we set \bar {V}_{2N}:= [-N,N]^d, then with reference to the diagram, we have

G^{V_N}(x,x)\le G^{\bar{V}_{2N}}(0,0),\quad x\in V_N.

As we’ll see when we study the maximum of the DGFF on V_N, uniform control over the pointwise variance will be a useful tool.

Maximising the Green’s function

The idea of bounding G^{V_N}(x,x) by G^{\bar V_{2N}}(0,0) for any x\in V_N is clever and useful. But a more direct approach would be to find the value of x that maximises G^{V_N}(x,x). We would conjecture that when V_N has a central vertex, then this is the maximiser.

We can prove this directly from the definition of the Green’s function in terms of random walk occupation times. Let’s assume we are working with \bar{V}_N for even N, so that 0 is the central vertex. Again, since

G^D(x,x)=\frac{1}{\mathbb{P}_x(\text{RW hits }\partial D\text{ before returning to }x)}, (*)

it would suffice to show that this probability is minimised when x=0. This feels right, since 0 is furthest from the boundary. Other points are closer to the boundary in some directions but further in others, so we can’t condition on the maximum distance from its start point achieved by an excursion of SRW (we’re vertex-transitive, so these look the same from all starting points), as even allowing for the four possible rotations, for an excursion of diameter slightly larger than N, starting at the centre is maximally bad.

However, intuitively it does feel as if being closer to the boundary makes you more likely to escape earlier. In fact, with a bit more care, we can couple the SRW started from 0 and the SRW started from r=(r^x,r^y)\ne 0 such that the latter always exits first. For convenience we’ll assume also that r^x,r^y are both even.

I couldn’t find any reference to this, so I don’t know whether it’s well-known or not. The following argument involves projecting into each axis, and doing separate couplings for transitions in the x-direction and transitions in the y-direction. We assume WLOG that x is in the upper-right quadrant as shown. Then, let 0=S_0,S_1,S_2,\ldots be SRW started from 0, and we will construct r=R_0,R_1,R_2,\ldots on the same probability space as (S_m)_{m\ge 0} as follows. For every m, we set the increment R_{m+1}-R_m to be \pm(S_{m+1}-S_m). It remains to specify the sign, which will be determined by the direction of the S-increment, and a pair of stopping times. The marginal is therefore again an SRW, started from r. Temporarily, we use the unusual notation S_m= (S^x_m,S^y_m) for the coordinates of S_m.

So, if S_{m+1}-S_m=(1,0), (-1,0), ie S moves left or right, then we set

R_{m+1}-R_m = \begin{cases} -(S_{m+1}-S_m) &\quad \text{if }m<T^x\\ S_{m+1}-S_m&\quad \text{if }m>T^x.\end{cases} (*)

where T^x:= \min\{m\,:\, R^x_m=S^x_m\}. That is, R^x moves in the opposing direction to S^x until the first time when they are equal (hence the parity requirement), and then they move together. WLOG assume that r^x>0. Then suppose S^x_m=\pm N and such m is minimal. Then by construction, if m\ge T^x, then R^x_m=\pm N also. If m<T^x, then we must have S^x_m=-N, and so since R^x‘s trajectory is a mirror image of S^x‘s, in fact R^x_m = N+r^x>N, so R^x hit +N first. In both cases, we see that R^x hits \pm N at the same time or before S^x.

In other words, when S^x_m has non-negative x coordinate, the lazy random walk R^x follows the same trajectory as S^x, and when it has negative x coordinate, the R^x mirrors S^x. At some time, it may happen that S^x_m= R^x_m=0 (recall the parity condition on r). Call this time T^x. We then adjust the description of the coupling so that (*) is the mechanism for m<T^x, and then for m\ge T^x, we take S^x_m=R^x_m.

Similarly, if S_{m+1}-S_m =(0,1), (0,-1), ie S moves up or down, then we set

R_{m+1}-R_m = \begin{cases} -(S_{m+1}-S_m)&\quad \text{ if }m<T^y\\  S_{m+1}-S_m&\quad \text{if }m\le T^y,\end{cases}

with corresponding definition of the stopping time T^y.

This completes the coupling, and by considering T^x\wedge T^y, we have shown what that the exit time for the walk started from zero dominates the exit time for walk started from r. Recall that so far we are in the case where the box has even width and r=(r^x,r^y) has even coordinates.

This exit time comparison isn’t exactly what we need to compare G^N(0,0) and G^N(x,x). It’s worth remarking at this stage that if all we cared about was the Green’s function on the integer line [-N,N], we would have an easier argument, as by the harmonic property of G(\cdot,y)

G^{[-N,N]}(0,r)=\frac{N-r}{N}G^{[-N,N]}(0,0),

G^{[-N,N]}(r,0) = \frac{N}{N+r}G^{[-N,N]}(r,r),

and so G(0,0)>G(r,r) follows by symmetry. To lift from 1D to 2D directly, we need a bit more than this. It’s possible that S returns in both x- and y- coordinates more often than R, but never at the same time. Fortunately, the coupling we defined slightly earlier does give us a bit more control.

Let \tau^x(S), \tau^x(R) be the first times that S^x, R^x hit \pm N. Under this coupling, for any m\ge 0

\mathbb{P}(S^x_m=0, m<T^x) = \mathbb{P}(R^x_m=r^x, m<T^x)

since these events are literally equal. Since we showed that \tau^x(R)\le \tau^x(S) almost surely, we can further deduce

\mathbb{P}(S^x_m=0,m<T^x\wedge \tau^x(S)) \ge \mathbb{P}(S^x_m=0,m<T^x\wedge \tau^x(R))

=\mathbb{P}(R^x_m=r^x, m <T^x \wedge \tau^x(R)).

To address the corresponding events for which m\ge T^x, we apply the strong Markov property at T^x, to obtain SRW Z_m started from r/2, and let \tau_{-N},\tau_{+N} be the hitting times of -N,+N respectively and \tau_{\pm N}=\tau_{-N}\wedge \tau_{+N}. It will now suffice to prove that

\mathbb{P}(Z_m=0, m< \tau_{\pm N}) \ge \mathbb{P}(Z_m=r,m<\tau_{\pm N}), (**)

as then we can apply the law of total probability and sum over values of T^x and m\ge 0.

To prove this result, we consider the following bijection between trajectories of length m from r/2 to {0,r}. We decompose the trajectories into excursions away from r/2, and then a final meander from r/2 to {0,r} that stays on the same side of r/2. We construct the new trajectory by preserving all the initial excursions, but reversing all the steps of the final meander. So if the original trajectory ended up at 0, the image ends up at r. Trivially, the initial excursions in the image only hit \pm N if the excursions in the original trajectory did this too. But it’s also easy to see, by a similar argument to the coupling at the start of this section, that if the original trajectory ends at r and does not hit \pm N, then so does the image. However, the converse is not true. So we conclude (**), and thus

\mathbb{P}(S_m^x=0) \ge \mathbb{P}(R_m^x=0)

for all m by combining everything we have seen so far. And so we can now lift to a statement about S_m itself, that is considering both coordinates separately.

 

The remaining cases for r require a little more care over the definition of T^x, though the same projection argument works, for fundamentally the same reason. (Note that in the above argument, if S^x_m=-N and m<T^x, then in fact R^x_m\ge N+2, and so it’s not hard to convince yourself that a sensible adjustment to the stopping time will allow a corresponding result with R^x_m\ge N+1 in the odd r^x case.) The case for N odd is harder, since in one dimension there are two median sites, and it’s clear by symmetry that we can’t couple them such that RW from one always exits at least as early as RW from the other. However, the distributions of exit times started from these two sites are the same (by symmetry), and so although we can’t find a coupling, we can use similar stopping times to obtain a result in probability.

In the next post, we’ll see how to apply this uniform bound on G^{V_N}(x,x) to control the maximum of the DGFF on V_N. In particular, we address how the positive correlations of DGFF influence the behaviour of the maximum by comparison with independent Gaussians at each site.

Generating uniform trees

A long time ago, I wrote quite a few a things about uniform trees. That is, a uniform choice from the n^{n-2} unrooted trees with vertex set [n]. This enumeration, normally called Cayley’s formula, has several elegant arguments, including the classical Prufer bijection. But making a uniform choice from a large set is awkward, and so we seek more probabilistic methods to sample such a tree, which might also give insight into the structure of a ‘typical’ uniform tree.

In another historic post, I talked about the Aldous-Broder algorithm. Here’s a quick summary. We run a random walk on the complete graph K_n started from a uniformly-chosen vertex. Every time we arrive at a vertex we haven’t visited before, we record the edge just traversed. Eventually we have visited all n vertices, so have recorded n-1 edges. It’s easy enough to convince yourself that these n-1 edges form a tree (how could there be a cycle?) and a bit more complicated to decide that the distribution of this tree is uniform.

It’s worth noting that this algorithm works to construct a uniform spanning tree on any connected base graph.

This post is about a few alternative constructions and interpretations of the uniform random tree. The first construction uses a Galton-Watson process. We take a Galton-Watson process where the offspring distribution is Poisson(1), and condition that the total population size is n. The resulting random tree has a root but no labels, however if we assign labels in [n] uniformly at random, the resulting rooted tree has the uniform distribution among rooted trees on [n].

Proof

This is all about moving from ordered trees to non-ordered trees. That is, when setting up a Galton-Watson tree, we distinguish between the following two trees, drawn extremely roughly in Paint:

That is, it matters which of the first-generation vertices have three children. Anyway, for such a (rooted) ordered tree T with n vertices, the probability that the Galton-Watson process ends up equal to T is

\mathbb{P}(GW = T) = \prod_{v\in T} \frac{e^{-1}}{C(v)!} = e^{-n} \prod_{v\in T}\frac{1}{C(v)!},

where C(v) is the number of children of a vertex v\in T. Then, since \mathbb{P}( |GW|=n ) is a function of n, we find

\mathbb{P}(GW=T \,\big|\, |GW|=n) = f(n)\prod_{v\in T} \frac{1}{C(v)!},

where f(n) is a function of n alone (ie depends on T only through its size n).

But given an unordered rooted tree t, labelled by [n], there are \prod_{v \in t} C(v)! ordered trees associated to t in the natural way. Furthermore, if we take the Poisson Galton-Watson tree conditioned to have total population size n, and label uniformly at random with [n], we obtain any one of these ordered trees with probability \frac{f(n)}{n!} \prod_{v\in t} \frac{1}{C(v)!}. So the probability that we have t after we forget about the ordering is \frac{f(n)}{n!}, which is a function of n alone, and so the distribution is uniform among the set of rooted unordered trees labelled by [n], exactly as required.

Heuristic for Poisson offspring distribution

In this proof, the fact that \mathbb{P}(C(v)=k)\propto \frac{1}{k!} exactly balances the number of orderings of the k children explains why Poisson(1) works out. Indeed, you can see in the proof that Poisson(c) works equally well, though when c\ne 1, the event we are conditioning on (namely that the total population size is n) has probability decaying exponentially in n, whereas for c=1, the branching process is critical, and the probability decays polynomially.

We can provide independent motivation though, from the Aldous-Broder construction. Both the conditioned Galton-Watson construction and the A-B algorithm supply the tree with a root, so we’ll keep that, and look at the distribution of the degree of the root as constructed by A-B. Let \rho=v_1,v_2,v_3,\ldots be the vertices [n], ordered by their discovery during the construction. Then \rho is definitely connected by an edge to v_2, but thereafter it follows by an elementary check that the probability \rho is connected to v_m is \frac{1}{n-1}, independently across all m. In other words, the distribution of the degree of \rho in the tree as constructed by A-B is

1+ \mathrm{Bin}\left(n-2,\frac{1}{n-1}\right) \approx 1+\mathrm{Poisson}(1).

Now, in the Galton-Watson process, conditioning the tree to have fixed, large size changes the offspring distribution of the root. Conveniently though, in a limiting sense it’s the same change as conditioning the tree to have size at least n. Since these events are monotone in n, it’s possible to take a limit of the conditioning events, and interpret the result as the Galton-Watson tree conditioned to survive. It’s a beautiful result that this interpretation can be formalised as a local limit. The limiting spine decomposition consists of an infinite spine, where the offspring distribution is a size-biased version of the original offspring distribution (and so in particular, always has at least one child) and where non-spine vertices have the original distribution.

In particular, the number of the offspring of the root is size-biased, and it is well-known and not hard to check that size-biasing Poisson(c) gives 1+Poisson(c) ! So in fact we have, in an appropriate limiting sense in both objects, a match between the degree distribution of the root in the uniform tree, and in the conditioned Galton-Watson tree.

This isn’t supposed to justify why a conditioned Galton-Watson tree is relevant a priori (especially the unconditional independence of degrees), but it does explain why Poisson offspring distributions are relevant.

Construction via G(N,p) and the random cluster model

The main reason uniform trees were important to my thesis was their appearance in the Erdos-Renyi random graph G(N,p). The probability that vertices {1, …, n} form a tree component in G(N,p) with some particular structure is

p^{n-1} (1-p)^{\binom{n}{2}-(n-1)} \times (1-p)^{n(N-m)}.

Here, the first two terms give the probability that the graph structure on {1, …, n} is correct, and the the final term gives the probability of the (independent) event that these vertices are not connected to anything else in the graph. In particular, this has no dependence on the tree structure chosen on [n] (for example, whether it should be a path or a star – both examples of trees). So the conditional distribution is uniform among all trees.

If we work in some limiting regime, where pn\rightarrow 0 (for example if n is fixed and p=\frac{1}{N}\rightarrow 0), then we can get away asymptotically with less strong conditioning. Suppose we condition instead just that [n] form a component. Now, there are more ways to form a connected graph with one cycle on [n] than there are trees on [n], but the former all require an extra edge, and so the probability that a given one such tree-with-extra-edge appears as the restriction to [n] in G(N,p) is asymptotically negligible compared to the probability that the restriction to [n] of G(N,p) is a tree. Naturally, the local limit of components in G(N,c/N) is a Poisson(c) Galton-Watson branching process, and so this is all consistent with the original construction.

One slightly unsatisfying aspect to this construction is that we have to embed the tree of size [n] within a much larger graph on [N] to see uniform trees. We can’t choose a scaling p=p(n) such that G(n,p) itself concentrates on trees. To guarantee connectivity with high probability, we need to take p> \frac{\log n}{n}, but by this threshold, the graph has (many) cycles with high probability.

At this PIMS summer school in Vancouver, one of the courses is focusing on lattice spin models, including the random cluster model, which we now briefly define. We start with some underlying graph G. From a physical motivation, we might take G to be \mathbb{Z}^d or some finite subset of it, or a d-ary tree, or the complete graph K_N. As in classical bond percolation (note G(N,p) is bond percolation on K_N), a random subset of the edges of G are included, or declared open. The probability of a given configuration w, with e open edges is proportional to

p^e (1-p)^{|E(G)| - e} q^{k(w)}, (*)

where the edge-weight p\in(0,1) as usual, and cluster weight q\in (0,\infty), and k(w) counts the number of connected components in configuration w. When q=1, we recover classical bond percolation (including G(N,p) ), while for q>1, this cluster-reweighting favours having more components, and q<1 favours fewer components. Note that in the case q\ne 1, the normalising constant (or partition function) of (*) is generally intractable to calculate explicitly.

As in the Erdos-Renyi graph, consider fixing the underlying graph G, and taking p\rightarrow 0, but also taking \frac{q}{p}\rightarrow 0. So the resulting graph asymptotically ‘wants to have as few edges as possible, but really wants to have as few components as possible’. In particular, 1) all spanning trees of G are equally likely; 2) any configuration with more than one component has asymptotically negligible probability relative to any tree; 3) any graph with a cycle has #components + #edges greater than that of a tree, and so is asymptotically negligible probability relative to any tree.

In other words, the limit of the distribution is the uniform spanning tree of G, and so this (like Aldous-Broder) is a substantial generalisation, which constructs the uniform random tree in the special case where G=K_n.

 

Random walks conditioned to stay positive

In this post, I’m going to discuss some of the literature concerning the question of conditioning a simple random walk to lie above a line with fixed gradient. A special case of this situation is conditioning to stay non-negative. Some notation first. Let (S_n)_{n\ge 0} be a random walk with IID increments, with distribution X. Take \mu to be the expectation of these increments, and we’ll assume that the variance \sigma^2 is finite, though at times we may need to enforce slightly stronger regularity conditions.

(Although simple symmetric random walk is a good example for asymptotic heuristics, in general we also assume that if the increments are discrete they don’t have parity-based support, or any other arithmetic property that prevents local limit theorems holding.)

We will investigate the probability that S_n\ge 0 for n=0,1,…,N, particularly for large N. For ease of notation we write T=\inf\{n\ge 0\,:\, S_n<0\} for the hitting time of the negative half-plane. Thus we are interested in S_n conditioned on T>N, or T=N, mindful that these might not be the same. We will also discuss briefly to what extent we can condition on T=\infty.

In the first paragraph, I said that this is a special case of conditioning SRW to lie above a line with fixed gradient. Fortunately, all the content of the general case is contained in the special case. We can repose the question of S_n conditioned to stay above n\alpha until step N by the question of S_n-n\alpha (which, naturally, has drift \mu-\alpha) conditioned to stay non-negative until step N, by a direct coupling.

Applications

Simple random walk is a perfectly interesting object to study in its own right, and this is a perfectly natural question to ask about it. But lots of probabilistic models can be studied via naturally embedded SRWs, and it’s worth pointing out a couple of applications to other probabilistic settings (one of which is the reason I was investigating this literature).

In many circumstances, we can desribe random trees and random graphs by an embedded random walk, such as an exploration process, as described in several posts during my PhD, such as here and here. The exploration process of a Galton-Watson branching tree is a particularly good example, since the exploration process really is simple random walk, unlike in, for example, the Erdos-Renyi random graph G(N,p), where the increments are only approximately IID. In this setting, the increments are given by the offspring distribution minus one, and the hitting time of -1 is the total population size of the branching process. So if the expectation of the offspring distribution is at most 1, then the event that the size of the tree is large is an atypical event, corresponding to delayed extinction. Whereas if the expectation is greater than one, then it is an event with limiting positive probability. Indeed, with positive probability the exploration process never hits -1, corresponding to survival of the branching tree. There are plenty of interesting questions about the structure of a branching process tree conditional on having atypically large size, including the spine decomposition of Kesten [KS], but the methods described in this post can be used to quantify the probability, or at least the scale of the probability of this atypical event.

In my current research, I’m studying a random walk embedded in a construction of the infinite-volume DGFF pinned at zero, as introduced by Biskup and Louidor [BL]. The random walk controls the gross behaviour of the field on annuli with dyadically-growing radii. Anyway, in this setting the random walk has Gaussian increments. (In fact, there is a complication because the increments aren’t exactly IID, but that’s definitely not a problem at this level of exposition.) The overall field is decomposed as a sum of the random walk, plus independent DGFFs with Dirichlet boundary conditions on each of the annuli, plus asymptotically negligible corrections from a ‘binding field’. Conditioning that this pinned field be non-negative up to the Kth annulus corresponds to conditioning the random walk to stay above the magnitude of the minimum of each successive annular DGFF. (These minima are random, but tightly concentrated around their expectations.)

Conditioning on \{T > N\}

When we condition on \{T>N\}, obviously the resulting distribution (of the process) is a mixture of the distributions we obtain by conditioning on each of \{T=N+1\}, \{T=N+2\},\ldots. Shortly, we’ll condition on \{T=N\} itself, but first it’s worth establishing how to relate the two options. That is, conditional on \{T>N\}, what is the distribution of T?

Firstly, when \mu>0, this event always has positive probability, since \mathbb{P}(T=\infty)>0. So as N\rightarrow\infty, the distribution of the process conditional on \{T>N\} converges to the distribution of the process conditional on survival. So we’ll ignore this for now.

In the case \mu\le 0, everything is encapsulated in the tail of the probabilities \mathbb{P}(T=N), and these tails are qualitatively different in the cases \mu=0 and \mu<0.

When \mu=0, then \mathbb{P}(T=N) decays polynomially in N. In the special case where S_n is simple symmetric random walk (and N has the correct parity), we can check this just by an application of Stirling’s formula to count paths with this property. By contrast, when \mu<0, even demanding S_N=-1 is a large deviations event in the sense of Cramer’s theorem, and so the probability decays exponentially with N. Mogulskii’s theorem gives a large deviation principle for random walks to lie above a line defined on the scale N. The crucial fact here is that the probabilistic cost of staying positive until N has the same exponent as the probabilistic cost of being positive at N. Heuristically, we think of spreading the non-expected behaviour of the increments uniformly through the process, at only polynomial cost once we’ve specified the multiset of values taken by the increments. So, when \mu<0, we have

\mathbb{P}(T\ge(1+\epsilon)N) \ll \mathbb{P}(T= N).

Therefore, conditioning on \{T\ge N\} in fact concentrates T on N+o(N). Whereas by contrast, when \mu=0, conditioning on \{T\ge N\} gives a nontrivial limit in distribution for T/N, supported on [1,\infty).

A related problem is the value taken by S_N, conditional on {T>N}. It’s a related problem because the event {T>N} depends only on the process up to time N, and so given the value of S_N, even with the conditioning, after time N, the process is just an unconditioned RW. This is a classic application of the Markov property, beloved in several guises by undergraduate probability exam designers.

Anyway, Iglehart [Ig2] shows an invariance principle for S_N | T>N when \mu<0, without scaling. That is S_N=\Theta(1), though the limiting distribution depends on the increment distribution in a sense that is best described through Laplace transforms. If we start a RW with negative drift from height O(1), then it hits zero in time O(1), so in fact this shows that conditonal on \{T\ge N\}, we have T= N +O(1) with high probability. When \mu=0, we have fluctuations on a scale \sqrt{N}, as shown earlier by Iglehart [Ig1]. Again, thinking about the central limit theorem, this fits the asymptotic description of T conditioned on T>N.

Conditioning on T=N

In the case \mu=0, conditioning on T=N gives

\left[\frac{1}{\sqrt{N}}S(\lfloor Nt\rfloor ) ,t\in[0,1] \right] \Rightarrow W^+(t), (*)

where W^+ is a standard Brownian excursion on [0,1]. This is shown roughly simultaneously in [Ka] and [DIM]. This is similar to Donsker’s theorem for the unconditioned random walk, which converges after rescaling to Brownian motion in this sense, or Brownian bridge if you condition on S_N=0. Skorohod’s proof for Brownian bridge [Sk] approximates the event \{S_N=0\} by \{S_N\in[-\epsilon \sqrt{N},+\epsilon \sqrt{N}]\}, since the probability of this event is bounded away from zero. Similarly, but with more technicalities, a proof of convergence conditional on T=N can approximate by \{S_m\ge 0, m\in[\delta N,(1-\delta)N], S_N\in [-\epsilon \sqrt{N},+\epsilon\sqrt{N}]\}. The technicalities here emerge since T, the first return time to zero, is not continuous as a function of continuous functions. (Imagine a sequence of processes f^N for which f^N(x)\ge 0 on [0,1] and f^N(\frac12)=\frac{1}{N}.)

Once you condition on T=N, the mean \mu doesn’t really matter for this scaling limit. That is, so long as variance is finite, for any \mu\in\mathbb{R}, the same result (*) holds, although a different proof is in general necessary. See [BD] and references for details. However, this is particularly clear in the case where the increments are Gaussian. In this setting, we don’t actually need to take a scaling limit. The distribution of Gaussian *random walk bridge* doesn’t depend on the mean of the increments. This is related to the fact that a linear transformation of a Gaussian is Gaussian, and can be seen by examining the joint density function directly.

Conditioning on T=\infty

When \mu>0, the event \{T=\infty\} occurs with positive probability, so it is well-defined to condition on it. When \mu\le 0, this is not the case, and so we have to be more careful.

First, an observation. Just for clarity, let’s take \mu<0, and condition on \{T>N\}, and look at the distribution of S_{\epsilon N}, where \epsilon>0 is small. This is approximately given by

\frac{S_{\epsilon N}}{\sqrt{N}}\stackrel{d}{\approx}W^+(\epsilon).

Now take \epsilon\rightarrow\infty and consider the RHS. If instead of the Brownian excursion W^+, we instead had Brownian motion, we could specify the distribution exactly. But in fact, we can construct Brownian excursion as the solution to an SDE:

\mathrm{d}W^+(t) = \left[\frac{1}{W^+(t)} - \frac{W^+(t)}{1-t}\right] \mathrm{d}t + \mathrm{d}B(t),\quad t\in(0,1) (**)

for B a standard Brownian motion. I might return in the next post to why this is valid. For now, note that the first drift term pushes the excursion away from zero, while the second term brings it back to zero as t\rightarrow 1.

From this, the second drift term is essentially negligible if we care about scaling W^+(\epsilon) as \epsilon\rightarrow 0, and we can say that W^+(\epsilon)=\Theta(\sqrt{\epsilon}).

So, returning to the random walk, we have

\frac{S_{\epsilon N}}{\sqrt{\epsilon N}}\stackrel{d}{\approx} \frac{W^+(\epsilon)}{\sqrt{\epsilon}} = \Theta(1).

At a heuristic level, it’s tempting to try ‘taking N\rightarrow\infty while fixing \epsilon N‘, to conclude that there is a well-defined scaling limit for the RW conditioned to stay positive forever. But we came up with this estimate by taking N\rightarrow\infty and then \epsilon\rightarrow 0 in that order. So while the heuristic might be convincing, this is not the outline of a valid argument in any way. However, the SDE representation of W^+ in the \epsilon\rightarrow 0 regime is useful. If we drop the second drift term in (**), we define the three-dimensional Bessel process, which (again, possibly the subject of a new post) is the correct scaling limit we should be aiming for.

Finally, it’s worth observing that the limit \{T=\infty\}=\lim_{N\rightarrow\infty} \{T>N\} is a monotone limit, and so further tools are available. In particular, if we know that the trajectories of the random walk satisfy the FKG property, then we can define this limit directly. It feels intuitively clear that random walks should satisfy the FKG inequality (in the sense that if a RW is large somewhere, it’s more likely to be large somewhere else). You can do a covariance calculation easily, but a standard way to show the FKG inequality applies is by verifying the FKG lattice condition, and unless I’m missing something, this is clear (though a bit annoying to check) when the increments are Gaussian, but not in general. Even so, defining this monotone limit does not tell you that it is non-degenerate (ie almost-surely finite), for which some separate estimates would be required.

A final remark: in a recent post, I talked about the Skorohod embedding, as a way to construct any centered random walk where the increments have finite variance as a stopped Brownian motion. One approach to conditioning a random walk to lie above some discrete function is to condition the corresponding Brownian motion to lie above some continuous extension of that function. This is a slightly stronger conditioning, and so any approach of this kind must quantify how much stronger. In Section 4 of [BL], the authors do this for the random walk associated with the DGFF conditioned to lie above a polylogarithmic curve.

References

[BD] – Bertoin, Doney – 1994 – On conditioning a random walk to stay nonnegative

[BL] – Biskup, Louidor – 2016 – Full extremal process, cluster law and freezing for two-dimensional discrete Gaussian free field

[DIM] – Durrett, Iglehart, Miller – 1977 – Weak convergence to Brownian meander and Brownian excursion

[Ig1] – Iglehart – 1974 – Functional central limit theorems for random walks conditioned to stay positive

[Ig2] – Iglehart – 1974 – Random walks with negative drift conditioned to stay positive

[Ka] – Kaigh – 1976 – An invariance principle for random walk conditioned by a late return to zero

[KS] – Kesten, Stigum – 1966 – A limit theorem for multidimensional Galton-Watson processes

[Sk] – Skorohod – 1955 – Limit theorems for stochastic processes with independent increments

Doob inequalities and Doob-Meyer decomposition

The first post I wrote on this blog was about martingales, way back in 2012 at a time when I had known what a martingale was for about a month. I now don’t have this excuse. So I’m going to write about a couple of properties of (discrete-time) martingales that came up while adjusting a proof which my thesis examiners suggested could be made much shorter as part of their corrections.

Doob’s submartingale inequality

When we prove that some sequence of processes converges to some other process, we typically want to show that this holds in some sense uniformly over a time-interval, rather than just at some fixed time. We don’t lose much at this level of vagueness by taking the limit process to be identically zero. Then, if the convergent processes are martingales or closely similar, we want to be able to bound \sup_{k\le n} |Z_k| in some sense.

Doob’s submartingale inequality allows us to do this. Recall that a submartingale has almost-surely non-negative conditional increments. You might think of it heuristically as ‘more increasing than a martingale’. If Z_n is a martingale, then |Z_n| is a submartingale. This will be useful almost immediately.

The statement is that for (Z_n) a non-negative submartingale,

\mathbb{P}\left( \sup_{k\le n} Z_k \ge \lambda\right) \le \frac{\mathbb{E}\left[Z_n\right]}{\lambda}.

The similarity of the statement to the statement of Markov’s inequality is no accident. Indeed the proof is very similar. We consider whether the event in question happens, and find lower bounds on the expectation of Z_n under both possibilities.

Formally, for ease of notation, let Z_n^* be the running maximum \sup_{k\le n}Z_k. Then, we let T:= n\wedge \inf\{k\le n, M_j\ge \lambda\} and apply the optional stopping theorem for submartingales at T, which is by construction at most n. That is

\mathbb{E}[Z_n]\ge \mathbb{E}[Z_T]=\mathbb{E}\left[Z_T\mathbf{1}_{Z_n^*<\lambda}\right] + \mathbb{E}\left[Z_T \mathbf{1}_{Z_n^*\ge \lambda}\right].

The first of these summands is positive, and the second is at least \lambda \mathbb{P}\left( Z_N^* \ge \lambda \right), from which the result follows.

We’ve already said that for any martingale Z_n, |Z_n| is a submartingale, but in fact f(Z_n) is a submartingale whenever f is convex, and \mathbb{E}|f(Z_n)|<\infty for each n. Naturally, this continues to hold when Z_n is itself a submartingale.

[Note that Z_n^* is also a submartingale, but this probably isn’t as interesting.]

A particularly relevant such function f is f(x)=x^p, for p>1. If we take Z_n a non-negative submartingale which is uniformly bounded in L^p, then by applying Holder’s inequality and this submartingale inequality, we obtain

\mathbb{E}\left( \sup_{k\le n}Z_n^p \right) \le \left(\frac{p}{p-1}\right)^p \mathbb{E}\left[ Z_n^p \right].

Since Z_n^p is a submartingale, then a limit in n on the RHS is monotone, and certainly a limit in n on the LHS is monotone, so we can extend to

\mathbb{E}\left( \sup_{k\le n}Z_\infty^p \right) \le \left(\frac{p}{1-p}\right)^p \mathbb{E}\left[ Z_\infty^p \right].

Initially, we have to define \mathbb{E}\left[ Z_\infty^p \right] through this limit, but in fact this result, Doob’s L^p inequality, shows that Z_\infty:= \lim Z_n exists almost surely as well.

Naturally, we will often apply this in the case p=2, and in the third of these three sections, we will see why it might be particularly straightforward to calculate \mathbb{E}\left[Z_\infty^2\right].

Remark: as in the case of Markov’s inequality, it’s hard to say much if the submartingale is not taken to be non-negative. Indeed, this effect can be seen even if the process is only defined for a single time step, for which the statement really is then Markov’s inequality.

Doob-Meyer decomposition

Unfortunately, most processes are not martingales. Given an discrete-time process X_n adapted to \mathcal{F}=(\mathcal{F}_n), it is a martingale if the conditional expectations of the increments are all almost surely zero. But given a general adapted process X_n which is integrable (so the increments have well-defined finite expectation), we can iteratively construct a new process M_n, where the increments are centred versions of X_n‘s increments. That is,

M_{n+1}-M_n:= X_{n+1}-X_n - \mathbb{E}\left[ X_{n+1}-X_n \,\big|\, \mathcal{F}_n\right] = X_{n+1}-\mathbb{E}\left[X_{n+1} \,\big|\, \mathcal{F}_n\right]. (*)

Then it’s immediately clear from the definition that M_n is a martingale.

There’s a temptation to tie oneself up in knots with the dependence. We might have that increments of the original process X_n depend on the current value of the process. And is it necessarily clear that we can recover the current value of the original process from the current value of M_n? Well, this is why we demand that everything be adapted, rather than just Markov. It’s not the case that M_n should be Markov, but it clearly is adapted.

Now we look at the middle expression in (*), and in particular the term we are subtracting, namely the conditional expectation. If we define, in the standard terminology, A_0=0 and

A_{n+1}-A_n:= \mathbb{E}\left[ X_{n+1}-X_n \,\big|\, \mathcal{F}_n\right],

then we have decomposed the original process X_n as the sum of a martingale M_n, and this new process A_n. In particular, note that the increment A_{n+1}-A_n given above is adapted to \mathcal{F}_n, which is a stronger condition than being adapted to \mathcal{F}_{n+1} as we would expect a priori. This property of the process (A_n) is called predictability (or possibly previsibility).

This decomposition X_n=X_0+M_n+A_n as just defined is called the Doob-Meyer decomposition, and there is a unique such decomposition where M_n is a martingale, and A_n is predictable. The proof of uniqueness is very straightforward. We look at the equalities given above as definitions of M_n,A_n, but then work in the opposite direction to show that they must hold if the decomposition holds.

I feel a final heuristic is worthwhile, using the term drift, more normally encountered in the continuous-time setting to describe infinitissimal expected increments. The increments of A_n represent the drift of X_n, and the increments of M_n are what remains from X_n after subtracting the drift. In general, the process to be subtracted to turn a non-martingale into a martingale is called a compensator, and the existence or otherwise of such processes is important but challenging for some classes of continuous-time processes.

In particular, note that when X_n is itself a martingale, then A_n\equiv 0. However, probably the most useful case is when X_n is a submartingale, as then the drift is always non-negative, and so A_n is almost surely increasing. The converse holds too.

This is relevant because this Doob-Meyer decomposition is obviously only a useful tool for treating X_n if we can handle the two processes M_n,A_n easily. We have tools to bound the martingale term, but this previsible term might in general be tricky, and so the case where X_n is a submartingale is good, as increasing processes are much easier than general processes, since bounding the whole process might involve only bounding the final term in many contexts.

Predictable quadratic variation

A particularly relevant example is the square of a martingale, that is X_n=M_n^2, where M_n is a martingale. By the convexity condition discussed earlier, X_n is a submartingale (provided it is integrable, ie M_n is square-integrable), and so the process A_n in its Doob-Meyer decomposition is increasing. This is often called the (predictable) quadratic variation of (X_n).

This predictable quadratic variation is sometimes denoted \langle X_n\rangle. This differs from the (regular) quadratic variation which is defined as the sum of the squares of the increments, that is [X_n]:= \sum_{k=0}^{n-1} (X_{k+1}-X_k)^2. Note that this is adapted, but obviously not previsible. The distinction between these two processes is more important in continuous time. There, they are almost surely equal for a continuous local martingale, but not for eg a Poisson process. (For a Poisson process, the PQV is deterministic, indeed linear, while the (R)QV is almost surely equal to the Poisson process itself.) In the discrete time setting, the regular quadratic variation is not relevant very often, while the predictable quadratic variation is useful, precisely because of this decomposition.

Whenever we have random variables which we then centre, there is a standard trick to apply when treating their variance. That is

A_{n+1}-A_n= \mathbb{E}\left[ M^2_{n+1}-M^2_n \,\big|\, \mathcal{F}_n\right]
= \mathbb{E}\left[ M^2_{n+1}\,\big|\, \mathcal{F}_n\right] - 2M_n^2 +M_n^2
= \mathbb{E}\left[ M^2_{n+1}\,\big|\, \mathcal{F}_n\right] - 2M_n \mathbb{E}\left[ M_{n+1}\,\big|\, \mathcal{F}_n\right] + M_n^2
= \mathbb{E}\left[ \left(M_{n+1}-M_n\right)^2\,\big|\, \mathcal{F}_n\right].

One consequence is seen by taking an ‘overall’ expectation. Because M_n^2-A_n is a martingale,

\mathbb{E}\left[M_n^2\right] = \mathbb{E}\left[A_n\right] = \mathbb{E}\left[M_0^2\right] + \sum_{k=0}^{n-1} \mathbb{E}\left[A_{k+1}-A_k\right]
= \mathbb{E}\left[ M_0^2\right] + \sum_{k=0}^{n-1}\mathbb{E}\left[ \left(M_{k+1}-M_k\right)^2 \right]. (**)

This additive (Pythagorean) property of the square of a martingale is useful in applications where there is reasonably good control on each increment separately.

We can also see this final property without the Doob-Meyer decomposition. For a martingale it is not the case that the increments on disjoint intervals are independent. However, following Williams 12.1 [1], disjoint intervals are orthogonal, in the sense that

\mathbb{E}\left[(M_t-M_s)(M_v-M_u)\right]=0,

whenever s\le t\le u\le v. Then, when we square the expression M_n=M_0+\sum M_{k+1}-M_k, and take expectations, all the cross terms vanish, leaving precisely (*).

References

[1] Williams – Probability with Martingales

I also followed the notes I made in 2011/12 while attending Perla Sousi’s course on Advanced Probability, and Arnab Sen’s subsequent course on Stochastic Calculus, though I can’t find any evidence online for the latter now.

Skorohod embedding

Background

Suppose we are given a standard Brownian motion (B_t), and a stopping time T. Then, so long as T satisfies one of the regularity conditions under which the Optional Stopping Theorem applies, we know that \mathbb{E}[B_T]=0. (See here for a less formal introduction to OST.) Furthermore, since B_t^2-t is a martingale, \mathbb{E}[B_T^2]=\mathbb{E}[T], so if the latter is finite, so is the former.

Now, using the strong Markov property of Brownian motion, we can come up with a sequence of stopping times 0=T_0, T_1, T_2,\ldots such that the increments T_k-T_{k-1} are IID with the same distribution as T. Then 0,B_{T_1},B_{T_2},\ldots is a centered random walk. By taking T to be the hitting time of \{-1,+1\}, it is easy to see that we can embed simple random walk in a Brownian motion using this approach.

p1020956_compressedEmbedding simple random walk in Brownian motion.

The Skorohod embedding question asks: can all centered random walks be constructed in this fashion, by stopping Brownian motion at a sequence of stopping time? With the strong Markov property, it immediately reduces the question of whether all centered finite-variance distributions X can be expressed as B_T for some integrable stopping time T.

The answer to this question is yes, and much of what follows is drawn from, or at least prompted by Obloj’s survey paper which details the problem and rich history of the many approaches to its solution over the past seventy years.

Applications and related things

The relationship between random walks and Brownian motion is a rich one. Donsker’s invariance principle asserts that Brownian motion appears as the scaling limit of a random walk. Indeed, one can construct Brownian motion itself as the limit of a sequence of consistent random walks with normal increments on an increasingly dense set of times. Furthermore, random walks are martingales, and we know that continuous, local martingales can be expressed as a (stochastically) time-changed Brownian motion, from the Dubins-Schwarz theorem.

The Skorohod embedding theorem can be used to prove results about random walks with general distribution by proving the corresponding result for Brownian motion, and checking that the construction of the sequence of stopping times has the right properties to allow the result to be carried back to the original setting. It obviously also gives a coupling between a individual random walk and a Brownian motion which may be useful in some contexts, as well as a coupling between any pair of random walks. This is useful in proving results for random walks which are much easier for special cases of the distribution. For example, when the increments are Gaussian, or when there are combinatorial approaches to a problem about simple random walk. At the moment no aspect of this blog schedule is guaranteed, but I plan to talk about the law of the iterated logarithm shortly, whose proof is approachable in both of these settings, as well as for Brownian motion, and Skorohod embedding provides the route to the general proof.

At the end, we will briefly compare some other ways to couple a random walk and a Brownian motion.

Adding extra randomness

One thing we could do is sample a copy of X independently from the Brownian motion, then declare T= \tau_{X}:= \inf\{t\ge 0: B_t=X\}, the hitting time of (random value) X. But recall that unfortunately \tau_x has infinite expectation for all non-zero x, so this doesn’t fit the conditions required to use OST.

Skorohod’s original method is described in Section 3.1 of Obloj’s notes linked above. The method is roughly to pair up positive values taken by X appropriately with negative values taken by X in a clever way. If we have a positive value b and a negative value a, then \tau_{a,b}, the first hitting time of \mathbb{R}\backslash (a,b) is integrable. Then we choose one of these positive-negative pairs according to the projection of the distribution of X onto the pairings, and let T be the hitting time of this pair of values. The probability of hitting b conditional on hitting {a,b} is easy to compute (it’s \frac{-a}{b-a}) so we need to have chosen our pairs so that the ‘probability’ of hitting b (ie the density) comes out right. In particular, this method has to start from continuous distributions X, and treat atoms in the distribution of X separately.

The case where the distribution X is symmetric (that is X\stackrel{d}=-X) is particularly clear, as then the pairs should be (-x,x).

However, it feels like there is enough randomness in Brownian motion already, and subsequent authors showed that indeed it wasn’t necessary to introduce extra randomness to provide a solution.

One might ask whether it’s possible to generate the distribution on the set of pairs (as above) out of the Brownian motion itself, but independently from all the hitting times. It feels like it might be possible to make the distribution on the pairs measurable with respect to

\mathcal{F}_{0+} = \bigcap\limits_{t>0} \mathcal{F}_t,

the sigma-algebra of events determined by limiting behaviour as t\rightarrow 0 (which is independent of hitting times). But of course, unfortunately \mathcal{F}_{0+} has a zero-one law, so it’s not possible to embed non-trivial distributions there.

Dubins solution

The exemplar for solutions without extra randomness is due to Dubins, shortly after Skorohod’s original argument. The idea is to express the distribution X as the almost sure limit of a martingale. We first use the hitting time of a pair of points to ‘decide’ whether we will end up positive or negative, and then given this information look at the hitting time (after this first time) of two subsequent points to ‘decide’ which of four regions of the real interval we end up in.

I’m going to use different notation to Obloj, corresponding more closely with how I ended up thinking about this method. We let

a_+:= \mathbb{E}[X \,|\, X>0], \quad a_- := \mathbb{E}[X\,|\, X<0], (*)

and take T_1 = \tau_{\{a_-,a_+\}}. We need to check that

\mathbb{P}\left( B_{T_1}=a_+\right) = \mathbb{P}\left(X>0\right),

for this to have a chance of working. But we know that

\mathbb{P}\left( B_{T_1}=a_+\right) = \frac{a_+}{a_+-a_-},

and we can also attack the other side using (*) and the fact that \mathbb{E}[X]=0, using the law of total expectation:

0=\mathbb{E}[X]=\mathbb{E}[X\,|\, X>0] \mathbb{P}(X>0) + \mathbb{E}[X\,|\,X<0]\mathbb{P}(X<0) = a_+ \mathbb{P}(X>0) + a_- \left(1-\mathbb{P}(X>0) \right),

\Rightarrow\quad \mathbb{P}(X>0)=\frac{a_+}{a_+-a_-}.

Now we define

a_{++}=\mathbb{E}[X \,|\, X>a_+],\quad a_{+-}=\mathbb{E}[X\,|\, 0<X<a_+],

and similarly a_{-+},a_{--}. So then, conditional on B_{T_1}=a_+, we take

T_2:= \inf_{t\ge T_1}\left\{ B_t\not\in (a_{+-},a_{++})  \right\},

and similarly conditional on B_{T_1}=a_-. By an identical argument to the one we have just deployed, we have \mathbb{E}\left[B_{T_2} \,|\,\mathcal{F}_{T_1} \right] = B_{T_1} almost surely. So, although the a_{+-+} notation now starts to get very unwieldy, it’s clear we can keep going in this way to get a sequence of stopping times 0=T_0,T_1,T_2,\ldots where B_{T_n} determines which of the 2^n regions of the real line any limit \lim_{m\rightarrow\infty} B_{T_m} should lie in.

A bit of work is required to check that the almost sure limit T_n\rightarrow T is almost surely finite, but once we have this, it is clear that B_{T_n}\rightarrow B_T almost surely, and B_T has the distribution required.

Komlos, Major, Tusnady coupling

We want to know how close we can make this coupling between a centered random walk with variance 1, and a standard Brownian motion. Here, ‘close’ means uniformly close in probability. For large times, the typical difference between one of the stopping times 0,T_1,T_2,\ldots in the Skorohod embedding and its expectation (recall \mathbb{E}[T_k]=k) is \sqrt{n}. So, constructing the random walk S_0,S_1,S_2,\ldots from the Brownian motion via Skorohod embedding leads to

\left |S_k - B_k \right| = \omega(n^{1/4}),

for most values of k\le n. Strassen (1966) shows that the true scale of the maximum

\max_{k\le n} \left| S_k - B_k \right|

is slightly larger than this, with some extra powers of \log n and \log\log n as one would expect.

The Komlos-Major-Tusnady coupling is a way to do a lot better than this, in the setting where the distribution of the increments has a finite MGF near 0. Then, there exists a coupling of the random walk and the Brownian motion such that

\max_{k\le n}\left|S_k- B_k\right| = O(\log n).

That is, there exists C such that

\left[\max_{k\le n} \left |S_k-B_k\right| - C\log n\right] \vee 0

is a tight family of distributions, indeed with uniform exponential tail. To avoid digressing infinitely far from my original plan to discuss the proof of the law of iterated logarithm for general distributions, I’ll stop here. I found it hard to find much coverage of the KMT result apart from the challenging original paper, and many versions expressed in the language of empirical processes, which are similar to random walks in many ways relevant to convergence and this coupling, but not for Skorohod embedding. So, here is a link to some slides from a talk by Chatterjee which I found helpful in getting a sense of the history, and some of the modern approaches to this type of normal approximation problem.

DGFF 3 – Gibbs-Markov property for entropic repulsion

In the previous post, we saw that it isn’t much extra effort to define the DGFF with non-zero boundary conditions, by adding onto the zero-BC DGFF the unique (deterministic) harmonic function which extends the boundary values into the domain. We also saw how a Gibbs-Markov property applies, whereby the values taken by the field on some sub-region A\subset D depend on the values taken on D\backslash A only through values taken on \partial A.

In this post, we look at how this property and some other methods are applied by Deuschel [1] to study the probability that the DGFF on a large box in \mathbb{Z}^d is positive ‘everywhere’. This event can be interpreted in a couple of ways, all of which are referred to there as entropic repulsion. Everything which follows is either taken directly or paraphrased directly from [1]. I have tried to phrase this in a way which avoids repeating most of the calculations, instead focusing on the methods and the motivation for using them.

Fix dimension d\ge 2 throughout. We let P^0_N be the law of the DGFF on V_N:=[-N,N]^d\subset \mathbb{Z}^d with zero boundary conditions. Then for any subset A\subset \mathbb{Z}^d, in an intuitively-clear abuse of notation, we let

\Omega^+(A):= \{ h_x\ge 0, x\in A\},

be the event that some random field h takes only non-negative values on A. The goal is to determine P^0_N ( \Omega^+(V_N)). But for the purposes of this post, we will focus on showing bounds on the probability that the field is non-negative on a thin annulus near the boundary of V_N, since this is a self-contained step in the argument which contains a blog-friendly number of ideas.

We set (L_N) to be a sequence of integers greater than one (to avoid dividing by zero in the statement), for which \frac{L_N}{N}\rightarrow 0. We now define for each N, the annulus

W_N = \{v\in V_N: L_N\le d_{\mathbb{Z}^d}(v, V_N^c)\le 2L_N \}

with radius L_N set a distance L_N inside the box V_N. We aim to control P^N_0 (\Omega^+(W_N)). This forms middle steps of Deuschel’s Propositions 2.5 and 2.9, which discuss P^N_0(\Omega^+(V_{N-L_N})). Clearly there is the upper bound

P^N_0(\Omega^+(V_{N-L_N})) \le P^N_0(\Omega^+(W_N)) (1)

and a lower bound on P^N_0(\Omega^+(V_{N-L_N})) is obtained in the second proposition by considering the box as a union of annuli then combining the bounds on each annulus using the FKG inequality.

Upper bound via odds and evens

After removing step (1), this is Proposition 2.5:

\limsup_{N\rightarrow \infty} \frac{L_N}{N^{d-1} \log L_N} \log P^N_0(\Omega^+(W_N)) < 0. (2)

This is giving a limiting upper bound on the probability of the form L_N^{-CN^{d-1}/L_N}, though as with all LDP estimates, the form given at (2) is more instructive.

Morally, the reason why it is unlikely that the field should be non-negative everywhere within the annulus is that the distribution at each location is centred, and even though any pair of values are positively correlated, this correlation is not strong enough to avoid this event being unlikely. But this is hard to corral into an upper bound argument directly. In many circumstances, we want to prove upper bounds for complicated multivariate systems by projecting to get an unlikely event for a one-dimensional random variable, or a family of independent variables, even if we have to throw away some probability. We have plenty of tools for tail probabilities in both of these settings. Since the DGFF is normal, a one-dimensional RV that is a linear combination (eg the sum) of all the field heights is a natural candidate. But in this case we would have thrown away too much probability, since the only way we could dominate is to demand that the sum \sum_{x\in W_N}h^N_x\ge 0, which obviously has probability 1/2 by symmetry. (3)

So Deuschel splits W_N into W_N^o,W_N^e, where the former includes all vertices with odd total parity in W_N and the latter includes all the vertices with even total parity in the interior of W_N. (Recall that \mathbb{Z}^d is bipartite in exactly this fashion). The idea is to condition on h^N\big|_{W^o_N}. But obviously each even vertex is exactly surrounded by odd vertices. So by the Gibbs-Markov property, conditional on the odd vertices, the values of the field at the even vertices are independent. Indeed, if for each v\in W_N^e we define \bar h_v to be the average of its neighbours (which is measurable w.r.t to the sigma-algebra generated by the odd vertices), then

\{h_v: v\in W_N^e \,\big|\, \sigma(h_w: w\in W_N^o)\},

is a collection of independent normals with variance one, and where the mean of h_v is \bar h_v.

To start finding bounds, we fix some threshold m=m_N\gg 1 to be determined later, and consider the odd-measurable event A_N that at most half of the even vertices v have \bar h_v\ge m. So A_N^c\cap \Omega^+(W_N) says that all the odd vertices are non-negative and many are quite large. This certainly feels like a low-probability event, and unlike at (3), we might be able to obtain good tail bounds by projection into one dimension.

In the other case, conditional on A_N, there are a large number of even vertices with conditional mean at most m, and so we can control the probability that at least one is negative as a product

(1-\varphi(m))^{\frac12 |W_N^e|}. (4)

Note that for this upper bound, we can completely ignore the other even vertices (those with conditional mean greater than m).

So we’ll go back to A_N^c \cap \Omega^+(W_N). For computations, the easiest one-dimensional variable to work with is probably the mean of the \bar h_vs across v\in W_N^e, since on A_N^c\cap \Omega^+(W_N) this is at least \frac{m}{2}. Rather than focus on the calculations themselves involving

\bar S^e_N:= \frac{1}{|W_N^e|} \sum\limits_{v\in W_N^e} \bar h_v,

let us remark that it is certainly normal and centered, and so there are many methods to bound its tail, for example

P^0_N \left( \bar S^e_N \ge \frac{m}{2} \right) \le \exp\left( \frac{-m^2}{8\mathrm{Var}(\bar S^e_N)} \right), (5)

as used by Deuschel just follows from an easy comparison argument within the integral of the pdf. We can tackle the variance using the Green’s function for the random walk (recall the first post in this set). But before that, it’s worth making an observation which is general and useful, namely that \bar S^e_N is the expectation of

S^e_N:= \sum{1}{|W_N^e|}\sum\limits_{v\in W_N^e} h_v

conditional on the odds. Directly from the law of total variance, the variance of any random variable X is always larger than the variance of \mathbb{E}[X|Y].

So in this case, we can replace \mathrm{Var}(\bar S^e_N) in (5) with \mathrm{Var}(S^e_N), which can be controlled via the Green’s function calculation.

Finally, we choose m_N so that the probability at (4) matches the probability at (5) in scale, and this choice leads directly to (2).

In summary, we decomposed the event that everything is non-negative into two parts: either there are lots of unlikely local events in the field between an even vertex and its odd neighbours, or the field has to be atypically large at the odd sites. Tuning the parameter m_N allows us to control both of these probabilities in the sense required.

Lower bound via a sparse sub-lattice

To get a lower bound on the probability that the field is non-negative on the annulus, we need to exploit the positive correlations in the field. We use a similar idea to the upper bound. If we know the field is positive and fairly large in many places, then it is increasingly likely that it is positive everywhere. The question is how many places to choose?

We are going to consider a sub-lattice that lives in a slightly larger region than W_N itself, and condition the field to be larger than m=m_N everywhere on this lattice. We want the lattice to be sparse enough that even if we ignore positive correlations, the chance of this happening is not too small. But we also want the lattice to be dense enough that, conditional on this event, the chance that the field is actually non-negative everywhere in W_N is not too small either.

To achieve this, Deuschel chooses a sub-lattice of width \lfloor\epsilon L_N^{2/d}\rfloor, and sets \Lambda_N(\epsilon) to be the intersection of this with the annulus with radii [N-\frac{5}{2}L_N, N-\frac{1}{2}L_N], to ensure it lives in a slightly larger region than W_N itself. The scaling of this sub-lattice density is such that when a random walk is started at any v\in W_N, the probability that the RW hits \Lambda_N(\epsilon) before \partial V_N is asymptotically in (0,1). (Ie, not asymptotically zero or one – this requires some definitely non-trivial calculations.) In particular, for appropriate (ie large enough) choice of \epsilon, this probability is at least 1/2 for all v\in W_N. This means that after conditioning on event B_N:=\{h_v\ge m : v\in \Lambda_N(\epsilon)\}, the conditional expectation of h_w is at least \frac{m}{2} for all w\in W_N\backslash \Lambda_N(\epsilon). Again this uses the Gibbs-Markov property and the Gaussian nature of the field. In particular, this conditioning means we are left with the DGFF on V_N\backslash \Lambda_N(\epsilon), ie with boundary \partial V_N\cup \Lambda_N(\epsilon), and then by linearity, the mean at non-boundary points is given by the harmonic extension, which is linear (and so increasing) in the boundary values.

At this point, the route through the calculations is fairly clear. Since we are aiming for a lower bound on the probability of the event \Omega^+(W_N), it’s enough to find a lower bound on P^0_N(\Omega^+(W_N)\cap B).

Now, by positive correlation (or, formally, the FKG inequality) we can control P^0_N(B) just as a product of the probabilities that the field exceeds the threshold at each individual site in \Lambda_N(\epsilon). Since the value of the field at each site is normal with variance at least 1 (by definition), this is straightforward.

Finally, we treat P^0_N(\Omega^+(W_N) \,\big|\, B). We’ve established that, conditional on B, the mean at each point of W_N\backslash \Lambda_N(\epsilon) is at least \frac{m}{2}, and we can bound the variance above too. Again, this is a conditional variance, and so is at most the corresponding original variance, which is bounded above by \sigma_N^2:=\mathrm{Var}(h^N_0). (This fact that the variance is maximised at the centre is intuitively clear when phrased in terms of occupation times, but the proof is non-obvious, or at least non-obvious to me.)

Since each of the event h_v^N\ge 0 for v\in W_N\backslash \Lambda_N(\epsilon) is positively correlated with B, we can bound the probability it holds for all v by the product of the probabilities that it holds for each v. But having established that the conditional mean is at least \frac{m_N}{2} for each v, and the variance is uniformly bounded above (including in N), this gives an easy tail bound of the form we require.

Again it just remains to choose the sequence of thresholds m_N to maximise the lower bound on the probability that we’ve found in this way. In both cases, it turns out that taking m_N= \sqrt{C\log N} is sensible, and this turns out to be linked to the scaling of the maximum of the DGFF, which we will explore in the future.

References

[1] – J-D Deuschel, Entropic Repulsion of the Lattice Free Field, II. The 0-Boundary Case. Available at ProjectEuclid.

DGFF 2 – Boundary conditions and Gibbs-Markov property

In the previous post, we defined the Discrete Gaussian Free Field, and offered some motivation via the discrete random walk bridge. In particular, when the increments of the random walk are chosen to be Gaussian, many natural calculations are straightforward, since Gaussian processes are well-behaved under conditioning and under linear transformations.

Non-zero boundary conditions

In the definition of the DGFF given last time, we demanded that h\equiv 0 on \partial D. But the model is perfectly well-defined under more general boundary conditions.

It’s helpful to recall again the situation with random walk and Brownian bridge. If we want a Brownian motion which passes through (0,0) and (1,s), we could repeat one construction for Brownian bridge, by taking a standard Brownian motion and conditioning (modulo probability zero technicalities) on passing through level s at time 1. But alternatively, we could set

B^{\mathrm{drift-br}}(t) = B(t)+ t(s-B(1)),\quad t\in[0,1],

or equivalently

B^{\mathrm{drift-br}}(t)=B^{\mathrm{br}}(t)+ st, \quad t\in[0,1].

That is, a Brownian bridge with drift can be obtain from a centered Brownian bridge by a linear transformation, and so certainly remains a Gaussian process. And exactly the same holds for a discrete Gaussian bridge: if we want non-zero values at the endpoints, we can obtain this distribution by taking the standard centred bridge and applying a linear transformation.

We can see how this works directly at the level of density functions. If we take 0=Z_0,Z_1,\ldots,Z_{N-1},Z_N=0 a centred Gaussian bridge, then the density of Z=\mathbf{z}\in \mathbb{R}^{N+1} is proportional to

\mathbf{1}\{z_0=z_N=0\}\exp\left( -\frac12 \sum_{i=1}^N (z_i-z_{i-1})^2 \right). (3)

So rewriting z_i= y_i- ki (where we might want k=s/N to fit the previous example), the sum within the exponent rearranges as

-\frac12 \sum_{i=1}^N (y_i-y_{i-1} - k)^2 = -\frac12 \sum_{i=1}^N (y_i-y_{i-1})^2 - 2k(y_N-y_0)+ Nk^2.

So when the values at the endpoints z_0,z_n,y_0,y_N are fixed, this middle term is a constant, as is the final term, and thus the density of the linearly transformed bridge has exactly the same form as the original one.

In two or more dimensions, the analogue of adding a linear function is to add a harmonic function. First, some notation. Let \varphi be any function on \partial D. Then there is a unique harmonic extension of \varphi, for which \nabla \varphi=0 everywhere on D, the interior of the domain. Recall that \nabla is the discrete graph Laplacian defined up to a constant by

(\nabla \varphi) _x = \sum\limits_{x\sim y} \varphi_x - \varphi_y.

If we want h^D instead to have boundary values \varphi, it’s enough to replace h^D with h^D+\varphi. Then, in the density for the DGFF ( (1) in the previous post), the term in the exponential becomes (ignoring the \frac{1}{4d} )

-\sum\limits_{x\sim y} \left[ (h^D_x-h^D_y)^2 + (\varphi_x-\varphi_y)^2 +2(h^D_x - h^D_y)(\varphi_x-\varphi_y)\right].

For each x\in D, on taking this sum over its neighbours y\in \bar D, the final term vanishes (since \varphi is harmonic), while the second term is just a constant. So the density of the transformed field, which we’ll call h^{D,\varphi} is proportional to (after removing the constant arising from the second term above)

\mathbf{1}\left\{h^{D,\varphi}_x = \varphi_x,\, x\in\partial D\right\} \exp\left( -\frac{1}{4d} \sum\limits_{x\sim y} \left( h^{D,\varphi}_x - h^{D,\varphi}_y \right)^2 \right).

So h^{D,\varphi}:= h^D + \varphi satisfies the conditions for the DGFF on D with non-zero boundary conditions \varphi.

Harmonic functions and RW – a quick review

Like the covariances in DGFF, harmonic functions on D are related to simple random walk on D stopped on \partial D. (I’m not claiming a direct connection right now.) We can define the harmonic extension \varphi to an interior point x by taking \mathbb{P}_x to be the law of SRW x=Z_0,Z_1,Z_2,\ldots started from x, and then setting

\varphi(x):= \mathbb{E}\left[ \varphi_{\tau_{\partial d}} \right],

where \tau_{\partial D} is the first time that the random walk hits the boundary.

Inverse temperature – a quick remark

In the original definition of the density of the DGFF, there is the option to add a constant \beta>0 within the exponential term so the density is proportional to

\exp\left(-\beta \sum\limits_{x\sim y} (h_x-h_y)^2 \right).

With zero boundary conditions, the effect of this is straightforward, as varying \beta just rescales the values taken by the field. But with non-zero boundary conditions, the effect is instead to vary the magnitude of the fluctuations of the values of the field around the (unique) harmonic function on the domain with those BCs. In particular, when \beta\rightarrow \infty, the field is ‘reluctant to be far from harmonic’, and so h^D \Rightarrow \varphi.

This parameter \beta is called inverse temperature. So low temperature corresponds to high \beta, and high stability, which fits some physical intuition.

A Markov property

For a discrete (Gaussian) random walk, the Markov property says that conditional on a given value at a given time, the trajectory of the process before this time is independent of the trajectory afterwards. The discrete Gaussian bridge is similar. Suppose we have as before 0=Z_0,Z_1,\ldots, Z_N=0 a centred Gaussian bridge, and condition that Z_k=y, for k\in\{1,\ldots,N-1\}, and y\in\mathbb{R}. With this conditioning, the density (3) splits as a product

\mathbf{1}\{z_0=z_N=0, z_k=y\}\exp\left(-\frac12 \sum\limits_{i=1}^N (z_i-z_{i-1})^2 \right) =

\mathbf{1}\{z_0=0,z_k=y\} \exp\left(-\frac12 \sum\limits_{i=1}^k (z_i-z_{i-1})^2 \right) \cdot \mathbf{1}\{z_k=y,z_N=0\} \exp\left(-\frac12 \sum\limits_{i=k+1}^N (z_i-z_{i-1})^2 \right).

Therefore, with this conditioning, the discrete Gaussian bridge splits into a pair of independent discrete Gaussian bridges with drift. (The same would hold if the original process had drift too.)

The situation for the DGFF is similar, though rather than focusing on the condition, it makes sense to start by focusing on the sub-domain of interest. Let A\subset D, and take B=\bar D\backslash A. So in particular \partial A\subset B.

img_20161106_194123472_compressedThen we have that conditional on h^D\big|_{\partial A}, the restricted fields h^D\big|_{B\backslash \partial A} and h^D\big|_A are independent. Furthermore, h^D\big|_A has the distribution of the DGFF on A, with boundary condition given by h^D\big|_{\partial A}. As in the discrete bridge, this follows just by splitting the density. Every gradient term corresponds to an edge in the underlying graph that lies either entirely inside \bar A or entirely inside B. This holds for a general class of Gibbs models where the Hamiltonian depends only on the sum of some function of the heights (taken to be constant in this ‘free’ model) and the sum of some function of their nearest-neighbour gradients.

One additional and useful interpretation is that if we only care about the field on the restricted region A, the dependence of h^D\big|_A on h^D\big|_{D\backslash A} comes only through h^D\big|_{\partial A}. But more than that, it comes only through the (random) harmonic function which extends the (random) values taken on the boundary of A to the whole of A. So, if h^A is an independent DGFF on A with zero boundary conditions, we can construct the DGFF h^D from its value on D\backslash A via

h^D_x \stackrel{d}= h^A_x + \varphi^{h^D\big|_{\partial A}},

where \varphi^{h^D\big|_{\partial A}} is the unique harmonic extension of the (random) values taken by h^D on \partial A to \bar A.

This Markov property is crucial to much of the analysis to come. There are several choices of the restricted domain which come up repeatedly. In the next post we’ll look at how much one can deduce by taking A to be the even vertices in D (recalling that every integer lattice \mathbb{Z}^d is bipartite), and then taking A to be a finer sublattice within D. We’ll use this to get some good bounds on the probability that the DGFF is positive on the whole of D. Perhaps later we’ll look at a ring decomposition of \mathbb{Z}^d consisting of annuli spreading out from a fixed origin. Then the distribution of the field at this origin can be considered, via the final idea discussed above, as the limit of an infinite sequence of random harmonic functions given by the values taken by the field at increasingly large radius from the origin. Defining the DGFF on the whole lattice depends on the existence or otherwise of this local limit.

DGFF 1 – The discrete Gaussian free field from scratch

I’ve moved to Haifa in northern Israel to start a post-doc in the probability group at the Technion, and now that my thesis is finished I want to start blogging again. The past couple of weeks have been occupied with finding an apartment and learning about the Discrete Gaussian Free Field. All questions about the apartment are solved, but fortunately lots remain open about the DGFF, so I thought I’d write some background about this object and methods which have been used to study it.

Background – Random walk bridge

When we think of a random walk, we usually think of the index as time, normally going forwards. So for a random walk bridge, we might assume Z_0=0, and then condition on Z_N=0, thinking of this as a demand that the process has returned to zero at the future time. In some applications, this is the ideal intuition, but in others, it is more useful to think of the random walk bridge

(0=Z_0,Z_1,\ldots,Z_{N-1},Z_N=0),

as a random height function indexed by [0,N], where the probability of a given path decomposes naturally into a product depending on the N increments, up to a normalising constant.

Naturally, we are interested in the asymptotic behaviour of such a random walk bridge when N\rightarrow\infty. So long as the step distribution has finite variance, a conditioned version of Donsker’s theorem shows that the rescaled random walk bridge converges in distribution to Brownian bridge. Note that Brownian bridge

(B^{\mathrm{br}}_t, t\in[0,1])

can be constructed either by conditioning a standard Brownian motion B to return to zero at time one (modulo some technicalities – this event has zero probability), or by applying an appropriate (random) linear shift

B^{\mathrm{br}}(t):= B(t) - tB(1). (*)

It is not too hard to calculate the distribution of B^{\mathrm{br}}(t) for each t\in[0,1], and with a bit more work, one can calculate the joint distribution of (B^{\mathrm{br}}(s),B^{\mathrm{br}}(t)). In particular, the joint distribution is multivariate Gaussian, and so everything depends on the covariance ‘matrix’ (which here is indexed by [0,1]).

So if we return to a random walk bridge what should the step distribution be? Simple symmetric RW is a natural choice, as then lots of the quantities we might want to consider boil down to combinatorial calculations. Cleverness and Stirling’s formula can often get us useful asymptotics. But there are lots of inconveniences, not least the requirement to be careful about parity (N has to be even for a start unless you make the walk lazy, in which case the combinatorics becomes harder), and even if these can be overcome in a given calculation, it would be better not to have this.

The claim is that the random walk with Gaussian increments is by far the easiest to analyse asymptotically. As a further heuristic, think about the statement of the central limit theorem in the case where the underlying distribution is normal: it’s true but obvious. [Indeed, it’s my favourite piece of advice to anyone taking second year probability exams to check that your proposed statement of CLT does actually work for N(\mu,\sigma^2)…] More concretely, if a RW has Gaussian increments, then the path (Z_1,\ldots,Z_N) is a multivariate normal, or a Gaussian process with finite index set. In particular, covariances define the distribution. It remains a Gaussian process after conditioning on Z_N=0, and the linear tilting argument at (*) remains true here, and can indeed be applied to turn any boundary conditions into any other boundary conditions.

The discrete Gaussian free field

We know how to generalise the domain of a random walk to higher dimensions. But what generalising the index to higher dimension? So now there is definitely no arrow of time, and the notion of a random height function above \mathbb{Z}^2 (or a subset of it) is helpful, for which a scaling limit might be a random surface rather than Brownian motion.

Because we can’t well-order \mathbb{Z}^d, it’s harder to define any such random object on the entire lattice immediately, so we start with compact connected subsets, with zero boundary conditions, as in the one-dimensional case of random walk bridge. Formally, let D be a finite subset of \mathbb{Z}^d, and the boundary \partial D those elements of D^c which are adjacent to an element of D, and let \bar D:= D\cup \partial D.

Then, the discrete Gaussian free field on D is a random real vector h^D=(h^D_x: x\in \bar D), with probability density proportional to

\mathbf{1}\{h^D_x=0, x\in\partial D\}\exp\left ( - \frac{1}{4d} \sum_{x\sim y}(h^D_x - h^D_y)^2 \right), (1)

where we write x\sim y if that x,y are adjacent in \bar D. We won’t at any stage worry much about the partition function which normalises this pdf. Note also that \frac{1}{4d} is just a convenient choice of constant, which corresponds to one of the canonical choices for the discrete Laplacian. Adjusting this constant is the same as uniformly rescaling the values taken by the field.

The immediate interpretation of (1) is that the values taken by the field at vertices which are close to each other are positively correlated. Furthermore, the form of the density is Gaussian. Concretely, if the values of h^D are fixed everywhere except one vertex x\in D, then the conditional distribution of h^D_x is Gaussian. Later, or in subsequent posts, we will heavily develop this idea. Alternatively, we could if we really wanted describe the model in terms of independent Gaussians describing the ‘increment’ along each edge in D (which we should direct), subject to a very large number of conditions, namely that the sum of increments along any directed cycle is zero. This latter description might be more useful if you wanted to define a DGFF on a more sparse graph, but won’t be useful in what follows.

Note that we can rearrange the Laplacian in (1) in terms of the transition kernel p( ) of the simple random walk of D to obtain

\exp\left( -\frac12 (h^D)^T (\mathbf{P}-\mathbf{1})h^D \right),

where P_{x,y}=p(y-x) is the transition matrix of SRW on D. In particular, this means that the free field is Gaussian, and we can extract the covariances via

\mathrm{Cov}(h^D_x,h^D_y) = \left[ (\mathbf{1}-\mathbf{P})^{-1}\right]_{x,y}

= \left[\sum_{n\ge 0} \mathbf{P}^n\right]_{x,y} = \sum_{n\ge 0} \mathbb{P}_x\left[X_n=y,\tau_{\partial D}>n\right],

where, under \mathbb{P}_x, (X_0,X_1,\ldots) is simple random walk started from x.

This final quantity records the expected number of visits to y before leaving the domain D, for a random walk started at x, and is called the Green’s function.

In summary, the DGFF on D is the centred Gaussian random vector indexed by \bar D with covariance given by the Green’s function G_D(x,y).

How many of these equivalences carries over to more general D-indexed random fields is discussed in the survey paper by Velenik. But it’s worth emphasising that having the covariance given by the Green’s function as in the definition we’ve just given is a very nice property, as there are lots of pre-existing tools for calculating these. By contrast, it’s hard to think of a natural model for an integer-valued surface of this kind, as an analogue to SRW.

[Though definitely not impossible. The nicest example I’ve heard of is for height functions of large uniform domino tilings within their ‘arctic circle’, which have GFF asymptotics. See this paper by Kenyon.]

A continuous limit?

We motivated the discussion of random walk bridge by the limit object, namely Brownian bridge. Part of the reason why the DGFF is more interesting than Gaussian random walk bridge, is that the limit object, the (continuum) Gaussian free field is hard to define classically in two dimensions.

We might suppose that the DGFF in V_N, the square box of width N has some scaling limit as N\rightarrow\infty. However, for fixed x,y\in [0,1]^2, (and taking integer parts component-wise), well-known asymptotics for SRW in a large square lattice (more on this soon hopefully) assert that

\mathrm{Cov}(h^{V_N}_{\lfloor Nx \rfloor},h^{V_N}_{\lfloor Ny\rfloor}) \sim \log |x-y|, (2)

and so any scaling limit will rescale only the square domain, not the height (since there is no N on the RHS of (2)). However, then the variance of the proposed limit is infinite everywhere.

So the GFF does not exist as a random height function on [0,1]^2, with the consequence that a) more care is needed over its abstract definition; b) the DGFF in 2D on a large square is an interesting object, since it does exist in this sense.

What makes it ‘free’?

This seemed like a natural question to ask, but I’ve received various answers. Some sources seem to suggest that having zero boundary condition is free. Other sources refer to the Hamiltonian (that is the term inside the exponential function at (1) ) as free since it depends only on the increments between values. If the Hamiltonian also depends on the heights themselves, for example via the addition of a \sum_{x} \Psi(h^D_x) term, then for suitable choice of function \Psi, this is interpreted as a model where the particles have mass. The physical interpretation of these more general Gibbs measures is discussed widely, and I’m not very comfortable with it all at the moment, but aim to come back to it later, when hopefully I will be more comfortable.

The Envelope ‘Paradox’

At the recent IMO in Hong Kong, there were several moments where the deputy leaders had to hang around, and I spent some of these moments discussing the following problem with Stephen Mackereth, my counterpart from New Zealand. He’s a mathematically-trained philosopher, so has a similar level of skepticism to me, but for different reasons, regarding supposed paradoxes in probability. Because, as we will see shortly, I don’t think this is a paradox in even the slightest fashion, I think there’s probably too much written about this on the internet already. So I’m aware that contributing further to this oeuvre is hypocritical, but we did the thinking in HKUST’s apparently famous Einstein Cafe, so it makes sense to write down the thoughts.

[And then forget about it for eight weeks. Oops.]

The ‘Paradox’

Here’s the situation. A cryptic friend gives you an envelope containing some sum of money, and shows you a second envelope. They then inform you that one of the envelopes contains twice as much money as the other. It’s implicit in this that the choice of which is which is uniform. You have the option to switch envelopes. Should you?

The supposed paradox arises by considering the amount in your envelope, say X. In the absence of further information, it is equally likely that the other envelope contains X/2 as 2X. Therefore, the average value of the other envelope is

\frac12 \left(\frac{X}{2}+2X \right)= \frac54 X > X.

So you should switch, since on average you gain money. But this is paradoxical, since the assignment of larger and smaller sums was uniform, so switching envelope should make no difference.

Probabilistic setup

This is not supposed to be a problem on a first-year probability exercise sheet. It’s supposed to be conducive to light discussion. So saying “I won’t engage with this problem until you tell me what the probability space is” doesn’t go down terribly well. But it is important to work out what is random, and what isn’t.

There are two sources of randomness, or at least ignorance. Firstly, there is the pair of values contained in the envelopes. Secondly, there is the assignment of this pair of values to the two envelopes. The second is a source of randomness, and this problem is founded on the premise that this second stage is ‘symmetric enough’ to smooth over any complications in the first stage. If we think that probability isn’t broken (and that’s what I think), then the answer is probably that the second stage isn’t symmetric enough.

Or, that the first stage isn’t very well-defined. In what follows, I’m going to make the second stage very symmetric, at the expense of setting up the first stage in what seems to me a reasonable way using the conventional language of probability theory to record our ignorance about the values in play.

So what’s the first stage? We must have a set of possible pairs of values taken by the envelopes. Let’s call this A, so

A\subset \mathbb{A}:=\{(x,2x)\,:\, x\in (0,\infty)\}.

Maybe we know what A is, but maybe we don’t, in which we should take A=\mathbb{A}, on the grounds that any pair is possible. Suppose that your friend has chosen the pair of values according to some distribution on \mathbb{A}, which we’ll assume has a density f, which is known by you. Maybe this isn’t the actual density, but it serves perfectly well if you treat it as *your* opinion on the likelihood. Then this actually does reduce to a problem along the lines of first-year probability, whether or not you get to see the amount in your envelope.

Suppose first that you do get to see the amount, and that it is x. Then the conditional probabilities that the pair is (x/2,x) or (x,2x) are, respectively

\frac{f(x/2,x)}{f(x/2,x)+f(x,2x)},\quad \frac{f(x,2x)}{f(x/2,x)+f(x,2x)}.

So you can work out your expected gain by switching, and decide accordingly. If you don’t know the value in your envelope, you can still work out the probability that it is better (in expectation) to switch, but this isn’t really a hugely meaningful measure, unless it is zero or one.

It’s worth noting that if you can view inside your envelope, and you know A has a particular form, then the game becomes determined. For example, if

A\subset \{(n,2n), n\text{ an odd integer}\},

then life is very easy. If you open your envelope and see an odd integer, you should switch, and if you see an even integer you shouldn’t.

We’ll return at the end to discuss a case where it is always better to switch, and why this isn’t actually a paradox.

Improper prior and paradox of resampling when \mathbb{E}=\infty

For now though, let’s assume that we don’t know anything about the amounts of money in the envelopes. Earlier, we said that “in the absence of further information, it is equally likely that the other envelope contains X/2 as 2X”. In the language of a distribution on \mathbb{A}, we are taking the uniform measure. Of course this not a distribution, in the same way that there isn’t a uniform distribution on the positive reals.

However, if this is your belief about the values in the pair of envelopes, what do you think is the mean value of the content of your envelope? Well, you think all values are equally likely. So, even though this isn’t a distribution, you pretty much think the value of your envelope has infinite expectation.

[This is where the philosophy comes in I guess. Is (expressing uniform ignorance about the content of the other envelope given knowledge of your own) the same as (expressing uniform ignorance of both envelopes at the beginning)? I think it is, even though it has a different consequence here, since the former can be turned into a proper distribution, whereas the latter cannot.]

Let’s briefly consider an alternative example. It’s fairly easy to conjure up distributions which are almost surely finite but which have infinite expectation. For example \mathbb{P}(X=2^k)=2^{-k} for k=1,2,…, which is the content of the *St. Petersburg paradox*, another supposed paradox in probability, but one whose resolution is a bit more clear.

Anyway, let X and Y be independent copies of such a distribution. Now suppose your friend offers you an envelope containing amount X. You look at the value, and then you are offered the opportunity to switch to an envelope containing amount Y. Should you?

Well, if expectation is what you care about, then you definitely should. Because with probability one, you are staring at a finite value in your envelope, whereas the other unknown envelope promises infinite expectation, which is certainly larger than the value that you’re looking at.

Is this also a paradox? I definitely don’t think it is. The expectation of the content of your envelope is infinite, the expected gain is infinite with probability one, which is consistent with the expected content of the other envelope being infinite. [Note that you don’t want to be claiming that the expectation of X-Y is zero.]

An example density function

As an exercise that isn’t necessarily hugely interesting, let’s assume that f, the distribution of the smaller of the pair, is \mathrm{Exp}(\lambda). So the mean of this smaller number is 1/\lambda. Then, conditional on seeing x in my envelope, the expected value of the number in the other envelope is

\frac{\frac{x}{2} e^{-\lambda x/2} + 2x e^{-\lambda x}}{e^{-\lambda x/2}+ e^{-\lambda x}}. (*)

Some straightforward manipulation shows that this quantity is at least x (implying it’s advantageous to switch) precisely when

e^{-\lambda x/2}\ge \frac12.

That is, when x\le \frac{2\log 2}{\lambda}. The shape of this interval should fit our intuition, namely that the optimal strategy should be to switch if the value in your envelope is small enough.

The point of doing this calculation is to emphasise that it ceases to be an interesting problem, and certainly ceases to be a paradox of any kind, once we specify f concretely. It doesn’t matter whether this is some true distribution (ie the friend is genuinely sampling the values somehow at random), or rather a perceived likelihood (that happens to be normalisable).

What if you should always switch?

The statement of the paradox only really has any bite if the suggestion is that we should always switch. Earlier, we discussed potential objections to considering the uniform prior in this setting, but what about other possible distributions f which might lead to this conclusion?

As at (*), we can conclude that when f(x)+f(x/2)>0, we should switch on seeing x precisely if

f(x)\ge 2f\left(\frac{x}{2}\right).

Therefore, partitioning the support of f into a collection of geometric sequences with exponent 2, it is clear that the mean of f is infinite if everything is integer-valued. If f is real-valued, there are some complications, but so long as everything is measurable, the same conclusion will hold.

So the you-should-switch-given-x strategy can only hold for all values of x if f has infinite mean. This pretty much wraps up my feelings. If the mean isn’t infinite, the statement of the paradox no longer holds, and if it is infinite, then the paradox dissolves into a statement about trying to order various expectations, all of which are infinite.

Conclusions

Mathematical summary: it’s Bayes. Things may be exchangeable initially, but not once you condition on the value of one of them! Well, not unless you have a very specific prior.

Philosophical summary: everything in my argument depends on the premise that one can always describe the protagonist’s prior opinion on the contents of the pair of envelopes with a (possibly degenerate) distribution. I feel this is reasonable. As soon as you write down \frac12 \cdot\frac{x}{2} + \frac12 \cdot2x, you are doing a conditional expectation, and it’s got to be conditional with respect to something. Here it’s the uniform prior, or at least the uniform prior restricted to the set of values that are now possible given the revelation of your number.

Second mathematical summary: once you are working with the uniform prior, or any measure with infinite mean, there’s no reason why

\mathbb{E}\left[X|Y\right]>Y,

with probability one (in terms of Y) should be surprising, since the LHS is (almost-surely) infinite while the RHS is almost surely finite, despite having infinite mean itself.

Turan’s Theorem

Turan’s theorem gives bounds on the number of edges required in a graph on a fixed number of vertices n to guarantee it contains a complete graph of size r+1. Equivalently, an upper bound on the number of edges in a K_{r+1}-free graph. For some of the applications and proofs, it may be more natural to look instead at the complement graph, for which the theorem becomes a statement about the existence or otherwise of an independent set of size r+1.

Rather than give an expression for the bound immediately, it is more natural to consider the Turan graph T(n,r), the maximal graph on n vertices without a copy of K_{r+1}. This is constructed by dividing the vertices into r classes with ‘as equal size as possible’. That is, some classes have size \lfloor \frac{n}{r}\rfloor and others have size \lfloor \frac{n}{r}\rfloor +1. Then connect any pair of vertices which are not in the same class by an edge. This gives a complete r-partite graph on these classes. Since any collection of r+1 vertices contains at least two in the same class, it can’t contain a K_{r+1}. Note that the complement of the complete r-partite graph is the union of r disjoint complete graphs on the classes.

There are a number of ways to enumerate the edges in T(n,r), and some can get quite complicated quite quickly. After a moderate amount of thought, this is my favourite. Let n=\ell r+k, so T(n,r) has k classes of size (l+1) and (r-k) classes of size l. Pick an ordered pair of vertices uniformly at random. (So picking the same vertices is indeed an option, and is counted twice.) Then the probability they are the same class is

\frac{k}{r}\cdot\frac{\ell+1}{n}+\frac{r-k}{r}\cdot \frac{\ell}{n} = \frac{1}{r}.

So the probability they are in different classes is \frac{r-1}{r}, and we can treat all of the 2n^2 ordered pairs in this way, noting a) that we count everything twice; and b) we know a priori that we don’t have loops, so the fact that we’ve included these in the count doesn’t matter. We end up with the enumeration (1-\frac{1}{r})\frac{n^2}{2} for the edges in T(n,r).

A standard proof

For both proofs, I find it slightly easier to work in the complement graph, where we are aiming for the largest number of edges with an independent set of size (r+1). Suppose we have a graph with the minimal number of vertices such that there’s no independent set of given size. Suppose also that there is an edge joining vertices v and w, such that d(v)> d(w). Then if we change v’s neighbourhood \Gamma(v) so that it becomes the same as \Gamma(w), (that is, we replace v with a copy of w, and maintain the original edge vw), then it is easily checked that we still do not have an independent set of that size, but fewer edges.

Note that by attempting to make the neighbourhoods of connected vertices equal, we are making the graph look more like a union of complete components. We can do a similar trick if we have three vertices u,v,w such that there are edges between u and v and v and w, but not u and w. Then we know the degrees of u,v,w are the same by the previous argument, and so it can again be checked that making \Gamma(u),\Gamma(w) the same as \Gamma(v), and adding the edge uw reduces the number of edges, and maintains the non-existence of the independent set.

The consequence of this is that we’ve shown that the minimum can only be attained when presence of edges is an equivalence relation (ignoring reflexivity). Thus the minimum is only attained for a union of at most r complete graphs. Jensen (or any root-mean-square type inequality) will then confirm that the true minimum is attained when the sizes of the r components are as equal as possible.

A probabilistic proof

The following probabilistic proof is courtesy of Alon and Spencer. The motivation is that in the (equality) case of a union of complete graphs, however we try to build up a maximal independent set, we always succeed. That is, it doesn’t matter how we choose which vertex (unconnected to those we already have) to add next – we will always get a set of size r. This motivates a probabilistic proof, as an argument in expectation will have equality in the equality case, which is always good.

Anyway, we build up an independent set in a graph by choosing uniformly at random a vertex which is not connected to any we have so far, until this set of vertices is empty. It makes sense to settle the randomness at the start, so give the vertices a uniform random labelling on [n], and at each stage, choose the independent vertex with minimal label.

Thus, a vertex v will be chosen for the independent set if, and only if, it has a smaller label than all of its neighbours, that is, with probability \frac{1}{1+d(v)}. So the expected size of the independent set constructed in this fashion is

\sum_{v\in V(G)} \frac{1}{1+d(v)}\ge \frac{V}{1+\bar d} = \frac{V}{1+\frac{2E}{V}}.

One can chase through the expressions to get the bound we want back.

Olympiad example

The reason I was thinking about Turan’s theorem was a problem which the UK IMO squad was discussing. It comes from an American selection test (slightly rephrased): given 100 points in the plane, what is the largest number of pairs of points with \ell_1 distance in (1,2]?

The key step is to think about how large a collection of points can have this property pairwise. It is easy to come up with an example of four points which work, and seemingly impossible to come up with an example with five points. To prove this, I found it easiest to place a point at the origin, then explicitly work with coordinates relative the basis (1,1),(1,-1) for fairly obvious reasons in this metric.

Anyway, once you are convinced that you can’t have five points with this property pairwise, you are ready to convert into a graph-theoretic statement. Vertices correspond to points, and edges link pairs of points whose distance is in (1,2] as required. We know from the previous paragraph that there is no copy of K_5 here, so Turan’s theorem bounds the number of edges, ie the number of suitable pairs.

It also tells us under what sort of circumstances the bound is attained, and from this, it’s natural to split the 100 points into four groups of 25, for example by taking four points which satisfy the condition pairwise (eg a diamond around the origin), and placing each group very near one of the points.

Extensions and other directions

The existence of a complete subgraph is reminiscent of Ramsey theory, which in one case is a symmetric version of Turan’s theorem. In Turan, we are adding enough edges to force a complete subgraph, while in the original form of Ramsey theory, we are asking how large the graph needs to be to ensure that for any edge configuration, either the original graph or the complement graph includes a complete subgraph. It makes a lot more sense to phrase this in terms of colours for the purpose of generalisation.

A natural extension is to ask about finding copies of fixed graphs H other than the complete graph. This is the content of the Erdos-Stone theorem. I’d prefer to say almost nothing rather than be vague, but the key difference is that the bound is asymptotic in the number of vertices rather than exact. Furthermore, the asymptotic proportion of vertices depends on the chromatic number of H, which tells you how many classes r are required to embed H in a (large if necessary) r-partite graph. So it is perhaps unsurprising that the limiting proportions end up matching the proportion of edges in the Turan graphs, namely r-1/r as r varies, which leaves the exact scaling open to further investigation in the case where H is bipartite (hence has chromatic number 2).