Random walks conditioned to stay positive

In this post, I’m going to discuss some of the literature concerning the question of conditioning a simple random walk to lie above a line with fixed gradient. A special case of this situation is conditioning to stay non-negative. Some notation first. Let (S_n)_{n\ge 0} be a random walk with IID increments, with distribution X. Take \mu to be the expectation of these increments, and we’ll assume that the variance \sigma^2 is finite, though at times we may need to enforce slightly stronger regularity conditions.

(Although simple symmetric random walk is a good example for asymptotic heuristics, in general we also assume that if the increments are discrete they don’t have parity-based support, or any other arithmetic property that prevents local limit theorems holding.)

We will investigate the probability that S_n\ge 0 for n=0,1,…,N, particularly for large N. For ease of notation we write T=\inf\{n\ge 0\,:\, S_n<0\} for the hitting time of the negative half-plane. Thus we are interested in S_n conditioned on T>N, or T=N, mindful that these might not be the same. We will also discuss briefly to what extent we can condition on T=\infty.

In the first paragraph, I said that this is a special case of conditioning SRW to lie above a line with fixed gradient. Fortunately, all the content of the general case is contained in the special case. We can repose the question of S_n conditioned to stay above n\alpha until step N by the question of S_n-n\alpha (which, naturally, has drift \mu-\alpha) conditioned to stay non-negative until step N, by a direct coupling.


Simple random walk is a perfectly interesting object to study in its own right, and this is a perfectly natural question to ask about it. But lots of probabilistic models can be studied via naturally embedded SRWs, and it’s worth pointing out a couple of applications to other probabilistic settings (one of which is the reason I was investigating this literature).

In many circumstances, we can desribe random trees and random graphs by an embedded random walk, such as an exploration process, as described in several posts during my PhD, such as here and here. The exploration process of a Galton-Watson branching tree is a particularly good example, since the exploration process really is simple random walk, unlike in, for example, the Erdos-Renyi random graph G(N,p), where the increments are only approximately IID. In this setting, the increments are given by the offspring distribution minus one, and the hitting time of -1 is the total population size of the branching process. So if the expectation of the offspring distribution is at most 1, then the event that the size of the tree is large is an atypical event, corresponding to delayed extinction. Whereas if the expectation is greater than one, then it is an event with limiting positive probability. Indeed, with positive probability the exploration process never hits -1, corresponding to survival of the branching tree. There are plenty of interesting questions about the structure of a branching process tree conditional on having atypically large size, including the spine decomposition of Kesten [KS], but the methods described in this post can be used to quantify the probability, or at least the scale of the probability of this atypical event.

In my current research, I’m studying a random walk embedded in a construction of the infinite-volume DGFF pinned at zero, as introduced by Biskup and Louidor [BL]. The random walk controls the gross behaviour of the field on annuli with dyadically-growing radii. Anyway, in this setting the random walk has Gaussian increments. (In fact, there is a complication because the increments aren’t exactly IID, but that’s definitely not a problem at this level of exposition.) The overall field is decomposed as a sum of the random walk, plus independent DGFFs with Dirichlet boundary conditions on each of the annuli, plus asymptotically negligible corrections from a ‘binding field’. Conditioning that this pinned field be non-negative up to the Kth annulus corresponds to conditioning the random walk to stay above the magnitude of the minimum of each successive annular DGFF. (These minima are random, but tightly concentrated around their expectations.)

Conditioning on \{T > N\}

When we condition on \{T>N\}, obviously the resulting distribution (of the process) is a mixture of the distributions we obtain by conditioning on each of \{T=N+1\}, \{T=N+2\},\ldots. Shortly, we’ll condition on \{T=N\} itself, but first it’s worth establishing how to relate the two options. That is, conditional on \{T>N\}, what is the distribution of T?

Firstly, when \mu>0, this event always has positive probability, since \mathbb{P}(T=\infty)>0. So as N\rightarrow\infty, the distribution of the process conditional on \{T>N\} converges to the distribution of the process conditional on survival. So we’ll ignore this for now.

In the case \mu\le 0, everything is encapsulated in the tail of the probabilities \mathbb{P}(T=N), and these tails are qualitatively different in the cases \mu=0 and \mu<0.

When \mu=0, then \mathbb{P}(T=N) decays polynomially in N. In the special case where S_n is simple symmetric random walk (and N has the correct parity), we can check this just by an application of Stirling’s formula to count paths with this property. By contrast, when \mu<0, even demanding S_N=-1 is a large deviations event in the sense of Cramer’s theorem, and so the probability decays exponentially with N. Mogulskii’s theorem gives a large deviation principle for random walks to lie above a line defined on the scale N. The crucial fact here is that the probabilistic cost of staying positive until N has the same exponent as the probabilistic cost of being positive at N. Heuristically, we think of spreading the non-expected behaviour of the increments uniformly through the process, at only polynomial cost once we’ve specified the multiset of values taken by the increments. So, when \mu<0, we have

\mathbb{P}(T\ge(1+\epsilon)N) \ll \mathbb{P}(T= N).

Therefore, conditioning on \{T\ge N\} in fact concentrates T on N+o(N). Whereas by contrast, when \mu=0, conditioning on \{T\ge N\} gives a nontrivial limit in distribution for T/N, supported on [1,\infty).

A related problem is the value taken by S_N, conditional on {T>N}. It’s a related problem because the event {T>N} depends only on the process up to time N, and so given the value of S_N, even with the conditioning, after time N, the process is just an unconditioned RW. This is a classic application of the Markov property, beloved in several guises by undergraduate probability exam designers.

Anyway, Iglehart [Ig2] shows an invariance principle for S_N | T>N when \mu<0, without scaling. That is S_N=\Theta(1), though the limiting distribution depends on the increment distribution in a sense that is best described through Laplace transforms. If we start a RW with negative drift from height O(1), then it hits zero in time O(1), so in fact this shows that conditonal on \{T\ge N\}, we have T= N +O(1) with high probability. When \mu=0, we have fluctuations on a scale \sqrt{N}, as shown earlier by Iglehart [Ig1]. Again, thinking about the central limit theorem, this fits the asymptotic description of T conditioned on T>N.

Conditioning on T=N

In the case \mu=0, conditioning on T=N gives

\left[\frac{1}{\sqrt{N}}S(\lfloor Nt\rfloor ) ,t\in[0,1] \right] \Rightarrow W^+(t), (*)

where W^+ is a standard Brownian excursion on [0,1]. This is shown roughly simultaneously in [Ka] and [DIM]. This is similar to Donsker’s theorem for the unconditioned random walk, which converges after rescaling to Brownian motion in this sense, or Brownian bridge if you condition on S_N=0. Skorohod’s proof for Brownian bridge [Sk] approximates the event \{S_N=0\} by \{S_N\in[-\epsilon \sqrt{N},+\epsilon \sqrt{N}]\}, since the probability of this event is bounded away from zero. Similarly, but with more technicalities, a proof of convergence conditional on T=N can approximate by \{S_m\ge 0, m\in[\delta N,(1-\delta)N], S_N\in [-\epsilon \sqrt{N},+\epsilon\sqrt{N}]\}. The technicalities here emerge since T, the first return time to zero, is not continuous as a function of continuous functions. (Imagine a sequence of processes f^N for which f^N(x)\ge 0 on [0,1] and f^N(\frac12)=\frac{1}{N}.)

Once you condition on T=N, the mean \mu doesn’t really matter for this scaling limit. That is, so long as variance is finite, for any \mu\in\mathbb{R}, the same result (*) holds, although a different proof is in general necessary. See [BD] and references for details. However, this is particularly clear in the case where the increments are Gaussian. In this setting, we don’t actually need to take a scaling limit. The distribution of Gaussian *random walk bridge* doesn’t depend on the mean of the increments. This is related to the fact that a linear transformation of a Gaussian is Gaussian, and can be seen by examining the joint density function directly.

Conditioning on T=\infty

When \mu>0, the event \{T=\infty\} occurs with positive probability, so it is well-defined to condition on it. When \mu\le 0, this is not the case, and so we have to be more careful.

First, an observation. Just for clarity, let’s take \mu<0, and condition on \{T>N\}, and look at the distribution of S_{\epsilon N}, where \epsilon>0 is small. This is approximately given by

\frac{S_{\epsilon N}}{\sqrt{N}}\stackrel{d}{\approx}W^+(\epsilon).

Now take \epsilon\rightarrow\infty and consider the RHS. If instead of the Brownian excursion W^+, we instead had Brownian motion, we could specify the distribution exactly. But in fact, we can construct Brownian excursion as the solution to an SDE:

\mathrm{d}W^+(t) = \left[\frac{1}{W^+(t)} - \frac{W^+(t)}{1-t}\right] \mathrm{d}t + \mathrm{d}B(t),\quad t\in(0,1) (**)

for B a standard Brownian motion. I might return in the next post to why this is valid. For now, note that the first drift term pushes the excursion away from zero, while the second term brings it back to zero as t\rightarrow 1.

From this, the second drift term is essentially negligible if we care about scaling W^+(\epsilon) as \epsilon\rightarrow 0, and we can say that W^+(\epsilon)=\Theta(\sqrt{\epsilon}).

So, returning to the random walk, we have

\frac{S_{\epsilon N}}{\sqrt{\epsilon N}}\stackrel{d}{\approx} \frac{W^+(\epsilon)}{\sqrt{\epsilon}} = \Theta(1).

At a heuristic level, it’s tempting to try ‘taking N\rightarrow\infty while fixing \epsilon N‘, to conclude that there is a well-defined scaling limit for the RW conditioned to stay positive forever. But we came up with this estimate by taking N\rightarrow\infty and then \epsilon\rightarrow 0 in that order. So while the heuristic might be convincing, this is not the outline of a valid argument in any way. However, the SDE representation of W^+ in the \epsilon\rightarrow 0 regime is useful. If we drop the second drift term in (**), we define the three-dimensional Bessel process, which (again, possibly the subject of a new post) is the correct scaling limit we should be aiming for.

Finally, it’s worth observing that the limit \{T=\infty\}=\lim_{N\rightarrow\infty} \{T>N\} is a monotone limit, and so further tools are available. In particular, if we know that the trajectories of the random walk satisfy the FKG property, then we can define this limit directly. It feels intuitively clear that random walks should satisfy the FKG inequality (in the sense that if a RW is large somewhere, it’s more likely to be large somewhere else). You can do a covariance calculation easily, but a standard way to show the FKG inequality applies is by verifying the FKG lattice condition, and unless I’m missing something, this is clear (though a bit annoying to check) when the increments are Gaussian, but not in general. Even so, defining this monotone limit does not tell you that it is non-degenerate (ie almost-surely finite), for which some separate estimates would be required.

A final remark: in a recent post, I talked about the Skorohod embedding, as a way to construct any centered random walk where the increments have finite variance as a stopped Brownian motion. One approach to conditioning a random walk to lie above some discrete function is to condition the corresponding Brownian motion to lie above some continuous extension of that function. This is a slightly stronger conditioning, and so any approach of this kind must quantify how much stronger. In Section 4 of [BL], the authors do this for the random walk associated with the DGFF conditioned to lie above a polylogarithmic curve.


[BD] – Bertoin, Doney – 1994 – On conditioning a random walk to stay nonnegative

[BL] – Biskup, Louidor – 2016 – Full extremal process, cluster law and freezing for two-dimensional discrete Gaussian free field

[DIM] – Durrett, Iglehart, Miller – 1977 – Weak convergence to Brownian meander and Brownian excursion

[Ig1] – Iglehart – 1974 – Functional central limit theorems for random walks conditioned to stay positive

[Ig2] – Iglehart – 1974 – Random walks with negative drift conditioned to stay positive

[Ka] – Kaigh – 1976 – An invariance principle for random walk conditioned by a late return to zero

[KS] – Kesten, Stigum – 1966 – A limit theorem for multidimensional Galton-Watson processes

[Sk] – Skorohod – 1955 – Limit theorems for stochastic processes with independent increments

Reflected Brownian Motion

A standard Brownian motion is space-homogeneous, meaning that the behaviour of B_{T+t}-B_T does not depend on the value of B_T. By Donsker’s Theorem, such a Brownian motion is also the limit in a process space of any homogeneous random walk with zero-drift and constant variance, after suitable rescaling.

In many applications, however, we are interested in real-valued continuous-time Markov processes that are defined not on the whole of the real line, but on the half-line \mathbb{R}_{\ge 0}. So as BM is the fundamental real-valued continuous-time Markov process, we should ask how we might adjust it so that it stays non-negative. In particular, we want to clarify uniqueness, or at least be sure we have found all the sensible ways to make this adjustment, and also to consider how Donsker’s Theorem might work in this setting.

We should consider what properties we want this non-negative BM to have. Obviously, it should be non-negative, but it is also reasonable to demand that it looks exactly like BM everywhere except near 0. But since BM has a scale-invariance property, it is essentially meaningful to say ‘near 0’, so we instead demand that it looks exactly like BM everywhere except at 0. Apart from this, the only properties we want are that it is Markov and has continuous sample paths.

A starting point is so-called reflected Brownian motion, defined by X_t:=|B_t|. This is very natural and very convenient for analysis, but there are some problems. Firstly, this has the property that it looks like Brownian motion everywhere except 0 only because BM is space-homogeneous but also symmetric, in the sense that B_t\stackrel{d}{=}-B_t. This will be untrue for essentially any other process, so as a general method for how to keep stochastic processes positive, this will be useless. My second objection is a bit more subtle. If we consider this as an SDE, we get


This is a perfectly reasonable SDE but it is undesirable, because we have a function of B as coefficient on the RHS. Ideally, increments of X would be a function of X, and the increments of B, rather than the values of B. That is, we would expect X_{t+\delta t}-X_t to depend on X_t and on (B_{t+s}-B_t, 0\le s\le \delta t), but not on B_t itself, as that means we have to keep track of extra information while constructing X.

So we need an alternative method. One idea might be to add some non-negative process to the BM so that the sum stays non-negative. If this process is deterministic and finite, there there is some positive probability that the sum will eventually be negative, so this won’t do. We are looking therefore so a process which depends on the BM. Obviously we could take \max(-B_t,0), but this sum would then spend macroscopic intervals of time at 0, and these intervals would have the Raleigh distribution (for Brownian excursions) rather than the exponential distribution, hence the process given by the sum would not be memoryless and Markov.

The natural alternative is to look for an increasing process A_t, and then it makes sense to talk about the minimal increasing process that has the desired property. A moment’s thought suggests that A_t=-min_{s\le t}B_t satisfies this. So we have the decomposition


where S_t is the height of B above its running minimum. So S is an ideal alternative definition of reflecting BM. In particular, when B is away from its minimum, dB_t=dS_t, so this has the property that it evolves exactly as the driving Brownian motion.

What we have done is to decompose a general continuous process into the sum of a decreasing continuous process and a non-negative process. This is known as the Skorohod problem, and was the subject of much interest, even in the deterministic case. Note that process A has the property that it is locally constant almost everywhere, and is continuous, yet non-constant. Unsurprisingly, since A only changes when the underlying BM is 0, A is continuous with respect to the local time process at 0. In fact, A is the local time process of the underlying Brownian motion, by comparison with the construction by direct reflection.

One alternative approach is to look instead at the generator of the process. Recall that the generator of a process is an operator on some space of functions, with \mathcal{L}f giving the infinitissimal drift of f(X_t). In the case of Brownian motion, the generator (\mathcal{L}f)(x)=\frac12 f''(x) for bounded smooth functions f. This is equivalent to saying that

f(X_t)-f(X_0)-\int_0^t \frac12 f''(X_s)ds (*)

is a martingale. This must hold also for reflected Brownian motion, whenever x is greater than 0. Alternatively, if the function f is zero in a small neighbourhood of 0, it should have the same generator with respect to reflected BM. Indeed, for a general smooth bounded function f, we can still consider the expression (*) with respect to reflected BM. We know this expression behaves as a martingale except when X is zero. If f'(0)>0, and T is some hitting time of 0, then f(X_{T+\delta T})-f(X_T)\ge 0, hence the expression (*) is a submartingale. So if we restrict attention to functions with f'(0)=0, the generator remains the same. Indeed, by patching together all such intervals, it can be argued that even if f'(0) is not zero,

f(X_t)-f(X_0)-\int_0^t \frac12 f''(X_s)ds - f'(0)A_t

is a martingale, where A is the local time process at zero.

I was aware when I started reading about this that there was another family of processes called ‘Sticky Brownian Motion’ that shared properties with Reflected BM, in that it behaves like standard BM away from zero, but is also constrained to the non-negative reals. I think this will get too long if I also talk about that here, so that can be postponed, and for now we consider reflected BM as a limit of reflected (or other) random walks, bearing in mind that there is at least one other candidate to be the limit.

Unsurprisingly, if we have a family of random walks constrained to the non-negative reals, that are zero-drift unit-variance away from 0, then if they converge as processes, the limit is Brownian away from zero, and non-negative. Note that “away from 0” means after rescaling. So the key aspect is behaviour near zero.

What is the drift of reflected BM at 0? We might suspect it is infinite because of the form of the generator, but we can calculate it directly. Given X_0=0, we have:


so letting t\rightarrow 0, we see indeed that the drift is infinite at 0.

For convergence of discrete processes, we really need the generators to converge. Typically we index the discrete-time processes by the time unit h, which tends to 0, and b_h(x),a_h(x) are the rescaled drift and square-drift from x. We assume that we don’t see macroscopic jumps in the limit. For the case of simple random walk reflected at 0, it doesn’t matter exactly how we construct the joint limit in h and x, as the drift is uniform on x>0, but in general this does matter. I don’t want to discuss sticky BM right now, so it’s probably easiest to be vague and say that the discrete Markov processes converge to reflected BM so long they don’t spend more time than expected near 0 in the limit, as the title ‘sticky’ might suggest.

The two ways in which this can happen is if the volatility term a_h(x) is too small, in which case the process looks almost deterministic near 0, or if the drift doesn’t increase fast enough. And indeed, this leads to two conditions. The first is straightforward, if a_h(x) is bounded below, in the sense that \liminf_{h,x\rightarrow 0} a_h(x)\ge C>0, then we have convergence to reflected BM. Alternatively, the only danger can arise down those subsequences where a_h(x)\rightarrow 0, so if we have that b_h(x)\rightarrow +\infty whenever h,x,a_h(x)\rightarrow 0, then this convergence also holds.

Next time I’ll discuss what sticky BM means, what it doesn’t mean, why it isn’t easy to double the local time, and how to obtain sticky BM as a limit of discrete random walks in a similar way to the above.


S. Varadhan – Chapter 16 from a Lecture Course at NYU can be found here.

Enhanced by Zemanta

SDEs and L-Diffusions

The motivation for the development of differential calculus by Newton et al. was to enable us to deduce concrete properties of, say, particle motion defined implicitly through ODEs. And we proceed similarly for the stochastic differential. Having defined all the terms through Ito’s formula, and concluded that BM is in some sense the canonical stochastic process, we seek to solve so-called stochastic differential equations of the form:


While there is no reason not to consider processes in \mathbb{R}^d, it is reasonable interesting to consider processes in one dimension. As with normal ODEs and PDEs, we have some intuitive notion if we specify some initial conditions, we should be able to set the differential equation up and ‘let it go’ like a functional clockwork mouse. Of course, we are conscious of the potential problems with uniqueness of solutions, stability of solutions, and general mathematical awkwardness that derives from the fact that we can’t treat all DEs as physical systems, with all the luxuries of definiteness that the physical world automatically affords. To establish some concreteness, we set up some definitions.

  • For a solution to the SDE, E(\sigma,b), we require a nice filtration \mathcal{F} and a BM adapted to that filtration to drive the process X_t, which satisfies X_t=X_0+\int_0^t\sigma(X_s)dB_s+\int_0^tb(X_s)ds, and we require this for each x_0\in\mathbb{R}^d s.t. X_0=x_0 a.s.
  • Uniqueness in law: all solutions to E(\sigma,b) starting from each x have the same law. Obviously, this places no restriction on the underlying probability space and filtration.
  • A stronger condition is Pathwise uniqueness: Given the filtration, solutions are almost surely indistinguishable (that is, paths are equal everywhere).
  • We have not specified any conditions on the filtration \mathcal{F}. It would be natural to consider only the minimal such filtration that works. If we can take \mathcal{F}=\mathcal{F}^B, the natural filtration of the driving BM, we say the solutions is strong. If every solution is strong, then we have pathwise uniqueness, otherwise we would have a solution where we could choose which path to follow independently of the BM.

The key theorem here is Yamada-Watanabe: If there exist solutions and we have pathwise uniqueness, then we have uniqueness in law. Then for every (\mathcal{F}_t), and \mathcal{F}_t-BM, the unique solution is strong.

Existence of solutions is particularly tractable when \sigma,b are Lipschitz, as this opens the way for implicit constructions as the fixed points of contracting mappings. We make particular use of Gronwall’s lemma, which confirms an intuitive thought that differential inequalities have solutions bounded by solutions to the corresponding ODE. Concretely, for $latex f\geq 0,\,f(t)\leq a+b\int_0^tf(s)ds,\quad 0\leq t\leq T$, the lemma states that f(t)\leq a\exp(bt). The case a=0 is obviously of particular interest for demonstrating convergence results. We deploy this method to show that when \sigma,b are Lipschitz, the SDE dX_t=\sigma(X_t)dB_t+b(X_t)dt has pathwise uniqueness and for any triple of filtration (\mathcal{F}_t), \mathcal{F}_t-adapted BM, and starting point x, there is a strong solution. Uniqueness in law then follows by Yamada-Watanabe, but we knew this anyway by composing measurable maps.

Now, given L, an operator on C^2 functions by:

Lf(x)=\frac12\sum_{i,j}a_{i,j}(x)\frac{\partial^2 f}{\partial x^i\partial x^j}+\sum_i b_i(x)\frac{\partial f}{\partial x^i}

We define X to be an L-diffusion if X’s local behaviour is specified (in distribution) by L(X). The first sum in the expression for L corresponds to diffusivity, while the second corresponds to (deterministic) drift. Formally, for a, b, bounded X_t a L-diffusion is \forall f\in C_b^2:

M_t^f:=f(X_t)-f(X_0)-\int_0^t Lf(X_s)ds is a martingale.

Alternatively, can relax boundedness condition, and require M_t^f\in\mathcal{M}_{c,loc}. To make a link to SDEs, define a=\sigma\sigma^T (so in one dimension a=\sqrt{\sigma}), then solutions to dX_t=\sigma(X_t)dB_t+b(X_t)dt are L-diffusions if boundedness conditions are met. Remember bounded implies Lipschitz implies solutions to SDEs. The result then follows directly from Ito’s formula.