Random walks conditioned to stay positive

In this post, I’m going to discuss some of the literature concerning the question of conditioning a simple random walk to lie above a line with fixed gradient. A special case of this situation is conditioning to stay non-negative. Some notation first. Let (S_n)_{n\ge 0} be a random walk with IID increments, with distribution X. Take \mu to be the expectation of these increments, and we’ll assume that the variance \sigma^2 is finite, though at times we may need to enforce slightly stronger regularity conditions.

(Although simple symmetric random walk is a good example for asymptotic heuristics, in general we also assume that if the increments are discrete they don’t have parity-based support, or any other arithmetic property that prevents local limit theorems holding.)

We will investigate the probability that S_n\ge 0 for n=0,1,…,N, particularly for large N. For ease of notation we write T=\inf\{n\ge 0\,:\, S_n<0\} for the hitting time of the negative half-plane. Thus we are interested in S_n conditioned on T>N, or T=N, mindful that these might not be the same. We will also discuss briefly to what extent we can condition on T=\infty.

In the first paragraph, I said that this is a special case of conditioning SRW to lie above a line with fixed gradient. Fortunately, all the content of the general case is contained in the special case. We can repose the question of S_n conditioned to stay above n\alpha until step N by the question of S_n-n\alpha (which, naturally, has drift \mu-\alpha) conditioned to stay non-negative until step N, by a direct coupling.


Simple random walk is a perfectly interesting object to study in its own right, and this is a perfectly natural question to ask about it. But lots of probabilistic models can be studied via naturally embedded SRWs, and it’s worth pointing out a couple of applications to other probabilistic settings (one of which is the reason I was investigating this literature).

In many circumstances, we can desribe random trees and random graphs by an embedded random walk, such as an exploration process, as described in several posts during my PhD, such as here and here. The exploration process of a Galton-Watson branching tree is a particularly good example, since the exploration process really is simple random walk, unlike in, for example, the Erdos-Renyi random graph G(N,p), where the increments are only approximately IID. In this setting, the increments are given by the offspring distribution minus one, and the hitting time of -1 is the total population size of the branching process. So if the expectation of the offspring distribution is at most 1, then the event that the size of the tree is large is an atypical event, corresponding to delayed extinction. Whereas if the expectation is greater than one, then it is an event with limiting positive probability. Indeed, with positive probability the exploration process never hits -1, corresponding to survival of the branching tree. There are plenty of interesting questions about the structure of a branching process tree conditional on having atypically large size, including the spine decomposition of Kesten [KS], but the methods described in this post can be used to quantify the probability, or at least the scale of the probability of this atypical event.

In my current research, I’m studying a random walk embedded in a construction of the infinite-volume DGFF pinned at zero, as introduced by Biskup and Louidor [BL]. The random walk controls the gross behaviour of the field on annuli with dyadically-growing radii. Anyway, in this setting the random walk has Gaussian increments. (In fact, there is a complication because the increments aren’t exactly IID, but that’s definitely not a problem at this level of exposition.) The overall field is decomposed as a sum of the random walk, plus independent DGFFs with Dirichlet boundary conditions on each of the annuli, plus asymptotically negligible corrections from a ‘binding field’. Conditioning that this pinned field be non-negative up to the Kth annulus corresponds to conditioning the random walk to stay above the magnitude of the minimum of each successive annular DGFF. (These minima are random, but tightly concentrated around their expectations.)

Conditioning on \{T > N\}

When we condition on \{T>N\}, obviously the resulting distribution (of the process) is a mixture of the distributions we obtain by conditioning on each of \{T=N+1\}, \{T=N+2\},\ldots. Shortly, we’ll condition on \{T=N\} itself, but first it’s worth establishing how to relate the two options. That is, conditional on \{T>N\}, what is the distribution of T?

Firstly, when \mu>0, this event always has positive probability, since \mathbb{P}(T=\infty)>0. So as N\rightarrow\infty, the distribution of the process conditional on \{T>N\} converges to the distribution of the process conditional on survival. So we’ll ignore this for now.

In the case \mu\le 0, everything is encapsulated in the tail of the probabilities \mathbb{P}(T=N), and these tails are qualitatively different in the cases \mu=0 and \mu<0.

When \mu=0, then \mathbb{P}(T=N) decays polynomially in N. In the special case where S_n is simple symmetric random walk (and N has the correct parity), we can check this just by an application of Stirling’s formula to count paths with this property. By contrast, when \mu<0, even demanding S_N=-1 is a large deviations event in the sense of Cramer’s theorem, and so the probability decays exponentially with N. Mogulskii’s theorem gives a large deviation principle for random walks to lie above a line defined on the scale N. The crucial fact here is that the probabilistic cost of staying positive until N has the same exponent as the probabilistic cost of being positive at N. Heuristically, we think of spreading the non-expected behaviour of the increments uniformly through the process, at only polynomial cost once we’ve specified the multiset of values taken by the increments. So, when \mu<0, we have

\mathbb{P}(T\ge(1+\epsilon)N) \ll \mathbb{P}(T= N).

Therefore, conditioning on \{T\ge N\} in fact concentrates T on N+o(N). Whereas by contrast, when \mu=0, conditioning on \{T\ge N\} gives a nontrivial limit in distribution for T/N, supported on [1,\infty).

A related problem is the value taken by S_N, conditional on {T>N}. It’s a related problem because the event {T>N} depends only on the process up to time N, and so given the value of S_N, even with the conditioning, after time N, the process is just an unconditioned RW. This is a classic application of the Markov property, beloved in several guises by undergraduate probability exam designers.

Anyway, Iglehart [Ig2] shows an invariance principle for S_N | T>N when \mu<0, without scaling. That is S_N=\Theta(1), though the limiting distribution depends on the increment distribution in a sense that is best described through Laplace transforms. If we start a RW with negative drift from height O(1), then it hits zero in time O(1), so in fact this shows that conditonal on \{T\ge N\}, we have T= N +O(1) with high probability. When \mu=0, we have fluctuations on a scale \sqrt{N}, as shown earlier by Iglehart [Ig1]. Again, thinking about the central limit theorem, this fits the asymptotic description of T conditioned on T>N.

Conditioning on T=N

In the case \mu=0, conditioning on T=N gives

\left[\frac{1}{\sqrt{N}}S(\lfloor Nt\rfloor ) ,t\in[0,1] \right] \Rightarrow W^+(t), (*)

where W^+ is a standard Brownian excursion on [0,1]. This is shown roughly simultaneously in [Ka] and [DIM]. This is similar to Donsker’s theorem for the unconditioned random walk, which converges after rescaling to Brownian motion in this sense, or Brownian bridge if you condition on S_N=0. Skorohod’s proof for Brownian bridge [Sk] approximates the event \{S_N=0\} by \{S_N\in[-\epsilon \sqrt{N},+\epsilon \sqrt{N}]\}, since the probability of this event is bounded away from zero. Similarly, but with more technicalities, a proof of convergence conditional on T=N can approximate by \{S_m\ge 0, m\in[\delta N,(1-\delta)N], S_N\in [-\epsilon \sqrt{N},+\epsilon\sqrt{N}]\}. The technicalities here emerge since T, the first return time to zero, is not continuous as a function of continuous functions. (Imagine a sequence of processes f^N for which f^N(x)\ge 0 on [0,1] and f^N(\frac12)=\frac{1}{N}.)

Once you condition on T=N, the mean \mu doesn’t really matter for this scaling limit. That is, so long as variance is finite, for any \mu\in\mathbb{R}, the same result (*) holds, although a different proof is in general necessary. See [BD] and references for details. However, this is particularly clear in the case where the increments are Gaussian. In this setting, we don’t actually need to take a scaling limit. The distribution of Gaussian *random walk bridge* doesn’t depend on the mean of the increments. This is related to the fact that a linear transformation of a Gaussian is Gaussian, and can be seen by examining the joint density function directly.

Conditioning on T=\infty

When \mu>0, the event \{T=\infty\} occurs with positive probability, so it is well-defined to condition on it. When \mu\le 0, this is not the case, and so we have to be more careful.

First, an observation. Just for clarity, let’s take \mu<0, and condition on \{T>N\}, and look at the distribution of S_{\epsilon N}, where \epsilon>0 is small. This is approximately given by

\frac{S_{\epsilon N}}{\sqrt{N}}\stackrel{d}{\approx}W^+(\epsilon).

Now take \epsilon\rightarrow\infty and consider the RHS. If instead of the Brownian excursion W^+, we instead had Brownian motion, we could specify the distribution exactly. But in fact, we can construct Brownian excursion as the solution to an SDE:

\mathrm{d}W^+(t) = \left[\frac{1}{W^+(t)} - \frac{W^+(t)}{1-t}\right] \mathrm{d}t + \mathrm{d}B(t),\quad t\in(0,1) (**)

for B a standard Brownian motion. I might return in the next post to why this is valid. For now, note that the first drift term pushes the excursion away from zero, while the second term brings it back to zero as t\rightarrow 1.

From this, the second drift term is essentially negligible if we care about scaling W^+(\epsilon) as \epsilon\rightarrow 0, and we can say that W^+(\epsilon)=\Theta(\sqrt{\epsilon}).

So, returning to the random walk, we have

\frac{S_{\epsilon N}}{\sqrt{\epsilon N}}\stackrel{d}{\approx} \frac{W^+(\epsilon)}{\sqrt{\epsilon}} = \Theta(1).

At a heuristic level, it’s tempting to try ‘taking N\rightarrow\infty while fixing \epsilon N‘, to conclude that there is a well-defined scaling limit for the RW conditioned to stay positive forever. But we came up with this estimate by taking N\rightarrow\infty and then \epsilon\rightarrow 0 in that order. So while the heuristic might be convincing, this is not the outline of a valid argument in any way. However, the SDE representation of W^+ in the \epsilon\rightarrow 0 regime is useful. If we drop the second drift term in (**), we define the three-dimensional Bessel process, which (again, possibly the subject of a new post) is the correct scaling limit we should be aiming for.

Finally, it’s worth observing that the limit \{T=\infty\}=\lim_{N\rightarrow\infty} \{T>N\} is a monotone limit, and so further tools are available. In particular, if we know that the trajectories of the random walk satisfy the FKG property, then we can define this limit directly. It feels intuitively clear that random walks should satisfy the FKG inequality (in the sense that if a RW is large somewhere, it’s more likely to be large somewhere else). You can do a covariance calculation easily, but a standard way to show the FKG inequality applies is by verifying the FKG lattice condition, and unless I’m missing something, this is clear (though a bit annoying to check) when the increments are Gaussian, but not in general. Even so, defining this monotone limit does not tell you that it is non-degenerate (ie almost-surely finite), for which some separate estimates would be required.

A final remark: in a recent post, I talked about the Skorohod embedding, as a way to construct any centered random walk where the increments have finite variance as a stopped Brownian motion. One approach to conditioning a random walk to lie above some discrete function is to condition the corresponding Brownian motion to lie above some continuous extension of that function. This is a slightly stronger conditioning, and so any approach of this kind must quantify how much stronger. In Section 4 of [BL], the authors do this for the random walk associated with the DGFF conditioned to lie above a polylogarithmic curve.


[BD] – Bertoin, Doney – 1994 – On conditioning a random walk to stay nonnegative

[BL] – Biskup, Louidor – 2016 – Full extremal process, cluster law and freezing for two-dimensional discrete Gaussian free field

[DIM] – Durrett, Iglehart, Miller – 1977 – Weak convergence to Brownian meander and Brownian excursion

[Ig1] – Iglehart – 1974 – Functional central limit theorems for random walks conditioned to stay positive

[Ig2] – Iglehart – 1974 – Random walks with negative drift conditioned to stay positive

[Ka] – Kaigh – 1976 – An invariance principle for random walk conditioned by a late return to zero

[KS] – Kesten, Stigum – 1966 – A limit theorem for multidimensional Galton-Watson processes

[Sk] – Skorohod – 1955 – Limit theorems for stochastic processes with independent increments

Ornstein-Uhlenbeck Process

A large part of my summer has been spent proving some technical results pertaining to the convergence of some functionals of a critical Frozen Percolation process. This has been worthwhile, but hasn’t involved a large amount of reading around anything in particular, which has probably contributed to the lack of posts in recent months. Perhaps a mixture of that and general laziness?

Anyway, it turns out that the limit of the discrete processes under consideration is the Ornstein-Uhlenbeck process. The sense in which this limit holds (or at least, for now, is conjectured to hold) is something for another article. However, I thought it would be worth writing a bit about this particular process and why it is interesting.

The O-U process is described by the SDE

dX_t=-\beta (X_t-\mu)dt+\sigma dW_t,

where W is a standard Brownian motion. We think of \mu as the ‘mean’. The extent to which this behaves as a mean will be discussed shortly. The process is then mean-reverting, in the sense that the drift is directed against deviations of the process away from this mean. The parameter \beta measures the extent of this mean reversion, while as usual \sigma controls the magnitude of the Brownian noise.

The motivation for considering mean-reverting processes is considerable. One measure of this is how many equations with articles on Wikipedia turn out to be precisely this Ornstein-Uhlenbeck process with different context or notation. In most cases, the motivation arises because Brownian motion is for some reason unsuitable to take as a canonical random process. We will see why the O-U process is somehow the next most canonical choice for a random process.

In physics, it is sometimes unsatisfactory to model the trajectory of a particle with Brownian motion (even though this motivated the name…) as the velocities are undefined (see this post from ages ago), or infinite, depending on your definition of velocity. Using the Ornstein-Uhlenbeck process to model the velocity of a particle is often a satisfactory alternative. It is not unreasonable that there should be a mean velocity, presumably zero. The mean reversion models a frictional force from the underlying medium, while the Brownian noise describes random collisions with similar particles.

In financial applications, the Ornstein-Uhlenbeck model has been applied, apparently under the title of the Vasicek model since the 70s to describe quantities such as interest rates where there is some underlying reason to ban indefinite growth, and require mean reversion. Another setting might be a commodity which, because of external driving factors, has over the relevant time-scale well-defined mean value, around which mean-reverting fluctuations on the observed time-scale can be described. As with other financial models, it is undesirable for a process to take negative values. This can be fixed by taking a positive mean, then setting the volatility to be state dependent, decaying to zero as the state tends to zero, so for small values, the positive drift dominates. I don’t fully understand why patching this aspect is significantly more important than patching any other non-realistic properties of the model, but the resulting SDE is, at least in one particular case where the volatility is \sqrt{X_t}, called the Cox-Ingersoll-Ross model.

Anyway, a mathematical reason to pay particular attention to this Ornstein-Uhlenbeck process is the following. It is the unique family of continuous Markov processes to have a stationary Gaussian distribution. It is the mean-reverting property that is key. There is no chance of Brownian motion having any stationary distribution, let alone a Gaussian one. If this isn’t clear, you can convince yourself by thinking of the stationary distribution of SRW on \mathbb{Z}. Since the process is space-homogeneous, the only stationary measure is the uniform measure.

I want to focus on one particular property of the O-U process, through which some other aspects will be illuminated. If we take \sigma=\beta and let \beta\rightarrow\infty, then the stationary processes converge to white noise.

First though, we should note this is perhaps the easiest SDE to solve explicitly. We consider X_t e^{\theta t}, and applying Ito’s lemma rapidly gives

X_t=\mu + (X_0-\mu)e^{-\beta t}+\sigma\int_0^t e^{-\beta(t-s)}dW_s.

W is Gaussian so the distribution of X_t conditional on X_0=x_0 is also Gaussian, and since W is centred we can read off the expectation. Applying the Ito isometry then gives the variance. In conclusion:

X_t\stackrel{d}{=}\mathcal{N}(\mu+(x_0-\mu)e^{-\beta t}, \frac{\sigma^2}{\beta}(1-e^{-2\beta t})).

In particular, note that the variation has no dependence on x_0. So as t grows to infinity, this converges to \mathcal{N}(\mu, \frac{\sigma^2}{\beta}). This is, unsurprisingly, the stationary distribution of the process.

To address the white noise convergence, we need to consider \text{Cov}(X_0,X_t) in stationarity. Let’s assume WLOG that \mu=0 so most of the expectations will vanish. We obtain

\text{Cov}(X_0,X_t)=\mathbb{E}[X_0X_t]=\mathbb{E}_{x_0}\left[\mathbb{E}[X_t| X_0=x_0]\right]=\mathbb{E}[X_0^2 e^{-\beta t}]= \frac{sigma^2}{2\beta}2^{-\beta t}.

If we want, the Chapman-Kolmogorov equations work particularly nicely here, and we are able to derive a PDE for the evolution of the density function, though obviously this is very related to the result above. This PDE is known as the Fokker-Planck equation.

So, in particular, when \sigma=\beta\rightarrow \infty, this covariance tends to 0. I’m not purporting that this constitutes a proof that the Ornstein-Uhlenbeck processes converge as processes to white noise. It’s not obvious how to define process convergence, not least because there’s flexibility about how to view white noise as a process. One doesn’t really want to define the value of white noise at a particular time, but you can consider the covariance of integrals of white noise over disjoint intervals as a limit, in similar way to convergence of finite dimensional distributions.

The fact that taking \beta=0 gives Brownian motion, and this case gives white noise, intermediate versions of the Ornstein-Uhlenbeck process are sometimes referred to as coloured noise.

Finally, the Ornstein-Uhlenbeck process emerges as the scaling limit of mean-reverting discrete Markov chains, analogous to Brownian motion as the scaling limit of simple random walk. One particularly nice example is the Ehrenfest Urn model. We have two urns, and 2N balls. In each time step one of the 2N balls is chosen uniformly at random, and it is moved to the other urn. So a ball is more likely to be removed from an urn with more than N balls. We can view this as a model for molecules in, say a room, with a slightly porous division between them, eg a small hole. More complicated interface models in higher dimensions lead to fascinating PDEs, such as the famous KPZ equation, which are the subject of much ongoing interest in this area.

This result can be an application of the theory of convergence of Markov chains to SDEs pioneered by Stroock and Varadhan, about which more may follow very soon. In any case, it turns out that the fluctuations in the Ehrenfest Urn model are on the scale of \sqrt{n}, unsurprisingly, and are given by a centred Ornstein-Uhlenbeck process.

Investigating this has reminded me how much I’ve forgotten, or perhaps how little I ever knew, about the technicalities of stochastic processes are their convergence results, so next up will probably be a summary of all the useful definitions and properties for this sort of analysis.

SDEs and L-Diffusions

The motivation for the development of differential calculus by Newton et al. was to enable us to deduce concrete properties of, say, particle motion defined implicitly through ODEs. And we proceed similarly for the stochastic differential. Having defined all the terms through Ito’s formula, and concluded that BM is in some sense the canonical stochastic process, we seek to solve so-called stochastic differential equations of the form:


While there is no reason not to consider processes in \mathbb{R}^d, it is reasonable interesting to consider processes in one dimension. As with normal ODEs and PDEs, we have some intuitive notion if we specify some initial conditions, we should be able to set the differential equation up and ‘let it go’ like a functional clockwork mouse. Of course, we are conscious of the potential problems with uniqueness of solutions, stability of solutions, and general mathematical awkwardness that derives from the fact that we can’t treat all DEs as physical systems, with all the luxuries of definiteness that the physical world automatically affords. To establish some concreteness, we set up some definitions.

  • For a solution to the SDE, E(\sigma,b), we require a nice filtration \mathcal{F} and a BM adapted to that filtration to drive the process X_t, which satisfies X_t=X_0+\int_0^t\sigma(X_s)dB_s+\int_0^tb(X_s)ds, and we require this for each x_0\in\mathbb{R}^d s.t. X_0=x_0 a.s.
  • Uniqueness in law: all solutions to E(\sigma,b) starting from each x have the same law. Obviously, this places no restriction on the underlying probability space and filtration.
  • A stronger condition is Pathwise uniqueness: Given the filtration, solutions are almost surely indistinguishable (that is, paths are equal everywhere).
  • We have not specified any conditions on the filtration \mathcal{F}. It would be natural to consider only the minimal such filtration that works. If we can take \mathcal{F}=\mathcal{F}^B, the natural filtration of the driving BM, we say the solutions is strong. If every solution is strong, then we have pathwise uniqueness, otherwise we would have a solution where we could choose which path to follow independently of the BM.

The key theorem here is Yamada-Watanabe: If there exist solutions and we have pathwise uniqueness, then we have uniqueness in law. Then for every (\mathcal{F}_t), and \mathcal{F}_t-BM, the unique solution is strong.

Existence of solutions is particularly tractable when \sigma,b are Lipschitz, as this opens the way for implicit constructions as the fixed points of contracting mappings. We make particular use of Gronwall’s lemma, which confirms an intuitive thought that differential inequalities have solutions bounded by solutions to the corresponding ODE. Concretely, for $latex f\geq 0,\,f(t)\leq a+b\int_0^tf(s)ds,\quad 0\leq t\leq T$, the lemma states that f(t)\leq a\exp(bt). The case a=0 is obviously of particular interest for demonstrating convergence results. We deploy this method to show that when \sigma,b are Lipschitz, the SDE dX_t=\sigma(X_t)dB_t+b(X_t)dt has pathwise uniqueness and for any triple of filtration (\mathcal{F}_t), \mathcal{F}_t-adapted BM, and starting point x, there is a strong solution. Uniqueness in law then follows by Yamada-Watanabe, but we knew this anyway by composing measurable maps.

Now, given L, an operator on C^2 functions by:

Lf(x)=\frac12\sum_{i,j}a_{i,j}(x)\frac{\partial^2 f}{\partial x^i\partial x^j}+\sum_i b_i(x)\frac{\partial f}{\partial x^i}

We define X to be an L-diffusion if X’s local behaviour is specified (in distribution) by L(X). The first sum in the expression for L corresponds to diffusivity, while the second corresponds to (deterministic) drift. Formally, for a, b, bounded X_t a L-diffusion is \forall f\in C_b^2:

M_t^f:=f(X_t)-f(X_0)-\int_0^t Lf(X_s)ds is a martingale.

Alternatively, can relax boundedness condition, and require M_t^f\in\mathcal{M}_{c,loc}. To make a link to SDEs, define a=\sigma\sigma^T (so in one dimension a=\sqrt{\sigma}), then solutions to dX_t=\sigma(X_t)dB_t+b(X_t)dt are L-diffusions if boundedness conditions are met. Remember bounded implies Lipschitz implies solutions to SDEs. The result then follows directly from Ito’s formula.