The reflection principle and conditioned RWs

I haven’t published a post about probability for far too long. Several are queued, so perhaps this will be the start of a deluge.

Anyway, with my advisor at Technion, I’m still working on some problems concerning Gaussian random walk subject to some conditioning which is complicated, but in practice (we hope) only mildly different to conditioning the walk to stay positive. Our conditioning at step n depends on some external randomness, but also on the future trajectory of the walk (related to the embedding of the walk in a 2D DGFF), thus ruining the possibility of applying the Markov property in any proof without significant preliminary work.

It seemed worth writing briefly about some of these results in a slightly simpler setting. The goal is to assemble many of the ingredients required to prove a local limit for Gaussian random walk conditioned to stay positive, in a sense which will be clarified towards the end. This is not the best way to go about getting scaling limits (as discussed briefly here, and for which see references [Ig74] and [Bo76]), and it’s probably not the best way to get local limits in the simplest setting, but it’s the method we are currently working to generalise, and follows the outline of [B-JD05], but in much less technical detail.

Probabilities via the reflection principle

We start with Brownian motion. The reflection principle, as described briefly in this post from the depths of history, is a classical technique for studying the maximum of Brownian motion. Roughly speaking, we exploit the fact that (W_t,t\ge 0)\stackrel{d}=(-W_t,t\ge 0), but we then apply this at the hitting time of a particular positive value, using the Strong Markov Property.

Let S_t=\max_{0\le s\le t}W_s be the running maximum of the Brownian motion W_t, and \tau_b the hitting time of b. Then

\mathbb{P}(S_t\ge b, B_t\le a)=\mathbb{P}(\tau_b<t\text{ and }B_t-B_{\tau_b}\le a-b),

which, by SMP at \tau_b and the reflection invariance of a standard BM, is equal to

\mathbb{P}(\tau_b<t\text{ and }B_t-B_{\tau_b}\ge b-a) = \mathbb{P}(B_t\ge 2b-a).

This obviously assumed b\ge a, but if we set b=a, we find

\mathbb{P}(S_t\ge b)=\mathbb{P}(B_t>b)+\mathbb{P}(S_t\ge b,B_t\le b)=2\mathbb{P}(B_t\ge b).

Or, in other words, S_t\stackrel{d}=|B_t|.

While we can’t derive such nice equalities in distribution, the reflection principle is robust enough to treat more complicated settings, such as Brownian bridge.

We might want to ask about the maximum of a standard Brownian bridge, but we can be more general, and ask about the maximum of a Brownian bridge with drift (let’s say general bridge here). It’s important to remember that a general Brownian bridge has the same distribution as a linear transformation of a standard Brownian bridge. Everything is Gaussian, after all. So asking whether the maximum of a general Brownian bridge is less than a particular value is equivalent to asking whether a standard Brownian bridge lies below a fixed line. Wherever possible, we make such a transformation at the start and perform the simplest version of the required calculation.

So, suppose we have a bridge B from (0,0) to (t,a), and we want to study \max_{s\in[0,t]} B_s. Fix some b>a, and work with a standard Brownian motion W_s. By a similar argument to before,

\mathbb{P}(\tau_b\le t, W_t\in[a,a+\mathrm{d}x]) = \mathbb{P}(W_t\in [2b-a-\mathrm{d}x,2b-a]) = \frac{\mathrm{d}x}{\sqrt{2\pi t}}e^{-(2b-a)^2/2t},

and

\mathbb{P}(W_t\in[a,a+\mathrm{d}x])=\frac{\mathrm{d}x}{\sqrt{2\pi t}}e^{-a^2/2t}.

So

\mathbb{P}(\max_{s\in[0,t]}B_t\ge b) = \exp\left(\frac{-(2b-a)^2 + a^2}{2t}\right).

Random walk conditioned to stay positive

Our main concern is conditioning to stay above zero. Let \mathbb{P}_{0,x}^{t,y} be some complete if cumbersome notation for a Brownian bridge B from (0,x) to (t,y). Then another simple transformation of the previous result gives

\mathbb{P}_{0,x}^{t,y}(B_s\ge 0,\,s\in[0,t])=1-\exp\left( \frac{-(x+y)^2 + (x-y)^2}{2t} \right)= 1-\exp\left(-\frac{2xy}{t}\right).

Then, if xy\ll t, we can approximate this by \frac{2xy}{t}. (*)

Extend the notation so \mathbb{P}_{0,x} describes Brownian motion started from (0,x). Then integrating over y, gives

\mathbb{P}_{0,x}(B_s\ge 0,\, s\in[0,t] ) = \frac{x}{t}\mathbb{E}[B_t\vee 0] = \sqrt{\frac{2}{\pi}} \frac{x}{\sqrt{t}}.

(It might appear that we have integrated the approximation (*) over parts of the range where it is not a valid approximation, but the density of B_t=\Theta(t) vanishes exponentially fast, and so actually it’s not a problem.)

We now want to extend this to random walks. Some remarks:

  • We used the Gaussian property of Brownian motion fairly heavily throughout this derivation. In general random walks are not Gaussian, though we can make life easier by focusing on this case.
  • We also used the continuity property of Brownian motion when we applied the reflection principle. For a general random walk, it’s hopeless to consider the hitting times of individual values. We have to consider instead the hitting times of regions \tau_{(-\infty,b]}, and so on. One can still apply SMP and a reflection principle, but this gives bounds rather than equalities. (The exception is simple random walk, for which other more combinatorial methods may be available anyway.)
  • On the flip side, if we are interested in Brownian motion/bridge staying positive, we can’t start it from zero, as then the probability of this event is zero, by Blumenthal’s 0-1 law. By contrast, we can certainly ask about random walk staying positive when started from zero without taking a limit.

A useful technique will be the viewpoint of random walk as the values taken by Brownian motion at a sequence of stopping times. This Skorohod embedding is slightly less obvious when considering a general random walk bridge inside a general Brownian bridge, but is achievable. We want to study quantities like

\mathbb{P}(S_k\ge 0,\, k=1,\ldots,n \big| S_0=x,S_n=y),

where for simplicity let’s just take (S_k,k\ge 0) to be a random walk with standard Gaussian increments. It’s possible we might want to take a scaling limit in x and y as functions of n. But first if we take x,y fixed, and embed the random walk bridge with these endpoints into the corresponding Brownian bridge with t\approx n, we are then faced with the question:

What’s the probability that the Brownian bridge goes below zero, but the embedded RW with n steps does not?

If the Brownian bridge conditioned to go below zero spends time \Theta_p(n) below zero, then for large n it’s asymptotically very unlikely that the n places at which we embed the random walk avoids this set of intervals.

Several technical estimates are required to make this analysis rigorous. The conclusion is that there exists a function f(x) for which f(x)=x(1+o(1)) as x\rightarrow\infty, such that

q_n(x,y):=\mathbb{P}(S_k\ge 0,\, k=0,1,\ldots,n \,\big|\, S_0=x,S_n=y) \sim \frac{2f(x)f(y)}{n},

\text{and}\quad q_n(x):=\mathbb{P}(S_k\ge 0,\,k=0,1,\ldots,n\,\big|\, S_0=x)\sim \sqrt{\frac{2}{\pi}}\frac{f(x)}{\sqrt{n}}.

As earlier, the second is obtained from the first by integrating over suitable y. This function f has to account for the extra complications when either end-point is near zero, for which the event where the Brownian motion goes negative without the random walk going negative requires additional care.

Limits for the conditioned random walk

In the previous post on this topic, we addressed scaling limits in space and time for conditioned random walks. But we don’t have to look at the classical Donsker scaling to see the effects of conditioning to stay positive. In our setting, we are interested in studying the distribution of S_m conditional on the event (S_1\ge 0,S_2\ge 0,\ldots, S_n\ge 0), with limits taken in the order n\rightarrow\infty and then m\rightarrow\infty.

(At a more general level, it is meaningful to describe the random walk conditioned on staying positive forever. Although this would a priori require conditioning on an event of probability zero, it can be handled formally as an example of an h-transform.)

As explained in that previous post, the scaling invariance of the Bessel process W^+ (which it’s not unreasonable to think of as ‘Brownian motion conditioned to stay non-negative’) suggests that this limit should exist, and be given by the entrance law of W^+. But this is hard to extract from a scaling limit.

However, we can use the previous estimate to start a direct calculation.

\mathbb{P}(S_m\in \mathrm{d}y \,\big|\, S_k\ge 0,\, k=1,\ldots,n) = \frac{q_m(0,y) q_{n-m}(y) \mathbb{P}(S_m\in\mathrm{d}y)}{q_n(0)}.

Here, we used the Markov property at time m to split the event that S_m=y and the walk stays positive into two time-intervals. We will later take m large, so we may approximate as

\frac{2f(0)f(y)/m \times \sqrt{\frac{2}{\pi}}f(y)/\sqrt{n-m}\times \mathbb{P}(S_m\in\mathrm{d}y) } { \sqrt{\frac{2}{\pi}}f(0)/\sqrt{n}}\stackrel{n\rightarrow\infty}=\frac{2f(y)^2}{m}\mathbb{P}(S_m\in\mathrm{d}y).

This final probability emphasises that as m\rightarrow\infty we only really have to consider y=\Theta(\sqrt{m}), so set y=z\sqrt{m}, and we obtain

\lim_{n\rightarrow\infty}\mathbb{P}(\frac{S_m}{\sqrt{m}}\in\mathrm{d}z\,\big|\, S_k\ge 0,\,k=1,\ldots,n)

\sim \sqrt{m}\cdot\frac{2z^2m}{m}\cdot \frac{1}{\sqrt{2\pi}}\frac{1}{\sqrt{m}}e^{-z^2/2} = \sqrt{\frac{2}{\pi}}z^2 e^{-z^2/2}.

This is precisely the entrance law of the 3-dimensional Bessel process, usually denoted R. This process is invariant under time-rescaling in the same fashion as Brownian motion. Indeed, one representation of R is as the radial part of a three-dimensional Brownian motion, given by independent BMs in each coordinate. (See [Pi75] for explanation of the relation to ‘BM conditioned to stay non-negative’.) We could complete the analogy by showing that q_n(x,y) converges to the transition density of R as well. (Cf the prelude to Theorem 2.2 of [B-JD05].)

Final remarks

The order of taking limits is truly crucial. We can also obtain a distributional scaling limit at time n under conditioning to stay non-negative up to time n. But then this is the size-biased normal distribution \sim ze^{-z^2/2} (the Rayleigh distribution), rather than the square-size-biased normal distribution we say in this setting. And we can exactly see why. Relative to the normal distribution which applies in the absence of conditioning, we require size-biasing to account for the walk staying positive up to time m, and then also size-biasing to account for the walk staying positive for the rest of time (or up to n in the n\rightarrow\infty limit if you prefer).

The asymptotics for q_n(x,y) were the crucial step, for which only heuristics are present in this post. It remains the case that estimates of this kind form the crucial step in other more exotic conditioning scenarios. This is immediately visible (even if the random walk notation is rather exotic) in, for example, Proposition 2.2 of [CHL17], of which we currently require a further level of generalisation.

References

[Bo76] – Bolthausen – On a functional central limit theorem for random walks conditioned to stay positive

[B-JD05] – Bryn-Jones, Doney – A functional limit theorem for random walk conditioned to stay non-negative

[CHL17] – Cortines, Hartung, Louidor – The structure of extreme level sets in branching Brownian motion

[Ig74] – Iglehart – Functional central limit theorems for random walks conditioned to stay positive

[Pi75] – Pitman – One-dimensional Brownian motion and the three-dimensional Bessel process

Advertisement

Skorohod embedding

Background

Suppose we are given a standard Brownian motion (B_t), and a stopping time T. Then, so long as T satisfies one of the regularity conditions under which the Optional Stopping Theorem applies, we know that \mathbb{E}[B_T]=0. (See here for a less formal introduction to OST.) Furthermore, since B_t^2-t is a martingale, \mathbb{E}[B_T^2]=\mathbb{E}[T], so if the latter is finite, so is the former.

Now, using the strong Markov property of Brownian motion, we can come up with a sequence of stopping times 0=T_0, T_1, T_2,\ldots such that the increments T_k-T_{k-1} are IID with the same distribution as T. Then 0,B_{T_1},B_{T_2},\ldots is a centered random walk. By taking T to be the hitting time of \{-1,+1\}, it is easy to see that we can embed simple random walk in a Brownian motion using this approach.

p1020956_compressedEmbedding simple random walk in Brownian motion.

The Skorohod embedding question asks: can all centered random walks be constructed in this fashion, by stopping Brownian motion at a sequence of stopping time? With the strong Markov property, it immediately reduces the question of whether all centered finite-variance distributions X can be expressed as B_T for some integrable stopping time T.

The answer to this question is yes, and much of what follows is drawn from, or at least prompted by Obloj’s survey paper which details the problem and rich history of the many approaches to its solution over the past seventy years.

Applications and related things

The relationship between random walks and Brownian motion is a rich one. Donsker’s invariance principle asserts that Brownian motion appears as the scaling limit of a random walk. Indeed, one can construct Brownian motion itself as the limit of a sequence of consistent random walks with normal increments on an increasingly dense set of times. Furthermore, random walks are martingales, and we know that continuous, local martingales can be expressed as a (stochastically) time-changed Brownian motion, from the Dubins-Schwarz theorem.

The Skorohod embedding theorem can be used to prove results about random walks with general distribution by proving the corresponding result for Brownian motion, and checking that the construction of the sequence of stopping times has the right properties to allow the result to be carried back to the original setting. It obviously also gives a coupling between a individual random walk and a Brownian motion which may be useful in some contexts, as well as a coupling between any pair of random walks. This is useful in proving results for random walks which are much easier for special cases of the distribution. For example, when the increments are Gaussian, or when there are combinatorial approaches to a problem about simple random walk. At the moment no aspect of this blog schedule is guaranteed, but I plan to talk about the law of the iterated logarithm shortly, whose proof is approachable in both of these settings, as well as for Brownian motion, and Skorohod embedding provides the route to the general proof.

At the end, we will briefly compare some other ways to couple a random walk and a Brownian motion.

Adding extra randomness

One thing we could do is sample a copy of X independently from the Brownian motion, then declare T= \tau_{X}:= \inf\{t\ge 0: B_t=X\}, the hitting time of (random value) X. But recall that unfortunately \tau_x has infinite expectation for all non-zero x, so this doesn’t fit the conditions required to use OST.

Skorohod’s original method is described in Section 3.1 of Obloj’s notes linked above. The method is roughly to pair up positive values taken by X appropriately with negative values taken by X in a clever way. If we have a positive value b and a negative value a, then \tau_{a,b}, the first hitting time of \mathbb{R}\backslash (a,b) is integrable. Then we choose one of these positive-negative pairs according to the projection of the distribution of X onto the pairings, and let T be the hitting time of this pair of values. The probability of hitting b conditional on hitting {a,b} is easy to compute (it’s \frac{-a}{b-a}) so we need to have chosen our pairs so that the ‘probability’ of hitting b (ie the density) comes out right. In particular, this method has to start from continuous distributions X, and treat atoms in the distribution of X separately.

The case where the distribution X is symmetric (that is X\stackrel{d}=-X) is particularly clear, as then the pairs should be (-x,x).

However, it feels like there is enough randomness in Brownian motion already, and subsequent authors showed that indeed it wasn’t necessary to introduce extra randomness to provide a solution.

One might ask whether it’s possible to generate the distribution on the set of pairs (as above) out of the Brownian motion itself, but independently from all the hitting times. It feels like it might be possible to make the distribution on the pairs measurable with respect to

\mathcal{F}_{0+} = \bigcap\limits_{t>0} \mathcal{F}_t,

the sigma-algebra of events determined by limiting behaviour as t\rightarrow 0 (which is independent of hitting times). But of course, unfortunately \mathcal{F}_{0+} has a zero-one law, so it’s not possible to embed non-trivial distributions there.

Dubins solution

The exemplar for solutions without extra randomness is due to Dubins, shortly after Skorohod’s original argument. The idea is to express the distribution X as the almost sure limit of a martingale. We first use the hitting time of a pair of points to ‘decide’ whether we will end up positive or negative, and then given this information look at the hitting time (after this first time) of two subsequent points to ‘decide’ which of four regions of the real interval we end up in.

I’m going to use different notation to Obloj, corresponding more closely with how I ended up thinking about this method. We let

a_+:= \mathbb{E}[X \,|\, X>0], \quad a_- := \mathbb{E}[X\,|\, X<0], (*)

and take T_1 = \tau_{\{a_-,a_+\}}. We need to check that

\mathbb{P}\left( B_{T_1}=a_+\right) = \mathbb{P}\left(X>0\right),

for this to have a chance of working. But we know that

\mathbb{P}\left( B_{T_1}=a_+\right) = \frac{a_+}{a_+-a_-},

and we can also attack the other side using (*) and the fact that \mathbb{E}[X]=0, using the law of total expectation:

0=\mathbb{E}[X]=\mathbb{E}[X\,|\, X>0] \mathbb{P}(X>0) + \mathbb{E}[X\,|\,X<0]\mathbb{P}(X<0) = a_+ \mathbb{P}(X>0) + a_- \left(1-\mathbb{P}(X>0) \right),

\Rightarrow\quad \mathbb{P}(X>0)=\frac{a_+}{a_+-a_-}.

Now we define

a_{++}=\mathbb{E}[X \,|\, X>a_+],\quad a_{+-}=\mathbb{E}[X\,|\, 0<X<a_+],

and similarly a_{-+},a_{--}. So then, conditional on B_{T_1}=a_+, we take

T_2:= \inf_{t\ge T_1}\left\{ B_t\not\in (a_{+-},a_{++})  \right\},

and similarly conditional on B_{T_1}=a_-. By an identical argument to the one we have just deployed, we have \mathbb{E}\left[B_{T_2} \,|\,\mathcal{F}_{T_1} \right] = B_{T_1} almost surely. So, although the a_{+-+} notation now starts to get very unwieldy, it’s clear we can keep going in this way to get a sequence of stopping times 0=T_0,T_1,T_2,\ldots where B_{T_n} determines which of the 2^n regions of the real line any limit \lim_{m\rightarrow\infty} B_{T_m} should lie in.

A bit of work is required to check that the almost sure limit T_n\rightarrow T is almost surely finite, but once we have this, it is clear that B_{T_n}\rightarrow B_T almost surely, and B_T has the distribution required.

Komlos, Major, Tusnady coupling

We want to know how close we can make this coupling between a centered random walk with variance 1, and a standard Brownian motion. Here, ‘close’ means uniformly close in probability. For large times, the typical difference between one of the stopping times 0,T_1,T_2,\ldots in the Skorohod embedding and its expectation (recall \mathbb{E}[T_k]=k) is \sqrt{n}. So, constructing the random walk S_0,S_1,S_2,\ldots from the Brownian motion via Skorohod embedding leads to

\left |S_k - B_k \right| = \omega(n^{1/4}),

for most values of k\le n. Strassen (1966) shows that the true scale of the maximum

\max_{k\le n} \left| S_k - B_k \right|

is slightly larger than this, with some extra powers of \log n and \log\log n as one would expect.

The Komlos-Major-Tusnady coupling is a way to do a lot better than this, in the setting where the distribution of the increments has a finite MGF near 0. Then, there exists a coupling of the random walk and the Brownian motion such that

\max_{k\le n}\left|S_k- B_k\right| = O(\log n).

That is, there exists C such that

\left[\max_{k\le n} \left |S_k-B_k\right| - C\log n\right] \vee 0

is a tight family of distributions, indeed with uniform exponential tail. To avoid digressing infinitely far from my original plan to discuss the proof of the law of iterated logarithm for general distributions, I’ll stop here. I found it hard to find much coverage of the KMT result apart from the challenging original paper, and many versions expressed in the language of empirical processes, which are similar to random walks in many ways relevant to convergence and this coupling, but not for Skorohod embedding. So, here is a link to some slides from a talk by Chatterjee which I found helpful in getting a sense of the history, and some of the modern approaches to this type of normal approximation problem.

DGFF 2 – Boundary conditions and Gibbs-Markov property

In the previous post, we defined the Discrete Gaussian Free Field, and offered some motivation via the discrete random walk bridge. In particular, when the increments of the random walk are chosen to be Gaussian, many natural calculations are straightforward, since Gaussian processes are well-behaved under conditioning and under linear transformations.

Non-zero boundary conditions

In the definition of the DGFF given last time, we demanded that h\equiv 0 on \partial D. But the model is perfectly well-defined under more general boundary conditions.

It’s helpful to recall again the situation with random walk and Brownian bridge. If we want a Brownian motion which passes through (0,0) and (1,s), we could repeat one construction for Brownian bridge, by taking a standard Brownian motion and conditioning (modulo probability zero technicalities) on passing through level s at time 1. But alternatively, we could set

B^{\mathrm{drift-br}}(t) = B(t)+ t(s-B(1)),\quad t\in[0,1],

or equivalently

B^{\mathrm{drift-br}}(t)=B^{\mathrm{br}}(t)+ st, \quad t\in[0,1].

That is, a Brownian bridge with drift can be obtain from a centered Brownian bridge by a linear transformation, and so certainly remains a Gaussian process. And exactly the same holds for a discrete Gaussian bridge: if we want non-zero values at the endpoints, we can obtain this distribution by taking the standard centred bridge and applying a linear transformation.

We can see how this works directly at the level of density functions. If we take 0=Z_0,Z_1,\ldots,Z_{N-1},Z_N=0 a centred Gaussian bridge, then the density of Z=\mathbf{z}\in \mathbb{R}^{N+1} is proportional to

\mathbf{1}\{z_0=z_N=0\}\exp\left( -\frac12 \sum_{i=1}^N (z_i-z_{i-1})^2 \right). (3)

So rewriting z_i= y_i- ki (where we might want k=s/N to fit the previous example), the sum within the exponent rearranges as

-\frac12 \sum_{i=1}^N (y_i-y_{i-1} - k)^2 = -\frac12 \sum_{i=1}^N (y_i-y_{i-1})^2 - 2k(y_N-y_0)+ Nk^2.

So when the values at the endpoints z_0,z_n,y_0,y_N are fixed, this middle term is a constant, as is the final term, and thus the density of the linearly transformed bridge has exactly the same form as the original one.

In two or more dimensions, the analogue of adding a linear function is to add a harmonic function. First, some notation. Let \varphi be any function on \partial D. Then there is a unique harmonic extension of \varphi, for which \nabla \varphi=0 everywhere on D, the interior of the domain. Recall that \nabla is the discrete graph Laplacian defined up to a constant by

(\nabla \varphi) _x = \sum\limits_{x\sim y} \varphi_x - \varphi_y.

If we want h^D instead to have boundary values \varphi, it’s enough to replace h^D with h^D+\varphi. Then, in the density for the DGFF ( (1) in the previous post), the term in the exponential becomes (ignoring the \frac{1}{4d} )

-\sum\limits_{x\sim y} \left[ (h^D_x-h^D_y)^2 + (\varphi_x-\varphi_y)^2 +2(h^D_x - h^D_y)(\varphi_x-\varphi_y)\right].

For each x\in D, on taking this sum over its neighbours y\in \bar D, the final term vanishes (since \varphi is harmonic), while the second term is just a constant. So the density of the transformed field, which we’ll call h^{D,\varphi} is proportional to (after removing the constant arising from the second term above)

\mathbf{1}\left\{h^{D,\varphi}_x = \varphi_x,\, x\in\partial D\right\} \exp\left( -\frac{1}{4d} \sum\limits_{x\sim y} \left( h^{D,\varphi}_x - h^{D,\varphi}_y \right)^2 \right).

So h^{D,\varphi}:= h^D + \varphi satisfies the conditions for the DGFF on D with non-zero boundary conditions \varphi.

Harmonic functions and RW – a quick review

Like the covariances in DGFF, harmonic functions on D are related to simple random walk on D stopped on \partial D. (I’m not claiming a direct connection right now.) We can define the harmonic extension \varphi to an interior point x by taking \mathbb{P}_x to be the law of SRW x=Z_0,Z_1,Z_2,\ldots started from x, and then setting

\varphi(x):= \mathbb{E}\left[ \varphi_{\tau_{\partial d}} \right],

where \tau_{\partial D} is the first time that the random walk hits the boundary.

Inverse temperature – a quick remark

In the original definition of the density of the DGFF, there is the option to add a constant \beta>0 within the exponential term so the density is proportional to

\exp\left(-\beta \sum\limits_{x\sim y} (h_x-h_y)^2 \right).

With zero boundary conditions, the effect of this is straightforward, as varying \beta just rescales the values taken by the field. But with non-zero boundary conditions, the effect is instead to vary the magnitude of the fluctuations of the values of the field around the (unique) harmonic function on the domain with those BCs. In particular, when \beta\rightarrow \infty, the field is ‘reluctant to be far from harmonic’, and so h^D \Rightarrow \varphi.

This parameter \beta is called inverse temperature. So low temperature corresponds to high \beta, and high stability, which fits some physical intuition.

A Markov property

For a discrete (Gaussian) random walk, the Markov property says that conditional on a given value at a given time, the trajectory of the process before this time is independent of the trajectory afterwards. The discrete Gaussian bridge is similar. Suppose we have as before 0=Z_0,Z_1,\ldots, Z_N=0 a centred Gaussian bridge, and condition that Z_k=y, for k\in\{1,\ldots,N-1\}, and y\in\mathbb{R}. With this conditioning, the density (3) splits as a product

\mathbf{1}\{z_0=z_N=0, z_k=y\}\exp\left(-\frac12 \sum\limits_{i=1}^N (z_i-z_{i-1})^2 \right) =

\mathbf{1}\{z_0=0,z_k=y\} \exp\left(-\frac12 \sum\limits_{i=1}^k (z_i-z_{i-1})^2 \right) \cdot \mathbf{1}\{z_k=y,z_N=0\} \exp\left(-\frac12 \sum\limits_{i=k+1}^N (z_i-z_{i-1})^2 \right).

Therefore, with this conditioning, the discrete Gaussian bridge splits into a pair of independent discrete Gaussian bridges with drift. (The same would hold if the original process had drift too.)

The situation for the DGFF is similar, though rather than focusing on the condition, it makes sense to start by focusing on the sub-domain of interest. Let A\subset D, and take B=\bar D\backslash A. So in particular \partial A\subset B.

img_20161106_194123472_compressedThen we have that conditional on h^D\big|_{\partial A}, the restricted fields h^D\big|_{B\backslash \partial A} and h^D\big|_A are independent. Furthermore, h^D\big|_A has the distribution of the DGFF on A, with boundary condition given by h^D\big|_{\partial A}. As in the discrete bridge, this follows just by splitting the density. Every gradient term corresponds to an edge in the underlying graph that lies either entirely inside \bar A or entirely inside B. This holds for a general class of Gibbs models where the Hamiltonian depends only on the sum of some function of the heights (taken to be constant in this ‘free’ model) and the sum of some function of their nearest-neighbour gradients.

One additional and useful interpretation is that if we only care about the field on the restricted region A, the dependence of h^D\big|_A on h^D\big|_{D\backslash A} comes only through h^D\big|_{\partial A}. But more than that, it comes only through the (random) harmonic function which extends the (random) values taken on the boundary of A to the whole of A. So, if h^A is an independent DGFF on A with zero boundary conditions, we can construct the DGFF h^D from its value on D\backslash A via

h^D_x \stackrel{d}= h^A_x + \varphi^{h^D\big|_{\partial A}},

where \varphi^{h^D\big|_{\partial A}} is the unique harmonic extension of the (random) values taken by h^D on \partial A to \bar A.

This Markov property is crucial to much of the analysis to come. There are several choices of the restricted domain which come up repeatedly. In the next post we’ll look at how much one can deduce by taking A to be the even vertices in D (recalling that every integer lattice \mathbb{Z}^d is bipartite), and then taking A to be a finer sublattice within D. We’ll use this to get some good bounds on the probability that the DGFF is positive on the whole of D. Perhaps later we’ll look at a ring decomposition of \mathbb{Z}^d consisting of annuli spreading out from a fixed origin. Then the distribution of the field at this origin can be considered, via the final idea discussed above, as the limit of an infinite sequence of random harmonic functions given by the values taken by the field at increasingly large radius from the origin. Defining the DGFF on the whole lattice depends on the existence or otherwise of this local limit.

When is a Markov chain a Markov chain?

I’ve been taking tutorials on the third quarter of the second-year probability course, in which the student have met discrete-time Markov chains for the first time. The hardest aspect of this introduction (apart from the rapid pace – they cover only slightly less material than I did in Cambridge, but in half the time) is, in my opinion, choosing which definition of the Markov property is most appropriate to use in a given setting.

We have the wordy “conditional on the present, the future is independent of the past”, which is probably too vague for any precise application. Then you can ask more formally that the transition probabilities are the same under two types of conditioning, that is conditioning on the whole history, and conditioning on just the current value

\mathbb{P}(X_{n+1}=i_{n+1} \,\big|\, X_n=i_n,\ldots,X_0=i_0) = \mathbb{P}(X_{n+1}=i_{n+1} \,\big |\, X_n=i_n), (*)

and furthermore this must hold for all sets of values (i_{n+1},\ldots,i_0) and if we want time-homogeneity (as is usually assumed at least implicitly when we use the word ‘chain’), then these expressions should be functions of (i_n,i_{n+1}) but not n.

Alternatively, one can define everything in terms of the probability of seeing a given path:

\mathbb{P}(X_0=i_0,\ldots,X_n=i_n)= \lambda_{i_0}p_{i_0,i_1}\ldots p_{i_{n-1}i_n},

where \lambda is the initial distribution, and the p_{i,j}s are the entries of the transition matrix P.

Fortunately, these latter two definitions are equivalent, but it can be hard to know how to proceed when you’re asked to show that a given process is a Markov chain. I think this is partly because this is one of the rare examples of a concept that students meet, then immediately find it hard to think of any examples of similar processes which are not Markov chains. The only similar concept I can think of are vector spaces, which share this property mainly because almost everything in first-year mathematics is linear in some regard.

Non-examples of Markov chains

Anyway, during the tutorials I was asking for some suggestions of discrete-time processes on a countable or finite state space which are not Markov chains. Here are some things we came up with:

  • Consider a bag with a finite collection of marbles of various colours. Record the colours of marbles sampled repeatedly without replacement. Then the colour of the next marble depends on the set you’ve already seen, not on the current colour. And of course, the process terminates.
  • Non-backtracking random walk. Suppose you are on a graph where every vertex has degree at least 2, and in a step you move to an adjacent vertex, chosen uniformly among the neighbours, apart from the one from which you arrived.
  • In a more applied setting, it’s reasonable to assume that if we wanted to know the chance it will rain tomorrow, this will be informed by the weather over the past week (say) rather than just today.

Showing a process is a Markov chain

We often find Markov chains embedded in other processes, for example a sequence of IID random variables X_1,X_2,\ldots. Let’s consider the random walk S_n=\sum_{i=1}^n X_i, where each X_i =\pm 1 with probability p and (1-p). Define the running maximum M_n=\max_{m\le n}S_m, and then we are interested in Y_n:=M_n-S_n, which we claim is a Markov chain, and we will use this as an example for our recipe to show this in general.

We want to show (*) for the process Y_n. We start with the LHS of (*)

\mathbb{P}(Y_{n+1}=i_{n+1} \,\big|\, Y_n=i_n,\ldots,Y_0=i_0),

and then we rewrite Y_{n+1} as much as possible in terms of previous and current values of Y, and quantities which might be independent of previous values of Y. At this point it’s helpful to split into the cases i_n=0 and i_n\ne 0. We’ll treat the latter for now. Then

Y_{n+1}=Y_n+X_{n+1},

so we rewrite as

=\mathbb{P}(X_{n+1}=i_{n+1}-i_n \, \big |\, Y_n=i_n,\ldots, Y_0=i_0),

noting that we substitute i_n for Y_n since that’s in the conditioning. But this is now ideal, since X_{n+1} is actually independent of everything in the conditioning. So we could get rid of all the conditioning. But we don’t really want to do that, because we want to have conditioning on Y_n left. So let’s get rid of everything except that:

=\mathbb{P}(X_{n+1}=i_{n+1}-i_n\, \big |\, Y_n=i_n).

Now we can exactly reverse all of the other steps to get back to

= \mathbb{P}(Y_{n+1}=i_{n+1} \,\big|\, Y_n=i_n),

which is exactly what we required.

The key idea is that we stuck to the definition in terms of Y, and held all the conditioning in terms of Y, since that what actually determines the Markov property for Y, rearranging the event until it’s in terms of one of the underlying Xs, at which point it’s easy to use independence.

Showing a process is not a Markov chain

Let’s show that M_n is not a Markov chain. The classic mistake to make here is to talk about possible paths the random walk S could take, which is obviously relevant, but won’t give us a clear reason why M is not Markov. What we should instead do is suggest two paths taken by M, which have the same ‘current’ value, but induce transition probabilities, because they place different restrictions on the possible paths taken by S.

IsMaxMarkov

In both diagrams, the red line indicates a possible path taken by (M_0,M_1,\ldots,M_4), and the blue lines show possible paths of S which could induce these.

In the left diagram, clearly there’s only one such path that S could take, and so we know immediately what happens next. Either X_5=+1 (with probability p) in which case M_5=S_5=3, otherwise it’s -1, in which case M_5=2.

In the right diagram, there are two possibilities. In the case that S_4=0, clearly there’s no chance of the maximum increasing. So in the absence of other information, for M_5=3, we must have X_4=X_5=+1, and so the chance of this is p^2.

So although the same transitions are possible, they have different probabilities with different information about the history, and so the Markov property does not hold here.

Avoiding Mistakes in Probability Exams

Over the past week, I’ve given several tutorials to second year undergraduates preparing for upcoming papers on probability and statistics. In particular, I’ve now seen a lot of solutions to a lot of past papers and specimen questions, and it’s worthwhile to consider some of the typical mistakes students can make on these questions. Of course, as with any maths exam, there’s always the possibility of a particularly subtle or involved question coming up, but if the following three common areas of difficulty can be avoided, you’re on track for doing well.

Jacobians

In a previous course, a student will learn how to calculate the pdf of a function of a random variable. Here, we move onto the more interesting and useful case of finding the (joint) density of function(s) of two or more random variables. The key thing to remember here is that manipulating pdfs is not a strange arbitrary exercise – it is just integration. It is rarely of interest to consider the value of a pdf at a single point. We can draw meaningful conclusions from a pdf or from comparison of two pdfs by integrating them.

Then the question of substituting for new random variables is precisely integration by substitution, which we are totally happy with in the one-dimensional case, and should be fairly happy with in the two-dimensional case. To get from one joint density to another, we multiply by the absolute value of the Jacobian. To ensure you get it right, it makes sense to write out the informal infinitesimal relation

f_{U,V}(u,v) du dv = f_{X,Y}(x,y)dx dy.

This is certainly relevant if we put integral signs in front of both sides, and explains why you obtain f_{U,V} = \frac{d(x,y)}{d(u,v)} f_{X,Y} rather than the other way round. Note though that if \frac{d(u,v)}{d(x,y)} is easier to calculate for some reason, then you can evaluate this and take the inverse, as your functions will almost certainly be locally bijective almost everywhere.

It is important to take the modulus of the Jacobian, since densities cannot be negative! If this looks like a fudge, then consider the situation in one dimension. If we substitute for x\mapsto f(x)=1-x, then f’ is obviously negative, BUT we also end up reversing the order of the bounds of the integral, eg [1/3, ¾] will become [2/3,1/4]. So we have a negative integrand (after multiplying by f'(x)) but bounds in the ‘wrong’ order. These two factors of -1 will obviously cancel, so it suffices just to multiply by |f'(x)| at that stage. It is harder to express in words, but a similar relation works for the Jacobian substitution.

You also need to check where the new joint density is non-zero. Suppose X, Y are supported on [0,1], then when we write f_{X,Y}(x,y) we should indicate that it is 0 off this region, either by splitting into cases, or adding the indicator function 1_{\{x,y\in[0,1]\}} as a factor. This is even more important after substitutions, as the range of the resulting random variables might be less obvious than the originals. Eg with X,Y as above, and U=X^2, V=X/Y, the resulting pdf will be non-zero only when u\in[0,1], v\ge \sqrt{u}. Failing to account for this will often lead to ludicrous answers. A general rule is that you can always check that any distribution you’ve produced does actually integrate to one.

Convergence using MGFs

There are two main reasons to use MGFs and PGFs. The first is that they behave nicely when applied to (possibly random) sums of independent random variables. The independence property is crucial to allow splitting of the MGF of the sum into the product of MGFs of the summands. Of course, implicit in this argument is that MGFs determine distributions.

A key theorem of the course is that this works even in the limit, so you can use MGFs to show convergence in distribution of a family of distributions. For this, you need to show that the MGFs converge pointwise on some interval [-a,a] around 0. (Note that the moments of the distribution are given by the family of derivatives at 0, as motivation for why this condition might be necessary.) Normally for such questions, you will have been asked to define the MGF earlier in the question, and probably will have found the MGF of a particular distribution or family of distributions, which might well end up appearing as the final answer.

Sometimes such an argument might involve substituting in something unusual, like t/N, rather than t, into a known MGF. Normally a Taylor series can be used to show the final convergence result. If you have a fraction, try to cancel terms so that you only have to evaluate one Taylor series, rather than lots.

Using the Markov Property

The Markov property is initially confusing, but once we become comfortable with the statement, it is increasingly irritating to have to answer the question: “show that this process has the Markov property.” This question is irritating because in most cases we want to answer: “because it obviously does!” Which is compelling, but unlikely to be considered satisfactory in a mathematics exam. Normally we observe that the random dynamics of the next step are a function only of the present location. Looking for the word ‘independent’ in the statement of the process under discussion is a good place to start for any argument along these lines.

The most developed example of a Markov process in this course is the Poisson process. I’ve written far too much about this before, so I won’t do so again, except to say this. When we think of the Poisson process, we generally have two thoughts going through our minds, namely the equivalent definitions of IID exponential inter-arrival times, and stationary, Poisson increments (or the infinitesimal version). If we draw a sketch of a sample trajectory of this process, we can label everything up and it is clear how it all fits together. But if you are asked to give a definition of the Poisson process (N_t), it is inappropriate to talk about inter-arrival times unless you define them in terms of N_t, since that is the process you are actually trying to define! It is fine to write out

T_k:=\min\{t: N_t=k\},\quad N_t=\max\{k: Y_1+Y_2+\ldots+Y_k\le t\}

but the relation between the two characterisations of the process is not obvious. That is why it is a theorem of the course.

We have to be particularly careful of the difference in definition when we are calculating probabilities of various events. A classic example is this. Find the distribution of N_2, conditional on T_3=1. It’s very tempting to come up with some massive waffle to argue that the answer is 3+Po(1). The most streamlined observation is that the problem is easy if we are conditioning instead on N_1=3. We just use the independent Poisson increments definition of (N_t), with no reference to inter-arrival times required. But then the Markov property applied at time 1 says that the distribution of (N_2) depends only on the value of N_1, not on the process on the interval [0,1). In a sense, the condition that T_3=1 is giving us extra information on the behaviour of the process up to time 1, and the Markov property, which we know holds for the Poisson process, asserts precisely that the extra information doesn’t matter.

Diameters of Trees and Cycle Deletion

In the past two posts, we introduced two models of random trees. The Uniform Spanning Tree chooses uniformly at random from the set of spanning trees for a given underlying graph. The Minimum Spanning Tree assigns IID weights to each edge in the underlying graph, then chooses the spanning tree with minimum induced total weight. We are interested to know whether these are in fact the same distribution, and if they are not, what properties can be used to distinguish them asymptotically.

While investigating my current research problem, I was interested in the diameter of large random trees under various models. Specifically, I am considering what happens if you take a standard Erdos-Renyi process on n vertices, where edges appear at constant rate between pairs of vertices chosen uniformly at random, and add an extra mechanism to prevent the components becoming too large. For this particular model, our mechanism consists of removing any cycles as they are formed. Thus all the components remain trees as time advances, so it is not unreasonable to think that there might be some sort of equilibrium distribution.

Now, by definition, any tree formed by the Erdos-Renyi process is a uniform tree. Why? Well, the probability of a configuration is determined entirely by the number of edges present, so once we condition that a particular set of vertices are the support of a tree, all possible tree structures are equally likely. Note that this relies on sampling at a single fixed time. If we know the full history of the process, then it is no longer uniform. For example, define a k-star to be a tree on k vertices where one ‘centre’ vertex has degree k-1. The probability that a uniform tree on k vertices is a k-star is \frac{k}{k^{k-2}}=k^{-(k-3)}. But a star can only be formed by successively adding single vertices to an existing star. That is, we cannot join a 3-tree and a 4-tree with a edge to get a 7-star. So it is certainly not immediately clear that once we’ve incorporated the cycle deletion mechanism, the resulting trees will be uniform once we condition on their size.

In fact, the process of component sizes is not itself Markovian. For a concrete example, observe first that there is, up to isomorphism, only one tree on any of {0,1,2,3} vertices, so the first possible counterexample will be splitting up a tree on four vertices. Note that cycle deletion always removes at least three edges (ie a triangle), so the two possibilities for breaking a 4-tree are:

(4) -> (2,1,1) and (4) -> (1,1,1,1)

I claim that the probabilities of each of these are different in the two cases: a) (4) is formed from (2,2) and b) (4) is formed from (3,1). This is precisely a counterexample to the Markov property.

In the case where (4) is formed from (2,2), the 4-tree is certainly a path of length 4. Therefore, with probability 1/3, the next edge added creates a 4-cycle, which is deleted to leave components (1,1,1,1). In the case where (4) is formed from (3,1), then with probability 2/3 it is a path of length 4 and with probability 1/3 it is a 4-star (a ‘T’ shape). In this second case, no edge can be added to make a 4-cycle, so after cycle deletion the only possibility is (2,1,1). Thus the probability of getting (1,1,1,1) is 2/9 in this case, confirming that the process is non-Markovian. However, we might remark that we are unlikely to have O(n) vertices involved in fragmentations until at least the formation of the giant component in the underlying E-R process, so it is possible that the cycle deletion process is ‘almost Markov’ for any property we might actually be interested in.

When we delete a cycle, how many vertices do we lose? Well, for a large tree on n vertices, the edge added which creates the cycle is chosen uniformly at random from the pairs of vertices which are not currently joined by an edge. Assuming that n is w(1), that is we are thinking about a limit of fairly large trees, then the number of edges present is much smaller than the number of possible edges. So we might as well assume we are choosing uniformly from the possible edges, rather than just the possible edges which aren’t already present.

If we choose to add an edge between vertices x and y in the tree, then a cycle is formed and immediately deleted. So the number of edges lost is precisely the length of the path between x and y in the original tree. We are interested to know the asymptotics for this length when x and y are chosen at random. The largest path in a graph is called the diameter, and in practice if we are just interested in orders of magnitude, we might as well assume diameter and expected path length are the same.

So we want to know the asymptotic diameter of a UST on n vertices for large n. This is generally taken to be n^{1/2}. Here’s a quick but very informal argument that did genuinely originate on the back of a napkin. I’m using the LERW definition. Let’s start at vertex x and perform LERW, and record how long the resultant path is as time t advances. This is a Markov chain: call the path length at time t X_t.

Then if X_t=k, with probability 1-\frac k n we get X_{t+1}=k+1, and for each j in {0,…,k-1}, with probability 1/n we have X_{t+1}=j, as this corresponds to hitting a vertex we have already visited. So

\mathbb{E}\Big[X_{t+1}|X_t=k\Big]=\frac{nk-k^2/2}{n}.

Note that this drift is positive for k<< \sqrt n and negative for k>>\sqrt n, so we would expect n^{-1/2} to be the correct scaling if we wanted to find an equilibrium distribution. And the expected hitting time of vertex y is n, by a geometric distribution argument, so in fact we would expect this Markov chain to be well into the equilibrium window with the n^{-1/2} scaling by the time this occurs. As a result, we expect the length of the x to y path to have magnitude n^{1/2}, and assume that the diameter is similar.

So this will be helpful for calculations in the cycle deletion model, provided that the trees look like uniform trees. But does that even matter? Do all sensible models of random trees have diameter going like n^{1/2}? Well, a recent paper of Addario-Berry, Broutin and Reed shows that this is not the case for the minimum spanning tree. They demonstrate that the diameter in this case is n^{1/3}. I found this initially surprising, so tried a small example to see if that shed any light on the situation.

The underlying claim is that MSTs are more likely to be ‘star-like’ than USTs, a term I am not going to define. Let’s consider n=4. Specifically, consider the 4-star with centre labelled as 1. There are six possible edges in K_4 and we want to see how many of the 6! weight allocations lead to this star. If the three edges into vertex 1 have weights 1, 2 and 3 then we certainly get the star, but we can also get this star if the edges have weights 1, 2 and 4, and the edge with weight 3 lies between the edges with weights 1 and 2. So the total number of possibilities for this is 3! x 3! + 3! x 2! = 48. Whereas to get a 4-path, you can assign weights 1, 2 and 3 to the edges of the path, or weights 1, 2 and 4 provided the 4 is not in the middle, and then you have the 3 joining up the triangle formed by 1 and 2. So the number of possibilities for this is 3! x 3! + 4 x 2! = 44.

To summarise in a highly informal way, in a star-like tree, you can ‘hide’ some fairly low-scoring weights on edges that aren’t in the tree, so long as they join up very low-scoring edges that are in the tree. Obviously, this is a long way from getting any formal results on asymptotics, but it does at least show that we need to be careful about diameters if we don’t know exactly what mechanism is generating the tree!

Mixing Times 1 – Reversing Markov Chains

A small group of us have started meeting to discuss Levin, Peres and Wilmer’s book on Markov Chains and Mixing Times. (The plan is to cover a couple of chapters every week, then discuss points of interest and some of the exercises – if anyone is reading this and fancies joining, let me know!) Anyway, this post is motivated by something we discussed in our first session.

Here are two interesting facts about Markov Chains. 1) The Markov property can be defined in terms of products of transition probabilities giving the probability of a particular initial sequence. However, a more elegant and general formulation is to say that, conditional on the present, the past and the future are independent. 2) All transition matrices have at least one equilibrium distribution. In fact, irreducible Markov Chains have precisely one equilibrium distribution. Then, if we start with any distribution, the distribution of the chain at time t converges to the equilibrium distribution.

But hang on. This might be a fairly serious problem. On the one hand we have given a definition of the Markov property that is symmetric in time, in the sense that it remains true whether we are working forwards or backwards. While, on the other hand, the convergence to equilibrium is very much not time-symmetric: we move from disorder to order as time advances. What has gone wrong here?

We examine each of the properties in turn, then consider how to make them fit together in a non-contradictory way.

Markov Property

As many of the students in the Applied Probability course learned the hard way, there are many ways to define the Markov property depending on context, and some are much easier to work with than others. For a Markov chain, you can find a way to say that the transition probability \mathbb{P}(X_{n+1}=x_{n+1}\,|\,X_n=x_n,\ldots,X_0=x_0) is independent of x_0,\ldots,x_{n-1}. Alternatively, you can use this to give an inductive specification for the probability of the first n values of X being some sequence.

It requires a moment’s checking to see that the earlier definition of past/future independence is consistent with this. Let’s first check that we haven’t messed up a definition somewhere, and that the time-reversal of a general Markov chain does have the Markov property, even as defined in the context of a Markov chain.

For clarity, consider X_0,X_1,\ldots, X_N a Markov chain on some finite state space, with N some fixed finite end time. We aren’t losing anything by reversing over a finite time interval – after all, we need to know how to do it over a finite time interval before it could possibly make sense to do it over (-\infty,\infty). We examine (Y_n)_{n=0}^N defined by Y_n:= X_{N-n}.

\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1},\ldots,X_N=x_N)=\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1})

is the statement of the Markov property for (Y_n). We rearrange the left hand side to obtain:

=\frac{\mathbb{P}(X_n=x_n,X_{n+1}=x_{n+1},\ldots,X_N=x_N)}{\mathbb{P}(X_{n+1}=x_{n+1},\ldots,X_N=x_N)}

=\frac{\mathbb{P}(X_N=x_N|X_n=x_n,\ldots,X_{N-1}=x_{N-1})\mathbb{P}(X_n=x_n,\ldots,X_{N-1}=x_{N-1})}{\mathbb{P}(X_N=x_N|X_{n+1}=x_{n+1},\ldots,X_{N-1}=x_{N-1})\mathbb{P}(X_{n+1}=x_{n+1},\ldots,X_{N-1}=x_{N-1})}.

Now, by the standard Markov property on the original chain (X_n), the first probability in each of the numerator and denominator are equal. This leaves us with exactly the same form of expression as before, but with one fewer term in the probability. So we can iterate until we end up with

\frac{\mathbb{P}(X_n=x_n,X_{n+1}=x_{n+1})}{\mathbb{P}(X_{n+1}=x_{n+1})}=\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1}),

as required.

So there’s nothing wrong with the definition. The reversed chain Y genuinely does have this property, regardless of the initial distribution of X.

In particular, if our original Markov chain starts at a particular state with probability 1, and we run it up to time N, then saying that the time-reversal is a Markov chain too is making a claim that we have a non-trivial chain that converges from some general distribution at time 0 to a distribution concentrated at a single point by time N. This seems to contradict everything we know about these chains.

Convergence to Equilibrium – Markov Property vs Markov Chains

It took us a while to come up with a reasonable explanation for this apparent discrepancy. In the end, we come to the conclusion that Markov chains are a strict subset of stochastic processes with the Markov property.

The key thing to notice is that a Markov chain has even more regularity than the definition above implies. The usual description via a transition matrix says that the probability of moving to state y at time t+1 given that you are at state x at time t is some function of x and y. The Markov property says that this probability is independent of the behaviour up until time t. But we also have that the probability is independent of t. The transition matrix P has no dependence on time t – for example in a random walk we do not have to specify the time to know what happens next. This is the property that fails for the non-stationary time-reversal.

In the most extreme example, we say X_0=x_0 with probability 1. So in the time reversal, \mathbb{P}(Y_N=x_0|Y_{N-1}=y_{N-1})=1 for all y_{N-1}. But it will obviously not be the case in general that \mathbb{P}(Y_n=x_0|Y_{n-1}=y_{n-1})=1 for all y_{n-1}, as this would mean the chain Y would be absorbed after one step at state x_0, which is obviously not how the reversal of X should behave.

Perhaps the best way to reconcile this difference is to consider this example where you definitely start from x_0. Then, a Markov chain in general can be thought of as a measure on paths, that is \Omega^N, with non-trivial but regular correlations between adjacent components. (In the case of stationarity, all the marginals are equal to the stationary distribution – a good example of i.d. but not independent RVs.) This is indexed by the transition matrix and the initial distribution. If the initial distribution is a single point mass, then this can be viewed as a restriction to a smaller set of possible paths, with measures rescaled appropriately.

What have we learned?

Well, mainly to be careful about assuming extra structure with the Markov property. Markov Chains are nice because there is a transition matrix which is constant in time. Other processes, such as Brownian motion are space-homogeneous, where the transitions, or increments in this context, are independent of time and space. However, neither of these properties are true for a general process with the Markov property. Indeed, we have seen in a post from a long time ago that there are Markov processes which do not have the Strong Markov Property, which seems unthinkable if we limit our attention to chain-like processes.

Most importantly, we have clarified the essential point that reversing a Markov Chain only makes sense in equilibrium. It is perfectly possibly to define the reversal of a chain not started at a stationary distribution, but lots of unwelcome information from the forward chain ends up in the reversed chain. In particular, the theory of Markov Chains is not broken, which is good.

The Inspection Paradox and related topics

In the final class for Applied Probability, we discussed the so-called Inspection Paradox for an arrivals process. We assume that buses, sat, arrive as a Poisson process with rate 1, and consider the size of the interval (between buses) containing some fixed time T. The ‘paradox’ is that the size of this interval is larger in expectation than the average time between buses, which of course is given by an exponential random variable.

As with many paradoxes, it isn’t really that surprising after all. Perhaps what is more surprising is that the difference between the expected observed interval and the expected actual interval time is quite small here. There are several points of interest:

1) The Markov property of the Poisson process is key. In particular, this says that the expectation (and indeed the distribution) of the waiting time for a given customer arriving at T is not dependent on T, even if T is a random variable (or rather, a class of random variables, the stopping times). So certainly the inspection paradox property will hold whenever the process has the Markov property, because the inspected interval contains the inspected waiting time, which is equal in distribution to any fixed interval.

2) Everything gets slightly complicated by the fact that the Poisson process is defined to begin at 0. In particular, it is not reversible. Under the infinitesimal (or even the independent Poisson increments) definition, we can view the Poisson process not as a random non-decreasing function of time, but rather as a random collection of points on the positive reals. With this setup, it is clearly no problem to define instead a random collection of points on all the reals. [If we consider this instead as a random collection of point masses, then this is one of the simplest examples of a random measure, but that’s not hugely relevant here.]

We don’t need to worry too much about what value the Poisson process takes at any given time if we are only looking at increments, but if it makes you more comfortable, you can still safely assume that it is 0 at time 0. Crucially, the construction IS now reversible. The number of points in the interval [s,t] has distribution parameterised by t-s, so we it doesn’t matter which direction we are moving in down the real line. In this case, A_t, the time since the previous arrival, and E_t, the waiting time until the next arrival, are both Exp(1) RVs, as the memorylessness property applies in each time direction.

For the original Poisson process, we actually have A_t stochastically dominated by an Exp(1) distribution, because it is conditioned to be less than or equal to t. So in this case, the expected interval time is some complicated function of t, lying strictly between 1 and 2. In our process extended to the whole real line, the expected interval time is exactly 2.

This idea of extending onto the whole real line explains why we mainly consider delayed renewal processes rather than just normal renewal processes. The condition that we start a holding time at 0 is often not general enough, particularly when the holding times are not exponential and so the process is not memoryless.

3) There is a general size-biasing principle in action here. Roughly speaking, we are more likely to arrive in a large interval than in a small interval. The scaling required is proportional to the length of the interval. Given a density function f(x) of X, we define the size-biased density function to be xf(x). We need to normalise to give a probability distribution, and dividing by the expectation EX is precisely what is needed. Failure to account for when an observation should have the underlying distribution or the size-biased distribution is a common cause of supposed paradoxes. A common example is ‘on average my friends have more friends than I do’. Obviously, various assumption on me and my friends, and how we are drawn from the set of people (and the distribution of number of friends) is required that might not necessarily be entirely accurate in all situations.

In the Poisson process example above, the holding times have density function e^{-x}, so the size-biased density function if xe^{-x}. This corresponds to a \Gamma(2,1) distribution, which may be written as the sum of two independent Exp(1) RVs as suggested above.

4) A further paradox mentioned on the sheet is the waiting time paradox. This says that the expected waiting time is longer if you arrive at a random time than if you just miss a bus. This is not too surprising: consider at least the stereotypical complaint about buses in this country arriving in threes, at least roughly. Then if you just miss a bus, there is a 2/3 chance that another will be turning up essentially immediately. On the sheet, we showed that the \Gamma(\alpha,1) distribution has this property also, provided \alpha<1.

We can probably do slightly better than this. The memoryless property of the exponential distribution says that:

\mathbb{P}(Z>t+s|Z>t)=\mathbb{P}(Z>s).

In general, for the sort of holding times we might see at a bus stop, we might expect it to be the case that if we have waited a long time already, then we are less likely relatively to have to wait a long time more, that is:

\mathbb{P}(Z>t+s|Z>t)\leq\mathbb{P}(Z>s),

and furthermore this will be strict if neither s nor t is 0. I see no reason not to make up a definition, and call this property supermemorylessness. However, for the subclass of Gamma distributions described above, we have the opposite property:

\mathbb{P}(Z>t+s|Z>t)\geq\mathbb{P}(Z>s).

Accordingly, let’s call this submemorylessness. If this is strict, then it says that we are more likely to have to wait a long time if we have already been waiting a long time. This seems contrary to most real-life distributions, but it certainly explains the paradox. If we arrive at a random time, then the appropriate holding time has been ‘waiting’ for a while, so is more likely to require a longer observed wait than if I had arrived as the previous bus departed.

In conclusion, before you think of something as a paradox, think about whether the random variables being compared are being generated in the same way, in particular whether one is size-biased, and whether there are effects due to non-memorylessness.

The Poisson Process – A Third Characteristion

There remains the matter of the distribution of the number of people to arrive in a fixed non-infinitissimal time interval. Consider the time interval [0,1], which we divide into n smaller intervals of equal width. As n grows large enough that we know the probability that two arrivals occur in the same interval tends to zero (as this is \leq no(\frac{1}{n})), we can consider this as a sequence of iid Bernoulli random variables as before. So

\mathbb{P}(N_1=k)=\binom{n}{k}(\frac{\lambda}{n})^k(1-\frac{\lambda}{n})^{n-k}
\approx \frac{n^k}{k!} \frac{\lambda^k}{n^k}(1-\frac{\lambda}{n})^n\approx \frac{\lambda^k}{k!}e^{-\lambda}.

We recognise this as belonging to a Poisson (hence the name of the process!) random variable. We can repeat this for a general time interval and obtain N_t\sim \text{Po}(\lambda t).

Note that we implicitly assumed that, in the infinitissimal case at least, behaviour in disjoint intervals was independent. We would hope that this would lift immediately to the large intervals, but it is not immediately obvious how to make this work. This property of independent increments is one of the key definitions of a Levy Process, of which the Poisson process is one of the two canonical examples (the other is Brownian Motion).

As before, if we can show that the implication goes both ways (and for this case it is not hard – letting t\rightarrow 0 clearly gives the infinitissimal construction), we can prove results about Poisson random variables with ease, for example \text{Po}(\lambda)+\text{Po}(\mu)=\text{Po}(\lambda+\mu).

This pretty much concludes the construction of the Poisson process. We have three characterisations:
1) X_n\sim\text{Exp}(\lambda) all iid.
2) The infinitissimal construction as before, with independence.
3) The number of arrivals in a time interval of width t \sim \text{Po}(\lambda t). (This is sometimes called a stationary increments property.) Furthermore, we have independent increments.

A formal derivation of the equivalence of these forms is important but technical, and so not really worth attempting here. See James Norris’s book for example for a fuller exposition.

The final remark is that the Poisson Process has the Markov property. Recall that this says that conditional on the present, future behaviour is independent of the past. Without getting into too much detail, we might like to prove this by using the independent increments property. But remember that for a continuous process, it is too much information to keep track of all the distributions at once. It is sufficient to track only the finite marginal distributions, provided the process is cadlag, which the Poisson process is, assuming we deal with the discontinuities in the right way. Alternatively, the exponential random variable is memoryless, a property that can be lifted, albeit with some technical difficulties, to show the Markov property.

Strong Markov Property for BM

The Strong Markov Property is the most important result to demonstrate for any Markov process, such as Brownian Motion. It is also probably the most widely requested item of bookwork on the Part III Advanced Probability exam. I feel it is therefore worth practising writing as quickly as possible.

Theorem (SMP): Take (B_t) a standard (\mathcal{F}_t)-BM, and T an a.s. finite stopping time. Then (B_{T+t}-B_T,t\geq 0) is a standard BM independent of \mathcal{F}_T.

Proof: We write B_t^{(T)}=B_{T+t}-B_T for ease of notation. We will show that for any A\in\mathcal{F}_T and F bounded, measurable:

\mathbb{E}[1_AF(B_{T+t_1}-B_T,\ldots,B_{T+t_n}-B_T)]=\mathbb{P}(A)\mathbb{E}F(B_{t_1},\ldots,B_{t_n})

This will suffice to establish independence, and taking A=\Omega\in\mathcal{F}_t shows that B_t^T is a standard BM since (Levy), BM is uniquely characterised by its finite joint distributions.

To prove the result, we approximate discretely, and apply the Markov property.

\mathbb{E}[1_AF(B_{t_1}^{(T)},\ldots)]=\lim_{m\rightarrow\infty}\sum_{k=1}^\infty \mathbb{E}[1_{A\cap\{T\in((k-1)2^{-m},k2^{-m}]\}}F(B_{t_1}^{(k2^{-m})},\ldots)]

by bounded convergence, using continuity of F, right-continuity of B, and that T<\infty a.s. (so that 1_A=\sum 1_{A\cap \{T\in(-,-]\}})

\stackrel{\text{WMP}}{=}\lim_{m\rightarrow\infty}\sum_{k=1}^\infty \mathbb{P}[A\cap\{T\in((k-1)2^{-m},k2^{-m}]\}]\mathbb{E}F(B_{t_1},\ldots,B_{t_n})

\stackrel{\text{DOM}}{=}\mathbb{P}(A)\mathbb{E}F(B_{t_1},\ldots,B_{t_n})

which is exactly what we required.

Remarks: 1) We only used right-continuity of the process, and characterisation by joint marginals, so the proof works equally well for Levy processes.

2) We can in fact show that it is independent of \mathcal{F}_T^+, by considering T+\frac{1}{n} which is still a stopping time, then taking a limit in this as well in the above proof. For details of a similar result, see my post on Blumenthal’s 0-1 Law.