Hitting Probabilities for Markov Chains

This continues my previous post on popular questions in second year exams. In the interest of keeping it under 2,500 words I’m starting a new article.

In a previous post I’ve spoken about the two types of Markov chain convergence, in particular, considering when they apply. Normally the ergodic theorem can be used to treat the case where the chain is periodic, so the transition probabilities do not converge to a stationary distribution, but do have limit points – one at zero corresponding to the off-period transitions, and one non-zero. With equal care, the case where the chain is not irreducible can also be treated.

A favourite question for examiners concerns hitting probabilities and expected hitting times of a set A. Note these are unlikely to come up simultaneously. Unless the hitting probability is 1, the expected hitting time is infinite! In both cases, we use the law of total probability to derive a family of equations satisfied by the probabilities/times. The only difference is that for hitting times, we add +1 on the right hand side, as we advance one time-step to use the law of total probability.

The case of hitting probabilities is perhaps more interesting. We have:

h_i^A = 1,\; i\in A, \quad h_i^A=\sum_{j\in S}p_{ij}h_j^A,\; i\not\in A.

There are two main cases of interest: where the chain is finite but has multiple closed communicating classes, and where the chain is infinite, so even though it is irreducible, a trajectory might diverge before hitting 0.

For the case of a finite non-irreducible Markov chain, this is fairly manageable, by solving backwards from states where we know the values. Although of course you could ask about the hitting probability of an open state, the most natural question is to consider the probability of ending up in a particular closed class. Then we know that the hitting probability starting from site in the closed class A is 1, and the probability starting from any site in a different closed class is 0. To find the remaining values, we can work backwards one step at a time if the set of possible transitions is sparse enough, or just solve the simultaneous equations for \{h_i^A: i\text{ open}\}.

We therefore care mainly about an infinite state-space that might be transient. Typically this might be some sort of birth-and-death chain on the positive integers. In many cases, the hitting probability equations can be reduced to a quadratic recurrence relation which can be solved, normally ending up with the form


where \lambda might well be q/p or similar if the chain is symmetric. If the chain is bounded, typically you might know h_0=1, h_N=0 or similar, and so you can solve two simultaneous equations to find A and B. For the unbounded case you might often only have one condition, so you have to rely instead on the result that the hitting probabilities are the minimal solution to the family of equations. Note that you will always have h^i_i=1, but with no conditions, h^i_j\equiv 1 is always a family of solutions.

It is not clear a priori what it means to be a minimal solution. Certainly it is not clear why one solution might be pointwise smaller than another, but in the case given above, it makes sense. Supposing that \lambda<1, and A+B=1 say, then as we vary the parameters, the resulting set of ‘probabilities’ does indeed vary monotically pointwise.

Why is this true? Why should the minimum solution give the true hitting probability values? To see this, take the equations, and every time an h_i^A appears on the right-hand side, substitute in using the equations. So we obtain, for i\not\in A,

h_i^A=\sum_{j\in A}p_{ij}+\sum_{j\not\in A} p_{ij}h_j^A,

and after a further iteration

h_i^A=\sum_{j_1\in A}p_{ij_1}+\sum_{j_1\not\in A, j_2\in A}p_{ij_1}p_{j_1j_2}+\sum_{j_1,j_2\not\in A}p_{ij_1}p_{j_1j_2}h_{j_2}^A.

So we see on the RHS the probability of getting from i to A in one step, and in two steps, and if keep iterating, we will get a large sum corresponding to the probability of getting from i to A in 1 or 2 or … or N steps, plus an extra term. Note that the extra term does not have to correspond to the probability of not hitting A by time N. After all, we do not yet know that (h_{i}^A) as defined by the equations gives the hitting probabilities. However, we know that the probability of hitting A within N steps converges to the probability of hitting A at all, since the sequence is increasing and bounded, so if we take a limit of both sides, we get h_i^A on the left, and something at least as large as the hitting probability starting from i on the right, because of the extra positive term. The result therefore follows.

It is worth looking out for related problems that look like a hitting probability calculation. There was a nice example on one of the past papers. Consider a simple symmetric random walk on the integers modulo n, arranged clockwise in a circle. Given that you start at state 0, what is the probability that your first return to state 0 involves a clockwise journey round the circle?

Because the system is finite and irreducible, it is not particularly interesting to consider the actual hitting probabilities. Also, note that if it is convenient to do so, we can immediately reduce the problem when n is even. In two steps, the chain moves from j to j+2 and j-2 with probability ¼ each, and stays at j with probability ½. So the two step chain is exactly equivalent to the lazy version of the same dynamics on n/2.

Anyway, even though the structure is different, our approach should be the same as for the hitting probability question, which is to look one step into the future. For example, to stand a chance of working, our first two moves must both be clockwise. Thereafter, we are allowed to move anticlockwise. There is nothing special about starting at 0 in defining the original probability. We could equally well ask for the probability that starting from j, the first time we hit 0 we have moved clockwise round the circle.

The only thing that is now not obvious is how to define moving clockwise round the circle, since it is not the case that all the moves have to be clockwise to have experienced a generally clockwise journey round the circle, but we definitely don’t want to get into anything complicated like winding numbers! In fact, the easiest way to make the definition is that given the hitting time of 0 is T, we demand that the chain was at state n at time T-1.

For convenience (ie to make the equations consistent) we take h_0=0, h_n=1 in an obvious abuse of notation, and then

h_j=\frac12h_{j-1}+\frac12 h_{j+1},

from which we get

h_j=a+bj \Rightarrow h_j=\frac{j}{n}.

Of course, once we have this in mind, we realise that we could have cut the circle at 0 (also known as n) and unfolded it to reduce the problem precisely to symmetric gambler’s ruin. In particular, the answer to the original problem is 1/2n, which is perhaps just a little surprising – maybe by thinking about the BM approximation to simple random walk, and that BM started from zero almost certainly crosses zero infinitely many times near we might have expected this probability to decay faster. But once it is unfolded into gambler’s ruin, we have the optimal stopping martingale motivation to reassure us that this indeed looks correct.

Sticky Brownian Motion

This follows on pretty much directly from the previous post about reflected Brownian motion. Recall that this is a process defined on the non-negative reals which looks like Brownian motion away from 0. We consider whether RBM is the only such process, and how any alternative might be constructed as a limit of discrete-time Markov processes.

One of the alternatives is called Sticky Brownian motion. This process spends more time at 0 than reflected Brownian motion. In fact it spends some positive proportion of time at 0. My main aim here is to explain why some intuitive ideas I had about how this might arise are wrong.

The first thought was to ensure that each visit to 0 last some positive measure of time. This could be achieved by staying at 0 for an Exp(1) duration, every time the process visited it. It doesn’t seem unreasonable that this might appear as the limit of a standard SRW adjusted so that on each visit to 0 the walker waits for a time given by independent geometric distributions. These distributions are memoryless, so that is fine, but by Blumenthal’s 0-1 Law, starting from 0 a Brownian motion hits zero infinitely many times before any small time t. So in fact the process described above would be identically zero as before it gets anywhere it would have to spend some amount of time at 0 given by an infinite sum of Exp(1) RVs.

We will return later to the question of why the proposed discrete-time model will still converge to reflected BM rather than anything more exotic. First though, we should discount the possibility of any intermediate level of stickiness, where the set of times spent at 0 still has measure zero, but the local time at 0 grows faster than for standard reflected BM. We can define the local time at 0 through a limit

L_t=\lim_{\epsilon\downarrow 0}\frac{1}{2\epsilon}\text{Leb}(\{0\le s \le t: |B_t|<\epsilon\})

of the measure of time spent very near 0, rescaled appropriately. So if the measure of the times when the process is at 0 is zero, then the local time is determined by behaviour near zero rather than by behaviour at zero. More precisely, on the interval [-\epsilon,\epsilon], the process behaves like Brownian motion, except on a set of measure zero, so the local time process should look the same as that of BM itself. Note I don’t claim this as a formal proof, but I hope it is a helpful heuristic for why you can’t alter the local time process without altering the whole process.

At this stage, it seems sensible to define Sticky Brownian motion. For motivation, note that we are looking for a process which spends a positive measure of time at 0. So let’s track this as a process, say C_t. Then the set of times when C is increasing is sparse, as it coincides with the process being 0, but we know we cannot wait around at 0 for some interval of time without losing the Markov property. So C shares properties with the local time of a reflected BM. The only difference is that the measure of times when C is increasing is positive here, but zero for the local time.

So it makes sense to construct the extra time spent at zero from the local time of a standard reflected BM. The heuristic is that we slow down the process whenever it is at 0, so that local time becomes real time. We can also control the factor by which this slowing-down happens, so define

\sigma(s)=\rho L(s)+s,

where L is the local time process of an underlying reflected BM, and \rho>0 is a constant. So \sigma is a map giving a random time-change. Unsurprisingly, we now define Sticky BM as the reflected BM with respect to this time-change. To do this formally, it is easiest to define a family of stopping times \{\tau_t\}, such that \sigma(\tau_t)=t, \tau_{\sigma(s)}=s, then if X is the reflected BM, define Y_t=X_{\tau_t} for the sticky BM.

It is worth thinking about what the generator of this process should be. In particular, why should it be different to reflected BM? The key observation is that the drift of the underlying reflected BM is essentially infinite at 0. By slowing down the process at 0, this drift becomes finite. So the martingale associated with sticky BM is precisely a time-changed version of the martingale associated with the underlying reflected BM, but this time-change is precisely what is required to give a generator. We get:

(\mathcal{L}f)(x)=\begin{cases}\frac12f''(x)&\quad x>0\\ \rho^{-1}f'(0) &\quad x=0.\end{cases}

Now that we have the generator, it starts to become apparent how sticky BM might appear as a limit of discrete-time walks. The process must look like mean-zero, unit-variance RW everywhere except near 0, where the limiting drift should be \rho^{-1}. Note that when considering the limiting drift near zero, we are taking a joint limit in x and h. The order of this matters. As explained at the end of the previous article, we only need to worry about the limiting drift along sequences of x,h such that a_h(x)\rightarrow 0. If no such sequences exist, or the limiting drift along any of these is infinite, then we actually have a reflected boundary condition.

This highlights one confusing matter about convergence of reflected processes. The boundary of the discrete-time process should converge to the boundary of the reflected process, but we also have to consider where reflective behaviour happens. Can we get sticky BM with reflection only at the boundary in the discrete-time processes? The answer turns out to be no. At the start of this article, I proposed a model of SRW with geometric waiting times whenever the origin was visiting. What is the limit of this?

The trick is to consider how long the discrete process spends near 0, after rescaling. It will spend a multiple 1/p more time at 0 itself, where p is the parameter of the geometric distribution, but no more time than expected at any point x\in(0,\epsilon). But time spent in (0,\epsilon) dominates time spent at 0 before this adjustment, so must also dominate it after the adjustment, so in the limit, the proportion of time spent at 0 is unchanged, and so in particular it cannot be positive.

Because of all of this, in practice it seems that most random walks we might be interested in converge (if they converge to a process at all) to a reflected SDE/diffusion etc, rather than one with sticky boundary conditions. I feel I’ve been talking a lot about Markov processes converging, so perhaps next, or at least soon, I’ll write some more technical things about exactly what conditions and methods are required to prove this.


S. Varadhan – Chapter 16 from a Lecture Course at NYU can be found here.

Enhanced by Zemanta

Markovian Excursions

In the previous post, I talked about the excursions of a Brownian motion. Today I’m thinking about how to extend these ideas to more general Markov chains. First we want to rule out some situations. In particular, we aren’t hugely interested in discrete time Markov chains. The machinery is fairly well established for excursions, whether or not the chain is transient. Furthermore, if the state space is discrete, as for a Poisson process for example, the discussion is not hugely interesting either. Remember that the technical challenges in the constructions of local time arise because of the Blumenthal 0-1 law property that Brownian motion visits 0 infinitely often in the small window after the start time. We therefore want the process to be regular at the point of the state space under discussion. This is precisely the condition described above for BM about 0.

Why is it harder in general?

The informal notion of a local time should transfer to a more general Markov chain, but there are some problems. Firstly, to define something in terms of an integral is not general enough. The state space E needs some topological structure, but any meaningful definition must be in terms of functions from E into the reals. There were also all sorts of special properties of Brownian motion, including the canonical time-space rescaling that came in handy in that particular case. It turns out to be easiest to consider the excursion measure on a general Markov chain through its Laplace transform.

Definition and Probabilistic interpretations

The resolvent is the Laplace transform of the transition probability P_t(x,A), viewed as an operator on functions f:E\rightarrow \mathbb{R}.

R_\lambda f(x):=\mathbb{E}_x\left[\int_0^\infty e^{-\lambda t}f(X_t)dt\right]=\int_0^\infty e^{-\lambda t}P_tf(x)dt.

We can interpret this in terms of the original process in a couple of ways which may be useful later. The motivation is that we would like to specify a Poisson process of excursions, for which we need to know the rate. We hope that the rate will in fact be constant, so it will in fact to suffice to work out the properties of the expected number of excursions (or whatever) up to various random times, in particular those given by exponential RVs.

So, we take \zeta\sim\text{Exp}(\lambda) independent of everything else, and assume that we ‘kill the chain’ at time \zeta. Then, by shuffling expectations in and out of the integral and separating independent bits, we get:

R_\lambda f(x)=\mathbb{E}_x\int_0^\zeta f(X_s)ds = \frac{1}{\lambda}\mathbb{E}_xf(X_\zeta).

As in the Brownian local time description

R_\lambda 1_A(x)=\mathbb{E}(\text{time spent in }A\text{ before death at time }\zeta_\lambda).

Markovian property

We want to show that excursions are Markov, once we’ve sorted out what an ‘excursion’ actually means in this context. We do know how to deal with the Markovian property once we are already on an excursion. It is relatively straightforward to define an extension of the standard transition probability operator, to include a condition that the chain should not hit point a during the transition. That is

_aP_t(x,A):= \mathbb{P}_x(X_t\in A\cap H_a>t).

This will suffice to define the behaviour once an excursion has started. The more complicated bit will be the entrance law n_t(A), being the probability of arriving at A after time t of an excursion. To summarise, as with BM, all the technical difficulties with an excursion happen at the beginning, ie bouncing around the start point, but once it is ‘up-and-running’, its path is Markovian and controlled by _aP_t.


The link between the resolvent and the excursions, is provided as in the Brownian case, by supplying a PPP of marks at uniform rate \lambda to real time. This induces a mark process on excursions, weighted by an (exponential) function of excursion length. We make no distinction between an excursion including one mark or many marks. Then the measure on marked excursion is, in a mild abuse of notation:

n_\lambda=(1-e^{-\lambda\zeta(f)})\cdot n.

We compare with the Laplace transform n_\lambda(dx)=\int_0^\infty e^{-\lambda t}dtn_t(dx) using a probabilistic argument.

We can apply the measure to a function in the usual way: \lambda n_\lambda(1_A) is the measure of those excursions for which the first mark occurs in A. So by taking A=E, we get

\lambda n_\lambda(1)=\text{ Excursion measure }=\int_U n(df)(1-e^{-\lambda\zeta(f)}).

We have therefore linked the exponential mark process on excursion measure with the Laplace transform of the entrance law. So in particular:

\frac{\lambda n_\lambda(A)}{\lambda n_\lambda(1)}=\mathbb{P}(\text{first mark when in }A)=\int_0^\infty \lambda e^{-\lambda t}P_t(0,A)dt=\lambda R_\lambda 1_A(0).

The resolvent is relatively easy to calculate explicitly, and so we can find the Laplace transform n_\lambda(A). From this it is generally possible to invert the transform to find the entrance law n_t.


A Guided Tour Through Excursions – L. C. G. Rogers.

This pair of posts is very much a paraphrase of chapters 3 and 4 of this excellent text. The original can be found here (possibly not open access?)

Recurrence and Transience of BM

In this post, we consider Brownian motion as a Markov process, and consider the recurrence and transience properties in several dimensions. As motivation, observe from Question 5 of this exam paper that it is a very much non-trivial operation to show that Brownian motion in two-dimensions almost surely has zero Lebesgue measure. We would expect this to be true by default, as we visualise BM as a curve. So it is interesting to see how much we can deduce without significant technical analysis. We will make use of Ito’s formula. This material comes from the Part III course Advanced Probability, which doesn’t explicitly mention Ito’s result, and instead proves the result required separately, making use of the nature of solutions to the diffusion equation. In this context we assume that for f\in C_b^{1,2}:

M_t:=f(t,B_t)-f(0,B_0)-\int_0^t(\frac{\partial}{\partial t}+\frac12\Delta)f(s,B_s)ds

is a martingale. Of course, precisely, from Ito’s formula, this can be expressed as the stochastic integral of a bounded function with respect to Brownian motion, which is therefore a (continuous local, but bounded) martingale.

d=1: In one dimension, BM is point-recurrent. This means that almost surely, BM returns to zero infinitely many times. This is easiest shown by using the time-reversal equivalence to deduce that \lim\sup B_t=-\lim\inf B_t=\infty.

d=2: BM is two dimensions is point-transient. That means that the probability of returning to a given point is <1. In fact it is 0, as one might suspect from the fact that BM is space-invariant and, intuitively at least, has measure 0. However, it is neighbourhood-recurrent, meaning that it almost surely returns to a ball around a given point infinitely often. We discuss small balls around 0, but obviously the conclusions apply equally well elsewhere.

The aim is to choose a function so that the expression in Ito’s formula as above is as simple as possible. Taking f a function of space alone and harmonic causes the integral term to vanish. In this case, f(y)=\log|y| will suffice. Obviously we have to restrict attention to \epsilon\leq |y|\leq R. We stop M at T_\epsilon\wedge T_R, that is the first time that the BM hits the boundary of the annulus on which f is defined, and apply OST, since \log|B_t| is bounded here and the stopping time is a.s. finite. We obviously have to assume the BM starts in this annulus, but then we obtain:

\mathbb{E}_x\log|B_{T_\epsilon\wedge T_R}|=\log|x|

and so we can consider the two possibilities for B_{T_\epsilon\wedge T_R} to deduce:

\mathbb{P}_x(T_\epsilon<T_R)=\frac{\log R-\log|x|}{\log R-\log\epsilon}

Now let \epsilon\downarrow 0 to see that \mathbb{P}_x(B_t=0,\text{ some }t>0)=0. Now apply the (weak) Markov property at a small fixed time a, to deduce, with a mild abuse of notation:

\mathbb{P}_0(B_t=0,\text{ some }t>a)=\int \mathbb{P}_x(B_t=0,t>0)\mathbb{P}_0(B_a=dx)=0

as the first term in the integral we have shown to be 0 B_a-a.e. Then let a\downarrow 0 to obtain the result about point-transience.

For neighbourhood recurrence, instead let R\uparrow\infty, so \mathbb{P}_x(T_\epsilon<\infty)=1. As before, can integrate over law of B_n to obtain

\mathbb{P}_0(|B_t|\leq \epsilon,\text{ some }t\geq n)=1

which is precisely what we require for transience.

d=>3: BM is transient. That is, |B_t|\rightarrow\infty a.s. Note that for d>3, the first three components have the same distribution as BM in three dimensions, and so it suffices to consider the case d=3.

Here, the correct choice of harmonic function is f(y)=\frac{1}{|y|}, so conclude as before that


From this, we can take a limit to see that

\mathbb{P}_x(T_\epsilon<\infty)\leq \frac{\epsilon}{|x|}

We deploy a neat trick to lift this result to a global statement about transience. Define the events that the modulus never returns to n after hitting n^3

A_n:=\{|B_t|>n\quad \forall t\geq T_{n^3}\}



Applying Borel-Cantelli 1, A_n eventually holds almost surely, which certainly implies the desired result.

Feller Processes and the Strong Markov Property

Markov Property

We go way back to the Part 1B short course Markov Chains. In the first lecture of this course, we met discrete time Markov Chains. A definition was given, in terms of conditional single period transition probabilities, and it was immediately proved that general transition probabilities are specified by sums of products of entries in a so-called transition matrix. This proves very useful for performing calculations. But the question will inevitably be asked: “Prove that this is a Markov process.” And the answer “because it obviously is” isn’t good enough.

The point is that all of the above is relevant to the setting, but the ideal definition of a Markov process is something like the very general statement:

Conditional on the present, the past and the future are independent.

This opens up the possibility of a large class of processes being Markov processes. A technical definition would be that for s<t and a measurable subset of the state space.

\mathbb{P}(X_t\in A|\mathcal{F}_s)=\mathbb{P}(X_t\in A|\sigma(X_s)).

It is easy to check that this is equivalent to the original definition in the that context.

Strong Markov Property

SMP states that given a stopping time T, conditional on the event \{T<\infty\}:

(X_{T+t}-X_T, t\geq 0)\stackrel{d}{=}(X_t^0,t\geq 0),

that is, the process started at time T has the same distribution as the original process started from 0 (in space as well as time). Furthermore, it is independent of \mathcal{F}_T, which requires technical definition, but which informally is the sigma field of events defined by what happens up to time T.

For a discrete time Markov chain, prove SMP by pre-multiplying by the indicator function 1(T=n), which reduces SMP to the normal Markov property. Then take the countable sum over (which is permissible) to get SMP. For Brownian Motion in one dimension, make dyadic approximations to the stopping time from above. SMP applies to these approximations, and measure theoretic machinery and the (almost sure) continuity of paths allows the equivalence of distributions to hold in the limit. Independence follows by expressing \mathcal{F}_T=\cap \mathcal{F}_{T_n} as the intersection of sigma fields corresponding to the approximations.

In both of these cases, an implicit countable substructure (discrete time and continuity respectively) have been required to deduce SMP. This suggests that there are Markov processes which do not have SMP.

Motivating Counterexample

Take B to be a Brownian Motion in one dimension, with B_0 a RV which contains 0 in its support. Now define the the process:

X_t=B_t1_{B_0\neq 0}.

Then X is certainly not Strong Markov, by considering the hitting time of 0. Then the process started there is either identically 0, or a standard BM, but which is determined by time 0 properties rather than time T properties.

But is Markov. Take s<t and A Borel. Then:

\mathbb{P}(X_t\in A|\mathcal{F}_s)=\mathbb{E}[1(X_t\in A\cap X_0\neq 0)+1(X_t\in A\cap X_0=0)|\mathcal{F}_s]

=1(X_0\neq 0)\int_A \frac{1}{\sqrt{2\pi(t-s)}}\exp(-\frac{(X_s-y)^2}{2(t-s)})dy+1(X_0=0)1(0\in A)

1(X_s\neq 0)\int_A(\ldots)dy + 1(X_s=0,X_0\neq 0)\int_A(\ldots)dy + 1(0\in A)[1(X_s=0)-1(X_s=0,X_0\neq 0)]

Now 1(X_s=0, X_0\neq 0)=0 a.s. so

= \mathbb{E}[1(X_t\in A)|X_s], which is precisely the Markov property.

Feller Property

In general, it hard to verify the Strong Markov Property for a given stochastic process. Instead, we consider a property which is stronger, but easier to check. Continue reading