Skorohod embedding

Background

Suppose we are given a standard Brownian motion (B_t), and a stopping time T. Then, so long as T satisfies one of the regularity conditions under which the Optional Stopping Theorem applies, we know that \mathbb{E}[B_T]=0. (See here for a less formal introduction to OST.) Furthermore, since B_t^2-t is a martingale, \mathbb{E}[B_T^2]=\mathbb{E}[T], so if the latter is finite, so is the former.

Now, using the strong Markov property of Brownian motion, we can come up with a sequence of stopping times 0=T_0, T_1, T_2,\ldots such that the increments T_k-T_{k-1} are IID with the same distribution as T. Then 0,B_{T_1},B_{T_2},\ldots is a centered random walk. By taking T to be the hitting time of \{-1,+1\}, it is easy to see that we can embed simple random walk in a Brownian motion using this approach.

p1020956_compressedEmbedding simple random walk in Brownian motion.

The Skorohod embedding question asks: can all centered random walks be constructed in this fashion, by stopping Brownian motion at a sequence of stopping time? With the strong Markov property, it immediately reduces the question of whether all centered finite-variance distributions X can be expressed as B_T for some integrable stopping time T.

The answer to this question is yes, and much of what follows is drawn from, or at least prompted by Obloj’s survey paper which details the problem and rich history of the many approaches to its solution over the past seventy years.

Applications and related things

The relationship between random walks and Brownian motion is a rich one. Donsker’s invariance principle asserts that Brownian motion appears as the scaling limit of a random walk. Indeed, one can construct Brownian motion itself as the limit of a sequence of consistent random walks with normal increments on an increasingly dense set of times. Furthermore, random walks are martingales, and we know that continuous, local martingales can be expressed as a (stochastically) time-changed Brownian motion, from the Dubins-Schwarz theorem.

The Skorohod embedding theorem can be used to prove results about random walks with general distribution by proving the corresponding result for Brownian motion, and checking that the construction of the sequence of stopping times has the right properties to allow the result to be carried back to the original setting. It obviously also gives a coupling between a individual random walk and a Brownian motion which may be useful in some contexts, as well as a coupling between any pair of random walks. This is useful in proving results for random walks which are much easier for special cases of the distribution. For example, when the increments are Gaussian, or when there are combinatorial approaches to a problem about simple random walk. At the moment no aspect of this blog schedule is guaranteed, but I plan to talk about the law of the iterated logarithm shortly, whose proof is approachable in both of these settings, as well as for Brownian motion, and Skorohod embedding provides the route to the general proof.

At the end, we will briefly compare some other ways to couple a random walk and a Brownian motion.

Adding extra randomness

One thing we could do is sample a copy of X independently from the Brownian motion, then declare T= \tau_{X}:= \inf\{t\ge 0: B_t=X\}, the hitting time of (random value) X. But recall that unfortunately \tau_x has infinite expectation for all non-zero x, so this doesn’t fit the conditions required to use OST.

Skorohod’s original method is described in Section 3.1 of Obloj’s notes linked above. The method is roughly to pair up positive values taken by X appropriately with negative values taken by X in a clever way. If we have a positive value b and a negative value a, then \tau_{a,b}, the first hitting time of \mathbb{R}\backslash (a,b) is integrable. Then we choose one of these positive-negative pairs according to the projection of the distribution of X onto the pairings, and let T be the hitting time of this pair of values. The probability of hitting b conditional on hitting {a,b} is easy to compute (it’s \frac{-a}{b-a}) so we need to have chosen our pairs so that the ‘probability’ of hitting b (ie the density) comes out right. In particular, this method has to start from continuous distributions X, and treat atoms in the distribution of X separately.

The case where the distribution X is symmetric (that is X\stackrel{d}=-X) is particularly clear, as then the pairs should be (-x,x).

However, it feels like there is enough randomness in Brownian motion already, and subsequent authors showed that indeed it wasn’t necessary to introduce extra randomness to provide a solution.

One might ask whether it’s possible to generate the distribution on the set of pairs (as above) out of the Brownian motion itself, but independently from all the hitting times. It feels like it might be possible to make the distribution on the pairs measurable with respect to

\mathcal{F}_{0+} = \bigcap\limits_{t>0} \mathcal{F}_t,

the sigma-algebra of events determined by limiting behaviour as t\rightarrow 0 (which is independent of hitting times). But of course, unfortunately \mathcal{F}_{0+} has a zero-one law, so it’s not possible to embed non-trivial distributions there.

Dubins solution

The exemplar for solutions without extra randomness is due to Dubins, shortly after Skorohod’s original argument. The idea is to express the distribution X as the almost sure limit of a martingale. We first use the hitting time of a pair of points to ‘decide’ whether we will end up positive or negative, and then given this information look at the hitting time (after this first time) of two subsequent points to ‘decide’ which of four regions of the real interval we end up in.

I’m going to use different notation to Obloj, corresponding more closely with how I ended up thinking about this method. We let

a_+:= \mathbb{E}[X \,|\, X>0], \quad a_- := \mathbb{E}[X\,|\, X<0], (*)

and take T_1 = \tau_{\{a_-,a_+\}}. We need to check that

\mathbb{P}\left( B_{T_1}=a_+\right) = \mathbb{P}\left(X>0\right),

for this to have a chance of working. But we know that

\mathbb{P}\left( B_{T_1}=a_+\right) = \frac{a_+}{a_+-a_-},

and we can also attack the other side using (*) and the fact that \mathbb{E}[X]=0, using the law of total expectation:

0=\mathbb{E}[X]=\mathbb{E}[X\,|\, X>0] \mathbb{P}(X>0) + \mathbb{E}[X\,|\,X<0]\mathbb{P}(X<0) = a_+ \mathbb{P}(X>0) + a_- \left(1-\mathbb{P}(X>0) \right),

\Rightarrow\quad \mathbb{P}(X>0)=\frac{a_+}{a_+-a_-}.

Now we define

a_{++}=\mathbb{E}[X \,|\, X>a_+],\quad a_{+-}=\mathbb{E}[X\,|\, 0<X<a_+],

and similarly a_{-+},a_{--}. So then, conditional on B_{T_1}=a_+, we take

T_2:= \inf_{t\ge T_1}\left\{ B_t\not\in (a_{+-},a_{++})  \right\},

and similarly conditional on B_{T_1}=a_-. By an identical argument to the one we have just deployed, we have \mathbb{E}\left[B_{T_2} \,|\,\mathcal{F}_{T_1} \right] = B_{T_1} almost surely. So, although the a_{+-+} notation now starts to get very unwieldy, it’s clear we can keep going in this way to get a sequence of stopping times 0=T_0,T_1,T_2,\ldots where B_{T_n} determines which of the 2^n regions of the real line any limit \lim_{m\rightarrow\infty} B_{T_m} should lie in.

A bit of work is required to check that the almost sure limit T_n\rightarrow T is almost surely finite, but once we have this, it is clear that B_{T_n}\rightarrow B_T almost surely, and B_T has the distribution required.

Komlos, Major, Tusnady coupling

We want to know how close we can make this coupling between a centered random walk with variance 1, and a standard Brownian motion. Here, ‘close’ means uniformly close in probability. For large times, the typical difference between one of the stopping times 0,T_1,T_2,\ldots in the Skorohod embedding and its expectation (recall \mathbb{E}[T_k]=k) is \sqrt{n}. So, constructing the random walk S_0,S_1,S_2,\ldots from the Brownian motion via Skorohod embedding leads to

\left |S_k - B_k \right| = \omega(n^{1/4}),

for most values of k\le n. Strassen (1966) shows that the true scale of the maximum

\max_{k\le n} \left| S_k - B_k \right|

is slightly larger than this, with some extra powers of \log n and \log\log n as one would expect.

The Komlos-Major-Tusnady coupling is a way to do a lot better than this, in the setting where the distribution of the increments has a finite MGF near 0. Then, there exists a coupling of the random walk and the Brownian motion such that

\max_{k\le n}\left|S_k- B_k\right| = O(\log n).

That is, there exists C such that

\left[\max_{k\le n} \left |S_k-B_k\right| - C\log n\right] \vee 0

is a tight family of distributions, indeed with uniform exponential tail. To avoid digressing infinitely far from my original plan to discuss the proof of the law of iterated logarithm for general distributions, I’ll stop here. I found it hard to find much coverage of the KMT result apart from the challenging original paper, and many versions expressed in the language of empirical processes, which are similar to random walks in many ways relevant to convergence and this coupling, but not for Skorohod embedding. So, here is a link to some slides from a talk by Chatterjee which I found helpful in getting a sense of the history, and some of the modern approaches to this type of normal approximation problem.

Fair games and the martingale strategy III

Gambler’s Ruin

Continuing directly from the previous post, the nicest example of the optional stopping theorem we developed there is to example a simple random walk constrained between two values, say 0 and N. This represents an idealised gambling situation, where the gambler stops playing either when they reach some pre-agreed profit, or when they go bankrupt. We assume that we start at level k, for k = 1,2,…,N-1.

Naturally, we want to know the probabilities of winning (ie getting to N) and losing (ie going bankrupt). We could set this up by conditioning on the first step. Let p_k be the probability of winning starting from level k. Then we must have

p_k= \frac12 p_{k+1}+\frac12 p_{k-1},\quad k=1,\ldots,N-1, (*)

with the obvious boundary conditions p_0=0, p_N=1. In an ideal world, we just know how to solve second order difference equations like (*). Well, actually it isn’t too hard, because we can see from (*) directly that

p_{k+1}-p_k = p_k-p_{k-1},

and so p_k is a linear function of k, and so p_k = k/N follows pretty much immediately.

But, we can also use OST profitably. Let T be the time at which we first hit 0 or N. It’s intuitively clear that this should have finite expectation, since the problems you might encounter with just the hitting time of a single level shouldn’t apply. Or you can consider the expected number of steps before you see N ups or downs in a row, which certainly provides an upper bound on T. This random number of steps is sort of geometric (at least, can be upper bounded by a geometric RV) and so has finite expectation. So can apply OST to X at T, and we have

\mathbb{E}[X_T] = N\cdot \mathbb{P}(X_T=N) + 0 \cdot \mathbb{P}(X_T=0) = \mathbb{E}[X_0]=k,

from which we also derive p_k=k/N.

The reason we talk about gambler’s ruin is by considering the limit N\rightarrow\infty with k fixed. After a moment’s thought, it’s clear we can’t really talk about stopping the process when we hit infinity, since that won’t happen at any finite time. But we can ask what’s the probability that we eventually hit zero. Then, if we imagine a barrier at level N, the probability that we hit 0 at some point is bounded below by the probability that we hit 0 before we hit level N (given that we know we hit either zero or level N with probability one), and this is \frac{N-k}{N}, and by choosing N large enough, we can make this as close to 1 as we want. So the only consistent option is that the probability of hitting 0 at some point is one. Hence gambler’s ruin. With probability one, ruin will occur. There’s probably a moral lesson hiding there not especially subtly.

A problem about pricing options

So the deal here seems to be that if you just care about your average, it doesn’t matter how to choose to play a sequence of fair games. But what if you care about something other than your average? In any real setting, we maybe care about slightly more than this. Suppose I offer you a bet on a coin toss: you get £3 if it comes up heads, and I get £1 if it comes up tails. Sounds like a good bet, since on average you gain a pound. But what about if you get £10,003 if it comes up heads and I get £10,001 if it comes up tails? I’m guessing you’re probably not quite so keen now.

But if you were an international bank, you might have fewer reservations about the second option. My intention is not to discuss whether our valuation of money is linear here, but merely to offer motivation for the financial option I’m about to propose. The point is that we are generally risk-averse (well, most of us, most of the time) and so we are scared of possible large losses, even when there is the possibility of large profits to balance it out.

Let’s assume we have our simple random walk, and for definiteness let’s say it starts at £1. Suppose (eg as a very niche birthday present) we have the following opportunity: at any point between now and time t=5, we have the right to buy one unit of the stock for £2.

We want to work out how much this opportunity, which from now on I’m going to call an option, is worth on average. Note that now it does seem that when we choose to cash in the option will have an effect on our return, and so we will have to include this in the analysis.

Note that, once we’ve bought a unit of the stock, we have an asset which is following a simple random walk (ie sequential fair games) and so from this point on its expected value remains unchanged. So in terms of expectation, we might as well sell the stock at the same moment we buy it. So if we cash in the option when the stock is currently worth £X, we will on average have a return of £(X-2). This means that we’ll only ever consider exercising our option if the current value of the stock is greater than £2. This narrows down our strategy slightly.

This sort of option minimises the risk of a large loss, since the worst thing that happens is that you never choose to exercise your option. So if you actually paid for the right to have this option, that cost is the largest amount you can lose. In the trading world, this type of opportunity is called an American option.

The trick here is to work backwards in time, thinking about strategies. If at time t=4, the stock is worth £1, then the best that can happen is that it’s worth £2 at time t=5, and this still gains you no wealth overall. Similarly if it’s worth £0 at time t=3. So we’ve identified a region where, if the stock value enters this region, we might as well rip up our contract, because we definitely aren’t going to gain anything. Remember now that we’ve also said you won’t ever cash in if the stock’s value is at most £2, because you don’t gain anything on average.

Now suppose that the stock has value £3 at time t=4. There’s no danger of it ever getting back below £2 during the lifetime of the option, so from now on your potential return is following the trajectory of a simple random walk, ie a fair game. So on average, it makes no difference whether you cash in now, or wait until t=5, or some combination of the two. The same argument holds if the stock has value £4 at time t=3 or time t=4, and so we can identify a region where you might as well cash in.

American Option 1

What about the final region? If the stock value is greater than £2, but not yet in the definitely-cash-in area, what should you do? Well, if you think about it, the value of the stock is a fair game. But your return should be better than that, because the stock price doesn’t take account of the fact that you wouldn’t buy in (and make a loss overall) if the value drops below £2. So at this stage, your future options are better than playing a fair game, and so it doesn’t make sense (in terms of maximising your *average*) to cash in.

Now we can actually work backwards in time to establish how much any starting value is worth under this optimal strategy. We can fill in the values in the ‘doomed’ area (ie all zeros) and on the ‘cash in now’ area (ie current value minus 2), and construct backwards using the fact that we have a random walk.

American Option 2

The final answer ends up being 7/16 if the stock had value £1 at time 0. Note that the main point here is that working out the qualitative form of the strategy was the non-trivial part. Once we’d done that, everything was fairly straightforward. I claim that this was a reasonably fun adjustment to the original problem, but have minimal idea whether pricing options is in general an interesting thing to do.

Anyway, I hope that provided an interesting overview to some of the topics of interest within the question of how to choose strategies for games based on random processes.

Fair games and the martingale strategy II

Optional Stopping

We continue directly from the end of the last post, where I was talking about how to play sequences of fair games, and whether by playing cunningly (including choosing when to stop playing cunningly) you can end up with an ‘unfair’ game overall. (Ie where you make a profit or a loss on average.) We gave two examples. First, the martingale strategy, where on a sequence of fair games you double your stake each time you lose. The result is that you win back your original stake at some point with probability one, but possibly accumulate huge temporary losses along the way. In the second game, you follow the path of a simple random walk from zero until it hits one, and then cash in. Here we observe that the time until this happens is almost surely finite, but has infinite expectation.

There’s another possible problem. It seems ridiculous, but suppose we could look into the future. Then our strategy for the random walk might be something like: check in advance what will happen in the first ten steps, and stop playing whenever we get to the moment which we know is the maximum value the walk will attain. Well then, sometimes the walk will never go above zero, in which case we will stop playing at the very start, and sometimes the walk will go above zero, in which case we make a positive amount. So overall, our mean return must be positive. Obviously if we have the option to adjust our stakes, this is completely ridiculous, because we would bet high (infinitely high?) if we knew we were about to win, and nothing if we were about to lose. So, obvious though it seems, we should emphasise that we mustn’t be allowed to look into the future!

The optional stopping theorem says that looking into the future, and these two problems already mentioned are essentially all that can go wrong. To say anything more interesting, at this point we really do need a little bit of notation.

In general, a sequence of fair games of this kind is called a martingale. The origin of the word is fairly unclear – see this unexpectedly comprehensive article. The martingale will be something like X_0,X_1,X_2,\ldots, representing the wealth (or whatever) at discrete time-steps. The key property is the fair game property, which says that whatever has happened up to time k, the next game is fair. That is:

\mathbb{E}[X_{k+1}-X_k \,|\,\text{any event involving }X_0,\ldots,X_k] = 0. (*)

Note that in any of the situations we are describing, X should describe our wealth, rather than the underlying process. In the random walk example, these are the same, but in the martingale strategy suggestion, X_k should be our wealth after the kth game, rather than anything directly recording the outcomes of the coin tosses.

If we allow X_0 to be random (and of course, being always equal to zero is a special case of being random…) we can then add up an initial sequence of such equations to obtain

\mathbb{E}[X_k]=\mathbb{E}[X_k-X_{k_1}] + \ldots + \mathbb{E}[X_1-X_0] + \mathbb{E}[X_0]=\mathbb{E}[X_0]. (**)

That is, if we play any sequence of fair games a fixed finite number of times, we have overall a fair game. (In the original strategy, we have a martingale, irrespective of the precise rule we use to choose how much we invest on each coin toss.) But what if we stop the process at a time determined by the current knowledge? (ie without looking into the future.)

Let’s call an example of such a random time T, and this property that we aren’t allowed to look into the future is described technically as the stopping time property. A proper setup would use more notation and fewer words at (*), but even without sigma-algebras, we can say that T is a stopping time if deciding whether T=k depends only on X_0,X_1,\ldots,X_k, and not on later actions.

Informal Proof

To show the optional stopping theorem, the key idea is that if you want to stop at time T, one option is to keep playing beyond time T with zero stakes. Thus we have a fair game at all times, even after T. We write this as X_{T\wedge k}, where \wedge means ‘minimum’, so that if k>T, the process stays constant.

Since X_{T\wedge k} is a martingale, we can invoke (**),

\mathbb{E}[X_{T\wedge k}] = \mathbb{E}[X_0].

Now what happens if we take k to be very large? How well does this truncated average approximate \mathbb{E}[X_T] itself?

This is where we want to return to our assumptions about what might make this go wrong. Let’s say that T has finite expectation, and that there is some absolute bound on how large an increment can be, say C. Then, whenever T\le k, we have X_T=X_{T\wedge k}. And when T>k, we have

|X_T - X_{T\wedge k}| = |X_T-X_k| \le C(T-k).

Therefore

|\mathbb{E}[X_T]-\mathbb{E}[X_0]|= |\mathbb{E}[X_T] - \mathbb{E}[X_{T\wedge k}] | \le C \mathbb{E}[(T-k)\vee 0], (***)

where we take the final expectation only across T-k when this quantity is positive, since this is the only case which contributes to the left hand side.

Now we need to show that by choosing k large enough, we can make the RHS very small. Obviously we don’t have a chance of doing this if C is not finite! With a bit of thought, we can see that \mathbb{E}[(T-k)\vee 0]\ge \mathbb{E}[T] - k, and so we also don’t have a chance of doing this if \mathbb{E}[T]=\infty. But if \mathbb{E}[T]<\infty, then \sum_{\ell\ge 1} \ell \mathbb{P}(T=\ell) <\infty, and so

\sum_{\ell \ge k} \ell \mathbb{P}(T=\ell)\rightarrow 0,\quad \text{as }k\rightarrow\infty,

and so certainly

\mathbb{E}[(T-k)\vee 0] = \sum_{\ell \ge k}(\ell -k)\mathbb{P}(T=\ell) \rightarrow 0.

But (***) holds for all values of k, and so the only consistent option is that

\mathbb{E}[X_T]=\mathbb{E}[X_0].

There are a couple more combinations of conditions (mostly involving relaxing one of these slightly, and substantially strengthening the other) which also work, but this seems like the more natural form. For a full formal statement, there are many resources available, and the Wikipedia page, for example, isn’t too bad. In the mists of history, I wrote about some of these topics more formally, but maybe less helpfully, since I’d known the theory myself for about a week.

Fair games and the martingale strategy I

I went back to my school a couple of weeks ago and gave a talk. I felt I’d given various incarnations of a talk on card-shuffling too many times, so it was time for a new topic. The following post (and time allowing, one or two more) is pretty much what I said.

The Martingale Strategy

Suppose we bet repeatedly on the outcome of tossing a fair coin. Since it’s November, my heart is set on buying an ice cream that costs £1, so my aim is to win this amount from our game. My strategy is this:

First, I bet £1. If I win, then that’s great, because I now have made exactly enough profit to buy the ice cream. If I lose, then I play again, and this time I bet £2. Again, if I win, then my total profit is £2-£1 = £1, so I stop playing and buy the ice cream. If I lose, then I play a third time, again doubling my stake. So if I win for the first time on the seventh go, my overall profit will be

£64 – (£1+£2+£4+£8+£16+£32) = £1,

and it’s clear that this can be continued and I will eventually win a round, and at this point my total profit will be £1. So I will always eventually be able to buy my ice cream.

But, there’s nothing special about the value £1, so I could replace the words ‘ice cream’ with ‘private tropical island’, so why am I still here in the UK on a wet Monday when I could be on my beach lounger?

There are some fairly obvious reasons why the strategy I’ve described is not actually a fail-safe way to make a profit. For a start, although with probability one a head will come up eventually, there is a small positive chance that the first 200 rolls will all be tails. At this point, I would have accrued a debt of roughly 2^{200} pounds, and this is slightly more than the number of atoms in the universe. All this for an ice cream?

So there are major problems carrying out this strategy in a finite world. And of course, it’s no good if we stop after a very large but finite number of turns, because then there’s always this very small chance that we’ve made a very large loss, which is bad, partly because we can’t have the ice cream, but also because it exactly cancels out the chance of making our £1 profit, and so our overall average profit is exactly zero.

Though I’ve set this up in an intentionally glib fashion, as so often is the case, we might have stumbled across an interesting mathematical idea. That is, if we play a fair game a finite number of times, we have a fair game overall, meaning our overall average profit is zero. But if we are allowed to play a potentially infinite number of times, then it’s not clear how to define our overall ‘average’ profit, since we feel it ought to be zero, as an extension of the finite case, but also might be positive, because it ends up being £1 with probability one.

It’s tempting at this stage to start writing statements like

1 \times 1 + (-\infty) \times 0=0 ,

to justify why this might have come about, where we consider the infinitely unlikely event that is infinitely costly. But this is only convincing at the most superficial level, and so it makes more sense to think a bit more carefully about under exactly what circumstances we can extend our observation about the overall fairness of a finite sequence of individual fair games.

A second example

The previous example was based upon a series of coin tosses, and we can use exactly the same source of randomness to produce a simple random walk. This is a process that goes up or down by 1 in each time step, where each option happens with probability ½, independently of the history.

We could avoid the requirement to deal with very large bets by always staking £1, and then cashing in the first time we have a profit of £1. Then, if we start the random walk at zero, it models our profit, and we stop the first time it gets to 1. It’s not obvious whether we hit 1 with probability one. Let’s show this.

In order to hit some positive value k, the random walk must pass through 1, 2, and so on, up to (k-1) and then finally k. So \mathbb{P}(\text{hit k}) = [\mathbb{P}(\text{hit 1})]^k. And similarly for negative values. Also, the probability that we return to zero is the same as the probability that we ever hit 1, since after one time-step they are literally the same problem (after symmetry). So, if the probability of hitting 1 is p<1, then the number of visits to zero is geometric (supported on 1,2,3,…) with parameter p, and so

\mathbb{E}[\text{visits to k}] = \mathbb{E}[\text{visits to zero}] \times \mathbb{P}(\text{hit k})=(1+1/p) \times p^{|k|} = (p+1)p^{|k|-1}.

Thus, when we sum over all values of k, we are summing a pair of geometric series with exponent <1, and so we get a finite answer. But if the expected number of visits to anywhere (ie the sum across all places) is finite, this is clearly ridiculous, since we are running the process for an infinite time, and at each time-step we must be somewhere! So we must in fact have p=1, and thus another potential counter-example to the claim that a sequence of fair games can sometimes be unfair.

We might have exactly the same set of practical objections, such as this method requiring arbitrarily large liquidity (even though it doesn’t grow exponentially fast so doesn’t seem so bad).

What will actually turn out to be useful is that although the bets are now small, the average time until we hit 1 is actually infinite. Remember that, even though most things we see in real life don’t have this property, it is completely possible for a random variable to take finite values yet have infinite expectation.

Notes on the Martingale Strategy

There’s no reason why the originally proposed strategy had to be based upon fair coin tosses. This strategy might work in a more general setting, where the chance of winning on a given turn is not ½, or is not even constant. So long as at each stage you bet exactly enough that, if you win, you recoup all your losses so far, and one extra pound, this has the same overall effect.

Of course, we need to check that we do eventually win a round, which is not guaranteed if the probability of winning (conditional on not having yet won) decays sufficiently fast. If we let p_k be the probability of winning on turn k, given that we haven’t previously won, then we require that the probability of never winning \prod_{k\ge 1}(1-p_k)=0. By taking logs and taking care of the approximations, it can be seen that the divergence or otherwise of \sum p_k determines which way this falls.

In the next post, we’ll talk about how the two problems encountered here, namely allowing large increments, and considering a stopping time with infinite expectation are exactly the two cases where something can go wrong. We’ll also talk about a slightly different setting, where the choice of when to stop playing becomes a bit more dynamic and complicated.

When is a Markov chain a Markov chain?

I’ve been taking tutorials on the third quarter of the second-year probability course, in which the student have met discrete-time Markov chains for the first time. The hardest aspect of this introduction (apart from the rapid pace – they cover only slightly less material than I did in Cambridge, but in half the time) is, in my opinion, choosing which definition of the Markov property is most appropriate to use in a given setting.

We have the wordy “conditional on the present, the future is independent of the past”, which is probably too vague for any precise application. Then you can ask more formally that the transition probabilities are the same under two types of conditioning, that is conditioning on the whole history, and conditioning on just the current value

\mathbb{P}(X_{n+1}=i_{n+1} \,\big|\, X_n=i_n,\ldots,X_0=i_0) = \mathbb{P}(X_{n+1}=i_{n+1} \,\big |\, X_n=i_n), (*)

and furthermore this must hold for all sets of values (i_{n+1},\ldots,i_0) and if we want time-homogeneity (as is usually assumed at least implicitly when we use the word ‘chain’), then these expressions should be functions of (i_n,i_{n+1}) but not n.

Alternatively, one can define everything in terms of the probability of seeing a given path:

\mathbb{P}(X_0=i_0,\ldots,X_n=i_n)= \lambda_{i_0}p_{i_0,i_1}\ldots p_{i_{n-1}i_n},

where \lambda is the initial distribution, and the p_{i,j}s are the entries of the transition matrix P.

Fortunately, these latter two definitions are equivalent, but it can be hard to know how to proceed when you’re asked to show that a given process is a Markov chain. I think this is partly because this is one of the rare examples of a concept that students meet, then immediately find it hard to think of any examples of similar processes which are not Markov chains. The only similar concept I can think of are vector spaces, which share this property mainly because almost everything in first-year mathematics is linear in some regard.

Non-examples of Markov chains

Anyway, during the tutorials I was asking for some suggestions of discrete-time processes on a countable or finite state space which are not Markov chains. Here are some things we came up with:

  • Consider a bag with a finite collection of marbles of various colours. Record the colours of marbles sampled repeatedly without replacement. Then the colour of the next marble depends on the set you’ve already seen, not on the current colour. And of course, the process terminates.
  • Non-backtracking random walk. Suppose you are on a graph where every vertex has degree at least 2, and in a step you move to an adjacent vertex, chosen uniformly among the neighbours, apart from the one from which you arrived.
  • In a more applied setting, it’s reasonable to assume that if we wanted to know the chance it will rain tomorrow, this will be informed by the weather over the past week (say) rather than just today.

Showing a process is a Markov chain

We often find Markov chains embedded in other processes, for example a sequence of IID random variables X_1,X_2,\ldots. Let’s consider the random walk S_n=\sum_{i=1}^n X_i, where each X_i =\pm 1 with probability p and (1-p). Define the running maximum M_n=\max_{m\le n}S_m, and then we are interested in Y_n:=M_n-S_n, which we claim is a Markov chain, and we will use this as an example for our recipe to show this in general.

We want to show (*) for the process Y_n. We start with the LHS of (*)

\mathbb{P}(Y_{n+1}=i_{n+1} \,\big|\, Y_n=i_n,\ldots,Y_0=i_0),

and then we rewrite Y_{n+1} as much as possible in terms of previous and current values of Y, and quantities which might be independent of previous values of Y. At this point it’s helpful to split into the cases i_n=0 and i_n\ne 0. We’ll treat the latter for now. Then

Y_{n+1}=Y_n+X_{n+1},

so we rewrite as

=\mathbb{P}(X_{n+1}=i_{n+1}-i_n \, \big |\, Y_n=i_n,\ldots, Y_0=i_0),

noting that we substitute i_n for Y_n since that’s in the conditioning. But this is now ideal, since X_{n+1} is actually independent of everything in the conditioning. So we could get rid of all the conditioning. But we don’t really want to do that, because we want to have conditioning on Y_n left. So let’s get rid of everything except that:

=\mathbb{P}(X_{n+1}=i_{n+1}-i_n\, \big |\, Y_n=i_n).

Now we can exactly reverse all of the other steps to get back to

= \mathbb{P}(Y_{n+1}=i_{n+1} \,\big|\, Y_n=i_n),

which is exactly what we required.

The key idea is that we stuck to the definition in terms of Y, and held all the conditioning in terms of Y, since that what actually determines the Markov property for Y, rearranging the event until it’s in terms of one of the underlying Xs, at which point it’s easy to use independence.

Showing a process is not a Markov chain

Let’s show that M_n is not a Markov chain. The classic mistake to make here is to talk about possible paths the random walk S could take, which is obviously relevant, but won’t give us a clear reason why M is not Markov. What we should instead do is suggest two paths taken by M, which have the same ‘current’ value, but induce transition probabilities, because they place different restrictions on the possible paths taken by S.

IsMaxMarkov

In both diagrams, the red line indicates a possible path taken by (M_0,M_1,\ldots,M_4), and the blue lines show possible paths of S which could induce these.

In the left diagram, clearly there’s only one such path that S could take, and so we know immediately what happens next. Either X_5=+1 (with probability p) in which case M_5=S_5=3, otherwise it’s -1, in which case M_5=2.

In the right diagram, there are two possibilities. In the case that S_4=0, clearly there’s no chance of the maximum increasing. So in the absence of other information, for M_5=3, we must have X_4=X_5=+1, and so the chance of this is p^2.

So although the same transitions are possible, they have different probabilities with different information about the history, and so the Markov property does not hold here.

The Yule Process

The second problem sheet for classes on the Applied Probability course this term features a long question about the Yule process. This is probably the simplest example of a birth process. It’s named for the British statistician George Udny Yule, though some sources prefer to call it the Yule-Furry process for the American physicist Wendell Furry who used it as a model of a radioactive reaction.

The model is straightforward. At any time there is some number of individuals in the population, and each individual gives birth to an offspring at constant rate \lambda, independently from the rest of the population. After a birth has happened, the parent and child evolve independently. In the notation of general birth processes, the birth rate when there are n individuals is \lambda_n=\lambda n.

Note that if we start with two or more individuals, the sizes of the two or more families of descendents evolve as a continuous-time Polya’s urn. The arrivals process speeds up with time, but the jump chain is exactly Polya’s urn. Unsurprisingly, the Yule process can be found embedded in preferential attachment models, and other processes which are based around Polya’s urn with extra information.

This is a discrete, random version of exponential growth. Since the geometric distribution is the discrete analogue of the exponential distribution, we probably shouldn’t be surprised to learn that this is indeed the distribution of the process at some fixed time t, when it is started from a single original ancestor. This is all we care about, since the numbers of descendents from each different original ancestors are independent. In general, the distribution of the population size at some fixed time will be negative binomial, that is, a sum of IID geometric distributions.

The standard method here is to proceed using generating functions. Conditioning on the first splitting time gives two independent copies of the original process over a shorter time-scale. One derives an ODE in time for the generating function evaluated at any particular value z. This can be solved uniquely for each z, and patching together gives the generating function of the distribution at any specific time t, which can be seen to coincide with the corresponding generating function of the geometric distribution with parameter e^{-\lambda t}.

So we were trying to decide whether there might be a more heuristic argument for this geometric distribution. The method we came up with is not immediate, but does justify the geometric distribution in a couple of steps. First, we say that the birth times are T_2,T_3,\ldots, so between times [T_n,T_{n+1}) there are n individuals, with T_1:=0 for concreteness. Then by construction of the birth process, T_{n+1}-T_n\stackrel{d}{=}\mathrm{Exp}(\lambda n).

We now look at these ‘inter-birth times’ backwards, starting from T_{n+1}. Note that \mathrm{Exp}(\lambda n) is the distribution of the time for the first of n IID \mathrm{Exp}(\lambda) clocks to ring. But then, looking backwards, the next inter-birth time is thus the distribution of the time for one of (n-1) IID \mathrm{Exp}(\lambda) clocks to ring. So by memorylessness of the exponential distribution (discussed at great length on the first problem sheet), we can actually take these (n-1) clocks to be exactly those of the original n clocks which did not ring first. Continuing this argument, we can show that the first (in the original time direction) inter-birth time corresponds to the time spent waiting for the final clock to ring. Rewriting this observation formally:

T_{n+1}\stackrel{d}{=}\max\{X_i : X_1,\ldots,X_n\stackrel{\text{iid}}{\sim}\mathrm{Exp}(\lambda)\}. (*)

To return to justifying the geometric form of the distribution, we need to clarify the easiest relationship between the population size at a fixed size and these birth times. As we are aiming for the geometric distribution, the probability of the event \{X_t>n\} will be most useful. Clearly this event is the same as \{T_{n+1}<t\}, and from the description involving maxima of IID exponentials, this is easy to compute as (1-e^{-\lambda t})^n, which is exactly what we want.

There are two interesting couplings hidden in these constructions. On closer inspection they turn out to be essentially the same from two different perspectives.

We have specified the distribution of T_n at (*). Look at this distribution on the right hand side. There is a very natural way to couple these distributions for all n, namely to take some infinite sequence X_1,X_2,\ldots of IID \mathrm{Exp}(\lambda) random variables, then use initial sequences of these to generate each of the T_ns as described in (*).

Does this coupling correspond to the use of these IID RVs in the birth process? Well, in fact it doesn’t. Examining the argument, we can see that X_1 gives a different inter-birth time for each value of t in the correspondence proposed. Even more concretely, in the birth process, almost surely T_{n+1}>T_n for each n. This is not true if we take the canonical coupling of (*). Here, if X_n<\max\{X_1,\ldots,X_{n-1}\}, which happens with high probability for large n, we have T_{n+1}=T_n in the process of running maxima.

Perhaps more interestingly, we might observe that this birth process gives a coupling of the geometric distributions. If we want to recover the standard parameterisation of the geometric distribution, we should reparameterise time. [And thus generate an essentially inevitable temptation to make some joke about now having a Yule Log process.]

Let’s consider what the standard coupling might be. For a binomial random variable, either on [n] or some more exotic set, as in percolation, we can couple across all values of the parameter by constructing a family independent uniform random variables, and returning a 1 if U_i>1-p and so on, where p is the parameter of a specific binomial realisation.

We can do exactly the same here. A geometric distribution can be justified as the first success in a sequence of Bernoulli trials, so again we can replace the relevant Bernoulli distribution with a uniform distribution. Take U_1,U_2,\ldots to be IID U[0,1] random variables. Then, we have:

X_t=\stackrel{d}{=}\bar X_t:= \max\{n: U_1,\ldots,U_{n-1}\ge e^{-\lambda t}\}.

The equality in distribution holds for any particular value of t by constructing. But it certainly doesn’t hold uniformly in t. Note that if we define \bar X_t as a process, then typically the jumps of this process will be greater than 1, which is forbidden in the Yule process.

So, we have seen that this Yule process, even though its distribution at a fixed time has a standard form, provides a coupling of such distributions that is perhaps slightly surprising.

Hitting Probabilities for Markov Chains

This continues my previous post on popular questions in second year exams. In the interest of keeping it under 2,500 words I’m starting a new article.

In a previous post I’ve spoken about the two types of Markov chain convergence, in particular, considering when they apply. Normally the ergodic theorem can be used to treat the case where the chain is periodic, so the transition probabilities do not converge to a stationary distribution, but do have limit points – one at zero corresponding to the off-period transitions, and one non-zero. With equal care, the case where the chain is not irreducible can also be treated.

A favourite question for examiners concerns hitting probabilities and expected hitting times of a set A. Note these are unlikely to come up simultaneously. Unless the hitting probability is 1, the expected hitting time is infinite! In both cases, we use the law of total probability to derive a family of equations satisfied by the probabilities/times. The only difference is that for hitting times, we add +1 on the right hand side, as we advance one time-step to use the law of total probability.

The case of hitting probabilities is perhaps more interesting. We have:

h_i^A = 1,\; i\in A, \quad h_i^A=\sum_{j\in S}p_{ij}h_j^A,\; i\not\in A.

There are two main cases of interest: where the chain is finite but has multiple closed communicating classes, and where the chain is infinite, so even though it is irreducible, a trajectory might diverge before hitting 0.

For the case of a finite non-irreducible Markov chain, this is fairly manageable, by solving backwards from states where we know the values. Although of course you could ask about the hitting probability of an open state, the most natural question is to consider the probability of ending up in a particular closed class. Then we know that the hitting probability starting from site in the closed class A is 1, and the probability starting from any site in a different closed class is 0. To find the remaining values, we can work backwards one step at a time if the set of possible transitions is sparse enough, or just solve the simultaneous equations for \{h_i^A: i\text{ open}\}.

We therefore care mainly about an infinite state-space that might be transient. Typically this might be some sort of birth-and-death chain on the positive integers. In many cases, the hitting probability equations can be reduced to a quadratic recurrence relation which can be solved, normally ending up with the form

h_i=A+B\lambda^i,

where \lambda might well be q/p or similar if the chain is symmetric. If the chain is bounded, typically you might know h_0=1, h_N=0 or similar, and so you can solve two simultaneous equations to find A and B. For the unbounded case you might often only have one condition, so you have to rely instead on the result that the hitting probabilities are the minimal solution to the family of equations. Note that you will always have h^i_i=1, but with no conditions, h^i_j\equiv 1 is always a family of solutions.

It is not clear a priori what it means to be a minimal solution. Certainly it is not clear why one solution might be pointwise smaller than another, but in the case given above, it makes sense. Supposing that \lambda<1, and A+B=1 say, then as we vary the parameters, the resulting set of ‘probabilities’ does indeed vary monotically pointwise.

Why is this true? Why should the minimum solution give the true hitting probability values? To see this, take the equations, and every time an h_i^A appears on the right-hand side, substitute in using the equations. So we obtain, for i\not\in A,

h_i^A=\sum_{j\in A}p_{ij}+\sum_{j\not\in A} p_{ij}h_j^A,

and after a further iteration

h_i^A=\sum_{j_1\in A}p_{ij_1}+\sum_{j_1\not\in A, j_2\in A}p_{ij_1}p_{j_1j_2}+\sum_{j_1,j_2\not\in A}p_{ij_1}p_{j_1j_2}h_{j_2}^A.

So we see on the RHS the probability of getting from i to A in one step, and in two steps, and if keep iterating, we will get a large sum corresponding to the probability of getting from i to A in 1 or 2 or … or N steps, plus an extra term. Note that the extra term does not have to correspond to the probability of not hitting A by time N. After all, we do not yet know that (h_{i}^A) as defined by the equations gives the hitting probabilities. However, we know that the probability of hitting A within N steps converges to the probability of hitting A at all, since the sequence is increasing and bounded, so if we take a limit of both sides, we get h_i^A on the left, and something at least as large as the hitting probability starting from i on the right, because of the extra positive term. The result therefore follows.

It is worth looking out for related problems that look like a hitting probability calculation. There was a nice example on one of the past papers. Consider a simple symmetric random walk on the integers modulo n, arranged clockwise in a circle. Given that you start at state 0, what is the probability that your first return to state 0 involves a clockwise journey round the circle?

Because the system is finite and irreducible, it is not particularly interesting to consider the actual hitting probabilities. Also, note that if it is convenient to do so, we can immediately reduce the problem when n is even. In two steps, the chain moves from j to j+2 and j-2 with probability ¼ each, and stays at j with probability ½. So the two step chain is exactly equivalent to the lazy version of the same dynamics on n/2.

Anyway, even though the structure is different, our approach should be the same as for the hitting probability question, which is to look one step into the future. For example, to stand a chance of working, our first two moves must both be clockwise. Thereafter, we are allowed to move anticlockwise. There is nothing special about starting at 0 in defining the original probability. We could equally well ask for the probability that starting from j, the first time we hit 0 we have moved clockwise round the circle.

The only thing that is now not obvious is how to define moving clockwise round the circle, since it is not the case that all the moves have to be clockwise to have experienced a generally clockwise journey round the circle, but we definitely don’t want to get into anything complicated like winding numbers! In fact, the easiest way to make the definition is that given the hitting time of 0 is T, we demand that the chain was at state n at time T-1.

For convenience (ie to make the equations consistent) we take h_0=0, h_n=1 in an obvious abuse of notation, and then

h_j=\frac12h_{j-1}+\frac12 h_{j+1},

from which we get

h_j=a+bj \Rightarrow h_j=\frac{j}{n}.

Of course, once we have this in mind, we realise that we could have cut the circle at 0 (also known as n) and unfolded it to reduce the problem precisely to symmetric gambler’s ruin. In particular, the answer to the original problem is 1/2n, which is perhaps just a little surprising – maybe by thinking about the BM approximation to simple random walk, and that BM started from zero almost certainly crosses zero infinitely many times near we might have expected this probability to decay faster. But once it is unfolded into gambler’s ruin, we have the optimal stopping martingale motivation to reassure us that this indeed looks correct.

Avoiding Mistakes in Probability Exams

Over the past week, I’ve given several tutorials to second year undergraduates preparing for upcoming papers on probability and statistics. In particular, I’ve now seen a lot of solutions to a lot of past papers and specimen questions, and it’s worthwhile to consider some of the typical mistakes students can make on these questions. Of course, as with any maths exam, there’s always the possibility of a particularly subtle or involved question coming up, but if the following three common areas of difficulty can be avoided, you’re on track for doing well.

Jacobians

In a previous course, a student will learn how to calculate the pdf of a function of a random variable. Here, we move onto the more interesting and useful case of finding the (joint) density of function(s) of two or more random variables. The key thing to remember here is that manipulating pdfs is not a strange arbitrary exercise – it is just integration. It is rarely of interest to consider the value of a pdf at a single point. We can draw meaningful conclusions from a pdf or from comparison of two pdfs by integrating them.

Then the question of substituting for new random variables is precisely integration by substitution, which we are totally happy with in the one-dimensional case, and should be fairly happy with in the two-dimensional case. To get from one joint density to another, we multiply by the absolute value of the Jacobian. To ensure you get it right, it makes sense to write out the informal infinitesimal relation

f_{U,V}(u,v) du dv = f_{X,Y}(x,y)dx dy.

This is certainly relevant if we put integral signs in front of both sides, and explains why you obtain f_{U,V} = \frac{d(x,y)}{d(u,v)} f_{X,Y} rather than the other way round. Note though that if \frac{d(u,v)}{d(x,y)} is easier to calculate for some reason, then you can evaluate this and take the inverse, as your functions will almost certainly be locally bijective almost everywhere.

It is important to take the modulus of the Jacobian, since densities cannot be negative! If this looks like a fudge, then consider the situation in one dimension. If we substitute for x\mapsto f(x)=1-x, then f’ is obviously negative, BUT we also end up reversing the order of the bounds of the integral, eg [1/3, ¾] will become [2/3,1/4]. So we have a negative integrand (after multiplying by f'(x)) but bounds in the ‘wrong’ order. These two factors of -1 will obviously cancel, so it suffices just to multiply by |f'(x)| at that stage. It is harder to express in words, but a similar relation works for the Jacobian substitution.

You also need to check where the new joint density is non-zero. Suppose X, Y are supported on [0,1], then when we write f_{X,Y}(x,y) we should indicate that it is 0 off this region, either by splitting into cases, or adding the indicator function 1_{\{x,y\in[0,1]\}} as a factor. This is even more important after substitutions, as the range of the resulting random variables might be less obvious than the originals. Eg with X,Y as above, and U=X^2, V=X/Y, the resulting pdf will be non-zero only when u\in[0,1], v\ge \sqrt{u}. Failing to account for this will often lead to ludicrous answers. A general rule is that you can always check that any distribution you’ve produced does actually integrate to one.

Convergence using MGFs

There are two main reasons to use MGFs and PGFs. The first is that they behave nicely when applied to (possibly random) sums of independent random variables. The independence property is crucial to allow splitting of the MGF of the sum into the product of MGFs of the summands. Of course, implicit in this argument is that MGFs determine distributions.

A key theorem of the course is that this works even in the limit, so you can use MGFs to show convergence in distribution of a family of distributions. For this, you need to show that the MGFs converge pointwise on some interval [-a,a] around 0. (Note that the moments of the distribution are given by the family of derivatives at 0, as motivation for why this condition might be necessary.) Normally for such questions, you will have been asked to define the MGF earlier in the question, and probably will have found the MGF of a particular distribution or family of distributions, which might well end up appearing as the final answer.

Sometimes such an argument might involve substituting in something unusual, like t/N, rather than t, into a known MGF. Normally a Taylor series can be used to show the final convergence result. If you have a fraction, try to cancel terms so that you only have to evaluate one Taylor series, rather than lots.

Using the Markov Property

The Markov property is initially confusing, but once we become comfortable with the statement, it is increasingly irritating to have to answer the question: “show that this process has the Markov property.” This question is irritating because in most cases we want to answer: “because it obviously does!” Which is compelling, but unlikely to be considered satisfactory in a mathematics exam. Normally we observe that the random dynamics of the next step are a function only of the present location. Looking for the word ‘independent’ in the statement of the process under discussion is a good place to start for any argument along these lines.

The most developed example of a Markov process in this course is the Poisson process. I’ve written far too much about this before, so I won’t do so again, except to say this. When we think of the Poisson process, we generally have two thoughts going through our minds, namely the equivalent definitions of IID exponential inter-arrival times, and stationary, Poisson increments (or the infinitesimal version). If we draw a sketch of a sample trajectory of this process, we can label everything up and it is clear how it all fits together. But if you are asked to give a definition of the Poisson process (N_t), it is inappropriate to talk about inter-arrival times unless you define them in terms of N_t, since that is the process you are actually trying to define! It is fine to write out

T_k:=\min\{t: N_t=k\},\quad N_t=\max\{k: Y_1+Y_2+\ldots+Y_k\le t\}

but the relation between the two characterisations of the process is not obvious. That is why it is a theorem of the course.

We have to be particularly careful of the difference in definition when we are calculating probabilities of various events. A classic example is this. Find the distribution of N_2, conditional on T_3=1. It’s very tempting to come up with some massive waffle to argue that the answer is 3+Po(1). The most streamlined observation is that the problem is easy if we are conditioning instead on N_1=3. We just use the independent Poisson increments definition of (N_t), with no reference to inter-arrival times required. But then the Markov property applied at time 1 says that the distribution of (N_2) depends only on the value of N_1, not on the process on the interval [0,1). In a sense, the condition that T_3=1 is giving us extra information on the behaviour of the process up to time 1, and the Markov property, which we know holds for the Poisson process, asserts precisely that the extra information doesn’t matter.

Coupling from the Past

In a long series of previous posts I have talked about mixing times for Markov chains. We consider how long it takes for the distribution of a particular Markov chain to approach equilibrium. We are particularly interested in the asymptotics when some parameter of the model grows, such as the size of the state space, grows to infinity.

But why are we interested in the underlying problem? The idea of Markov Chain Monte Carlo methods is to sample from an intractable distribution by instead sampling from a Markov chain which approximates the distribution well at large times. A distribution might be intractable because it is computationally demanding to work out the normalising constant, or it might be distributed uniformly on a complicated combinatorial set. If, however, the distribution is the equilibrium distribution of some Markov chain, then we know how to at least sample from a distribution which is close to the one we want. But we need to know how long to run the process. We will typically tolerate some small error in approximating the distribution (whether we measure this in terms of total variation distance or some other metric doesn’t really matter at this heuristic level), but we need to know how it scale. If we double the size of the system, do we need to double the number of iterations of the chain, or square it. This is really important if we are going to use this for large real-world models with finite computing power!

Sometimes though, an approximation is not enough. If we want an exact sample from the equilibrium distribution, Markov chains typically will not help us as it is only in very artificial examples that the distribution after some finite time is actually the equilibrium distribution. One thing that we might use is a stationary time, which is a stopping time T, for which X_T\stackrel{d}{=}\pi. Note that there is one trivial way to do this. We can sample Y from distribution \pi before starting the process, then stop X at the first time T for which X_T=Y. But this is no help really, as we need to have Y in the first place!

So we are really interested in less trivial stationary times. Perhaps the best example is the top-to-random shuffle. Here we are given a pack of labelled cards, WLOG initially in descending order at each step we move the top card in the pile to a randomly-chosen location in the pile (which includes back onto the top). Then it turns out that the first time we move the card originally at the bottom from the top to somewhere is a strong stationary time. This is fairly natural, as by this time, every card has been involved in at least one randomising event.

Anyway, so this gives a somewhat artificial way to sample from the uniform distribution on a pack of cards. This strong stationary time is almost surely finite, with distribution given by the coupon collector problem, for which the expectation grows as n\log n, where n is the number of cards.

The problem with this method is that it is not easy in general to come up with a non-contrived stationary time such as this one. The idea of coupling from the past, discussed by some previous authors but introduced in this context by Propp and Wilson in the mid ’90s, is another method to achieve perfect sampling from the equilibrium distribution of a Markov chain. The idea here is to work backwards rather than forwards. The rest of this post, which discusses this idea, is based on the talk given at the Junior Probability Seminar by Irene, and on the chapter in the Levin, Peres, Wilmer book.

The key to the construction is a coupling of the transitions of a Markov chain. In the setting of a simple random walk, we have by construction a coupling of the transitions. It doesn’t matter which state we are at: we toss a coin to decide whether to move up or down, and we can do this without reference to our current position. Levin, Peres and WIlmer call this a random mapping representation in general, and it is yet another concept that is less scary than its definition might suggest.

Given a transition matrix P on state space S, such a representation is a function

\phi: S\times[0,1]\rightarrow S,\text{ s.t. }\mathbb{P}(\phi(i,U)=j)=p_{ij},

where U is a U(0,1) random variable independent of choice of i. In particular, once we have the random value of u, we can consider \phi(i,u) as i varies, to obtain a random map S\rightarrow S. Crucially, this map is not necessarily a bijection.

Note first that there are many possibilities for constructing the representation \phi. For some chains, and some representations, in particular random walks on vertex-transitive graphs (such as SRW – only for now we are restricting attention to finite state spaces) it is possible to choose \phi so that it always gives a bijection, but it is also always possible to choose it so that there is some probability it doesn’t give a bijection.

Let U_1,U_2,\ldots be an IID sequence of U[0,1] random variables, and write \phi_i for the random map induced by U_i. Then consider the sequence of iterated maps:

\phi_1, \phi_1\circ \phi_2, \ldots, \phi_1\circ\ldots\circ\phi_n,

and let T be the (random) smallest time such that the image of \phi_1\circ\ldots\circ \phi_T is a single state. Ie, as we go backwards in time through the maps \phi_i, we are gradually losing various states, corresponding to the maps not being bijections. Since the state space is finite, and the probability of not being a bijection is positive, it can be shown that T is almost surely finite. The claim then is that

Y=\text{Im}(\phi_1\circ\ldots\circ \phi_T)

is distributed as the equilibrium distribution of the chain. We finish by proving this.

Proof: Since the algorithm terminates after finite time almost surely, given any \epsilon>0, we can choose N such that the probability the algorithm stops in at most N steps is greater than 1-\epsilon.

Now run the Markov chain from time -N, started in the equilibrium distribution, with the transition from time -t to -(t-1) given by the random mapping driven by U_t. Thus at time 0, the distribution of the chain is still the equilibrium distribution. But if we condition on the event that T\le N, then X_0=\phi_1\circ \ldots \circ\phi_n(X_{-N})=Y regardless of the initial value. So \mathbb{P}(X_0\ne Y)<\epsilon, and hence the result follows, since \epsilon>0 was arbitrary.

What makes this easier than strong stationary times is that we don’t have to be clever to come up with the stopping time. It is however still important to know how long on average it takes to run the algorithm. At the end of her talk, Irene showed how to adapt this algorithm to deal with Probabilistic Cellular Automata. Roughly speaking, these are a sequence of infinite strings of 0s and 1s. The value of some element is determined randomly as a function of the values in the row underneath, say the element directly underneath and the two either side. In that setting, if you start with a finite subsequence and couple from the past by looking down to lower rows, each time you drop down a row you consider one further element, so in fact the coupling from the past algorithm has to eliminate possibilities fast enough to make up for this, if we want to terminate almost surely in finite time.

Here’s a link to the paper which discusses this in fuller detail.

Enhanced by Zemanta

Lamperti Walks

DSC_2604

The theory of simple random walks on the integer lattice is a classical topic in probability theory. Polya proved in the 1920s that such a SRW on \mathbb{Z}^d is recurrent only for d=1 or 2. The argument is essentially combinatorial. We count the number of possible paths from 0 back to itself and show that this grows fast enough that even with the probabilistic penalty of having a particular long path we will still repeatedly see this event happening. In larger dimensions there is essentially ‘more space’ at large distances, at least comparatively, so a typical walk is more likely to escape into this space.

As Kakutani (of the product martingale theorem) said, and was subsequently quoted as the dedication on every undergraduate pdf about random walks: “A drunk man will find his way home, whereas a drunk bird may get lost forever.”

But transience in some sense a long-distance property. We can fiddle with the transition rates near zero and, so long as we don’t make anything deterministic this shouldn’t affect transience properties. Obviously if we have a (space-)homogeneous nearest-neighbour random walk on the integers with non-zero drift the process will be transient: it drifts towards positive infinity if the drift is positive. But can we have a random walk with non-zero drift, but where the drift tends to zero at large distances fast enough, and the process is still recurrent? What is the correct scaling for the decay of the drift to see interesting effects?

The answers to these questions is seen in the so-called Lamperti random walks, which were a recurring theme of the meeting on Aspects of Random Walks held in Durham this week. Thanks to the organisers for putting on such an excellent meeting. I hadn’t known much about this topic before, so thought it might be worth writing a short note.

As explained above, we consider time-homogeneous random walks. It will turn out that the exact distributions of the increments is not hugely important. Most of the properties we might care about will be determined only by the first two moments, which we define as:

\mu_1(x)=\mathbb{E}[X_{t+1}-X_t | X_t=x],

\mu_2=\mathbb{E}[(X_{t+1}-X_t)^2 | X_t=x].

Note that because the drift will be asymptotically zero, the second term is asymptotically equal to the variance of the increment. It will also turn out that the correct scaling for \mu_1 to see a phase transition is \mu_1(x)\sim \frac{c}{x}.

We begin by seeing how this works in the simplest possible example, from Harris (1952). Let’s restrict attention to a random walk on the non-negative integers, and impose the further condition that increments are +1 or -1. In the notation of a birth-and-death process from a first course on Markov chains, we can set:

p_j:=\mathbb{P}(X_{t+1}=j+1| X_t=j), \quad q_j=1-p_j.

We will set p_j=\frac12 + \frac{c}{2j}. Then a condition for transience is that

1+\frac{q_1}{p_1}+\frac{q_1q_2}{p_1p_2}+\ldots <\infty.

In our special case:

\frac{q_1\ldots q_r}{p_1\ldots p_r}\approx\frac{(r-2c)(r-1-2c)(r-2-2c)\ldots}{r!}\approx \frac{1}{r^{2c}}.

So we can deduce that this sum converges if c>1/2, giving transience. A similar, but slightly more complicated calculation specifies the two regimes of recurrence. If -1/2<=c<=1/2 then the chain is null-recurrent, meaning that the expected time to return to any given state is infinite. If c<-1/2, then it is positive recurrent.

In general, we assume \mu_1(x)\sim \frac{c}{x} and \mu_2(x)\approx s^2. In the case above, obviously s^2=1. The general result is that under mild assumptions on the increment distributions, for instance a (2+\epsilon)-moment, if we define r=-\frac{2c}{s^2}, then the RW is transient if r<-1, positive-recurrent if r>1, and null-recurrent otherwise. This is the main result of Lamperti.

To explain why we have parameterised exactly like this, it makes sense to talk about the more general proof methods, as obviously the direct Markov chain calculation won’t work in general. The motivating idea is that we can deal well with the situation where the drift is zero, so let’s transform the random walk so that the drift becomes zero. A function of a Markov chain that is more stable (in some sense) that the original MC, for analysis at least, is sometimes called a Lyapunov function. Here, the sensible thing is to consider Y_t=X_t^\gamma, for some exponent \gamma>0.

So long as our distributions are fairly well-behaved (eg a finite 2+\epsilon-moment), we can calculate the drift of Y as

\mathbb{E}[Y_{t+1}-Y_t| X_t=x]=\frac{\gamma}{2}x^{\gamma-2}(2c+(1-\gamma)s^2) +o(x^{\gamma-2}).

In particular, taking \gamma=1+r results in a random walk that is ‘almost’ a martingale. Note that the original RW was almost a martingale, in the sense that the drift is asymptotically zero, but now it is zero to second order as well.

To draw any rigorous conclusions, we need to be careful about exactly how precise this approximation is, but we won’t worry about that now. In particular, we need to know whether we can take this approximation over the optional stopping theorem, as this allows us to say:

\mathbb{P}(X\text{ hits }x\text{ before 0})=\mathbb{P}(Y\text{ hits }x^\gamma\text{ before 0})\sim x^{-\gamma}.

This is particularly useful for working out the expected excursion time away from 0, which precisely leads to the condition for null-recurrence.

In his talk, Ostap Hryniv showed that this Lyapunov function analysis can be taken much further, to derive much more precise results about excursions, maxima and ergodicity. Results of Menshikov and Popov from the 90s further specify the asymptotics for the invariant distribution, if it exists, in terms of r.

One cautionary remark I should make is that earlier I implied that once we know the drift of such a random walk is zero, we have recurrence. This is true on \mathbb{Z} with very mild restrictions, but is not necessarily true in higher dimensions. For example, consider the random walk on \mathbb{R}^2, where conditional on X_t, the increment is X_{t+1}-X_t is of length 1 and perpendicular to the vector X_t. The two possible directions are equally likely. The drift is therefore 0 everything, and the second moment is also well-behaved, but note that ||X_t||^2=t^2, just by considering Pythagoras. So in higher dimensions, we have to be a bit more careful, and put restrictions on the covariance structure of the increment distributions.

As a final comment, note that from Lamperti’s result, we can re-derive Polya’s result about SRW in higher dimensions. If we have X_t an SRW on \mathbb{Z}^d, then consider Y_t=||X_t||. By considering a couple of examples in two-dimensions, it is clear that this is not Markov. But the methods we considered above for the Lamperti walks were really martingale methods rather than Markov chain methods. And indeed this process Y has asymptotically zero drift with the right scaling. Here,

c=\frac{1}{2}(1-\frac{1}{d}),\quad s^2=\frac{1}{d},

and so r=d-1, leading to exactly the result we know to be true, that the SRW is transient precisely in three dimensions and higher.

REFERENCES

Harris – First Passage and Recurrence Distributions (1952)

The slides from Ostap Hryniv’s talk, on which this was based, can be found here.

Enhanced by Zemanta