Doob inequalities and Doob-Meyer decomposition

The first post I wrote on this blog was about martingales, way back in 2012 at a time when I had known what a martingale was for about a month. I now don’t have this excuse. So I’m going to write about a couple of properties of (discrete-time) martingales that came up while adjusting a proof which my thesis examiners suggested could be made much shorter as part of their corrections.

Doob’s submartingale inequality

When we prove that some sequence of processes converges to some other process, we typically want to show that this holds in some sense uniformly over a time-interval, rather than just at some fixed time. We don’t lose much at this level of vagueness by taking the limit process to be identically zero. Then, if the convergent processes are martingales or closely similar, we want to be able to bound \sup_{k\le n} |Z_k| in some sense.

Doob’s submartingale inequality allows us to do this. Recall that a submartingale has almost-surely non-negative conditional increments. You might think of it heuristically as ‘more increasing than a martingale’. If Z_n is a martingale, then |Z_n| is a submartingale. This will be useful almost immediately.

The statement is that for (Z_n) a non-negative submartingale,

\mathbb{P}\left( \sup_{k\le n} Z_k \ge \lambda\right) \le \frac{\mathbb{E}\left[Z_n\right]}{\lambda}.

The similarity of the statement to the statement of Markov’s inequality is no accident. Indeed the proof is very similar. We consider whether the event in question happens, and find lower bounds on the expectation of Z_n under both possibilities.

Formally, for ease of notation, let Z_n^* be the running maximum \sup_{k\le n}Z_k. Then, we let T:= n\wedge \inf\{k\le n, M_j\ge \lambda\} and apply the optional stopping theorem for submartingales at T, which is by construction at most n. That is

\mathbb{E}[Z_n]\ge \mathbb{E}[Z_T]=\mathbb{E}\left[Z_T\mathbf{1}_{Z_n^*<\lambda}\right] + \mathbb{E}\left[Z_T \mathbf{1}_{Z_n^*\ge \lambda}\right].

The first of these summands is positive, and the second is at least \lambda \mathbb{P}\left( Z_N^* \ge \lambda \right), from which the result follows.

We’ve already said that for any martingale Z_n, |Z_n| is a submartingale, but in fact f(Z_n) is a submartingale whenever f is convex, and \mathbb{E}|f(Z_n)|<\infty for each n. Naturally, this continues to hold when Z_n is itself a submartingale.

[Note that Z_n^* is also a submartingale, but this probably isn’t as interesting.]

A particularly relevant such function f is f(x)=x^p, for p>1. If we take Z_n a non-negative submartingale which is uniformly bounded in L^p, then by applying Holder’s inequality and this submartingale inequality, we obtain

\mathbb{E}\left( \sup_{k\le n}Z_n^p \right) \le \left(\frac{p}{p-1}\right)^p \mathbb{E}\left[ Z_n^p \right].

Since Z_n^p is a submartingale, then a limit in n on the RHS is monotone, and certainly a limit in n on the LHS is monotone, so we can extend to

\mathbb{E}\left( \sup_{k\le n}Z_\infty^p \right) \le \left(\frac{p}{1-p}\right)^p \mathbb{E}\left[ Z_\infty^p \right].

Initially, we have to define \mathbb{E}\left[ Z_\infty^p \right] through this limit, but in fact this result, Doob’s L^p inequality, shows that Z_\infty:= \lim Z_n exists almost surely as well.

Naturally, we will often apply this in the case p=2, and in the third of these three sections, we will see why it might be particularly straightforward to calculate \mathbb{E}\left[Z_\infty^2\right].

Remark: as in the case of Markov’s inequality, it’s hard to say much if the submartingale is not taken to be non-negative. Indeed, this effect can be seen even if the process is only defined for a single time step, for which the statement really is then Markov’s inequality.

Doob-Meyer decomposition

Unfortunately, most processes are not martingales. Given an discrete-time process X_n adapted to \mathcal{F}=(\mathcal{F}_n), it is a martingale if the conditional expectations of the increments are all almost surely zero. But given a general adapted process X_n which is integrable (so the increments have well-defined finite expectation), we can iteratively construct a new process M_n, where the increments are centred versions of X_n‘s increments. That is,

M_{n+1}-M_n:= X_{n+1}-X_n - \mathbb{E}\left[ X_{n+1}-X_n \,\big|\, \mathcal{F}_n\right] = X_{n+1}-\mathbb{E}\left[X_{n+1} \,\big|\, \mathcal{F}_n\right]. (*)

Then it’s immediately clear from the definition that M_n is a martingale.

There’s a temptation to tie oneself up in knots with the dependence. We might have that increments of the original process X_n depend on the current value of the process. And is it necessarily clear that we can recover the current value of the original process from the current value of M_n? Well, this is why we demand that everything be adapted, rather than just Markov. It’s not the case that M_n should be Markov, but it clearly is adapted.

Now we look at the middle expression in (*), and in particular the term we are subtracting, namely the conditional expectation. If we define, in the standard terminology, A_0=0 and

A_{n+1}-A_n:= \mathbb{E}\left[ X_{n+1}-X_n \,\big|\, \mathcal{F}_n\right],

then we have decomposed the original process X_n as the sum of a martingale M_n, and this new process A_n. In particular, note that the increment A_{n+1}-A_n given above is adapted to \mathcal{F}_n, which is a stronger condition than being adapted to \mathcal{F}_{n+1} as we would expect a priori. This property of the process (A_n) is called predictability (or possibly previsibility).

This decomposition X_n=X_0+M_n+A_n as just defined is called the Doob-Meyer decomposition, and there is a unique such decomposition where M_n is a martingale, and A_n is predictable. The proof of uniqueness is very straightforward. We look at the equalities given above as definitions of M_n,A_n, but then work in the opposite direction to show that they must hold if the decomposition holds.

I feel a final heuristic is worthwhile, using the term drift, more normally encountered in the continuous-time setting to describe infinitissimal expected increments. The increments of A_n represent the drift of X_n, and the increments of M_n are what remains from X_n after subtracting the drift. In general, the process to be subtracted to turn a non-martingale into a martingale is called a compensator, and the existence or otherwise of such processes is important but challenging for some classes of continuous-time processes.

In particular, note that when X_n is itself a martingale, then A_n\equiv 0. However, probably the most useful case is when X_n is a submartingale, as then the drift is always non-negative, and so A_n is almost surely increasing. The converse holds too.

This is relevant because this Doob-Meyer decomposition is obviously only a useful tool for treating X_n if we can handle the two processes M_n,A_n easily. We have tools to bound the martingale term, but this previsible term might in general be tricky, and so the case where X_n is a submartingale is good, as increasing processes are much easier than general processes, since bounding the whole process might involve only bounding the final term in many contexts.

Predictable quadratic variation

A particularly relevant example is the square of a martingale, that is X_n=M_n^2, where M_n is a martingale. By the convexity condition discussed earlier, X_n is a submartingale (provided it is integrable, ie M_n is square-integrable), and so the process A_n in its Doob-Meyer decomposition is increasing. This is often called the (predictable) quadratic variation of (X_n).

This predictable quadratic variation is sometimes denoted \langle X_n\rangle. This differs from the (regular) quadratic variation which is defined as the sum of the squares of the increments, that is [X_n]:= \sum_{k=0}^{n-1} (X_{k+1}-X_k)^2. Note that this is adapted, but obviously not previsible. The distinction between these two processes is more important in continuous time. There, they are almost surely equal for a continuous local martingale, but not for eg a Poisson process. (For a Poisson process, the PQV is deterministic, indeed linear, while the (R)QV is almost surely equal to the Poisson process itself.) In the discrete time setting, the regular quadratic variation is not relevant very often, while the predictable quadratic variation is useful, precisely because of this decomposition.

Whenever we have random variables which we then centre, there is a standard trick to apply when treating their variance. That is

A_{n+1}-A_n= \mathbb{E}\left[ M^2_{n+1}-M^2_n \,\big|\, \mathcal{F}_n\right]
= \mathbb{E}\left[ M^2_{n+1}\,\big|\, \mathcal{F}_n\right] - 2M_n^2 +M_n^2
= \mathbb{E}\left[ M^2_{n+1}\,\big|\, \mathcal{F}_n\right] - 2M_n \mathbb{E}\left[ M_{n+1}\,\big|\, \mathcal{F}_n\right] + M_n^2
= \mathbb{E}\left[ \left(M_{n+1}-M_n\right)^2\,\big|\, \mathcal{F}_n\right].

One consequence is seen by taking an ‘overall’ expectation. Because M_n^2-A_n is a martingale,

\mathbb{E}\left[M_n^2\right] = \mathbb{E}\left[A_n\right] = \mathbb{E}\left[M_0^2\right] + \sum_{k=0}^{n-1} \mathbb{E}\left[A_{k+1}-A_k\right]
= \mathbb{E}\left[ M_0^2\right] + \sum_{k=0}^{n-1}\mathbb{E}\left[ \left(M_{k+1}-M_k\right)^2 \right]. (**)

This additive (Pythagorean) property of the square of a martingale is useful in applications where there is reasonably good control on each increment separately.

We can also see this final property without the Doob-Meyer decomposition. For a martingale it is not the case that the increments on disjoint intervals are independent. However, following Williams 12.1 [1], disjoint intervals are orthogonal, in the sense that

\mathbb{E}\left[(M_t-M_s)(M_v-M_u)\right]=0,

whenever s\le t\le u\le v. Then, when we square the expression M_n=M_0+\sum M_{k+1}-M_k, and take expectations, all the cross terms vanish, leaving precisely (*).

References

[1] Williams – Probability with Martingales

I also followed the notes I made in 2011/12 while attending Perla Sousi’s course on Advanced Probability, and Arnab Sen’s subsequent course on Stochastic Calculus, though I can’t find any evidence online for the latter now.

Skorohod embedding

Background

Suppose we are given a standard Brownian motion (B_t), and a stopping time T. Then, so long as T satisfies one of the regularity conditions under which the Optional Stopping Theorem applies, we know that \mathbb{E}[B_T]=0. (See here for a less formal introduction to OST.) Furthermore, since B_t^2-t is a martingale, \mathbb{E}[B_T^2]=\mathbb{E}[T], so if the latter is finite, so is the former.

Now, using the strong Markov property of Brownian motion, we can come up with a sequence of stopping times 0=T_0, T_1, T_2,\ldots such that the increments T_k-T_{k-1} are IID with the same distribution as T. Then 0,B_{T_1},B_{T_2},\ldots is a centered random walk. By taking T to be the hitting time of \{-1,+1\}, it is easy to see that we can embed simple random walk in a Brownian motion using this approach.

p1020956_compressedEmbedding simple random walk in Brownian motion.

The Skorohod embedding question asks: can all centered random walks be constructed in this fashion, by stopping Brownian motion at a sequence of stopping time? With the strong Markov property, it immediately reduces the question of whether all centered finite-variance distributions X can be expressed as B_T for some integrable stopping time T.

The answer to this question is yes, and much of what follows is drawn from, or at least prompted by Obloj’s survey paper which details the problem and rich history of the many approaches to its solution over the past seventy years.

Applications and related things

The relationship between random walks and Brownian motion is a rich one. Donsker’s invariance principle asserts that Brownian motion appears as the scaling limit of a random walk. Indeed, one can construct Brownian motion itself as the limit of a sequence of consistent random walks with normal increments on an increasingly dense set of times. Furthermore, random walks are martingales, and we know that continuous, local martingales can be expressed as a (stochastically) time-changed Brownian motion, from the Dubins-Schwarz theorem.

The Skorohod embedding theorem can be used to prove results about random walks with general distribution by proving the corresponding result for Brownian motion, and checking that the construction of the sequence of stopping times has the right properties to allow the result to be carried back to the original setting. It obviously also gives a coupling between a individual random walk and a Brownian motion which may be useful in some contexts, as well as a coupling between any pair of random walks. This is useful in proving results for random walks which are much easier for special cases of the distribution. For example, when the increments are Gaussian, or when there are combinatorial approaches to a problem about simple random walk. At the moment no aspect of this blog schedule is guaranteed, but I plan to talk about the law of the iterated logarithm shortly, whose proof is approachable in both of these settings, as well as for Brownian motion, and Skorohod embedding provides the route to the general proof.

At the end, we will briefly compare some other ways to couple a random walk and a Brownian motion.

Adding extra randomness

One thing we could do is sample a copy of X independently from the Brownian motion, then declare T= \tau_{X}:= \inf\{t\ge 0: B_t=X\}, the hitting time of (random value) X. But recall that unfortunately \tau_x has infinite expectation for all non-zero x, so this doesn’t fit the conditions required to use OST.

Skorohod’s original method is described in Section 3.1 of Obloj’s notes linked above. The method is roughly to pair up positive values taken by X appropriately with negative values taken by X in a clever way. If we have a positive value b and a negative value a, then \tau_{a,b}, the first hitting time of \mathbb{R}\backslash (a,b) is integrable. Then we choose one of these positive-negative pairs according to the projection of the distribution of X onto the pairings, and let T be the hitting time of this pair of values. The probability of hitting b conditional on hitting {a,b} is easy to compute (it’s \frac{-a}{b-a}) so we need to have chosen our pairs so that the ‘probability’ of hitting b (ie the density) comes out right. In particular, this method has to start from continuous distributions X, and treat atoms in the distribution of X separately.

The case where the distribution X is symmetric (that is X\stackrel{d}=-X) is particularly clear, as then the pairs should be (-x,x).

However, it feels like there is enough randomness in Brownian motion already, and subsequent authors showed that indeed it wasn’t necessary to introduce extra randomness to provide a solution.

One might ask whether it’s possible to generate the distribution on the set of pairs (as above) out of the Brownian motion itself, but independently from all the hitting times. It feels like it might be possible to make the distribution on the pairs measurable with respect to

\mathcal{F}_{0+} = \bigcap\limits_{t>0} \mathcal{F}_t,

the sigma-algebra of events determined by limiting behaviour as t\rightarrow 0 (which is independent of hitting times). But of course, unfortunately \mathcal{F}_{0+} has a zero-one law, so it’s not possible to embed non-trivial distributions there.

Dubins solution

The exemplar for solutions without extra randomness is due to Dubins, shortly after Skorohod’s original argument. The idea is to express the distribution X as the almost sure limit of a martingale. We first use the hitting time of a pair of points to ‘decide’ whether we will end up positive or negative, and then given this information look at the hitting time (after this first time) of two subsequent points to ‘decide’ which of four regions of the real interval we end up in.

I’m going to use different notation to Obloj, corresponding more closely with how I ended up thinking about this method. We let

a_+:= \mathbb{E}[X \,|\, X>0], \quad a_- := \mathbb{E}[X\,|\, X<0], (*)

and take T_1 = \tau_{\{a_-,a_+\}}. We need to check that

\mathbb{P}\left( B_{T_1}=a_+\right) = \mathbb{P}\left(X>0\right),

for this to have a chance of working. But we know that

\mathbb{P}\left( B_{T_1}=a_+\right) = \frac{a_+}{a_+-a_-},

and we can also attack the other side using (*) and the fact that \mathbb{E}[X]=0, using the law of total expectation:

0=\mathbb{E}[X]=\mathbb{E}[X\,|\, X>0] \mathbb{P}(X>0) + \mathbb{E}[X\,|\,X<0]\mathbb{P}(X<0) = a_+ \mathbb{P}(X>0) + a_- \left(1-\mathbb{P}(X>0) \right),

\Rightarrow\quad \mathbb{P}(X>0)=\frac{a_+}{a_+-a_-}.

Now we define

a_{++}=\mathbb{E}[X \,|\, X>a_+],\quad a_{+-}=\mathbb{E}[X\,|\, 0<X<a_+],

and similarly a_{-+},a_{--}. So then, conditional on B_{T_1}=a_+, we take

T_2:= \inf_{t\ge T_1}\left\{ B_t\not\in (a_{+-},a_{++})  \right\},

and similarly conditional on B_{T_1}=a_-. By an identical argument to the one we have just deployed, we have \mathbb{E}\left[B_{T_2} \,|\,\mathcal{F}_{T_1} \right] = B_{T_1} almost surely. So, although the a_{+-+} notation now starts to get very unwieldy, it’s clear we can keep going in this way to get a sequence of stopping times 0=T_0,T_1,T_2,\ldots where B_{T_n} determines which of the 2^n regions of the real line any limit \lim_{m\rightarrow\infty} B_{T_m} should lie in.

A bit of work is required to check that the almost sure limit T_n\rightarrow T is almost surely finite, but once we have this, it is clear that B_{T_n}\rightarrow B_T almost surely, and B_T has the distribution required.

Komlos, Major, Tusnady coupling

We want to know how close we can make this coupling between a centered random walk with variance 1, and a standard Brownian motion. Here, ‘close’ means uniformly close in probability. For large times, the typical difference between one of the stopping times 0,T_1,T_2,\ldots in the Skorohod embedding and its expectation (recall \mathbb{E}[T_k]=k) is \sqrt{n}. So, constructing the random walk S_0,S_1,S_2,\ldots from the Brownian motion via Skorohod embedding leads to

\left |S_k - B_k \right| = \omega(n^{1/4}),

for most values of k\le n. Strassen (1966) shows that the true scale of the maximum

\max_{k\le n} \left| S_k - B_k \right|

is slightly larger than this, with some extra powers of \log n and \log\log n as one would expect.

The Komlos-Major-Tusnady coupling is a way to do a lot better than this, in the setting where the distribution of the increments has a finite MGF near 0. Then, there exists a coupling of the random walk and the Brownian motion such that

\max_{k\le n}\left|S_k- B_k\right| = O(\log n).

That is, there exists C such that

\left[\max_{k\le n} \left |S_k-B_k\right| - C\log n\right] \vee 0

is a tight family of distributions, indeed with uniform exponential tail. To avoid digressing infinitely far from my original plan to discuss the proof of the law of iterated logarithm for general distributions, I’ll stop here. I found it hard to find much coverage of the KMT result apart from the challenging original paper, and many versions expressed in the language of empirical processes, which are similar to random walks in many ways relevant to convergence and this coupling, but not for Skorohod embedding. So, here is a link to some slides from a talk by Chatterjee which I found helpful in getting a sense of the history, and some of the modern approaches to this type of normal approximation problem.

Fair games and the martingale strategy III

Gambler’s Ruin

Continuing directly from the previous post, the nicest example of the optional stopping theorem we developed there is to example a simple random walk constrained between two values, say 0 and N. This represents an idealised gambling situation, where the gambler stops playing either when they reach some pre-agreed profit, or when they go bankrupt. We assume that we start at level k, for k = 1,2,…,N-1.

Naturally, we want to know the probabilities of winning (ie getting to N) and losing (ie going bankrupt). We could set this up by conditioning on the first step. Let p_k be the probability of winning starting from level k. Then we must have

p_k= \frac12 p_{k+1}+\frac12 p_{k-1},\quad k=1,\ldots,N-1, (*)

with the obvious boundary conditions p_0=0, p_N=1. In an ideal world, we just know how to solve second order difference equations like (*). Well, actually it isn’t too hard, because we can see from (*) directly that

p_{k+1}-p_k = p_k-p_{k-1},

and so p_k is a linear function of k, and so p_k = k/N follows pretty much immediately.

But, we can also use OST profitably. Let T be the time at which we first hit 0 or N. It’s intuitively clear that this should have finite expectation, since the problems you might encounter with just the hitting time of a single level shouldn’t apply. Or you can consider the expected number of steps before you see N ups or downs in a row, which certainly provides an upper bound on T. This random number of steps is sort of geometric (at least, can be upper bounded by a geometric RV) and so has finite expectation. So can apply OST to X at T, and we have

\mathbb{E}[X_T] = N\cdot \mathbb{P}(X_T=N) + 0 \cdot \mathbb{P}(X_T=0) = \mathbb{E}[X_0]=k,

from which we also derive p_k=k/N.

The reason we talk about gambler’s ruin is by considering the limit N\rightarrow\infty with k fixed. After a moment’s thought, it’s clear we can’t really talk about stopping the process when we hit infinity, since that won’t happen at any finite time. But we can ask what’s the probability that we eventually hit zero. Then, if we imagine a barrier at level N, the probability that we hit 0 at some point is bounded below by the probability that we hit 0 before we hit level N (given that we know we hit either zero or level N with probability one), and this is \frac{N-k}{N}, and by choosing N large enough, we can make this as close to 1 as we want. So the only consistent option is that the probability of hitting 0 at some point is one. Hence gambler’s ruin. With probability one, ruin will occur. There’s probably a moral lesson hiding there not especially subtly.

A problem about pricing options

So the deal here seems to be that if you just care about your average, it doesn’t matter how to choose to play a sequence of fair games. But what if you care about something other than your average? In any real setting, we maybe care about slightly more than this. Suppose I offer you a bet on a coin toss: you get £3 if it comes up heads, and I get £1 if it comes up tails. Sounds like a good bet, since on average you gain a pound. But what about if you get £10,003 if it comes up heads and I get £10,001 if it comes up tails? I’m guessing you’re probably not quite so keen now.

But if you were an international bank, you might have fewer reservations about the second option. My intention is not to discuss whether our valuation of money is linear here, but merely to offer motivation for the financial option I’m about to propose. The point is that we are generally risk-averse (well, most of us, most of the time) and so we are scared of possible large losses, even when there is the possibility of large profits to balance it out.

Let’s assume we have our simple random walk, and for definiteness let’s say it starts at £1. Suppose (eg as a very niche birthday present) we have the following opportunity: at any point between now and time t=5, we have the right to buy one unit of the stock for £2.

We want to work out how much this opportunity, which from now on I’m going to call an option, is worth on average. Note that now it does seem that when we choose to cash in the option will have an effect on our return, and so we will have to include this in the analysis.

Note that, once we’ve bought a unit of the stock, we have an asset which is following a simple random walk (ie sequential fair games) and so from this point on its expected value remains unchanged. So in terms of expectation, we might as well sell the stock at the same moment we buy it. So if we cash in the option when the stock is currently worth £X, we will on average have a return of £(X-2). This means that we’ll only ever consider exercising our option if the current value of the stock is greater than £2. This narrows down our strategy slightly.

This sort of option minimises the risk of a large loss, since the worst thing that happens is that you never choose to exercise your option. So if you actually paid for the right to have this option, that cost is the largest amount you can lose. In the trading world, this type of opportunity is called an American option.

The trick here is to work backwards in time, thinking about strategies. If at time t=4, the stock is worth £1, then the best that can happen is that it’s worth £2 at time t=5, and this still gains you no wealth overall. Similarly if it’s worth £0 at time t=3. So we’ve identified a region where, if the stock value enters this region, we might as well rip up our contract, because we definitely aren’t going to gain anything. Remember now that we’ve also said you won’t ever cash in if the stock’s value is at most £2, because you don’t gain anything on average.

Now suppose that the stock has value £3 at time t=4. There’s no danger of it ever getting back below £2 during the lifetime of the option, so from now on your potential return is following the trajectory of a simple random walk, ie a fair game. So on average, it makes no difference whether you cash in now, or wait until t=5, or some combination of the two. The same argument holds if the stock has value £4 at time t=3 or time t=4, and so we can identify a region where you might as well cash in.

American Option 1

What about the final region? If the stock value is greater than £2, but not yet in the definitely-cash-in area, what should you do? Well, if you think about it, the value of the stock is a fair game. But your return should be better than that, because the stock price doesn’t take account of the fact that you wouldn’t buy in (and make a loss overall) if the value drops below £2. So at this stage, your future options are better than playing a fair game, and so it doesn’t make sense (in terms of maximising your *average*) to cash in.

Now we can actually work backwards in time to establish how much any starting value is worth under this optimal strategy. We can fill in the values in the ‘doomed’ area (ie all zeros) and on the ‘cash in now’ area (ie current value minus 2), and construct backwards using the fact that we have a random walk.

American Option 2

The final answer ends up being 7/16 if the stock had value £1 at time 0. Note that the main point here is that working out the qualitative form of the strategy was the non-trivial part. Once we’d done that, everything was fairly straightforward. I claim that this was a reasonably fun adjustment to the original problem, but have minimal idea whether pricing options is in general an interesting thing to do.

Anyway, I hope that provided an interesting overview to some of the topics of interest within the question of how to choose strategies for games based on random processes.

Fair games and the martingale strategy II

Optional Stopping

We continue directly from the end of the last post, where I was talking about how to play sequences of fair games, and whether by playing cunningly (including choosing when to stop playing cunningly) you can end up with an ‘unfair’ game overall. (Ie where you make a profit or a loss on average.) We gave two examples. First, the martingale strategy, where on a sequence of fair games you double your stake each time you lose. The result is that you win back your original stake at some point with probability one, but possibly accumulate huge temporary losses along the way. In the second game, you follow the path of a simple random walk from zero until it hits one, and then cash in. Here we observe that the time until this happens is almost surely finite, but has infinite expectation.

There’s another possible problem. It seems ridiculous, but suppose we could look into the future. Then our strategy for the random walk might be something like: check in advance what will happen in the first ten steps, and stop playing whenever we get to the moment which we know is the maximum value the walk will attain. Well then, sometimes the walk will never go above zero, in which case we will stop playing at the very start, and sometimes the walk will go above zero, in which case we make a positive amount. So overall, our mean return must be positive. Obviously if we have the option to adjust our stakes, this is completely ridiculous, because we would bet high (infinitely high?) if we knew we were about to win, and nothing if we were about to lose. So, obvious though it seems, we should emphasise that we mustn’t be allowed to look into the future!

The optional stopping theorem says that looking into the future, and these two problems already mentioned are essentially all that can go wrong. To say anything more interesting, at this point we really do need a little bit of notation.

In general, a sequence of fair games of this kind is called a martingale. The origin of the word is fairly unclear – see this unexpectedly comprehensive article. The martingale will be something like X_0,X_1,X_2,\ldots, representing the wealth (or whatever) at discrete time-steps. The key property is the fair game property, which says that whatever has happened up to time k, the next game is fair. That is:

\mathbb{E}[X_{k+1}-X_k \,|\,\text{any event involving }X_0,\ldots,X_k] = 0. (*)

Note that in any of the situations we are describing, X should describe our wealth, rather than the underlying process. In the random walk example, these are the same, but in the martingale strategy suggestion, X_k should be our wealth after the kth game, rather than anything directly recording the outcomes of the coin tosses.

If we allow X_0 to be random (and of course, being always equal to zero is a special case of being random…) we can then add up an initial sequence of such equations to obtain

\mathbb{E}[X_k]=\mathbb{E}[X_k-X_{k_1}] + \ldots + \mathbb{E}[X_1-X_0] + \mathbb{E}[X_0]=\mathbb{E}[X_0]. (**)

That is, if we play any sequence of fair games a fixed finite number of times, we have overall a fair game. (In the original strategy, we have a martingale, irrespective of the precise rule we use to choose how much we invest on each coin toss.) But what if we stop the process at a time determined by the current knowledge? (ie without looking into the future.)

Let’s call an example of such a random time T, and this property that we aren’t allowed to look into the future is described technically as the stopping time property. A proper setup would use more notation and fewer words at (*), but even without sigma-algebras, we can say that T is a stopping time if deciding whether T=k depends only on X_0,X_1,\ldots,X_k, and not on later actions.

Informal Proof

To show the optional stopping theorem, the key idea is that if you want to stop at time T, one option is to keep playing beyond time T with zero stakes. Thus we have a fair game at all times, even after T. We write this as X_{T\wedge k}, where \wedge means ‘minimum’, so that if k>T, the process stays constant.

Since X_{T\wedge k} is a martingale, we can invoke (**),

\mathbb{E}[X_{T\wedge k}] = \mathbb{E}[X_0].

Now what happens if we take k to be very large? How well does this truncated average approximate \mathbb{E}[X_T] itself?

This is where we want to return to our assumptions about what might make this go wrong. Let’s say that T has finite expectation, and that there is some absolute bound on how large an increment can be, say C. Then, whenever T\le k, we have X_T=X_{T\wedge k}. And when T>k, we have

|X_T - X_{T\wedge k}| = |X_T-X_k| \le C(T-k).

Therefore

|\mathbb{E}[X_T]-\mathbb{E}[X_0]|= |\mathbb{E}[X_T] - \mathbb{E}[X_{T\wedge k}] | \le C \mathbb{E}[(T-k)\vee 0], (***)

where we take the final expectation only across T-k when this quantity is positive, since this is the only case which contributes to the left hand side.

Now we need to show that by choosing k large enough, we can make the RHS very small. Obviously we don’t have a chance of doing this if C is not finite! With a bit of thought, we can see that \mathbb{E}[(T-k)\vee 0]\ge \mathbb{E}[T] - k, and so we also don’t have a chance of doing this if \mathbb{E}[T]=\infty. But if \mathbb{E}[T]<\infty, then \sum_{\ell\ge 1} \ell \mathbb{P}(T=\ell) <\infty, and so

\sum_{\ell \ge k} \ell \mathbb{P}(T=\ell)\rightarrow 0,\quad \text{as }k\rightarrow\infty,

and so certainly

\mathbb{E}[(T-k)\vee 0] = \sum_{\ell \ge k}(\ell -k)\mathbb{P}(T=\ell) \rightarrow 0.

But (***) holds for all values of k, and so the only consistent option is that

\mathbb{E}[X_T]=\mathbb{E}[X_0].

There are a couple more combinations of conditions (mostly involving relaxing one of these slightly, and substantially strengthening the other) which also work, but this seems like the more natural form. For a full formal statement, there are many resources available, and the Wikipedia page, for example, isn’t too bad. In the mists of history, I wrote about some of these topics more formally, but maybe less helpfully, since I’d known the theory myself for about a week.

Fair games and the martingale strategy I

I went back to my school a couple of weeks ago and gave a talk. I felt I’d given various incarnations of a talk on card-shuffling too many times, so it was time for a new topic. The following post (and time allowing, one or two more) is pretty much what I said.

The Martingale Strategy

Suppose we bet repeatedly on the outcome of tossing a fair coin. Since it’s November, my heart is set on buying an ice cream that costs £1, so my aim is to win this amount from our game. My strategy is this:

First, I bet £1. If I win, then that’s great, because I now have made exactly enough profit to buy the ice cream. If I lose, then I play again, and this time I bet £2. Again, if I win, then my total profit is £2-£1 = £1, so I stop playing and buy the ice cream. If I lose, then I play a third time, again doubling my stake. So if I win for the first time on the seventh go, my overall profit will be

£64 – (£1+£2+£4+£8+£16+£32) = £1,

and it’s clear that this can be continued and I will eventually win a round, and at this point my total profit will be £1. So I will always eventually be able to buy my ice cream.

But, there’s nothing special about the value £1, so I could replace the words ‘ice cream’ with ‘private tropical island’, so why am I still here in the UK on a wet Monday when I could be on my beach lounger?

There are some fairly obvious reasons why the strategy I’ve described is not actually a fail-safe way to make a profit. For a start, although with probability one a head will come up eventually, there is a small positive chance that the first 200 rolls will all be tails. At this point, I would have accrued a debt of roughly 2^{200} pounds, and this is slightly more than the number of atoms in the universe. All this for an ice cream?

So there are major problems carrying out this strategy in a finite world. And of course, it’s no good if we stop after a very large but finite number of turns, because then there’s always this very small chance that we’ve made a very large loss, which is bad, partly because we can’t have the ice cream, but also because it exactly cancels out the chance of making our £1 profit, and so our overall average profit is exactly zero.

Though I’ve set this up in an intentionally glib fashion, as so often is the case, we might have stumbled across an interesting mathematical idea. That is, if we play a fair game a finite number of times, we have a fair game overall, meaning our overall average profit is zero. But if we are allowed to play a potentially infinite number of times, then it’s not clear how to define our overall ‘average’ profit, since we feel it ought to be zero, as an extension of the finite case, but also might be positive, because it ends up being £1 with probability one.

It’s tempting at this stage to start writing statements like

1 \times 1 + (-\infty) \times 0=0 ,

to justify why this might have come about, where we consider the infinitely unlikely event that is infinitely costly. But this is only convincing at the most superficial level, and so it makes more sense to think a bit more carefully about under exactly what circumstances we can extend our observation about the overall fairness of a finite sequence of individual fair games.

A second example

The previous example was based upon a series of coin tosses, and we can use exactly the same source of randomness to produce a simple random walk. This is a process that goes up or down by 1 in each time step, where each option happens with probability ½, independently of the history.

We could avoid the requirement to deal with very large bets by always staking £1, and then cashing in the first time we have a profit of £1. Then, if we start the random walk at zero, it models our profit, and we stop the first time it gets to 1. It’s not obvious whether we hit 1 with probability one. Let’s show this.

In order to hit some positive value k, the random walk must pass through 1, 2, and so on, up to (k-1) and then finally k. So \mathbb{P}(\text{hit k}) = [\mathbb{P}(\text{hit 1})]^k. And similarly for negative values. Also, the probability that we return to zero is the same as the probability that we ever hit 1, since after one time-step they are literally the same problem (after symmetry). So, if the probability of hitting 1 is p<1, then the number of visits to zero is geometric (supported on 1,2,3,…) with parameter p, and so

\mathbb{E}[\text{visits to k}] = \mathbb{E}[\text{visits to zero}] \times \mathbb{P}(\text{hit k})=(1+1/p) \times p^{|k|} = (p+1)p^{|k|-1}.

Thus, when we sum over all values of k, we are summing a pair of geometric series with exponent <1, and so we get a finite answer. But if the expected number of visits to anywhere (ie the sum across all places) is finite, this is clearly ridiculous, since we are running the process for an infinite time, and at each time-step we must be somewhere! So we must in fact have p=1, and thus another potential counter-example to the claim that a sequence of fair games can sometimes be unfair.

We might have exactly the same set of practical objections, such as this method requiring arbitrarily large liquidity (even though it doesn’t grow exponentially fast so doesn’t seem so bad).

What will actually turn out to be useful is that although the bets are now small, the average time until we hit 1 is actually infinite. Remember that, even though most things we see in real life don’t have this property, it is completely possible for a random variable to take finite values yet have infinite expectation.

Notes on the Martingale Strategy

There’s no reason why the originally proposed strategy had to be based upon fair coin tosses. This strategy might work in a more general setting, where the chance of winning on a given turn is not ½, or is not even constant. So long as at each stage you bet exactly enough that, if you win, you recoup all your losses so far, and one extra pound, this has the same overall effect.

Of course, we need to check that we do eventually win a round, which is not guaranteed if the probability of winning (conditional on not having yet won) decays sufficiently fast. If we let p_k be the probability of winning on turn k, given that we haven’t previously won, then we require that the probability of never winning \prod_{k\ge 1}(1-p_k)=0. By taking logs and taking care of the approximations, it can be seen that the divergence or otherwise of \sum p_k determines which way this falls.

In the next post, we’ll talk about how the two problems encountered here, namely allowing large increments, and considering a stopping time with infinite expectation are exactly the two cases where something can go wrong. We’ll also talk about a slightly different setting, where the choice of when to stop playing becomes a bit more dynamic and complicated.

Azuma-Hoeffding Inequality

It’s (probably) my last Michaelmas term in Oxford, at least for the time being, and so also the last time giving tutorials on either of the probability courses that students take in their first two years. This time, I’m teaching the second years, and as usual the aim of the majority of the first half of the course is to acquire as sophisticated an understanding as possible of the Central Limit Theorem. I feel a key step is appreciating that CLT tells you about the correct scaling for the deviations from the mean of these partial sums of IID random variables. The fact that these deviations on this correct scaling converge in law to a normal distribution, irrespective (apart from mild conditions) on the underlying distribution, is interesting, but should be viewed as a secondary, bonus, property.

Emphasising the scaling of deviations in CLT motivates the next sections of this (or any) course. We develop tools like Markov’s inequality to control the probability that a random variable is much larger than its expectation, and experiment with applying this to various functions of the random variable to get stronger bounds. When the moment generating function exists, this is an excellent choice for this analysis. We end up with a so-called Chernoff bound. For example, we might consider the probability that when we toss N coins, at least a proportion ¾ are Heads. A Chernoff bound says that this probability decays exponentially in N.

One direction to take is to ask how to control precisely the parameter of this exponential decay, which leads to Cramer’s theorem and the basis of the theory of Large Deviations. An alternative direction is to observe that the signed difference between the partial sums of independent random variables and their means is an example of a martingale, albeit not a very interesting one, since in general the increments of a martingale are not independent. So we might ask: under what circumstances can we show exponential tail bounds on the deviation of a martingale from its mean (that is, its initial value) at a fixed (perhaps large) time?

Azuma-Hoeffding inequality

The following result was derived and used by various authors in the 60s, including Azuma and Hoeffding (separately), but also others.

Let X_0,X_1,X_2,\ldots be a martingale with respect to some filtration, and we assume that the absolute value of each increment |X_i-X_{i-1}| is bounded almost surely by some c_i<\infty. Then, recalling that \mathbb{E}[X_n|\mathcal{F}_0]=X_0, we have

\mathbb{P}(X_n \ge X_0+t) \le \exp\left( -\frac{t^2}{2\sum_{i=1}^n c_i^2}\right).

Proof

We apply a Chernoff argument to each increment. First, observe that for Y a distribution supported on [-1,1] with mean zero, by convexity \mathbb{E}[e^{tY}] is maximised by taking Y equal to +1 and -1 each with probability ½. Thus

\mathbb{E}[e^{tY}]\le \frac12 e^t + \frac 12 e^{-t}=\cosh(t) \le e^{-t^2/2},

where the final inequality follows by directly comparing the Taylor series.

We’ll use this shortly. Before that, we start the usual argument for a Chernoff bound on X_n-X_0.

\mathbb{P}(X_n-X_0\ge t) = \mathbb{P}(e^{\theta(X_n-X_0)}\ge e^{\theta t})\le e^{-\theta t} \mathbb{E}[e^{\theta(X_n-X_0)}]

= e^{-\theta t} \mathbb{E}[\mathbb{E}[e^{\theta((X_n-X_{n-1}) +X_{n-1}-X_0)} | \mathcal{F}_{n-1}]]

= e^{-\theta t} \mathbb{E}[e^{\theta(X_{n-1}-X_0)} \mathbb{E}[e^{\theta(X_n-X_{n-1})}|\mathcal{F}_{n-1}] ],

and our preliminary result allows us to control this inner expectation

\le e^{-\theta t} e^{\theta^2c_n^2/2} \mathbb{E}[e^{\theta(X_{n-1}-X_0)}].

So now we can apply this inductively to obtain

\mathbb{P}(X_n-X_0\ge t) \le e^{-\theta t+ \theta^2 \sum_{i=1}^n c_i^2}.

Finally, as usual in such an argument, we need to choose a sensible value of the free parameter \theta, and naturally we want to choose it to make this RHS as small as possible, which is achieved when \theta = \frac{t}{\sum_{i=1}^n c_i^2}, and leads exactly to the statement of the inequality.

Applications

Unsurprisingly, we can easily apply this to the process of partial sums of IID random variables with mean zero and bounded support, to recover a Chernoff bound.

A more interesting example involves revealing the state (ie open or closed) of the edges of an Erdos-Renyi graph one at a time. We need to examine some quantitative property of the graph which can’t ever be heavily influenced by the presence or non-presence of a single given edge. The size of the largest clique, or the largest cut, are good examples. Adding or removing an edge can change these quantities by at most one.

So if we order the edges, and let the filtration \mathcal{F}_k be generated by the state of the first k edges in this ordering, then X_k=\mathbb{E}[\text{max cut}| \mathcal{F}_k] is a martingale. (A martingale constructed backwards in this fashion by conditioning a final state on a filtration is sometimes called a Doob martingale.) Using A-H on this shows that the deviations from the mean are of order \sqrt{N}, where N is the size of the graph. In the sparse case, it can be justified fairly easily that the maximum cut has size \Theta(N), since for example there will always be some positive proportion of isolated vertices. However, accurate asymptotics for the mean of this quantity seem (at least after a brief search of the literature – please do correct me if this is wrong!) to be unknown. So this might be an example of the curious situation where we can control the deviations around the mean better than the mean itself!

Beyond bounded increments

One observation we might make about the proof is that it is tight only if all the increments X_i-X_{i-1} are supported on \{-c_i,+c_i\}, which is stronger than demanding that the absolute value is bounded. If in fact we have X_i-X_{i-1}\in[-d_i,c_i] almost surely, then, with a more detailed preliminary lemma, we can have instead a bound of \exp\left( -\frac{2t^2}{\sum_{i=1}^n (c_i+d_i)^2} \right).

While it isn’t a problem in these examples, in many settings the restriction to bounded increments is likely to be the obstacle to applying A-H. Indeed, in the technical corner of my current research problem, this is exactly the challenge I faced. Fortunately, at least in principle, all is not necessarily lost. We might, for example, be able to establish bounds (c_i) as described, such that the probability that any |X_i-X_{i-1}| exceeds its c_i is very small. You could then construct a coupled process (Y_i), that is equal to X_i whenever the increments are within the given range, and something else otherwise. For Y to fit the conditions of A-H, the challenge is to ensure we can do this such that the increments remain bounded (ie the ‘something else’ also has to be within [-c_i,c_i] ) and also that Y remains a martingale. This total probability of a deviation is bounded above by the probability of Y experiencing that deviation, plus the probability of Y and X decoupling. To comment on the latter probability is hard in general without saying a bit more about the dependence structure in X itself.

Reflected Brownian Motion

A standard Brownian motion is space-homogeneous, meaning that the behaviour of B_{T+t}-B_T does not depend on the value of B_T. By Donsker’s Theorem, such a Brownian motion is also the limit in a process space of any homogeneous random walk with zero-drift and constant variance, after suitable rescaling.

In many applications, however, we are interested in real-valued continuous-time Markov processes that are defined not on the whole of the real line, but on the half-line \mathbb{R}_{\ge 0}. So as BM is the fundamental real-valued continuous-time Markov process, we should ask how we might adjust it so that it stays non-negative. In particular, we want to clarify uniqueness, or at least be sure we have found all the sensible ways to make this adjustment, and also to consider how Donsker’s Theorem might work in this setting.

We should consider what properties we want this non-negative BM to have. Obviously, it should be non-negative, but it is also reasonable to demand that it looks exactly like BM everywhere except near 0. But since BM has a scale-invariance property, it is essentially meaningful to say ‘near 0’, so we instead demand that it looks exactly like BM everywhere except at 0. Apart from this, the only properties we want are that it is Markov and has continuous sample paths.

A starting point is so-called reflected Brownian motion, defined by X_t:=|B_t|. This is very natural and very convenient for analysis, but there are some problems. Firstly, this has the property that it looks like Brownian motion everywhere except 0 only because BM is space-homogeneous but also symmetric, in the sense that B_t\stackrel{d}{=}-B_t. This will be untrue for essentially any other process, so as a general method for how to keep stochastic processes positive, this will be useless. My second objection is a bit more subtle. If we consider this as an SDE, we get

dX_t=\text{sign}(B_t)dB_t.

This is a perfectly reasonable SDE but it is undesirable, because we have a function of B as coefficient on the RHS. Ideally, increments of X would be a function of X, and the increments of B, rather than the values of B. That is, we would expect X_{t+\delta t}-X_t to depend on X_t and on (B_{t+s}-B_t, 0\le s\le \delta t), but not on B_t itself, as that means we have to keep track of extra information while constructing X.

So we need an alternative method. One idea might be to add some non-negative process to the BM so that the sum stays non-negative. If this process is deterministic and finite, there there is some positive probability that the sum will eventually be negative, so this won’t do. We are looking therefore so a process which depends on the BM. Obviously we could take \max(-B_t,0), but this sum would then spend macroscopic intervals of time at 0, and these intervals would have the Raleigh distribution (for Brownian excursions) rather than the exponential distribution, hence the process given by the sum would not be memoryless and Markov.

The natural alternative is to look for an increasing process A_t, and then it makes sense to talk about the minimal increasing process that has the desired property. A moment’s thought suggests that A_t=-min_{s\le t}B_t satisfies this. So we have the decomposition

B_t=-A_t+S_t,

where S_t is the height of B above its running minimum. So S is an ideal alternative definition of reflecting BM. In particular, when B is away from its minimum, dB_t=dS_t, so this has the property that it evolves exactly as the driving Brownian motion.

What we have done is to decompose a general continuous process into the sum of a decreasing continuous process and a non-negative process. This is known as the Skorohod problem, and was the subject of much interest, even in the deterministic case. Note that process A has the property that it is locally constant almost everywhere, and is continuous, yet non-constant. Unsurprisingly, since A only changes when the underlying BM is 0, A is continuous with respect to the local time process at 0. In fact, A is the local time process of the underlying Brownian motion, by comparison with the construction by direct reflection.

One alternative approach is to look instead at the generator of the process. Recall that the generator of a process is an operator on some space of functions, with \mathcal{L}f giving the infinitissimal drift of f(X_t). In the case of Brownian motion, the generator (\mathcal{L}f)(x)=\frac12 f''(x) for bounded smooth functions f. This is equivalent to saying that

f(X_t)-f(X_0)-\int_0^t \frac12 f''(X_s)ds (*)

is a martingale. This must hold also for reflected Brownian motion, whenever x is greater than 0. Alternatively, if the function f is zero in a small neighbourhood of 0, it should have the same generator with respect to reflected BM. Indeed, for a general smooth bounded function f, we can still consider the expression (*) with respect to reflected BM. We know this expression behaves as a martingale except when X is zero. If f'(0)>0, and T is some hitting time of 0, then f(X_{T+\delta T})-f(X_T)\ge 0, hence the expression (*) is a submartingale. So if we restrict attention to functions with f'(0)=0, the generator remains the same. Indeed, by patching together all such intervals, it can be argued that even if f'(0) is not zero,

f(X_t)-f(X_0)-\int_0^t \frac12 f''(X_s)ds - f'(0)A_t

is a martingale, where A is the local time process at zero.

I was aware when I started reading about this that there was another family of processes called ‘Sticky Brownian Motion’ that shared properties with Reflected BM, in that it behaves like standard BM away from zero, but is also constrained to the non-negative reals. I think this will get too long if I also talk about that here, so that can be postponed, and for now we consider reflected BM as a limit of reflected (or other) random walks, bearing in mind that there is at least one other candidate to be the limit.

Unsurprisingly, if we have a family of random walks constrained to the non-negative reals, that are zero-drift unit-variance away from 0, then if they converge as processes, the limit is Brownian away from zero, and non-negative. Note that “away from 0” means after rescaling. So the key aspect is behaviour near zero.

What is the drift of reflected BM at 0? We might suspect it is infinite because of the form of the generator, but we can calculate it directly. Given X_0=0, we have:

\frac{\mathbb{E}X_t}{t}=\frac{\mathbb{E}|B_t|}{t}=\frac{\sqrt{t}\mathbb{E}|B_1|}{t},

so letting t\rightarrow 0, we see indeed that the drift is infinite at 0.

For convergence of discrete processes, we really need the generators to converge. Typically we index the discrete-time processes by the time unit h, which tends to 0, and b_h(x),a_h(x) are the rescaled drift and square-drift from x. We assume that we don’t see macroscopic jumps in the limit. For the case of simple random walk reflected at 0, it doesn’t matter exactly how we construct the joint limit in h and x, as the drift is uniform on x>0, but in general this does matter. I don’t want to discuss sticky BM right now, so it’s probably easiest to be vague and say that the discrete Markov processes converge to reflected BM so long they don’t spend more time than expected near 0 in the limit, as the title ‘sticky’ might suggest.

The two ways in which this can happen is if the volatility term a_h(x) is too small, in which case the process looks almost deterministic near 0, or if the drift doesn’t increase fast enough. And indeed, this leads to two conditions. The first is straightforward, if a_h(x) is bounded below, in the sense that \liminf_{h,x\rightarrow 0} a_h(x)\ge C>0, then we have convergence to reflected BM. Alternatively, the only danger can arise down those subsequences where a_h(x)\rightarrow 0, so if we have that b_h(x)\rightarrow +\infty whenever h,x,a_h(x)\rightarrow 0, then this convergence also holds.

Next time I’ll discuss what sticky BM means, what it doesn’t mean, why it isn’t easy to double the local time, and how to obtain sticky BM as a limit of discrete random walks in a similar way to the above.

REFERENCES

S. Varadhan – Chapter 16 from a Lecture Course at NYU can be found here.

Enhanced by Zemanta

Lamperti Walks

DSC_2604

The theory of simple random walks on the integer lattice is a classical topic in probability theory. Polya proved in the 1920s that such a SRW on \mathbb{Z}^d is recurrent only for d=1 or 2. The argument is essentially combinatorial. We count the number of possible paths from 0 back to itself and show that this grows fast enough that even with the probabilistic penalty of having a particular long path we will still repeatedly see this event happening. In larger dimensions there is essentially ‘more space’ at large distances, at least comparatively, so a typical walk is more likely to escape into this space.

As Kakutani (of the product martingale theorem) said, and was subsequently quoted as the dedication on every undergraduate pdf about random walks: “A drunk man will find his way home, whereas a drunk bird may get lost forever.”

But transience in some sense a long-distance property. We can fiddle with the transition rates near zero and, so long as we don’t make anything deterministic this shouldn’t affect transience properties. Obviously if we have a (space-)homogeneous nearest-neighbour random walk on the integers with non-zero drift the process will be transient: it drifts towards positive infinity if the drift is positive. But can we have a random walk with non-zero drift, but where the drift tends to zero at large distances fast enough, and the process is still recurrent? What is the correct scaling for the decay of the drift to see interesting effects?

The answers to these questions is seen in the so-called Lamperti random walks, which were a recurring theme of the meeting on Aspects of Random Walks held in Durham this week. Thanks to the organisers for putting on such an excellent meeting. I hadn’t known much about this topic before, so thought it might be worth writing a short note.

As explained above, we consider time-homogeneous random walks. It will turn out that the exact distributions of the increments is not hugely important. Most of the properties we might care about will be determined only by the first two moments, which we define as:

\mu_1(x)=\mathbb{E}[X_{t+1}-X_t | X_t=x],

\mu_2=\mathbb{E}[(X_{t+1}-X_t)^2 | X_t=x].

Note that because the drift will be asymptotically zero, the second term is asymptotically equal to the variance of the increment. It will also turn out that the correct scaling for \mu_1 to see a phase transition is \mu_1(x)\sim \frac{c}{x}.

We begin by seeing how this works in the simplest possible example, from Harris (1952). Let’s restrict attention to a random walk on the non-negative integers, and impose the further condition that increments are +1 or -1. In the notation of a birth-and-death process from a first course on Markov chains, we can set:

p_j:=\mathbb{P}(X_{t+1}=j+1| X_t=j), \quad q_j=1-p_j.

We will set p_j=\frac12 + \frac{c}{2j}. Then a condition for transience is that

1+\frac{q_1}{p_1}+\frac{q_1q_2}{p_1p_2}+\ldots <\infty.

In our special case:

\frac{q_1\ldots q_r}{p_1\ldots p_r}\approx\frac{(r-2c)(r-1-2c)(r-2-2c)\ldots}{r!}\approx \frac{1}{r^{2c}}.

So we can deduce that this sum converges if c>1/2, giving transience. A similar, but slightly more complicated calculation specifies the two regimes of recurrence. If -1/2<=c<=1/2 then the chain is null-recurrent, meaning that the expected time to return to any given state is infinite. If c<-1/2, then it is positive recurrent.

In general, we assume \mu_1(x)\sim \frac{c}{x} and \mu_2(x)\approx s^2. In the case above, obviously s^2=1. The general result is that under mild assumptions on the increment distributions, for instance a (2+\epsilon)-moment, if we define r=-\frac{2c}{s^2}, then the RW is transient if r<-1, positive-recurrent if r>1, and null-recurrent otherwise. This is the main result of Lamperti.

To explain why we have parameterised exactly like this, it makes sense to talk about the more general proof methods, as obviously the direct Markov chain calculation won’t work in general. The motivating idea is that we can deal well with the situation where the drift is zero, so let’s transform the random walk so that the drift becomes zero. A function of a Markov chain that is more stable (in some sense) that the original MC, for analysis at least, is sometimes called a Lyapunov function. Here, the sensible thing is to consider Y_t=X_t^\gamma, for some exponent \gamma>0.

So long as our distributions are fairly well-behaved (eg a finite 2+\epsilon-moment), we can calculate the drift of Y as

\mathbb{E}[Y_{t+1}-Y_t| X_t=x]=\frac{\gamma}{2}x^{\gamma-2}(2c+(1-\gamma)s^2) +o(x^{\gamma-2}).

In particular, taking \gamma=1+r results in a random walk that is ‘almost’ a martingale. Note that the original RW was almost a martingale, in the sense that the drift is asymptotically zero, but now it is zero to second order as well.

To draw any rigorous conclusions, we need to be careful about exactly how precise this approximation is, but we won’t worry about that now. In particular, we need to know whether we can take this approximation over the optional stopping theorem, as this allows us to say:

\mathbb{P}(X\text{ hits }x\text{ before 0})=\mathbb{P}(Y\text{ hits }x^\gamma\text{ before 0})\sim x^{-\gamma}.

This is particularly useful for working out the expected excursion time away from 0, which precisely leads to the condition for null-recurrence.

In his talk, Ostap Hryniv showed that this Lyapunov function analysis can be taken much further, to derive much more precise results about excursions, maxima and ergodicity. Results of Menshikov and Popov from the 90s further specify the asymptotics for the invariant distribution, if it exists, in terms of r.

One cautionary remark I should make is that earlier I implied that once we know the drift of such a random walk is zero, we have recurrence. This is true on \mathbb{Z} with very mild restrictions, but is not necessarily true in higher dimensions. For example, consider the random walk on \mathbb{R}^2, where conditional on X_t, the increment is X_{t+1}-X_t is of length 1 and perpendicular to the vector X_t. The two possible directions are equally likely. The drift is therefore 0 everything, and the second moment is also well-behaved, but note that ||X_t||^2=t^2, just by considering Pythagoras. So in higher dimensions, we have to be a bit more careful, and put restrictions on the covariance structure of the increment distributions.

As a final comment, note that from Lamperti’s result, we can re-derive Polya’s result about SRW in higher dimensions. If we have X_t an SRW on \mathbb{Z}^d, then consider Y_t=||X_t||. By considering a couple of examples in two-dimensions, it is clear that this is not Markov. But the methods we considered above for the Lamperti walks were really martingale methods rather than Markov chain methods. And indeed this process Y has asymptotically zero drift with the right scaling. Here,

c=\frac{1}{2}(1-\frac{1}{d}),\quad s^2=\frac{1}{d},

and so r=d-1, leading to exactly the result we know to be true, that the SRW is transient precisely in three dimensions and higher.

REFERENCES

Harris – First Passage and Recurrence Distributions (1952)

The slides from Ostap Hryniv’s talk, on which this was based, can be found here.

Enhanced by Zemanta

Martingale Convergence

I continue the theme of explaining important bookwork from the Part III course, Advanced Probability, as succinctly as possible. In this post, we consider the convergence properties of discrete time martingales.

1) Theorem: Assume X is a martingale bounded in L^1. Then

X_n\rightarrow X_\infty\in L^1(\mathcal{F}_\infty) a.s.

Remark: The theorem and proof works equally well for X a supermartingale.

Proof: Essentially, we want to reduce convergence of the random variables which make up the martingale to a countable collection of events. We do this by considering upcrossings, which counts the number of times the process alternates from less than a given interval to greater than a given interval. The formal definition will be too wide for this format, so we summarise as

N_n([a,b],X)= the number of disjoint time intervals up to time n in which X goes from [-\infty a) to (b,\infty]. Define N([a,b],X) to be the limit as n increases to infinity.

It is a genuinely easy check that a sequence converges (possibly to \infty) iff this number of upcrossings of any interval with rational bounds is finite. We will show that the martingale almost surely has this property. The key lemma is a bound due to Doob:

Lemma: (b-a)\mathbb{E}[N_n([a,b],X)]\leq \mathbb{E}[(X_n-a)^-]

 Proof: Say S_1<T_1<S_2<T_2,\ldots are the successive hitting times of [-\infty,a),(b,\infty] respectively. So N_n=\inf\{k:T_k\leq n\}. We decompose, abbreviating the number of upcrossings as N.

\sum_{k=1}^n(X_{T_k\wedge n}-X_{S_k\wedge n})=\sum_{k=1}^N(X_{T_k}-X_{S_k})+(X_n-X_{S_{N+1}})1_{\{S_{N+1}\leq n\}}

Now take an expectation of both sides, applying the Optional Stopping Theorem to the bounded stopping times on the LHS. (If we are working with a supermartingale, formally we need to take \mathbb{E}[\cdot|\mathcal{F}_{S_k}] of each summand on LHS to show that they are non-negative, and so taking a further expectation over \mathcal{F}_{S_k} gives the required result.) We obtain:

0\geq (b-a)\mathbb{E}N-\mathbb{E}(X_n-X_{S_{N+1}})1_{\{S_{N+1}\leq n\}}

If S_{N+1}>n then both 1_{\{S_{N+1}\leq n\}}=(X_n-a)^-=0. Otherwise (X_n-a)^-\geq X_{S_{N+1}}-X_n. This complete the proof of the lemma.

Since \mathbb{E}[(X_n-a)^-]\leq \mathbb{E}|X_n|+a<\infty, where this last bound is uniform in by assumption, applying monotone convergence, we get that N([a,b],X) is almost surely finite for every pair a<b\in\mathbb{Q}. Because this set is countable, we can deduce that this holds almost surely for every pair simultaneously. We therefore define X_\infty(w)=\lim X_n(w) when this limit exists, and 0 otherwise. With probability one the limit exists. Fatou’s lemma confirms that X_\infty\in L^1(\mathcal{F}_\infty).

2) We often want to have convergence in L^1 as well. Recall for Part II Probability and Measure (or elsewhere) that

UI + Convergence almost surely is necessary and sufficient for convergence in L^1.

This applies equally well to this situation. Note that for a martingale, this condition is often convenient, because, for example, we know that the set \{\mathbb{E}[X_\infty|\mathcal{F}_n],n\} is UI for any integrable X_\infty.

3) Convergence in L^p is easier to guarantee.

Theorem: i) X a martingale bounded in L^p iff ii) X_n\rightarrow X_\infty in L^p and almost surely iff iii) \exists Z\in L^p s.t. X_n=\mathbb{E}[Z|\mathcal{F}_n] a.s.

Remark: As part of the proof, we will show, as expected, that X_\infty,Z are the same.

Proof: i)->ii) Almost sure convergence follows from the above result applied to the p-th power process. We apply Doob’s inequality about running maxima in a martingale process:

||X_n^*||_p:=||\sup_{m\leq n}X_m||_p\leq \frac{p}{p-1}||X_n||_p

Using this, we see that X_n^*\uparrow X_\infty^*:=\sup|X_k|. Now consider |X_n-X_\infty|\leq 2X_\infty^*\in L^p and use Dominated Convergence to confirm convergence in L^p.

Note that Doob’s L^p inequality can be proven using the same author’s Maximal inequality and Holder.

ii)->iii) As we suspected, we show Z=X_\infty is suitable.

||X_n-\mathbb{E}[X_\infty|\mathcal{F}_n]||_p\stackrel{\text{large }m}{=}||\mathbb{E}[X_m-X_\infty|\mathcal{F}_n]||_p\stackrel{\text{Jensen}}{\leq}||X_m-X_\infty||_p\rightarrow 0

iii)->i) is easy. Z bounded in L^p implies X bounded by a simple application of the triangle inequality in the definition of conditional expectation.

Martingales

Definition

X_n is a stochastic process, integrable and adapted to filtration (\Omega,\mathcal{F},(\mathcal{F}_n),\mathbb{P}). Then X is a martingale if \mathbb{E}[X_n|\mathcal{F}_m]=X_m almost surely whenever n\geq m.

It is natural to think about a martingale as defined in the context of a process evolving in time. Then this definition is very reasonable:

  • Integrable: the entire construction is about taking expectations. So these need to exist.
  • Adapted: X_n can’t be affected by what happens after time n. It must be defined by what has happened up to time n.
  • Expectation condition: if you look at the process at a given time m, the best estimate for what X will be in the future is in fact what it is now. As you might expect, it is sufficient that \mathbb{E}[X_{n+1}|\mathcal{F}_n]=X_n almost surely for each n.

Motivation

There are many situations where the expected change in a variable over a time period is zero, whatever the value of the variable is at the start. For example, gambling. For illustration, assume we are speculating on the outcomes of tossing a coin repeatedly. You might have a complicated strategy, for example ‘double your stake when you lose’ (the so-called martingale strategy), or anything else. But ultimately, you can’t see into the future. So before every coin toss, you have to decide your stake, based on what’s happened up until now, and you will win or lose with equal probability, so your expected gain is 0, regardless of how you make your stake choice. Thus under any strategy determined without looking into the future (called previsible), the process recording your winnings is a martingale. Continue reading