# Doob inequalities and Doob-Meyer decomposition

The first post I wrote on this blog was about martingales, way back in 2012 at a time when I had known what a martingale was for about a month. I now don’t have this excuse. So I’m going to write about a couple of properties of (discrete-time) martingales that came up while adjusting a proof which my thesis examiners suggested could be made much shorter as part of their corrections.

Doob’s submartingale inequality

When we prove that some sequence of processes converges to some other process, we typically want to show that this holds in some sense uniformly over a time-interval, rather than just at some fixed time. We don’t lose much at this level of vagueness by taking the limit process to be identically zero. Then, if the convergent processes are martingales or closely similar, we want to be able to bound $\sup_{k\le n} |Z_k|$ in some sense.

Doob’s submartingale inequality allows us to do this. Recall that a submartingale has almost-surely non-negative conditional increments. You might think of it heuristically as ‘more increasing than a martingale’. If $Z_n$ is a martingale, then $|Z_n|$ is a submartingale. This will be useful almost immediately.

The statement is that for $(Z_n)$ a non-negative submartingale,

$\mathbb{P}\left( \sup_{k\le n} Z_k \ge \lambda\right) \le \frac{\mathbb{E}\left[Z_n\right]}{\lambda}.$

The similarity of the statement to the statement of Markov’s inequality is no accident. Indeed the proof is very similar. We consider whether the event in question happens, and find lower bounds on the expectation of $Z_n$ under both possibilities.

Formally, for ease of notation, let $Z_n^*$ be the running maximum $\sup_{k\le n}Z_k$. Then, we let $T:= n\wedge \inf\{k\le n, M_j\ge \lambda\}$ and apply the optional stopping theorem for submartingales at T, which is by construction at most n. That is

$\mathbb{E}[Z_n]\ge \mathbb{E}[Z_T]=\mathbb{E}\left[Z_T\mathbf{1}_{Z_n^*<\lambda}\right] + \mathbb{E}\left[Z_T \mathbf{1}_{Z_n^*\ge \lambda}\right].$

The first of these summands is positive, and the second is at least $\lambda \mathbb{P}\left( Z_N^* \ge \lambda \right)$, from which the result follows.

We’ve already said that for any martingale $Z_n$, $|Z_n|$ is a submartingale, but in fact $f(Z_n)$ is a submartingale whenever f is convex, and $\mathbb{E}|f(Z_n)|<\infty$ for each n. Naturally, this continues to hold when $Z_n$ is itself a submartingale.

[Note that $Z_n^*$ is also a submartingale, but this probably isn’t as interesting.]

A particularly relevant such function f is $f(x)=x^p$, for p>1. If we take $Z_n$ a non-negative submartingale which is uniformly bounded in $L^p$, then by applying Holder’s inequality and this submartingale inequality, we obtain

$\mathbb{E}\left( \sup_{k\le n}Z_n^p \right) \le \left(\frac{p}{p-1}\right)^p \mathbb{E}\left[ Z_n^p \right].$

Since $Z_n^p$ is a submartingale, then a limit in n on the RHS is monotone, and certainly a limit in n on the LHS is monotone, so we can extend to

$mathbb{E}\left( \sup_{k\le n}Z_\infty^p \right) \le \left(\frac{p}{1-p}\right)^p \mathbb{E}\left[ Z_\infty^p \right].$

Initially, we have to define $\mathbb{E}\left[ Z_\infty^p \right]$ through this limit, but in fact this result, Doob’s $L^p$ inequality, shows that $Z_\infty:= \lim Z_n$ exists almost surely as well.

Naturally, we will often apply this in the case p=2, and in the third of these three sections, we will see why it might be particularly straightforward to calculate $\mathbb{E}\left[Z_\infty^2\right].$

Remark: as in the case of Markov’s inequality, it’s hard to say much if the submartingale is not taken to be non-negative. Indeed, this effect can be seen even if the process is only defined for a single time step, for which the statement really is then Markov’s inequality.

Doob-Meyer decomposition

Unfortunately, most processes are not martingales. Given an discrete-time process $X_n$ adapted to $\mathcal{F}=(\mathcal{F}_n)$, it is a martingale if the conditional expectations of the increments are all almost surely zero. But given a general adapted process $X_n$ which is integrable (so the increments have well-defined finite expectation), we can iteratively construct a new process $M_n$, where the increments are centred versions of $X_n$‘s increments. That is,

$M_{n+1}-M_n:= X_{n+1}-X_n - \mathbb{E}\left[ X_{n+1}-X_n \,\big|\, \mathcal{F}_n\right] = X_{n+1}-\mathbb{E}\left[X_{n+1} \,\big|\, \mathcal{F}_n\right].$ (*)

Then it’s immediately clear from the definition that $M_n$ is a martingale.

There’s a temptation to tie oneself up in knots with the dependence. We might have that increments of the original process $X_n$ depend on the current value of the process. And is it necessarily clear that we can recover the current value of the original process from the current value of $M_n$? Well, this is why we demand that everything be adapted, rather than just Markov. It’s not the case that $M_n$ should be Markov, but it clearly is adapted.

Now we look at the middle expression in (*), and in particular the term we are subtracting, namely the conditional expectation. If we define, in the standard terminology, $A_0=0$ and

$A_{n+1}-A_n:= \mathbb{E}\left[ X_{n+1}-X_n \,\big|\, \mathcal{F}_n\right],$

then we have decomposed the original process $X_n$ as the sum of a martingale $M_n$, and this new process $A_n$. In particular, note that the increment $A_{n+1}-A_n$ given above is adapted to $\mathcal{F}_n$, which is a stronger condition than being adapted to $\mathcal{F}_{n+1}$ as we would expect a priori. This property of the process $(A_n)$ is called predictability (or possibly previsibility).

This decomposition $X_n=X_0+M_n+A_n$ as just defined is called the Doob-Meyer decomposition, and there is a unique such decomposition where $M_n$ is a martingale, and $A_n$ is predictable. The proof of uniqueness is very straightforward. We look at the equalities given above as definitions of $M_n,A_n$, but then work in the opposite direction to show that they must hold if the decomposition holds.

I feel a final heuristic is worthwhile, using the term drift, more normally encountered in the continuous-time setting to describe infinitissimal expected increments. The increments of $A_n$ represent the drift of $X_n$, and the increments of $M_n$ are what remains from $X_n$ after subtracting the drift. In general, the process to be subtracted to turn a non-martingale into a martingale is called a compensator, and the existence or otherwise of such processes is important but challenging for some classes of continuous-time processes.

In particular, note that when $X_n$ is itself a martingale, then $A_n\equiv 0$. However, probably the most useful case is when $X_n$ is a submartingale, as then the drift is always non-negative, and so $A_n$ is almost surely increasing. The converse holds too.

This is relevant because this Doob-Meyer decomposition is obviously only a useful tool for treating $X_n$ if we can handle the two processes $M_n,A_n$ easily. We have tools to bound the martingale term, but this previsible term might in general be tricky, and so the case where $X_n$ is a submartingale is good, as increasing processes are much easier than general processes, since bounding the whole process might involve only bounding the final term in many contexts.

A particularly relevant example is the square of a martingale, that is $X_n=M_n^2$, where $M_n$ is a martingale. By the convexity condition discussed earlier, $X_n$ is a submartingale (provided it is integrable, ie $M_n$ is square-integrable), and so the process $A_n$ in its Doob-Meyer decomposition is increasing. This is often called the (predictable) quadratic variation of $(X_n)$.

This predictable quadratic variation is sometimes denoted $\langle X_n\rangle$. This differs from the (regular) quadratic variation which is defined as the sum of the squares of the increments, that is $[X_n]:= \sum_{k=0}^{n-1} (X_{k+1}-X_k)^2$. Note that this is adapted, but obviously not previsible. The distinction between these two processes is more important in continuous time. There, they are almost surely equal for a continuous local martingale, but not for eg a Poisson process. (For a Poisson process, the PQV is deterministic, indeed linear, while the (R)QV is almost surely equal to the Poisson process itself.) In the discrete time setting, the regular quadratic variation is not relevant very often, while the predictable quadratic variation is useful, precisely because of this decomposition.

Whenever we have random variables which we then centre, there is a standard trick to apply when treating their variance. That is

$A_{n+1}-A_n= \mathbb{E}\left[ M^2_{n+1}-M^2_n \,\big|\, \mathcal{F}_n\right]$
$= \mathbb{E}\left[ M^2_{n+1}\,\big|\, \mathcal{F}_n\right] - 2M_n^2 +M_n^2$
$= \mathbb{E}\left[ M^2_{n+1}\,\big|\, \mathcal{F}_n\right] - 2M_n \mathbb{E}\left[ M_{n+1}\,\big|\, \mathcal{F}_n\right] + M_n^2$
$= \mathbb{E}\left[ \left(M_{n+1}-M_n\right)^2\,\big|\, \mathcal{F}_n\right].$

One consequence is seen by taking an ‘overall’ expectation. Because $M_n^2-A_n$ is a martingale,

$\mathbb{E}\left[M_n^2\right] = \mathbb{E}\left[A_n\right] = \mathbb{E}\left[M_0^2\right] + \sum_{k=0}^{n-1} \mathbb{E}\left[A_{k+1}-A_k\right]$
$= \mathbb{E}\left[ M_0^2\right] + \sum_{k=0}^{n-1}\mathbb{E}\left[ \left(M_{k+1}-M_k\right)^2 \right].$ (**)

This additive (Pythagorean) property of the square of a martingale is useful in applications where there is reasonably good control on each increment separately.

We can also see this final property without the Doob-Meyer decomposition. For a martingale it is not the case that the increments on disjoint intervals are independent. However, following Williams 12.1 [1], disjoint intervals are orthogonal, in the sense that

$\mathbb{E}\left[(M_t-M_s)(M_v-M_u)\right]=0,$

whenever $s\le t\le u\le v$. Then, when we square the expression $M_n=M_0+\sum M_{k+1}-M_k$, and take expectations, all the cross terms vanish, leaving precisely (*).

References

[1] Williams – Probability with Martingales

I also followed the notes I made in 2011/12 while attending Perla Sousi’s course on Advanced Probability, and Arnab Sen’s subsequent course on Stochastic Calculus, though I can’t find any evidence online for the latter now.

# Skorohod embedding

Background

Suppose we are given a standard Brownian motion $(B_t)$, and a stopping time T. Then, so long as T satisfies one of the regularity conditions under which the Optional Stopping Theorem applies, we know that $\mathbb{E}[B_T]=0$. (See here for a less formal introduction to OST.) Furthermore, since $B_t^2-t$ is a martingale, $\mathbb{E}[B_T^2]=\mathbb{E}[T]$, so if the latter is finite, so is the former.

Now, using the strong Markov property of Brownian motion, we can come up with a sequence of stopping times $0=T_0, T_1, T_2,\ldots$ such that the increments $T_k-T_{k-1}$ are IID with the same distribution as T. Then $0,B_{T_1},B_{T_2},\ldots$ is a centered random walk. By taking T to be the hitting time of $\{-1,+1\}$, it is easy to see that we can embed simple random walk in a Brownian motion using this approach.

Embedding simple random walk in Brownian motion.

The Skorohod embedding question asks: can all centered random walks be constructed in this fashion, by stopping Brownian motion at a sequence of stopping time? With the strong Markov property, it immediately reduces the question of whether all centered finite-variance distributions X can be expressed as $B_T$ for some integrable stopping time T.

The answer to this question is yes, and much of what follows is drawn from, or at least prompted by Obloj’s survey paper which details the problem and rich history of the many approaches to its solution over the past seventy years.

Applications and related things

The relationship between random walks and Brownian motion is a rich one. Donsker’s invariance principle asserts that Brownian motion appears as the scaling limit of a random walk. Indeed, one can construct Brownian motion itself as the limit of a sequence of consistent random walks with normal increments on an increasingly dense set of times. Furthermore, random walks are martingales, and we know that continuous, local martingales can be expressed as a (stochastically) time-changed Brownian motion, from the Dubins-Schwarz theorem.

The Skorohod embedding theorem can be used to prove results about random walks with general distribution by proving the corresponding result for Brownian motion, and checking that the construction of the sequence of stopping times has the right properties to allow the result to be carried back to the original setting. It obviously also gives a coupling between a individual random walk and a Brownian motion which may be useful in some contexts, as well as a coupling between any pair of random walks. This is useful in proving results for random walks which are much easier for special cases of the distribution. For example, when the increments are Gaussian, or when there are combinatorial approaches to a problem about simple random walk. At the moment no aspect of this blog schedule is guaranteed, but I plan to talk about the law of the iterated logarithm shortly, whose proof is approachable in both of these settings, as well as for Brownian motion, and Skorohod embedding provides the route to the general proof.

At the end, we will briefly compare some other ways to couple a random walk and a Brownian motion.

One thing we could do is sample a copy of X independently from the Brownian motion, then declare $T= \tau_{X}:= \inf\{t\ge 0: B_t=X\}$, the hitting time of (random value) X. But recall that unfortunately $\tau_x$ has infinite expectation for all non-zero x, so this doesn’t fit the conditions required to use OST.

Skorohod’s original method is described in Section 3.1 of Obloj’s notes linked above. The method is roughly to pair up positive values taken by X appropriately with negative values taken by X in a clever way. If we have a positive value b and a negative value a, then $\tau_{a,b}$, the first hitting time of $\mathbb{R}\backslash (a,b)$ is integrable. Then we choose one of these positive-negative pairs according to the projection of the distribution of X onto the pairings, and let T be the hitting time of this pair of values. The probability of hitting b conditional on hitting {a,b} is easy to compute (it’s $\frac{-a}{b-a}$) so we need to have chosen our pairs so that the ‘probability’ of hitting b (ie the density) comes out right. In particular, this method has to start from continuous distributions X, and treat atoms in the distribution of X separately.

The case where the distribution X is symmetric (that is $X\stackrel{d}=-X$) is particularly clear, as then the pairs should be $(-x,x)$.

However, it feels like there is enough randomness in Brownian motion already, and subsequent authors showed that indeed it wasn’t necessary to introduce extra randomness to provide a solution.

One might ask whether it’s possible to generate the distribution on the set of pairs (as above) out of the Brownian motion itself, but independently from all the hitting times. It feels like it might be possible to make the distribution on the pairs measurable with respect to

$\mathcal{F}_{0+} = \bigcap\limits_{t>0} \mathcal{F}_t,$

the sigma-algebra of events determined by limiting behaviour as $t\rightarrow 0$ (which is independent of hitting times). But of course, unfortunately $\mathcal{F}_{0+}$ has a zero-one law, so it’s not possible to embed non-trivial distributions there.

Dubins solution

The exemplar for solutions without extra randomness is due to Dubins, shortly after Skorohod’s original argument. The idea is to express the distribution X as the almost sure limit of a martingale. We first use the hitting time of a pair of points to ‘decide’ whether we will end up positive or negative, and then given this information look at the hitting time (after this first time) of two subsequent points to ‘decide’ which of four regions of the real interval we end up in.

I’m going to use different notation to Obloj, corresponding more closely with how I ended up thinking about this method. We let

$a_+:= \mathbb{E}[X \,|\, X>0], \quad a_- := \mathbb{E}[X\,|\, X<0],$ (*)

and take $T_1 = \tau_{\{a_-,a_+\}}$. We need to check that

$\mathbb{P}\left( B_{T_1}=a_+\right) = \mathbb{P}\left(X>0\right),$

for this to have a chance of working. But we know that

$\mathbb{P}\left( B_{T_1}=a_+\right) = \frac{a_+}{a_+-a_-},$

and we can also attack the other side using (*) and the fact that $\mathbb{E}[X]=0$, using the law of total expectation:

$0=\mathbb{E}[X]=\mathbb{E}[X\,|\, X>0] \mathbb{P}(X>0) + \mathbb{E}[X\,|\,X<0]\mathbb{P}(X<0) = a_+ \mathbb{P}(X>0) + a_- \left(1-\mathbb{P}(X>0) \right),$

$\Rightarrow\quad \mathbb{P}(X>0)=\frac{a_+}{a_+-a_-}.$

Now we define

$a_{++}=\mathbb{E}[X \,|\, X>a_+],\quad a_{+-}=\mathbb{E}[X\,|\, 0

and similarly $a_{-+},a_{--}$. So then, conditional on $B_{T_1}=a_+$, we take

$T_2:= \inf_{t\ge T_1}\left\{ B_t\not\in (a_{+-},a_{++}) \right\},$

and similarly conditional on $B_{T_1}=a_-$. By an identical argument to the one we have just deployed, we have $\mathbb{E}\left[B_{T_2} \,|\,\mathcal{F}_{T_1} \right] = B_{T_1}$ almost surely. So, although the $a_{+-+}$ notation now starts to get very unwieldy, it’s clear we can keep going in this way to get a sequence of stopping times $0=T_0,T_1,T_2,\ldots$ where $B_{T_n}$ determines which of the $2^n$ regions of the real line any limit $\lim_{m\rightarrow\infty} B_{T_m}$ should lie in.

A bit of work is required to check that the almost sure limit $T_n\rightarrow T$ is almost surely finite, but once we have this, it is clear that $B_{T_n}\rightarrow B_T$ almost surely, and $B_T$ has the distribution required.

We want to know how close we can make this coupling between a centered random walk with variance 1, and a standard Brownian motion. Here, ‘close’ means uniformly close in probability. For large times, the typical difference between one of the stopping times $0,T_1,T_2,\ldots$ in the Skorohod embedding and its expectation (recall $\mathbb{E}[T_k]=k$) is $\sqrt{n}$. So, constructing the random walk $S_0,S_1,S_2,\ldots$ from the Brownian motion via Skorohod embedding leads to

$\left |S_k - B_k \right| = \omega(n^{1/4}),$

for most values of $k\le n$. Strassen (1966) shows that the true scale of the maximum

$\max_{k\le n} \left| S_k - B_k \right|$

is slightly larger than this, with some extra powers of $\log n$ and $\log\log n$ as one would expect.

The Komlos-Major-Tusnady coupling is a way to do a lot better than this, in the setting where the distribution of the increments has a finite MGF near 0. Then, there exists a coupling of the random walk and the Brownian motion such that

$\max_{k\le n}\left|S_k- B_k\right| = O(\log n).$

That is, there exists C such that

$\left[\max_{k\le n} \left |S_k-B_k\right| - C\log n\right] \vee 0$

is a tight family of distributions, indeed with uniform exponential tail. To avoid digressing infinitely far from my original plan to discuss the proof of the law of iterated logarithm for general distributions, I’ll stop here. I found it hard to find much coverage of the KMT result apart from the challenging original paper, and many versions expressed in the language of empirical processes, which are similar to random walks in many ways relevant to convergence and this coupling, but not for Skorohod embedding. So, here is a link to some slides from a talk by Chatterjee which I found helpful in getting a sense of the history, and some of the modern approaches to this type of normal approximation problem.

# Fair games and the martingale strategy II

Optional Stopping

We continue directly from the end of the last post, where I was talking about how to play sequences of fair games, and whether by playing cunningly (including choosing when to stop playing cunningly) you can end up with an ‘unfair’ game overall. (Ie where you make a profit or a loss on average.) We gave two examples. First, the martingale strategy, where on a sequence of fair games you double your stake each time you lose. The result is that you win back your original stake at some point with probability one, but possibly accumulate huge temporary losses along the way. In the second game, you follow the path of a simple random walk from zero until it hits one, and then cash in. Here we observe that the time until this happens is almost surely finite, but has infinite expectation.

There’s another possible problem. It seems ridiculous, but suppose we could look into the future. Then our strategy for the random walk might be something like: check in advance what will happen in the first ten steps, and stop playing whenever we get to the moment which we know is the maximum value the walk will attain. Well then, sometimes the walk will never go above zero, in which case we will stop playing at the very start, and sometimes the walk will go above zero, in which case we make a positive amount. So overall, our mean return must be positive. Obviously if we have the option to adjust our stakes, this is completely ridiculous, because we would bet high (infinitely high?) if we knew we were about to win, and nothing if we were about to lose. So, obvious though it seems, we should emphasise that we mustn’t be allowed to look into the future!

The optional stopping theorem says that looking into the future, and these two problems already mentioned are essentially all that can go wrong. To say anything more interesting, at this point we really do need a little bit of notation.

In general, a sequence of fair games of this kind is called a martingale. The origin of the word is fairly unclear – see this unexpectedly comprehensive article. The martingale will be something like $X_0,X_1,X_2,\ldots$, representing the wealth (or whatever) at discrete time-steps. The key property is the fair game property, which says that whatever has happened up to time k, the next game is fair. That is:

$\mathbb{E}[X_{k+1}-X_k \,|\,\text{any event involving }X_0,\ldots,X_k] = 0.$ (*)

Note that in any of the situations we are describing, X should describe our wealth, rather than the underlying process. In the random walk example, these are the same, but in the martingale strategy suggestion, $X_k$ should be our wealth after the kth game, rather than anything directly recording the outcomes of the coin tosses.

If we allow $X_0$ to be random (and of course, being always equal to zero is a special case of being random…) we can then add up an initial sequence of such equations to obtain

$\mathbb{E}[X_k]=\mathbb{E}[X_k-X_{k_1}] + \ldots + \mathbb{E}[X_1-X_0] + \mathbb{E}[X_0]=\mathbb{E}[X_0].$ (**)

That is, if we play any sequence of fair games a fixed finite number of times, we have overall a fair game. (In the original strategy, we have a martingale, irrespective of the precise rule we use to choose how much we invest on each coin toss.) But what if we stop the process at a time determined by the current knowledge? (ie without looking into the future.)

Let’s call an example of such a random time T, and this property that we aren’t allowed to look into the future is described technically as the stopping time property. A proper setup would use more notation and fewer words at (*), but even without sigma-algebras, we can say that T is a stopping time if deciding whether T=k depends only on $X_0,X_1,\ldots,X_k$, and not on later actions.

Informal Proof

To show the optional stopping theorem, the key idea is that if you want to stop at time T, one option is to keep playing beyond time T with zero stakes. Thus we have a fair game at all times, even after T. We write this as $X_{T\wedge k}$, where $\wedge$ means ‘minimum’, so that if k>T, the process stays constant.

Since $X_{T\wedge k}$ is a martingale, we can invoke (**),

$\mathbb{E}[X_{T\wedge k}] = \mathbb{E}[X_0].$

Now what happens if we take k to be very large? How well does this truncated average approximate $\mathbb{E}[X_T]$ itself?

This is where we want to return to our assumptions about what might make this go wrong. Let’s say that T has finite expectation, and that there is some absolute bound on how large an increment can be, say C. Then, whenever $T\le k$, we have $X_T=X_{T\wedge k}$. And when T>k, we have

$|X_T - X_{T\wedge k}| = |X_T-X_k| \le C(T-k).$

Therefore

$|\mathbb{E}[X_T]-\mathbb{E}[X_0]|= |\mathbb{E}[X_T] - \mathbb{E}[X_{T\wedge k}] | \le C \mathbb{E}[(T-k)\vee 0],$ (***)

where we take the final expectation only across T-k when this quantity is positive, since this is the only case which contributes to the left hand side.

Now we need to show that by choosing k large enough, we can make the RHS very small. Obviously we don’t have a chance of doing this if C is not finite! With a bit of thought, we can see that $\mathbb{E}[(T-k)\vee 0]\ge \mathbb{E}[T] - k$, and so we also don’t have a chance of doing this if $\mathbb{E}[T]=\infty$. But if $\mathbb{E}[T]<\infty$, then $\sum_{\ell\ge 1} \ell \mathbb{P}(T=\ell) <\infty$, and so

$\sum_{\ell \ge k} \ell \mathbb{P}(T=\ell)\rightarrow 0,\quad \text{as }k\rightarrow\infty,$

and so certainly

$\mathbb{E}[(T-k)\vee 0] = \sum_{\ell \ge k}(\ell -k)\mathbb{P}(T=\ell) \rightarrow 0.$

But (***) holds for all values of k, and so the only consistent option is that

$\mathbb{E}[X_T]=\mathbb{E}[X_0].$

There are a couple more combinations of conditions (mostly involving relaxing one of these slightly, and substantially strengthening the other) which also work, but this seems like the more natural form. For a full formal statement, there are many resources available, and the Wikipedia page, for example, isn’t too bad. In the mists of history, I wrote about some of these topics more formally, but maybe less helpfully, since I’d known the theory myself for about a week.

# Fair games and the martingale strategy I

I went back to my school a couple of weeks ago and gave a talk. I felt I’d given various incarnations of a talk on card-shuffling too many times, so it was time for a new topic. The following post (and time allowing, one or two more) is pretty much what I said.

The Martingale Strategy

Suppose we bet repeatedly on the outcome of tossing a fair coin. Since it’s November, my heart is set on buying an ice cream that costs £1, so my aim is to win this amount from our game. My strategy is this:

First, I bet £1. If I win, then that’s great, because I now have made exactly enough profit to buy the ice cream. If I lose, then I play again, and this time I bet £2. Again, if I win, then my total profit is £2-£1 = £1, so I stop playing and buy the ice cream. If I lose, then I play a third time, again doubling my stake. So if I win for the first time on the seventh go, my overall profit will be

£64 – (£1+£2+£4+£8+£16+£32) = £1,

and it’s clear that this can be continued and I will eventually win a round, and at this point my total profit will be £1. So I will always eventually be able to buy my ice cream.

But, there’s nothing special about the value £1, so I could replace the words ‘ice cream’ with ‘private tropical island’, so why am I still here in the UK on a wet Monday when I could be on my beach lounger?

There are some fairly obvious reasons why the strategy I’ve described is not actually a fail-safe way to make a profit. For a start, although with probability one a head will come up eventually, there is a small positive chance that the first 200 rolls will all be tails. At this point, I would have accrued a debt of roughly $2^{200}$ pounds, and this is slightly more than the number of atoms in the universe. All this for an ice cream?

So there are major problems carrying out this strategy in a finite world. And of course, it’s no good if we stop after a very large but finite number of turns, because then there’s always this very small chance that we’ve made a very large loss, which is bad, partly because we can’t have the ice cream, but also because it exactly cancels out the chance of making our £1 profit, and so our overall average profit is exactly zero.

Though I’ve set this up in an intentionally glib fashion, as so often is the case, we might have stumbled across an interesting mathematical idea. That is, if we play a fair game a finite number of times, we have a fair game overall, meaning our overall average profit is zero. But if we are allowed to play a potentially infinite number of times, then it’s not clear how to define our overall ‘average’ profit, since we feel it ought to be zero, as an extension of the finite case, but also might be positive, because it ends up being £1 with probability one.

It’s tempting at this stage to start writing statements like

$1 \times 1 + (-\infty) \times 0=0 ,$

to justify why this might have come about, where we consider the infinitely unlikely event that is infinitely costly. But this is only convincing at the most superficial level, and so it makes more sense to think a bit more carefully about under exactly what circumstances we can extend our observation about the overall fairness of a finite sequence of individual fair games.

A second example

The previous example was based upon a series of coin tosses, and we can use exactly the same source of randomness to produce a simple random walk. This is a process that goes up or down by 1 in each time step, where each option happens with probability ½, independently of the history.

We could avoid the requirement to deal with very large bets by always staking £1, and then cashing in the first time we have a profit of £1. Then, if we start the random walk at zero, it models our profit, and we stop the first time it gets to 1. It’s not obvious whether we hit 1 with probability one. Let’s show this.

In order to hit some positive value k, the random walk must pass through 1, 2, and so on, up to (k-1) and then finally k. So $\mathbb{P}(\text{hit k}) = [\mathbb{P}(\text{hit 1})]^k$. And similarly for negative values. Also, the probability that we return to zero is the same as the probability that we ever hit 1, since after one time-step they are literally the same problem (after symmetry). So, if the probability of hitting 1 is p<1, then the number of visits to zero is geometric (supported on 1,2,3,…) with parameter p, and so

$\mathbb{E}[\text{visits to k}] = \mathbb{E}[\text{visits to zero}] \times \mathbb{P}(\text{hit k})=(1+1/p) \times p^{|k|} = (p+1)p^{|k|-1}.$

Thus, when we sum over all values of k, we are summing a pair of geometric series with exponent <1, and so we get a finite answer. But if the expected number of visits to anywhere (ie the sum across all places) is finite, this is clearly ridiculous, since we are running the process for an infinite time, and at each time-step we must be somewhere! So we must in fact have p=1, and thus another potential counter-example to the claim that a sequence of fair games can sometimes be unfair.

We might have exactly the same set of practical objections, such as this method requiring arbitrarily large liquidity (even though it doesn’t grow exponentially fast so doesn’t seem so bad).

What will actually turn out to be useful is that although the bets are now small, the average time until we hit 1 is actually infinite. Remember that, even though most things we see in real life don’t have this property, it is completely possible for a random variable to take finite values yet have infinite expectation.

Notes on the Martingale Strategy

There’s no reason why the originally proposed strategy had to be based upon fair coin tosses. This strategy might work in a more general setting, where the chance of winning on a given turn is not ½, or is not even constant. So long as at each stage you bet exactly enough that, if you win, you recoup all your losses so far, and one extra pound, this has the same overall effect.

Of course, we need to check that we do eventually win a round, which is not guaranteed if the probability of winning (conditional on not having yet won) decays sufficiently fast. If we let $p_k$ be the probability of winning on turn k, given that we haven’t previously won, then we require that the probability of never winning $\prod_{k\ge 1}(1-p_k)=0$. By taking logs and taking care of the approximations, it can be seen that the divergence or otherwise of $\sum p_k$ determines which way this falls.

In the next post, we’ll talk about how the two problems encountered here, namely allowing large increments, and considering a stopping time with infinite expectation are exactly the two cases where something can go wrong. We’ll also talk about a slightly different setting, where the choice of when to stop playing becomes a bit more dynamic and complicated.

# One queue or many queues?

You have counters serving customers. What is a better arrangement? To have queues, one for each counter, as in a supermarket (in the heady days before self-checkout, when ‘something unexpected in the bagging area’ might necessitate a trip to the doctor); or a single queue feeding to counters when they become free, as at a train station, or airport passport control?

Queueing is a psychologically intense process, at least in this country. Here are some human points:

– When you have to choose your line, you might, if you have nothing better to do, spend your time in the queue worrying about whether you chose optimally; whether Fortune has given a customer who arrived after you an advantage. (*)

– A single queue feels long, but conversely gives its users the pleasing feeling of almost constant progress. A particularly difficult customer doesn’t have as great an effect.

Interestingly, we can qualify these remarks in purely mathematical terms. We shall see that the mean waiting times are the same, but that, as we might expect, the variance is less for the single queue.

So to begin, we assume that the arrival process is Poisson, and the service times are exponential. These assumptions are very much not required, but it makes it easier to set up the model as a Markov chain. We do need that $\lambda$, the arrival rate, is less than $\mu N$, where $\mu$ is the service rate. Why? Well, otherwise the number of customers waiting to be served will almost surely grow to infinity. When supply is greater than demand, as we specify, we know that there will (almost surely) be infinitely many times when there are no customers in the system. This shows that the Markov chain is recurrent, and so it is meaningful to talk about a mean waiting time.

We assume that in the N-queues arrangement, if a counter is free but the corresponding queue is empty, then a customer from a non-zero queue will take the place if possible. We do not worry about how this customer is chosen. Indeed, we do not concern ourselves with how the customers choose a queue, except to ask that the choice algorithm is previsible. That is, they cannot look into the future and optimise (or otherwise) accordingly. This assumption is standard and reasonable by considering the real-world situation.

Now if we set things up in the right way, we can couple the two processes. Precisely, we start with no customers, and define

$0

to be the sequence of arrival times (note that almost surely no two customers arrive at the same time). Now say that

$0

are the times at which a customer starts being served, that is, moves from the queue to the counter. Say $S_1,S_2,\ldots$ are the service times of the customers, where $S_i$ is the service time of the customer who leaves the queue at time $D_i$.

The important point is that service times are independent of the queueing process, and are i.i.d. This is where we use the fact that the queue choice strategies are previsible. So, in fact all we need to examine is how customers spend in the queue. Concretely, take the queueing time, and S the service time. These are independent so:

$\mathbb{E}[Q+S]=\mathbb{E}Q+\mathbb{E}S$

and $\text{Var} (Q+S)=\text{Var } Q+\text{Var } S$.

Observe that $A_i,S_i$ have the same meanings and distributions in both queue models. It is not obvious that $D_i$ does. We notice that for both models, if we define

$C_n(t)=\#\{i:D_i+S_i\geq t\}$ on $[D_n,\infty)$,

the number of the first n customers being served at time t, we have

$D_{n+1}=A_{n+1}\vee \inf\{t\geq D_n: C_n(t)\leq N-1\}.$

Therefore, by induction on n, we have that $(D_i)$ is well-defined as the same function of $(A_i,S_i)$ for both models. This fact was given as implicitly obvious by many of the remarks on this topic which I found. In my opinion, it is no more obvious than the solution to the original problem.

We can also define, for both models, the number of customers served before the queue is empty again as:

$\inf\{N:\exists \epsilon>0\text{ s.t. }c_N(A_{N+1}-\epsilon)=0\},$

ie, the least N for which the queue is empty just before the N+1-th customer arrives. As previously discussed, N is a.s. finite. The processes are Markov chains, the first time the queue is empty is a stopping time, and it is meaningful to consider the average waiting time of the N customers before the queue is empty again. There is an expectation over N implicit here too, but this won’t be necessary for our result.

We can now finish the proof. This mean waiting time is

$\frac{1}{N}\sum_{i=1}^N(D_i-A_i)$

for the single queue. For multiple queues, suppose the i-th customer to arrive is the $\sigma(i)$-th to be served, where $\sigma\in S_N$ is a permutation. Then the waiting time is

$\frac{1}{N}\sum_{i=1}^N (D_{\sigma(i)}-A_i)$,

which is clearly equal to the previous expression as we can reorder a finite sum.

For variance, we have to consider

$\frac{1}{N}\sum_{i=1}^N (D_{\sigma(i)}-A_i)^2.$\$

Observe that for $a\leq b\leq c\leq d$, we have $0\leq(d-c)(b-a)$ and so

$(c-a)^2+(d-b)^2\leq (d-a)^2+(c-b)^2$,

which we can then apply repeatedly to show that

$\sum_{i=1}^N(D_{\sigma(i)}-A_i)^2\geq\sum_{i=1}^N(D_i-A_i)^2,$

with equality precisely when $\sigma=\text{id}$. So we conclude that the variance is in general higher when there are queues. And the reason is not that different to what we had originally suspected (*), namely that customers do not necessarily get served in the same order as they arrive when there are multiple queues, and so the variance of waiting time is greater.

References

This problem comes from Examples Sheet 1 of the Part III course Stochastic Networks. An official solution of the same flavour as this is also linked from the course page.

# Martingales

Definition

$X_n$ is a stochastic process, integrable and adapted to filtration $(\Omega,\mathcal{F},(\mathcal{F}_n),\mathbb{P})$. Then $X$ is a martingale if $\mathbb{E}[X_n|\mathcal{F}_m]=X_m$ almost surely whenever $n\geq m$.

It is natural to think about a martingale as defined in the context of a process evolving in time. Then this definition is very reasonable:

• Integrable: the entire construction is about taking expectations. So these need to exist.
• Adapted: $X_n$ can’t be affected by what happens after time n. It must be defined by what has happened up to time n.
• Expectation condition: if you look at the process at a given time m, the best estimate for what $X$ will be in the future is in fact what it is now. As you might expect, it is sufficient that $\mathbb{E}[X_{n+1}|\mathcal{F}_n]=X_n$ almost surely for each n.

Motivation

There are many situations where the expected change in a variable over a time period is zero, whatever the value of the variable is at the start. For example, gambling. For illustration, assume we are speculating on the outcomes of tossing a coin repeatedly. You might have a complicated strategy, for example ‘double your stake when you lose’ (the so-called martingale strategy), or anything else. But ultimately, you can’t see into the future. So before every coin toss, you have to decide your stake, based on what’s happened up until now, and you will win or lose with equal probability, so your expected gain is 0, regardless of how you make your stake choice. Thus under any strategy determined without looking into the future (called previsible), the process recording your winnings is a martingale. Continue reading