Doob inequalities and Doob-Meyer decomposition

The first post I wrote on this blog was about martingales, way back in 2012 at a time when I had known what a martingale was for about a month. I now don’t have this excuse. So I’m going to write about a couple of properties of (discrete-time) martingales that came up while adjusting a proof which my thesis examiners suggested could be made much shorter as part of their corrections.

Doob’s submartingale inequality

When we prove that some sequence of processes converges to some other process, we typically want to show that this holds in some sense uniformly over a time-interval, rather than just at some fixed time. We don’t lose much at this level of vagueness by taking the limit process to be identically zero. Then, if the convergent processes are martingales or closely similar, we want to be able to bound \sup_{k\le n} |Z_k| in some sense.

Doob’s submartingale inequality allows us to do this. Recall that a submartingale has almost-surely non-negative conditional increments. You might think of it heuristically as ‘more increasing than a martingale’. If Z_n is a martingale, then |Z_n| is a submartingale. This will be useful almost immediately.

The statement is that for (Z_n) a non-negative submartingale,

\mathbb{P}\left( \sup_{k\le n} Z_k \ge \lambda\right) \le \frac{\mathbb{E}\left[Z_n\right]}{\lambda}.

The similarity of the statement to the statement of Markov’s inequality is no accident. Indeed the proof is very similar. We consider whether the event in question happens, and find lower bounds on the expectation of Z_n under both possibilities.

Formally, for ease of notation, let Z_n^* be the running maximum \sup_{k\le n}Z_k. Then, we let T:= n\wedge \inf\{k\le n, M_j\ge \lambda\} and apply the optional stopping theorem for submartingales at T, which is by construction at most n. That is

\mathbb{E}[Z_n]\ge \mathbb{E}[Z_T]=\mathbb{E}\left[Z_T\mathbf{1}_{Z_n^*<\lambda}\right] + \mathbb{E}\left[Z_T \mathbf{1}_{Z_n^*\ge \lambda}\right].

The first of these summands is positive, and the second is at least \lambda \mathbb{P}\left( Z_N^* \ge \lambda \right), from which the result follows.

We’ve already said that for any martingale Z_n, |Z_n| is a submartingale, but in fact f(Z_n) is a submartingale whenever f is convex, and \mathbb{E}|f(Z_n)|<\infty for each n. Naturally, this continues to hold when Z_n is itself a submartingale.

[Note that Z_n^* is also a submartingale, but this probably isn’t as interesting.]

A particularly relevant such function f is f(x)=x^p, for p>1. If we take Z_n a non-negative submartingale which is uniformly bounded in L^p, then by applying Holder’s inequality and this submartingale inequality, we obtain

\mathbb{E}\left( \sup_{k\le n}Z_n^p \right) \le \left(\frac{p}{p-1}\right)^p \mathbb{E}\left[ Z_n^p \right].

Since Z_n^p is a submartingale, then a limit in n on the RHS is monotone, and certainly a limit in n on the LHS is monotone, so we can extend to

mathbb{E}\left( \sup_{k\le n}Z_\infty^p \right) \le \left(\frac{p}{1-p}\right)^p \mathbb{E}\left[ Z_\infty^p \right].

Initially, we have to define \mathbb{E}\left[ Z_\infty^p \right] through this limit, but in fact this result, Doob’s L^p inequality, shows that Z_\infty:= \lim Z_n exists almost surely as well.

Naturally, we will often apply this in the case p=2, and in the third of these three sections, we will see why it might be particularly straightforward to calculate \mathbb{E}\left[Z_\infty^2\right].

Remark: as in the case of Markov’s inequality, it’s hard to say much if the submartingale is not taken to be non-negative. Indeed, this effect can be seen even if the process is only defined for a single time step, for which the statement really is then Markov’s inequality.

Doob-Meyer decomposition

Unfortunately, most processes are not martingales. Given an discrete-time process X_n adapted to \mathcal{F}=(\mathcal{F}_n), it is a martingale if the conditional expectations of the increments are all almost surely zero. But given a general adapted process X_n which is integrable (so the increments have well-defined finite expectation), we can iteratively construct a new process M_n, where the increments are centred versions of X_n‘s increments. That is,

M_{n+1}-M_n:= X_{n+1}-X_n - \mathbb{E}\left[ X_{n+1}-X_n \,\big|\, \mathcal{F}_n\right] = X_{n+1}-\mathbb{E}\left[X_{n+1} \,\big|\, \mathcal{F}_n\right]. (*)

Then it’s immediately clear from the definition that M_n is a martingale.

There’s a temptation to tie oneself up in knots with the dependence. We might have that increments of the original process X_n depend on the current value of the process. And is it necessarily clear that we can recover the current value of the original process from the current value of M_n? Well, this is why we demand that everything be adapted, rather than just Markov. It’s not the case that M_n should be Markov, but it clearly is adapted.

Now we look at the middle expression in (*), and in particular the term we are subtracting, namely the conditional expectation. If we define, in the standard terminology, A_0=0 and

A_{n+1}-A_n:= \mathbb{E}\left[ X_{n+1}-X_n \,\big|\, \mathcal{F}_n\right],

then we have decomposed the original process X_n as the sum of a martingale M_n, and this new process A_n. In particular, note that the increment A_{n+1}-A_n given above is adapted to \mathcal{F}_n, which is a stronger condition than being adapted to \mathcal{F}_{n+1} as we would expect a priori. This property of the process (A_n) is called predictability (or possibly previsibility).

This decomposition X_n=X_0+M_n+A_n as just defined is called the Doob-Meyer decomposition, and there is a unique such decomposition where M_n is a martingale, and A_n is predictable. The proof of uniqueness is very straightforward. We look at the equalities given above as definitions of M_n,A_n, but then work in the opposite direction to show that they must hold if the decomposition holds.

I feel a final heuristic is worthwhile, using the term drift, more normally encountered in the continuous-time setting to describe infinitissimal expected increments. The increments of A_n represent the drift of X_n, and the increments of M_n are what remains from X_n after subtracting the drift. In general, the process to be subtracted to turn a non-martingale into a martingale is called a compensator, and the existence or otherwise of such processes is important but challenging for some classes of continuous-time processes.

In particular, note that when X_n is itself a martingale, then A_n\equiv 0. However, probably the most useful case is when X_n is a submartingale, as then the drift is always non-negative, and so A_n is almost surely increasing. The converse holds too.

This is relevant because this Doob-Meyer decomposition is obviously only a useful tool for treating X_n if we can handle the two processes M_n,A_n easily. We have tools to bound the martingale term, but this previsible term might in general be tricky, and so the case where X_n is a submartingale is good, as increasing processes are much easier than general processes, since bounding the whole process might involve only bounding the final term in many contexts.

Predictable quadratic variation

A particularly relevant example is the square of a martingale, that is X_n=M_n^2, where M_n is a martingale. By the convexity condition discussed earlier, X_n is a submartingale (provided it is integrable, ie M_n is square-integrable), and so the process A_n in its Doob-Meyer decomposition is increasing. This is often called the (predictable) quadratic variation of (X_n).

This predictable quadratic variation is sometimes denoted \langle X_n\rangle. This differs from the (regular) quadratic variation which is defined as the sum of the squares of the increments, that is [X_n]:= \sum_{k=0}^{n-1} (X_{k+1}-X_k)^2. Note that this is adapted, but obviously not previsible. The distinction between these two processes is more important in continuous time. There, they are almost surely equal for a continuous local martingale, but not for eg a Poisson process. (For a Poisson process, the PQV is deterministic, indeed linear, while the (R)QV is almost surely equal to the Poisson process itself.) In the discrete time setting, the regular quadratic variation is not relevant very often, while the predictable quadratic variation is useful, precisely because of this decomposition.

Whenever we have random variables which we then centre, there is a standard trick to apply when treating their variance. That is

A_{n+1}-A_n= \mathbb{E}\left[ M^2_{n+1}-M^2_n \,\big|\, \mathcal{F}_n\right]
= \mathbb{E}\left[ M^2_{n+1}\,\big|\, \mathcal{F}_n\right] - 2M_n^2 +M_n^2
= \mathbb{E}\left[ M^2_{n+1}\,\big|\, \mathcal{F}_n\right] - 2M_n \mathbb{E}\left[ M_{n+1}\,\big|\, \mathcal{F}_n\right] + M_n^2
= \mathbb{E}\left[ \left(M_{n+1}-M_n\right)^2\,\big|\, \mathcal{F}_n\right].

One consequence is seen by taking an ‘overall’ expectation. Because M_n^2-A_n is a martingale,

\mathbb{E}\left[M_n^2\right] = \mathbb{E}\left[A_n\right] = \mathbb{E}\left[M_0^2\right] + \sum_{k=0}^{n-1} \mathbb{E}\left[A_{k+1}-A_k\right]
= \mathbb{E}\left[ M_0^2\right] + \sum_{k=0}^{n-1}\mathbb{E}\left[ \left(M_{k+1}-M_k\right)^2 \right]. (**)

This additive (Pythagorean) property of the square of a martingale is useful in applications where there is reasonably good control on each increment separately.

We can also see this final property without the Doob-Meyer decomposition. For a martingale it is not the case that the increments on disjoint intervals are independent. However, following Williams 12.1 [1], disjoint intervals are orthogonal, in the sense that


whenever s\le t\le u\le v. Then, when we square the expression M_n=M_0+\sum M_{k+1}-M_k, and take expectations, all the cross terms vanish, leaving precisely (*).


[1] Williams – Probability with Martingales

I also followed the notes I made in 2011/12 while attending Perla Sousi’s course on Advanced Probability, and Arnab Sen’s subsequent course on Stochastic Calculus, though I can’t find any evidence online for the latter now.

The Envelope ‘Paradox’

At the recent IMO in Hong Kong, there were several moments where the deputy leaders had to hang around, and I spent some of these moments discussing the following problem with Stephen Mackereth, my counterpart from New Zealand. He’s a mathematically-trained philosopher, so has a similar level of skepticism to me, but for different reasons, regarding supposed paradoxes in probability. Because, as we will see shortly, I don’t think this is a paradox in even the slightest fashion, I think there’s probably too much written about this on the internet already. So I’m aware that contributing further to this oeuvre is hypocritical, but we did the thinking in HKUST’s apparently famous Einstein Cafe, so it makes sense to write down the thoughts.

[And then forget about it for eight weeks. Oops.]

The ‘Paradox’

Here’s the situation. A cryptic friend gives you an envelope containing some sum of money, and shows you a second envelope. They then inform you that one of the envelopes contains twice as much money as the other. It’s implicit in this that the choice of which is which is uniform. You have the option to switch envelopes. Should you?

The supposed paradox arises by considering the amount in your envelope, say X. In the absence of further information, it is equally likely that the other envelope contains X/2 as 2X. Therefore, the average value of the other envelope is

\frac12 \left(\frac{X}{2}+2X \right)= \frac54 X > X.

So you should switch, since on average you gain money. But this is paradoxical, since the assignment of larger and smaller sums was uniform, so switching envelope should make no difference.

Probabilistic setup

This is not supposed to be a problem on a first-year probability exercise sheet. It’s supposed to be conducive to light discussion. So saying “I won’t engage with this problem until you tell me what the probability space is” doesn’t go down terribly well. But it is important to work out what is random, and what isn’t.

There are two sources of randomness, or at least ignorance. Firstly, there is the pair of values contained in the envelopes. Secondly, there is the assignment of this pair of values to the two envelopes. The second is a source of randomness, and this problem is founded on the premise that this second stage is ‘symmetric enough’ to smooth over any complications in the first stage. If we think that probability isn’t broken (and that’s what I think), then the answer is probably that the second stage isn’t symmetric enough.

Or, that the first stage isn’t very well-defined. In what follows, I’m going to make the second stage very symmetric, at the expense of setting up the first stage in what seems to me a reasonable way using the conventional language of probability theory to record our ignorance about the values in play.

So what’s the first stage? We must have a set of possible pairs of values taken by the envelopes. Let’s call this A, so

A\subset \mathbb{A}:=\{(x,2x)\,:\, x\in (0,\infty)\}.

Maybe we know what A is, but maybe we don’t, in which we should take A=\mathbb{A}, on the grounds that any pair is possible. Suppose that your friend has chosen the pair of values according to some distribution on \mathbb{A}, which we’ll assume has a density f, which is known by you. Maybe this isn’t the actual density, but it serves perfectly well if you treat it as *your* opinion on the likelihood. Then this actually does reduce to a problem along the lines of first-year probability, whether or not you get to see the amount in your envelope.

Suppose first that you do get to see the amount, and that it is x. Then the conditional probabilities that the pair is (x/2,x) or (x,2x) are, respectively

\frac{f(x/2,x)}{f(x/2,x)+f(x,2x)},\quad \frac{f(x,2x)}{f(x/2,x)+f(x,2x)}.

So you can work out your expected gain by switching, and decide accordingly. If you don’t know the value in your envelope, you can still work out the probability that it is better (in expectation) to switch, but this isn’t really a hugely meaningful measure, unless it is zero or one.

It’s worth noting that if you can view inside your envelope, and you know A has a particular form, then the game becomes determined. For example, if

A\subset \{(n,2n), n\text{ an odd integer}\},

then life is very easy. If you open your envelope and see an odd integer, you should switch, and if you see an even integer you shouldn’t.

We’ll return at the end to discuss a case where it is always better to switch, and why this isn’t actually a paradox.

Improper prior and paradox of resampling when \mathbb{E}=\infty

For now though, let’s assume that we don’t know anything about the amounts of money in the envelopes. Earlier, we said that “in the absence of further information, it is equally likely that the other envelope contains X/2 as 2X”. In the language of a distribution on \mathbb{A}, we are taking the uniform measure. Of course this not a distribution, in the same way that there isn’t a uniform distribution on the positive reals.

However, if this is your belief about the values in the pair of envelopes, what do you think is the mean value of the content of your envelope? Well, you think all values are equally likely. So, even though this isn’t a distribution, you pretty much think the value of your envelope has infinite expectation.

[This is where the philosophy comes in I guess. Is (expressing uniform ignorance about the content of the other envelope given knowledge of your own) the same as (expressing uniform ignorance of both envelopes at the beginning)? I think it is, even though it has a different consequence here, since the former can be turned into a proper distribution, whereas the latter cannot.]

Let’s briefly consider an alternative example. It’s fairly easy to conjure up distributions which are almost surely finite but which have infinite expectation. For example \mathbb{P}(X=2^k)=2^{-k} for k=1,2,…, which is the content of the *St. Petersburg paradox*, another supposed paradox in probability, but one whose resolution is a bit more clear.

Anyway, let X and Y be independent copies of such a distribution. Now suppose your friend offers you an envelope containing amount X. You look at the value, and then you are offered the opportunity to switch to an envelope containing amount Y. Should you?

Well, if expectation is what you care about, then you definitely should. Because with probability one, you are staring at a finite value in your envelope, whereas the other unknown envelope promises infinite expectation, which is certainly larger than the value that you’re looking at.

Is this also a paradox? I definitely don’t think it is. The expectation of the content of your envelope is infinite, the expected gain is infinite with probability one, which is consistent with the expected content of the other envelope being infinite. [Note that you don’t want to be claiming that the expectation of X-Y is zero.]

An example density function

As an exercise that isn’t necessarily hugely interesting, let’s assume that f, the distribution of the smaller of the pair, is \mathrm{Exp}(\lambda). So the mean of this smaller number is 1/\lambda. Then, conditional on seeing x in my envelope, the expected value of the number in the other envelope is

\frac{\frac{x}{2} e^{-\lambda x/2} + 2x e^{-\lambda x}}{e^{-\lambda x/2}+ e^{-\lambda x}}. (*)

Some straightforward manipulation shows that this quantity is at least x (implying it’s advantageous to switch) precisely when

e^{-\lambda x/2}\ge \frac12.

That is, when x\le \frac{2\log 2}{\lambda}. The shape of this interval should fit our intuition, namely that the optimal strategy should be to switch if the value in your envelope is small enough.

The point of doing this calculation is to emphasise that it ceases to be an interesting problem, and certainly ceases to be a paradox of any kind, once we specify f concretely. It doesn’t matter whether this is some true distribution (ie the friend is genuinely sampling the values somehow at random), or rather a perceived likelihood (that happens to be normalisable).

What if you should always switch?

The statement of the paradox only really has any bite if the suggestion is that we should always switch. Earlier, we discussed potential objections to considering the uniform prior in this setting, but what about other possible distributions f which might lead to this conclusion?

As at (*), we can conclude that when f(x)+f(x/2)>0, we should switch on seeing x precisely if

f(x)\ge 2f\left(\frac{x}{2}\right).

Therefore, partitioning the support of f into a collection of geometric sequences with exponent 2, it is clear that the mean of f is infinite if everything is integer-valued. If f is real-valued, there are some complications, but so long as everything is measurable, the same conclusion will hold.

So the you-should-switch-given-x strategy can only hold for all values of x if f has infinite mean. This pretty much wraps up my feelings. If the mean isn’t infinite, the statement of the paradox no longer holds, and if it is infinite, then the paradox dissolves into a statement about trying to order various expectations, all of which are infinite.


Mathematical summary: it’s Bayes. Things may be exchangeable initially, but not once you condition on the value of one of them! Well, not unless you have a very specific prior.

Philosophical summary: everything in my argument depends on the premise that one can always describe the protagonist’s prior opinion on the contents of the pair of envelopes with a (possibly degenerate) distribution. I feel this is reasonable. As soon as you write down \frac12 \cdot\frac{x}{2} + \frac12 \cdot2x, you are doing a conditional expectation, and it’s got to be conditional with respect to something. Here it’s the uniform prior, or at least the uniform prior restricted to the set of values that are now possible given the revelation of your number.

Second mathematical summary: once you are working with the uniform prior, or any measure with infinite mean, there’s no reason why


with probability one (in terms of Y) should be surprising, since the LHS is (almost-surely) infinite while the RHS is almost surely finite, despite having infinite mean itself.