Doob inequalities and Doob-Meyer decomposition

The first post I wrote on this blog was about martingales, way back in 2012 at a time when I had known what a martingale was for about a month. I now don’t have this excuse. So I’m going to write about a couple of properties of (discrete-time) martingales that came up while adjusting a proof which my thesis examiners suggested could be made much shorter as part of their corrections.

Doob’s submartingale inequality

When we prove that some sequence of processes converges to some other process, we typically want to show that this holds in some sense uniformly over a time-interval, rather than just at some fixed time. We don’t lose much at this level of vagueness by taking the limit process to be identically zero. Then, if the convergent processes are martingales or closely similar, we want to be able to bound \sup_{k\le n} |Z_k| in some sense.

Doob’s submartingale inequality allows us to do this. Recall that a submartingale has almost-surely non-negative conditional increments. You might think of it heuristically as ‘more increasing than a martingale’. If Z_n is a martingale, then |Z_n| is a submartingale. This will be useful almost immediately.

The statement is that for (Z_n) a non-negative submartingale,

\mathbb{P}\left( \sup_{k\le n} Z_k \ge \lambda\right) \le \frac{\mathbb{E}\left[Z_n\right]}{\lambda}.

The similarity of the statement to the statement of Markov’s inequality is no accident. Indeed the proof is very similar. We consider whether the event in question happens, and find lower bounds on the expectation of Z_n under both possibilities.

Formally, for ease of notation, let Z_n^* be the running maximum \sup_{k\le n}Z_k. Then, we let T:= n\wedge \inf\{k\le n, M_j\ge \lambda\} and apply the optional stopping theorem for submartingales at T, which is by construction at most n. That is

\mathbb{E}[Z_n]\ge \mathbb{E}[Z_T]=\mathbb{E}\left[Z_T\mathbf{1}_{Z_n^*<\lambda}\right] + \mathbb{E}\left[Z_T \mathbf{1}_{Z_n^*\ge \lambda}\right].

The first of these summands is positive, and the second is at least \lambda \mathbb{P}\left( Z_N^* \ge \lambda \right), from which the result follows.

We’ve already said that for any martingale Z_n, |Z_n| is a submartingale, but in fact f(Z_n) is a submartingale whenever f is convex, and \mathbb{E}|f(Z_n)|<\infty for each n. Naturally, this continues to hold when Z_n is itself a submartingale.

[Note that Z_n^* is also a submartingale, but this probably isn’t as interesting.]

A particularly relevant such function f is f(x)=x^p, for p>1. If we take Z_n a non-negative submartingale which is uniformly bounded in L^p, then by applying Holder’s inequality and this submartingale inequality, we obtain

\mathbb{E}\left( \sup_{k\le n}Z_n^p \right) \le \left(\frac{p}{p-1}\right)^p \mathbb{E}\left[ Z_n^p \right].

Since Z_n^p is a submartingale, then a limit in n on the RHS is monotone, and certainly a limit in n on the LHS is monotone, so we can extend to

\mathbb{E}\left( \sup_{k\le n}Z_\infty^p \right) \le \left(\frac{p}{1-p}\right)^p \mathbb{E}\left[ Z_\infty^p \right].

Initially, we have to define \mathbb{E}\left[ Z_\infty^p \right] through this limit, but in fact this result, Doob’s L^p inequality, shows that Z_\infty:= \lim Z_n exists almost surely as well.

Naturally, we will often apply this in the case p=2, and in the third of these three sections, we will see why it might be particularly straightforward to calculate \mathbb{E}\left[Z_\infty^2\right].

Remark: as in the case of Markov’s inequality, it’s hard to say much if the submartingale is not taken to be non-negative. Indeed, this effect can be seen even if the process is only defined for a single time step, for which the statement really is then Markov’s inequality.

Doob-Meyer decomposition

Unfortunately, most processes are not martingales. Given an discrete-time process X_n adapted to \mathcal{F}=(\mathcal{F}_n), it is a martingale if the conditional expectations of the increments are all almost surely zero. But given a general adapted process X_n which is integrable (so the increments have well-defined finite expectation), we can iteratively construct a new process M_n, where the increments are centred versions of X_n‘s increments. That is,

M_{n+1}-M_n:= X_{n+1}-X_n - \mathbb{E}\left[ X_{n+1}-X_n \,\big|\, \mathcal{F}_n\right] = X_{n+1}-\mathbb{E}\left[X_{n+1} \,\big|\, \mathcal{F}_n\right]. (*)

Then it’s immediately clear from the definition that M_n is a martingale.

There’s a temptation to tie oneself up in knots with the dependence. We might have that increments of the original process X_n depend on the current value of the process. And is it necessarily clear that we can recover the current value of the original process from the current value of M_n? Well, this is why we demand that everything be adapted, rather than just Markov. It’s not the case that M_n should be Markov, but it clearly is adapted.

Now we look at the middle expression in (*), and in particular the term we are subtracting, namely the conditional expectation. If we define, in the standard terminology, A_0=0 and

A_{n+1}-A_n:= \mathbb{E}\left[ X_{n+1}-X_n \,\big|\, \mathcal{F}_n\right],

then we have decomposed the original process X_n as the sum of a martingale M_n, and this new process A_n. In particular, note that the increment A_{n+1}-A_n given above is adapted to \mathcal{F}_n, which is a stronger condition than being adapted to \mathcal{F}_{n+1} as we would expect a priori. This property of the process (A_n) is called predictability (or possibly previsibility).

This decomposition X_n=X_0+M_n+A_n as just defined is called the Doob-Meyer decomposition, and there is a unique such decomposition where M_n is a martingale, and A_n is predictable. The proof of uniqueness is very straightforward. We look at the equalities given above as definitions of M_n,A_n, but then work in the opposite direction to show that they must hold if the decomposition holds.

I feel a final heuristic is worthwhile, using the term drift, more normally encountered in the continuous-time setting to describe infinitissimal expected increments. The increments of A_n represent the drift of X_n, and the increments of M_n are what remains from X_n after subtracting the drift. In general, the process to be subtracted to turn a non-martingale into a martingale is called a compensator, and the existence or otherwise of such processes is important but challenging for some classes of continuous-time processes.

In particular, note that when X_n is itself a martingale, then A_n\equiv 0. However, probably the most useful case is when X_n is a submartingale, as then the drift is always non-negative, and so A_n is almost surely increasing. The converse holds too.

This is relevant because this Doob-Meyer decomposition is obviously only a useful tool for treating X_n if we can handle the two processes M_n,A_n easily. We have tools to bound the martingale term, but this previsible term might in general be tricky, and so the case where X_n is a submartingale is good, as increasing processes are much easier than general processes, since bounding the whole process might involve only bounding the final term in many contexts.

Predictable quadratic variation

A particularly relevant example is the square of a martingale, that is X_n=M_n^2, where M_n is a martingale. By the convexity condition discussed earlier, X_n is a submartingale (provided it is integrable, ie M_n is square-integrable), and so the process A_n in its Doob-Meyer decomposition is increasing. This is often called the (predictable) quadratic variation of (X_n).

This predictable quadratic variation is sometimes denoted \langle X_n\rangle. This differs from the (regular) quadratic variation which is defined as the sum of the squares of the increments, that is [X_n]:= \sum_{k=0}^{n-1} (X_{k+1}-X_k)^2. Note that this is adapted, but obviously not previsible. The distinction between these two processes is more important in continuous time. There, they are almost surely equal for a continuous local martingale, but not for eg a Poisson process. (For a Poisson process, the PQV is deterministic, indeed linear, while the (R)QV is almost surely equal to the Poisson process itself.) In the discrete time setting, the regular quadratic variation is not relevant very often, while the predictable quadratic variation is useful, precisely because of this decomposition.

Whenever we have random variables which we then centre, there is a standard trick to apply when treating their variance. That is

A_{n+1}-A_n= \mathbb{E}\left[ M^2_{n+1}-M^2_n \,\big|\, \mathcal{F}_n\right]
= \mathbb{E}\left[ M^2_{n+1}\,\big|\, \mathcal{F}_n\right] - 2M_n^2 +M_n^2
= \mathbb{E}\left[ M^2_{n+1}\,\big|\, \mathcal{F}_n\right] - 2M_n \mathbb{E}\left[ M_{n+1}\,\big|\, \mathcal{F}_n\right] + M_n^2
= \mathbb{E}\left[ \left(M_{n+1}-M_n\right)^2\,\big|\, \mathcal{F}_n\right].

One consequence is seen by taking an ‘overall’ expectation. Because M_n^2-A_n is a martingale,

\mathbb{E}\left[M_n^2\right] = \mathbb{E}\left[A_n\right] = \mathbb{E}\left[M_0^2\right] + \sum_{k=0}^{n-1} \mathbb{E}\left[A_{k+1}-A_k\right]
= \mathbb{E}\left[ M_0^2\right] + \sum_{k=0}^{n-1}\mathbb{E}\left[ \left(M_{k+1}-M_k\right)^2 \right]. (**)

This additive (Pythagorean) property of the square of a martingale is useful in applications where there is reasonably good control on each increment separately.

We can also see this final property without the Doob-Meyer decomposition. For a martingale it is not the case that the increments on disjoint intervals are independent. However, following Williams 12.1 [1], disjoint intervals are orthogonal, in the sense that

\mathbb{E}\left[(M_t-M_s)(M_v-M_u)\right]=0,

whenever s\le t\le u\le v. Then, when we square the expression M_n=M_0+\sum M_{k+1}-M_k, and take expectations, all the cross terms vanish, leaving precisely (*).

References

[1] Williams – Probability with Martingales

I also followed the notes I made in 2011/12 while attending Perla Sousi’s course on Advanced Probability, and Arnab Sen’s subsequent course on Stochastic Calculus, though I can’t find any evidence online for the latter now.

Advertisements

Popoviciu’s Inequality

I’ve just returned to the UK after an excellent stay at the University of British Columbia. More about that will follow in some posts which are being queued. Anyway, I flew back in time to attend the last day of the camp held at Oundle School to select the UK team for this year’s International Mathematical Olympiad, to be held in Cape Town in early July. I chose to give a short session on inequalities, which is a topic I did not enjoy as a student and do not enjoy now, but perhaps that makes it a particularly suitable choice?

We began with a discussion of convexity. Extremely occasionally in olympiads, and merely slightly occasionally in real life, an inequality arises which can be proved by showing that a given function is convex in all its arguments, hence its maximum must be attained at a boundary value in each variable.

In general though, our main experience of convexity will be through the medium of Jensen’s inequality. A worthwhile check is to consider one form of the statement of Jensen’s inequality, with two arguments. We are always given a convex function f defined on an interval I=[a,b], and x,y\in I, and weights \alpha,\beta which sum to 1. Then

\alpha f(x)+\beta f(y)\ge f(\alpha x+\beta y).

How do we prove this? Well, in fact this is the natural definition of convexity for a function. There had initially been vague murmurings that convexity should be defined as a property of the second derivative of the function. But this is somewhat unsatisfactory, as the function f(x)=|x| is certainly convex, but the second derivative does not exist at x=0. One could argue that the second derivative may not be finite at x=0, but is nonetheless positive by defining it as a limit which happens to be infinite in this case. However, I feel it is uncontroversial to take the case of Jensen given above as the definition of convexity. It is after all a geometric property, so why raise objections to a geometric definition?

The general statement of Jensen’s inequality, with the natural definitions, is

\sum_{i} \alpha_i f(x_i)\ge f(\sum_{i}\alpha_ix_i).

This is sometimes called Weighted Jensen in the olympiad community, with ‘ordinary’ Jensen following when the weights are all 1/n. In a probabilistic context, we write

\mathbb{E}[f(X)]\ge f(\mathbb{E}X),

for X any random variable supported on the domain of f. Naturally, X can be continuous as well as discrete, giving an integral version of the discretely weighted statement.

Comparing ‘ordinary’ Jensen and ‘weighted’ Jensen, we see an example of the situation where the more general result is easier to prove. As is often the case in these situations, this arises because the more general conditions allow more ‘elbow room’ to perform an inductive argument. A stronger statement means that assuming the induction hypothesis is more useful! Anyway, I won’t digress too far onto the proof of discrete ‘weighted’ Jensen as it is a worthwhile exercise for olympiad students.

What I wanted to discuss principally was an inequality due to Tiberiu Popoviciu:

\frac13[f(x)+f(y)+f(z)]+f(\frac{x+y+z}{3})\ge \frac23[f(\frac{x+y}{2})+f(\frac{y+z}{2})+f(\frac{z+x}{2})].

We might offer the following highly vague intuition. Jensen asserts that for sums of the form \sum f(x_i), you get larger sums if the points are more spread out. The effect of taking the mean is immediately to bring all the points as close together as possible. But Popoviciu says that this effect is so pronounced that even with only half the weight on the outer points (and the rest as close together as possible), it still dominates a system with the points twice as close together.

So how to prove it? I mentioned that there is, unsurprisingly, a weighted version of this result, which was supposed to act as a hint to avoid getting too hung up about midpoints. One can draw nice diagrams with a triangle of points (x,f(x)), (y,f(y)), (z,f(z)) and draw midpoints, medians and centroids, but the consensus seemed to be that this didn’t help much.

I had tried breaking up the LHS into three symmetric portions, and using weighted Jensen to obtain terms on the RHS, but this also didn’t yield much, so I warned the students against this approach unless they had a specific reason to suppose it might succeed.

Fortunately, several of the students decided to ignore this advice, and though most fell into a similar problem I had experienced, Joe found that by actively avoiding symmetry, a decomposition into two cases of Jensen could be made. First we assume WLOG that x\le y \le z, and so by standard Jensen, we have

\frac13[f(x)+f(y)]\ge \frac23 f(\frac{x+y}{2}).

It remains to show

\frac13 f(z)+f(\frac{x+y+z}{3})\ge \frac23[f(\frac{x+z}{2})+f(\frac{y+z}{2})].

If we multiply by ¾, then we have an expression on each side that looks like the LHS of Weighted Jensen. At this point, it is worth getting geometric again. One way to visualise Jensen is that for a convex function, a chord between two points on the function lies above the function. (For standard Jensen with two variables, in particular the midpoint lies above the function.) But indeed, suppose we have values x_1<x_2<y_2<y_1, then the chord between f(x_1),f(y_1) lies strictly above the chord between f(x_2),f(y_2). Making precisely such a comparison gives the result required above. If you want to be more formal about it, you could consider replacing the values of f between x_2,y_2 with a straight line, then applying Jensen to this function. Linearity allows us to move the weighting in and out of the brackets on the right hand side, whenever the mean lies in this straight line interval.

Convex function chordsHopefully the diagram above helps. Note that we can compare the heights of the blue points (with the same abscissa), but obviously not the red points!

In any case, I was sceptical about whether this method would work for the weighted version of Popoviciu’s inequality

\alpha f(x)+\beta f(y) + \gamma f(z)+f(\alpha x+\beta y+\gamma z)\ge

(\alpha+\beta)f(\frac{\alpha x+\beta y}{\alpha+\beta})+(\beta+\gamma)f(\frac{\beta y + \gamma z}{\beta+\gamma})+(\gamma+\alpha)f(\frac{\gamma z+\alpha x}{\gamma+\alpha}).

It turns out though, that it works absolutely fine. I would be interested to see a solution to the original statement making use of the medians and centroid, as then by considering general Cevians the more general inequality might follow.

That’s all great, but my main aim had been to introduce one trick which somewhat trivialises the problem. Note that in the original statement of Popoviciu, we have a convex function, but we only evaluate it at seven points. So for given x,y,z, it makes no difference if we replace the function f with a piece-wise linear function going through the correct seven points. This means that if we can prove that the inequality for any convex piece-wise linear function with at most eight linear parts then we are done.

(There’s a subtlety here. Note that we will prove the inequality for all such functions and all x,y,z, but we will only use this result when x,y,z and their means are the points where the function changes gradient.)

So we consider the function

g_a(x)=\begin{cases}0& x\le 0\\ ax & x\ge 0\end{cases}

for some positive value of a. It is not much effort to check that this satisfies Popoviciu. It is also easy to check that the constant function, and the linear function g(x)=bx also satisfy the inequality. We now prove that we can write the piece-wise linear function as a sum of functions which satisfy the inequality, and hence the piece-wise linear function satisfies the inequality.

Suppose we have a convex piecewise linear function h(x) where x_1<x_2<\ldots<x_n are the points where the derivative changes. We write

a_i=h'(x_i+)-h'(x_i'),\quad a_0=h'(x_1-),

for the change in gradient of h around point x_i. Crucially, because h is convex, we have a_i\ge 0. Then we can write h as

h(x)=C+ a_0x+g_{a_1}(x-x_1)+\ldots+g_{a_{n}}(x-x_n),

for a suitable choice of the constant C. This result comes according to [1] as an intermediate step in a short paper of Hardy, Littlewood and Polya, which I can’t currently find online. Note that inequalities are preserved under addition (but not under subtraction) so it follows that h satisfies Popoviciu, and so the original function f satisfies it too for the values of x,y,z chosen. These were arbitrary (but were used to construct h), and hence f satisfies the inequality for all x,y,z.

Some further generalisations can be found in [1]. With more variables, there are more interesting combinatorial aspects that must be checked, regarding the order of the various weighted means.

[1] – D. Grinberg – Generalizations of Popoviciu’s Inequality. arXiv

Enhanced by Zemanta

Large Deviations 5 – Stochastic Processes and Mogulskii’s Theorem

Motivation

In the previous posts about Large Deviations, most of the emphasis has been on the theory. To summarise briefly, we have a natural idea that for a family of measures supported on the same metric space, increasingly concentrated as some index grows, we might expect the probability of seeing values in a set not containing the limit in distribution to grow exponentially. The canonical example is the sample mean of a family of IID random variables, as treated by Cramer’s theorem.

It becomes apparent that it will not be enough to specify the exponent for a given large deviation event just by taking the infimum of the rate function, so we have to define an LDP topologically, with different behaviour on open and closed sets. Now we want to find some LDPs for more complicated measures, but which will have genuinely non-trivial applications. The key idea in all of this is that the infimum present in the definition of an LDP doesn’t just specify the rate function, it also might well give us some information about the configurations or events that lead to the LDP.

The slogan for the LDP as in Frank den Hollander’s excellent book is: “A large deviation event will happen in the least unlikely of all the unlikely ways.” This will be useful when our underlying space is a bit more complicated.

Setup

As a starting point, consider the set-up for Cramer’s theorem, with IID X_1,\ldots,X_n. But instead of investigating LD behaviour for the sample mean, we investigate LD behaviour for the whole set of RVs. There is a bijection between sequences and the partial sums process, so we investigate the partial sums process, rescaled appropriately. For the moment this is a sequence not a function or path (continuous or otherwise), but in the limit it will be, and furthermore it won’t make too much difference whether we interpolate linearly or step-wise.

Concretely, we consider the rescaled random walk:

Z_n(t):=\tfrac{1}{n}\sum_{i=1}^{[nt]}X_i,\quad t\in[0,1],

with laws \mu_n supported on L_\infty([0,1]). Note that the expected behaviour is a straight line from (0,0) to (1,\mathbb{E}X_1). In fact we can say more than that. By Donsker’s theorem we have a functional version of a central limit theorem, which says that deviations from this expected behaviour are given by suitably scaled Brownian motion:

\sqrt{n}\left(\frac{Z_n(t)-t\mathbb{E}X}{\sqrt{\text{Var}(X_1)}}\right)\quad\stackrel{d}{\rightarrow}\quad B(t),\quad t\in[0,1].

This is what we expect ‘standard’ behaviour to look like:

mog1 - Copy

The deviations from a straight line are on a scale of \sqrt{n}. Here are two examples of potential large deviation behaviour:

mog2 - Copy

Or this:

mog3 - Copy

Note that these are qualitatively different. In the first case, the first half of the random variables are in general much larger than the second half, which appear to have empirical mean roughly 0. In the second case, a large deviation in overall mean is driven by a single very large value. It is obviously of interest to find out what the probabilities of each of these possibilities are.

We can do this via an LDP for (\mu_n). Now it is really useful to be working in a topological context with open and closed sets. It will turn out that the rate function is supported on absolutely continuous functions, whereas obviously for finite n, none of the sample paths are continuous!

We assume that \Lambda(\lambda) is the logarithmic moment generating function of X_1 as before, with \Lambda^*(x) the Fenchel-Legendre transform. Then the key result is:

Theorem (Mogulskii): The measures (\mu_n) satisfy an LDP on L_\infty([0,1]) with good rate function:

I(\phi)=\begin{cases}\int_0^1 \Lambda^*(\phi'(t))dt,&\quad \text{if }\phi\in\mathcal{AC}, \phi(0)=0,\\ \infty&\quad\text{otherwise,}\end{cases}

where AC is the space of absolutely continuous functions on [0,1]. Note that AC is dense in L_\infty([0,1]), so any open set contains a \phi for which I(\phi) is at least in principle finite. (Obviously, if \Lambda^* is not finite everywhere, then extra restrictions of \phi' are required.)

The following picture may be helpful at providing some motivation:

CptPath2

So what is going on is that if we take a path and zoom in on some small interval around a point, note first that behaviour on this interval is independent of behaviour everywhere else. Then the gradient at the point is the local empirical mean of the random variables around this point in time. The probability that this differs from the actual mean is given by Cramer’s rate function applied to the empirical mean, so we obtain the rate function for the whole path by integrating.

More concretely, but still very informally, suppose there is some \phi'(t)\neq \mathbb{E}X, then this says that:

Z_n(t+\delta t)-Z_n(t)=\phi'(t)\delta t+o(\delta t),

\Rightarrow\quad \mu_n\Big(\phi'(t)\delta t+o(\delta t)=\frac{1}{n}\sum_{i=nt+1}^{n(t+\delta t)}X_i\Big),

= \mu_n\Big( \phi'(t)+o(1)=\frac{1}{n\delta t}\sum_{i=1}^{n\delta t}X_i\Big)\sim e^{-n\delta t\Lambda^*(\phi'(t))},

by Cramer. Now we can use independence:

\mu_n(Z_n\approx \phi)=\prod_{\delta t}e^{-n\delta t \Lambda^*(\phi'(t))}=e^{-\sum_{\delta t}n\delta t \Lambda^*(\phi'(t))}\approx e^{-n\int_0^1 \Lambda^*(\phi'(t))dt},

as in fact is given by Mogulskii.

Remarks

1) The absolutely continuous requirement is useful. We really wouldn’t want to be examining carefully the tail of the underlying distribution to see whether it is possible on an exponential scale that o(n) consecutive RVs would have sum O(n).

2) In general \Lambda^*(x) will be convex, which has applications as well as playing a useful role in the proof. Recalling den Hollander’s mantra, we are interested to see where infima hold for LD sets in the host space. So for the event that the empirical mean is greater than some threshold larger than the expectation, Cramer’s theorem told us that this is exponentially the same as same the empirical mean is roughly equal to the threshold. Now Mogulskii’s theorem says more. By convexity, we know that the integral functional for the rate function is minimised by straight lines. So we learn that the contributions to the large deviation are spread roughly equally through the sample. Note that this is NOT saying that all the random variables will have the same higher than expected value. The LDP takes no account of fluctuations in the path on a scale smaller than n. It does however rule out both of the situations pictured a long way up the page. We should expect to see roughly a straight line, with unexpectedly steep gradient.

3) The proof as given in Dembo and Zeitouni is quite involved. There are a few stages, the first and simplest of which is to show that it doesn’t matter on an exponential scale whether we interpolate linearly or step-wise. Later in the proof we will switch back and forth at will. The next step is to show the LDP for the finite-dimensional problem given by evaluating the path at finitely many points in [0,1]. A careful argument via the Dawson-Gartner theorem allows lifting of the finite-dimensional projections back to the space of general functions with the topology of pointwise convergence. It remains to prove that the rate function is indeed the supremum of the rate functions achieved on projections. Convexity of \Lambda^*(x) is very useful here for the upper bound, and this is where it comes through that the rate function is infinite when the comparison path is not absolutely continuous. To lift to the finer topology of L_\infty([0,1]) requires only a check of exponential tightness in the finer space, which follows from Arzela-Ascoli after some work.

In conclusion, it is fairly tricky to prove even this most straightforward case, so unsurprisingly it is hard to extend to the natural case where the distributions of the underlying RVs (X) change continuously in time, as we will want for the analysis of more combinatorial objects. Next time I will consider why it is hard but potentially interesting to consider with adaptations of these techniques an LDP for the size of the largest component in a sparse random graph near criticality.

Mixing Times 3 – Convex Functions on the Space of Measures

The meat of this course covers rate of convergence of the distribution of Markov chains. In particular, we want to be thinking about lots of distributions simultaneously, so we really to be comfortable working with the space of measures on a (for now) finite state space. This is not really too bad actually, since we can embed it in a finite-dimensional real vector space.

\mathcal{M}_1(E)=\{(x_v:v\in\Omega),x_v\geq 0, \sum x_v=1\}\subset \mathbb{R}^\Omega.

Since most operations we might want to apply to distributions are linear, it doesn’t make much sense to inherit the usual Euclidean metric. In the end, the metric we use is the same as the L_1 metric, but the motivation is worth exploring. Typically, the size of |\Omega| will be function of n, a parameter which will tend to infinity. So we do not want to be too rooted in the actual set \Omega for what will follow.

Perhaps the best justification for total variation distance is from a gambling viewpoint. Suppose your opinion for the distribution of some outcome is \mu, and a bookmaker has priced their odds according to their evaluation of the outcome as \nu. You want to make the most money, assuming that your opinion of the distribution is correct (which in your opinion, of course it is!). So assuming the bookmaker will accept an arbitrarily complicated (but finite obviously, since there are only |\Omega| possible outcomes) bet, you want to place money on whichever event evinces the greatest disparity between your measure of likeliness and the bookmaker’s. If you can find an event which you think is very likely, and which the bookmaker thinks is unlikely, you are (again, according to your own opinion of the measure) on for a big profit. This difference is the total variation distance ||\mu-\nu||_{TV}.

Formally, we define:

||\mu-\nu||_{TV}:=\max_{A\subset\Omega}|\mu(A)-\nu(A)|.

Note that if this maximum is achieved at A, it is also achieved at A^c, and so we might as well go with the original intuition of

||\mu-\nu||_{TV}=\max_{A\subset\Omega} \left[\mu(A)-\nu(A)\right].

If we decompose \mu(A)=\sum_{x\in A}\mu(x), and similarly for A^c, then add the results, we obtain:

||\mu-\nu||_{TV}=\frac12\sum_{x\in\Omega}|\mu(A)-\nu(A)|.

There are plenty of other interesting interpretations of total variation distance, but I don’t want to get bogged down right now. We are interested in the rate of convergence of distributions of Markov chains. Given some initial distribution \lambda of X_0, we are interested in ||\lambda P^t-\pi||_{TV}. The problem is that doing everything in terms of some general \lambda is really annoying, at the very least for notational reasons. So really we want to investigate

d(t)=\max_{\lambda\in\mathcal{M}_1(E)}||\lambda P^t-\pi||_{TV},

the worst-case scenario, where we choose the initial distribution that mixes the slowest, at least judging at time t. Now, here’s where the space of measures starts to come in useful. For now, we relax the requirement that measures must be probability distributions. In fact, we allow them to be negative as well. Then \lambda P^t-\pi is some signed measure on \Omega with zero total mass.

But although I haven’t yet been explicit about this, it is easy to see that ||\cdot||_{TV} is a norm on this space. In fact, it is (equivalent to – dividing by 1/2 makes no difference!) the product norm of the L_1 norm as defined before. Recall the norms are convex functions. This is an immediate consequence of the triangle inequality. The set of suitable distributions \lambda is affine, because an affine combination of probability distributions is another probability distribution.

Then, we know from linear optimisation theory, that convex functions on an affine space achieve their maxima at boundary points. And the boundary points for this definition of \lambda\in\mathcal{M}_1(E), are precisely the delta-measures at some point of the state space \delta_v. So in fact, we can replace our definition of d(t) by:

d(t)=\max_{x\in\Omega}||P^t(x,\cdot)-\pi||_{TV},

where P^t(x,\cdot) is the same as (\delta_x P^t)(\cdot). Furthermore, we can immediately apply this idea to get a second result for free. In some problems, particularly those with neat couplings across all initial distributions, it is easier to work with a larger class of transition probabilities, rather than the actual equilibrium distribution, so we define:

\bar{d}(t):=\max_{x,y\in\Omega}||P^t(x,\cdot)-P^t(y,\cdot)||_{TV}.

The triangle inequality gives \bar{d}(t)\leq 2d(t) immediately. But we want to show d(t)\leq \bar{d}(t), and we can do that as before, by considering

\max_{\lambda,\mu\in\mathcal{M}_1(E)}||\lambda P^t-\mu P^t||_{TV}.

The function we are maximising is a convex function on \mathcal{M}_1(E)^2, and so it attains its maximum at a boundary point, which must be \lambda=\delta_x,\mu=\delta_y. Hence \bar{d}(t) is equal to the displayed expression above, which is certainly greater than or equal to the original formulation of d(t), as this is the maximum of the same expression over a strict subset.

I’m not suggesting this method is qualitatively different to that proposed by the authors of the book. However, I think this is very much the right way to be thinking about these matters of maximising norms over a space of measures. Partly this is good because it gives an easy ‘sanity check’ for any idea. But also because it gives some idea of whether it will or won’t be possible to extend the ideas to the case where the state space is infinite, which will be of interest much later.