The meat of this course covers rate of convergence of the distribution of Markov chains. In particular, we want to be thinking about lots of distributions simultaneously, so we really to be comfortable working with the space of measures on a (for now) finite state space. This is not really too bad actually, since we can embed it in a finite-dimensional real vector space.
Since most operations we might want to apply to distributions are linear, it doesn’t make much sense to inherit the usual Euclidean metric. In the end, the metric we use is the same as the metric, but the motivation is worth exploring. Typically, the size of
will be function of n, a parameter which will tend to infinity. So we do not want to be too rooted in the actual set
for what will follow.
Perhaps the best justification for total variation distance is from a gambling viewpoint. Suppose your opinion for the distribution of some outcome is , and a bookmaker has priced their odds according to their evaluation of the outcome as
. You want to make the most money, assuming that your opinion of the distribution is correct (which in your opinion, of course it is!). So assuming the bookmaker will accept an arbitrarily complicated (but finite obviously, since there are only
possible outcomes) bet, you want to place money on whichever event evinces the greatest disparity between your measure of likeliness and the bookmaker’s. If you can find an event which you think is very likely, and which the bookmaker thinks is unlikely, you are (again, according to your own opinion of the measure) on for a big profit. This difference is the total variation distance
.
Formally, we define:
Note that if this maximum is achieved at A, it is also achieved at , and so we might as well go with the original intuition of
If we decompose , and similarly for
, then add the results, we obtain:
There are plenty of other interesting interpretations of total variation distance, but I don’t want to get bogged down right now. We are interested in the rate of convergence of distributions of Markov chains. Given some initial distribution of
, we are interested in
. The problem is that doing everything in terms of some general
is really annoying, at the very least for notational reasons. So really we want to investigate
the worst-case scenario, where we choose the initial distribution that mixes the slowest, at least judging at time t. Now, here’s where the space of measures starts to come in useful. For now, we relax the requirement that measures must be probability distributions. In fact, we allow them to be negative as well. Then is some signed measure on
with zero total mass.
But although I haven’t yet been explicit about this, it is easy to see that is a norm on this space. In fact, it is (equivalent to – dividing by 1/2 makes no difference!) the product norm of the
norm as defined before. Recall the norms are convex functions. This is an immediate consequence of the triangle inequality. The set of suitable distributions
is affine, because an affine combination of probability distributions is another probability distribution.
Then, we know from linear optimisation theory, that convex functions on an affine space achieve their maxima at boundary points. And the boundary points for this definition of , are precisely the delta-measures at some point of the state space
. So in fact, we can replace our definition of d(t) by:
where is the same as
. Furthermore, we can immediately apply this idea to get a second result for free. In some problems, particularly those with neat couplings across all initial distributions, it is easier to work with a larger class of transition probabilities, rather than the actual equilibrium distribution, so we define:
The triangle inequality gives immediately. But we want to show
, and we can do that as before, by considering
The function we are maximising is a convex function on , and so it attains its maximum at a boundary point, which must be
. Hence
is equal to the displayed expression above, which is certainly greater than or equal to the original formulation of d(t), as this is the maximum of the same expression over a strict subset.
I’m not suggesting this method is qualitatively different to that proposed by the authors of the book. However, I think this is very much the right way to be thinking about these matters of maximising norms over a space of measures. Partly this is good because it gives an easy ‘sanity check’ for any idea. But also because it gives some idea of whether it will or won’t be possible to extend the ideas to the case where the state space is infinite, which will be of interest much later.
Related articles
- Mixing Times 1- Reversing Markov Chains
- Mixing Times 2 – Metropolis Chains
- Split-Bregman for total variation: parameter choice (mirror2image.wordpress.com)
- Hormander-Minlin (nucaltiado.wordpress.com)