# Doob inequalities and Doob-Meyer decomposition

The first post I wrote on this blog was about martingales, way back in 2012 at a time when I had known what a martingale was for about a month. I now don’t have this excuse. So I’m going to write about a couple of properties of (discrete-time) martingales that came up while adjusting a proof which my thesis examiners suggested could be made much shorter as part of their corrections.

Doob’s submartingale inequality

When we prove that some sequence of processes converges to some other process, we typically want to show that this holds in some sense uniformly over a time-interval, rather than just at some fixed time. We don’t lose much at this level of vagueness by taking the limit process to be identically zero. Then, if the convergent processes are martingales or closely similar, we want to be able to bound $\sup_{k\le n} |Z_k|$ in some sense.

Doob’s submartingale inequality allows us to do this. Recall that a submartingale has almost-surely non-negative conditional increments. You might think of it heuristically as ‘more increasing than a martingale’. If $Z_n$ is a martingale, then $|Z_n|$ is a submartingale. This will be useful almost immediately.

The statement is that for $(Z_n)$ a non-negative submartingale,

$\mathbb{P}\left( \sup_{k\le n} Z_k \ge \lambda\right) \le \frac{\mathbb{E}\left[Z_n\right]}{\lambda}.$

The similarity of the statement to the statement of Markov’s inequality is no accident. Indeed the proof is very similar. We consider whether the event in question happens, and find lower bounds on the expectation of $Z_n$ under both possibilities.

Formally, for ease of notation, let $Z_n^*$ be the running maximum $\sup_{k\le n}Z_k$. Then, we let $T:= n\wedge \inf\{k\le n, M_j\ge \lambda\}$ and apply the optional stopping theorem for submartingales at T, which is by construction at most n. That is

$\mathbb{E}[Z_n]\ge \mathbb{E}[Z_T]=\mathbb{E}\left[Z_T\mathbf{1}_{Z_n^*<\lambda}\right] + \mathbb{E}\left[Z_T \mathbf{1}_{Z_n^*\ge \lambda}\right].$

The first of these summands is positive, and the second is at least $\lambda \mathbb{P}\left( Z_N^* \ge \lambda \right)$, from which the result follows.

We’ve already said that for any martingale $Z_n$, $|Z_n|$ is a submartingale, but in fact $f(Z_n)$ is a submartingale whenever f is convex, and $\mathbb{E}|f(Z_n)|<\infty$ for each n. Naturally, this continues to hold when $Z_n$ is itself a submartingale.

[Note that $Z_n^*$ is also a submartingale, but this probably isn’t as interesting.]

A particularly relevant such function f is $f(x)=x^p$, for p>1. If we take $Z_n$ a non-negative submartingale which is uniformly bounded in $L^p$, then by applying Holder’s inequality and this submartingale inequality, we obtain

$\mathbb{E}\left( \sup_{k\le n}Z_n^p \right) \le \left(\frac{p}{p-1}\right)^p \mathbb{E}\left[ Z_n^p \right].$

Since $Z_n^p$ is a submartingale, then a limit in n on the RHS is monotone, and certainly a limit in n on the LHS is monotone, so we can extend to

$\mathbb{E}\left( \sup_{k\le n}Z_\infty^p \right) \le \left(\frac{p}{1-p}\right)^p \mathbb{E}\left[ Z_\infty^p \right].$

Initially, we have to define $\mathbb{E}\left[ Z_\infty^p \right]$ through this limit, but in fact this result, Doob’s $L^p$ inequality, shows that $Z_\infty:= \lim Z_n$ exists almost surely as well.

Naturally, we will often apply this in the case p=2, and in the third of these three sections, we will see why it might be particularly straightforward to calculate $\mathbb{E}\left[Z_\infty^2\right].$

Remark: as in the case of Markov’s inequality, it’s hard to say much if the submartingale is not taken to be non-negative. Indeed, this effect can be seen even if the process is only defined for a single time step, for which the statement really is then Markov’s inequality.

Doob-Meyer decomposition

Unfortunately, most processes are not martingales. Given an discrete-time process $X_n$ adapted to $\mathcal{F}=(\mathcal{F}_n)$, it is a martingale if the conditional expectations of the increments are all almost surely zero. But given a general adapted process $X_n$ which is integrable (so the increments have well-defined finite expectation), we can iteratively construct a new process $M_n$, where the increments are centred versions of $X_n$‘s increments. That is,

$M_{n+1}-M_n:= X_{n+1}-X_n - \mathbb{E}\left[ X_{n+1}-X_n \,\big|\, \mathcal{F}_n\right] = X_{n+1}-\mathbb{E}\left[X_{n+1} \,\big|\, \mathcal{F}_n\right].$ (*)

Then it’s immediately clear from the definition that $M_n$ is a martingale.

There’s a temptation to tie oneself up in knots with the dependence. We might have that increments of the original process $X_n$ depend on the current value of the process. And is it necessarily clear that we can recover the current value of the original process from the current value of $M_n$? Well, this is why we demand that everything be adapted, rather than just Markov. It’s not the case that $M_n$ should be Markov, but it clearly is adapted.

Now we look at the middle expression in (*), and in particular the term we are subtracting, namely the conditional expectation. If we define, in the standard terminology, $A_0=0$ and

$A_{n+1}-A_n:= \mathbb{E}\left[ X_{n+1}-X_n \,\big|\, \mathcal{F}_n\right],$

then we have decomposed the original process $X_n$ as the sum of a martingale $M_n$, and this new process $A_n$. In particular, note that the increment $A_{n+1}-A_n$ given above is adapted to $\mathcal{F}_n$, which is a stronger condition than being adapted to $\mathcal{F}_{n+1}$ as we would expect a priori. This property of the process $(A_n)$ is called predictability (or possibly previsibility).

This decomposition $X_n=X_0+M_n+A_n$ as just defined is called the Doob-Meyer decomposition, and there is a unique such decomposition where $M_n$ is a martingale, and $A_n$ is predictable. The proof of uniqueness is very straightforward. We look at the equalities given above as definitions of $M_n,A_n$, but then work in the opposite direction to show that they must hold if the decomposition holds.

I feel a final heuristic is worthwhile, using the term drift, more normally encountered in the continuous-time setting to describe infinitissimal expected increments. The increments of $A_n$ represent the drift of $X_n$, and the increments of $M_n$ are what remains from $X_n$ after subtracting the drift. In general, the process to be subtracted to turn a non-martingale into a martingale is called a compensator, and the existence or otherwise of such processes is important but challenging for some classes of continuous-time processes.

In particular, note that when $X_n$ is itself a martingale, then $A_n\equiv 0$. However, probably the most useful case is when $X_n$ is a submartingale, as then the drift is always non-negative, and so $A_n$ is almost surely increasing. The converse holds too.

This is relevant because this Doob-Meyer decomposition is obviously only a useful tool for treating $X_n$ if we can handle the two processes $M_n,A_n$ easily. We have tools to bound the martingale term, but this previsible term might in general be tricky, and so the case where $X_n$ is a submartingale is good, as increasing processes are much easier than general processes, since bounding the whole process might involve only bounding the final term in many contexts.

A particularly relevant example is the square of a martingale, that is $X_n=M_n^2$, where $M_n$ is a martingale. By the convexity condition discussed earlier, $X_n$ is a submartingale (provided it is integrable, ie $M_n$ is square-integrable), and so the process $A_n$ in its Doob-Meyer decomposition is increasing. This is often called the (predictable) quadratic variation of $(X_n)$.

This predictable quadratic variation is sometimes denoted $\langle X_n\rangle$. This differs from the (regular) quadratic variation which is defined as the sum of the squares of the increments, that is $[X_n]:= \sum_{k=0}^{n-1} (X_{k+1}-X_k)^2$. Note that this is adapted, but obviously not previsible. The distinction between these two processes is more important in continuous time. There, they are almost surely equal for a continuous local martingale, but not for eg a Poisson process. (For a Poisson process, the PQV is deterministic, indeed linear, while the (R)QV is almost surely equal to the Poisson process itself.) In the discrete time setting, the regular quadratic variation is not relevant very often, while the predictable quadratic variation is useful, precisely because of this decomposition.

Whenever we have random variables which we then centre, there is a standard trick to apply when treating their variance. That is

$A_{n+1}-A_n= \mathbb{E}\left[ M^2_{n+1}-M^2_n \,\big|\, \mathcal{F}_n\right]$
$= \mathbb{E}\left[ M^2_{n+1}\,\big|\, \mathcal{F}_n\right] - 2M_n^2 +M_n^2$
$= \mathbb{E}\left[ M^2_{n+1}\,\big|\, \mathcal{F}_n\right] - 2M_n \mathbb{E}\left[ M_{n+1}\,\big|\, \mathcal{F}_n\right] + M_n^2$
$= \mathbb{E}\left[ \left(M_{n+1}-M_n\right)^2\,\big|\, \mathcal{F}_n\right].$

One consequence is seen by taking an ‘overall’ expectation. Because $M_n^2-A_n$ is a martingale,

$\mathbb{E}\left[M_n^2\right] = \mathbb{E}\left[A_n\right] = \mathbb{E}\left[M_0^2\right] + \sum_{k=0}^{n-1} \mathbb{E}\left[A_{k+1}-A_k\right]$
$= \mathbb{E}\left[ M_0^2\right] + \sum_{k=0}^{n-1}\mathbb{E}\left[ \left(M_{k+1}-M_k\right)^2 \right].$ (**)

This additive (Pythagorean) property of the square of a martingale is useful in applications where there is reasonably good control on each increment separately.

We can also see this final property without the Doob-Meyer decomposition. For a martingale it is not the case that the increments on disjoint intervals are independent. However, following Williams 12.1 [1], disjoint intervals are orthogonal, in the sense that

$\mathbb{E}\left[(M_t-M_s)(M_v-M_u)\right]=0,$

whenever $s\le t\le u\le v$. Then, when we square the expression $M_n=M_0+\sum M_{k+1}-M_k$, and take expectations, all the cross terms vanish, leaving precisely (*).

References

[1] Williams – Probability with Martingales

I also followed the notes I made in 2011/12 while attending Perla Sousi’s course on Advanced Probability, and Arnab Sen’s subsequent course on Stochastic Calculus, though I can’t find any evidence online for the latter now.

# Azuma-Hoeffding Inequality

It’s (probably) my last Michaelmas term in Oxford, at least for the time being, and so also the last time giving tutorials on either of the probability courses that students take in their first two years. This time, I’m teaching the second years, and as usual the aim of the majority of the first half of the course is to acquire as sophisticated an understanding as possible of the Central Limit Theorem. I feel a key step is appreciating that CLT tells you about the correct scaling for the deviations from the mean of these partial sums of IID random variables. The fact that these deviations on this correct scaling converge in law to a normal distribution, irrespective (apart from mild conditions) on the underlying distribution, is interesting, but should be viewed as a secondary, bonus, property.

Emphasising the scaling of deviations in CLT motivates the next sections of this (or any) course. We develop tools like Markov’s inequality to control the probability that a random variable is much larger than its expectation, and experiment with applying this to various functions of the random variable to get stronger bounds. When the moment generating function exists, this is an excellent choice for this analysis. We end up with a so-called Chernoff bound. For example, we might consider the probability that when we toss N coins, at least a proportion ¾ are Heads. A Chernoff bound says that this probability decays exponentially in N.

One direction to take is to ask how to control precisely the parameter of this exponential decay, which leads to Cramer’s theorem and the basis of the theory of Large Deviations. An alternative direction is to observe that the signed difference between the partial sums of independent random variables and their means is an example of a martingale, albeit not a very interesting one, since in general the increments of a martingale are not independent. So we might ask: under what circumstances can we show exponential tail bounds on the deviation of a martingale from its mean (that is, its initial value) at a fixed (perhaps large) time?

Azuma-Hoeffding inequality

The following result was derived and used by various authors in the 60s, including Azuma and Hoeffding (separately), but also others.

Let $X_0,X_1,X_2,\ldots$ be a martingale with respect to some filtration, and we assume that the absolute value of each increment $|X_i-X_{i-1}|$ is bounded almost surely by some $c_i<\infty$. Then, recalling that $\mathbb{E}[X_n|\mathcal{F}_0]=X_0$, we have

$\mathbb{P}(X_n \ge X_0+t) \le \exp\left( -\frac{t^2}{2\sum_{i=1}^n c_i^2}\right).$

Proof

We apply a Chernoff argument to each increment. First, observe that for Y a distribution supported on [-1,1] with mean zero, by convexity $\mathbb{E}[e^{tY}]$ is maximised by taking Y equal to +1 and -1 each with probability ½. Thus

$\mathbb{E}[e^{tY}]\le \frac12 e^t + \frac 12 e^{-t}=\cosh(t) \le e^{-t^2/2},$

where the final inequality follows by directly comparing the Taylor series.

We’ll use this shortly. Before that, we start the usual argument for a Chernoff bound on $X_n-X_0$.

$\mathbb{P}(X_n-X_0\ge t) = \mathbb{P}(e^{\theta(X_n-X_0)}\ge e^{\theta t})\le e^{-\theta t} \mathbb{E}[e^{\theta(X_n-X_0)}]$

$= e^{-\theta t} \mathbb{E}[\mathbb{E}[e^{\theta((X_n-X_{n-1}) +X_{n-1}-X_0)} | \mathcal{F}_{n-1}]]$

$= e^{-\theta t} \mathbb{E}[e^{\theta(X_{n-1}-X_0)} \mathbb{E}[e^{\theta(X_n-X_{n-1})}|\mathcal{F}_{n-1}] ],$

and our preliminary result allows us to control this inner expectation

$\le e^{-\theta t} e^{\theta^2c_n^2/2} \mathbb{E}[e^{\theta(X_{n-1}-X_0)}].$

So now we can apply this inductively to obtain

$\mathbb{P}(X_n-X_0\ge t) \le e^{-\theta t+ \theta^2 \sum_{i=1}^n c_i^2}.$

Finally, as usual in such an argument, we need to choose a sensible value of the free parameter $\theta$, and naturally we want to choose it to make this RHS as small as possible, which is achieved when $\theta = \frac{t}{\sum_{i=1}^n c_i^2}$, and leads exactly to the statement of the inequality.

Applications

Unsurprisingly, we can easily apply this to the process of partial sums of IID random variables with mean zero and bounded support, to recover a Chernoff bound.

A more interesting example involves revealing the state (ie open or closed) of the edges of an Erdos-Renyi graph one at a time. We need to examine some quantitative property of the graph which can’t ever be heavily influenced by the presence or non-presence of a single given edge. The size of the largest clique, or the largest cut, are good examples. Adding or removing an edge can change these quantities by at most one.

So if we order the edges, and let the filtration $\mathcal{F}_k$ be generated by the state of the first k edges in this ordering, then $X_k=\mathbb{E}[\text{max cut}| \mathcal{F}_k]$ is a martingale. (A martingale constructed backwards in this fashion by conditioning a final state on a filtration is sometimes called a Doob martingale.) Using A-H on this shows that the deviations from the mean are of order $\sqrt{N}$, where N is the size of the graph. In the sparse case, it can be justified fairly easily that the maximum cut has size $\Theta(N)$, since for example there will always be some positive proportion of isolated vertices. However, accurate asymptotics for the mean of this quantity seem (at least after a brief search of the literature – please do correct me if this is wrong!) to be unknown. So this might be an example of the curious situation where we can control the deviations around the mean better than the mean itself!

Beyond bounded increments

One observation we might make about the proof is that it is tight only if all the increments $X_i-X_{i-1}$ are supported on $\{-c_i,+c_i\}$, which is stronger than demanding that the absolute value is bounded. If in fact we have $X_i-X_{i-1}\in[-d_i,c_i]$ almost surely, then, with a more detailed preliminary lemma, we can have instead a bound of $\exp\left( -\frac{2t^2}{\sum_{i=1}^n (c_i+d_i)^2} \right)$.

While it isn’t a problem in these examples, in many settings the restriction to bounded increments is likely to be the obstacle to applying A-H. Indeed, in the technical corner of my current research problem, this is exactly the challenge I faced. Fortunately, at least in principle, all is not necessarily lost. We might, for example, be able to establish bounds $(c_i)$ as described, such that the probability that any $|X_i-X_{i-1}|$ exceeds its $c_i$ is very small. You could then construct a coupled process $(Y_i)$, that is equal to $X_i$ whenever the increments are within the given range, and something else otherwise. For Y to fit the conditions of A-H, the challenge is to ensure we can do this such that the increments remain bounded (ie the ‘something else’ also has to be within $[-c_i,c_i]$ ) and also that Y remains a martingale. This total probability of a deviation is bounded above by the probability of Y experiencing that deviation, plus the probability of Y and X decoupling. To comment on the latter probability is hard in general without saying a bit more about the dependence structure in X itself.

# Popoviciu’s Inequality

I’ve just returned to the UK after an excellent stay at the University of British Columbia. More about that will follow in some posts which are being queued. Anyway, I flew back in time to attend the last day of the camp held at Oundle School to select the UK team for this year’s International Mathematical Olympiad, to be held in Cape Town in early July. I chose to give a short session on inequalities, which is a topic I did not enjoy as a student and do not enjoy now, but perhaps that makes it a particularly suitable choice?

We began with a discussion of convexity. Extremely occasionally in olympiads, and merely slightly occasionally in real life, an inequality arises which can be proved by showing that a given function is convex in all its arguments, hence its maximum must be attained at a boundary value in each variable.

In general though, our main experience of convexity will be through the medium of Jensen’s inequality. A worthwhile check is to consider one form of the statement of Jensen’s inequality, with two arguments. We are always given a convex function f defined on an interval I=[a,b], and $x,y\in I$, and weights $\alpha,\beta$ which sum to 1. Then

$\alpha f(x)+\beta f(y)\ge f(\alpha x+\beta y).$

How do we prove this? Well, in fact this is the natural definition of convexity for a function. There had initially been vague murmurings that convexity should be defined as a property of the second derivative of the function. But this is somewhat unsatisfactory, as the function $f(x)=|x|$ is certainly convex, but the second derivative does not exist at x=0. One could argue that the second derivative may not be finite at x=0, but is nonetheless positive by defining it as a limit which happens to be infinite in this case. However, I feel it is uncontroversial to take the case of Jensen given above as the definition of convexity. It is after all a geometric property, so why raise objections to a geometric definition?

The general statement of Jensen’s inequality, with the natural definitions, is

$\sum_{i} \alpha_i f(x_i)\ge f(\sum_{i}\alpha_ix_i).$

This is sometimes called Weighted Jensen in the olympiad community, with ‘ordinary’ Jensen following when the weights are all 1/n. In a probabilistic context, we write

$\mathbb{E}[f(X)]\ge f(\mathbb{E}X),$

for X any random variable supported on the domain of f. Naturally, X can be continuous as well as discrete, giving an integral version of the discretely weighted statement.

Comparing ‘ordinary’ Jensen and ‘weighted’ Jensen, we see an example of the situation where the more general result is easier to prove. As is often the case in these situations, this arises because the more general conditions allow more ‘elbow room’ to perform an inductive argument. A stronger statement means that assuming the induction hypothesis is more useful! Anyway, I won’t digress too far onto the proof of discrete ‘weighted’ Jensen as it is a worthwhile exercise for olympiad students.

What I wanted to discuss principally was an inequality due to Tiberiu Popoviciu:

$\frac13[f(x)+f(y)+f(z)]+f(\frac{x+y+z}{3})\ge \frac23[f(\frac{x+y}{2})+f(\frac{y+z}{2})+f(\frac{z+x}{2})].$

We might offer the following highly vague intuition. Jensen asserts that for sums of the form $\sum f(x_i)$, you get larger sums if the points are more spread out. The effect of taking the mean is immediately to bring all the points as close together as possible. But Popoviciu says that this effect is so pronounced that even with only half the weight on the outer points (and the rest as close together as possible), it still dominates a system with the points twice as close together.

So how to prove it? I mentioned that there is, unsurprisingly, a weighted version of this result, which was supposed to act as a hint to avoid getting too hung up about midpoints. One can draw nice diagrams with a triangle of points (x,f(x)), (y,f(y)), (z,f(z)) and draw midpoints, medians and centroids, but the consensus seemed to be that this didn’t help much.

I had tried breaking up the LHS into three symmetric portions, and using weighted Jensen to obtain terms on the RHS, but this also didn’t yield much, so I warned the students against this approach unless they had a specific reason to suppose it might succeed.

Fortunately, several of the students decided to ignore this advice, and though most fell into a similar problem I had experienced, Joe found that by actively avoiding symmetry, a decomposition into two cases of Jensen could be made. First we assume WLOG that $x\le y \le z$, and so by standard Jensen, we have

$\frac13[f(x)+f(y)]\ge \frac23 f(\frac{x+y}{2}).$

It remains to show

$\frac13 f(z)+f(\frac{x+y+z}{3})\ge \frac23[f(\frac{x+z}{2})+f(\frac{y+z}{2})].$

If we multiply by ¾, then we have an expression on each side that looks like the LHS of Weighted Jensen. At this point, it is worth getting geometric again. One way to visualise Jensen is that for a convex function, a chord between two points on the function lies above the function. (For standard Jensen with two variables, in particular the midpoint lies above the function.) But indeed, suppose we have values $x_1, then the chord between $f(x_1),f(y_1)$ lies strictly above the chord between $f(x_2),f(y_2)$. Making precisely such a comparison gives the result required above. If you want to be more formal about it, you could consider replacing the values of f between $x_2,y_2$ with a straight line, then applying Jensen to this function. Linearity allows us to move the weighting in and out of the brackets on the right hand side, whenever the mean lies in this straight line interval.

Hopefully the diagram above helps. Note that we can compare the heights of the blue points (with the same abscissa), but obviously not the red points!

In any case, I was sceptical about whether this method would work for the weighted version of Popoviciu’s inequality

$\alpha f(x)+\beta f(y) + \gamma f(z)+f(\alpha x+\beta y+\gamma z)\ge$

$(\alpha+\beta)f(\frac{\alpha x+\beta y}{\alpha+\beta})+(\beta+\gamma)f(\frac{\beta y + \gamma z}{\beta+\gamma})+(\gamma+\alpha)f(\frac{\gamma z+\alpha x}{\gamma+\alpha}).$

It turns out though, that it works absolutely fine. I would be interested to see a solution to the original statement making use of the medians and centroid, as then by considering general Cevians the more general inequality might follow.

That’s all great, but my main aim had been to introduce one trick which somewhat trivialises the problem. Note that in the original statement of Popoviciu, we have a convex function, but we only evaluate it at seven points. So for given x,y,z, it makes no difference if we replace the function f with a piece-wise linear function going through the correct seven points. This means that if we can prove that the inequality for any convex piece-wise linear function with at most eight linear parts then we are done.

(There’s a subtlety here. Note that we will prove the inequality for all such functions and all x,y,z, but we will only use this result when x,y,z and their means are the points where the function changes gradient.)

So we consider the function

$g_a(x)=\begin{cases}0& x\le 0\\ ax & x\ge 0\end{cases}$

for some positive value of a. It is not much effort to check that this satisfies Popoviciu. It is also easy to check that the constant function, and the linear function g(x)=bx also satisfy the inequality. We now prove that we can write the piece-wise linear function as a sum of functions which satisfy the inequality, and hence the piece-wise linear function satisfies the inequality.

Suppose we have a convex piecewise linear function h(x) where $x_1 are the points where the derivative changes. We write

$a_i=h'(x_i+)-h'(x_i'),\quad a_0=h'(x_1-),$

for the change in gradient of h around point $x_i$. Crucially, because h is convex, we have $a_i\ge 0$. Then we can write h as

$h(x)=C+ a_0x+g_{a_1}(x-x_1)+\ldots+g_{a_{n}}(x-x_n),$

for a suitable choice of the constant C. This result comes according to [1] as an intermediate step in a short paper of Hardy, Littlewood and Polya, which I can’t currently find online. Note that inequalities are preserved under addition (but not under subtraction) so it follows that h satisfies Popoviciu, and so the original function f satisfies it too for the values of x,y,z chosen. These were arbitrary (but were used to construct h), and hence f satisfies the inequality for all x,y,z.

Some further generalisations can be found in [1]. With more variables, there are more interesting combinatorial aspects that must be checked, regarding the order of the various weighted means.

[1] – D. Grinberg – Generalizations of Popoviciu’s Inequality. arXiv

# Large Deviations 5 – Stochastic Processes and Mogulskii’s Theorem

Motivation

In the previous posts about Large Deviations, most of the emphasis has been on the theory. To summarise briefly, we have a natural idea that for a family of measures supported on the same metric space, increasingly concentrated as some index grows, we might expect the probability of seeing values in a set not containing the limit in distribution to grow exponentially. The canonical example is the sample mean of a family of IID random variables, as treated by Cramer’s theorem.

It becomes apparent that it will not be enough to specify the exponent for a given large deviation event just by taking the infimum of the rate function, so we have to define an LDP topologically, with different behaviour on open and closed sets. Now we want to find some LDPs for more complicated measures, but which will have genuinely non-trivial applications. The key idea in all of this is that the infimum present in the definition of an LDP doesn’t just specify the rate function, it also might well give us some information about the configurations or events that lead to the LDP.

The slogan for the LDP as in Frank den Hollander’s excellent book is: “A large deviation event will happen in the least unlikely of all the unlikely ways.” This will be useful when our underlying space is a bit more complicated.

Setup

As a starting point, consider the set-up for Cramer’s theorem, with IID $X_1,\ldots,X_n$. But instead of investigating LD behaviour for the sample mean, we investigate LD behaviour for the whole set of RVs. There is a bijection between sequences and the partial sums process, so we investigate the partial sums process, rescaled appropriately. For the moment this is a sequence not a function or path (continuous or otherwise), but in the limit it will be, and furthermore it won’t make too much difference whether we interpolate linearly or step-wise.

Concretely, we consider the rescaled random walk:

$Z_n(t):=\tfrac{1}{n}\sum_{i=1}^{[nt]}X_i,\quad t\in[0,1],$

with laws $\mu_n$ supported on $L_\infty([0,1])$. Note that the expected behaviour is a straight line from (0,0) to (1,$\mathbb{E}X_1$). In fact we can say more than that. By Donsker’s theorem we have a functional version of a central limit theorem, which says that deviations from this expected behaviour are given by suitably scaled Brownian motion:

$\sqrt{n}\left(\frac{Z_n(t)-t\mathbb{E}X}{\sqrt{\text{Var}(X_1)}}\right)\quad\stackrel{d}{\rightarrow}\quad B(t),\quad t\in[0,1].$

This is what we expect ‘standard’ behaviour to look like:

The deviations from a straight line are on a scale of $\sqrt{n}$. Here are two examples of potential large deviation behaviour:

Or this:

Note that these are qualitatively different. In the first case, the first half of the random variables are in general much larger than the second half, which appear to have empirical mean roughly 0. In the second case, a large deviation in overall mean is driven by a single very large value. It is obviously of interest to find out what the probabilities of each of these possibilities are.

We can do this via an LDP for $(\mu_n)$. Now it is really useful to be working in a topological context with open and closed sets. It will turn out that the rate function is supported on absolutely continuous functions, whereas obviously for finite n, none of the sample paths are continuous!

We assume that $\Lambda(\lambda)$ is the logarithmic moment generating function of X_1 as before, with $\Lambda^*(x)$ the Fenchel-Legendre transform. Then the key result is:

Theorem (Mogulskii): The measures $(\mu_n)$ satisfy an LDP on $L_\infty([0,1])$ with good rate function:

$I(\phi)=\begin{cases}\int_0^1 \Lambda^*(\phi'(t))dt,&\quad \text{if }\phi\in\mathcal{AC}, \phi(0)=0,\\ \infty&\quad\text{otherwise,}\end{cases}$

where AC is the space of absolutely continuous functions on [0,1]. Note that AC is dense in $L_\infty([0,1])$, so any open set contains a $\phi$ for which $I(\phi)$ is at least in principle finite. (Obviously, if $\Lambda^*$ is not finite everywhere, then extra restrictions of $\phi'$ are required.)

The following picture may be helpful at providing some motivation:

So what is going on is that if we take a path and zoom in on some small interval around a point, note first that behaviour on this interval is independent of behaviour everywhere else. Then the gradient at the point is the local empirical mean of the random variables around this point in time. The probability that this differs from the actual mean is given by Cramer’s rate function applied to the empirical mean, so we obtain the rate function for the whole path by integrating.

More concretely, but still very informally, suppose there is some $\phi'(t)\neq \mathbb{E}X$, then this says that:

$Z_n(t+\delta t)-Z_n(t)=\phi'(t)\delta t+o(\delta t),$

$\Rightarrow\quad \mu_n\Big(\phi'(t)\delta t+o(\delta t)=\frac{1}{n}\sum_{i=nt+1}^{n(t+\delta t)}X_i\Big),$

$= \mu_n\Big( \phi'(t)+o(1)=\frac{1}{n\delta t}\sum_{i=1}^{n\delta t}X_i\Big)\sim e^{-n\delta t\Lambda^*(\phi'(t))},$

by Cramer. Now we can use independence:

$\mu_n(Z_n\approx \phi)=\prod_{\delta t}e^{-n\delta t \Lambda^*(\phi'(t))}=e^{-\sum_{\delta t}n\delta t \Lambda^*(\phi'(t))}\approx e^{-n\int_0^1 \Lambda^*(\phi'(t))dt},$

as in fact is given by Mogulskii.

Remarks

1) The absolutely continuous requirement is useful. We really wouldn’t want to be examining carefully the tail of the underlying distribution to see whether it is possible on an exponential scale that o(n) consecutive RVs would have sum O(n).

2) In general $\Lambda^*(x)$ will be convex, which has applications as well as playing a useful role in the proof. Recalling den Hollander’s mantra, we are interested to see where infima hold for LD sets in the host space. So for the event that the empirical mean is greater than some threshold larger than the expectation, Cramer’s theorem told us that this is exponentially the same as same the empirical mean is roughly equal to the threshold. Now Mogulskii’s theorem says more. By convexity, we know that the integral functional for the rate function is minimised by straight lines. So we learn that the contributions to the large deviation are spread roughly equally through the sample. Note that this is NOT saying that all the random variables will have the same higher than expected value. The LDP takes no account of fluctuations in the path on a scale smaller than n. It does however rule out both of the situations pictured a long way up the page. We should expect to see roughly a straight line, with unexpectedly steep gradient.

3) The proof as given in Dembo and Zeitouni is quite involved. There are a few stages, the first and simplest of which is to show that it doesn’t matter on an exponential scale whether we interpolate linearly or step-wise. Later in the proof we will switch back and forth at will. The next step is to show the LDP for the finite-dimensional problem given by evaluating the path at finitely many points in [0,1]. A careful argument via the Dawson-Gartner theorem allows lifting of the finite-dimensional projections back to the space of general functions with the topology of pointwise convergence. It remains to prove that the rate function is indeed the supremum of the rate functions achieved on projections. Convexity of $\Lambda^*(x)$ is very useful here for the upper bound, and this is where it comes through that the rate function is infinite when the comparison path is not absolutely continuous. To lift to the finer topology of $L_\infty([0,1])$ requires only a check of exponential tightness in the finer space, which follows from Arzela-Ascoli after some work.

In conclusion, it is fairly tricky to prove even this most straightforward case, so unsurprisingly it is hard to extend to the natural case where the distributions of the underlying RVs (X) change continuously in time, as we will want for the analysis of more combinatorial objects. Next time I will consider why it is hard but potentially interesting to consider with adaptations of these techniques an LDP for the size of the largest component in a sparse random graph near criticality.