Tightness in Skorohod Space

This post continues the theme of revising topics in the analytic toolkit relevant to proving convergence of stochastic processes. Of particular interest is the question of how to prove that families of Markov chains might have a process scaling limit converging to a solution of some stochastic differential equation, in a generalisation of Donsker’s theorem for Brownian motion. In this post, however, we address more general aspects of convergence of stochastic processes, with particular reference to Skorohod space.

Topological Background

I’ve discussed Skorohod space in a previous post. For now, we focus attention on compactly supported functions, D[0,T]. Some of what follows can be extended to the infinite-time setting easily, and some requires more work. Although we can define a metric on the space of cadlag functions in lots of ways, it is more useful to think topologically, or at least with a more vague sense of metric. We say two cadlag functions are close to one another if there is a reparameterisation of the time-axis, (a function [0,T] to itself) that is uniformly close to the identity function, and when applied to one of the cadlag functions, brings it close to the other cadlag function. Heuristically, two cadlag functions are close if their large jumps are close to one another and of similar size, and if they are uniformly close elsewhere. It is worth remembering that a cadlag function on even an unbounded interval can have only countably many jumps, and only finitely many with magnitude greater than some threshold on any compact interval.

For much of the theory one would like to use, it is useful for the spaces under investigation to be separable. Recall a topological space is separable if there exists a countable dense subset. Note in particular that D[0,T] is not separable under the uniform metric, since we can define f_x(\cdot)=\mathbf{1}_{(\cdot \ge x)} for each x\in[0,T], then ||f_x-f_y||_\infty=1 whenever x\ne y. In particular, we have an uncountable collection of disjoint open sets given by the balls \mathcal{B}(f_x,\frac12), and so the space is not countable. Similarly, C[0,\infty) is not separable. A counterexample might be given by considering functions which take the values {0,1} on the integers. Thus we have a map from \{0,1\}^{\mathbb{N}}\rightarrow C[0,\infty), where the uniform distance between any two distinct image points is at least one, hence the open balls of radius 1/2 around each image point give the same contradiction as before. However, the Stone-Weierstrass theorem shows that C[0,T] is separable, as we can approximate any such function uniformly well by a polynomial, and thus uniformly well by a polynomial with rational coefficients.

In any case, it can be shown that D[0,T] is separable with respect to the natural choice of metric. It can also be shown that there is a metric which gives the same open sets (hence is a topologically equivalent metric) under which D[0,T] is complete, and hence a Polish space.

Compactness in C[0,T] and D[0,T]

We are interested in tightness of measures on D[0,T], so first we need to address compactness for sets of deterministic functions in D[0,T]. First, we consider C[0,T]. Here, the conditions for a set of functions to be compact is given by the celebrated Arzela-Ascoli theorem. We are really interested in compactness as a property of size, so we consider instead relative compactness. A set is relatively compact (sometimes pre-compact) if its closure is compact. For the existence of subsequential limits, this is identical to compactness, only now we allow the possibility of the limit point lying outside the set.

We note that the function C[0,T]\rightarrow \mathbb{R} given by ||f||_\infty is continuous, and hence uniform boundedness is certainly a required condition for compactness in C[0,T]. Arzela-Ascoli states that uniform boundedness plus equicontinuity is sufficient for a set of such functions to be compact. Equicontinuity should be thought of as uniform continuity that is uniform among all the functions in the set, rather than just within the argument of an individual particular function.

For identical reasons, we need uniform boundedness for relative compactness in D[0,T], but obviously uniform continuity won’t work as a criterion for discontinuous functions! We seek some analogue of the modulus of continuity that ignores jumps. We define

\omega'_\delta(f):=\inf_{\{t_i\}} \max_i \sup_{s,t\in[t_{i-1},t_i)} |f(s)-f(t)|,

where the infimum is taken over all meshes 0=t_0<t_1<\ldots<t_r with t_i-t_{i-1}>\delta. Note that as \delta\downarrow 0, we can, if we want, place the t_i so that large jumps of the function f take place over the boundaries between adjacent parts of the mesh. In particular, for a given cadlag function, it can be shown fairly easily that \omega'_\delta(f)\downarrow 0 as \delta\rightarrow 0. Then, unsurprisingly, in a similar fashion to the Arzela-Ascoli theorem, it follows that a set of functions A\subset D[0,T] is relatively compact if it is uniformly bounded, and

\lim_{\delta\rightarrow 0} \sup_{f\in A}\omega'_\delta(f)=0.

Note that this ‘modulus of continuity’ needs to decay uniformly across the set of functions, but that we do not need to choose the mesh at level \delta uniformly across all functions. This would obviously not work, as then the functions \mathbf{1}_{(\cdot\ge x_n)} for any sequence x_n\rightarrow x would not be compact, but they clearly converge in Skorohod space!

Tightness in C[0,T] and D[0,T]

Naturally, we are mainly interested in (probability) measures on D[0,T], and in particular conditions for tightness on this space. Recall a family of measures is tight if for any \epsilon>0, there exists some compact set A such that

\pi(A)>1-\epsilon,\quad \forall \pi\in\Pi.

So, for measures (\mu_n) on D[0,T], the sequence is tight precisely if for any \epsilon>0, there exists M,\delta and some N such that for any n>N, both

\mu_n(||f||_\infty >M)\le \epsilon,\quad \mu_n(\omega'_\delta(f)>\epsilon)\le \epsilon

hold. In fact, the second condition controls variation sufficiently strongly, that we can replace the first condition with

\mu_n(|f(0)|>M)\le \epsilon.

Often we might be taking some sort of scaling limit of these processes in D[0,T], where the jumps become so small in the limit that we expect the limit process to be continuous, perhaps an SDE or diffusion. If we can replace \omega'_\delta by \omega_\delta, the standard modulus of continuity, then we have the additional that any weak limit lies in C[0,T].

In general, to prove convergence of some stochastic processes, we will want to show that the processes are tight, by demonstrating the properties above, or something equivalent. Then Prohorov’s theorem (which I tend to think of as a probabilistic functional version of Bolzano-Weierstrass) asserts that the family of processes has a weak subsequential limit. Typically, one then shows that any weak subsequential limit must have the law of some particular random process. Normally this is achieved by showing some martingale property (eg for an SDE) in the limit, often by using the Skorohod representation theorem to use almost sure subsequential convergence rather than merely weak convergence. Then one argues that there is a unique process with this property and a given initial distribution. So since all weak subsequential limits are this given process, in fact the whole family has a weak limit.

Large Deviations 2 – LDPs, Rate Functions and Lower Semi-Continuity

Remarks from Cramer’s Theorem

So in the previous post we discussed Cramer’s theorem on large deviations for means of i.i.d. random variables. It’s worth stepping back and thinking more abstractly about what we showed. Each S_n has some law, which we think of as a measure on \mathbb{R}, though this could equally well be some other space, depending on where the random variables are supported. The law of large numbers asserts that as n\rightarrow\infty, these measures are increasingly concentrated at a single point in \mathbb{R}, which in this case is \mathbb{E}X_1. Cramer’s theorem then asserts that the measure of certain sets not containing this point of concentration decays exponentially in n, and quantifies the exponent, a so-called rate function, via a Legendre transform of the log moment generating function of the underlying distribution.

One key point is that we considered only certain sets [a,\infty),\,a>\mathbb{E}X_1, though we could equally well have considered (-\infty,a],\,a<\mathbb{E}X_1. What would happen if we wanted to consider an interval, say [a,b],\,\mathbb{E}X_1<a<b? Well, \mu_n([a,b])=\mu_n([a,\infty))-\mu_n((b,\infty)), and we might as well assume that \mu_n is sufficiently continuous, at least in the limit, that we can replace the open interval bound with a closed one. Then Cramer’s theorem asserts, written in a more informal style, that \mu_n([a,\infty))\sim e^{-nI(a)} and  similarly for [b,\infty). So provided I(a)<I(b), we have

\mu_n([a,b])\sim e^{-nI(a)}-e^{-nI(b)}\sim e^{-nI(a)}.

To in order to accord with our intuition, we would like I(x) to be increasing for x>\mathbb{E}X_1, and decreasing for x<\mathbb{E}X_1. Also, we want I(\mathbb{E}X_1)=0, to account for the fact that \mu_n([\mathbb{E}X_1,\infty))=O(1). For each consider a sequence of coin tosses. The probability that the observed proportion of heads is in [\frac12,1] should be roughly 1/2 for all n.

Note that in the previous displayed equation for \mu_n([a,b]) the right hand side has no dependence on b. Informally, this means that any event which is at least as unlikely as the event of a deviation to a, will in the limit happen in the most likely of the unlikely ways, which will in this case be a deviation to a, because of relative domination of exponential functions. So if, rather than just half-lines and intervals, we wanted to consider more general sets, we might conjecture a result of the form:

\mu_n(\Gamma)\sim e^{-n\inf_{z\in\Gamma}(z)},

with the approximation defined formally as in the statement of Cramer’s theorem. What can go wrong?

Large Deviations Principles

Well, if the set \Gamma=\{\gamma\} a single point, and the underlying distribution is continuous, then we would expect \mu_n(\{\gamma\})=0 for all n. Similarly, we would expect \mu_n((\mathbb{E}X_1,\infty))\sim O(1), but there is no a priori reason why I(z) should be continuous at \mathbb{E}X_1. (In fact, this is false.), so taking \Gamma=(\mathbb{E}X_1,\infty) again gives a contradiction.

So we need something a bit more precise. Noting that the problem here is that measure (in this case, measure of likeliness on an exponential scale) can leak into open sets through the boundary in the limit, and also the rate function requires some sort of neighbourhood to make sense for continuous RVs, so boundaries of closed sets may give an overestimate. This is reminiscent of weak convergence, and motivated by this, the appropriate general definition for a Large Deviation Principle is:

A sequence of measure (\mu_n) on some space E satisfies an LDP with rate function I and speed n if \forall \Gamma\in \mathcal{B}(E):

-\inf_{x\in\Gamma^\circ}I(x)\leq \liminf \frac{1}{n}\log\mu_n(\Gamma)\leq \limsup\frac{1}{n}\log\mu_n(\Gamma)\leq -\inf_{x\in \bar{\Gamma}}I(x).

Although this might look very technical, you might as well think of it as nothing more than the previous conjecture for general sets, with the two problems that we mentioned now taken care of.

So, we need to define a rate function. I: E\rightarrow[0,\infty] is a rate function, if it not identically infinite. We also demand that it is lower semi-continuous, and has closed level sets \Psi_I^\alpha:=\{x\in E: I(x)\leq\alpha\}. These definitions are in fact equivalent. I will say what lower semi-continuity is in a moment. Some authors also demand that the level sets be compact. Others call this a good rate function, or similar. The advantage of this is that infima on closed sets are attained.

It is possible to specify a different rate. The rate gives the speed of convergence. \frac 1 n can be replaced with any function converging to 0, including continuously.

Lower Semi-Continuity

A function f is lower semi-continuous if

f(x)\leq \liminf f(x_n),\text{ for all sequences }x_n\rightarrow x.

One way of thinking about this definition is to say that the function cannot jump upwards as it reaches a boundary, it can only jump downwards (or not jump at all). The article on Wikipedia for semi-continuity has this picture explaining how a lower semi-continuous function must behave at discontinuities. Note that the value of f at the discontinuity could be the blue dot, or anything less than the blue dot. It is reasonable clear why this definition is equivalent to having closed level sets.

So the question to ask is: why should rate functions be lower semi-continuous? Rather than proceeding directly, we argue by uniqueness. Given a function on \mathbb{R} with discontinuities, we can turn it into a cadlag function, or a caglad function by fiddling with the values taken at points of discontinuity. We can do a similar thing to turn any function into a lower semi-continuous function. Given f, we define

f_*(x):=\liminf_{x_n\rightarrow x}f(x_n)=\sup\{\inf_G f: x\ni G, G \text{ open}\}.

The notes I borrowed this idea from described this as the maximal lower semi-continuous regularisation, which I think is quite a good explanation despite the long words.

Anyway, the claim is that if I(x) satisfies a LDP then so does $I_*(x)$. This needs to be checked, but it explains why we demand that the rate function be lower semi-continuous. We really want the rate function not to be unique, and this is a good way to prevent an obvious cause of non-uniqueness. It needs to be checked that it is actually unique once we have this assumption, but that is relatively straightforward.

So, to check that the lower semi-continuous regularisation of I satisfies the LDP if I does, we observe that the upper bound is trivial, since I^*\leq I everywhere. Then, for every open set G, note that for x\in G, I_*(x)=\liminf_{x_n\rightarrow x}I(x), so we might as well consider sequences within G, and so I_*(x)\geq \inf \inf_G I. So, since I_*(x)\leq I(x), it follows that

\inf_G I_*=\inf_G I,

and thus we get the upper bound for the LDP.

References

The motivation for this particular post was my own, but the set of notes here, as cited in the previous post were very useful. Also the Wikipedia page on semi-continuity, and Frank den Hollander’s book ‘Large Deviations’.

Skorohod Space

The following is a summary of a chapter from Billingsley’s Convergence of Probability Measures. The ideas are easy to explain heuristically, but this was the first text I could find which explained how to construct Skorohod space for functions on the whole of the non-negative reals in enough stages that it was easily digestible.

It is relatively straightforward to define a topology on C[0,1], as we can induce from the most sensible metric. In this topology, functions f and g are close together if

\sup_{t\in[0,1]} |f(t)-g(t)| is small.

For cadlag functions, things are a bit more complicated. Two functions might be very similar, but have a discontinuity of similar magnitude at slightly different places. The sup norm of the difference is therefore macroscopically large. So we want a metric that also allows uniformly small deformations of the time scale.

We define the Skorohod (or Skorokhod depending on your transliteration preferences) metric d on D[0,1] as follows. Let \Lambda be the family of continuous, strictly increasing functions from [0,1] to [0,1] which map 0 to 0 and 1 to 1. This will be our family of suitable reparameterisations of the time scale (or abscissa – a new word I learned today. The other axis in a co-ordinate pair is called the ordinate). Anyway, we now say that

d(x,y)<\epsilon\quad\text{if }\exists \lambda\in\Lambda\text{ s.t. }

||\lambda - id||_\infty<\epsilon\quad\text{and}\quad ||f-\lambda\circ g||_\infty<\epsilon.

In other words, after reparameterising the time scale for g, without moving any time by more than epsilon, the functions are within epsilon in the sup metric.

Weak Convergence

We have the condition: if \{P_n\} is a tight sequence of probability measures and we have

\text{If }P_n\pi_{t_1,\ldots,t_k}^{-1}\Rightarrow P\pi_{t_1,\ldots,t_k}^{-1}\quad\forall t_1,\ldots,t_k\in[0,1],\quad\text{then }P_n\Rightarrow P,

where \pi_{t_1,\ldots,t_k} is the projection onto a finite-dimensional set. This is a suitable condition for C[0,1]. For D[0,1], we have the additional complication that these projections might not be continuous everywhere.

We can get over this problem. For a measure P, set T_P to be the set of t\in[0,1] such that \pi_t is continuous P-almost everywhere (ie for all f\in D apart from a collection with P-measure = 0). Then, for all P, it is not hard to check that 0,1\in T_P and [0,1]\backslash T_P is countable.

The tightness condition requires two properties:

1) \lim_{K\rightarrow\infty} \limsup_{n}P_n[f:||f||\geq K]=0.

2) \forall \epsilon>0:\,\lim_\delta\limsup_n P_n[f:w_f'(\delta)\geq\epsilon]=0.

These say, respectively, that the measure of ||f|| doesn’t escape to \infty, and there is no mass given in the limit to functions which ‘wiggle with infinite frequency on an epsilon scale of amplitude’.

D_\infty=D[0,\infty)

Our earlier definition of the Skorohod metric could have been written:

d(f,g)=\inf_{\lambda\in\Lambda}\{||\lambda-\text{id}||\vee||f-\lambda\circ g||\}.

From a topological convergence point of view, there’s no need to use the sup norm on \lambda - \text{id}. We want to regulate smoothness of the reparameterisation, so we could use the norm:

||\lambda||^\circ=\sup_{s<t}|\log\frac{\lambda(t)-\lambda(s)}{t-s}|,

that is, the slope is uniformly close to 1 if ||\lambda||^\circ is small. The advantage of this choice of norm is that an extension to D[0,\infty) is immediate. Also, the induced product norm

d^\circ(f,g)=\inf_{\lambda\in\Lambda} \{||\lambda - \text{id}||^\circ \vee||x-\lambda\circ y||\}

is complete. This gives us a few problems, as for example

d_\circ(1_{[0,1)},1_{[0,1-\frac{1}{n})})=1,

as you can’t reparameterise over the jump in a way that ensures the log of the gradient is relatively small. (In particular, to keep the sup norm less than 1, we would need \lambda to send [1-\frac{1}{n}]\mapsto 1, and so ||\lambda||^\circ=\infty by definition.)

So we can’t immediately define Skorohod convergence on D_\infty by demanding convergence on any restriction to [0,t]. We overcome this in a similar way to convergence of distribution functions.

Lemma: If d_t^\circ (f_n,f)\rightarrow_n 0 then for any s<t with f cts at s, then d_s^\circ(f_n,f)\rightarrow_n 0.

So this says that the functions converge in Skorohod space if for arbitrarily large times T where the limit function is continuous, the restrictions to [0,T] converge. (Note that cadlag functions have at most countably many discontinuities, so this is fine.)

A metric for D_\infty

If we want to specify an actual metric d_\infty^\circ, the usual tools for specifying a countable product metric will do here:

d_\infty^\circ(f,g)=\sum_{m\geq 1}2^{-m}[1\wedge d_m^\circ(f^m,g^m)],

where f^m is the restriction of f to [0,m], with the potential discontinuity at m smoothed out:

f^m(t)=\begin{cases}t&t\leq m-1\\ (m-t)f(t)&t\in[m-1,m]\\ 0&t\geq m.\end{cases}

In particular, d_\infty^\circ(f,g)=0\Rightarrow f^m=g^m\,\forall m.

It can be checked that:

Theorem: d_\infty^\circ(f_n,f)\rightarrow 0 in D_\infty if and only iff

\exists \lambda_n\in\Lambda_\infty\text{ s.t. }||\lambda_n-\text{id}||\rightarrow 0

\text{and }\sup_{t\leq m}|\lambda_n\circ f_n-f|\rightarrow_n 0,\,\forall m,

and that d_\infty^\circ (f_n,f)\rightarrow 0 \Rightarrow d_t^\circ(f_n,f)\rightarrow 0 for every point of continuity t of f.

Similarly weak convergence and tightness properties are available, roughly as you might expect. It is probably better to reference Billingsley’s book or similar sources rather than further attempting to summarise them here.

The Poisson Process – A Third Characteristion

There remains the matter of the distribution of the number of people to arrive in a fixed non-infinitissimal time interval. Consider the time interval [0,1], which we divide into n smaller intervals of equal width. As n grows large enough that we know the probability that two arrivals occur in the same interval tends to zero (as this is \leq no(\frac{1}{n})), we can consider this as a sequence of iid Bernoulli random variables as before. So

\mathbb{P}(N_1=k)=\binom{n}{k}(\frac{\lambda}{n})^k(1-\frac{\lambda}{n})^{n-k}
\approx \frac{n^k}{k!} \frac{\lambda^k}{n^k}(1-\frac{\lambda}{n})^n\approx \frac{\lambda^k}{k!}e^{-\lambda}.

We recognise this as belonging to a Poisson (hence the name of the process!) random variable. We can repeat this for a general time interval and obtain N_t\sim \text{Po}(\lambda t).

Note that we implicitly assumed that, in the infinitissimal case at least, behaviour in disjoint intervals was independent. We would hope that this would lift immediately to the large intervals, but it is not immediately obvious how to make this work. This property of independent increments is one of the key definitions of a Levy Process, of which the Poisson process is one of the two canonical examples (the other is Brownian Motion).

As before, if we can show that the implication goes both ways (and for this case it is not hard – letting t\rightarrow 0 clearly gives the infinitissimal construction), we can prove results about Poisson random variables with ease, for example \text{Po}(\lambda)+\text{Po}(\mu)=\text{Po}(\lambda+\mu).

This pretty much concludes the construction of the Poisson process. We have three characterisations:
1) X_n\sim\text{Exp}(\lambda) all iid.
2) The infinitissimal construction as before, with independence.
3) The number of arrivals in a time interval of width t \sim \text{Po}(\lambda t). (This is sometimes called a stationary increments property.) Furthermore, we have independent increments.

A formal derivation of the equivalence of these forms is important but technical, and so not really worth attempting here. See James Norris’s book for example for a fuller exposition.

The final remark is that the Poisson Process has the Markov property. Recall that this says that conditional on the present, future behaviour is independent of the past. Without getting into too much detail, we might like to prove this by using the independent increments property. But remember that for a continuous process, it is too much information to keep track of all the distributions at once. It is sufficient to track only the finite marginal distributions, provided the process is cadlag, which the Poisson process is, assuming we deal with the discontinuities in the right way. Alternatively, the exponential random variable is memoryless, a property that can be lifted, albeit with some technical difficulties, to show the Markov property.

Brownian Motion is not finite variation

There is a natural definition of ‘pathwise’ stochastic integrals of a certain type of ‘simple’ process with respect to cadlag non-decreasing processes. It can be a shown that a function is of finite variation iff it can be expressed as the difference of two such functions. Hence, these finite variation processes can be used as variable of integration via an obvious linear extension. One direction of this result is obvious; the other is fiddly. To proceed, we show that the total valuation process is cadlag (and, obviously, increasing), and then check that a'=\frac12(v+a),a''=\frac12(v-a) are processes satisfying the conditions of the result.

Our overall aim is to define integrals with respect to Brownian Motion since that is (in a sense to be made precise through the Dubins-Schwarz theorem later) the canonical non-trivial stochastic process with non-zero quadratic variation. The result we demonstrate shows that it is not possible to define the integral with respect to BM through pathwise finite variation integrals.

Theorem: M\in\mathcal{M}_{c,loc},M_0=0 a.s. is of finite variation. Then M is indistinguishable from 0.

We will show this for M a bounded martingale with bounded variation. Why does this suffice? In general, set S_n:=\inf\{t,V_t\leq n\}, noting that V is continuous adapted non-decreasing. If M^{S_n}\equiv 0\,\forall n, then we are done, as the S_ns are increasing. But this is a bounded martingale with bounded variation.

To prove this, we make use of the orthogonality relation which is a key trick for this sort of result: If M is a martingale, with M_s,M_t\in L^2, for s<t, then just by multiplying out:

\mathbb{E}[(M_t-M_s)^2|\mathcal{F}_s]=\mathbb{E}[M_t^2-M_s^2|\mathcal{F}_s] a.s.

Now, for this particular result, we decompose

\mathbb{E}[M_t^2]=\mathbb{E}\left[\sum_{k=0}^{2^n-1}(M_{(k+1)2^{-n}t}^2-M_{k2^{-n}t}^2)\right]=\mathbb{E}[\sum (M_{(k+1)2^{-n}t}-M_{k2^{-n}t})^2]

and then we bound this last term as

\leq \mathbb{E}\left[\sup_k [M_+-M_-]\sum_k |M_+-M_-|\right]

Now, as n\uparrow\infty, we have \sum_k |M_+-M_-|\uparrow V_t\leq N by the boundedness assumption. Furthermore, M is almost surely continuous on [0,t] and so it is in fact uniformly continuous, which allows us to conclude that

\sup_k |M_+-M_-|\downarrow 0

By bounded convergence, this limit applies equally under the expectation, and so conclude that \mathbb{E}M_t^2=0 for each time t, and so for each time t the martingale is almost surely equal to 0. In the usual, can lift this to rational points by countability, then to all points by continuity.