Trap models and laws of not-so-large numbers

Posted on August 4, 2017 by dominicyeo

I’m back in Rio, this time for the Brazilian Probability School, which this year is being held in parallel with the Brazilian Mathematical Colloquium, so there’s a lot of possible lectures to be attending across a wide range of topics. I’ve been paying particular to a course by Veronique Gayrard concerning the phenomenon of aging, as seen in various spin-glass and trap models. [Lecture notes exist, but haven’t yet been put online.]

I want to write something about the setup for one of these models. It took me quite a long time to settle on a title for this post, and as you can see I’ve hedged. At least in this post, I’m not so interested in the model (and don’t want to try and offer a physical motivation at this point) but rather in talking about the natural model-independent problem it reduces to.

Motivation

Let $X_1,X_2,X_3,\ldots$ be IID random variables which take some fixed value K>0 with probability 1/K, and otherwise take the value zero. The law of large numbers says that for large m, the rescaled partial sum process $\frac{1}{m}(X_1+\ldots+X_m) \approx 1$ . The weak LLN makes this precise in the sense of convergence in distribution, and the strong LLN gives almost sure convergence.

But the speed of convergence is obviously not uniform over all distributions of the underlying IID random variables. This is particularly clear in the setup I’ve outlined, in the regime where $K\rightarrow\infty$ . Certainly if $1\ll m\ll K$ , then we have $\frac{1}{m}(X_1+\ldots+X_m) \ge \frac{K}{m}$ with probability $\approx \frac{m}{K}$ and otherwise $\frac{1}{m}(X_1+\ldots+X_m) = 0$ . So if we let K and m diverge together with scaling as given, the only version of a LLN we can write down is

$\frac{1}{m}(X_1+\ldots+X_m)\stackrel{d}\rightarrow 0,$

which is obviously different to the original version for fixed K and diverging m.

If we take $m=\Theta(K)$ , then the rescaled partial sum process converges in distribution to a scaled Poisson process. Of course, the Poisson process obeys it’s own law of large numbers (or law of large times), but on this scale the first-order behaviour is random.

At a more general level, what we are doing in the previous examples is looking at a process which converges to equilibrium, but studying it on a faster timescale than the timescale of this convergence. The REM-like trap model, which will be the eventual focus of this post, does exactly this to a continuous-time Markov chain, with the additional factor that the holding rates are random and heavy-tailed.

The mean-field REM-like trap model

This REM-like trap model is defined as follows. We have N sites, and for these sites we sample an IID collection of holding rates $(\tau_N(1),\ldots,\tau_N(N))$ according to some distribution. We then choose a sequence of a IID uniform samples from {1,…,N}, labelled $(J_N(1),J_N(2),\ldots)$ . We think of this as recording an itinerary of visits to the sites, where the jth site we visit is $J_N(j)$ . (Though notice that under this definition, it’s possible that the jth site we visit and the j+1st site we visit are the same.) We wait at each site for an exponential holding time, with parameter $\tau_N(j)$ if we are at site j, and these holding times are independent of the other holding times, and independent of the trajectory, all conditional on $(\tau_N(1),\ldots,\tau_N(N))$ .

You can think of this as a continuous-time RW on the complete graph $K_N$ (with self-loops), where the jump chain is uniform, and the holding rates are given by $(\tau_N(1),\ldots,\tau_N(N))$ . This explains the notation, and how you’d construct a similar model on a different underlying graph.

The general of a trap model is a random walk with very inhomogeneous speed, for example because some holding times have very large expectation. In a setting with more inbuilt geometry, for example on a lattice, we can imagine the RW getting trapped in regions associated with atypically low speeds. We might therefore think of a site with very long holding times as being deep, in the sense that the chain might get stuck there.

This will be most interesting if we allow an extreme range of values taken by $\tau_N$ , and so the best choice is a distribution in the domain of attraction of an $\alpha$ -stable law with parameter $\alpha\in(0,1)$ . That is $\mathbb{P}(\tau \ge u) = u^{-\alpha}L(u)$ , where L is a slowly-varying function at $+\infty$ .

This distribution has infinite mean, and so we couldn’t apply either LLN to a sequence of copies of $\tau$ . However, obviously the sequence $(\tau_N(1),\ldots,\tau_N(N))$ almost surely does have finite mean, since each entry is finite! So for each N, the trap model will have a LLN on large timescales, but we will investigate at faster timescales.

The clock process

At least for the purpose of this post, we will focus on the clock process, which records the (continuous) time which elapses before we arrive at the *k*th state of the jump chain.

That is,

$S_N(0)=0, \quad S_N(k) = \sum_{i=0}^k \mathrm{Exp}\left(1/\tau_N(J_N(i))\right),$

where the exponential random variables are independent except through their parameters. This can be made even more clear if we take advantage of the method to write a general exponential distribution as a multiple of a exponential distribution with parameter 1. Let $e_0,e_1,\ldots$ be IID exponential RVs independent of $(\tau_N(1),\ldots,\tau_N(N))$ and the jump chain. Then

$S_N(k)=\sum_{i=0}^k \tau_N(J_N(i)) e_i.$

Let’s briefly pause to apply the LLN to $S_N$ for fixed N. It matters whether we consider the quenched or annealed settings here. As usual, quenched means we fix a realisation of the random environment, and draw all conclusions in terms of that environment (think of conditional expectations). And annealed means that we also include the randomness of the environment. This is notationally annoying, so as a shorthand we write $\mathbb{E}_{\tau_N}$ for quenched expectations $\mathbb{E}[\cdot \,|\, \tau_N(1),\ldots,\tau_N(N)]$ , and $\mathbb{E}$ for an expectation over all randomness.

Then the quenched rate of growth of $S_N$ is given by

$\mathbb{E}_{\tau_N}\left[ \frac{S_N(k)}{k}\right] = \frac{\tau_N(1)+\ldots +\tau_N(N)}{N},$

and so the annealed rate

$\mathbb{E}\left[\frac{S_N(k)}{k}\right] = \infty,$

since $\mathbb{E}[\tau_N(1)]=\infty$ . But as in the introduction, these rates are only relevant to laws of large numbers when k grows on a large enough timescale, and we will consider smaller scales of k.

Timescales of the clock process

We’re going to look for scaling limits of the clock process. The increments are ‘sort of IID’ and ‘sort of heavy-tailed’ (we’ll clarify these sort ofs when we need to) so it wouldn’t be surprising if the scaling limits are Levy processes. The clock process is increasing, so in fact the scaling limits should be subordinators, and it wouldn’t be surprising if under some circumstances they turned out to be stable subordinators.

There is flexibility about how to do the rescaling. From now on, we are working in a $N\rightarrow\infty$ regime. Let’s assume we look at $t a_N$ steps of the jump chain, where $(a_N)$ is some divergent sequence. A property of large sums of IID stable distributions with parameter $\alpha\in(0,1)$ is that the scaling of the value of the sum is comparable to the scale of the largest summand. That is, the partial sum is dominated by its largest summands. Compare with the standard case for non-negative RVs, where for k summands, the sum is $\Theta(k)$ , while the largest summand is $O(\log k)$ .

So to identify the scale of the clock process after $a_N$ steps of the jump chain, it’s sufficient to identify the scale of its expected largest holding time. All of this is vague at the level of constants, so we choose a divergent sequence $(c_N)$ for which

$\mathbb{P}(\tau_N(1) \ge c_N) = \Theta\left(\frac{1}{a_N}\right).$

Note 1: this means that the number of holding times among the first $a_N$ which are at least $c_N$ is binomial with $\Theta(1)$ expectation. The fact that is well-approximated by a Poisson distribution will be relevant shortly.

Note 2: because we already insisted that $\tau(x)$ had a slowly-varying tail, this gives control of the $\mathbb{P}(\tau_N(1)\ge 5c_N)$ etc as well.

We expect that $S_N(t a_N) = \Theta(c_N)$ , and so we consider scaling limits of the process

$\tilde S_N(t):= \frac{1}{c_N} S_N(\lfloor t a_N \rfloor),$

as usual. [Note I am using the opposite convention to VG’s notes, where ~ denotes the unrescaled clock process.]

Scaling limits

We identify two types of scaling limit, depending on whether $a_N\ll N$ or $a_N = \Theta(N)$ . The former is called an intermediate timescale, while the latter is an extreme timescale. After this long motivation and notational preliminary section, my goal is to explain (partly to myself) why these scaling limits are different.

First, we state the result for intermediate timescales. Let $S^{\mathrm{int}}$ be the stable subordinator with parameter $\alpha$ , that is with Levy measure $\alpha \Gamma(\alpha)u^{-\alpha}$ . Then $\tilde S_N \Rightarrow S^{\mathrm{int}}$ , in the Skorohod topology. We need to be clear about the sense of convergence, and the role of the random environment. It turns out that if in addition $a_N\ll \frac{N}{\log N}$ , then this convergence holds for almost all realisations of the random environment. That is, the laws of the processes (with respect to the randomness of the jump chain / holding times etc) converge. When $a_N$ is only $\ll N$ , then the convergence holds in probability with respect to the environment. It took me a while to parse what this means. It means that for large N, the probability that the random environment induces a law of $\tilde S_N$ which is far from the law of $S^{\mathrm{int}}$ tends to zero.

The exact Levy triple of the limit process is not the important message here, and if that’s unfamiliar, then it isn’t a problem. The point is that you would also get this limiting Levy process if you took the sum process of genuinely IID random variables with the same $\alpha$ -tail. And this is not surprising. Since recall that in the intermediate timescale $a_N\ll N$ , so during the first $t a_N$ steps of the jump chain, we do not typically visit many sites more than once. Indeed, if $a_N\ll \sqrt{N}$ , then this is the birthday problem, and we typically visit no site more than once. However, even in the weaker setting $a_N\ll N$ , look at the deepest 1000 sites we visit during the first $ta_N$ steps. We can compute that, in expectation, we visit essentially zero of these more than once. But these 1000 sites dominate the clock process at $ta_N$ . So from the point of view of the clock process, since we hardly ever visit relevant sites twice, the depths $\tau_N(J_N(1)),\tau_N(J_N(2)),\ldots$ are essentially independent, and so it’s unsurprising that we get the scaling limit corresponding to IID partial sums.

For extreme timescales, by contrast, this fails. If we take $a_N=1000 N$ , we expect to visit each site roughly 1000 times, indeed the number of visits to a given site will be approximately $\mathrm{Poisson}(1000)$ . But it’s still the case that the scaling limit will be dominated by the deepest sites. In particular, at some point on this timescale we will visit the deepest site, and indeed we will visit it multiple times if we look at $ta_N$ for large t. So the jumps of any scaling limit are not independent any more unless we condition on all the depths $\tau_N$ .

However, all is not lost, since we can show that the point process of rescaled depths $\sum \delta_{\tau_N(i)/c_N}$ converges to a Poisson random measure on $[0,\infty)$ . The candidate for the scaling limit of the clock process is then the subordinator whose Levy measure is this Poisson random measure. This isn’t itself a Levy process, but it is a mixture of Levy processes, reflecting that on extreme timescales the quenched and annealed viewpoints are different since there is enough time to visit the whole landscape.

Heuristically, the extreme timescale is the entry point for convergence to equilibrium. Indeed, taking $t\rightarrow\infty$ , the number of visits to each of the 1000 top sites converge to their expectation, corresponding to convergence of the clock process to equilibrium, since these holding times continue to dominate the sum. The clock process therefore starts to feel the finiteness of the state space, which introduces dependence between the most relevant holding times, which was not the close on intermediate timescales.

In the next post, I’m going to try and summarise VG’s descriptions of taking this model beyond the mean-field setting, where the range of possibilities becomes much much richer. I’m also going to try and say something and glassy dynamics and ageing, and why the physical motivation justifies considering these particular models and scalings.

Subordinators and the Arcsine rule

Posted on December 5, 2012 by dominicyeo

After the general discussion of Levy processes in the previous post, we now discuss a particular class of such processes. The majority of content and notation below is taken from chapters 1-3 of Jean Bertoin’s Saint-Flour notes.

We say $X_t$ is a subordinator if:

It is a right-continuous adapted stochastic process, started from 0.
It has stationary, independent increments.
It is increasing.

Note that the first two conditions are precisely those required for a Levy process. We could also allow the process to take the value $\infty$ , where the hitting time of infinity represents ‘killing’ the subordinator in some sense. If this hitting time is almost surely infinite, we say it is a strict subordinator. There is little to be gained right now from considering anything other than strict subordinators.

Examples

A compound Poisson process, with finite jump measure supported on $[0,\infty)$ . Hereafter we exclude this case, as it is better dealt with in other languages.
A so-called stable Levy process, where $\Phi(\lambda)=\lambda^\alpha$ , for some $\alpha\in(0,1)$ . (I’ll define $\Phi$ very soon.) Note that checking that the sample paths are increasing requires only that $X_1\geq 0$ almost surely.
The hitting time process for Brownian Motion. Note that this does indeed have jumps as we would need. (This has $\Phi(\lambda)=\sqrt{2\lambda}$ .)

Properties

In general, we describe Levy processes by their characteristic exponent. As a subordinator takes values in $[0,\infty)$ , we can use the Laplace exponent instead:

$\mathbb{E}\exp(-\lambda X_t)=:\exp(-t\Phi(\lambda)).$

We can refine the Levy-Khintchine formula;

$\Phi(\lambda)=k+d\lambda+\int_{[0,\infty)}(1-e^{-\lambda x})\Pi(dx),$

where k is the kill rate (in the non-strict case). Because the process is increasing, it must have bounded variation, and so the quadratic part vanishes, and we have a stronger condition on the Levy measure: $\int(1\wedge x)\Pi(dx)<\infty$ .
The expression $\bar{\Pi}(x):=k+\Pi((x,\infty))$ for the tail of the Levy measure is often more useful in this setting.
We can think of this decomposition as the sum of a drift, and a PPP with characteristic measure $\Pi+k\delta_\infty$ . As we said above, we do not want to consider the case that X is a step process, so either d>0 or $\Pi((0,\infty))=\infty$ is enough to ensure this.

Analytic Methods

We give a snapshot of a couple of observations which make these nice to work with. Define the renewal measure U(dx) by:

$\int_{[0,\infty)}f(x)U(dx)=\mathbb{E}\left(\int_0^\infty f(X_t)dt\right).$

If we want to know the distribution function of this U, it will suffice to consider the indicator function $f(x)=1_{X_t\leq x}$ in the above.

The reason to exclude step processes specifically is to ensure that X has a continuous inverse:

$L_x=\sup\{t\geq 0:X_t\leq x\}$ so $U(x)=\mathbb{E}L_x$ is continuous.

In fact, this renewal measure characterises the subordinator uniquely, as we see by taking the Laplace transform:

$\mathcal{L}U(\lambda)=\int_{[0,\infty)}e^{-\lambda x}U(dx)=\mathbb{E}\int e^{-\lambda X_t}dt$

$=\int \mathbb{E}e^{-\lambda X_t}dt=\int\exp(-t\Phi(\lambda))dt=\frac{1}{\Phi(\lambda)}.$

The Arcsine Law

X is Markov, which induces a so-called regenerative property on the range of X, $\mathcal{R}$ . Formally, given s, we do not always have $s\in\mathcal{R}$ (as the process might jump over s), but we can define $D_s=\inf\{t>s:t\in\mathcal{R}\}$ . Then

$\{v\geq 0:v+D_s\in\mathcal{R}\}\stackrel{d}{=}\mathcal{R}.$

In fact, the converse holds as well. Any random set with this regenerative property is the range of some subordinator. Note that $D_s$ is some kind of dual to X, since it is increasing, and the regenerative property induces some Markovian properties.

In particular, we consider the last passage time $g_t=\sup\{s<t:s\in\mathcal{R}\}$ , in the case of a stable subordinator with $\Phi(\lambda)=\lambda^\alpha$ . Here, $\mathcal{R}$ is self-similar with scaling exponent $\alpha$ . The distribution of $\frac{g_t}{t}$ is thus independent of t. In this situation, we can derive the generalised arcsine rule for the distribution of $g_1$ :

$\mathbb{R}(g_1\in ds)=\frac{\sin \alpha\pi}{\pi}s^{\alpha-1}(1-s)^{-\alpha}ds.$

The most natural application of this is to the hitting time process of Brownian Motion, which is stable with $\alpha=\frac12$ . Then $g_1=S_1-B_1$ , in the usual notation for the supremum process. Furthermore, we have equality in distribution of the processes (see previous posts on excursion theory and the short aside which follows):

$(S_t-B_t)_{t\geq 0}\stackrel{d}{=}(|B_t|)_{t\geq 0}.$

So $g_1$ gives the time of the last zero of BM before time 1, and the arcsine law shows that its distribution is given by:

$\mathbb{P}(g_1\leq t)=\frac{2}{\pi}\text{arcsin}\sqrt{t}.$

The Levy-Khintchine Formula

Posted on December 4, 2012 by dominicyeo

Because of a string of coincidences involving my choice of courses for Part III and various lecturers’ choices about course content, I didn’t learn what a Levy process until a few weeks’ ago. Trying to get my head around the Levy-Khintchine formula took a little while, so the following is what I would have liked to have been able to find back then.

A Levy process is an adapted stochastic process started from 0 at time zero, and with stationary, independent increments. This is reminiscent, indeed a generalisation, of the definition of Brownian motion. In that case, we were able to give a concrete description of the distribution of $X_1$ . For a general Levy process, we have

$X_1=X_{1/n}+(X_{2/n}-X_{1/n})+\ldots+(X_1-X_{1-1/n}).$

So the distribution of $X_1$ is infinitely divisible, that is, can be expressed as the distribution of the sum n iid random variables for all n. Viewing this definition in terms of convolutions of distributions may be more helpful, especially as we will subsequently consider characteristic functions. If this is the first time you have seen this property, note that it is not a universal property. For example, it is not clear how to write a U[0,1] random variable as a convolution of two iid RVs. Note that exactly the same argument suffices to show that the distribution of $X_t$ is infinitely divisible.

It will be most convenient to work with the characteristic functions

$\mathbb{E}\exp(i\langle \lambda,X_t\rangle).$

By stationarity of increments, we can show that this is equal to

$\exp(-\Psi(\lambda)t)\quad\text{where}\quad \mathbb{E}\exp(i\langle \lambda,X_1\rangle)=:\exp(-\Psi(\lambda)).$

This function $\Psi(\lambda)$ is called the characteristic exponent. The argument resembles that used for Cauchy’s functional equations, by dealing first with the rationals using stationarity of increments, then lifting to the reals by the (right-)continuity of

$t\mapsto \mathbb{E}\exp(i\langle \lambda,X_t\rangle).$

As ever, $\Psi(\lambda)$ uniquely determines the distribution of $X_1$ , and so it also uniquely determines the distribution of Levy process. The only condition on $\Psi$ is that it be the characteristic function of an infinitely divisible distribution. This condition is given explicitly by the Levy-Khintchine formula.

Levy-Khintchine

$\Psi(\lambda)$ is the characteristic function of an infinitely divisible distribution iff

$\Psi(\lambda)=i\langle a,\lambda\rangle +\frac12 Q(\lambda)+\int_{\mathbb{R}^d}(1-e^{i\langle \lambda,x\rangle}+i\langle \lambda,x\rangle 1_{|x|<1})\Pi(dx).$

for $a\in\mathbb{R}^d$ , Q a quadratic form on $\mathbb{R}^d$ , and $\Pi$ a so-called Levy measure satisfying $\int (1\wedge |x|^2)\Pi(dx)<\infty$ .

This looks a bit arbitrary, so first let’s explain what each of these terms ‘means’.

$i\langle a,\lambda\rangle$ comes from a drift of $-a$ . Note that a deterministic linear function is a (not especially interesting) Levy process.
$\frac12Q(\lambda)$ comes from a Brownian part $\sqrt{Q}B_t$ .

The rest corresponds to the jump part of the process. Note that a Poisson process is an example of a Levy process, hence why we might consider thinking about jumps in the first place. The reason why there is an indicator function floating around is that we have to think about two regimes separately, namely large and small jumps. Jumps of size bounded below cannot happen too often as otherwise the process might explode off to infinity in finite time with positive probability. On the other hand, infinitesimally small jumps can happen very often (say on a dense set) so long as everything is controlled to prevent an explosion on the macroscopic scale.

There is no canonical choice for where the divide between these regimes happens, but conventionally this is taken to be at $|x|=1$ . The restriction on the Levy measure near 0 ensures that the sum of the squares all jumps up some finite time converges absolutely.

$\Pi\cdot 1_{|x|\geq 1}$ gives the intensity of a standard compound Poisson process. The jumps are well-spaced, and so it is a relatively simple calculation to see that the characteristic function is

$\int_{\mathbb{R}^d}(1-e^{i\langle \lambda,x\rangle})1_{|x|\geq 1}\Pi(dx).$

The intensity $\Pi\cdot 1_{|x|<1}$ gives infinitely many hits in finite time, so if the expectation of this measure is not 0, we explode immediately. We compensate by drifting away from this at rate

$\int_{\mathbb{R}^d}x1_{|x|<1}\Pi(dx).$

To make this more rigorous, we should really consider $1_{\epsilon<|x|<1}$ then take a limit, but this at least explains where all the terms come from. Linearity allows us to interchange integrals and inner products, to get the term

$\int_{\mathbb{R}^d}(1-e^{-i\langle \lambda,x\rangle}+i\langle\lambda,x\rangle 1_{|x|<1})\Pi(dx).$

If the process has bounded variation, then we must have Q=0, and also

$\int (1\wedge |x|)\Pi(dx)<\infty,$

that is, not too many jumps on an |x| scale. In this case, then this drift component is well-defined and linear $\lambda$ , so can be incorporated with the drift term at the beginning of the Levy-Khintchine expression. If not, then there are some $\lambda$ for which it does not exist.

There are some other things to be said about Levy processes, including

Stable Levy processes, where $\Psi(k\lambda)=k^\alpha \Psi(\lambda)$ , which induces the rescaling-invariance property: $k^{-1/\alpha}X_{kt}\stackrel{d}{=}X$ . The distribution of each $X_t$ is then also a stable distribution.
Resolvents, where instead of working with the process itself, we work with the distribution of the process at a random exponential time.

Gaussian tail bounds and a word of caution about CLT (eventuallyalmosteverywhere.wordpress.com)

The Poisson Process – A Third Characteristion

Posted on September 4, 2012 by dominicyeo

There remains the matter of the distribution of the number of people to arrive in a fixed non-infinitissimal time interval. Consider the time interval [0,1], which we divide into n smaller intervals of equal width. As n grows large enough that we know the probability that two arrivals occur in the same interval tends to zero (as this is $\leq no(\frac{1}{n})$ ), we can consider this as a sequence of iid Bernoulli random variables as before. So

$\mathbb{P}(N_1=k)=\binom{n}{k}(\frac{\lambda}{n})^k(1-\frac{\lambda}{n})^{n-k}$
$\approx \frac{n^k}{k!} \frac{\lambda^k}{n^k}(1-\frac{\lambda}{n})^n\approx \frac{\lambda^k}{k!}e^{-\lambda}.$

We recognise this as belonging to a Poisson (hence the name of the process!) random variable. We can repeat this for a general time interval and obtain $N_t\sim \text{Po}(\lambda t)$ .

Note that we implicitly assumed that, in the infinitissimal case at least, behaviour in disjoint intervals was independent. We would hope that this would lift immediately to the large intervals, but it is not immediately obvious how to make this work. This property of independent increments is one of the key definitions of a Levy Process, of which the Poisson process is one of the two canonical examples (the other is Brownian Motion).

As before, if we can show that the implication goes both ways (and for this case it is not hard – letting $t\rightarrow 0$ clearly gives the infinitissimal construction), we can prove results about Poisson random variables with ease, for example $\text{Po}(\lambda)+\text{Po}(\mu)=\text{Po}(\lambda+\mu)$ .

This pretty much concludes the construction of the Poisson process. We have three characterisations:
1) $X_n\sim\text{Exp}(\lambda)$ all iid.
2) The infinitissimal construction as before, with independence.
3) The number of arrivals in a time interval of width t $\sim \text{Po}(\lambda t)$ . (This is sometimes called a stationary increments property.) Furthermore, we have independent increments.

A formal derivation of the equivalence of these forms is important but technical, and so not really worth attempting here. See James Norris’s book for example for a fuller exposition.

The final remark is that the Poisson Process has the Markov property. Recall that this says that conditional on the present, future behaviour is independent of the past. Without getting into too much detail, we might like to prove this by using the independent increments property. But remember that for a continuous process, it is too much information to keep track of all the distributions at once. It is sufficient to track only the finite marginal distributions, provided the process is cadlag, which the Poisson process is, assuming we deal with the discontinuities in the right way. Alternatively, the exponential random variable is memoryless, a property that can be lifted, albeit with some technical difficulties, to show the Markov property.

Strong Markov Property for BM

Posted on April 14, 2012 by dominicyeo

The Strong Markov Property is the most important result to demonstrate for any Markov process, such as Brownian Motion. It is also probably the most widely requested item of bookwork on the Part III Advanced Probability exam. I feel it is therefore worth practising writing as quickly as possible.

Theorem (SMP): Take $(B_t)$ a standard $(\mathcal{F}_t)$ -BM, and T an a.s. finite stopping time. Then $(B_{T+t}-B_T,t\geq 0)$ is a standard BM independent of $\mathcal{F}_T$ .

Proof: We write $B_t^{(T)}=B_{T+t}-B_T$ for ease of notation. We will show that for any $A\in\mathcal{F}_T$ and F bounded, measurable:

$\mathbb{E}[1_AF(B_{T+t_1}-B_T,\ldots,B_{T+t_n}-B_T)]=\mathbb{P}(A)\mathbb{E}F(B_{t_1},\ldots,B_{t_n})$

This will suffice to establish independence, and taking $A=\Omega\in\mathcal{F}_t$ shows that $B_t^T$ is a standard BM since (Levy), BM is uniquely characterised by its finite joint distributions.

To prove the result, we approximate T discretely, and apply the Markov property.

$\mathbb{E}[1_AF(B_{t_1}^{(T)},\ldots)]=\lim_{m\rightarrow\infty}\sum_{k=1}^\infty \mathbb{E}[1_{A\cap\{T\in((k-1)2^{-m},k2^{-m}]\}}F(B_{t_1}^{(k2^{-m})},\ldots)]$

by bounded convergence, using continuity of F, right-continuity of B, and that $T<\infty$ a.s. (so that $1_A=\sum 1_{A\cap \{T\in(-,-]\}}$ )

$\stackrel{\text{WMP}}{=}\lim_{m\rightarrow\infty}\sum_{k=1}^\infty \mathbb{P}[A\cap\{T\in((k-1)2^{-m},k2^{-m}]\}]\mathbb{E}F(B_{t_1},\ldots,B_{t_n})$

$\stackrel{\text{DOM}}{=}\mathbb{P}(A)\mathbb{E}F(B_{t_1},\ldots,B_{t_n})$

which is exactly what we required.

Remarks: 1) We only used right-continuity of the process, and characterisation by joint marginals, so the proof works equally well for Levy processes.

2) We can in fact show that it is independent of $\mathcal{F}_T^+$ , by considering $T+\frac{1}{n}$ which is still a stopping time, then taking a limit in this n as well in the above proof. For details of a similar result, see my post on Blumenthal’s 0-1 Law.

Feller Processes and the Strong Markov Property (eventuallyalmosteverywhere.wordpress.com)
Remarkable fact about Brownian Motion #2: Blumenthal’s 0-1 Law and its Consequences (eventuallyalmosteverywhere.wordpress.com)

Motivating Ito’s Formula

Posted on January 20, 2012 by dominicyeo

Ito’s formula, which characterises the stochastic differential, has been mentioned by various textbooks and courses, but now for the first time (after James Norris’s first lecture for the Stochastic Calculus course) I think I finally have a reasonable idea of what’s going on. The reasons I was initially confused help to explain what the motivation is:

What processes can we consider? Well, initially continuous time, time-homogeneous Markov processes in $\mathbb{R}^d$ with continuous paths. It could be space-homogeneous as well if desired. By the theory of decomposition of Levy processes (ie what we are considering), the continuous paths property gives that such a process must be a Brownian motion with drift. This has the property that $X_{t+dt}-X_t\sim N(b(X_t)dt,a(X_t)dt)$ where $a(X_t)$ is the diffusivity, that is, the intensity of the Brownian component, and $b(X_t)$ is the drift.
What is the stochastic differential? Well, for a process as above, we define: $dX_t:= X_{t+dt}-X_t-N(b(X_t)dt,a(X_t)dt)$ . This is non-deterministic: that’s reasonable since X is a stochastic process. And, a normal differential is meaningful only when you integrate, so similarly the stochastic differential is only meaningful when you take an expectation.
Write $N_t$ for the Brownian noise. Then $\mathbb{E}[d(f(X_t))|\mathcal{F}_t]=\mathbb{E}[f(X_{t+dt})-f(X_t)|\mathcal{F}_t]$ , so by Taylor: $=\mathbb{E}[f'(X_t)(b(X_t)dt+N_t)+\frac12 f''(X_t)N_t^2 +o(dt)]$ , remembering that $N_t=O(\sqrt{dt})$ .
This is generally written as $\mathbb{E}[d(f(X_t))|\mathcal{F}_t]=Lf(X_t)dt$ where $Lf(x)=b(x)f'(x)+\frac12 a(x)f''(x)$ . Now note that $\mathbb{E}[dX_t|\mathcal{F}_t]=b(X_t)dt$ and $\mathbb{E}[dX_tdX_t|\mathcal{F}_t]=a(X_t)dt$ , so it is reasonable that we might ‘cancel the expectations’ to get: $d(f(X_t))=f'(X_t)dX_t+\frac12 f''(X_t)dX_tdX_t$ .
Use a suitable tensor product or $dX_tdX_t^T$ when d>1.
This is (a version of) Ito’s Lemma.