Conditional Expectations

To define an expectation conditional on an event in a probability space is essentially no more than defining a conditional probability, then constructing the expectation as an integral with respect to this measure. At an abstract level, it is often more useful to define an expectation conditional on a sigma-algebra. Informally, we want to define the expectation conditional on every event in a sigma-algebra simultaneously. Most importantly, the result is measurable with respect to this sigma-algebra, so is a random variable itself under suitable conditions.

We want to proof that such a construction exists. We take X a \mathcal{F}-measurable random variable, and consider a sub-sigma-algebra \mathcal{G}\subset\mathcal{F}. We want to define Y=\mathbb{E}[X|\mathcal{G}], which is integrable, \mathcal{G}-measurable, and such that

\mathbb{E}X1_A=\mathbb{E}Y1_A, for all A\in\mathcal{G}.

We also want to show that this conditional expectation is, up to null events, unique. When \mathcal{G}=\sigma(B_i,i\in I) is generated by a countable collection of disjoint events, then we can define

Y:=\sum_{i\in I}\mathbb{E}[X|B_i]1_{B_i},

and verify that this satisfies the conditions.

We proceed in the general case. Uniqueness is easy. Suppose have Y,Y' satisfying the conditions. Then the event A=\{Y>Y'\}\in\mathcal{G}. Substituting into the definition gives:

\mathbb{E}[(Y-Y')1_A]=\mathbb{E}[X1_A]-\mathbb{E}[X1_A]=0

But (Y-Y')1_A\geq 0 and so we conclude that \mathbb{P}(A)=0, and Y'\geq Y almost surely. Of course the reverse argument applies also, and so Y=Y' a.s.

For existence, we exploit a property of Hilbert spaces. We initially assume X\in L^2. We can decompose the host space as

L^2(\mathcal{F})=L^2(\mathcal{G})+L^2(\mathcal{G})^{\perp} and $X=Y+Z$

in this orthogonal projection. The operator on this space is \langle X,Y\rangle:=\mathbb{E}XY, and so

1_A\in L^2(\mathcal{G})\Rightarrow \mathbb{E}[Z1_A]=0,

From this, we conclude that Y is suitable. For what follows, observe that \{Y<0\} is \mathcal{G}-measurable, and so by a similar argument to before, X\geq 0 a.s. implies Y\geq 0 a.s.

For general X\geq 0, set X_n:=X\wedge n\uparrow X, and Y_n:=\mathbb{E}[X_n|\mathcal{G}]. By our previous observation, everything relevant is almost surely non-negative, and we can apply monotone convergence to both sides of the relation

\mathbb{E}[X_n1_A]=\mathbb{E}[Y_n1_A]

to obtain the definition, and take A=\Omega to check integrability. Separating general X into positive and negative parts gives the result for generally supported random variables.

Advertisement

Martingale Convergence

I continue the theme of explaining important bookwork from the Part III course, Advanced Probability, as succinctly as possible. In this post, we consider the convergence properties of discrete time martingales.

1) Theorem: Assume X is a martingale bounded in L^1. Then

X_n\rightarrow X_\infty\in L^1(\mathcal{F}_\infty) a.s.

Remark: The theorem and proof works equally well for X a supermartingale.

Proof: Essentially, we want to reduce convergence of the random variables which make up the martingale to a countable collection of events. We do this by considering upcrossings, which counts the number of times the process alternates from less than a given interval to greater than a given interval. The formal definition will be too wide for this format, so we summarise as

N_n([a,b],X)= the number of disjoint time intervals up to time n in which X goes from [-\infty a) to (b,\infty]. Define N([a,b],X) to be the limit as n increases to infinity.

It is a genuinely easy check that a sequence converges (possibly to \infty) iff this number of upcrossings of any interval with rational bounds is finite. We will show that the martingale almost surely has this property. The key lemma is a bound due to Doob:

Lemma: (b-a)\mathbb{E}[N_n([a,b],X)]\leq \mathbb{E}[(X_n-a)^-]

 Proof: Say S_1<T_1<S_2<T_2,\ldots are the successive hitting times of [-\infty,a),(b,\infty] respectively. So N_n=\inf\{k:T_k\leq n\}. We decompose, abbreviating the number of upcrossings as N.

\sum_{k=1}^n(X_{T_k\wedge n}-X_{S_k\wedge n})=\sum_{k=1}^N(X_{T_k}-X_{S_k})+(X_n-X_{S_{N+1}})1_{\{S_{N+1}\leq n\}}

Now take an expectation of both sides, applying the Optional Stopping Theorem to the bounded stopping times on the LHS. (If we are working with a supermartingale, formally we need to take \mathbb{E}[\cdot|\mathcal{F}_{S_k}] of each summand on LHS to show that they are non-negative, and so taking a further expectation over \mathcal{F}_{S_k} gives the required result.) We obtain:

0\geq (b-a)\mathbb{E}N-\mathbb{E}(X_n-X_{S_{N+1}})1_{\{S_{N+1}\leq n\}}

If S_{N+1}>n then both 1_{\{S_{N+1}\leq n\}}=(X_n-a)^-=0. Otherwise (X_n-a)^-\geq X_{S_{N+1}}-X_n. This complete the proof of the lemma.

Since \mathbb{E}[(X_n-a)^-]\leq \mathbb{E}|X_n|+a<\infty, where this last bound is uniform in by assumption, applying monotone convergence, we get that N([a,b],X) is almost surely finite for every pair a<b\in\mathbb{Q}. Because this set is countable, we can deduce that this holds almost surely for every pair simultaneously. We therefore define X_\infty(w)=\lim X_n(w) when this limit exists, and 0 otherwise. With probability one the limit exists. Fatou’s lemma confirms that X_\infty\in L^1(\mathcal{F}_\infty).

2) We often want to have convergence in L^1 as well. Recall for Part II Probability and Measure (or elsewhere) that

UI + Convergence almost surely is necessary and sufficient for convergence in L^1.

This applies equally well to this situation. Note that for a martingale, this condition is often convenient, because, for example, we know that the set \{\mathbb{E}[X_\infty|\mathcal{F}_n],n\} is UI for any integrable X_\infty.

3) Convergence in L^p is easier to guarantee.

Theorem: i) X a martingale bounded in L^p iff ii) X_n\rightarrow X_\infty in L^p and almost surely iff iii) \exists Z\in L^p s.t. X_n=\mathbb{E}[Z|\mathcal{F}_n] a.s.

Remark: As part of the proof, we will show, as expected, that X_\infty,Z are the same.

Proof: i)->ii) Almost sure convergence follows from the above result applied to the p-th power process. We apply Doob’s inequality about running maxima in a martingale process:

||X_n^*||_p:=||\sup_{m\leq n}X_m||_p\leq \frac{p}{p-1}||X_n||_p

Using this, we see that X_n^*\uparrow X_\infty^*:=\sup|X_k|. Now consider |X_n-X_\infty|\leq 2X_\infty^*\in L^p and use Dominated Convergence to confirm convergence in L^p.

Note that Doob’s L^p inequality can be proven using the same author’s Maximal inequality and Holder.

ii)->iii) As we suspected, we show Z=X_\infty is suitable.

||X_n-\mathbb{E}[X_\infty|\mathcal{F}_n]||_p\stackrel{\text{large }m}{=}||\mathbb{E}[X_m-X_\infty|\mathcal{F}_n]||_p\stackrel{\text{Jensen}}{\leq}||X_m-X_\infty||_p\rightarrow 0

iii)->i) is easy. Z bounded in L^p implies X bounded by a simple application of the triangle inequality in the definition of conditional expectation.

Strong Markov Property for BM

The Strong Markov Property is the most important result to demonstrate for any Markov process, such as Brownian Motion. It is also probably the most widely requested item of bookwork on the Part III Advanced Probability exam. I feel it is therefore worth practising writing as quickly as possible.

Theorem (SMP): Take (B_t) a standard (\mathcal{F}_t)-BM, and T an a.s. finite stopping time. Then (B_{T+t}-B_T,t\geq 0) is a standard BM independent of \mathcal{F}_T.

Proof: We write B_t^{(T)}=B_{T+t}-B_T for ease of notation. We will show that for any A\in\mathcal{F}_T and F bounded, measurable:

\mathbb{E}[1_AF(B_{T+t_1}-B_T,\ldots,B_{T+t_n}-B_T)]=\mathbb{P}(A)\mathbb{E}F(B_{t_1},\ldots,B_{t_n})

This will suffice to establish independence, and taking A=\Omega\in\mathcal{F}_t shows that B_t^T is a standard BM since (Levy), BM is uniquely characterised by its finite joint distributions.

To prove the result, we approximate discretely, and apply the Markov property.

\mathbb{E}[1_AF(B_{t_1}^{(T)},\ldots)]=\lim_{m\rightarrow\infty}\sum_{k=1}^\infty \mathbb{E}[1_{A\cap\{T\in((k-1)2^{-m},k2^{-m}]\}}F(B_{t_1}^{(k2^{-m})},\ldots)]

by bounded convergence, using continuity of F, right-continuity of B, and that T<\infty a.s. (so that 1_A=\sum 1_{A\cap \{T\in(-,-]\}})

\stackrel{\text{WMP}}{=}\lim_{m\rightarrow\infty}\sum_{k=1}^\infty \mathbb{P}[A\cap\{T\in((k-1)2^{-m},k2^{-m}]\}]\mathbb{E}F(B_{t_1},\ldots,B_{t_n})

\stackrel{\text{DOM}}{=}\mathbb{P}(A)\mathbb{E}F(B_{t_1},\ldots,B_{t_n})

which is exactly what we required.

Remarks: 1) We only used right-continuity of the process, and characterisation by joint marginals, so the proof works equally well for Levy processes.

2) We can in fact show that it is independent of \mathcal{F}_T^+, by considering T+\frac{1}{n} which is still a stopping time, then taking a limit in this as well in the above proof. For details of a similar result, see my post on Blumenthal’s 0-1 Law.

One queue or many queues?

You have counters serving customers. What is a better arrangement? To have queues, one for each counter, as in a supermarket (in the heady days before self-checkout, when ‘something unexpected in the bagging area’ might necessitate a trip to the doctor); or a single queue feeding to counters when they become free, as at a train station, or airport passport control?

Queueing is a psychologically intense process, at least in this country. Here are some human points:

– When you have to choose your line, you might, if you have nothing better to do, spend your time in the queue worrying about whether you chose optimally; whether Fortune has given a customer who arrived after you an advantage. (*)

– A single queue feels long, but conversely gives its users the pleasing feeling of almost constant progress. A particularly difficult customer doesn’t have as great an effect.

Interestingly, we can qualify these remarks in purely mathematical terms. We shall see that the mean waiting times are the same, but that, as we might expect, the variance is less for the single queue.

So to begin, we assume that the arrival process is Poisson, and the service times are exponential. These assumptions are very much not required, but it makes it easier to set up the model as a Markov chain. We do need that \lambda, the arrival rate, is less than \mu N, where \mu is the service rate. Why? Well, otherwise the number of customers waiting to be served will almost surely grow to infinity. When supply is greater than demand, as we specify, we know that there will (almost surely) be infinitely many times when there are no customers in the system. This shows that the Markov chain is recurrent, and so it is meaningful to talk about a mean waiting time.

We assume that in the N-queues arrangement, if a counter is free but the corresponding queue is empty, then a customer from a non-zero queue will take the place if possible. We do not worry about how this customer is chosen. Indeed, we do not concern ourselves with how the customers choose a queue, except to ask that the choice algorithm is previsible. That is, they cannot look into the future and optimise (or otherwise) accordingly. This assumption is standard and reasonable by considering the real-world situation.

Now if we set things up in the right way, we can couple the two processes. Precisely, we start with no customers, and define

0<A_1<A_2<\ldots

to be the sequence of arrival times (note that almost surely no two customers arrive at the same time). Now say that

0<D_1<D_2<\ldots

are the times at which a customer starts being served, that is, moves from the queue to the counter. Say S_1,S_2,\ldots are the service times of the customers, where S_i is the service time of the customer who leaves the queue at time D_i.

The important point is that service times are independent of the queueing process, and are i.i.d. This is where we use the fact that the queue choice strategies are previsible. So, in fact all we need to examine is how customers spend in the queue. Concretely, take the queueing time, and S the service time. These are independent so:

\mathbb{E}[Q+S]=\mathbb{E}Q+\mathbb{E}S

and \text{Var} (Q+S)=\text{Var } Q+\text{Var } S.

Observe that A_i,S_i have the same meanings and distributions in both queue models. It is not obvious that D_i does. We notice that for both models, if we define

C_n(t)=\#\{i:D_i+S_i\geq t\} on [D_n,\infty),

the number of the first n customers being served at time t, we have

D_{n+1}=A_{n+1}\vee \inf\{t\geq D_n: C_n(t)\leq N-1\}.

Therefore, by induction on n, we have that (D_i) is well-defined as the same function of (A_i,S_i) for both models. This fact was given as implicitly obvious by many of the remarks on this topic which I found. In my opinion, it is no more obvious than the solution to the original problem.

We can also define, for both models, the number of customers served before the queue is empty again as:

\inf\{N:\exists \epsilon>0\text{ s.t. }c_N(A_{N+1}-\epsilon)=0\},

ie, the least N for which the queue is empty just before the N+1-th customer arrives. As previously discussed, N is a.s. finite. The processes are Markov chains, the first time the queue is empty is a stopping time, and it is meaningful to consider the average waiting time of the N customers before the queue is empty again. There is an expectation over N implicit here too, but this won’t be necessary for our result.

We can now finish the proof. This mean waiting time is

\frac{1}{N}\sum_{i=1}^N(D_i-A_i)

for the single queue. For multiple queues, suppose the i-th customer to arrive is the \sigma(i)-th to be served, where \sigma\in S_N is a permutation. Then the waiting time is

\frac{1}{N}\sum_{i=1}^N (D_{\sigma(i)}-A_i),

which is clearly equal to the previous expression as we can reorder a finite sum.

For variance, we have to consider

\frac{1}{N}\sum_{i=1}^N (D_{\sigma(i)}-A_i)^2.$

Observe that for a\leq b\leq c\leq d, we have 0\leq(d-c)(b-a) and so

(c-a)^2+(d-b)^2\leq (d-a)^2+(c-b)^2,

which we can then apply repeatedly to show that

\sum_{i=1}^N(D_{\sigma(i)}-A_i)^2\geq\sum_{i=1}^N(D_i-A_i)^2,

with equality precisely when \sigma=\text{id}. So we conclude that the variance is in general higher when there are queues. And the reason is not that different to what we had originally suspected (*), namely that customers do not necessarily get served in the same order as they arrive when there are multiple queues, and so the variance of waiting time is greater.

References

This problem comes from Examples Sheet 1 of the Part III course Stochastic Networks. An official solution of the same flavour as this is also linked from the course page.

Feller Processes and the Strong Markov Property

Markov Property

We go way back to the Part 1B short course Markov Chains. In the first lecture of this course, we met discrete time Markov Chains. A definition was given, in terms of conditional single period transition probabilities, and it was immediately proved that general transition probabilities are specified by sums of products of entries in a so-called transition matrix. This proves very useful for performing calculations. But the question will inevitably be asked: “Prove that this is a Markov process.” And the answer “because it obviously is” isn’t good enough.

The point is that all of the above is relevant to the setting, but the ideal definition of a Markov process is something like the very general statement:

Conditional on the present, the past and the future are independent.

This opens up the possibility of a large class of processes being Markov processes. A technical definition would be that for s<t and a measurable subset of the state space.

\mathbb{P}(X_t\in A|\mathcal{F}_s)=\mathbb{P}(X_t\in A|\sigma(X_s)).

It is easy to check that this is equivalent to the original definition in the that context.

Strong Markov Property

SMP states that given a stopping time T, conditional on the event \{T<\infty\}:

(X_{T+t}-X_T, t\geq 0)\stackrel{d}{=}(X_t^0,t\geq 0),

that is, the process started at time T has the same distribution as the original process started from 0 (in space as well as time). Furthermore, it is independent of \mathcal{F}_T, which requires technical definition, but which informally is the sigma field of events defined by what happens up to time T.

For a discrete time Markov chain, prove SMP by pre-multiplying by the indicator function 1(T=n), which reduces SMP to the normal Markov property. Then take the countable sum over (which is permissible) to get SMP. For Brownian Motion in one dimension, make dyadic approximations to the stopping time from above. SMP applies to these approximations, and measure theoretic machinery and the (almost sure) continuity of paths allows the equivalence of distributions to hold in the limit. Independence follows by expressing \mathcal{F}_T=\cap \mathcal{F}_{T_n} as the intersection of sigma fields corresponding to the approximations.

In both of these cases, an implicit countable substructure (discrete time and continuity respectively) have been required to deduce SMP. This suggests that there are Markov processes which do not have SMP.

Motivating Counterexample

Take B to be a Brownian Motion in one dimension, with B_0 a RV which contains 0 in its support. Now define the the process:

X_t=B_t1_{B_0\neq 0}.

Then X is certainly not Strong Markov, by considering the hitting time of 0. Then the process started there is either identically 0, or a standard BM, but which is determined by time 0 properties rather than time T properties.

But is Markov. Take s<t and A Borel. Then:

\mathbb{P}(X_t\in A|\mathcal{F}_s)=\mathbb{E}[1(X_t\in A\cap X_0\neq 0)+1(X_t\in A\cap X_0=0)|\mathcal{F}_s]

=1(X_0\neq 0)\int_A \frac{1}{\sqrt{2\pi(t-s)}}\exp(-\frac{(X_s-y)^2}{2(t-s)})dy+1(X_0=0)1(0\in A)

1(X_s\neq 0)\int_A(\ldots)dy + 1(X_s=0,X_0\neq 0)\int_A(\ldots)dy + 1(0\in A)[1(X_s=0)-1(X_s=0,X_0\neq 0)]

Now 1(X_s=0, X_0\neq 0)=0 a.s. so

= \mathbb{E}[1(X_t\in A)|X_s], which is precisely the Markov property.

Feller Property

In general, it hard to verify the Strong Markov Property for a given stochastic process. Instead, we consider a property which is stronger, but easier to check. Continue reading

SLE revision 2: Loewner’s Differential Equation

Last time I set up the geometric notions of probability that will be needed to proceed with the course material. Now we consider the deterministic differential equation due to Loewner (1923) which he used to make progress on the Bieberbach Conjecture, but which will also underpin the construction of SLE. This proof is adapted for this specific case from the slightly more general argument in Duren’s Univalent Functions (Section 3.3). Because in that setting the result concerns an infinite domain, readers should beware that though I am using identical notation, in about half the cases, my function are the inverse and my sets the complement of what they are in Duren.

To explain the construction, as with so many things, a picture speaks a thousand words. Unfortunately I have neither the software nor, right now, the time to produce the necessary diagrams, so the following will have to suffice.

Consider a deterministic simple curve in the unit disc, (\gamma(t): t\in[0,\infty)). Removing initial segments of the curve gives the nested simply-connected regions:

U_t:=\mathbb{U}\backslash \gamma[0,t].

Then define as in the previous post, the unique conformal map

f_t: U_s\rightarrow \mathbb{U} such that f_t(0)=0, f_t'(0)\in\mathbb{R}^+,

and furthermore set \xi_t to be the image of \gamma(t) under this map. (Note that though the conformal map is not defined on the boundary, it must extend continuously).

f_t'(0) is increasing.

Very informally, this derivative records how much twisting is required at the origin to turn the slit domain into the open disk. Extending the path will demand further twisting. More rigorously, set:

g_t=f_t^{-1}:\mathbb{U}\rightarrow U_t.

Then the (g_t) are injective functions from the unit disc to itself which preserve the origin, so Schwarz’s lemma applies. They are clearly not rotations, so

|g_t'(0)|<1.

By the inverse function theorem, |f_t'(0)|>1 (*). Now, given t>s, can decompose:

f_t=f_s\circ \tilde{f},

and \tilde{f} has this useful Schwarz property (*) also. By applying the chain rule, noting that f_s(0)=0, we deduce that

|f_t'(0)|>|f_s'(0)|.

This means we are free to demand that the curve has time parameter such that

|f_t'(0)|=e^t.

A reminder of the statement of Schwarz’s Lemma: Continue reading

Coalescence 2: Marcus-Lushnikov Processes and a neat equivalence

Last time, discussed the Smoluchowski equations which define an infinite volume mean-field model for coalescence. Now we consider a stochastic coalescent model, the Marcus-Lushnikov process. Here, we start with a finite collection of finite mass particles, with total mass N. Then we define a process by demanding that given particles with masses and coalesce into a single particle with mass x + y at rate K(x,y)/N.

This can be formalised as a continuous-time Markov chain in different ways. The underlying state space consists of the set of unordered multisets of positive integers which sum to N. But rather than considering the configurations of masses themselves, it is sometimes more convenient to take the state space to be:

\{n=(n_1,\ldots,n_N): \sum xn_x=N\}.

Here n_x records the number of particles with mass x, and the constraint controls conservation of mass. Writing out a transition of this Markov chain is notationally annoying, but they have simple transition rates. The rate of transition from (n_1,\ldots,n_N) to the state where masses x and have coalesced is given by

N^{-1}K(x,y)n_xn_y.

Therefore, the Marcus-Lushnikov process ML^{(N)}_t is the process of component sizes in the finite particle setting of coalescence. The existence and mode of convergence of these processes to the deterministic solutions to Smoluchowski’s equation are of particular interest in many models.

As discussed in the previous post, there is an obvious heuristic link between the multiplicative coalescent and random graph processes. An easy but interesting explicit equivalence can be drawn between the Marcus-Lushnikov process with kernel K(x,y) = xy and monodisperse (that is, starting with unit mass particles) initial conditions and a specific random graph structure.

Proposition: The process ML^{(N)}_t with the conditions above is equivalent to the process of component sizes in \mathcal{G}(N,1-e^{-t/N}).

Proof: First, observe that we can couple the latter process in the obvious way by associating a U[0,1] random variable U_e with each of the \binom{N}{2} edges. The edge e is included at time iff U_e\leq 1-e^{-t/N}. In this process, the appearances of different edges are independent and

\mathbb{P}(\text{edge \emph{e} appears after \emph{t}})=e^{-t/N}.

Therefore the waiting times for a given edge to appear are independent \exp(1/N) RVs. In particular, the edge process is memoryless, hence the component size process is Markov. Then it is genuinely obvious that the rate at which an edge joining given distinct components of sizes and appears is N^{-1}xy. So the evolution is exactly the same as the Marcus-Lushnikov process, and the initial configuration is monodisperse.