Antichains in the grid

In the previous post on this topic, we discussed Dilworth’s theorem on chains and antichains in a general partially ordered set. In particular, whatever the size of the largest antichain in a poset, it is possible to partition the poset into exactly that many chains. So for various specific posets, or the directed acyclic graphs associated to them, we are interested in the size of this largest antichain.

The following example turned out to be more interesting than I’d expected. At a conventional modern maths olympiad, there are typically three questions on each paper, and for reasons lost in the mists of time, each student receives an integer score between 0 and 7 per question. A natural question to ask is “how many students need to sit a paper before it’s guaranteed that one will scores at least as highly as another on every question?” (I’m posing this as a straight combinatorial problem – the correlation between scores on different questions will be non-zero and presumably positive, but that is not relevant here.)

The set of outcomes is clearly \{0,1,\ldots,7\}^3, with the usual weak domination partial order inherited from \mathbb{R}^3. Then an antichain corresponds to a set of triples of scores such that no triple dominates another triple. So the answer to the question posed is: “the size of the largest antichain in this poset, plus one.”

In general, we might ask about \{1,2,\ldots,n\}^d, again with the weak domination ordering. This directed graph, which generalises the hypercube as well as our example, is called the grid.

Heuristics for the largest antichain

Retaining the language of test scores on multiple questions is helpful. In the previous post, we constructed a partition of the poset into antichains, indexed by the elements of some maximal chain, by starting with the sources, then looking at everything descended only from sources, and so on. (Recall that the statement that this is possible was referred to as the dual of Dilworth’s theorem.) In the grid, there’s a lot of symmetry (in particular under the mapping x\mapsto n+1-x in every coordinate), and so you end up with the same family of antichains whether you work upwards from the sources or downwards from the sinks. (Or vice versa depending on how you’ve oriented your diagram…) The layers of antichains also have a natural interpretation – each layer corresponds to a given total score. It’s clear a priori why each of these is an antichain. If A scores the same as B overall, but strictly more on the first question, this must be counterbalanced by a strictly lower score on another question.

So a natural guess for the largest antichain is the largest antichain corresponding to some fixed total score. Which total score should this be? It ought to be the middle layer, that is total score \frac{(n+1)d}{2}, or the two values directly on either side if this isn’t an integer. My intuition was probabilistic. The uniform distribution on the grid is achieved by IID uniform distributions in each coordinate, which you can think of as a random walk, especially if you subtract off the mean first. It feels that any symmetric random walk should have mode zero or next-to-zero. Certainly this works asymptotically in a rescaled sense by CLT, and in a slightly stronger sense by local CLT, but we don’t really want asymptotics here.

When I started writing the previous paragraph, I assumed there would be a simple justification for the claim that the middle layer(s) was largest, whether by straight enumeration, or some combinatorial argument, or even generating functions. Perhaps there is, and I didn’t spot it. Induction on d definitely works though, with a slightly stronger hypothesis that the layer sizes are symmetric around the median, and monotone on either side of the median. The details are simple and not especially interesting, so I won’t go into them.

From now on, the hypothesis is that this middle layer of the grid is the largest antichain. Why shouldn’t it, for example, be some mixture of middle-ish layers? (*) Well, heuristically, any score sequence in one layer removes several possibilities from a directly adjacent layer, and it seems unlikely that this effect is going to cancel out if you take some intermediate number of score sequences in the first layer. Also, the layers get smaller as you go away from the middle, so because of the large amount of symmetry (coordinates are exchangeable etc), it feels reasonable that there should be surjections between layers in the outward direction from the middle. The union of all these surjections gives a decomposition into chains.

This result is in fact true, and its proof by Bollobas and Leader, using shadows and compression can be found in the very readable Sections 0 and 1 of [1].

Most of the key ideas to a compression argument are present in the case n=2, for which some notes by Leader can be found here, starting with Proof 1 of Theorem 3, the approach of which is developed over subsequent sections. We treat the case n=2, but focusing on a particularly slick approach that does not generalise as successfully. We also return to the original case d=3 without using anything especially exotic.

Largest antichain in the hypercube – Sperner’s Theorem

The hypercube \{0,1\}^d is the classical example. There is a natural correspondence between the vertices of the hypercube, and subsets of [d]. The ordering on the hypercube corresponds to the ordering given by containment on \mathcal{P}([d]). Almost by definition, the k-th layer corresponds to subsets of size k, and thus includes \binom{d}{k} subsets. The claim is that the size of the largest antichain is \binom{d}{\lfloor d/2 \rfloor}, corresponding to the middle layer if d is even, and one of the two middle layers if d is odd. This result is true, and is called Sperner’s theorem.

I know a few proofs of this from the Combinatorics course I attended in my final year at Cambridge. As explained, I’m mostly going to ignore the arguments using compression and shadows, even though these generalise better.

As in the previous post, one approach is to exhibit a covering family of exactly this number of disjoint chains. Indeed, this can be done layer by layer, working outwards from the middle layer(s). The tool here is Hall’s Marriage Theorem, and we verify the relevant condition by double-counting. Probably the hardest case is demonstrating the existence of a matching between the middle pair of layers when d is odd.

Take d odd, and let d':= \lfloor d/2\rfloor. Now consider any subset S of the d’-th layer \binom{[d]}{d'}. We now let the upper shadow of S be

\partial^+(S):= \{A\in \binom{[d]}{d'+1}\,:\, \exists B\in S, B\subset A\},

the sets in the (d’+1)-th layer which lie above some set in S. To apply Hall’s Marriage theorem, we have to show that |\partial^+(S)|\ge |S| for all choice of S.

We double-count the number of edges in the hypercube from S to \partial^+(S). Firstly, for every element B\in S, there are exactly d’ relevant edges. Secondly, for every element A\in\partial^+(S), there are exactly d’ edges to some element of \binom{[d]}{d'}, and so in particular there are at most d’ edges to elements of S. Thus

d' |S|=|\text{edges }S\leftrightarrow\partial^+(S)| \le d' |\partial^+(S)|,

which is exactly what we require for Hall’s MT. The argument for the matching between other layers is the same, with a bit more notation, but also more flexibility, since it isn’t a perfect matching.

The second proof looks at maximal chains. Recall, in this context, a maximal chain is a sequence \mathcal{C}=B_0\subset B_1\subset\ldots\subset B_d where each B_k:= \binom{[d]}{k}. We now consider some largest-possible antichain \mathcal{A}, and count how many maximal chains include an element A\in\mathcal{A}. If |A|=k, it’s easy to convince yourself that there are \binom{d}{r} such maximal chains. However, given A\ne A'\in\mathcal{A}, the set of maximal chains containing A and the set of maximal chains containing A’ are disjoint, since \mathcal{A} is an antichain. From this, we obtain

\sum_{A\in\mathcal{A}} \binom{d}{|A|} \le d!. (**)

Normally after a change of notation, so that we are counting the size of the intersection of the antichain with each layer, this is called the LYM inequality after Lubell, Yamamoto and Meshalkin. The heuristic is that the sum of the proportions of layers taken up by the antichain is at most one. This is essentially the same as earlier at (*). This argument can also be phrased probabilistically, by choosing a *random* maximal chain, and considering the probability that it intersects the proposed largest antichain, which is, naturally, at most one. Of course, the content is the same as this deterministic combinatorial argument.

Either way, from (**), the statement of Sperner’s theorem follows rapidly, since we know that \binom{d}{|A|}\le \binom{d}{\lfloor d/2\rfloor} for all A.

Largest antichain in the general grid

Instead of attempting a proof or even a digest of the argument in the general case, I’ll give a brief outline of why the previous arguments don’t transfer immediately. It’s pretty much the same reason for both approaches. In the hypercube, there is a lot of symmetry within each layer. Indeed, almost by definition, any vertex in the k-th layer can be obtained from any other vertex in the k-th layer just by permuting the labels (or permuting the coordinates if thinking as a vector).

The hypercube ‘looks the same’ from every vertex, but that is not true of the grid. Consider for clarity the n=8, d=3 case we discussed right at the beginning, and compare the scores (7,0,0) and (2,2,3). The number of maximal chains through (7,0,0) is \binom{14}{7}, while the number of maximal chains through (2,2,3) is \binom{7}{2, 2,3}\binom{14}{4,5,5}, and the latter is a lot larger, which means any attempt to use the second argument is going to be tricky, or at least require an extra layer of detail. Indeed, exactly the same problem arises when we try and use Hall’s condition to construct the optimal chain covering directly. In the double-counting section, it’s a lot more complicated than just multiplying by d’, as was the case in the middle of the hypercube.

Largest antichain in the d=3 grid

We can, however, do the d=3 case. As we will see, the main reason we can do the d=3 case is that the d=2 case is very tractable, and we have lots of choices for the chain coverings, and can choose one which is well-suited to the move to d=3. Indeed, when I set this problem to some students, an explicit listing of a maximal chain covering was the approach some of them went for, and the construction wasn’t too horrible to state.

[Another factor is that it computationally feasible to calculate the size of the middle layer, which is much more annoying in d>3.]

[I’m redefining the grid here as \{0,1,\ldots,n-1\}^d rather than \{1,2,\ldots,n\}^d.]

The case distinction between n even and n odd is going to make both the calculation and the argument annoying, so I’m only going to treat the even case, since n=8 was the original problem posed. I should be honest and confess that I haven’t checked the n odd case, but I assume it’s similar.

So when n is even, there are two middle layers namely \frac{3n}{2}-2, \frac{3n}{2}-1 (corresponding to total score 10 and total score eleven in the original problem). I calculated the number of element in the \frac{3n}{2}-1 layer by splitting based on the value of the first coordinate. I found it helpful to decompose the resulting sum as

\sum_{k=0}^{n-1} = \sum_{k=0}^{\frac{n}{2}-1} + \sum_{k=\frac{n}{2}}^{n-1},

based on whether there is an upper bound, or a lower bound on the value taken by the second coordinate. This is not very interesting, and I obtained the answer \frac{3n^2}{4}, and of course this is an integer, since n is even.

Now to show that any antichain has size at most \frac{3n^2}{4}. Here we use our good control on the chain coverings in the case d=2. We note that there is a chain covering of the (n,d=2) grid where the chains have 2n-1, 2n-3,…, 3, 1 elements (%). We get this by starting with a maximal chain, then taking a maximal chain on what remains etc. It’s pretty much the first thing you’re likely to try.

Consider an antichain with size A in the (n,d=3) grid, and project into the second and third coordinates. The image sets are distinct, because otherwise a non-trivial pre-image would be a chain. So we have A sets in the (n,d=2) grid. How many can be in each chain in the decomposition (%). Well, if there are more than n in any chain in (%), then two must have been mapped from elements of the (n,d=3) grid with the same first coordinate, and so satisfy a containment relation. So in fact there are at most n image points in any of the chains of (%). So we now have a bound of n^2. But of course, some of the chains in (%) have length less than n, so we are throwing away information. Indeed, the number of images points in a given chain is at most

\max(n,\text{length of chain}),

and so the number of image points in total is bounded by

n+\ldots+n+ (n-1)+(n-3)+\ldots+1,

where there are n/2 copies of n in the first half of the sum. Evaluating this sum gives \frac{3n^2}{4}, exactly as we wanted.

References

[1] – Bollobas, Leader (1991) – Compressions and Isoperimetric Inequalities. Available open-access here.

Advertisement

Independence and Association

Back when we did GCSE probability, we gave a definition of independent events as:

A and B are said to be independent if \mathbb{P}(A)\mathbb{P}(B)=\mathbb{P}(A\cap B).

We might also apply Bayes’ definition of conditional probability to say

\mathbb{P}(A|B)=\mathbb{P}(A)\quad\iff\quad A,B\text{ independent}\quad\iff\quad\mathbb{P}(B|A)=\mathbb{P}(B)

provided all the terms exist. (Eg the definition of \mathbb{P}(B|A) is at the very least non-obvious if the probability of A is 0.) In my opinion, this is a more naturally intuitive definition. For example, I think that when you toss two coins, the fact that the probability of the second coin being a tail is unaffected by whether the first is heads is more naturally ‘obvious’ than the fact that the joint probability of the two events is 1/4.

But, before getting too into anything philosophical, it is worth thinking about an equivalent situation for non-independent events. We remark that by an identical argument to above:

\mathbb{P}(A|B)\geq\mathbb{P}(A)\quad\iff\quad \mathbb{P}(A\cap B)\geq\mathbb{P}(A)\mathbb{P}(B)\quad\iff\quad\mathbb{P}(B|A)\geq\mathbb{P}(B)

Informally, this says that if we know A occurs, it increases the likelihood of B occuring. If we were talking about two random variables, we might say that they were positively correlated. But of course, by considering RVs 1_A,1_B, the result above is precisely the statement that the indicator functions have positive correlation.

Aim: To find a sufficient condition for positive correlation of random variables in a product measure.

Consider the following. Suppose A is an event which is positively correlated with the appearance of each edge. We might suspect that two such events A and B would be positively correlated. Instead, we consider a more concrete description. Recall that an event A is a subset of \Omega=\{0,1\}^E. Given w\in\Omega,e\in E, we say w^e\in\Omega defined by taking w and setting edge e to be open (note it may be open already). Now, we say event A is increasing, if

\forall w\in\Omega,\forall e\in E: w\in A\Rightarrow w^e\in A.

Note that this certainly implies the property previously mentioned, but the converse is not necessarily true.

Anyway, our revised aim will be to show that increasing events A and B are positively correlated for product measure.

For now, we approach the problem from the other direction, namely we attempt to find which measures on \{0,1\}^E have the property that A and B are positively correlated for all increasing A, B. Note that as before, we can think of this as \mathbb{E}1_A1_B\geq\mathbb{E}1_A\mathbb{E}1_B, and again here it is useful to rephrase our framework in terms of random variables. There is a natural (product) partial ordering of \Omega=\{0,1\}^E, and from this there is an easy notion of increasing random variables. Recall a random variable is defined as a measurable map \Omega\rightarrow\mathbb{R} so no further work is required.

X is increasing if w\geq w'\Rightarrow X(w)\geq X(w').

So we clarify our aim, which is to find a condition on the measure \mu such that \mu(XY)\geq \mu(X)\mu(Y) for all increasing X, Y. When this occurs, we say \mu is positively associated. Note that this is equivalent to \mu(A\cap B)\geq \mu(A)\mu(B) for all increasing events A, B. Why? We can build up X and Y from increasing indicator functions like \{X\geq x\} in a usual monotone class argument.

On the way, we need a partial ordering on the set of probability measures. Obviously, if \mu(A)\leq \nu(A) for all events A, then in fact \mu=\nu! So instead we say \mu\leq_{st}\nu if \mu(A)\leq \nu(A) for all increasing A. This is called the stochastic ordering, and there is a technical result of Strassen, proving the intuitively obvious claim that if \mu_1\leq \mu_2, then we can couple the measures in a natural way. Formally:

Theorem: \mu_1\leq\mu_2 \iff \exists a probability measure \nu on \Omega^2 such that the marginals are \mu_1,\mu_2 and

\nu(\{(w_1,w_2):w_1\leq w_2\})=1.

Our main result will be the FKG inequality which asserts that when \mu satisfies the following FKG lattice property

\mu(w_1\vee w_2)\mu(w_1\wedge w_2)\geq \mu(w_1)\mu(w_2),\quad\forall w_1,w_2\in\Omega

then \mu is positively associated. We will prove the case |E|<\infty.

We proceed by showing that \mu_1\leq\mu_2\propto Y\mu_1, rescaled, for Y an increasing RV. [Note that we are now suppressing the ‘st’ subscript, as context makes the use clear.]

To show this, we prove the more general Holley’s Theorem:

This states that if two positive probability measures satisfy a related lattice condition:

\mu_2(w_1\vee w_2)\mu_1(w_1\wedge w_2)\geq \mu_1(w_1)\mu_2(w_2)\quad\forall w_1,w_2\in\Omega

then we have the stochastic domination result: \mu_1\leq \mu_2.

Note that the lattice condition states, very informally, that adding edges results in a greater relative increase with respect to the measure \mu_2, which has a natural similarity to the definition of stochastic domination.

We prove this, perhaps unexpectedly, by resorting to a Markov chain. We note that there is a Markov chain on \Omega with equilibrium distribution given by \mu_1. This is simple: the non-zero transition rates are those given by the addition or removal of a single edge. Assume that edges are added at unit rate, and that edges are removed with rate: G(w^e,w_e)=\frac{\mu_1(w_e)}{\mu_1(w^e)}.

Similarly, we can construct a Markov chain on state space \Omega^2, where non-zero transitions are given by the addition of an edge to both states in the pair, the removal of an edge from both states in the pair, and the removal of an edge from only the first edge in the pair. Note that, as before, we may be ‘adding’ an edge which is already present. Assuming we start in this set, this choice means that we are restricting the sample space to \{(\pi,w):\pi\leq w\}. We need the transition rate of the third type of transition to have the form: \frac{\mu_1(\pi_e)}{\mu_1(\pi^e)}-\frac{\mu_2(w_e)}{\mu_2(w^e)}. So the lattice condition precisely confirms that this is non-negative, and thus we have a well-constructed Markov chain. The marginals have equilibrium distributions \mu_1,\mu_2 by construction, and by the general theory of Markov chains, there is an equilibrium distribution, and this leaves us in precisely the right position to apply Strassen to conclude the result.#

Summary of consequences: We have demonstrated that product measure is positively associated, as it certainly satisfies the FKG condition. Recall that this is what we had suspected intuitively for reasons given at the start of this account. Next time, I will talk about the most natural companion result, the BK inequality, and the stronger Reimer’s Inequality.

References: Both the motivation and the material is derived from Prof. Grimmett’s Part III course, Percolation and Related Topics, which was one of the mathematical highlights of the year. This account of the subject is a paraphrase of his lecture notes, which were themselves based on his book Probability on Graphs. Mistakes, naturally, are mine. Background on the course, and an online source of the book can be found on the course website here.

SLE Revision 4: The Gaussian Free Field and SLE4

I couldn’t resist breaking the order of my revision notes in order that the title might be self-referential. Anyway, it’s the night before my exam on Conformal Invariance and Randomness, and I’m practising writing this in case of an essay question about the Gaussian Free Field and its relation to the SLE objects discussed in the course.

What is a Gaussian Free Field?

The most natural definition is too technical for this context. Instead, recall that we could very informally consider a Poisson random measure to have the form of a series of Poisson random variables placed at each point in the domain, weighted infinitissimely so that the integrals over an area give a Poisson random variable with mean proportional to the measure of the area, and so that different areas are independent. Here we do a similar thing only for infinitesimal centred Gaussians. We have to specify the covariance structure.

We define the Green’s function on a domain D, which has a resonance with PDE theory, by:

G_D(x,y)=\lim_{\epsilon\rightarrow 0}\mathbb{E}[\text{time spent in }B(y,\epsilon)\text{ by BM started at }x\text{ stopped at }T_D]

We want the covariance structure of the hypothetical infinitesimal Gaussians to be given by \mathbb{E}(g(x)g(y))=G_D(x,y). So formally, we define (\Gamma(A),A\subset D) for open A, by (\Gamma(A_1),\ldots,\Gamma(A_n)) a centred Gaussian RV with covariance \mathbb{E}(\Gamma(A_1)\Gamma(A_2))=\int_{A_1\times A_2}dxdyG_D(x,y).

The good news is that we have a nice expression G_U(0,x)=\log\frac{1}{|x|}, and the Green’s functions are conformally invariant in the sense that G_{\phi(D)}(\phi(x),\phi(y))=G_D(x,y), following directly for conformality of Brownian Motion.

The bad news is that the existence is not clear. The motivation for this is the following though. We have a so-called excursion measure for BMs in a domain D. There isn’t time to discuss this now: it is infinite, and invariant under translations of the boundary (assuming the boundary is \mathbb{R}\subset \bar{\mathbb{H}}, which is fine after taking a conformal map). Then take a Poisson Point Process on the set of Brownian excursions with this measure. Now define a function f on the boundary of the domain dD, and define \Gamma_f(A) to be the sum of the values of f at the starting point of BMs in the PPP passing through A, weighted by the time spent in A. We have a universality relation given by the central limit theorem: if we define h to be (in a point limit) the expected value of this variable, and we take n independent copies, we have:

\frac{1}{\sqrt{n}}\left['\Gamma_f^1(A)+\ldots+\Gamma_f^n(A)-n\int_Ah(x)dx\right]\rightarrow\Gamma(A)

where this limiting random variable is Gaussian.

For now though, we assume existence without full proof.

SLE_4

We consider chordal SLE_k, which has the form of a curve \gamma[0,\infty] from 0 to \infty in H. The g_t the regularising function as normal, consider \tilde{X}_t=X_t-W_t:=g_t(x)-\sqrt{\kappa}\beta_t for some fixed x. We are interested in the  evolution of the function arg x. Note that conditional on the (almost sure for K<=4) event that x does not lie on the curve, arg x will converge either to 0 or pi almost surely, depending on whether the curve passes to the left or the right (respectively) of x.

By Loewner’s DE for the upper half-plane and Ito’s formula:

d\bar{X}_t=\sqrt{\kappa}d\beta_t,\quad d\log\bar{X}_t=(2-\frac{\kappa}{2})\frac{dt}{\bar{X}_t^2}+\frac{\sqrt{\kappa}}{\bar{X}_t}d\beta_t

So, when K=4, the dt terms vanish, which gives that log X is a local martingale, and so

d\theta_t=\Im(\frac{2}{\bar{X}_t}d\beta_t

is a true martingale since it is bounded. Note that

\theta_t=\mathbb{E}[\pi1(x\text{ on right of }\gamma)|\mathcal{F}_t]

Note that also:

\mathbb{P}(\text{BM started at }x\text{ hits }\gamma[0,t]\cup\mathbb{R}\text{ to the right of }\gamma(t)|\gamma[0,t])=\frac{\theta_t}{\pi} also.

SLE_4 and the Gaussian Free Field on H

We will show that this chordal SLE_4 induces a conformal Markov type of property in Gaussian Free Fields constructed on the slit-domain. Precisely, we will show that if \Gamma_T is a GFF on H_T=\mathbb{H}\backslash\gamma[0,T], then \Gamma_T+ch_T(\cdot)=\Gamma_0+ch_0(\cdot), where c is a constant to be determined, and h_t(x)=\theta_t(x) in keeping with the lecturer’s notation!

It will suffice to check that for all fixed p with compact support \Gamma_T(p)+c(h_T(p)-h_0(p)) is a centred Gaussian with variance \int dxdyG_H(x,y)p(x)p(y).

First, applying Ito and conformal invariance of the Green’s functions under the maps g_t,

dG_{H_t}(x,y)=cd[h(x),h(y)]_t

The details are not particularly illuminating, but exploit the fact that Green’s function on H has a reasonable nice form \log\left|\frac{x-\bar{y}}{x-y}\right|. We are also being extremely lax with constants, but we have plenty of freedom there.

After applying Ito and some (for now unjustified) Fubini:

dh_t(p)=\left(\int c.p(x)\Im(\frac{1}{\bar{X}_t})dx\right)d\beta_t

and so as we would have suspected (since h(x) was), this is a local martingale. We now deploy Dubins-Schwarz:

h_T(p)-h_T(0)\stackrel{d}{=}B_{\sigma(T)} for B an independent BM and

\sigma(T)=\int_0^Tdt(\int c.p(x)\Im(\frac{1}{\tilde{X}_t})dx)^2

So conditional on (h_T(p),t\in[0,T]), we want to make up the difference to \Gamma_0. Add to h_T(p)-h_0(p) an independent random variable distribution as N(0,s-\sigma(T)), where

s=\int dxdyp(x)p(y)G(x,y)\quad =:\Gamma_0(p)

Then

s-\sigma(T)=\int p(x)p(y)[G(x,y)-c\int_0^Tdt\Im(\frac{1}{X_t})\Im(\frac{1}{Y_t})]dxdy=\int p(x)p(y)G_t(x,y)dxdy as desired.

Why is this important?

This is important, or at least interesting, because we can use it to reverse engineer the SLE. Informally, we let T\rightarrow\infty in the previous result. This states that taking a GFF in the domain left by removing the whole of the SLE curve (whatever that means) then adding \pi at points on the left of the curve, which is the limit \lim_T h_T is the same as a normal GFF on the upper half plane added to the argument function. It is reasonable to conjecture that a GFF in a non-connected domain has the same structure as taking independent GFFs in each component, and this gives an interesting invariance condition on GFFs. It can also be observed (Schramm-Sheffield) that SLE_4 arises by reversing the argument – take an appropriate conditioned GFF on H and look for the interface between it being ‘large’ and ‘small’ (Obviously this is a ludicrous simplification). This interface is then, under a suitable limit, SLE_4.

SDEs and L-Diffusions

The motivation for the development of differential calculus by Newton et al. was to enable us to deduce concrete properties of, say, particle motion defined implicitly through ODEs. And we proceed similarly for the stochastic differential. Having defined all the terms through Ito’s formula, and concluded that BM is in some sense the canonical stochastic process, we seek to solve so-called stochastic differential equations of the form:

dX_t=b(X_t)dt+\sigma(X_t)dB_t

While there is no reason not to consider processes in \mathbb{R}^d, it is reasonable interesting to consider processes in one dimension. As with normal ODEs and PDEs, we have some intuitive notion if we specify some initial conditions, we should be able to set the differential equation up and ‘let it go’ like a functional clockwork mouse. Of course, we are conscious of the potential problems with uniqueness of solutions, stability of solutions, and general mathematical awkwardness that derives from the fact that we can’t treat all DEs as physical systems, with all the luxuries of definiteness that the physical world automatically affords. To establish some concreteness, we set up some definitions.

  • For a solution to the SDE, E(\sigma,b), we require a nice filtration \mathcal{F} and a BM adapted to that filtration to drive the process X_t, which satisfies X_t=X_0+\int_0^t\sigma(X_s)dB_s+\int_0^tb(X_s)ds, and we require this for each x_0\in\mathbb{R}^d s.t. X_0=x_0 a.s.
  • Uniqueness in law: all solutions to E(\sigma,b) starting from each x have the same law. Obviously, this places no restriction on the underlying probability space and filtration.
  • A stronger condition is Pathwise uniqueness: Given the filtration, solutions are almost surely indistinguishable (that is, paths are equal everywhere).
  • We have not specified any conditions on the filtration \mathcal{F}. It would be natural to consider only the minimal such filtration that works. If we can take \mathcal{F}=\mathcal{F}^B, the natural filtration of the driving BM, we say the solutions is strong. If every solution is strong, then we have pathwise uniqueness, otherwise we would have a solution where we could choose which path to follow independently of the BM.

The key theorem here is Yamada-Watanabe: If there exist solutions and we have pathwise uniqueness, then we have uniqueness in law. Then for every (\mathcal{F}_t), and \mathcal{F}_t-BM, the unique solution is strong.

Existence of solutions is particularly tractable when \sigma,b are Lipschitz, as this opens the way for implicit constructions as the fixed points of contracting mappings. We make particular use of Gronwall’s lemma, which confirms an intuitive thought that differential inequalities have solutions bounded by solutions to the corresponding ODE. Concretely, for $latex f\geq 0,\,f(t)\leq a+b\int_0^tf(s)ds,\quad 0\leq t\leq T$, the lemma states that f(t)\leq a\exp(bt). The case a=0 is obviously of particular interest for demonstrating convergence results. We deploy this method to show that when \sigma,b are Lipschitz, the SDE dX_t=\sigma(X_t)dB_t+b(X_t)dt has pathwise uniqueness and for any triple of filtration (\mathcal{F}_t), \mathcal{F}_t-adapted BM, and starting point x, there is a strong solution. Uniqueness in law then follows by Yamada-Watanabe, but we knew this anyway by composing measurable maps.

Now, given L, an operator on C^2 functions by:

Lf(x)=\frac12\sum_{i,j}a_{i,j}(x)\frac{\partial^2 f}{\partial x^i\partial x^j}+\sum_i b_i(x)\frac{\partial f}{\partial x^i}

We define X to be an L-diffusion if X’s local behaviour is specified (in distribution) by L(X). The first sum in the expression for L corresponds to diffusivity, while the second corresponds to (deterministic) drift. Formally, for a, b, bounded X_t a L-diffusion is \forall f\in C_b^2:

M_t^f:=f(X_t)-f(X_0)-\int_0^t Lf(X_s)ds is a martingale.

Alternatively, can relax boundedness condition, and require M_t^f\in\mathcal{M}_{c,loc}. To make a link to SDEs, define a=\sigma\sigma^T (so in one dimension a=\sqrt{\sigma}), then solutions to dX_t=\sigma(X_t)dB_t+b(X_t)dt are L-diffusions if boundedness conditions are met. Remember bounded implies Lipschitz implies solutions to SDEs. The result then follows directly from Ito’s formula.

Dubins-Schwarz Theorem

In developing the stochastic integral, much of our motivation has come from considering integrals with respect to Brownian Motion. In this section, we develop some results which justify that Brownian Motion is the canonical stochastic process with non-zero quadratic variation (which is related, but not directly equivalent to the property of infinite total variation). In particular, we shall observe the Dubins-Schwarz theorem, which shows that martingales with unbounded (as time \rightarrow\infty) quadratic variation ARE Brownian Motion, up to a (stochastic) time change.

Recall Levy’s characterisation of a d-dimensional BM, which allows us to avoid considering independent normal increments. Given X^1,\ldots,X^d\in\mathcal{M}_{c,loc}:

X=(X^1,\ldots,X^d) a BM iff [X^i,X^j]_t=\delta_{ij}t

Obviously, one direction has been shown as part of the construction and properties of quadratic variation. For the other direction,, because laws are precisely defined by characteristic functions, it suffices to show that

\mathbb{E}\left[\exp(i\langle \theta,X_t-X_s\rangle)|\mathcal{F}_s\right]=\exp(-\frac12||\theta||^2(t-s))

We set Y_t:=\langle \theta,X_t\rangle, and deduce [Y]=t||\theta||^2 and Z=\mathcal{E}(iY)=\exp(iY_t+\frac12[Y]_t)\in\mathcal{M}_{c,loc}, and furthermore is bounded on compact [0,t], hence is a true martingale. So \mathbb{E}\left(\frac{Z_t}{Z_s}|\mathcal{F}_s\right)=1 which is pretty much what was required.

Now, Dubins-Schwarz states

Theorem: Given M\in\mathcal{M}_{c,loc}, M_0=0, [M]_\infty=\infty almost surely, if we set \tau_s:=\inf\{t:[M]_t>s\}, then B_s:=M_{\tau_s} is a (\mathcal{F}_{\tau_s})-BM, with M_t=B_{[M]_t}.

This final result is clear if [M]_t is almost surely strictly increasing in t: just take s=[M]_t in the definition.

We know B is cadlag: we first show B as defined is almost surely continuous. It remains to show B_{s-}=B_s\,\forall s>0\iff M_{\tau_{s-}}=M_{\tau_s}, noting that \tau_{s-}=\inf\{t\geq 0:[M]_t=s\} (by continuity) is a stopping time also.

The only interesting case is if \tau_{s-}<\tau_s, for which need to show [M] is constant. This is intuitively obvious, but formally, we must appeal to (M^2-[M])^{\tau_s} which is UI, since \mathbb{E}[M^{\tau_s}]_\infty<\infty. Now may apply OST to obtain \mathbb{E}[M_{\tau_s}^2-M_{\tau_{s-}}^2|\mathcal{F}_{\tau_{s-}}]=\mathbb{E}[(M_{\tau_s}-M_{\tau_{s-}})^2|\mathcal{F}_{\tau_{s-}}]=0 which implies M is almost surely constant on [\tau_{s-},\tau_s]. We need to lift this to the case where it holds for all s simultaneously almost surely. Note that cadlag almost surely plus almost surely continuous at each point does not implies almost surely continuous everywhere (eg consider H(U(0,1)) with H the Heaviside function and U a uniform distribution). Instead, we record intervals of constancy of both M_t,[M]_t. That is, we set

T_r=\inf\{t>r:M_t\neq M_r\},\quad S_r=\inf\{t>r:[M]_t\neq [M]_r\}

Then these are cadlag, and by above T_r=S_r\,\forall r\in\mathbb{Q}^+ a.s. therefore T_r=S_r\,\forall r almost surely. Thus M, [M] are constant on the same intervals.

We also check B is adapted to \mathcal{G}_t=\mathcal{F}_{\tau_t}. STP X_T1_{\{T<\infty\}} is \mathcal{F}_T-measurable for X cadlag adapted. Approximating T discretely from above gives the result, exploiting that the result is clear if T has countable support. Now, obtain M^{\tau_s}\in\mathcal{M}_c^2, so M_{t\wedge \tau_s} UI by Doob, so by OST, get \mathbb{E}[M_{\tau_s}|\mathcal{F}_{\tau_s}]=M_{\tau_r}, to get B a martingale. The finally:

\mathbb{E}[B_s^2-s|\mathcal{G}_r]=\mathbb{E}[(M^2-[M])_{\tau_s}|\mathcal{F}_{\tau_s}]=M_{\tau_r}^2-[M]_{\tau_r}=B_r^2-r

And so we can apply Levy’s characterisation to finish the result.

Working towards the Ito Isometry

We have defined the stochastic integral (H\cdot A)_t for a finite variation, continuous adapted process A, and H a previsible process. We want to extend this to integrals with respect to continuous local martingales. Analogous to defining the Lebesgue integral by extending linearly from indicator functions, we define integrals of simple processes with respect to martingales. A simple process is a disjointly (time-)supported linear combination of the canonical previsible process Z_s1_{(s,t]}. It will be convenient to demand that the variables of integration are in \mathcal{M}^2. Then the space \mathcal{M}^2 is preserved under this integral operation, that is H\cdot M\in\mathcal{M}^2, and in particular

\mathbb{E}(H\cdot M)_\infty^2\leq ||H||_\infty^2\mathbb{E}(M_\infty-M_0)^2

By Doob’s L^2 inequality, X\in\mathcal{M}^2\Rightarrow ||\sup_t|X_t|||_2<\infty, so it makes sense to define \mathcal{C}^2 as the set of cadlag adapted processes s.t. |||X|||:=||\sup_t|X_t|||<\infty. Of particular interest is that \mathcal{C}^2 is complete, by finding limits for subsequences and lifting to the whole sequence by Fatou.

We aim to construct integrals over \mathcal{M}_{c,loc} by taking limits of integrals over \mathcal{M}_c^2. For this space, we have that M^2-[M] is not just \in\mathcal{M}_{c,loc} but in fact a UI true martingale. Then, as a natural extension of the orthogonality trick (remembering that M^2-[M]\in\mathcal{M}_{c,loc}), we obtain

\mathbb{E}[M]_\infty=\mathbb{E}(M_\infty-M_0)^2

We then define for M\in\mathcal{M}_c^2 the norm ||\cdot||_M by ||H||_M^2:=\mathbb{E}\left(\int_0^tH_s^2d[M]_s\right) with L^2(M) the space of H s.t. that this is finite. Then for H\in \mathcal{S}, we have the Ito isometry:

I: (L^2(M),||\cdot||_M)\rightarrow(M_c^2,||\cdot||) with I(H)=H\cdot M

Now, we also have S dense in L^2(M). Why? We know linear combinations of 1_A, A\in\mathcal{P} are dense, so it STP A\in\mathcal{P}\Rightarrow A\in\bar{S}. Use a typical Dynkin’s lemma argument on \mathcal{A}=\{A\in\mathcal{P}:1_A\in \bar{S}\}, which contains \pi-system generating \mathcal{P}. So extend to L^2(M) generally. We recover the original motivating result that [H\cdot M]=H^2\cdot[M]. Take a sequence of stopping times reducing both H and M to force boundedness of integrand, and M\in\mathcal{M}_c^2. Stopping the integral is equivalent to stopping the integrand, and checking limits in stopping times allows us to lift the Ito Isometry result to this one about quadratic variations.

Brownian Motion is not finite variation

There is a natural definition of ‘pathwise’ stochastic integrals of a certain type of ‘simple’ process with respect to cadlag non-decreasing processes. It can be a shown that a function is of finite variation iff it can be expressed as the difference of two such functions. Hence, these finite variation processes can be used as variable of integration via an obvious linear extension. One direction of this result is obvious; the other is fiddly. To proceed, we show that the total valuation process is cadlag (and, obviously, increasing), and then check that a'=\frac12(v+a),a''=\frac12(v-a) are processes satisfying the conditions of the result.

Our overall aim is to define integrals with respect to Brownian Motion since that is (in a sense to be made precise through the Dubins-Schwarz theorem later) the canonical non-trivial stochastic process with non-zero quadratic variation. The result we demonstrate shows that it is not possible to define the integral with respect to BM through pathwise finite variation integrals.

Theorem: M\in\mathcal{M}_{c,loc},M_0=0 a.s. is of finite variation. Then M is indistinguishable from 0.

We will show this for M a bounded martingale with bounded variation. Why does this suffice? In general, set S_n:=\inf\{t,V_t\leq n\}, noting that V is continuous adapted non-decreasing. If M^{S_n}\equiv 0\,\forall n, then we are done, as the S_ns are increasing. But this is a bounded martingale with bounded variation.

To prove this, we make use of the orthogonality relation which is a key trick for this sort of result: If M is a martingale, with M_s,M_t\in L^2, for s<t, then just by multiplying out:

\mathbb{E}[(M_t-M_s)^2|\mathcal{F}_s]=\mathbb{E}[M_t^2-M_s^2|\mathcal{F}_s] a.s.

Now, for this particular result, we decompose

\mathbb{E}[M_t^2]=\mathbb{E}\left[\sum_{k=0}^{2^n-1}(M_{(k+1)2^{-n}t}^2-M_{k2^{-n}t}^2)\right]=\mathbb{E}[\sum (M_{(k+1)2^{-n}t}-M_{k2^{-n}t})^2]

and then we bound this last term as

\leq \mathbb{E}\left[\sup_k [M_+-M_-]\sum_k |M_+-M_-|\right]

Now, as n\uparrow\infty, we have \sum_k |M_+-M_-|\uparrow V_t\leq N by the boundedness assumption. Furthermore, M is almost surely continuous on [0,t] and so it is in fact uniformly continuous, which allows us to conclude that

\sup_k |M_+-M_-|\downarrow 0

By bounded convergence, this limit applies equally under the expectation, and so conclude that \mathbb{E}M_t^2=0 for each time t, and so for each time t the martingale is almost surely equal to 0. In the usual, can lift this to rational points by countability, then to all points by continuity.

Loss Networks and Erlang’s Fixed Point

Loss Networks

In Erlang’s telephone line model discussed in the previous post, we considered users competing for a finite set of resources. When insufficient resources are available, the call is lost. A loss network generalises this situation to more complicated resource configurations. We think of links 1, …, J, each with some integer capacity c_j. Each incoming call requires 1 units of each link in some subset of the set of links, and lasts for a time distributed as an exponential random variable with parameter 1, independent of everything else in the model. We call this subset the route, and denote by A_{jr} the incidence matrix of links on routes. Calls arrive as PP(\nu_r)$ independently for each route: no queueing occurs – a call is lost if some link required is operating at full capacity. We call the probability of this event L_r, the loss probability. Observe that (n_r), the number of calls on each route r, is a Markov chain on the truncated space \{An\leq c\}.

By checking the DBEs, it is clear that an ED for this Markov chain is proportional to the ED for the MC without the capacity constraint, with state-space restricted to the truncated space. But without capacity constraints, the system is a linear migration process, for which we discovered the form of the ED in the previous section. If we write H(c)=\mathbb{P}(An\leq c) in the linear migration process, we can compute the acceptance probability for the finite capacity system as:

1-L_r=\frac{H(C-Ae_r)}{H(C)}

Approximating Blocking Probabilities

We want to calculate B_j, the equilibrium blocking probability, that a given link j is full. We have two methods: firstly, to find the distribution for (n_r) with maximum probability, for which the blocking probabilities appear as shadow prices. And secondly, to make a reasonable approximation about blocking independence, and solve explicitly. We want to show that these methods give the same answers.

To maximise the probability \pi(n)\propto \prod_r \frac{\nu_r^{n_r}}{n_r!} on \{An\leq c\}, we take logs and maximise using Stirling’s approximation, which is reasonable as we are implicitly working under a regime where the throughput tends to infinity while preserving ratios.

The primal problem is

\max\quad \sum_r(x_r\log \nu_r-x_r\log x_r+x_r),\quad\text{s.t. }Ax\leq c

which has Lagrangian

L(x,y,z)=\sum_r x_r+\sum_r x_r(\log \nu_r-\log x_r-\sum_j y_jA_{jr})+\sum_j y_jc_j-\sum_j y_jc_j

We observe that complementary slackness here has the form y.z=0, and remember that by Strong Duality, which applies here because everything relevant is convex, this equality holds at the primal optimum. Differentiating the Lagrangian at the optimum allows us to specify the optimal x in terms of y:

\bar{x}_r=\nu_r e^{-\sum y_jA_{jr}}

The dual problem is then to minimise

\min\quad \sum_r \nu_re^{-\sum_jy_jA_{jr}}+\sum_j y_jc_j

At this point, we make the suggestive substitution e^{-y_j}=1-B_j, observing that this gives B non-negative by default since y is non-negative. After further work, we will deduce that these B do indeed have a sensible interpretation as blocking probabilities, but it should be stressed that this is in no way obvious yet. Now complementary slackness asserts:

\sum_rA_{jr}\nu_r\prod_i(1-B_i)^{A_{ir}}\left\{\begin{array}{l l}=c_j& \quad B_j>0\\ \leq c_j & \quad B_j=0\\ \end{array} \right.

Note that the primal objective function is strictly convex so \bar{x} as discussed is the unique optimum. The dual objective is strictly convex in yA, so if A has full rank J, this induces a unique optimum in terms of y. We assume A is full rank (since for example we can perturb slightly) and that there is no degeneracy in the blocking.

Now we consider a sequence of networks with proportionally increasing arrival rates and capacities Continue reading

Elastic Networks and TCP

Now, suppose each route is given a rate x_r, and there is a utility function U_r(x_r). Links have capacities as before, so seek to solve the optimisation problem

\max\quad \sum_r U_r(x_r)\quad\text{s.t.}\quad Ax\leq c

This is called the SYSTEM problem. However, the system might not know the utility functions U, or the incidence matrix A. We can alternatively consider the USER problem, where for a given route we assume the cost of the rate is linear, and maximise (utility – cost). That is:

\max\quad U_r(\frac{w_r}{\lambda_r})-w_r,\quad w_r\geq 0

where the constant of linearity has been absorbed into the utility function. For the NETWORK problem, we essentially repeat the system problem, only under the assumption that utility functions are w_r\log x_r for fixed w. That is

\max\quad \sum_r w_r\log x_r,\quad\text{s.t.}\quad Ax\leq c

These optimisation problems are linked formally by the following theorem:

Theorem: We can find \lambda,w,x that solve USER and NETWORK, for which w_r=\lambda_rx_r. Furthermore, x then solves SYSTEM.

To prove this result, write down SYSTEM’s Lagrangian L(x,z,\mu)=\sum_r U_r(x_r)+\mu^T(c-Ax-z). The Lagrangian of NETWORK is similar, and at the optimum of USER, these Lagrangians are in fact the same, with \mu_j,\lambda_r shadow prices for resources and routes respectively.

We discuss two regimes for fairness of allocations:

1) (x_r) is max-min fair if it is a feasible flow, and if the smallest flow component is as large as possible. That is, if y is another feasible flow, then \exists r such that y_r>x_r\Rightarrow\exists s: y_s<x_s\leq x_r, ie you can’t increase a flow component without taking away from some other component which is smaller to begin with. Such a vector exists if the feasible region is convex, which in all practical examples it ought to be.

2) (x_r) is proportionally fair if it is feasible, and if for any other feasible vector, the change in the sum, weighted by x is non-positive, that is:

\sum_r \frac{x_r-y_r}{x_r}\leq 0

Alternatively, can think of this as saying that the sum of the percentage changes in each component is non-positive. We say the distribution is weighted proportionally fair with respect to w if

\sum_r w_r\frac{x_r-y_r}{x_r}\leq 0

Then it can be seen by considering behaviour in a neighbourhood of the optimum of the Lagrangian, that x solves NETWORK iff it weighted proportionally fair.

TCP and Lyapunov functions

Assume rates are controlled by the ODE

\frac{d}{dt}x_r(t)=k_r(w_r-x_r(t)\sum_{j\in r}\mu_j(t))

where \mu_j(t)=p_j(\sum_{s\ni j}x_s(t)). The motivation is that the increase in flow is linear (we increase it whenever it appears possible) and the decrease is multiplicative (that is, after negative feedback – a sign of failure to transmit a packet – the flow is halved). We seek a so-called Lyapunov function, which is a function of the system monotone along trajectories. In this case, a suitable example is

U(x)=\sum w_r\log x_r -\sum_j \int_0^{\sum_{s\ni j}x_s}p_j(y)dy

Such a function is differentiable, and has a unique maximum. Bounding the derivative away from 0 ensures that the maximum is actually attained in every case.

Effective Bandwidth

Here, devices have fixed capacity, but packet sizes are random. So, we still have a capacity constraint for the links, but we accept that it won’t be possible to ensure that we stay within those limits all the time, and seek instead to minimise the probability that the limits are exceeded, while keeping throughput as high as possible.

An important result is Chernoff’s Bound: \mathbb{P}(Y\geq 0)\leq \inf_{s\geq 0}\mathbb{E}e^{sY}. The proof is very straightforward: apply Markov’s inequality to the non-negative random variable e^{SY}. So in particular \frac{1}{n}\log\mathbb{P}(X_1+\ldots+X_n\geq 0)\leq \inf M(s), where M(s)=\log\mathbb{E}e^{sX}, and Cramer’s Theorem asserts that after taking a limit in n on the LHS, equality holds, provided \mathbb{E}X<0,\mathbb{P}(X>0)>0.

We assume that the traffic has the form S=\sum_{j=1}^J\sum_{i=1}^{n_j}X_{ji}, where these summands are iid, interpreted as one of the n_j loads used on source j. We have

\log\mathbb{P}(S>c)\leq\log \mathbb{E}[e^{s(S-C)}]=\sum_{j=1}^Jn_jM_j(s)-sC

so \inf(\sum n_jM_j(s)-sC)\leq -\gamma\quad\Rightarrow\quad \mathbb{P}(s\geq C)\leq e^{-\gamma}

so we want this to hold for large \gamma.

We might then choose to restrict attention to

A=\{n:\sum n_jM_j-sC\leq-\gamma,\text{ some }s\geq 0\}

So, when operating near capacity, say with call profile n* on (ie near) the boundary of A, with s* the argmin of the above. Then the tangent plane is \sum n_jM_j(s^*)-s^*C=-\gamma, and since A’s complement is convex, it suffices to stay on the ‘correct’ side (ie halfspace) of this tangent plane.

We can rewrite as \sum n_jM_j(S^*)\leq C-\frac{\gamma}{s^*}. Note that this is reasonable since s* is fixed, and we call \frac{M_j(s)}{s}=:\alpha_j(s), the effective bandwidth. It is with respect to this average that we are bounding probabilities, hence ‘effective’.

Observe that \alpha_j(s) is increasing by Jensen as (\mathbb{E}e^X)^t\leq \mathbb{E}e^{tX} for t>1 implies that for t>s, (\mathbb{E}e^{sX})^t\leq(\mathbb{E}e^{tX})^s.

In particular,

\mathbb{E}X\leq \alpha_j(s)\leq \text{ess sup}X