BMO2 2018

The second round of the British Mathematical Olympiad was taken yesterday by the 100 or so top scoring eligible participants from the first round, as well as some open entries. Qualifying for BMO2 is worth celebrating in its own right. The goal of the setters is to find the sweet spot of difficult but stimulating for the eligible participants, which ultimately means it’s likely to be the most challenging exam many of the candidates sit while in high school, at least in mathematics.

I know that lots of students view BMO2 as something actively worth preparing for. As with everything, this is a good attitude in moderation. Part of the goal in writing about the questions at such length (and in particular not just presenting direct solutions) is because I think at this level it’s particularly easy to devote more time than needed to preparation, and use it poorly.

All these questions could be solved by able children. In fact, each could be solved by able children in less than an hour. You definitely count as an able child if you qualified or if your teacher allowed you to make an open entry! Others count too naturally. But most candidates won’t in fact solve all the questions, and many won’t solve any. And I think candidates often come up with the wrong reasons why they didn’t solve problems. “I didn’t know the right theorems” is very very rarely the reason. Olympiad problems have standard themes and recurring tropes, but the task is not to look at the problem and decide that it is an example of Olympiad technique #371. The task is actually to have as many ideas as possible, and eliminate the ones that don’t work as quickly as possible.

The best way to realise that an idea works is to solve the problem immediately. For the majority of occasions when we’re not lucky enough for that to happen, the second-best way to realise that an idea works is to see that it makes the problem look a bit more like something familiar. Conversely, the best way to realise that an idea doesn’t work is to observe that if it worked it would solve a stronger but false problem too. (Eg Fermat’s Last Theorem *does* have solutions over the reals…) The second-best way to realise that an idea doesn’t work is to have the confidence that you’ve tried it enough and you’ve only made the problem harder, or less familiar.

Both of these second-best ideas do require a bit of experience, but I will try to explain why none of the ideas I needed for various solutions this year required any knowledge beyond the school syllabus, some similarities to recent BMOs, and a small bit of creativity.

As usual, the caveat that these are not really solutions, and certainly not official solutions, but they are close enough to spoil the problems for anyone who hasn’t tried them by themselves already. Of course, the copyright for the problems is held by BMOS, and reproduced here with permission.

Question One

I wrote this question. Perhaps as a focal point of the renaissance of my interest in geometry, or at least my interest in teaching geometry, I have quite a lot to say about the problem, its solutions, its origin story, the use of directed angles, the non-use of coordinate methods and so on. In an ideal world I would write a book about this sort of thing, but for now, a long and separate post is the answer.

This will be available once I’ve successfully de-flooded my apartment.

Question Two

I also wrote this problem, though I feel it’s only fair to show the version I submitted to the BMO committee. All the credit for the magical statement that appears above lies with them. There is a less magical origin story as well, but hopefully with some interesting combinatorial probability, which is postponed until the end of this post.One quick observation is that in my version Joe / Hatter gets to keep going forever. As we shall see, all the business happens in the first N steps, but a priori one doesn’t know that, and in my version it forces you to strategise slightly differently for Neel / Alice. In the competition version, we know Alice is done as soon as she visits a place for a second time, but not in the original. So in the original we only have to consider ‘avoid one place’ rather than the multiple possibilities now of ‘avoid one place’ or ‘visit a place again’.

But I think the best idea is to get Alice to avoid one particular place c\not\equiv 0 whenever possible. At all times she has two possible options for where to go next, lets say b_k+a_k, b_k-a_k in the language of the original statement. We lose nothing by assuming -N/2 < a_k\le N/2, and certainly it would be ridiculous for Joe / Hatter ever to choose a_k=0. The only time Alice’s strategy doesn’t work is when both of these are congruent to c, which implies N\,|\, 2a_k, and thus we must have N= 2a_k. In other words, Alice’s strategy will always work if N is odd.

I think it’s really worth noticing that the previous argument is weak. We certainly did not show that N must be odd for Alice to win. We showed that Alice can avoid a congruence class modulo an odd integer. We didn’t really need that odd integer to be N for this to work. In particular, if N has an odd factor p (say a prime), then the same argument works to show that we can avoid visiting any site with label congruent to 1 modulo p.

It’s actually very slightly more complicated. In the original argument, we didn’t need to use any property of b_k. But obviously here, if b_k\equiv 1 modulo p and p\,|\,a_k, then certainly b_{k+1}\equiv 1 modulo p. So we have to prove instead that Alice can ensure she never ‘visits 1 modulo p for the first time’. Which is fine, by the same argument.

So, we’ve shown that Neel / Alice wins if N is odd, or has an odd factor. The only values that remain are powers of 2. I should confess that I was genuinely a little surprised that Joe / Hatter wins in the power of 2 case. You can find a construction fairly easily for N=2 and N=4, but I suspected that might be a facet of small numbers. Why? Because it still felt we could avoid a particular site. In order for Alice’s strategy to fail, we have to end up exactly opposite the particular site at exactly the time when the next a_k=N/2, and so maybe we could try to avoid that second site as well, and so on backwards?

But that turned out to be a good example of something that got very complicated quite quickly with little insight. And, as discussed at the beginning, that’s often a sign in a competition problem that your idea isn’t so good. (Obviously, when composing a problem, that’s no guarantee at all. Sometimes things are true but no good ideas work.) So we want other ideas. Note that for N=4, the sequence (2,1,2) works for Joe / Hatter, because that forces Alice / Neel to visit either (0,2,1,3) or (0,2,3,1). In particular, this strategy gave Alice no control on the first step nor the last step, and the consequence is that we force her to visit the evens first, then transfer to an odd, and then force her to visit the other odd.

We might play around with N=8, or we might proceed directly to a general extension. If we have a Joe / Hatter strategy for N, then by doubling all the a_ks, we have a strategy for 2N which visits all the even sites in the first N steps. But then we can move to an odd site eg by taking a_N=1. Just as in the N=4 case, it doesn’t matter which odd site we start from, since if we again double all the a_ks, we will visit all the other odd sites. This gives us an inductive construction of a strategy for powers of two. To check it’s understood, the sequence for N=8 is (4,2,4,1,4,2,4).

Although we don’t use it, note that this strategy takes Alice on a tour of sites described by decreasing order of largest power of two dividing the label of the site.

Question Three

I have a theory that the average marks on Q1, Q2 and Q3 on this year’s paper will be in ascending order rather than, as one might expect, descending order. I think my theory will fail because it’s an unavoidable fact of life that in any exam, candidates normally start at the beginning, and don’t move to the middle until making earlier progress. But I think that’s the only reason my theory will fail.

Like kitchen cleanliness or children’s character flaws, it’s hard to compare one’s own problem proposals with others’ rationally. But I felt that, allowing for general levels of geometry non-preference, Q3 was more approachable than Q2, especially to any candidate who’d prepared by looking at some past papers.

I’m in no way a number theorist, but I know three or four common themes when one is asked to prove that a certain sequence contains no squares, or almost no squares. [3a]

  • Number theoretic properties of the sequence of squares. Squares cannot be 3 modulo 4 for example. They also cannot be 2 modulo 4, and thus they also cannot be 2^{k-1} modulo 2^k for any even k. This first observation was essentially the body of most solutions to Q4 of BMO1 2016, among many others.
  • Soft properties of the sequence of squares. The sequence of squares grows quadratically. Sometimes we can show a quadratic sequence will have no overlap with some other sequence for basic reasons. This is especially common if the second sequence is also quadratic or similar. For example, the expression n^2+3n-4 is typically not a square because

(n+1)^2 = n^2+2n+1 < n^2 + 3n - 4 < n^2+4n+4 = (n+2)^2,

  • when n is large. In fact the right hand inequality is always true, and the left hand inequality is true for n\ge 6, which doesn’t leave too many cases to check (and n=5 does actually give a square). This type of argument has been quite common on BMO recently, directly on Q1 of BMO1 2011 and also Q3 of BMO1 2016. An example in a more abstract setting is Q3 of Balkan MO 2007, which I greatly enjoyed at the time…
  • Number theoretic properties of the definition of a square. A square is the product of an integer with itself, and so if we want the product of two or more integers to be a square, then this imposes conditions on the shared factors of the two integers. I’ll cite some examples shortly.
  • Huge theorems. Some old paper which I encountered as a child asked us to find all solutions to x^2-1=2^y. Or similar – I can’t find it now – but Q2 of BMO2 2006 is close enough to the sensible approach to the problem. I think it’s more helpful to think about this as proving that a particular sequence rarely includes powers of two than that a particular sequence rarely includes squares. But either way, one could in principle use the Catalan conjecture, which controls all non-trivial solutions to a^p - b^q=1. Fortunately, the Catalan conjecture was proved, by Mihailescu (readable blog about it), between the paper being set, and me attempting it a few years later. I’m being flippant. This is not a standard trope in solving these questions. For very obvious reasons. If it can be killed by direct reference to a known theorem, it won’t be set.

Anyway, those references (and more to follow) are to illuminate why I thought this question was not too hard. Indeed, I feel one can make substantial meta-progress in your head. The given information is interesting, but for the purpose of this question is just a black box. By subtracting the expression for m from the expression for 2m, we can derive an expression for the required sum. It’ll be a quartic in m, because the leading terms won’t cancel.

This leaves all three of the methods above very accessible. Unfortunately m=0 would be a square were it not excluded specifically, so a modular arithmetic approach is unlikely to work directly. Bounding between two quadratics is entirely plausible, as is factorising and comparing number theoretic properties of the factors. I thought the second one seemed more promising, but either way, having two potentially good ideas based only on recent BMO problems before even writing anything down is a good opening.

We do have to calculate the sum, and I make it \frac{1}{4}m^2(5m+3)(3m+1). Now I’m not so sure how to bound this between two quadratics, because the leading coefficient is 15/4, which is not the square of a rational. But the factor analysis approach is definitely on.

Let’s review this generally. Throughout, suppose m,n are positive integers.

Claim 1: if mn is a square, then m and n are squares too.

Claim 2: if mn is a square, then m=n.

Both of these claims are false. However, a version of Claim 1 is true.

Claim 1′: if mn is a square, and m,n are coprime, then each is a square.

Even though this isn’t a named theorem, it is true, and well-known and can be used without proof. One way to prove it is to write m,n as products of primes, and show that since the primes are disjoint, the exponents must all be even. Most other methods will be equivalent to this, maybe with less notation.

What is good about Claim 1′ is that more complicated versions are true for for essentially similar reasons. For example

Claim 3: if mn is 6k^2, and m,n are coprime, then either one is a square and the other is six times a square; or one is two times a square, and the other is three times a square.

Claim 4: if mn is a square, and the greatest common divisor (m,n) is either 5 or 1, then either each is a square, or each is five times a square.

I cited some examples of the other methods I proposed. Here are some examples of this sort of thing in recent BMOs:

  • Q4 of BMO2 2016. Even the statement is suggestive. There are more complicated routes, but showing that (2p-u-v)(2p+u+v) is a square is one way to proceed, and then Claim 4 directly applies after checking a gcd.
  • Q2 of BMO1 2014 is similar, but it is much more explicit that this is the correct approach. Expose p^2 then use a (correct) version of Claim 2.
  • Q1 of BMO2 2009. Show that a and b must each be a square times 41 for rationality reasons.
  • Q6 of BMO1 2006. After sensible focused substitutions, obtain 3n^2=q(q-1). Rather than try to ‘solve’ this, extract the key properties along the lines of Claim 3, eliminate one of the cases by modular arithmetic, and return to the required statement.
  • Q3 of BMO2 2010 requires the student to reproduce the essentials of the arguments above in the case of a particular degree six polynomial with a tractable factorisation, along with some mild square-sandwiching or bounding arguments as discussed earlier.

In conclusion, I’m trying to say that if I claim I am confident I can find all integers m such that \frac14 m^2(5m+3)(3m+1) is a square, this is not based on complicated adult experience, but rather on recent problems at a similar sensible level. And I still don’t think it counts as Olympiad technique #371 – thinking about divisibility of factors is a good thing to do when talking about integers, and so it’s just a natural entry point into problems about squares. Plenty of problems might have this sort of thing as a starting point or an ending point.

For this problem we need a different ending point. To be brief, the factors (5m+3) and (3m+1) cannot both be squares because 5m+3 is never a square. So since the gcd of these factors is 1, 2 or 4, the only other option is that they are both squares times 2. And because -1 is not a square modulo 3, so 1 is not a (square times 2) modulo 3, and we are done. Note that this was a literal example of the first technique for proving something is not a square, proposed all the way back at the start of this section.

Footnotes

[3a] – some common themes for proving that sequences do include squares might be comparison with Pell’s Equations, or comparison with the explicit construction of solutions to Pythagoras’s equation.

Question Four

An example of an absorbing function is f(x)=\lfloor x\rfloor. One challenge is thinking of many other examples. This one is fine, but it’s true under replacing 2018 by 1 in the statement, and so it doesn’t really capture the richness of the situation.

Notation: the pre-image of a function is the language used to describe the inverse of a function which doesn’t have a uniquely-defined inverse. That is, if f is not injective, and multiple arguments have the output. We write f^{-1}(y)=\{x: f(x)=y\}. In particular, this is a set of values, not necessarily a single value. We also use \mathbb{Z} to denote the integers. We can apply pre-images to sets as well. So for example f^{-1}(\mathbb{Z})=\{x : f(x)\in \mathbb{Z}\}.

This question is tricky, and I will be surprised to see many full solutions from the eligible candidates. It rewards the sort of organisation and clear-thinking that is easier said than done in a time-pressured contest environment. There are also many many possible things to consider, and so is particularly challenging in the short timeframe of BMO2 as opposed to, for example, appearing as the middle question on a 4.5 hour international-level paper.

At a meta-level we are being asked to confirm or deny the existence of absorbing functions where f^{-1}(\mathbb{Z}) is small in some sense, firstly when actually having finite size, secondly when, although infinite, being a small sort of infinite, namely spread out in a sparse, well-ordered way (you might say countable if familiar with that language). The general idea is presumably that it’s hard to be absorbing if the pre-image of the integers is small, and so it’s reasonable to assume that it’s too hard if this is finite; but perhaps not quite too hard if it’s merely countable. So (no, yes) is a sensible guess at the answer to the question, though (no, no) might also fit, maybe with a harder argument for the second no.

Ok, instead of trying a) or b), just play with the configuration. Let A=f^{-1}(\mathbb{Z}). We will use this frequently. In the picture below, f maps the real line on top to the real line below. If two reals get mapped to the same image, then whether or not the image is an integer, the whole (closed) interval bounded by the two reals also gets mapped to the same image. This is because f is weakly increasing.

This means that A consists of various intervals (which include single points). But in both a) and b) we know that A is ‘small’, and so it cannot contain any intervals of positive length. So in fact A is a set of separated real values. In the case of a) it’s a finite set.

Do we want to try and iterate this, and look at f^{-1}(A)? Well maybe, but we don’t know much about about pre-images of A, only about pre-images of \mathbb{Z}.

But note that the pre-image of the pre-image of the … of the pre-image [2017 times] of A must be the whole real line, so at some point, some value has a pre-image that is an interval. So if we’re guessing that the answer to b) is yes, then we need to give a construction.

\mathbb{R} \stackrel{f}\longrightarrow ?? \stackrel{f}\longrightarrow\quad\ldots\quad \stackrel{f}\longrightarrow ??\stackrel{f}\longrightarrow A \stackrel{f}\longrightarrow f(A)\subset \mathbb{Z}.

If you play around for a bit, it seems very unlikely to be absorbing if the integers don’t get mapped to the integers. You can try to prove this, but at the moment we’re just aiming for a construction, so let’s assume f(\mathbb{Z})\subset \mathbb{Z}. It would be convenient if f(n)=n for all n\in \mathbb{Z}, but we already know that this won’t work because then the pre-image of the pre-image of the… of \mathbb{Z} is always \mathbb{Z}, but we need it to be \mathbb{R}.

The ideal situation would be if A= \mathbb{Z}\cup \{\ldots, a'_{-1},a'_0,a'_1,\ldots\}, where the pre-image of \{\ldots, a'_{-1},a_0,a'_1,\ldots\} is pretty much everything.

Informally, we are specifically banned from mapping intervals directly onto an integer. So have an intermediate set, and try to map almost everything (except the integers and the set itself) onto that set, so and map that set into the integers.

At this point, you really just have to have the right idea and finish it. Many things will work, but this seems the easiest to me. Let the set A consist of the integers and the (integers plus 1/2). And for x\in A, f(x)=2x. This is what f looks like so far.

Here the black crosses are integers, and the purple crosses are (integers plus 1/2). But now we need to make as many reals as possible in the top row map to a purple cross (which is allowed, because purple crosses aren’t integers), but we need also to preserve the weakly increasing property. Fortunately, we can exactly do that. Each cross of either colour in the top row maps to a black cross in the middle row (ie an integer), so we can map the open interval between crosses in the top row to a purple cross in the middle row. As shown in red:

Note that this is consistent. The fact that I haven’t drawn in the red cones into the bottom row is only because I didn’t use the bottom row to motivate doing this. I’ve shown a consistent definition of f that maps all the reals onto the integers in two steps. If it’s an integer to begin with, that was great; if it was an (integer plus 1/2) to begin with then it becomes an integer in one step and stays an integer; and otherwise it first maps to an (integer plus 1/2), and then to an integer in the second step.

To check you’ve understood, try to write down a standalone definition of this function.

I’ve therefore solved part b) with the alternative condition \ldots a_{-1}<a_0<a_1<a_2<\ldots which isn’t exactly as required. It requires one small and simple idea to convert to a solution to the actual statement. See if you can find it yourself!

I think part a) is harder, not because the solution will look more complicated, but because there are so many potential partial results you could try to prove, because there are so many sets you could consider. To name a few: the image of f, the image of f intersected with \mathbb{Z}, the image of \mathbb{Z}, the 2018-composition image f^{2018}(\mathbb{R}), the 2018-composition image f^{2018}(\mathbb{Z}) and so on and so forth. You might have good insight into the wrong things.

For me, the crucial observation (which you can see from the figure in the b) construction) is that when composing an increasing function with itself, the ‘trajectories’ are either increasing or decreasing. That is, if x\le f(x) (respectively, x\ge f(x)), then x\le f(x)\le f^2(x)\le f^3(x)\le\ldots (respectively x\ge f(x)\ge f^2(x)\ge \ldots). Again, you can think of this as Olympiad technique #371 if you insist, but I don’t think that’s helpful. There are lots of things one could try to say here, and this turns out to be natural, true and useful, but you can’t know it’s useful until you play with it.

Anyway, we’re playing with part a), and we know that f^k(x) is an integer for all large enough k, and that f^{k+1}(x) is also an integer, so f^k(x) is one of a finite set of integers because of the condition on A. But we’ve seen the sequence x,f(x),f^2(x),\ldots is weakly increasing or weakly decreasing, and so if we also know it’s eventually bounded (because eventually it’s in this finite set) then it must eventually be constant. And this constant is one of the integers, say n. But unless we started from n, this means that f(n)=n, but also f(x)=n for some other real value x. And so exactly as at the very very beginning, that’s bad, because then the whole interval [x,n] gets mapped to n, which is a contradiction.

Question Two – Origin story

The origin story for Q2 started in a talk I heard by Renan Gross at Weizmann, who referenced some of the history of Scenery Reconstruction. Roughly speaking, we colour the integers (say with two colours), and then let loose a random walker, who tells us the sequence of colours she observes during her walk, but no other information about the walk itself.

How much information can we recover about the colouring? Obviously, the best we can hope for is to recover the colouring, up to translations and reflection, since for every possible random walk trajectory, the exact reflection is equally probable, and we are given no information about the starting point.

Since lots of the transitions between recoverable and unrecoverable depend on the periodicity of the colouring, a reasonable toy model is to do it on a cycle. Note that the Strong Law of Large Numbers tells us that we almost surely recover the number of black sites and white sites from an the infinite trajectory of the random walk. Of course it’s possible that there are only two black vertices, and they are adjacent, and the walker oscillates between them, thus seeing BBBBBB… But this is extremely unlikely. You could think of this in Bayesian terms as strongly increasing the prior on the whole cycle being black, but I think initially it’s best to do this as an infinite-time, SLLN problem not as finite time WLLN/CLT reweightings of anything.

But what more? It’s clear that the lengths of all black substrings should follow some mixed geometric-ish distribution, and this distribution will almost surely wash out as the empirical distribution in an SLLN sense. But it’s tricky to justify why such a mixed geometric-ish distribution should be uniquely determined by the lengths of black arcs in the cycle. But it does definitely feel like we should have enough information to reconstruct the colouring up to reflection/rotation with probability one. For example, analogously to the number of black vertices and the number of white vertices, we should be able to recover the number of adjacent black vertices, the number of adjacent white vertices, and the number of black-white adjacent vertices, and so on.

Anyway, this can be done, and it follows as a consequence of various authors’ work answering some more general conjectures of Benjamini and, separately, of den Hollander and Keane. Douglas Howard [DH] shows a handful of generalisations of this, as do Benjamini and Kesten [BK]. Most of this work is focused on sceneries on \mathbb{Z}, but periodic sceneries are often used as a basis, and of course, the only difference between periodic sceneries on \mathbb{Z} and sceneries on the N-cycle are whether you know the period in advance. [BK] show that ‘almost all’ sceneries are distinguishable in a particular sense, in response to which Lindenstrauss [L99] exhibits a large family of sceneries which are not distinguishable. A readable but technical review is [ML].

So Renan’s talk was about the similar problem (and generalisations) on the hypercube [GG]. Rather than paraphrase the main differences badly, you can read his own excellent blog post about the work.

On the train back to Haifa from Rehovot, I was thinking a bit about the cycle case, and what happens if you generalise the random walk with varying jump lengths, or indeed introduce a demon walker, whose goal is to make it as hard as possible for the reviewer to deduce the colouring. One way this can certainly happen is if the walker can avoid visiting some particular site, as then how could one possibly deduce the colour of the never-visited site? And so we get to the statement posed.

References

[BK] – Benjamini, Kesten, 1996 – Distinguishing sceneries by observing the scenery along a random walk path

[dH] – den Hollander, 1988 – Mixing properties for random walk in random scenery

[DH] – Douglas Howard, 1996 – Detecting defects in periodic scenery by random walks on Z

[GG] – Grupel, Gross, 2017 – Indistinguishable sceneries on the Boolean hypercube

[L99] – Lindenstrauss, 1999 – Indistinguishable sceneries

[ML] – Matzinger, Lember, 2003 – Scenery reconstruction: an overview [link]

 

Advertisements

Random transpositions

We study a procedure for generating a random sequence of permutations of [N]. We start with the identity permutation, and then in each step, we choose two elements uniformly at random, and swap them. We obtain a sequence of permutations, where each term is obtained from the previous one by multiplying by a uniformly-chosen transposition.

Some more formality and some technical remarks:

  • This is a Markov chain, and as often with Markov chains, it would be better it was aperiodic. As described, the cycle will alternate between odd and even permutations. So we allow the two elements chosen to be the same. This laziness slows down the chain by a factor N-1/N, but removes periodicity. We will work over timescales where this adjustment makes no practical difference.
  • Let \tau_1,\tau_2,\ldots be the sequence of transpositions. We could define the sequence of permutations by \pi_m= \tau_m\cdot\tau_{m-1}\cdot \ldots\cdot \tau_1. I find it slightly more helpful to think of swapping the elements in places i and j, rather the elements i and j themselves, and so I’ll use this language, for which \pi_m = \tau_1\cdot \tau_2\cdot\ldots \cdot \tau_m is the appropriate description. Of course, transpositions and the identity are self-inverse permutations, so it makes no difference to anything we might discuss.
  • You can view this as lazy random walk on the Cayley graph of S_N generated by the set of transpositions. That is, the vertices of the graph are elements of S_N, and two are connected by an edge if one can be obtained from the other by multiplying by a transposition. Note this relation is symmetric. Hence random transposition random walk.
  • Almost everything under discussion would work in continuous time too.

At a very general level, this sort of model is interesting because sometimes the only practical way to introduce ‘global randomness’ is repeatedly to apply ‘local randomness’. This is not the case for permutations – it is not hard to sample uniformly from S_N. But it is a tractable model in which to study relevant questions about the generating randomness on a complicated set through iterated local operations.

Since it is a Markov chain with a straightforward invariant distribution, we can ask about the mixing time. That is, the correct scaling for the number of moves before the random permutation is close in distribution (say in the sense of total variation distance) to the equilibrium distribution. See this series of posts for an odd collection of background material on the topic. Diaconis and Shahshahani [DS81] give an analytic argument for mixing around \frac{N\log N}{2} transpositions. Indeed they include a constant because it is a sharp cutoff, where the total variation distance drops from approximately 1 to approximately 0 in O(N) steps.

Comparison with Erdos-Renyi random graph process

In the previous result, one might observe that m=\frac{N\log N}{2} is also the threshold number of edges to guarantee connectivity of the Erdos-Renyi random graph G(N,m) with high probability. [ER59] Indeed, there is also a sharp transition around this threshold in this setting too.

We explore this link further. We can construct a sequence of random graphs simultaneously with the random transposition random walk. When we multiply by transposition (i j), we add edge ij in the graph. Laziness of RTRW and the possibility of multiple edges mean this definition isn’t literally the same as the conventional definition of a discrete-time Erdos-Renyi random graph process, but again this is not a problem for any of the effects we seek to study.

The similarity between the constructions is clear. But what about the differences? For the RTRW, we need to track more information than the random graph. That is, we need to know what order the transpositions were added, rather than merely which edges were added. However, the trade-off is that a permutation is a simpler object than a graph in the following sense. A permutation can be a described as a union of disjoint cycles. In an exchangeable setting, all the information about a random permutation is encoded in the lengths of the these cycles. Whereas in a graph, geometry is important. It’s an elegant property of the Erdos-Renyi process that we can forget about the geometry and treat it as a process on component sizes (indeed, a multiplicative coalescent process), but there are other questions we might need to ask for which we do have to study the graph structure itself.

Within this analogy, unfortunately the word cycle means different things in the two different settings. In a permutation, a cycle is a directed orbit, while in a graph it has the usual definition. I’m going to write graph-cycle whenever relevant to avoid confusion.

A first observation is that, under this equivalence, the cycles of the permutation form a finer partition than the components of the graph. This is obvious. If we split the vertices into sets A and B, and there are no edges between them, then nothing in set A will ever get moved out of set A by a transposition. (Note that the slickness of this analogy is the advantage of viewing a transposition as swapping the elements in places i and j.)

However, we might then ask under what circumstances is a cycle of the permutation the same as a component of the graph (rather than a strict subset of it). A first answer is the following:

Lemma: [Den59] The permutation formed by a product of transpositions corresponding in any order to a tree in the graph has a single cycle.

We can treat this as a standalone problem and argue in the following predictable fashion. (Indeed, I was tempted to set this as a problem during selection for the UK team for IMO 2017 – it’s perfectly suitable in this context I think.) The first transposition corresponds to some edge say ab, and removing this edge divides the vertices into components A \ni a, B\ni b. Since no further transposition swaps between places in A and places in B, the final permutation maps a into B and b into A, and otherwise preserves A and B.

This argument extends to later transpositions too. Now, suppose there are multiple cycles. Colour one of them. So during the process, the coloured labels move around. At some point, we must swap a coloured label with an uncoloured label. Consider this edge, between places a and b as before, and indeed the same conclusion holds. WLOG we move the coloured label from a to b. But then at the end of the process (ie in the permutation) there are more coloured labels in B than initially. But the number of coloured labels should be the same, because they just cycle around in the final permutation.

We can learn a bit more by trying thinking about the action on cycles (in the permutation) of adding a transposition. In the following pair of diagrams, the black arrows represent the original permutation (note it’s not helpful to think of the directed edges as having anything to do with transpositions now), the dashed line represents a new transposition, and the new arrows describe the new permutation which results from this product.

It’s clear from this that adding a transposition between places corresponding to different cycles causes the cycles to merge, while adding a transposition between places already in the same cycle causes the cycle to split into two cycles. Furthermore the sizes of the two cycles formed is related to the distance in the cycle between the places defining the transposition.

This allows us to prove the lemma by adding the edges of the tree one-at-a-time and using induction. The inductive claim is that cycles of the permutation exactly correspond to components of the partially-built tree. Assuming this claim guarantees that the next step is definitely a merge, not a split (otherwise the edge corresponding to the next step would have to form a cycle). If all N-1 steps are merges, then the number of cycles is reduced by one on each step, and so the final permutation must be a single cycle.

Uniform split-merge

This gives another framework for thinking about the RTRW itself, entirely in terms of cycle lengths as a partition of [N]. That is, given a partition, we choose a pair of parts in a size-biased way. If they are different, we merge them; and if it is the same part, with size k, we split them into two parts, with sizes chosen uniformly from { (1,k-1), (2,k-2), …  (k-1,1) }.

What’s nice about this is that it’s easy to generalise to real-valued partitions, eg of [0,1]. Given a partition of [0,1], we sample two IID U[0,1] random variables U_1,U_2. If these correspond to different parts, we replace these parts by a single part with size given by the sum. If these correspond to the same part, with size \alpha, we split this part into two parts with sizes |U_1-U_2| and \alpha - |U_1-U_2|. This is equivalent in a distributional sense to sampling another U[0,1] variable U and replacing \alpha with (\alpha U, \alpha(1-U)). We probably want our partition to live in \ell^1_\searrow, so we might have to reorder the parts afterwards too.

These uniform split-merge dynamics have a (unique) stationary distribution, the canonical Poisson-Dirichlet random partition, hereafter PD(0,1). This was first shown in [DMZZ04], and then in a framework more relevant to this post by Schramm [Sch08].

Conveniently, PD(0,1) is also the scaling limit of the cycle lengths in a uniform random permutation (scaled by N). The best way to see this is to start with the observation that the length of the cycle containing 1 in a permutation chosen uniformly from S_N has the uniform distribution on {1,…,N}. This matches up well with the uniform stick-breaking construction of PD(0,1), though other arguments are available too. Excellent background on Poisson-Dirichlet distributions and this construction and equivalence can be found in Chapter 3 of Pitman’s comprehensive St. Flour notes [CSP]. Also see this post, and the links within, with the caveat that my understanding of the topic was somewhat shaky then (as presently, for now).

However, Schramm says slightly more than this. As the Erdos-Renyi graph passes criticality, there is a well-defined (and whp unique) giant component including \Theta(N) vertices. It’s not clear that the corresponding permutation should have giant cycles. Indeed, whp the giant component has \Theta(N) surplus edges, so the process of cycle lengths will have undergone O(N) splits. Schramm shows that most of the labels within the giant component are contained in giant cycles in the permutation. Furthermore, the distribution of cycle lengths within the giant component, rescaled by the size of the giant component, converges in distribution to PD(0,1) at any supercritical time \frac{(1+\epsilon)N}{2}

This is definitely surprising, since we already know that the whole permutation doesn’t look close to uniform until time \frac{N\log N}{2}. Essentially, even though the size of the giant component is non-constant (ie it’s gaining vertices), the uniform split-merge process is happening to the cycles within it at rate N. So heuristically, at the level of the largest cycles, at any supercritical time we have a non-trivial partition, so at any slightly later time (eg \frac{(1+\epsilon/2)N}{2} and \frac{(1+\epsilon)N}{2} ), mixing will have comfortably occurred, and so the distribution is close to PD(0,1).

This is explained very clearly in the introduction of [Ber10], in which the approach is extended to a random walk on S_N driven by a uniform choice from any conjugacy class.

So this really does tell us how the global uniform randomness emerges. As the random graph process passes criticality, we have a positive mass of labels in a collection of giant cycles which are effectively a continuous-space uniform split-merge model near equilibrium (and thus with PD(0,1) marginals). The remaining cycles are small, corresponding to small trees which make up the remaining (subcritical by duality) components of the ER graph. These cycles slowly get absorbed into the giant cycles, but on a sufficiently slow timescale relevant to the split-merge dynamics that we do not need to think of a separate split-merge-with-immigration model. Total variation distance on permutations does feel the final few fixed points (corresponding to isolated vertices in the graph), hence the sharp cutoff corresponding to sharp transition in the number of isolated vertices.

References

[Ber10] – N. Berestycki – Emergence of giant cycles and slowdown transition in random transpositions and k-cycles. [arXiv version]

[CSP] – Pitman – Combinatorial stochastic processes. [pdf available]

[Den59] – Denes – the representation of a permutation as a product of a minimal number of transpositions, and its connection with the theory of graphs

[DS81] – Diaconis, Shahshahani – Generating a random permutation with random transpositions

[DMZZ04] – Diaconis, Mayer-Wolf, Zeitouni, Zerner – The Poisson-Dirichlet distribution is the unique invariant distribution for uniform split-merge transformations [link]

[ER59] – Erdos, Renyi – On random graphs I.

[Sch08] – Schramm – Compositions of random transpositions [book link]

DGFF 4 – Properties of the Green’s function

I’m at UBC this month for the PIMS probability summer school. One of the long courses is being given by Marek Biskup about the Discrete Gaussian Free Field (notes and outline here) so this seems like a good moment to revive the sequence of posts about the DGFF. Here’s DGFF1, DGFF2, DGFF3 from November.

The first draft of this post was about the maximum of the DGFF in a large box V_N, and also about the Green’s function G^{V_N}(x,y), which specifies the covariance structure of the DGFF. This first draft also became too long, so I’m splitting it into two somewhat shorter ones. As we’ll see, some understanding and standard estimates of the Green’s function is enough to say quite a bit about the maximum. In this first post, we’ll explore some ‘low-hanging fruit’ concerning the Green’s function, as defined through a simple random walk, which are useful, but rarely explained in the DGFF literature.

Symmetry of Green’s function

We start with one of these low-hanging fruit. If G^{V_N} is to be a covariance matrix, it has to be symmetric. In the first post, showing that the definition of the DGFF as a random field with given Hamiltonian is equivalent to \mathcal{N}(0,G^{V_N}) certainly can be viewed as a proof of symmetry. However, it would be satisfying if there was a direct argument in the language of the definition of the Green’s function.

To make this self-contained, recall the random walk definition of G^{V_N}(x,y). Let (S_m)_{m\ge 0} be simple random walk on V_N, and \mathbb{P}_x,\,\mathbb{E}_x denote starting the random walk at x\in V_N. As usual, let \tau_y,\,\tau_A denote the hitting time of a vertex y or a set A respectively. Then

G^{V_N}(x,y):= \mathbb{E}_x \left[ \sum_{m=0}^{\tau_{\partial V_N}}1_{(S_m=y) }\right].

That is, G^{V_N}(x,y) is the expected number of visits to y by a random walk from x, before it exits V_N.

Let’s drop the superscript for now, as everything should hold for a more general subset of the lattice. I don’t think it’s immediately obvious at the level of Markov chains why G(x,y)=G(y,x). In particular, it’s not the case that

\mathbb{P}_x(\tau_y < \tau_{D^c}) = \mathbb{P}_y(\tau_x <\tau_{D^c}),

and it feels that we can’t map between paths x \to \partial D and y\to \partial D in a way that preserves the number of visits to y and x, respectively. However, we can argue that for any m

\mathbb{P}_x(S_m=y, \tau_{D^c}>m) = \mathbb{P}_y(S_m=x, \tau_{D^c}>m),

by looking at the suitable paths of (S_m). That is, if we have a path x=S_0,S_1,\ldots,S_m=y that stays within D, then the probability of seeing this path starting from x and its reverse direction starting from y are equal. Why? Because

\mathbb{P}_x(S_0=x,S_1=v_1,\ldots,S_{m-1}=v_{m-1},S_m=y) = \prod_{\ell=0}^{m-1} \frac{1}{\mathrm{deg}(v_\ell)},

and

\mathbb{P}_y(S_0=y,S_1=v_{m-1},\ldots,S_{m-1}=v_1, S_m=x) = \prod_{\ell=0}^{m-1} \frac{1}{\mathrm{deg}(v_{m-\ell})} = \prod_{\ell=1}^m \frac{1}{\mathrm{deg}(v_\ell)}.

Since D\subset \mathbb{Z}^d and x,y are in the interior of D, we must have \mathrm{deg}(x)=\mathrm{deg}(y), and so these two expressions are equal. Summing over all such two-way paths, and then all m gives the result.

Fixing one argument

We now focus on G^D(\cdot,y), where the second argument is fixed. This is the solution to the Poisson equation

\Delta G^D(\cdot,y) = -\delta_y(\cdot),\quad G^D(x,y)=0,\; \forall x\in \partial D.

To see this, can use a standard hitting probability argument (as here) with the Markov property. This is harmonic in D\backslash \{y\}, and since we know

G^D(y,y)= \frac{1}{\mathbb{P}_y(\text{RW hits }\partial D\text{ before returning to }y)},

this uniquely specifies G^D(\cdot,y). Anyway, since harmonic functions achieve their maxima at the boundary, we have G(y,y)\ge G(x,y) for all x\in D. We can also see this from the SRW definition as

G(x,y)=G(y,x) = \mathbb{P}_y (\tau_x < \tau_{\partial D} ) G(x,x) \le G(x,x).

Changing the domain

Now we want to consider nested domains D\subset E, and compare G^D(\cdot,\cdot) and G^E(\cdot,\cdot) on DxD. The idea is that for SRW started from x\in D, we have \tau_{\partial D}\le \tau_{\partial E}, since one boundary is contained within the other. From this, we get

G^D(x,y)\le G^E(x,y),\quad \forall x,y\in D,

and we will use the particular case y=x.

For example, if x\in V_N, the box with width N, then the box with width 2N centred on x contains the whole of V_N. So, if we set \bar {V}_{2N}:= [-N,N]^d, then with reference to the diagram, we have

G^{V_N}(x,x)\le G^{\bar{V}_{2N}}(0,0),\quad x\in V_N.

As we’ll see when we study the maximum of the DGFF on V_N, uniform control over the pointwise variance will be a useful tool.

Maximising the Green’s function

The idea of bounding G^{V_N}(x,x) by G^{\bar V_{2N}}(0,0) for any x\in V_N is clever and useful. But a more direct approach would be to find the value of x that maximises G^{V_N}(x,x). We would conjecture that when V_N has a central vertex, then this is the maximiser.

We can prove this directly from the definition of the Green’s function in terms of random walk occupation times. Let’s assume we are working with \bar{V}_N for even N, so that 0 is the central vertex. Again, since

G^D(x,x)=\frac{1}{\mathbb{P}_x(\text{RW hits }\partial D\text{ before returning to }x)}, (*)

it would suffice to show that this probability is minimised when x=0. This feels right, since 0 is furthest from the boundary. Other points are closer to the boundary in some directions but further in others, so we can’t condition on the maximum distance from its start point achieved by an excursion of SRW (we’re vertex-transitive, so these look the same from all starting points), as even allowing for the four possible rotations, for an excursion of diameter slightly larger than N, starting at the centre is maximally bad.

However, intuitively it does feel as if being closer to the boundary makes you more likely to escape earlier. In fact, with a bit more care, we can couple the SRW started from 0 and the SRW started from r=(r^x,r^y)\ne 0 such that the latter always exits first. For convenience we’ll assume also that r^x,r^y are both even.

I couldn’t find any reference to this, so I don’t know whether it’s well-known or not. The following argument involves projecting into each axis, and doing separate couplings for transitions in the x-direction and transitions in the y-direction. We assume WLOG that x is in the upper-right quadrant as shown. Then, let 0=S_0,S_1,S_2,\ldots be SRW started from 0, and we will construct r=R_0,R_1,R_2,\ldots on the same probability space as (S_m)_{m\ge 0} as follows. For every m, we set the increment R_{m+1}-R_m to be \pm(S_{m+1}-S_m). It remains to specify the sign, which will be determined by the direction of the S-increment, and a pair of stopping times. The marginal is therefore again an SRW, started from r. Temporarily, we use the unusual notation S_m= (S^x_m,S^y_m) for the coordinates of S_m.

So, if S_{m+1}-S_m=(1,0), (-1,0), ie S moves left or right, then we set

R_{m+1}-R_m = \begin{cases} -(S_{m+1}-S_m) &\quad \text{if }m<T^x\\ S_{m+1}-S_m&\quad \text{if }m>T^x.\end{cases} (*)

where T^x:= \min\{m\,:\, R^x_m=S^x_m\}. That is, R^x moves in the opposing direction to S^x until the first time when they are equal (hence the parity requirement), and then they move together. WLOG assume that r^x>0. Then suppose S^x_m=\pm N and such m is minimal. Then by construction, if m\ge T^x, then R^x_m=\pm N also. If m<T^x, then we must have S^x_m=-N, and so since R^x‘s trajectory is a mirror image of S^x‘s, in fact R^x_m = N+r^x>N, so R^x hit +N first. In both cases, we see that R^x hits \pm N at the same time or before S^x.

In other words, when S^x_m has non-negative x coordinate, the lazy random walk R^x follows the same trajectory as S^x, and when it has negative x coordinate, the R^x mirrors S^x. At some time, it may happen that S^x_m= R^x_m=0 (recall the parity condition on r). Call this time T^x. We then adjust the description of the coupling so that (*) is the mechanism for m<T^x, and then for m\ge T^x, we take S^x_m=R^x_m.

Similarly, if S_{m+1}-S_m =(0,1), (0,-1), ie S moves up or down, then we set

R_{m+1}-R_m = \begin{cases} -(S_{m+1}-S_m)&\quad \text{ if }m<T^y\\  S_{m+1}-S_m&\quad \text{if }m\le T^y,\end{cases}

with corresponding definition of the stopping time T^y.

This completes the coupling, and by considering T^x\wedge T^y, we have shown what that the exit time for the walk started from zero dominates the exit time for walk started from r. Recall that so far we are in the case where the box has even width and r=(r^x,r^y) has even coordinates.

This exit time comparison isn’t exactly what we need to compare G^N(0,0) and G^N(x,x). It’s worth remarking at this stage that if all we cared about was the Green’s function on the integer line [-N,N], we would have an easier argument, as by the harmonic property of G(\cdot,y)

G^{[-N,N]}(0,r)=\frac{N-r}{N}G^{[-N,N]}(0,0),

G^{[-N,N]}(r,0) = \frac{N}{N+r}G^{[-N,N]}(r,r),

and so G(0,0)>G(r,r) follows by symmetry. To lift from 1D to 2D directly, we need a bit more than this. It’s possible that S returns in both x- and y- coordinates more often than R, but never at the same time. Fortunately, the coupling we defined slightly earlier does give us a bit more control.

Let \tau^x(S), \tau^x(R) be the first times that S^x, R^x hit \pm N. Under this coupling, for any m\ge 0

\mathbb{P}(S^x_m=0, m<T^x) = \mathbb{P}(R^x_m=r^x, m<T^x)

since these events are literally equal. Since we showed that \tau^x(R)\le \tau^x(S) almost surely, we can further deduce

\mathbb{P}(S^x_m=0,m<T^x\wedge \tau^x(S)) \ge \mathbb{P}(S^x_m=0,m<T^x\wedge \tau^x(R))

=\mathbb{P}(R^x_m=r^x, m <T^x \wedge \tau^x(R)).

To address the corresponding events for which m\ge T^x, we apply the strong Markov property at T^x, to obtain SRW Z_m started from r/2, and let \tau_{-N},\tau_{+N} be the hitting times of -N,+N respectively and \tau_{\pm N}=\tau_{-N}\wedge \tau_{+N}. It will now suffice to prove that

\mathbb{P}(Z_m=0, m< \tau_{\pm N}) \ge \mathbb{P}(Z_m=r,m<\tau_{\pm N}), (**)

as then we can apply the law of total probability and sum over values of T^x and m\ge 0.

To prove this result, we consider the following bijection between trajectories of length m from r/2 to {0,r}. We decompose the trajectories into excursions away from r/2, and then a final meander from r/2 to {0,r} that stays on the same side of r/2. We construct the new trajectory by preserving all the initial excursions, but reversing all the steps of the final meander. So if the original trajectory ended up at 0, the image ends up at r. Trivially, the initial excursions in the image only hit \pm N if the excursions in the original trajectory did this too. But it’s also easy to see, by a similar argument to the coupling at the start of this section, that if the original trajectory ends at r and does not hit \pm N, then so does the image. However, the converse is not true. So we conclude (**), and thus

\mathbb{P}(S_m^x=0) \ge \mathbb{P}(R_m^x=0)

for all m by combining everything we have seen so far. And so we can now lift to a statement about S_m itself, that is considering both coordinates separately.

 

The remaining cases for r require a little more care over the definition of T^x, though the same projection argument works, for fundamentally the same reason. (Note that in the above argument, if S^x_m=-N and m<T^x, then in fact R^x_m\ge N+2, and so it’s not hard to convince yourself that a sensible adjustment to the stopping time will allow a corresponding result with R^x_m\ge N+1 in the odd r^x case.) The case for N odd is harder, since in one dimension there are two median sites, and it’s clear by symmetry that we can’t couple them such that RW from one always exits at least as early as RW from the other. However, the distributions of exit times started from these two sites are the same (by symmetry), and so although we can’t find a coupling, we can use similar stopping times to obtain a result in probability.

In the next post, we’ll see how to apply this uniform bound on G^{V_N}(x,x) to control the maximum of the DGFF on V_N. In particular, we address how the positive correlations of DGFF influence the behaviour of the maximum by comparison with independent Gaussians at each site.

Random walks conditioned to stay positive

In this post, I’m going to discuss some of the literature concerning the question of conditioning a simple random walk to lie above a line with fixed gradient. A special case of this situation is conditioning to stay non-negative. Some notation first. Let (S_n)_{n\ge 0} be a random walk with IID increments, with distribution X. Take \mu to be the expectation of these increments, and we’ll assume that the variance \sigma^2 is finite, though at times we may need to enforce slightly stronger regularity conditions.

(Although simple symmetric random walk is a good example for asymptotic heuristics, in general we also assume that if the increments are discrete they don’t have parity-based support, or any other arithmetic property that prevents local limit theorems holding.)

We will investigate the probability that S_n\ge 0 for n=0,1,…,N, particularly for large N. For ease of notation we write T=\inf\{n\ge 0\,:\, S_n<0\} for the hitting time of the negative half-plane. Thus we are interested in S_n conditioned on T>N, or T=N, mindful that these might not be the same. We will also discuss briefly to what extent we can condition on T=\infty.

In the first paragraph, I said that this is a special case of conditioning SRW to lie above a line with fixed gradient. Fortunately, all the content of the general case is contained in the special case. We can repose the question of S_n conditioned to stay above n\alpha until step N by the question of S_n-n\alpha (which, naturally, has drift \mu-\alpha) conditioned to stay non-negative until step N, by a direct coupling.

Applications

Simple random walk is a perfectly interesting object to study in its own right, and this is a perfectly natural question to ask about it. But lots of probabilistic models can be studied via naturally embedded SRWs, and it’s worth pointing out a couple of applications to other probabilistic settings (one of which is the reason I was investigating this literature).

In many circumstances, we can desribe random trees and random graphs by an embedded random walk, such as an exploration process, as described in several posts during my PhD, such as here and here. The exploration process of a Galton-Watson branching tree is a particularly good example, since the exploration process really is simple random walk, unlike in, for example, the Erdos-Renyi random graph G(N,p), where the increments are only approximately IID. In this setting, the increments are given by the offspring distribution minus one, and the hitting time of -1 is the total population size of the branching process. So if the expectation of the offspring distribution is at most 1, then the event that the size of the tree is large is an atypical event, corresponding to delayed extinction. Whereas if the expectation is greater than one, then it is an event with limiting positive probability. Indeed, with positive probability the exploration process never hits -1, corresponding to survival of the branching tree. There are plenty of interesting questions about the structure of a branching process tree conditional on having atypically large size, including the spine decomposition of Kesten [KS], but the methods described in this post can be used to quantify the probability, or at least the scale of the probability of this atypical event.

In my current research, I’m studying a random walk embedded in a construction of the infinite-volume DGFF pinned at zero, as introduced by Biskup and Louidor [BL]. The random walk controls the gross behaviour of the field on annuli with dyadically-growing radii. Anyway, in this setting the random walk has Gaussian increments. (In fact, there is a complication because the increments aren’t exactly IID, but that’s definitely not a problem at this level of exposition.) The overall field is decomposed as a sum of the random walk, plus independent DGFFs with Dirichlet boundary conditions on each of the annuli, plus asymptotically negligible corrections from a ‘binding field’. Conditioning that this pinned field be non-negative up to the Kth annulus corresponds to conditioning the random walk to stay above the magnitude of the minimum of each successive annular DGFF. (These minima are random, but tightly concentrated around their expectations.)

Conditioning on \{T > N\}

When we condition on \{T>N\}, obviously the resulting distribution (of the process) is a mixture of the distributions we obtain by conditioning on each of \{T=N+1\}, \{T=N+2\},\ldots. Shortly, we’ll condition on \{T=N\} itself, but first it’s worth establishing how to relate the two options. That is, conditional on \{T>N\}, what is the distribution of T?

Firstly, when \mu>0, this event always has positive probability, since \mathbb{P}(T=\infty)>0. So as N\rightarrow\infty, the distribution of the process conditional on \{T>N\} converges to the distribution of the process conditional on survival. So we’ll ignore this for now.

In the case \mu\le 0, everything is encapsulated in the tail of the probabilities \mathbb{P}(T=N), and these tails are qualitatively different in the cases \mu=0 and \mu<0.

When \mu=0, then \mathbb{P}(T=N) decays polynomially in N. In the special case where S_n is simple symmetric random walk (and N has the correct parity), we can check this just by an application of Stirling’s formula to count paths with this property. By contrast, when \mu<0, even demanding S_N=-1 is a large deviations event in the sense of Cramer’s theorem, and so the probability decays exponentially with N. Mogulskii’s theorem gives a large deviation principle for random walks to lie above a line defined on the scale N. The crucial fact here is that the probabilistic cost of staying positive until N has the same exponent as the probabilistic cost of being positive at N. Heuristically, we think of spreading the non-expected behaviour of the increments uniformly through the process, at only polynomial cost once we’ve specified the multiset of values taken by the increments. So, when \mu<0, we have

\mathbb{P}(T\ge(1+\epsilon)N) \ll \mathbb{P}(T= N).

Therefore, conditioning on \{T\ge N\} in fact concentrates T on N+o(N). Whereas by contrast, when \mu=0, conditioning on \{T\ge N\} gives a nontrivial limit in distribution for T/N, supported on [1,\infty).

A related problem is the value taken by S_N, conditional on {T>N}. It’s a related problem because the event {T>N} depends only on the process up to time N, and so given the value of S_N, even with the conditioning, after time N, the process is just an unconditioned RW. This is a classic application of the Markov property, beloved in several guises by undergraduate probability exam designers.

Anyway, Iglehart [Ig2] shows an invariance principle for S_N | T>N when \mu<0, without scaling. That is S_N=\Theta(1), though the limiting distribution depends on the increment distribution in a sense that is best described through Laplace transforms. If we start a RW with negative drift from height O(1), then it hits zero in time O(1), so in fact this shows that conditonal on \{T\ge N\}, we have T= N +O(1) with high probability. When \mu=0, we have fluctuations on a scale \sqrt{N}, as shown earlier by Iglehart [Ig1]. Again, thinking about the central limit theorem, this fits the asymptotic description of T conditioned on T>N.

Conditioning on T=N

In the case \mu=0, conditioning on T=N gives

\left[\frac{1}{\sqrt{N}}S(\lfloor Nt\rfloor ) ,t\in[0,1] \right] \Rightarrow W^+(t), (*)

where W^+ is a standard Brownian excursion on [0,1]. This is shown roughly simultaneously in [Ka] and [DIM]. This is similar to Donsker’s theorem for the unconditioned random walk, which converges after rescaling to Brownian motion in this sense, or Brownian bridge if you condition on S_N=0. Skorohod’s proof for Brownian bridge [Sk] approximates the event \{S_N=0\} by \{S_N\in[-\epsilon \sqrt{N},+\epsilon \sqrt{N}]\}, since the probability of this event is bounded away from zero. Similarly, but with more technicalities, a proof of convergence conditional on T=N can approximate by \{S_m\ge 0, m\in[\delta N,(1-\delta)N], S_N\in [-\epsilon \sqrt{N},+\epsilon\sqrt{N}]\}. The technicalities here emerge since T, the first return time to zero, is not continuous as a function of continuous functions. (Imagine a sequence of processes f^N for which f^N(x)\ge 0 on [0,1] and f^N(\frac12)=\frac{1}{N}.)

Once you condition on T=N, the mean \mu doesn’t really matter for this scaling limit. That is, so long as variance is finite, for any \mu\in\mathbb{R}, the same result (*) holds, although a different proof is in general necessary. See [BD] and references for details. However, this is particularly clear in the case where the increments are Gaussian. In this setting, we don’t actually need to take a scaling limit. The distribution of Gaussian *random walk bridge* doesn’t depend on the mean of the increments. This is related to the fact that a linear transformation of a Gaussian is Gaussian, and can be seen by examining the joint density function directly.

Conditioning on T=\infty

When \mu>0, the event \{T=\infty\} occurs with positive probability, so it is well-defined to condition on it. When \mu\le 0, this is not the case, and so we have to be more careful.

First, an observation. Just for clarity, let’s take \mu<0, and condition on \{T>N\}, and look at the distribution of S_{\epsilon N}, where \epsilon>0 is small. This is approximately given by

\frac{S_{\epsilon N}}{\sqrt{N}}\stackrel{d}{\approx}W^+(\epsilon).

Now take \epsilon\rightarrow\infty and consider the RHS. If instead of the Brownian excursion W^+, we instead had Brownian motion, we could specify the distribution exactly. But in fact, we can construct Brownian excursion as the solution to an SDE:

\mathrm{d}W^+(t) = \left[\frac{1}{W^+(t)} - \frac{W^+(t)}{1-t}\right] \mathrm{d}t + \mathrm{d}B(t),\quad t\in(0,1) (**)

for B a standard Brownian motion. I might return in the next post to why this is valid. For now, note that the first drift term pushes the excursion away from zero, while the second term brings it back to zero as t\rightarrow 1.

From this, the second drift term is essentially negligible if we care about scaling W^+(\epsilon) as \epsilon\rightarrow 0, and we can say that W^+(\epsilon)=\Theta(\sqrt{\epsilon}).

So, returning to the random walk, we have

\frac{S_{\epsilon N}}{\sqrt{\epsilon N}}\stackrel{d}{\approx} \frac{W^+(\epsilon)}{\sqrt{\epsilon}} = \Theta(1).

At a heuristic level, it’s tempting to try ‘taking N\rightarrow\infty while fixing \epsilon N‘, to conclude that there is a well-defined scaling limit for the RW conditioned to stay positive forever. But we came up with this estimate by taking N\rightarrow\infty and then \epsilon\rightarrow 0 in that order. So while the heuristic might be convincing, this is not the outline of a valid argument in any way. However, the SDE representation of W^+ in the \epsilon\rightarrow 0 regime is useful. If we drop the second drift term in (**), we define the three-dimensional Bessel process, which (again, possibly the subject of a new post) is the correct scaling limit we should be aiming for.

Finally, it’s worth observing that the limit \{T=\infty\}=\lim_{N\rightarrow\infty} \{T>N\} is a monotone limit, and so further tools are available. In particular, if we know that the trajectories of the random walk satisfy the FKG property, then we can define this limit directly. It feels intuitively clear that random walks should satisfy the FKG inequality (in the sense that if a RW is large somewhere, it’s more likely to be large somewhere else). You can do a covariance calculation easily, but a standard way to show the FKG inequality applies is by verifying the FKG lattice condition, and unless I’m missing something, this is clear (though a bit annoying to check) when the increments are Gaussian, but not in general. Even so, defining this monotone limit does not tell you that it is non-degenerate (ie almost-surely finite), for which some separate estimates would be required.

A final remark: in a recent post, I talked about the Skorohod embedding, as a way to construct any centered random walk where the increments have finite variance as a stopped Brownian motion. One approach to conditioning a random walk to lie above some discrete function is to condition the corresponding Brownian motion to lie above some continuous extension of that function. This is a slightly stronger conditioning, and so any approach of this kind must quantify how much stronger. In Section 4 of [BL], the authors do this for the random walk associated with the DGFF conditioned to lie above a polylogarithmic curve.

References

[BD] – Bertoin, Doney – 1994 – On conditioning a random walk to stay nonnegative

[BL] – Biskup, Louidor – 2016 – Full extremal process, cluster law and freezing for two-dimensional discrete Gaussian free field

[DIM] – Durrett, Iglehart, Miller – 1977 – Weak convergence to Brownian meander and Brownian excursion

[Ig1] – Iglehart – 1974 – Functional central limit theorems for random walks conditioned to stay positive

[Ig2] – Iglehart – 1974 – Random walks with negative drift conditioned to stay positive

[Ka] – Kaigh – 1976 – An invariance principle for random walk conditioned by a late return to zero

[KS] – Kesten, Stigum – 1966 – A limit theorem for multidimensional Galton-Watson processes

[Sk] – Skorohod – 1955 – Limit theorems for stochastic processes with independent increments

Skorohod embedding

Background

Suppose we are given a standard Brownian motion (B_t), and a stopping time T. Then, so long as T satisfies one of the regularity conditions under which the Optional Stopping Theorem applies, we know that \mathbb{E}[B_T]=0. (See here for a less formal introduction to OST.) Furthermore, since B_t^2-t is a martingale, \mathbb{E}[B_T^2]=\mathbb{E}[T], so if the latter is finite, so is the former.

Now, using the strong Markov property of Brownian motion, we can come up with a sequence of stopping times 0=T_0, T_1, T_2,\ldots such that the increments T_k-T_{k-1} are IID with the same distribution as T. Then 0,B_{T_1},B_{T_2},\ldots is a centered random walk. By taking T to be the hitting time of \{-1,+1\}, it is easy to see that we can embed simple random walk in a Brownian motion using this approach.

p1020956_compressedEmbedding simple random walk in Brownian motion.

The Skorohod embedding question asks: can all centered random walks be constructed in this fashion, by stopping Brownian motion at a sequence of stopping time? With the strong Markov property, it immediately reduces the question of whether all centered finite-variance distributions X can be expressed as B_T for some integrable stopping time T.

The answer to this question is yes, and much of what follows is drawn from, or at least prompted by Obloj’s survey paper which details the problem and rich history of the many approaches to its solution over the past seventy years.

Applications and related things

The relationship between random walks and Brownian motion is a rich one. Donsker’s invariance principle asserts that Brownian motion appears as the scaling limit of a random walk. Indeed, one can construct Brownian motion itself as the limit of a sequence of consistent random walks with normal increments on an increasingly dense set of times. Furthermore, random walks are martingales, and we know that continuous, local martingales can be expressed as a (stochastically) time-changed Brownian motion, from the Dubins-Schwarz theorem.

The Skorohod embedding theorem can be used to prove results about random walks with general distribution by proving the corresponding result for Brownian motion, and checking that the construction of the sequence of stopping times has the right properties to allow the result to be carried back to the original setting. It obviously also gives a coupling between a individual random walk and a Brownian motion which may be useful in some contexts, as well as a coupling between any pair of random walks. This is useful in proving results for random walks which are much easier for special cases of the distribution. For example, when the increments are Gaussian, or when there are combinatorial approaches to a problem about simple random walk. At the moment no aspect of this blog schedule is guaranteed, but I plan to talk about the law of the iterated logarithm shortly, whose proof is approachable in both of these settings, as well as for Brownian motion, and Skorohod embedding provides the route to the general proof.

At the end, we will briefly compare some other ways to couple a random walk and a Brownian motion.

Adding extra randomness

One thing we could do is sample a copy of X independently from the Brownian motion, then declare T= \tau_{X}:= \inf\{t\ge 0: B_t=X\}, the hitting time of (random value) X. But recall that unfortunately \tau_x has infinite expectation for all non-zero x, so this doesn’t fit the conditions required to use OST.

Skorohod’s original method is described in Section 3.1 of Obloj’s notes linked above. The method is roughly to pair up positive values taken by X appropriately with negative values taken by X in a clever way. If we have a positive value b and a negative value a, then \tau_{a,b}, the first hitting time of \mathbb{R}\backslash (a,b) is integrable. Then we choose one of these positive-negative pairs according to the projection of the distribution of X onto the pairings, and let T be the hitting time of this pair of values. The probability of hitting b conditional on hitting {a,b} is easy to compute (it’s \frac{-a}{b-a}) so we need to have chosen our pairs so that the ‘probability’ of hitting b (ie the density) comes out right. In particular, this method has to start from continuous distributions X, and treat atoms in the distribution of X separately.

The case where the distribution X is symmetric (that is X\stackrel{d}=-X) is particularly clear, as then the pairs should be (-x,x).

However, it feels like there is enough randomness in Brownian motion already, and subsequent authors showed that indeed it wasn’t necessary to introduce extra randomness to provide a solution.

One might ask whether it’s possible to generate the distribution on the set of pairs (as above) out of the Brownian motion itself, but independently from all the hitting times. It feels like it might be possible to make the distribution on the pairs measurable with respect to

\mathcal{F}_{0+} = \bigcap\limits_{t>0} \mathcal{F}_t,

the sigma-algebra of events determined by limiting behaviour as t\rightarrow 0 (which is independent of hitting times). But of course, unfortunately \mathcal{F}_{0+} has a zero-one law, so it’s not possible to embed non-trivial distributions there.

Dubins solution

The exemplar for solutions without extra randomness is due to Dubins, shortly after Skorohod’s original argument. The idea is to express the distribution X as the almost sure limit of a martingale. We first use the hitting time of a pair of points to ‘decide’ whether we will end up positive or negative, and then given this information look at the hitting time (after this first time) of two subsequent points to ‘decide’ which of four regions of the real interval we end up in.

I’m going to use different notation to Obloj, corresponding more closely with how I ended up thinking about this method. We let

a_+:= \mathbb{E}[X \,|\, X>0], \quad a_- := \mathbb{E}[X\,|\, X<0], (*)

and take T_1 = \tau_{\{a_-,a_+\}}. We need to check that

\mathbb{P}\left( B_{T_1}=a_+\right) = \mathbb{P}\left(X>0\right),

for this to have a chance of working. But we know that

\mathbb{P}\left( B_{T_1}=a_+\right) = \frac{a_+}{a_+-a_-},

and we can also attack the other side using (*) and the fact that \mathbb{E}[X]=0, using the law of total expectation:

0=\mathbb{E}[X]=\mathbb{E}[X\,|\, X>0] \mathbb{P}(X>0) + \mathbb{E}[X\,|\,X<0]\mathbb{P}(X<0) = a_+ \mathbb{P}(X>0) + a_- \left(1-\mathbb{P}(X>0) \right),

\Rightarrow\quad \mathbb{P}(X>0)=\frac{a_+}{a_+-a_-}.

Now we define

a_{++}=\mathbb{E}[X \,|\, X>a_+],\quad a_{+-}=\mathbb{E}[X\,|\, 0<X<a_+],

and similarly a_{-+},a_{--}. So then, conditional on B_{T_1}=a_+, we take

T_2:= \inf_{t\ge T_1}\left\{ B_t\not\in (a_{+-},a_{++})  \right\},

and similarly conditional on B_{T_1}=a_-. By an identical argument to the one we have just deployed, we have \mathbb{E}\left[B_{T_2} \,|\,\mathcal{F}_{T_1} \right] = B_{T_1} almost surely. So, although the a_{+-+} notation now starts to get very unwieldy, it’s clear we can keep going in this way to get a sequence of stopping times 0=T_0,T_1,T_2,\ldots where B_{T_n} determines which of the 2^n regions of the real line any limit \lim_{m\rightarrow\infty} B_{T_m} should lie in.

A bit of work is required to check that the almost sure limit T_n\rightarrow T is almost surely finite, but once we have this, it is clear that B_{T_n}\rightarrow B_T almost surely, and B_T has the distribution required.

Komlos, Major, Tusnady coupling

We want to know how close we can make this coupling between a centered random walk with variance 1, and a standard Brownian motion. Here, ‘close’ means uniformly close in probability. For large times, the typical difference between one of the stopping times 0,T_1,T_2,\ldots in the Skorohod embedding and its expectation (recall \mathbb{E}[T_k]=k) is \sqrt{n}. So, constructing the random walk S_0,S_1,S_2,\ldots from the Brownian motion via Skorohod embedding leads to

\left |S_k - B_k \right| = \omega(n^{1/4}),

for most values of k\le n. Strassen (1966) shows that the true scale of the maximum

\max_{k\le n} \left| S_k - B_k \right|

is slightly larger than this, with some extra powers of \log n and \log\log n as one would expect.

The Komlos-Major-Tusnady coupling is a way to do a lot better than this, in the setting where the distribution of the increments has a finite MGF near 0. Then, there exists a coupling of the random walk and the Brownian motion such that

\max_{k\le n}\left|S_k- B_k\right| = O(\log n).

That is, there exists C such that

\left[\max_{k\le n} \left |S_k-B_k\right| - C\log n\right] \vee 0

is a tight family of distributions, indeed with uniform exponential tail. To avoid digressing infinitely far from my original plan to discuss the proof of the law of iterated logarithm for general distributions, I’ll stop here. I found it hard to find much coverage of the KMT result apart from the challenging original paper, and many versions expressed in the language of empirical processes, which are similar to random walks in many ways relevant to convergence and this coupling, but not for Skorohod embedding. So, here is a link to some slides from a talk by Chatterjee which I found helpful in getting a sense of the history, and some of the modern approaches to this type of normal approximation problem.

DGFF 2 – Boundary conditions and Gibbs-Markov property

In the previous post, we defined the Discrete Gaussian Free Field, and offered some motivation via the discrete random walk bridge. In particular, when the increments of the random walk are chosen to be Gaussian, many natural calculations are straightforward, since Gaussian processes are well-behaved under conditioning and under linear transformations.

Non-zero boundary conditions

In the definition of the DGFF given last time, we demanded that h\equiv 0 on \partial D. But the model is perfectly well-defined under more general boundary conditions.

It’s helpful to recall again the situation with random walk and Brownian bridge. If we want a Brownian motion which passes through (0,0) and (1,s), we could repeat one construction for Brownian bridge, by taking a standard Brownian motion and conditioning (modulo probability zero technicalities) on passing through level s at time 1. But alternatively, we could set

B^{\mathrm{drift-br}}(t) = B(t)+ t(s-B(1)),\quad t\in[0,1],

or equivalently

B^{\mathrm{drift-br}}(t)=B^{\mathrm{br}}(t)+ st, \quad t\in[0,1].

That is, a Brownian bridge with drift can be obtain from a centered Brownian bridge by a linear transformation, and so certainly remains a Gaussian process. And exactly the same holds for a discrete Gaussian bridge: if we want non-zero values at the endpoints, we can obtain this distribution by taking the standard centred bridge and applying a linear transformation.

We can see how this works directly at the level of density functions. If we take 0=Z_0,Z_1,\ldots,Z_{N-1},Z_N=0 a centred Gaussian bridge, then the density of Z=\mathbf{z}\in \mathbb{R}^{N+1} is proportional to

\mathbf{1}\{z_0=z_N=0\}\exp\left( -\frac12 \sum_{i=1}^N (z_i-z_{i-1})^2 \right). (3)

So rewriting z_i= y_i- ki (where we might want k=s/N to fit the previous example), the sum within the exponent rearranges as

-\frac12 \sum_{i=1}^N (y_i-y_{i-1} - k)^2 = -\frac12 \sum_{i=1}^N (y_i-y_{i-1})^2 - 2k(y_N-y_0)+ Nk^2.

So when the values at the endpoints z_0,z_n,y_0,y_N are fixed, this middle term is a constant, as is the final term, and thus the density of the linearly transformed bridge has exactly the same form as the original one.

In two or more dimensions, the analogue of adding a linear function is to add a harmonic function. First, some notation. Let \varphi be any function on \partial D. Then there is a unique harmonic extension of \varphi, for which \nabla \varphi=0 everywhere on D, the interior of the domain. Recall that \nabla is the discrete graph Laplacian defined up to a constant by

(\nabla \varphi) _x = \sum\limits_{x\sim y} \varphi_x - \varphi_y.

If we want h^D instead to have boundary values \varphi, it’s enough to replace h^D with h^D+\varphi. Then, in the density for the DGFF ( (1) in the previous post), the term in the exponential becomes (ignoring the \frac{1}{4d} )

-\sum\limits_{x\sim y} \left[ (h^D_x-h^D_y)^2 + (\varphi_x-\varphi_y)^2 +2(h^D_x - h^D_y)(\varphi_x-\varphi_y)\right].

For each x\in D, on taking this sum over its neighbours y\in \bar D, the final term vanishes (since \varphi is harmonic), while the second term is just a constant. So the density of the transformed field, which we’ll call h^{D,\varphi} is proportional to (after removing the constant arising from the second term above)

\mathbf{1}\left\{h^{D,\varphi}_x = \varphi_x,\, x\in\partial D\right\} \exp\left( -\frac{1}{4d} \sum\limits_{x\sim y} \left( h^{D,\varphi}_x - h^{D,\varphi}_y \right)^2 \right).

So h^{D,\varphi}:= h^D + \varphi satisfies the conditions for the DGFF on D with non-zero boundary conditions \varphi.

Harmonic functions and RW – a quick review

Like the covariances in DGFF, harmonic functions on D are related to simple random walk on D stopped on \partial D. (I’m not claiming a direct connection right now.) We can define the harmonic extension \varphi to an interior point x by taking \mathbb{P}_x to be the law of SRW x=Z_0,Z_1,Z_2,\ldots started from x, and then setting

\varphi(x):= \mathbb{E}\left[ \varphi_{\tau_{\partial d}} \right],

where \tau_{\partial D} is the first time that the random walk hits the boundary.

Inverse temperature – a quick remark

In the original definition of the density of the DGFF, there is the option to add a constant \beta>0 within the exponential term so the density is proportional to

\exp\left(-\beta \sum\limits_{x\sim y} (h_x-h_y)^2 \right).

With zero boundary conditions, the effect of this is straightforward, as varying \beta just rescales the values taken by the field. But with non-zero boundary conditions, the effect is instead to vary the magnitude of the fluctuations of the values of the field around the (unique) harmonic function on the domain with those BCs. In particular, when \beta\rightarrow \infty, the field is ‘reluctant to be far from harmonic’, and so h^D \Rightarrow \varphi.

This parameter \beta is called inverse temperature. So low temperature corresponds to high \beta, and high stability, which fits some physical intuition.

A Markov property

For a discrete (Gaussian) random walk, the Markov property says that conditional on a given value at a given time, the trajectory of the process before this time is independent of the trajectory afterwards. The discrete Gaussian bridge is similar. Suppose we have as before 0=Z_0,Z_1,\ldots, Z_N=0 a centred Gaussian bridge, and condition that Z_k=y, for k\in\{1,\ldots,N-1\}, and y\in\mathbb{R}. With this conditioning, the density (3) splits as a product

\mathbf{1}\{z_0=z_N=0, z_k=y\}\exp\left(-\frac12 \sum\limits_{i=1}^N (z_i-z_{i-1})^2 \right) =

\mathbf{1}\{z_0=0,z_k=y\} \exp\left(-\frac12 \sum\limits_{i=1}^k (z_i-z_{i-1})^2 \right) \cdot \mathbf{1}\{z_k=y,z_N=0\} \exp\left(-\frac12 \sum\limits_{i=k+1}^N (z_i-z_{i-1})^2 \right).

Therefore, with this conditioning, the discrete Gaussian bridge splits into a pair of independent discrete Gaussian bridges with drift. (The same would hold if the original process had drift too.)

The situation for the DGFF is similar, though rather than focusing on the condition, it makes sense to start by focusing on the sub-domain of interest. Let A\subset D, and take B=\bar D\backslash A. So in particular \partial A\subset B.

img_20161106_194123472_compressedThen we have that conditional on h^D\big|_{\partial A}, the restricted fields h^D\big|_{B\backslash \partial A} and h^D\big|_A are independent. Furthermore, h^D\big|_A has the distribution of the DGFF on A, with boundary condition given by h^D\big|_{\partial A}. As in the discrete bridge, this follows just by splitting the density. Every gradient term corresponds to an edge in the underlying graph that lies either entirely inside \bar A or entirely inside B. This holds for a general class of Gibbs models where the Hamiltonian depends only on the sum of some function of the heights (taken to be constant in this ‘free’ model) and the sum of some function of their nearest-neighbour gradients.

One additional and useful interpretation is that if we only care about the field on the restricted region A, the dependence of h^D\big|_A on h^D\big|_{D\backslash A} comes only through h^D\big|_{\partial A}. But more than that, it comes only through the (random) harmonic function which extends the (random) values taken on the boundary of A to the whole of A. So, if h^A is an independent DGFF on A with zero boundary conditions, we can construct the DGFF h^D from its value on D\backslash A via

h^D_x \stackrel{d}= h^A_x + \varphi^{h^D\big|_{\partial A}},

where \varphi^{h^D\big|_{\partial A}} is the unique harmonic extension of the (random) values taken by h^D on \partial A to \bar A.

This Markov property is crucial to much of the analysis to come. There are several choices of the restricted domain which come up repeatedly. In the next post we’ll look at how much one can deduce by taking A to be the even vertices in D (recalling that every integer lattice \mathbb{Z}^d is bipartite), and then taking A to be a finer sublattice within D. We’ll use this to get some good bounds on the probability that the DGFF is positive on the whole of D. Perhaps later we’ll look at a ring decomposition of \mathbb{Z}^d consisting of annuli spreading out from a fixed origin. Then the distribution of the field at this origin can be considered, via the final idea discussed above, as the limit of an infinite sequence of random harmonic functions given by the values taken by the field at increasingly large radius from the origin. Defining the DGFF on the whole lattice depends on the existence or otherwise of this local limit.

Parking on a ring, linear hashing

I’ve spent most of my doctorate trying to analyse how adding destructive dynamics affects the behaviour of a particular random growth process, the classical random graph. In this post I’m going to talk about another random growth process, which is slightly less natural, but for which one can show some similar qualitative properties.

The model, and the additive coalescent

Consider m places arranged in a circle, and for consistency of analogy we think of these as parking spaces. Some number n of cars will arrive one at a time. Each car will arrive at a space chosen uniformly at random. If it is empty they will park in it, otherwise they will look clockwise until they find an empty space, and park there. For now we are only interested in growth, so we assume cars never leave. We are interested in the sizes of blocks of consecutively parked cars.

The reason to consider this slightly unnatural statement is its equivalence to the problem of hashing with linear probing, apparently a key topic in computer science, which I won’t pretend that I know anything about. In any case, it’s a nice model, and it seems reasonable that it would have a basis in more realistic search algorithms.

So, how does the sequence of sizes of blocks of consecutively parked cars grow? Well, given the sequence of block sizes, it is reasonably easy to convince yourself that the order of the blocks around the circle is uniformly random, and the number of empty spaces between adjacent blocks is also uniformly random.

Assume for now that there are at least three blocks. A block of size x can merge with a block of size y with the arrival of the next car only if the blocks are adjacent, with exactly one empty space between them. The chance of this is uniform among all pairs of blocks. Now suppose this is the case, and that the block of size y lies clockwise from the block of size x. Then they will merge precisely if the next car arrives at any of the x occupied spaces in that block, or at the empty space between the pair of blocks. This has probability \frac{x+1}{m}. There’s also the opposite ordering to consider, where the block of size x lies clockwise from the other. The total probability of this merge \{x,y\}\mapsto \{x+y+1\} is therefore proportional to (x+y+2).

So the process of block sizes looks a bit like the additive coalescent, at least for large blocks. This is in contrast to the random graph process, where the sequence of component sizes behaves exactly like a multiplicative coalescent, where blocks merge at a rate proportional to the product of their sizes.

Asymptotics

As in the random graph process, it’s interesting to ask roughly how large the largest block will be in such a configuration. Pittel [3] considers the case where the number of empty places \ell = m-n \approx \beta m, for some \beta\in (0,1).

A less interesting model would be to choose the positions of the n cars uniformly at random. But then the size of a block is roughly geometric with parameter \beta, and there are \Theta(m) blocks with high probability. Relatively straightforward calculations in extreme value theory suggest that the largest block is likely to have size on the order of \log m in this setting.

Of course, the actual model is slightly more complicated, because the size of a block is self-reinforcing, since larger blocks are more likely to grow than smaller blocks. However, we can still get somewhere with naïve estimates. Let’s label the places clockwise. Then in order for there to be a block starting at 0 and stretching beyond \alpha \log m, a necessary condition is that at least \alpha \log m cars arrive at those places. The number of cars which arrive at those places is binomial, since there are n cars, and each arrives at a place chosen uniformly, and independently of the other cars. So this event corresponds to

\mathrm{Bin}(n,\frac{\alpha \log m}{m}) \ge \alpha \log m.

Then, since n\approx (1-\beta)n, this event corresponds approximately to

\mathrm{Po}((1-\beta)\alpha \log m) \ge \alpha \log m.

The probability that a Poisson RV is at least a constant multiple larger than its mean decays exponentially with the mean, hence in this case the probability is asymptotically some negative power of m, depending on the value of \alpha. But there are O(m) possible places for such a block to start, so whether we can apply a union bound usefully or not depends on whether the power of m is strictly less than -1.

Since all of this depends on \alpha, it is reasonable that everything is fine, and the largest block does have size at least \alpha \log m when \alpha is small, and very unlikely when \alpha is large. This heuristic argument fits with Pittel’s theorem. Indeed, his result shows much stronger concentration: that the fluctuations of the size of the largest block are O(1).

Critical regime and empirical processes

The following is a paraphrase of the introduction and some methods from [2].

Obviously, once m=m cars have arrived, there’s no room for manoeuvre and definitely all the places are taken in one giant block. But it’s not obvious in general what scaling for the number of gaps will give rise to giant blocks of \Theta(m) cars.

As for the random graph, we can find a process similar to the exploration process of a (random) graph which encodes much of the information we care about. Let Y_k be the number of cars which arrive at place k. So the sum of the Y_ks will be n, the total number of cars. Now consider the process

C_0=0, \ldots, C_{k+1}=C_k + Y_{k+1}-1.

A block has the property that the number of arrivals within that set of places is equal to the number of places. So every time this *empirical process* C drops below its previous running minimum, this indicates the end of a block. To make this equivalence precise, we need to be a bit careful about where we start counting. It works exactly if we start at the beginning of a block. If not, it might introduce some unwanted divisions within the first block.

What we have is a process that looks roughly like a random walk that is constrained to pass through the point (m,n-m), which is equal to (m,-l). Even if we aren’t totally precise about how this is like a random walk, we would expect to see Brownian fluctuations after rescaling. Indeed, we might expect to see a Brownian bridge added to a deterministic linear function with negative gradient. But this is only meaningful if the random part is at least as large as the deterministic part, and since the fluctuations have order \sqrt{m}, if l is much larger than this, the rescaled empirical process is essentially deterministic, so we won’t see any macroscopic excursions above the minimum.

If l is substantially smaller than \sqrt{m}, then there is no real difference between (m,-l) and (m,0), and what we see is just a Brownian bridge. At this point, where we choose to start the process is actually important. If we were to start it at the minimum of the Brownian bridge instead, we would have seen a Brownian excursion, which corresponds to one block occupying (almost) all of the places.

Unsurprisingly, the story is completed by considering \ell=\Theta(\sqrt{m}), where the rescaled empirical process looks like a slanted Brownian bridge, that is Brownian motion conditioned to pass through $(1,-\frac{\ell}{\sqrt{m})$. There isn’t an obvious fix to the question of where to start the process, but it turns out that the correct way is now adding a Brownian excursion onto the deterministic linear function with gradient - \frac{\ell}{\sqrt{m}}. It’s now reasonable that the excursions above the minimum should macroscopic.

This scaling limit works dynamically as well, where the same Brownian excursion is used for different gradients of the deterministic line, corresponding to \ell moving through the critical window m-\Theta(\sqrt{m}). Finally, a direction to Bertoin’s recent paper [1] for the model with an additional destructive property. Analogous to the forest fire, blocks of cars are removed at a rate proportional to their size (as a result, naturally, of ‘Molotov cocktails’…). Similar effects of self-organised criticality are seen when the rate of bombs is scaled appropriately.

References

[1] – Bertoin – Burning cars in a parking lot (paper / slides)

[2] – Chassaing + Louchard – Phase transition for parking blocks, Brownian excursion and coalescence (arXiv)

[3] – Pittel – Linear probing: the probable largest search time grows logarithmically with the number of records

Fair games and the martingale strategy III

Gambler’s Ruin

Continuing directly from the previous post, the nicest example of the optional stopping theorem we developed there is to example a simple random walk constrained between two values, say 0 and N. This represents an idealised gambling situation, where the gambler stops playing either when they reach some pre-agreed profit, or when they go bankrupt. We assume that we start at level k, for k = 1,2,…,N-1.

Naturally, we want to know the probabilities of winning (ie getting to N) and losing (ie going bankrupt). We could set this up by conditioning on the first step. Let p_k be the probability of winning starting from level k. Then we must have

p_k= \frac12 p_{k+1}+\frac12 p_{k-1},\quad k=1,\ldots,N-1, (*)

with the obvious boundary conditions p_0=0, p_N=1. In an ideal world, we just know how to solve second order difference equations like (*). Well, actually it isn’t too hard, because we can see from (*) directly that

p_{k+1}-p_k = p_k-p_{k-1},

and so p_k is a linear function of k, and so p_k = k/N follows pretty much immediately.

But, we can also use OST profitably. Let T be the time at which we first hit 0 or N. It’s intuitively clear that this should have finite expectation, since the problems you might encounter with just the hitting time of a single level shouldn’t apply. Or you can consider the expected number of steps before you see N ups or downs in a row, which certainly provides an upper bound on T. This random number of steps is sort of geometric (at least, can be upper bounded by a geometric RV) and so has finite expectation. So can apply OST to X at T, and we have

\mathbb{E}[X_T] = N\cdot \mathbb{P}(X_T=N) + 0 \cdot \mathbb{P}(X_T=0) = \mathbb{E}[X_0]=k,

from which we also derive p_k=k/N.

The reason we talk about gambler’s ruin is by considering the limit N\rightarrow\infty with k fixed. After a moment’s thought, it’s clear we can’t really talk about stopping the process when we hit infinity, since that won’t happen at any finite time. But we can ask what’s the probability that we eventually hit zero. Then, if we imagine a barrier at level N, the probability that we hit 0 at some point is bounded below by the probability that we hit 0 before we hit level N (given that we know we hit either zero or level N with probability one), and this is \frac{N-k}{N}, and by choosing N large enough, we can make this as close to 1 as we want. So the only consistent option is that the probability of hitting 0 at some point is one. Hence gambler’s ruin. With probability one, ruin will occur. There’s probably a moral lesson hiding there not especially subtly.

A problem about pricing options

So the deal here seems to be that if you just care about your average, it doesn’t matter how to choose to play a sequence of fair games. But what if you care about something other than your average? In any real setting, we maybe care about slightly more than this. Suppose I offer you a bet on a coin toss: you get £3 if it comes up heads, and I get £1 if it comes up tails. Sounds like a good bet, since on average you gain a pound. But what about if you get £10,003 if it comes up heads and I get £10,001 if it comes up tails? I’m guessing you’re probably not quite so keen now.

But if you were an international bank, you might have fewer reservations about the second option. My intention is not to discuss whether our valuation of money is linear here, but merely to offer motivation for the financial option I’m about to propose. The point is that we are generally risk-averse (well, most of us, most of the time) and so we are scared of possible large losses, even when there is the possibility of large profits to balance it out.

Let’s assume we have our simple random walk, and for definiteness let’s say it starts at £1. Suppose (eg as a very niche birthday present) we have the following opportunity: at any point between now and time t=5, we have the right to buy one unit of the stock for £2.

We want to work out how much this opportunity, which from now on I’m going to call an option, is worth on average. Note that now it does seem that when we choose to cash in the option will have an effect on our return, and so we will have to include this in the analysis.

Note that, once we’ve bought a unit of the stock, we have an asset which is following a simple random walk (ie sequential fair games) and so from this point on its expected value remains unchanged. So in terms of expectation, we might as well sell the stock at the same moment we buy it. So if we cash in the option when the stock is currently worth £X, we will on average have a return of £(X-2). This means that we’ll only ever consider exercising our option if the current value of the stock is greater than £2. This narrows down our strategy slightly.

This sort of option minimises the risk of a large loss, since the worst thing that happens is that you never choose to exercise your option. So if you actually paid for the right to have this option, that cost is the largest amount you can lose. In the trading world, this type of opportunity is called an American option.

The trick here is to work backwards in time, thinking about strategies. If at time t=4, the stock is worth £1, then the best that can happen is that it’s worth £2 at time t=5, and this still gains you no wealth overall. Similarly if it’s worth £0 at time t=3. So we’ve identified a region where, if the stock value enters this region, we might as well rip up our contract, because we definitely aren’t going to gain anything. Remember now that we’ve also said you won’t ever cash in if the stock’s value is at most £2, because you don’t gain anything on average.

Now suppose that the stock has value £3 at time t=4. There’s no danger of it ever getting back below £2 during the lifetime of the option, so from now on your potential return is following the trajectory of a simple random walk, ie a fair game. So on average, it makes no difference whether you cash in now, or wait until t=5, or some combination of the two. The same argument holds if the stock has value £4 at time t=3 or time t=4, and so we can identify a region where you might as well cash in.

American Option 1

What about the final region? If the stock value is greater than £2, but not yet in the definitely-cash-in area, what should you do? Well, if you think about it, the value of the stock is a fair game. But your return should be better than that, because the stock price doesn’t take account of the fact that you wouldn’t buy in (and make a loss overall) if the value drops below £2. So at this stage, your future options are better than playing a fair game, and so it doesn’t make sense (in terms of maximising your *average*) to cash in.

Now we can actually work backwards in time to establish how much any starting value is worth under this optimal strategy. We can fill in the values in the ‘doomed’ area (ie all zeros) and on the ‘cash in now’ area (ie current value minus 2), and construct backwards using the fact that we have a random walk.

American Option 2

The final answer ends up being 7/16 if the stock had value £1 at time 0. Note that the main point here is that working out the qualitative form of the strategy was the non-trivial part. Once we’d done that, everything was fairly straightforward. I claim that this was a reasonably fun adjustment to the original problem, but have minimal idea whether pricing options is in general an interesting thing to do.

Anyway, I hope that provided an interesting overview to some of the topics of interest within the question of how to choose strategies for games based on random processes.

Fair games and the martingale strategy I

I went back to my school a couple of weeks ago and gave a talk. I felt I’d given various incarnations of a talk on card-shuffling too many times, so it was time for a new topic. The following post (and time allowing, one or two more) is pretty much what I said.

The Martingale Strategy

Suppose we bet repeatedly on the outcome of tossing a fair coin. Since it’s November, my heart is set on buying an ice cream that costs £1, so my aim is to win this amount from our game. My strategy is this:

First, I bet £1. If I win, then that’s great, because I now have made exactly enough profit to buy the ice cream. If I lose, then I play again, and this time I bet £2. Again, if I win, then my total profit is £2-£1 = £1, so I stop playing and buy the ice cream. If I lose, then I play a third time, again doubling my stake. So if I win for the first time on the seventh go, my overall profit will be

£64 – (£1+£2+£4+£8+£16+£32) = £1,

and it’s clear that this can be continued and I will eventually win a round, and at this point my total profit will be £1. So I will always eventually be able to buy my ice cream.

But, there’s nothing special about the value £1, so I could replace the words ‘ice cream’ with ‘private tropical island’, so why am I still here in the UK on a wet Monday when I could be on my beach lounger?

There are some fairly obvious reasons why the strategy I’ve described is not actually a fail-safe way to make a profit. For a start, although with probability one a head will come up eventually, there is a small positive chance that the first 200 rolls will all be tails. At this point, I would have accrued a debt of roughly 2^{200} pounds, and this is slightly more than the number of atoms in the universe. All this for an ice cream?

So there are major problems carrying out this strategy in a finite world. And of course, it’s no good if we stop after a very large but finite number of turns, because then there’s always this very small chance that we’ve made a very large loss, which is bad, partly because we can’t have the ice cream, but also because it exactly cancels out the chance of making our £1 profit, and so our overall average profit is exactly zero.

Though I’ve set this up in an intentionally glib fashion, as so often is the case, we might have stumbled across an interesting mathematical idea. That is, if we play a fair game a finite number of times, we have a fair game overall, meaning our overall average profit is zero. But if we are allowed to play a potentially infinite number of times, then it’s not clear how to define our overall ‘average’ profit, since we feel it ought to be zero, as an extension of the finite case, but also might be positive, because it ends up being £1 with probability one.

It’s tempting at this stage to start writing statements like

1 \times 1 + (-\infty) \times 0=0 ,

to justify why this might have come about, where we consider the infinitely unlikely event that is infinitely costly. But this is only convincing at the most superficial level, and so it makes more sense to think a bit more carefully about under exactly what circumstances we can extend our observation about the overall fairness of a finite sequence of individual fair games.

A second example

The previous example was based upon a series of coin tosses, and we can use exactly the same source of randomness to produce a simple random walk. This is a process that goes up or down by 1 in each time step, where each option happens with probability ½, independently of the history.

We could avoid the requirement to deal with very large bets by always staking £1, and then cashing in the first time we have a profit of £1. Then, if we start the random walk at zero, it models our profit, and we stop the first time it gets to 1. It’s not obvious whether we hit 1 with probability one. Let’s show this.

In order to hit some positive value k, the random walk must pass through 1, 2, and so on, up to (k-1) and then finally k. So \mathbb{P}(\text{hit k}) = [\mathbb{P}(\text{hit 1})]^k. And similarly for negative values. Also, the probability that we return to zero is the same as the probability that we ever hit 1, since after one time-step they are literally the same problem (after symmetry). So, if the probability of hitting 1 is p<1, then the number of visits to zero is geometric (supported on 1,2,3,…) with parameter p, and so

\mathbb{E}[\text{visits to k}] = \mathbb{E}[\text{visits to zero}] \times \mathbb{P}(\text{hit k})=(1+1/p) \times p^{|k|} = (p+1)p^{|k|-1}.

Thus, when we sum over all values of k, we are summing a pair of geometric series with exponent <1, and so we get a finite answer. But if the expected number of visits to anywhere (ie the sum across all places) is finite, this is clearly ridiculous, since we are running the process for an infinite time, and at each time-step we must be somewhere! So we must in fact have p=1, and thus another potential counter-example to the claim that a sequence of fair games can sometimes be unfair.

We might have exactly the same set of practical objections, such as this method requiring arbitrarily large liquidity (even though it doesn’t grow exponentially fast so doesn’t seem so bad).

What will actually turn out to be useful is that although the bets are now small, the average time until we hit 1 is actually infinite. Remember that, even though most things we see in real life don’t have this property, it is completely possible for a random variable to take finite values yet have infinite expectation.

Notes on the Martingale Strategy

There’s no reason why the originally proposed strategy had to be based upon fair coin tosses. This strategy might work in a more general setting, where the chance of winning on a given turn is not ½, or is not even constant. So long as at each stage you bet exactly enough that, if you win, you recoup all your losses so far, and one extra pound, this has the same overall effect.

Of course, we need to check that we do eventually win a round, which is not guaranteed if the probability of winning (conditional on not having yet won) decays sufficiently fast. If we let p_k be the probability of winning on turn k, given that we haven’t previously won, then we require that the probability of never winning \prod_{k\ge 1}(1-p_k)=0. By taking logs and taking care of the approximations, it can be seen that the divergence or otherwise of \sum p_k determines which way this falls.

In the next post, we’ll talk about how the two problems encountered here, namely allowing large increments, and considering a stopping time with infinite expectation are exactly the two cases where something can go wrong. We’ll also talk about a slightly different setting, where the choice of when to stop playing becomes a bit more dynamic and complicated.