Advice for BMO1

The first round of the British Mathematical Olympiad (BMO1) takes place tomorrow. Last year I wrote a brief note to my mentoring students about the exam. Most of the advice is fairly obvious, but I guess it never does any harm to be reminded. In particular, while it is tempting to give lots of mathematical guidance, under exam pressure good deductive ideas either will or won’t come, and there’s relatively little to be done about it in advance to help. However, especially for students for whom this is their first experience of a long olympiad style paper, there are a few practical and general points to be made, so you have the best chance of turning good ideas into good solutions during the time allowed.

DON’T waste time. 3.5 hours is a long time, but it will pass quickly when you have lots to think about. Obviously, you will inevitably spend some time just thinking vaguely about the problems, or even daydreaming, just to give your brain a break. Don’t worry about that, but do try not to waste time pursuing methods which don’t look like they are working. If you have made 6 algebraic substitutions and the expression now takes up an entire line, ask yourself whether you’re going anywhere. If your geometrical diagram now has dozens of extra points, or if you are trying to solve a polynomial in n variables where n is large, question yourself. Maybe you’re missing something more obvious?

On the subject, DO flit between questions. The rubric says that full solutions are better than partial solutions. However, if moving to another question allows you to take a fresh stab at the first one in 15 minutes or whatever, that is a good thing.

Also, DO take food or drink (within reason and so long as whoever is invigilating doesn’t mind), if you think it will help. 3.5 hours of concentration can be draining! The 200g value pack of Dairy Milk was my preference back in the day…

On a more mathematical note, DON’T draw rubbish geometrical diagrams. DO use a compass and a ruler. These geometry problems normally want you to spot similar triangles or something like that. These will be much much easier to find if they actually look similar on your diagram! Markers also like seeing good diagrams.

DO write up relevant bits of your rough. It’s a good way to grab small marks, and you never know, you might have had all the right ideas, just missed the final crucial step. It sometimes says not to hand in rough: so make sure what you hand in looks vaguely neat, and has key steps or results you’ve proved underlined or in a box, so that they are as visible as possible to the marker. Checking small cases explicitly will be useful to your understanding of the problem, and so may gain credit.

DON’T wait until the end to write up bits of your rough. The temptation to keep working on them will be too strong, and you might have forgotten what seemed interesting an hour ago. Crucially, deciding carefully what the most important steps of your working are may very well help you to finish the problem.

DO read the question properly. Trying to prove something false will waste your time; trying to prove something simpler than the actual question will cost you marks. Things to consider include:

  • If the question says ‘If and only if’, you have to prove it both ways. Similarly if it asks for a converse.
  • Check what the domains are. Does n have to be an integer or is it a real number? Can it be zero?
  • In a counting question, does order matter?
  • Is the triangle allowed to be obtuse? Does this change anything important in the argument?

DON’T waffle. If you are writing a massive load of text, have a think about whether that’s a good idea. It is very easy, especially for fiddly combinatorics questions, for a simple equation to turn into a sprawling essay. Keeping sentences very short (no long subordinate clauses) and leaving space between displayed maths and words will help. Remember that whether or not you know what you are doing, you want to GIVE THE IMPRESSION that you know what you are doing!

DO be clever. Sometimes the questions are hard but routine, sometimes they require clever ideas. If your current method isn’t making any progress and you have a crazy idea, try it – it might be just the thing.

However, DON’T be too clever. It’s very tempting, especially to new mentoring students, to try to use every bit of theory you’ve recently learned. Remember that not every geometry question requires the Angle Bisector Theorem, and you don’t always need to deploy Fermat’s Little Theorem or even modular arithmetic on every problem about integers. In particular, avoid applying anything you don’t properly understand – under the pressure of an exam, it’s easy to forget the details, and end up assuming something that is false!

DO relax. I know that is easier said than done, but this is an academically stressful time of life, so enjoy the fact that this is a rare exam where doing well is not of huge importance to the rest of your life. I haven’t seen this year’s paper, but the questions are normally interesting, and should bring out the best in a strong young mathematician. As with many things, if you stop worrying about the outcome, you often do better than you might expect.

Best of luck to everyone sitting the exam tomorrow!

Advertisement

Gaussian tail bounds and a word of caution about CLT

The first part is more of an aside. In a variety of contexts, whether for testing Large Deviations Principles or calculating expectations by integrating over the tail, it is useful to know good approximations to the tail of various distributions. In particular, the exact form of the tail of a standard normal distribution is not particularly tractable. The following upper bound is therefore often extremely useful, especially because it is fairly tight, as we will see.

Let Z\sim N(0,1) be a standard normal RV. We are interested in the tail probability R(x)=\mathbb{P}(Z\geq x). The density function of a normal RV decays very rapidly, as the exponential of a quadratic function of x. This means we might expect that conditional on \{Z\geq x\}, with high probability Z is in fact quite close to x. This concentration of measure property would suggest that the tail probability decays at a rate comparable to the density function itself. In fact, we can show that:

\mathbb{P}(Z>x)< \frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.

It is in fact relatively straightforward:

\mathbb{P}(Z>x)=\frac{1}{\sqrt{2\pi}}\int_x^\infty e^{-u^2/2}du< \frac{1}{\sqrt{2\pi}}\int_x^\infty \frac{u}{x}e^{-u^2/2}du=\frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.

Just by comparing derivatives, we can also show that this bound is fairly tight. In particular:

\frac{1}{\sqrt{2\pi}}\frac{x}{x^2+1}e^{-x^2/2}<\mathbb{P}(Z>x)< \frac{1}{\sqrt{2\pi}}\frac{1}{x}e^{-x^2/2}.

—-

Now for the second part about CLT. The following question is why I started thinking about various interpretations of CLT in the previous post. Suppose we are trying to prove the Strong Law of Large Numbers for a random variable with 0 mean and unit variance. Suppose we try to use an argument via Borel-Cantelli:

\mathbb{P}(\frac{S_n}{n}>\epsilon) = \mathbb{P}(\frac{S_n}{\sqrt{n}}>\epsilon\sqrt{n})\stackrel{\text{CLT}}{\approx}\mathbb{P}(Z>\epsilon\sqrt{n}).

Now we can use our favourite estimate on the tail of a normal distribution.

\mathbb{P}(Z>\epsilon\sqrt{n})\leq \frac{1}{\epsilon\sqrt{n}\sqrt{2\pi}}e^{-n/2}

\Rightarrow \sum_n \mathbb{P}(Z>\epsilon\sqrt{n})\leq \frac{1}{\sqrt{2\pi}}(e^{-1/2})^n=\frac{1}{\sqrt{2\pi}(e^{1/2}-1)}<\infty.

By Borel-Cantelli, we conclude that with probability 1, eventually \frac{S_n}{n}<\epsilon. This holds for all \epsilon>0, and a symmetric result for the negative case. We therefore obtain the Strong Law of Large Numbers.

The question is: was that application of CLT valid? It certainly looks ok, but I claim not. The main problem is that the deviations under discussion fall outside the remit of discussion. CLT gives a limiting expression for deviations on the \sqrt{n} scale.

Let’s explain this another way. Let’s take \epsilon=10^{-2}. CLT says that as n becomes very large

\mathbb{P}(\frac{S_n}{\sqrt{n}}>1000)\approx \mathbb{P}(Z>1000).

But we don’t know how large n has to be before this approximation is vaguely accurate. If in fact it only becomes accurate for n=10^{12}, then it is not relevant for estimating

\mathbb{P}(\frac{S_n}{\sqrt{n}}>\epsilon\sqrt{n}).

This looks like an artificial example, but the key is that this problem becomes worse as n grows, (or as we increase the number which currently reads as 1000), and certainly is invalid in the limit. [I find the original explanation about scale of deviation treated by CLT more manageable, but hopefully this further clarifies.]

One solution might be to find some sort of uniform convergence criterion for CLT, ie a (hopefully rapidly decreasing) function f(n) such that

\sup_{x\in\mathbb{R}}|\mathbb{P}(\frac{S_n}{\sqrt{n}}>x)-\mathbb{P}(Z>x)|\leq f(n).

This is possible, as given by the Berry-Esseen theorem, but even the most careful refinements in the special case where the third moment is bounded fail to give better bounds than

f(n)\sim \frac{1}{\sqrt{n}}.

Adding this error term will certainly destroy any hope we had of the sum being finite. Of course, part of the problem is that the supremum in the above definition is certainly not going to be attained at any point under discussion in these post-\sqrt{n} deviations. We really want to take a supremum over larger-than-usual deviations if this is to work out.

By this stage, however, I hope it is clear what the cautionary note is, even if the argument could potentially be patched. CLT is a theorem about standard deviations. Separate principles are required to deal with the case of large deviations. This feels like a strangely ominous note on which to end, but I don’t think there’s much more to say. Do comment below if you think there’s a quick fix to the argument for SLLN presented above.

Modular arithmetic – Beyond the Definitions

Modular arithmetic is a relatively simple idea to define. The natural motivation is to consider a clock. The display of a standard analogue clock makes no distinction between 4am, 4pm, and 4pm next Thursday. This is a direct visualisation of the integers modulo 12. Instead of counting in the usual way, where each successive integer is different to all those considered previously, here, every time we get to a multiple of 12, we reset our count back to zero. As a result, this procedure is often referred to as `clock arithmetic’.

A common problem for good students, for example those starting the UKMT’s Senior Mentoring Scheme, is that the idea of modular arithmetic seems very simple, but it’s hard to work out how it might be useful in application to problems. My claim is that the language of modular arithmetic is often the best way to discuss divisibility properties in problems about whole numbers. In particular, the behaviour of powers (ie m^n and so forth) is nice in this context, and the notation of modular arithmetic is the only sensible way to approach it. Anyway, let’s begin with a quick review of the definitions.

Definitions

We are interested in divisibility by some fixed integer n\geq 2, and the remainders given after we divide by n. Given an integer M, we can write this as:

M=kn+a,\quad\text{ where }a=0,1,\ldots,n-1,

and this decomposition is unique. We then say that M is congruent to a modulo n. Note that working modulo n, means that we are interested in remainders after division by n (rather than a or k or M or anything else). This has the feeling of a function or algorithm applied to M. We get told what M is, then work out the remainder after division by n, and say that this is `M mod n‘.

This is fine, but it very much worth bearing in mind a slightly different interpretation. Working modulo n is a way of saying that we aren’t interested in the exact value of an integer, only where it lies on the n-clock. In particular, this means we are viewing lots of integers as the same. The `sameness’ is actually more important in lots of arguments than the position on the n-clock.

More formally, we say that a\equiv b or a is congruent to b modulo n, if they have the same remainder after division by n. Another way of writing this is that

a\equiv b\quad \iff \quad n|a-b.

Sets of integers which are equivalent are called congruence classes. For example \{\ldots,-4,-1,2,5,8,\ldots\} is a congruence class modulo 3. Note that under the first definition, we consider all elements here to be congruent to 2, but in a particular question it may be more useful to consider elements congruent to -1, or some combination.

These definitions are equivalent, but it can be more useful to apply this second definition for proving things, rather than writing out b=kn+a or whatever all the time.

Addition and Multiplication

Everything has been set up in terms of addition, so it is easy to see that addition works well on congruence classes. That is:

\text{If }a\equiv b,c\equiv d,\quad\text{then }a+c\equiv b+d.

We could argue via a clock argument, but the second definition works very well here:

\text{We have }n|a-b,n|c-d,\quad\text{and so }n|(a+c)-(b+d),\text{ exactly as required.}

We want to show that a similar result happens for multiplication. But this holds as well:

\text{If }a\equiv b,c\equiv d,\quad\text{then }n|c(b-a)\text{ and }n|b(c-d).

\Rightarrow n|ac-bd,\text{ that is }ac\equiv bd.

Let’s just recap what this means. If we want to know what 157\times 32 is modulo 9, it suffices to say that 157\equiv 4 and 32\equiv 5, and so 157\times 32\equiv 4\times 5\equiv 2. In a more abstract setting, hopefully the following makes sense:

\text{ODD}\times\text{ODD}=\text{ODD};\quad \text{EVEN}\times\text{EVEN}=\text{EVEN};\quad \text{ODD}\times\text{EVEN}=\text{EVEN}.

This is exactly the statement of the result in the case n=2, where the congruence classes are the odd integers and the even integers.

Powers

What happens if we try to extend to powers? Is it the case that

\text{if }a\equiv b,c\equiv d,\quad\text{then }a^c\equiv b^d? Continue reading

DBEs and stationary distributions

Aside

The most recent Applied Probability assignment sheet featured various aspects of Detailed Balance Equations for continuous-time Markov chains. We discussed the advantages and disadvantages of using DBEs rather than solving for an equilibrium distribution directly. The equations used in this second case are often called Full Balance Equations.

Briefly, the advantages of DBEs are that they are easy to solve. After all, each one contains only two components of the equilibrium distribution, so generally you can solve one-at-a-time. The disadvantage is that an equilibrium distribution might not satisfy the DBEs. The deductive structure is:

\text{Solves DBEs}\quad \stackrel{\Rightarrow}{\not\Leftarrow}\quad\text{Equilibrium distribution}

Usually, the chain will be irreducible, so the equilibrium distribution is unique. This means that if we can solve the DBEs, the result is the unique equilibrium distribution.

The DBEs are soluble only if the situation is reversible. This is probably the best definition to use in practice, but informally we can say that this means that the behaviour looks qualitatively the same if we reverse time. For example, as in Q1:

Q=\begin{pmatrix}-1 &1&0\\ 0& -1&1\\1&0&-1\end{pmatrix},

gives the Q-matrix which equilibrium distribution (\frac13,\frac13,\frac13), which does not satisfy DBEs. The chain is not reversible because sample paths always go clockwise, so if we reversed time they would go anti-clockwise (or vice-versa depending on how you’ve drawn the diagram).

What I wanted to say in the class, and made a mess of explaining was this, about why it was inappropriate to use DBEs to find stationary distributions in Q3d):

Reversibility is not just a function of the chain. It is a function of the chain AND the initial distribution. This is only in practice a concern when the chain is reducible, but in this case it really can lead you astray. Let’s consider an example, like

Q=\begin{pmatrix}-3&2&0&0&1&0\\ 0&-4&3&1&0&0\\ 0&1&-4&3&0&0\\ 0&3&1&-4&0&0\\ 0&0&0&0&-5&5\\ 0&0&0&0&5&-5\end{pmatrix}.

Then by solving as in the problem sheet, the invariant distributions are given by:

\lambda(0,\frac13,\frac13,\frac13,0,0)+\mu(0,0,0,0,\frac12\frac12),\quad \lambda+\mu=1.

If you attempted to solve the DBEs, you would succeed, but the only solution would be

(0,0,0,0,\frac12,\frac12).

The explanation is fairly simple in the end. Reversibility is a class property, and only one of the communicating classes, \{5,6\} in this example admits a reversible initial distribution, so to solve the DBEs we must assign zero mass on the other class.

Anyway, I hope that clears up any residual confusion from the class.

CLT and Stable Distributions

One of the questions I posed at the end of the previous post about the Central Limit Theorem was this: what is special about the normal distribution?

More precisely, for a large class of variables (those with finite variance) the limit in distribution of S_n after a natural rescaling is distributed as N(0,1). As a starting point for investigating similar results for a more general class of underlying distributions, it is worth considering what properties we might require of a distribution if it is to appear as a limit in distribution of sums of IID RVs, rescaled if necessary.

The property required is that the distribution is stable. In the rest of the post I am going to give an informal precis of the content of the relevant chapter of Feller.

Throughout, we assume a collection of IID RVs, X,X_1,X_2,\ldots, with the initial sums S_n:=X_1+\ldots+X_n. Then we say X is stable in the broad sense if

S_n\stackrel{d}{=}c_nX+\gamma_n,

for some deterministic parameters c_n,\gamma_n for every n. If in fact \gamma_n=0 then we say X is stable in the strict sense. I’m not sure if this division into strict and broad is still widely drawn, but anyway. One interpretation might be that a collection of distributions is stable if they form a non-trivial subspace of the vector space of random variables and also form a subgroup under the operation of adding independent RVs. I’m not sure that this is hugely useful either though. One observation is that if \mathbb{E}X exists and is 0, then so are all the \gamma_ns.

The key result to be shown is that

c_n=n^{1/\alpha} for some 0<\alpha\leq 2.

Relevant though the observation about means is, a more useful one is this. The stability property is retained if we replace the distribution of X with the distribution of X_1-X-2 (independent copies naturally!). The behaviour of c_n is also preserved. Now we can work with an underlying distribution that is symmetric about 0, rather than merely centred. The deduction that \gamma_n=0 still holds now, whether or not X has a mean.

Now we proceed with the proof. All equalities are taken to be in distribution unless otherwise specified. By splitting into two smaller sums, we deduce that

c_{m+n}X=S_{m+n}=c_mX_1+c_nX_2.

Extending this idea, we have

c_{kr}X=S_{kr}=S_k^{(1)}+\ldots+S_k^{(r)}=c_kX_1+\ldots+c_kX_r=c_kS_r=c_kc_rX.

Note that it is not even obvious yet that the c_ns are increasing. To get a bit more control, we proceed as follows. Set v=m+n, and express

X=\frac{c_m}{c_v}X_1+\frac{c_n}{c_v}X_2,

from which we can make the deduction

\mathbb{P}(X>t)\geq \mathbb{P}(X_1>0,X_2>t\frac{c_v}{c_n})=\frac12\mathbb{P}(X_2>t\frac{c_v}{c_n}). (*)

So most importantly, by taking t>>0 in the above, and using that X is symmetric, we can obtain an upper bound

\mathbb{P}(X_2>t\frac{c_v}{c_n})\leq \delta<\frac12,

in fact for any \delta<\frac12 if we take t large enough. But since

\mathbb{P}(X_2>0)=\frac12(1-\mathbb{P}(X_2=0)),

(which should in most cases be \frac12), this implies that \frac{c_v}{c_n} cannot be very close to 0. In other words, \frac{c_n}{c_v} is bounded above. This is in fact regularity enough to deduce that c_n=n^{1/\alpha} from the Cauchy-type functional equation (*).

It remains to check that \alpha\leq 2. Note that this equality case \alpha=2 corresponds exactly to the \frac{1}{\sqrt{n}} scaling we saw for the normal distribution, in the context of the CLT. This motivates the proof. If \alpha>2, we will show that the variance of X is finite, so CLT applies. This gives some control over c_n in an n\rightarrow\infty limit, which is plenty to ensure a contradiction.

To show the variance is finite, we use the definition of stable to check that there is a value of t such that

\mathbb{P}(S_n>tc_n)<\frac14\,\forall n.

Now consider the event that the maximum of the X_is is >tc_n and that the sum of the rest is non-negative. This has, by independence, exactly half the probability of the event demanding just that the maximum be bounded below, and furthermore is contained within the event with probability <\frac14 shown above. So if we set

z(n)=n\mathbb{P}(X>tc_n)

we then have

\frac14>\mathbb{P}(S_n>tc_n)\geq\frac12\mathbb{P}(\max X_i>tc_n)=\frac12[1-(1-\frac{z}{n})^n]

\iff 1-e^{-z(n)}\leq \frac12\text{ for large }n.

So, z(n)=n(1-F(tc_n)) is bounded as n varies. Rescaling suitably, this gives that

x^\alpha(1-R(x))<M\,\forall x,\,\text{for some }M<\infty.

This is exactly what we need to control the variance, as:

\mathbb{E}X^2=\int_0^\infty \mathbb{P}(X^2>t)dt=\int_0^\infty \mathbb{P}(X^2>u^2)2udu

=\int_0^\infty 4u\mathbb{P}(X>u)du\leq \int_0^\infty 1\wedge\frac{4M}{u^{-(\alpha-1)}}du<\infty,

using that X is symmetric and that \alpha>2 for the final equalities. But we know from CLT that if the variance is finite, we must have \alpha=2.

All that remains is to mention how stable distributions fit into the context of limits in distribution of RVs. This is little more than a definition.

We say F is in the domain of attraction of a broadly stable distribution R if

\exists a_n>0,b_n,\quad\text{s.t.}\quad \frac{S_n-b_n}{a_n}\stackrel{d}{\rightarrow}R.

The role of b_n is not hugely important, as a broadly stable distribution is in the domain of attraction of the corresponding strictly stable distribution.

The natural question to ask is: do the domains of attraction of stable distributions (for 0<\alpha\leq 2) partition the space of probability distributions, or is some extra condition required?

Next time I will talk about stable distributions in a more analytic context, and in particular how a discussion of their properties is motivated by the construction of Levy processes.

Large Deviations and the CLT

Taking a course on Large Deviations has forced me to think a bit more carefully about what happens when you have large collections of IID random variables. I guess the first thing think to think about is ‘What is a Large Deviation‘? In particular, how large or deviant does it have to be?

Of primary interest is the tail of the distribution function of S_n=X_1+\ldots+X_n, where the X_i are independent and identically distributed as X. As we can always negate everything later if necessary, we typically consider the probability of events of the type:

\mathbb{P}(S_n\geq \theta(n))

where \theta(n) is some function which almost certainly increases fairly fast with n. More pertinently, if we are looking for some limit which corresponds to an actual random variable, we perhaps want to look at lots of related \theta(n)s simultaneously. More concretely, we should fix \theta and consider the probabilities

\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha). (*)

Throughout, we lose no generality by assuming that \mathbb{E}X=0. Of course, it is possible that this expectation does not exist, but that is certainly a question for another post!

Now let’s consider the implications of our choice of \theta(n). If this increases with n too slowly, and the likely deviation of S_n is greater than \theta(n), then the event might not be a large deviation at all. In fact, the difference between this event and the event (S_n is above 0, that is, its mean) becomes negligible, and so the probability at (*) might be 1/2 or whatever, regardless of the value of \alpha. So object \lim \frac{S_n}{\theta(n)} whatever that means, certainly cannot be a proper random variable, as if we were to have convergence in distribution, this would imply that the limit RV consisted of point mass at each of \{+\infty, -\infty\}.

On the other hand, if \theta(n) increases rapidly with n, then the probabilities at (*) might become very small indeed when \alpha>0. For example, we might expect:

\lim_{n\rightarrow\infty}\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha)=\begin{cases}0& \alpha>0\\1&\alpha<0.\end{cases}

and more information to be required when \alpha=0. This is what we mean by a large deviation event. Although we always have to define everything concretely in terms of some finite sum S_n, we are always thinking about the behaviour in the limit. A large deviation principle exists in an enormous range of cases to show that these probabilities in fact decay exponentially. Again, that is the subject for another post, or indeed the lecture course I’m attending.

Instead, I want to return to the Central Limit Theorem. I first encountered this result in popular science books in a vague “the histogram of boys’ heights looks like a bell” kind of way, then, once a normal random variable had been to some extent defined, it returned in A-level statistics courses in a slightly more fleshed out form. As an undergraduate, you see it in several forms, including as a corollary following from Levy’s convergence theorem.

In all applications though, it is generally used as a method of calculating good approximations. It is not uncommon to see it presented as:

\mathbb{P}(a\sigma\sqrt{n}+\mu n\leq S_n\leq b\sigma\sqrt{n}+\mu n)\approx \frac{1}{\sqrt{2\pi}}\int_a^b e^{-x^2/2}dx.

Although in many cases that is the right way to think use it, it isn’t the most interesting aspect of the theorem itself. CLT says that the correct scaling of \theta(n) so that the deviation probabilities lie between the two cases outline above is the same (that is, \theta(n)=O(\sqrt{n}) in some sense) for an enormous class of distributions, and in particular, most distributions that one might encounter in practice (ie finite mean, finite variance). There is even greater universality, as furthermore the limit distribution at this interface has the same form (some appropriate normal distribution) whenever X is in this class of distributions. I think that goes a long way to explaining why we should care about the theorem. It also immediately prompts several questions:

  • What happens for less regular distributions? It is now more clear what the right question to ask in this setting might be. What is the appropriate scaling for \theta(n) in this case, if such a scaling exists? Is there a similar universality property for suitable classes of distributions?
  • What is special about the normal distribution? The theorem itself shows us that it appears as a universal scaling limit in distribution, but we might reasonably ask what properties such a distribution should have, as perhaps this will offer a clue to a version of CLT or LLNs for less regular distributions.
  • We can see that the Weak Law of Large Numbers follows immediately from CLT. In fact we can say more, perhaps a Slightly Less Weak LLN, that

\frac{S_n-\mu n}{\sigma \theta(n)}\stackrel{d}{\rightarrow}0

  • whenever \sqrt{n}<<\theta(n). But of course, we also have a Strong Law of Large Numbers, which asserts that the empirical mean converges almost surely. What is the threshhold for almost sure convergence, because there is no a priori reason why it should be \theta(n)=n?

To be continued next time.

The Perron-Frobenius Theorem for Stochastic Matrices

This article was prompted by a question asked by one of my students about 5 minutes before the end of the final lecture of my Markov Chains course in Linyi. Although I didn’t have time to think about it right then, the flight home offered plenty of opportunity for having a play around. I am well aware that the proof given below is not the best one, but the ideas of minimising quadrants of a matrix seemed fairly natural. Anyway, it’s been sitting on my desktop for over two months now, so I decided I should put it up.

———

Recall that the transition probabilities of a finite Markov chain are determined by a stochastic matrix P. That is, each row of P is a probability distribution. In both theory and applications, the existence and properties of an invariant distribution is of interest. This is a probability distribution \pi satisfying the relation:

\pi P=\pi. (1)

It is clear that \bigg(\begin{smallmatrix}1\\ \vdots\\1\end{smallmatrix}\bigg) is a right- or column eigenvector of P, with eigenvalue 1. Since the spectrum of P^T is the same as that of P, we conclude that 1 is a left-eigenvalue of P also. So we can be assured of the existence of a vector \pi satisfying (1). What is unclear is that this eigenvector \pi should be a probability distribution. Since at least one entry must be non-zero, it will suffice to show that every entry of \pi is non-negative.

A necessary condition for the uniqueness of an invariant distribution is that the Markov chain be irreducible. This is best defined using the terminology of random walks: the chain is irreducible if for every pair of states i,j\in I, it is possible to move from i to j and back again. In terms of the transition matrix, P is irreducible if it is not block upper-triangular, up to reordering rows and columns.

We want to show that when P is irreducible, the (unique) 1-eigenvector is a probability distribution. The standard method proposed in this context is to exhibit the invariant distribution directly. For example, Norris’s Markov Chains defines

\gamma_i^k=\text{ expected time spent in i between visits to k }=\mathbb{E}_k\sum_{n=0}^{T_k}1_{\{X_n=i\}},

and shows that (\gamma_i^k)_{i\in I} satisfies (1).

Nonetheless, the result required is clearly at least one step removed from the probabilistic interpretation, so it would be satisfying to find a direct proof of existence. Typically, one quotes the substantially more general theorem of Perron and Frobenius, the most relevant form of which is:

Theorem (Perron-Frobenius): Given A a non-negative and irreducible square matrix. Then there is a positive real eigenvalue \lambda with multiplicity 1 and such that all other eigenvalues have absolute value less than or equal to \lambda. Then the (unique up to scaling) left- and right-eigenvectors corresponding to \lambda are positive.

Here we present a short proof of this result in the case where A is the (stochastic) transition matrix of a Markov chain.

Proposition: An irreducible stochastic matrix has a 1-eigenvector with all entries non-negative.

Proof: We show instead the contrapositive: that if a stochastic matrix has a 1-eigenvector with both negative and non-negative components, then it is reducible. The motivation is this third eigenvector given in example (2). Observe that the communicating classes are \{1,2\} and \{3\}, and it is not hard to see that for any eigenvector with negative and non-negative components, the sign of a component is a class property.

Informally, given an n\times n stochastic matrix P, and a 1-eigenvector \pi with this property, we relabel the states so that the non-negative components, which we call A\subset I are first. That is, in a moderate abuse of notation:

\pi=(\underbrace{\pi_A}_{\geq 0}\quad\underbrace{\pi_B}_{<0}).\quad\text{ If we write P as }\begin{pmatrix}P_{AA}&P_{AB}\\P_{BA}&P_{BB}\end{pmatrix},

our aim is to show that the sub-matrices P_{AB} and P_{BA} are both zero. This implies that states in A and states in B do not communicate, showing that P is reducible. We can formulate this as a linear programming problem:

\text{Maximise }\sum_{\substack{x\in A,y\in B\\x\in B, y\in A}}p_{xy}\text{ s.t. }\begin{cases}p_{xy}\geq 0&\forall x,y\in I\\p_{x1}+\ldots+p_{xn}=1&\forall x\in I\\\pi_1p_{1y}+\ldots+\pi_np_{ny}=\pi_y&\forall y\in I\end{cases}

It suffices to show that this maximum is 0. Now, we take |A|=i, and assume that 1\leq i\leq n-1, that is, there are a positive number of negative and non-negative components. Noting that the sum of the rows in a stochastic matrix is 1, we may consider instead:

\text{Minimise }\sum_{\substack{x,y\in A\\x,y\in B}}p_{xy}\text{ s.t. }\begin{cases}p_{xy}\geq 0&\forall x,y\in I\\p_{x1}+\ldots+p_{xn}=1&\forall x\in I\\\pi_1p_{1y}+\ldots+\pi_np_{ny}=\pi_y&\forall y\in I\end{cases}

and it will suffice to show that this minimum is n. To do this, we consider instead the dual:

\text{Maximise }\lambda_1+\ldots+\lambda_n+\pi_1\mu_1+\ldots+\pi_n\mu_n,

\text{ s.t. }\lambda_x+\pi_y\mu_x\leq\begin{cases}1&\text{if }x,y\leq i\text{ or }x,y\geq i+1\& \text{otherwise}\end{cases}

The objective is certainly bounded by n. And in fact this is attainable, for example by taking:

\lambda_1=\ldots=\lambda_i=1,\quad \lambda_{i+1}=\ldots=\lambda_n=0

\mu_1=\ldots=\mu_i=0,\quad \mu_{i+1}=-\frac{1}{\pi_{i+1}}, \ldots,\mu_n=-\frac{1}{\pi_n}.

Applying strong duality for linear programming problems completes the proof.

An obvious remark

Aside

An obvious remark:

If a sequence of independent random variables X_n converge almost surely to some limit X, this limit must be a constant (almost surely).

I’ve been thinking about the Central Limit Theorem about related Large Deviations results this afternoon, and wasted almost an hour worrying about situations which were effectively well-disguised special cases of the above.

Why is it true? Well, suppose each X_n is \mathcal{F}_n-measurable. But by independence, we might as well take \mathcal{F}_n=\sigma(X_n). Then the limit variable X is independent of \mathcal{F}_n for all n, and thus independent of \cup F_n=\mathcal{F}\supset \sigma(X). If X is independent of itself, it must be almost surely constant.