Lecture 4 – Hitting time theorem

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

This lecture consisted of revision of the most relevant theory of Galton-Watson trees, with a focus on the case where the offspring distribution is Poisson, since, as we have seen in previous lectures, this is a strong candidate to approximate the structure of G(n,\lambda/n). It makes sense to cover the theory of the trees before attempting to make rigorous the sense of approximation.

Given a Galton-Watson tree T, it is natural to label the vertices in a breadth-first order as \varnothing=v_1,v_2,\ldots,v_{|T|}. This is easiest if we have constructed the Galton-Watson tree as a subset of the infinite Ulam-Harris tree, where vertices have labels like (3,5,17,4), whose parent is (3,5,17). If this child vertex is part of the tree, then so are (3,5,17,1), (3,5,17,2), and (3,5,17,3). This means our breadth-first order is canonically well-defined, as we have a natural ordering of the children of each parent vertex.

Note: one advantage of using breadth-first order rather than depth-first order (which corresponds to the usual dictionary, or lexicographic ordering of the labels) is that if the tree is infinite, we don’t explore all of it during a depth-first search. (In the sense that there exist vertices which are never given a finite label.) For breadth-first search, a similar problem arises precisely when some vertex has infinitely many children. For a conventional Galton-Watson tree, the latter situation is much less of a problem than the infinite total population problem, which happens with positive probability whenever \mu=\mathbb{E}[X]>1.

Anyway, given the depth-first order, one can consider an exploration process S_0,S_1,S_2,\ldots,S_{|T|} given by

S_0=1,\quad S_i=S_{i-1}+(X_i-1),

where X_i is the number of children of v_i. In this way, we see that

S_i=\big| \Gamma(v_1)\cup\ldots\cup\Gamma(v_i)\backslash \{v_1,\ldots,v_i\}\big|,\quad i\ge 1,

records the number of vertices in some stack containing those which we have ‘seen but not explored’. Some authors prefer to start from 0, in which case one ends up with a similar but slightly different interpretation of the ‘stack’, but that’s fine since we aren’t going to define formally what ‘seen’ and ‘explored’ means in this post.

Essentially, we exhaust the vertices of the tree whenever S_t=0, and so the condition that |T|=n requires

S_n=0,\quad S_m\ge 1,\; m=0,1,\ldots,n-1.

Conveniently, so long as we have avoiding ordering ambiguity, for example by insisting that trees live within the Ulam-Harris tree, we can reconstruct T uniquely from (S_0,S_1,\ldots,S_{|T|}).

Furthermore, if T is a Galton-Watson process, then the numbers of children X_i are IID, and so in fact this exploration process is a random walk, and the size of the tree can be recovered as the hitting time of zero.

Note: making fully rigorous the argument that children in the GW tree are independent of the breadth-first walk fully rigorous is somewhat technical, and not to be dismissed lightly, though not of principle interest at the level of this topics course. See Proposition 1.5 in Section 1.2 of Le Gall’s notes or Section 1.2.2 of my doctoral thesis for further discussion and argument.

The hitting time theorem allows us to study the distribution of the hitting time of a random walk whose increments are bounded below by -1, in terms of the distribution of the value of the random walk.

Theorem: Let (S_n,\, n\ge 0) be a random walk with S_0=0 and IID increments (X_n,n\ge 1) satisfying \mathbb{P}(X_n\ge -1)=1. Let H_{-k}=\inf \left\{n\,:\, S_n=-k\right\} be the hitting time of -k.

Then \mathbb{P}\big( H_{-k}=n\big) = \frac{k}{n}\mathbb{P}\big(S_n=-k).

Commentary: there are local central limit theorem estimates and large deviation estimates that allow good control of the probability on the RHS for a rich class of contexts. So at a meta-level, the hitting time theorem allows us to reduce a complicated (though still classical) problem, to a real classical problem, which is particularly helpful when the LHS is a device for capturing relevant information about our random tree model.

Proof context/history: The case k=1, as relevant to the trees, is often referred to as Dwass’s theorem, though the combinatorial argument pre-dates this, sometimes in the form of the similar ballot theoremA complete reference list would be challenging, but it certainly appears in Feller, due to Spitzer. In his recent book, van der Hofstad uses an argument by induction on n, noting that the statement is obvious when n=0,1,…,k-1, and obvious for different reasons when n=k. This argument can also be found published here.

I’m going to give a combinatorial argument, similar to one I’ve given in a post from years ago (back then, for the case k=1), because there wasn’t time to include this in the lecture.

Proof: from the point of view of the course, the most important probabilistic take home message from this argument is that we get significant control by introducing the randomness in two stages, and conditioning on the first round of randomness.

Here, what we do is to condition on the increments (X_1,X_2,\ldots,X_n)_{\mathrm{cyc}} in cyclic order, ie without distinguishing between (x_1,x_2,\ldots,x_n) and (x_4,x_5,\ldots,x_n,x_1,x_2,x_3). We’ll declare which of these is the true ordering at the end. Anyway, the punchline is the following lemma.

Lemma: when x_1+\ldots +x_n=-k, and all x_i\ge -1, then exactly k of the n orderings corresponding to the cyclic ordering (x_1,\ldots,x_n)_{\mathrm{cyc}} have the property that x_1+\ldots+x_\ell \ge -k+1.

With this lemma, it’s clear why the hitting time theorem holds. We condition on (x_1,\ldots,x_n)_{\mathrm{cyc}}. If the sum of these elements is not -k, then there’s no chance that the hitting time is n. If the sum is -k, then when we choose a true ordering at random, we get the hitting time as n with conditional probability k/n.

Proof of lemma: At all times, take indices modulo n. We want labels \alpha satisfying:

x_\alpha,\; x_\alpha+x_{\alpha+1},\; \ldots\;, ;\ x_{\alpha}+\ldots+x_{\alpha+n-2}\text{ are all }\ge -(k-1). (*)

Step 1: Suppose there are \alpha_1<\alpha_2<\ldots<\alpha_{k+1} satisfying (*). It’s helpful to think about going round the circle of values. If we go all the way round the circle, we get -k, but if we go from x_{\alpha_{i+1}}\to x_{\alpha_i-1} then we have an initial segment from x_{\alpha_{i+1}}, and so

x_{\alpha_i}+\ldots+x_{\alpha_{i+1}-1} = -k - (x_{\alpha_{i+1}}+\ldots+x_{\alpha_i-1}) \le -k - \left(-(k-1)\right) = -1.

However, if we repeat this for the k+1 such blocks, we end up going all the way round the circle, and get

\left(x_{\alpha_1}+\ldots+x_{\alpha_2-1}\right) + \left(x_{\alpha_2}+\ldots+x_{\alpha_3-1}\right) + \ldots

\ldots+ \left(x_{\alpha_{k+1}} + \ldots+x_{\alpha_1-1}\right) \le -(k+1),

which is a contradiction since the total sum is k. So we have at most k such values of \alpha.

Step 2: Now we show that there is at least one such \alpha. Let

\beta=\mathrm{argmin}_{1\le m\le n} (x_1+\ldots+x_m),

and if there are multiple such m, let \beta be the minimal such. Now set \alpha=\beta+1, and we will show this satisfies the requirements.

  • Then x_\alpha+x_{\alpha+1}+\ldots+x_{\alpha+m}\ge 0, whenever \alpha+m\le n.
  • And when m<\beta, we have

x_{\alpha}+\ldots+x_n+x_1+\ldots+x_m = -k+(x_1+\ldots+x_m)-(x_1+\ldots+x_\beta).

And by assumption, the first bracket is strictly greater than the second bracket, so

x_\alpha + \ldots+x_n+x_1+\ldots+x_m \ge -(k-1).

Step 3: Now we have an \alpha which satisfies, we may assume WLOG that it is \alpha_0=1. Now we construct the hitting times:

\alpha_1=\min\big(m\,:\, x_1+\ldots+x_m=-1\big),

\alpha_{k-1}=\min\big(m\,:\, x_1+\ldots+x_m=-(k-1)\big).

We claim that all these also satisfy the requirements for \alpha. This is fairly clear from a diagram, or from the fact that when \alpha_i\le m<\alpha_{i+1}, then

x_{\alpha_k}+x_{\alpha_k+1}+\ldots+x_m\ge x_{\alpha_k}+\ldots+x_{\alpha_i}= i-k,

(or i-k+n if i<k).


Lecture 3 – Couplings, comparing distributions

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

In this third lecture, we made our first foray into the scaling regime for G(n,p) which will be the main focus of the course, namely the sparse regime when p=\frac{\lambda}{n}. The goal for today was to give a self-contained proof of the result that in the subcritical setting \lambda<1, there is no giant component, that is, a component supported on a positive proportion of the vertices, with high probability as n\rightarrow\infty.

More formally, we showed that the proportion of vertices contained within the largest component of G(n,\frac{\lambda}{n}) vanishes in probability:

\frac{1}{n} \left| L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right| \stackrel{\mathbb{P}}\longrightarrow 0.

The argument for this result involves an exploration process of a component of the graph. This notion will be developed more formally in future lectures, aiming for good approximation rather than bounding arguments.

But for now, the key observation is that when we ‘explore’ the component of a uniformly chosen vertex v\in[n] outwards from v, at all times the number of ‘children’ of v which haven’t already been considered is ‘at most’ \mathrm{Bin}(n-1,\frac{\lambda}{n}). Since, for example, if we already know that eleven vertices, including the current one w are in C(v), then the distribution of the number of new vertices to be added to consideration because they are directly connected to w has conditional distribution \mathrm{Bin}(n-11,\frac{\lambda}{n}).

Firstly, we want to formalise the notion that this is ‘less than’ \mathrm{Bin}(n,\frac{\lambda}{n}), and also that, so long as we don’t replace 11 by a linear function of n, that \mathrm{Bin}(n-11,\frac{\lambda}{n})\stackrel{d}\approx \mathrm{Poisson}(\lambda).

Couplings to compare distributions

coupling of two random variables (or distributions) X and Y is a realisation (\hat X,\hat Y) on the same probability space with correct marginals, that is

\hat X\stackrel{d}=X,\quad \hat Y\stackrel{d}=Y.

We saw earlier in the course that we could couple G(n,p) and G(n,q) by simulating both from the same family of uniform random variables, and it’s helpful to think of this in general: ‘constructing the distributions from the same source of randomness’.

Couplings are a useful notion to digest at this point, as they embody a general trend in discrete probability theory. Wherever possible, we try to do as we can with the random objects, before starting any calculations. Think about the connectivity property of G(n,p) as discussed in the previous lecture. This can be expressed directly as a function of p in terms of a large sum, but showing it is an increasing function of p is essentially impossible by computation, whereas this is very straightforward using the coupling.

We will now review how to use couplings to compare distributions.

For a real-valued random variable X, with distribution function F_X, we always have the option to couple with a uniform U(0,1) random variable. That is, when U\sim U[0,1], we have (F_X^{-1}(U)\stackrel{d}= X, where the inverse of the distribution function is defined (in the non-obvious case of atoms) as

F_X^{-1}(u)=\inf\left\{ x\in\mathbb{R}\,:\, F(x)\ge u\right\}.

Note that when the value taken by U increases, so does the value taken by F_X^{-1}(U). This coupling can be used simultaneously on two random variables X and Y, as (F_X^{-1}(U),F_Y^{-1}(U)), to generate a coupling of X and Y.

The total variation distance between two probability measures is

d_{\mathrm{TV}}(\mu,\nu):= \sup_{A}|\mu(A)-\nu(A)|,

with supremum taken over all events in the joint support S of \mu,\nu. This is particularly clear in the case of discrete measures, as then

d_{\mathrm{TV}}(\mu,\nu)=\frac12 \sum_{x\in S} \left| \mu\left(\{x\}\right) - \nu\left(\{x\}\right) \right|.

(Think of the difference in heights between the bars, when you plot \mu,\nu simultaneously as a bar graph…)

The total variation distances records how well we can couple two distributions, if we want them to be equal as often as possible. It is therefore a bad measure of distributions with different support. For example, the distributions \delta_0 and \delta_{1/n} are distance 1 apart (the maximum) for all values of n. Similarly, the uniform distribution on [0,1] and the uniform distribution on \{0,1/n,2/n,\ldots, n-1/n, 1\} are also distance 1 apart.

When there is more overlap, the following result is useful.

Proposition: Any coupling (\hat X,\hat Y) of X\sim \mu,\,Y\sim \nu satisfies \mathbb{P}(X=Y)\le 1-d_{\mathrm{TV}}(\mu,\nu), and there exists a coupling such that equality is achieved.

Proof: The first part of the resultis shown by

\mathbb{P}\left(\hat X=\hat Y = x\right)\le \mathbb{P}\left(\hat X=x\right) = \mu\left(\{x\}\right),

and so we may also bound by \nu\left(\{x\}\right). Thus

\mathbb{P}\left(\hat X=\hat Y\right) = \sum_x \min\left( \mu\left(\{x\}\right),\, \nu\left(\{x\}\right) \right),

leading to

\mathbb{P}\left(\hat X \ne\hat Y\right) = 1 - \sum_x \min\left(\mu\left(\{x\}\right),\, \nu\left(\{x\}\right) \right)

= \sum_x \left[ \frac12 \mu(\{x\}) + \frac12 \nu(\{x\}) - \min\left( \mu(\{x\}), \,\nu(\{x\})\right) \right] = \frac12 \sum_x \left| \mu(\{x\}) - \nu(\{x\}) \right|.

The second part of the theorem is more of an exercise in notation, and less suitable for this blog article.

Crucially, observe that the canonical coupling discussed above is definitely not necessarily the best coupling for total variation distance. For example, the uniform distribution on \{0,1,2,\ldots, n-1\} and the uniform distribution on \{1,2,\ldots, n-1, n\} are very close in the total variation distance, but under the canonical coupling involving the distribution function, they are never actually equal!

Stochastic domination

Given two random variables X and Y, we say that Y stochastically dominates X if

\mathbb{P}(X\ge x)\le \mathbb{P}(Y\ge x),\quad \forall x\in\mathbb{R},

and write X\le_{st}Y. If we are aiming for results in probability, then it is easy to read results in probability for X off from results in probability for Y, at least in one direction. For this relation, it turns out that the canonical coupling is particularly useful, unsurprisingly so since it is defined in terms of the distribution function, which precisely describes the tail probabilities. In fact the reverse implication is true too, as stated in the following theorem.

Theorem (Strassen): X\le{st} Y if and only if there exists a coupling (\hat X,\hat Y) such that \mathbb{P}(\hat X\le \hat Y)=1.

Proof: The reverse implication is clear since

\mathbb{P}\left(\hat X \ge x\right) = \mathbb{P}\left( \hat Y\ge \hat X\ge x\right) \le \mathbb{P}\left( \hat Y \ge x\right),

and \hat X,\,\hat Y have the correct marginals.

Examples: In many situations, the coupling description makes it much easier to verify the stochastic domination relation.

  • \lambda\le \mu implies \mathrm{Poisson}(\lambda)\le_{st} \mathrm{Poisson}(\mu). Checking the tail of the Poisson mass function would be a bit of a nightmare, but since we have the superposition coupling \mathrm{Po}(\mu)\stackrel{d}= \mathrm{Po}(\lambda) + \mathrm{Po}(\mu-\lambda), for independent RVs on the RHS, we can read it off.
  • Similarly \mathrm{Bin}(n,p)\le \mathrm{Bin}(m,q) when n\le m,\, p\le q. Again, checking the tail would be computationally-annoying.
  • One of the most useful examples is the size-biased distribution, obtained from a distribution X by \mathbb{P}(X_{sb}=x) \propto x\mathbb{P}(X=x), when X is a non-negative RV. As discussed on one of the exercises, |C(v)|, which we study repeatedly, can be expressed as the size-biased version of a uniform choice from amongst the list of component sizes in the graph. We have X\le_{st} X_{sb}, which is intuitively reasonable since the size-biased distribution ‘biases in favour of larger values’.


1) The fact that we study the component C(v) containing a uniformly chosen vertex v\in[n] is another example of ‘revealing’ the total randomness in a helpful order. Note that |C(v)| requires extra randomness on top of the randomness of the graph itself. When analysing C(v) it was convenient to sample the graph section-by-section having settled on v. However, for the final comparison of |C(v)| and |L_1(G)|, the opposite order is helpful.

Since, whenever there is a component of size at least \epsilon n, the probability that we pick v within that component is at least \frac{\epsilon n}{n}. Thus we can conclude that

\mathbb{P}\left(|C(v)|\ge \epsilon n\right) \ge \frac{\epsilon n}{n}\mathbb{P}\left( \left|L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right| \ge \epsilon n\right).

Here, since we are aiming for convergence in probability, \epsilon>0 is fixed, and

\frac{1}{n}\left|L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right|\stackrel{\mathbb{P}}\longrightarrow 0,

follows from

\frac{1}{n}\left| C(v) \right| \stackrel{\mathbb{P}}\longrightarrow 0,

which we have shown using the exploration process.

2) In fact, the final step of the argument can be strengthened to show that

\lim_{K\rightarrow\infty}\limsup_{n\rightarrow\infty}\mathbb{P}\left(\left|L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right)\ge K\log n\right)=0,

though this requires familiarity with Chernoff bounds / Large Deviations estimates, which would have been a distraction to have introduced at exactly this point in the course. Showing that the size of the largest component is concentrated in probability on the scale of \log n is also possible, though for this we obviously need to track exact errors, rather than just use bounding arguments.

3) There are many more ways to compare distributions, with and without couplings, some of which have been of central importance in studying the age distributions corresponding to mean field forest fires, as introduced in a recent paper with Ed Crane and Balazs Rath, which can, as of this morning, be found on Arxiv. In the interests of space, further discussion of more exotic measures on distributions is postponed to a separate post.

Lecture 2 – Connectivity threshold

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

The goal of the second lecture was to establish the sharp phase transition for the connectivity of the random graph G(n,p(n)) around the critical regime p(n)\sim \frac{\log n}{n}. In the end, we showed that when \omega(n) is any diverging sequence, and p(n)=\frac{\log n-\omega(n)}{n}, then we have that G(n,p(n)) is with high probability not connected.

In the next lecture, we will finish the classification by studying p(n)=\frac{\log n+\omega(n)}{n}, and show that for this range of p, the graph G(n,p(n)) is with high probability connected.

The details of the lecture, especially the calculation, are not presented fully here. There, I followed van der Hofstad’s recent book fairly closely, sometimes taking different approximations and routes through the algebra, though all versions remain fairly close to the original enumerations by Renyi.

Immediate remarks

  • One is allowed to be surprised that for almost all scalings of p(n), G(n,p) is either whp connected or whp not connected. The speed of the transition is definitely interesting.
  • As defined in lectures, the property that a graph is connected is an increasing property, meaning that it is preserved when you add additional edges to the graph.
  • Because of the natural coupling between G(n,p) and G(n,q), the fact that connectedness is an increasing property makes life easier. For example, we can insist temporarily that \omega(n)\ll \log n, or whatever scaling turns out to be convenient for the proof, but conclude the result for all diverging \omega(n). This avoids the necessity for an annoying case distinction.

Heuristics – Isolated vertices

It turns out that the ‘easiest’ way for such a graph to be disconnected is for it to have an isolated vertex. In determining that the graph has a cut into classes of sizes a and b with no edges between them, there is a trade-off between the number of ways to choose the partition (which increases with min(a,b) ) and the probabilistic penalty from banning the ab edges between the classes (which decreases with min(a,b) ). It turns out that the latter effect is slightly stronger, and so (1,n-1) dominates.

Method 1: second-moment method

In the case p(n)=\frac{\log n - \omega(n)}{n}, we use a second-moment method argument to establish that G(n,p) contains an isolated vertex with high probability. Note that a given vertex v is isolated precisely if n-1 edges are not present. Furthermore, two given vertices v,w are both isolated, precisely if 2n-3 edges are not present. So in fact, both the first moment and the second moment of the number of isolated vertices are straightforward to evaluate.

It turns out that the number of isolated vertices, Y_n, satisfies

\mathbb{E}[Y_n]= \exp(\omega(n)+o(1))\rightarrow\infty. (*)

As always, we have to eliminate the possibility that this divergent expectation is achieved by the graph typically having no isolated vertices, but occasionally having very many. So we turn to the second moment, and can show

\mathrm{Var}(Y_n)= (1+o(1))\mathbb{E}[Y_n],

and so by Chebyshev’s inequality, we have \mathbb{P}(Y_n=0)\rightarrow 0.

Method 2: first-moment method

Counter-intuitively, although the case p(n)=\frac{\log n + \omega(n)}{n} requires only a first-moment method, it is more technical because it involves the non-clear direction of the informal equivalence:

\text{Connected}\; ``\iff ''\; \text{no isolated vertices}.

At the time we showed (*), we also showed that for this regime of p(n), G(n,p) whp has no isolated vertices. It remains to show that it has no splits into (unions of) connected components of sizes k and n-k.

It turns out that it is annoying to count components of size k because when we take an expectation, we can’t study all connected components on k vertices equally, since the probability of a given component structure depends on the exact number of edges it contains (which must be at least k-1, but will be larger if it includes lots of cycles). It is much easier to study the number of spanning trees of components of size k. So if the component itself is a tree, it has exactly one spanning tree, and if it has more cyclic structure then it will have several spanning trees.

The point is that we expect the expected number of such components to be very very small when k\ge 2, so it doesn’t really matter if we overcount by some factor. Somewhat abusing notation, we define X_k to be the number of such spanning trees in G(n,p), so

\mathbb{E}[X_k]=\binom{n}{k} k^{k-2}p^{k-1}(1-p)^{k(n-k)}.

Here the terms determine, respectively:

  • the number of ways to choose the vertex set of the component;
  • the number of trees on that vertex set (Cayley’s formula);
  • the probability of E(G(n,p)) including the edges of the tree in question;
  • the probability that G(n,p) includes no edges between the vertex of the tree and the rest of the graph (so it is actually a component).

Applying Stirling’s formula, and bounding helpfully, we obtain

\mathbb{E}[X_k]\le n(npe)^k \exp(-k(n-k)p).

We have already seen (modulo the abuse of notation) that \mathbb{E}[X_1]\rightarrow 0. Now we need to show that

\mathbb{E}\left[ \sum_{k=2}^{\lfloor n/2\rfloor} X_k\right]\rightarrow 0.

At this point, we should optimise our approximations based on how easy they are to sum, rather than how strong they are. One weak option is

\mathbb{E}[X_k] \le n(npe^{1-np/2})^k,

but it turns out that this is perfectly strong enough, and gives a geometric series, which is easy to sum. The only issue is that we get

\mathbb{E}\left[\sum_{k=\ell}^{\lfloor n/2\rfloor} X_k\right] < n \left(\frac{\log n}{\sqrt{n}}\right)^\ell,

so the sum is only bounded as we want when \ell\ge 3. Fortunately, almost any argument will work to show \mathbb{E}[X_2]\rightarrow 0 separately.

Once we know that \mathbb{E}[\sum X_k] \rightarrow 0, we can use the fact that for non-negative-valued RVs X,

\mathbb{P}(X\ne 0)=\mathbb{P}(X\ge 1)\le \mathbb{E}[X],

which you can view as an example of Markov’s inequality if you want, but really is just one of the fundamental properties of integer-valued RVs that allows useful interchange of probabilities and expectations.

In any case, combining this shows that with high probability there is no cut into classes of size k and n-k for any k, which is the same as being connected.


  • The similarity to the famous coupon collector problem is unsurprising, modulo the heuristic about isolated vertices. Under the right setup (continuous-time, for example), whether the coupons have been collected is independent between different coupons. That’s not the case here, as can be seen in the second-moment calculation. But under any version of the coupon collector problem, the second-moment calculation setup is essentially the same as here.
  • It becomes clear in the proof that the threshold for the existence of cuts into k and n-k with high probability is p(n)\sim \frac{\log n}{kn} for fixed k. In particular for every k>1, we are working with a range of p much larger than this threshold in the calculation above, so it’s unsurprising that we could make weak assumptions and approximations, and still show that the expected number of such cuts vanishes.
  • Once one has a coupling of G(n,p) for all p, one can ask whether the graph becomes connected at exactly the same time / value of p as the last isolated vertex receives an incident edge. I will not discuss this more here, as it’s one of the extended exercises, but this is just the start of a family of questions concerning whether closely-related properties occur simultaneously or not.

Renyi’s study of G(n,m)

Renyi’s original argument concerned the related model G(n,m), that is, a uniform choice amongst the graphs on vertex set [n] with exactly m edges. There’s no right or wrong answer to the question of which is the better choice as the canonical model for a random graph. However, for this question, I guess it is natural to pose the question for G(n,m) since there’s an obvious extremal answer in terms of edges. That is, we require at least n-1 edges before there’s any chance that the graph is connected (cf the exercise about trees), so it’s natural to ask how many edges are required before there’s a strong chance that the G(n,m) is connected.

Clearly, conditioning on m edges being present in G(n,p) gives G(n,m). We can view G(n,p) as a mixture of distributions G(n,m) across values of m from 0 to \binom{n}{2}. Some graph properties are clearly badly focused under this mixture, for example the property A={the graph has an even number of edges}.

However, for increasing events (which are preserved under the addition of edges), proving that an event holds whp for G(n,p) given it holds whp for G(n,m), and vice versa, is normally straightforward.

For example, given a sequence p(n), take m(n) to be a sequence such that E(G(n,p))\ge m(n) with high probability. This is just about the tail of the binomial distribution, so for example taking m(n)=\binom{n}{2}p - \omega(\sqrt{n^2p}) suffices. Then if increasing property A holds for G(n,m) with high probability, it holds for G(n,p) with high probability too.

One can argue similarly in the other direction.

An interesting adjustment to this argument is required in the final section of my paper with James Martin in ALEA concerning a model of random forests which does not have the natural coupling property enjoyed by G(n,p) and G(n,q). This may be the subject of an upcoming post, time permitting.

Random Graphs – Lecture 1

My plan is to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics about the course can be found here.

In the first lecture, we revised some basic definitions about graphs, focusing on those which are most relevant to a first study of the Erdos-Renyi random graph G(n,p) which will be the focus of the lecture course. We discussed in abstract why the independence of the (potential) edges makes the model easier to analyse, but reduces its suitability as a direct model for lots of networks one might see in the real world, where knowledge that A is directly connected to both B and C affects the probability that B is directly connected to C, in either direction. Thinking about the Facebook friendship graph is one of the best examples, where in this case, we expect this extra information to increase the probability that B and C are connected. Even as the world moves away from heteronormativity, it realistically remains the case that in a graph of the dating history amongst a well-defined community we would likely observe the opposite effect.

All of these more complicated phenomena can be captured by various random graphs, but G(n,p) remains the corner stone, evinced by the >10^5 citations towards one of Erdos and Renyi’s original papers on the topic.

Somewhat paraphrasing, one of their (well, mostly Renyi’s) original questions was: when n is large, what should p be so that there’s a good chance that G(n,p) is connected?

The answer to this question lies in Lecture 2, but to cement understanding of the model, and explore some key methods for proofs in discrete probability (as well as play around with the big-O and little-o notation), we investigated the following two situations, which are very far from interesting as far as connectivity of G(n,p) is concerned.

Dense regime

When p is fixed, there are many interesting questions one could ask about the asymptotic properties of G(n,p), but connectivity is not one of them. In particular, for p\in(0,1) we claim:

Proposition: \mathrm{diam}(G(n,p)) \stackrel{\mathbb{P}}\rightarrow 2 as n\rightarrow\infty.

Note that if \mathrm{diam}(G(n,p))=1, then G(n,p)\simeq K_n, the complete graph on n vertices. In other words, every possible edge is actually present. But the probability of this event is p^{\binom{n}{2}}\rightarrow 0, so long as p<1.

It then suffices to prove that \mathbb{P}(G(n,p)>2) \rightarrow 0. We use a union bound, where we study the probability that the graph distance d_{G(n,p}(v,w)>2 for two fixed vertices v\ne w first, and then sum over all such pairs. Of course, there is a probability p that the two vertices are directly connected by an edge. Then, there are (n-2) other vertices with the potential to be a common neighbour of v and w, which would ensure that the graph distance between them is at most two. So

\mathbb{P}(d_{G(n,p)}(v,w)>2)=(1-p)[1-p^2]^{n-2} .

Note that we are using independence throughout this calculation. Then comes the union bound:

\mathbb{P}(\mathrm{diam}(G(n,p)) >2) \le \sum_{v\ne w \in[n]} \mathbb{P}(d_{G(n,P)}(v,w)>2)

\le \binom{n}{2} (1-p)[1-p^2]^{n-2} \rightarrow 0,

since exponential decay ‘kills’ polynomial growth.

Ultra-sparse regime

In general, we work in the setting where p=p(n) depends on n. If p(n) decays fast enough (see Exercise 2), then with high probability G(n,p) has no edges at all. However, when p(n)=o(n^{-3/2}) we have

Proposition: \mathbb{P}(\text{edges of }G(n,p)\text{ form a matching}) \rightarrow 1 as n\rightarrow\infty.

A matching is a collection of edges with no vertices in common. So if the edge set of the graph is a matching, we have essentially no interesting connectivity structure at all. The longest path has length one, for example.

To prove this, note that the edge set of the graph fails to be a matching precisely if one of the vertices has degree at least two. But since a vertex v is connected to each of the (n-1) other vertices in the graph independently with probability p, we have

\mathrm{deg}_{G(n,p)}(v) \sim \mathrm{Bin}(n-1,p),

and so we can directly make the crude approximation

\mathbb{P}(\mathrm{deg}_{G(n,p)}(v) =k) = \binom{n-1}{k}p^k(1-p)^{n-1-k}\le n^k p^k.

We’ve made this very weak bound to make life easier when we sum:

\mathbb{P}(\mathrm{deg}_{G(n,p)}(v) \ge 2) \le \sum_{k\ge 2}(np)^k = \frac{(np)^2}{1-np}.

Since p=o(n^{-3/2}), we have \frac{1}{1-np}=\frac{1}{1-o(1)}=1+o(1), and overall we obtain

\mathbb{P}(\mathrm{deg}_{G(n,p)}(v) \ge 2) = o(\frac{1}{n}).

Again, we finish with a union bound, considering this event across all vertices v\in[n].

\mathbb{P}(E(G(n,p))\text{ not a matching}) \le \sum_{v\in[n]} \mathbb{P}(\mathrm{deg}_{G(n,p)}(v)\ge 2)

=n\mathbb{P}(\mathrm{deg}_{G(n,p)}(1)\ge 2) = o(1),

as required.

Next time

In the next lecture, we’ll study the regime p(n)\sim \frac{\log n}{n}, where G(n,p) experiences a phase transition from probably not connected to probably connected. Part of this involves making the notion probably connected precise, which will be useful throughout the rest of the course, as well as establishing the language for comparing G(n,p) and G(n,q).

The proof itself requires some more sophisticated versions of calculations from Lecture 1, and more sophisticated probabilistic tools (first- and second-moment methods) to convert them into statements about convergence in probability. This will be an advertisement for the more classical enumerative methods that underpinned much of the early work on random graphs.

The rest of the course will exploit much more some comparisons and embeddings involving branching processes and exploration processes, so don’t worry – it won’t be 26 hours of counting trees!

Linear Algebra II: Eigenvectors and Diagonalisability

This post continues the discussion of the Oxford first-year course Linear Algebra II. We’ve moved on from determinants, and are now considering eigenvalues and eigenvectors of matrices and linear maps.

A good question to ask is: what’s the point of knowing about eigenvectors? I can think of a quick answer and a longer answer. The quick answer is that whenever we have a mapping of any kind, it is natural to ask about its fixed points. And since we are thinking about vector spaces and linear maps, if we can’t find any fixed points, we might nonetheless be able to find the best thing, some vectors whose direction is fixed by the map. In general, knowing about fixed points of a mapping might tell us other more qualitative properties, including the behaviour seen when you apply the map iteratively a large number of times. (Indeed a recent post discusses this exact problem for positive matrices in a context relevant to a chapter of my thesis…)

A more specific answer concerns bases. Recall that a linear map is defined independently of any basis: it’s just a map from the vector space to itself. We can express the linear map via a matrix with respect to some basis, but how to choose the basis? We could always choose the canonical basis in \mathbb{R}^n, since it’s easy to do vector and matrix calculations when most of the entries of all the vectors are zero. We also have a good visual idea (at least in up to three dimensions) of what a matrix might mean with respect to that basis. If we needed to divide the three-dimensional world around us into small volumes, we’d tend to describe it with small cubes rather than small arbitrary parallelopipeds.

But once we know something about the linear map, we might want to choose a basis of vectors on which the behaviour of the map is particularly easy to describe. And eigenvectors fulfil precisely this role. If we are able to choose a basis of eigenvectors, describing the map’s action, either abstractly, or via a (diagonal) matrix, is very straightforward. If we are given a matrix to begin with, we know how to do a change of basis, and changing to the basis of eigenvectors is precisely what’s going when we write A=P^{-1}DP, where D is a diagonal matrix. We construct P by taking its columns to be these eigenvectors. In particular, for a given vector x, y=Px is the vector giving the coefficients of x in the basis of eigenvectors.

So the case where we have a basis of eigenvectors is particularly useful, and in this case, we say the matrix or the map is diagonalisable. Remember how we find eigenvalues. If there exists a non-zero vector x satisfying Ax=\lambda x, then x is in the kernel of A-\lambda I. As we discussed last time, introducing the determinant gives a much more manageable way to verify which values of \lambda result in A-\lambda I having a non-trivial kernel. In particular, if non-zero x is in the kernel, we have \mathrm{det}(A-\lambda I)=0, and this leads to a polynomial of degree n (the dimensional of the vector space / size of the matrix) for \lambda, called the characteristic polynomial \chi_A(z), which has the eigenvalues as its roots.

If we agree to work over the complex field, then this is good, because it means we always have eigenvalues, and so it becomes sensible to talk about exactly how many eigenvalues and eigenvectors we have. Observe that if we restrict to real vector spaces, this might not be the case. In the plane, the rotation by \pi/2 for example has no fixed vectors.

Multiplicities of eigenvalues

We call the algebraic multiplicity \alpha(\lambda) of an eigenvalue \lambda to be the exponent of the factor (z-\lambda) in the factorisation of the characteristic polynomial. To define the geometric multiplicity, observe that all the eigenvectors with eigenvalue \lambda form a subspace, and so it is meaningful to talk about the dimension of this subspace (‘eigenspace’), which is the geometric multiplicity \gamma(\lambda). There are two facts that one needs to remember. The slightly less obvious one is that \gamma(\lambda)\le \alpha(\lambda) for all \lambda. One can see this by, for example, working in a basis that extends a basis of the \lambda-eigenspace. Observe at this stage that the sum of the algebraic multiplicities has to be n by definition, while the sum of geometric multiplicities is at most n. And this makes sense, because the space spanned by all the eigenvectors is a subspace, and so has dimension at most n.

The more obvious, but more frequently forgotten result is that

\alpha(\lambda)\ge 1 \quad \iff \quad \gamma(\lambda)\ge 1,

which is simply a consequence of the property discussed a few paragraphs previously concerning the kernel of A-\lambda I.

In particular, we might make the heuristic observation that ‘most’ polynomials of degree n have n distinct roots. This is certainly true for quadratics: there is only one value that the discriminant can take such that we see a repeated root. Alternatively, imagine shifting the quadratic up and down (in a complex way if necessary); again there is only one moment at which it might have a repeated root. This observation can be generalised easily to higher degree polynomials in a number of ways.

So if we lift this observation across to matrices, we see that most matrices have n distinct eigenvalues, and thus have n linearly independent eigenvectors which form a basis, hence the matrix is diagonalisable. I think it’s really worth reflecting on this, since much of a first exploration into linear algebra ends up treating exactly the case where the matrix is not diagonalisable.

The principal example of a non-diagonalisable matrix is \begin{pmatrix}2&1\\0&2\end{pmatrix}, where the 2s can be replaced by any value, and the 1 can be replaced by an non-zero value. There’s plenty to learn about to what extent versions of this matrix of higher size represent all non-diagonalisable matrices, but such an exposition of Jordan normal form comes next year for the students taking this course.

It probably is worth saying now though, that this example gives a good sanity check for whether a method is actually using diagonalisability correctly. For example, it is easily seen that elementary row operations to not preserve diagonalisability by starting from \begin{pmatrix}2&0\\0&2\end{pmatrix} and ending up at our counter-example. One could also argue from this that the set of non-diagonalisable matrices are dense within the set of matrices with a repeated eigenvalue. That is, having a repeated eigenvalue but full eigenspace is doubly-infinitely-unlikely.

Cayley-Hamilton theorem

Anyway, among other results, we also saw the Cayley-Hamilton theorem, which states that a matrix A satisfies its own characteristic equation. That is \chi_A(A)=0, where the zero on the right-hand side is the zero matrix. It’s tempting to substitute A into the expression \mathrm{det}(A-\lambda I), but of course this is not valid. Indeed imagine a typical eigenvalue determinant matrix with terms like (7-\lambda) on the diagonal; it doesn’t make sense to substitute a matrix for \lambda as one of the entries of the overall matrix!

Fortunately, we can argue convincingly in the case where A is a diagonalisable matrix. Remember that \chi_A(A) is a matrix. Now looki at the action of \chi_A(A) on any eigenvector v, corresponding to eigenvalue \lambda. Applying some power of A to v gives v multiplied by the same power of \lambda, and so we end up with

\chi_A(A)v = \chi_A(\lambda)v = 0.

This only worked when v was an eigenvector, but fortunately there is a basis of eigenvectors if A is diagonalisable, and so \chi_A(A)v=0 for all v, hence \chi_A(A)=0.

But \chi_A(A) is just a matrix-valued function of A. If you think about it, \chi_A is a monic polynomial, all of whose non-leading coefficients are multinomials of degree at most n-1 in the entries of A. Furthermore, these multinomials have (non-negative) integer coefficients. Therefore the entries of \chi_A(A) are multinomials of degree at most 2n-1 in the entries of A, and again have (non-negative) integer coefficients.

Even without the integrality of the coefficients, this says that, under any reasonable definition of continuity of matrices (which could be induced from any topology on \mathbb{R}^{n\times n}) the function \chi_A(A) should be continuous as a function of A. But we’ve shown \chi_A(A)=0 for all diagonalisable A, and also argued that most complex-valued matrices are diagonalisable. Turning this into a formal statement about denseness means that we’ve shown the Cayley-Hamilton theorem for non-diagonalisable matrices also. It feels that because the coefficients are non-negative integers, we might also have shown the result for other fields too, but I have minimal knowledge or recollection at the moment of the things one has to check for this sort of result.

It’s worth ending with the brief comment that Cayley-Hamilton is useful, among other reasons because it enables us to write the inverse of A as a polynomial of degree at most n-1 in terms of A. In many settings this is a lot easier to work with in terms of calculations than an argument with minors.

Linear Algebra II: Determinants 2

In the previous post, we introduced determinants of matrices (and by extension linear maps) via its multilinearity properties, and as the change-of-volume factor. We also discussed how to calculate them, via row operations, or Laplace expansion, or directly via a sum of products of entries over permutations.

The question of why this is ever a useful quantity to consider remains, and this post tries to answer it. We’ll start by seeing one example where this is a very natural quantity to consider, and then the main abstract setting, where the determinant is zero, and consider a particularly nice example of this.

Jacobeans as a determinant

We consider integration by substitution. Firstly, in one variable: when it comes to Riemann integration of a function g(x) with respect to x, we view dx as the width of a small column which approximates the function near x. Now, if we reparameterise, that is if we write x=f(y) for some well-behaved (in particular differentiable) function f, then the width of the column is dx= dy.(dx/dy)=f'(y) dy. This may be negative, if y is decreasing while x is increasing, but for now let’s not worry about this overly, for example by assuming the function g is non-negative. Thus if we want to integrate with y as the variable, we multiply the integrand by this factor |f'(y)|.

What about in higher dimensions? We have exactly the same situation, only instead of two-dimensional columns, we have (n+1)-dimensional columns. We then multiply the n-dimensional volume of the base by the height, again given by g(\mathbf{x}). If we have a similar transformation of the base variable \mathbf{x}=f(\mathbf{y}), we differentiate to get

\mathrm{d}x_i = \sum_{j=1}^n\frac{\mathrm{d}f_i}{\mathrm{d}y_j} \mathrm{d}y_j.

In other words

\mathrm{d}\mathbf{x}= J \mathrm{d}\mathbf{y},

where J is the Jacobean matrix of partial deriatives. In particular, we know how to relate the volume [0,\mathrm{d}x_1]\times\ldots\times [0,\mathrm{d}x_n] to the volume [0,\mathrm{d}y_1]\times \ldots\times [0,\mathrm{d}y_n]. It’s simply the determinant of the Jacobean J. So if we want to integrate with respect to \mathbf{y}, it only remains to pre-multiply the integrand by |\mathrm{det}J| and proceed otherwise as in the one-dimensional case.

Det A = 0

A first linear algebra course might well motivate the introducing matrices as a notational shortcut for solving families of linear equations, Ax=b. The main idea is that generally we can solve this equation uniquely. Almost all of the theory developed in such a first linear algebra course deals with the case when this fails to hold. In particular, there are many ways to characterise this case, and we list some of them now:

  • Ax=b has no solutions for some b;
  • A is not invertible;
  • A has non-trivial kernel, that is, with dimension at least one;
  • A does not have full rank, that is, the image has dimension less than n;
  • The columns (or indeed the rows) are linearly dependent;
  • The matrix can be row-reduced to a matrix with a row of zeroes.

It is useful that these are equivalent, as in abstract problems one can choose whichever interpretation from this list is most relevant. However, all of these are quite hard to check. Exhibiting a non-trivial kernel element is hard – one either has to do manual row-reduction, or the equivalent in the context of linear equations. But we can add the characterisation

  • det A = 0;

to the list. And this is genuinely much easier to check for specific examples, either abstract or numerical.

Let’s quickly convince ourselves of a couple of these equivalences. Determinant is invariant under row-reductions, and by multilinearity it is certainly the case that det A = 0 if A has a row of zeroes. We also said that A is the change-of-volume factor. Note that A is a map from the domain to its image, so if A has less than full rank, then any set in the image has zero volume.

The Vandermonde matrix

This is a good example of this theory in practice. Consider the Vandermonde matrix where each row is a geometric progression:

V=\begin{pmatrix}1&\alpha_1&\ldots&\alpha_1^{n-1}\\1&\alpha_2&\ldots&\alpha_2^{n-1}\\ \vdots&\vdots&\ddots&\vdots\\ 1&\alpha_n&\ldots&\alpha_n^{n-1}\end{pmatrix}.

Now suppose we attempt to solve

V\begin{pmatrix}a_0\\a_1\\ \vdots\\ a_{n-1}\end{pmatrix}=\begin{pmatrix}b_1\\b_2\\ \vdots \\ b_n\end{pmatrix}.

There’s a natural interpretation to this, that’s especially clear with this suggestive notation. Each row corresponds to a polynomial, where the coefficients are given by the (a_0,a_1,\ldots,a_{n-1}), and the argument is given by \alpha_i.

So if we try to solve for (a_0,a_1,\ldots,a_{n-1}), given (\alpha_1,\ldots,\alpha_n) and (b_1,\ldots,b_n), we are asking whether we can find a polynomial P with degree at most n-1 such that P(\alpha_i)=b_i for $i=1. Lagrange interpolation gives an argument where we just directly write down the relevant polynomial, but we can also deploy our linear algebraic arguments too.

The equivalence of all these statements means that to verify existence and uniqueness of such a polynomial, we only need to check that the Vandermonde matrix has non-zero determinant. And in fact there are a variety of methods to show that

\mathrm{det}V=\prod_{1\le i< j}\le n(\alpha_j-\alpha_i).

For the polynomial question to be meaningful, we would certainly demand that the (\alpha_i) are distinct, and so this determinant is non-zero, and we’ve shown that n points determine a degree (n-1) polynomial uniquely.

If we multiply on the left instead, suppose that we are considering a discrete probability distribution X that takes n known values (\alpha_1,\ldots,\alpha_n) with unknown probabilities (p_1,\ldots,p_n). Then we have

(p_1,\ldots,p_n) V = (1,\mathbb{E}X, \mathbb{E}[X^2],\ldots, \mathbb{E}[X^{n-1}]).

So, again by inverting the Vandermonde matrix (which is know is possible since its determinant is non-zero…) we can recover the distribution from the first (n-1) moments of the distribution.

A similar argument applies to show that the Discrete Fourier Transform is invertible, and in this case (where the \alpha_is are roots of unity), the expression for the Vandermonde determinant is particularly tractable.

Linear Algebra II: Determinants 1

This term, I’m giving tutorials on a course that’s new to me, the apparently notorious ‘Linear Algebra II’ for first year undergraduates. I can appreciate how it might have ended up with this reputation, but as always, every challenge is also an opportunity, So I’m going to (try to) write a short series of blog posts about what we’ve discussed in the tutorials.

The first problem sheet-and-a-half concerned determinants of matrices. There are three things worth addressing here:

  1. What are abstract definitions, and which is most useful in each setting?
  2. How to actually calculate them?
  3. What’s the overall point?

The answers are obviously not completely unrelated, but we’ll probably defer the third question to a second post.

The determinant is a map from the set of matrices \mathcal{M}_n to the base field (hereafter assumed to be \mathbb{R},\mathbb{C}). The Oxford course defines it through its properties:

  • Multilinear in the columns of the matrix.
  • Equal to zero if two columns are equal.
  • Equal to one if the matrix is the identity.

One then checks that there is a unique such map, and so from now on it’s reasonable to call it the determinant of the matrix. It will follow from pretty much any consequence that we can replace ‘columns’ with ‘rows’ throughout and get the same map.

Other definitions

We have a closed form expression for the determinant given via permutations of n

\mathrm{det}(A)=\sum_{\sigma\in \Sigma_n} \mathrm{sign}(\sigma) a_{1\sigma(1)}\ldots a_{n\sigma(n)}.

We’ll come back to a discussion of when this particular definition is useful. It can be derived by carefully transforming the identity matrix into A, using the operations which are mentioned in the original definition of the determinant, in particular, keeping track of the number of transpositions of columns.

It’s clear from any definition that the determinant is a polynomial of degree n in the entries of the matrix, but this definition will be useful if you want to make some more precise comment on the nature of this polynomial. For example, if entries of the matrix are polynomials in x of various degree (think of the eigenvalue equation for example) this allows you to control (or at least bound) the overall degree of the determinant as a polynomial in x.

The determinant is also the volume of the n-dimensional parallelopiped formed by the column vectors of the matrix. This is easy to check in two dimensions, for the matrix \begin{pmatrix}a&b\\c&d\end{pmatrix}:

20160225_172206To calculate the area of the central parallelogram, we have to subtract the area of two small rectangles and four small triangles from the outer rectangle, obtaining

(a+b)(c+d)-2bc - \frac12(ac+ac+bd+bd)=ad-bc,

as we expect. This calculation is harder to execute in higher dimensions, and certainly harder to visualise.

Maybe, though, we don’t have to, so long as we can reassure ourselves that this volume satisfies the implicit definition of the determinant map at the start. Multilinearity in the columns is not that hard to see. If we multiply the jth column by some constant, we are stretching the parallelopiped by the same constant factor in one direction, and so the volume grows appropriately. The additivity property can similarly be thought of as joining together two parallelopipeds at their common face (which is common since the other column vectors have to be constant in this construction). If two column vectors are equal, then clearly this volume actually has dimension at most n-1, and thus volume zero, so the final two conditions are genuinely easy to check.

The challenge here is that there is a direction involved. Determinants can be negative, but in our classical viewpoint, areas generally are not. In 2D, we can think of this as saying that the area is positive if the vector (b,d) lies anti-clockwise from (a,c) in the parallelogram, while is it negative otherwise. Again, this is harder to visualise in higher dimensions, but it is at least plausible that one could develop a similar decomposition. Ultimately, we are happy with the notion of directed lengths (ie vectors on the real line), and these are easy to add up without having to separate into cases, and the case holds for areas and higher-dimensional volumes.

Evaluating determinants

If we actually want to compute the determinant of a given matrix, the sum over permutations is intractable since it doesn’t have any natural splits into stages. The implicit definitions and this area consideration are clearly useless for all but the most special of examples.

The Laplace expansion is the usual algorithm to calculate the determinant of an n x n matrix. You pick a row (or a column), and evaluate the determinants of the (n-1) x (n-1) minor matrices given by deleting this row (or column), and each column (or row) in turn. This leaves us with n determinants of smaller matrices, which we pre-multiply by the entries in the original deleted row (or column), and add up in an alternating way (*). This is highly computationally intensive for large matrices, but for 3×3 and 4×4 can be done by hand with probability of an error bounded away from 1.

There is the flexibility to choose the reference row or column. Since the entries of these affect the sum through small products, it is highly convenient to choose a row or column with a lot of zeros. In particular, if there’s a row or column with exactly one non-zero entry, this is an ideal candidate.

The sum over permutations also works well when a lot of the entries are zero, because then a lot of the permutations give a summand which is zero. Upper-triangular matrices are a good example: only one permutation (the identity permutation) avoids all the zero elements underneath the diagonal.

One can also observe from the multilinearity property of the determinant map that there are lots of operations we can apply to the matrix which leave the determinant fixed. These are often called elementary row operations, though obviously we can apply them to the columns as well. To summarise, if we interchange two rows, the sign of the determinant is reversed. And if we add some multiple of one row to any other row, the determinant stays the same.

When matrices are not square, it’s quite important to be specific about exactly what form you can reduce a general matrix to via such row operations, but in this context, it’s not hugely important. Reduced echelon form (without the condition that leading coefficients will be one) is achievable, but this is a special case of an upper triangular matrix, for which the determinant is given by the product of the diagonal entries, ie is easy.

Whether this is substantially easier than Laplace expansion depends on the matrix itself and taste, both to do manually, and to code.

(*) I’m not a fan personally of this alternating definition. It seems to me much more natural to define the minor as

M^{i,j}=(a_{i+k,j_\ell})_{1\le k,\ell\le n-1},

with indices taken modulo n. Then you don’t have any \pm 1s in the Laplace expansion.

Using determinants in abstract problems

So the determinant gives directly the area of the image (under A) of the unit hypercube. By linearity (of A), it is easy to see that it also gives the scale factor of the area change (under A) of any hyper-cuboid, parallel to the conventional axes, anywhere in the space. Then, eg by approximating any sensible n-dimensional shape (*) as a union of such hyper-cuboids, we can show that in fact the area of any sensible shape increases by a factor (det A) under application of A.

This is a good thing to remember, because it is an excellent heuristic for seeing why the determinant of a linear map is basis-independent. It also gives a much easier proof of the key result

\mathrm{det}(AB)=\mathrm{det}(A) \mathrm{det}(B),

than that given by fixing B and viewing det(AB) as a map from matrices to the field, just like the original definition of determinant.

Some of the theory in the course is proved using elementary row operations. But these invite complicated notation, so are best used only in simple arguments, or when things are fairly explicit to begin with. Given an abstract problem about determinants of matrices, it is often tempting to induct on the size of the matrix in some way. I think it’s worth saying that even though the Laplace expansion is explicitly set up in this way, the notation involved is also likely to be annoying here, while permutations are easy to describe inductively: eg let \sigma(1)=k, then view the remainder of the permutation as a bijection \{2,3,\ldots,n\}\rightarrow [n]\backslash \{k\}.

Shortly, we’ll have a second post answering the final question: what’s the point of working with determinants? We’ve already seen half an answer, in that they describe the change-of-volume factor of a matrix (or linear map), but this can be substantially developed.

Hitting Probabilities for Markov Chains

This continues my previous post on popular questions in second year exams. In the interest of keeping it under 2,500 words I’m starting a new article.

In a previous post I’ve spoken about the two types of Markov chain convergence, in particular, considering when they apply. Normally the ergodic theorem can be used to treat the case where the chain is periodic, so the transition probabilities do not converge to a stationary distribution, but do have limit points – one at zero corresponding to the off-period transitions, and one non-zero. With equal care, the case where the chain is not irreducible can also be treated.

A favourite question for examiners concerns hitting probabilities and expected hitting times of a set A. Note these are unlikely to come up simultaneously. Unless the hitting probability is 1, the expected hitting time is infinite! In both cases, we use the law of total probability to derive a family of equations satisfied by the probabilities/times. The only difference is that for hitting times, we add +1 on the right hand side, as we advance one time-step to use the law of total probability.

The case of hitting probabilities is perhaps more interesting. We have:

h_i^A = 1,\; i\in A, \quad h_i^A=\sum_{j\in S}p_{ij}h_j^A,\; i\not\in A.

There are two main cases of interest: where the chain is finite but has multiple closed communicating classes, and where the chain is infinite, so even though it is irreducible, a trajectory might diverge before hitting 0.

For the case of a finite non-irreducible Markov chain, this is fairly manageable, by solving backwards from states where we know the values. Although of course you could ask about the hitting probability of an open state, the most natural question is to consider the probability of ending up in a particular closed class. Then we know that the hitting probability starting from site in the closed class A is 1, and the probability starting from any site in a different closed class is 0. To find the remaining values, we can work backwards one step at a time if the set of possible transitions is sparse enough, or just solve the simultaneous equations for \{h_i^A: i\text{ open}\}.

We therefore care mainly about an infinite state-space that might be transient. Typically this might be some sort of birth-and-death chain on the positive integers. In many cases, the hitting probability equations can be reduced to a quadratic recurrence relation which can be solved, normally ending up with the form


where \lambda might well be q/p or similar if the chain is symmetric. If the chain is bounded, typically you might know h_0=1, h_N=0 or similar, and so you can solve two simultaneous equations to find A and B. For the unbounded case you might often only have one condition, so you have to rely instead on the result that the hitting probabilities are the minimal solution to the family of equations. Note that you will always have h^i_i=1, but with no conditions, h^i_j\equiv 1 is always a family of solutions.

It is not clear a priori what it means to be a minimal solution. Certainly it is not clear why one solution might be pointwise smaller than another, but in the case given above, it makes sense. Supposing that \lambda<1, and A+B=1 say, then as we vary the parameters, the resulting set of ‘probabilities’ does indeed vary monotically pointwise.

Why is this true? Why should the minimum solution give the true hitting probability values? To see this, take the equations, and every time an h_i^A appears on the right-hand side, substitute in using the equations. So we obtain, for i\not\in A,

h_i^A=\sum_{j\in A}p_{ij}+\sum_{j\not\in A} p_{ij}h_j^A,

and after a further iteration

h_i^A=\sum_{j_1\in A}p_{ij_1}+\sum_{j_1\not\in A, j_2\in A}p_{ij_1}p_{j_1j_2}+\sum_{j_1,j_2\not\in A}p_{ij_1}p_{j_1j_2}h_{j_2}^A.

So we see on the RHS the probability of getting from i to A in one step, and in two steps, and if keep iterating, we will get a large sum corresponding to the probability of getting from i to A in 1 or 2 or … or N steps, plus an extra term. Note that the extra term does not have to correspond to the probability of not hitting A by time N. After all, we do not yet know that (h_{i}^A) as defined by the equations gives the hitting probabilities. However, we know that the probability of hitting A within N steps converges to the probability of hitting A at all, since the sequence is increasing and bounded, so if we take a limit of both sides, we get h_i^A on the left, and something at least as large as the hitting probability starting from i on the right, because of the extra positive term. The result therefore follows.

It is worth looking out for related problems that look like a hitting probability calculation. There was a nice example on one of the past papers. Consider a simple symmetric random walk on the integers modulo n, arranged clockwise in a circle. Given that you start at state 0, what is the probability that your first return to state 0 involves a clockwise journey round the circle?

Because the system is finite and irreducible, it is not particularly interesting to consider the actual hitting probabilities. Also, note that if it is convenient to do so, we can immediately reduce the problem when n is even. In two steps, the chain moves from j to j+2 and j-2 with probability ¼ each, and stays at j with probability ½. So the two step chain is exactly equivalent to the lazy version of the same dynamics on n/2.

Anyway, even though the structure is different, our approach should be the same as for the hitting probability question, which is to look one step into the future. For example, to stand a chance of working, our first two moves must both be clockwise. Thereafter, we are allowed to move anticlockwise. There is nothing special about starting at 0 in defining the original probability. We could equally well ask for the probability that starting from j, the first time we hit 0 we have moved clockwise round the circle.

The only thing that is now not obvious is how to define moving clockwise round the circle, since it is not the case that all the moves have to be clockwise to have experienced a generally clockwise journey round the circle, but we definitely don’t want to get into anything complicated like winding numbers! In fact, the easiest way to make the definition is that given the hitting time of 0 is T, we demand that the chain was at state n at time T-1.

For convenience (ie to make the equations consistent) we take h_0=0, h_n=1 in an obvious abuse of notation, and then

h_j=\frac12h_{j-1}+\frac12 h_{j+1},

from which we get

h_j=a+bj \Rightarrow h_j=\frac{j}{n}.

Of course, once we have this in mind, we realise that we could have cut the circle at 0 (also known as n) and unfolded it to reduce the problem precisely to symmetric gambler’s ruin. In particular, the answer to the original problem is 1/2n, which is perhaps just a little surprising – maybe by thinking about the BM approximation to simple random walk, and that BM started from zero almost certainly crosses zero infinitely many times near we might have expected this probability to decay faster. But once it is unfolded into gambler’s ruin, we have the optimal stopping martingale motivation to reassure us that this indeed looks correct.

Avoiding Mistakes in Probability Exams

Over the past week, I’ve given several tutorials to second year undergraduates preparing for upcoming papers on probability and statistics. In particular, I’ve now seen a lot of solutions to a lot of past papers and specimen questions, and it’s worthwhile to consider some of the typical mistakes students can make on these questions. Of course, as with any maths exam, there’s always the possibility of a particularly subtle or involved question coming up, but if the following three common areas of difficulty can be avoided, you’re on track for doing well.


In a previous course, a student will learn how to calculate the pdf of a function of a random variable. Here, we move onto the more interesting and useful case of finding the (joint) density of function(s) of two or more random variables. The key thing to remember here is that manipulating pdfs is not a strange arbitrary exercise – it is just integration. It is rarely of interest to consider the value of a pdf at a single point. We can draw meaningful conclusions from a pdf or from comparison of two pdfs by integrating them.

Then the question of substituting for new random variables is precisely integration by substitution, which we are totally happy with in the one-dimensional case, and should be fairly happy with in the two-dimensional case. To get from one joint density to another, we multiply by the absolute value of the Jacobian. To ensure you get it right, it makes sense to write out the informal infinitesimal relation

f_{U,V}(u,v) du dv = f_{X,Y}(x,y)dx dy.

This is certainly relevant if we put integral signs in front of both sides, and explains why you obtain f_{U,V} = \frac{d(x,y)}{d(u,v)} f_{X,Y} rather than the other way round. Note though that if \frac{d(u,v)}{d(x,y)} is easier to calculate for some reason, then you can evaluate this and take the inverse, as your functions will almost certainly be locally bijective almost everywhere.

It is important to take the modulus of the Jacobian, since densities cannot be negative! If this looks like a fudge, then consider the situation in one dimension. If we substitute for x\mapsto f(x)=1-x, then f’ is obviously negative, BUT we also end up reversing the order of the bounds of the integral, eg [1/3, ¾] will become [2/3,1/4]. So we have a negative integrand (after multiplying by f'(x)) but bounds in the ‘wrong’ order. These two factors of -1 will obviously cancel, so it suffices just to multiply by |f'(x)| at that stage. It is harder to express in words, but a similar relation works for the Jacobian substitution.

You also need to check where the new joint density is non-zero. Suppose X, Y are supported on [0,1], then when we write f_{X,Y}(x,y) we should indicate that it is 0 off this region, either by splitting into cases, or adding the indicator function 1_{\{x,y\in[0,1]\}} as a factor. This is even more important after substitutions, as the range of the resulting random variables might be less obvious than the originals. Eg with X,Y as above, and U=X^2, V=X/Y, the resulting pdf will be non-zero only when u\in[0,1], v\ge \sqrt{u}. Failing to account for this will often lead to ludicrous answers. A general rule is that you can always check that any distribution you’ve produced does actually integrate to one.

Convergence using MGFs

There are two main reasons to use MGFs and PGFs. The first is that they behave nicely when applied to (possibly random) sums of independent random variables. The independence property is crucial to allow splitting of the MGF of the sum into the product of MGFs of the summands. Of course, implicit in this argument is that MGFs determine distributions.

A key theorem of the course is that this works even in the limit, so you can use MGFs to show convergence in distribution of a family of distributions. For this, you need to show that the MGFs converge pointwise on some interval [-a,a] around 0. (Note that the moments of the distribution are given by the family of derivatives at 0, as motivation for why this condition might be necessary.) Normally for such questions, you will have been asked to define the MGF earlier in the question, and probably will have found the MGF of a particular distribution or family of distributions, which might well end up appearing as the final answer.

Sometimes such an argument might involve substituting in something unusual, like t/N, rather than t, into a known MGF. Normally a Taylor series can be used to show the final convergence result. If you have a fraction, try to cancel terms so that you only have to evaluate one Taylor series, rather than lots.

Using the Markov Property

The Markov property is initially confusing, but once we become comfortable with the statement, it is increasingly irritating to have to answer the question: “show that this process has the Markov property.” This question is irritating because in most cases we want to answer: “because it obviously does!” Which is compelling, but unlikely to be considered satisfactory in a mathematics exam. Normally we observe that the random dynamics of the next step are a function only of the present location. Looking for the word ‘independent’ in the statement of the process under discussion is a good place to start for any argument along these lines.

The most developed example of a Markov process in this course is the Poisson process. I’ve written far too much about this before, so I won’t do so again, except to say this. When we think of the Poisson process, we generally have two thoughts going through our minds, namely the equivalent definitions of IID exponential inter-arrival times, and stationary, Poisson increments (or the infinitesimal version). If we draw a sketch of a sample trajectory of this process, we can label everything up and it is clear how it all fits together. But if you are asked to give a definition of the Poisson process (N_t), it is inappropriate to talk about inter-arrival times unless you define them in terms of N_t, since that is the process you are actually trying to define! It is fine to write out

T_k:=\min\{t: N_t=k\},\quad N_t=\max\{k: Y_1+Y_2+\ldots+Y_k\le t\}

but the relation between the two characterisations of the process is not obvious. That is why it is a theorem of the course.

We have to be particularly careful of the difference in definition when we are calculating probabilities of various events. A classic example is this. Find the distribution of N_2, conditional on T_3=1. It’s very tempting to come up with some massive waffle to argue that the answer is 3+Po(1). The most streamlined observation is that the problem is easy if we are conditioning instead on N_1=3. We just use the independent Poisson increments definition of (N_t), with no reference to inter-arrival times required. But then the Markov property applied at time 1 says that the distribution of (N_2) depends only on the value of N_1, not on the process on the interval [0,1). In a sense, the condition that T_3=1 is giving us extra information on the behaviour of the process up to time 1, and the Markov property, which we know holds for the Poisson process, asserts precisely that the extra information doesn’t matter.

Generating Functions for Dice

So last week I was writing an article for Betting Expert about laws of large numbers, and I was trying to produce some representations of distributions to illustrate the Weak LLN and the Central Limit Theorem. Because tossing a coin feels too simplistic, and also because the natural state space for this random variable, at least verbally, is not a subset of the reals, I decided to go for dice instead. So it’s clear what the distribution of the outcome of a single dice roll is, and with a bit of thought or a 6×6 grid, we can work out the distribution of the average of two dice rolls. But what about 100 rolls? Obviously, we need large samples to illustrate the laws of large numbers! In this post, we discuss how to calculate the distribution of the sample mean of n dice rolls.

First we observe that the total set of outcomes of n dice rolls is 6^n. The sum of the outcomes must lie between n and 6n inclusive. The distribution of the sum and the distribution of the sample mean are equivalent up to dividing by n. The final observation is that because the total number of outcomes has a nice form, we shouldn’t expect it to make any difference to the method if we calculate the probability of a given sum, or the number of configurations giving rise to that sum.

Indeed, tying in nicely with the first year probability course, we are going to use generating functions, and there is no difference in practice between the probability generating function, and the combinatorial generating function, if the underlying mechanism is a uniform choice. Well, in practice, there is a small difference, namely a factor of 6 here. The motivation for using generating functions is clear: we are considering the distribution of a sum of independent random variables. This is pretty much exactly why we bother to set up the machinery for PGFs.

Anyway, since each of {1,2,…,6} is equally likely, the GF of a single dice roll is

x+x^2+\ldots+x^6=x\cdot \frac{1-x^6}{1-x}.

So, if we want the generating function of the sum of n independent dice rolls, we can obtain this by raising the above function to the power n. We obtain


Note the factor of x^n at the beginning arises because the minimum value of the sum is n. So to work out the number of configurations giving rise to sum k, we need to evaluate the coefficient of x^k. We can deal with (1-x^6)^n fairly straightforwardly, but some thought it required regarding whether it’s possible to do similar job on (1-x)^{-n}.

We have to engage briefly with what is meant by a binomial coefficient. Note that

\binom{x}{k}=\frac{x(x-1)\ldots(x-k+1)}{1\cdot\ldots\cdot k}

is a valid definition even when x is not a positive integer, as it is simply a degree k polynomial in x. This works if x is a general positive real, and indeed if x is a general negative real. At this stage, we do need to keep k a positive integer, but that’s not a problem for our applications.

So we need to engage with how the binomial theorem works for exponents that are not positive integers. The tricky part with the standard expression as

(a+b)^n=\binom{n}{0}a^n+\ldots + \binom{n}{n}b^n,

is that the attraction of this symmetry in a and b prompts us to work in more generality than is entirely necessary to state the result. Note if we instead write


we have unwittingly described this finite sum as an infinite series. It just happens that all the binomial coefficients apart from the first (n+1) are zero. The nice thing about this definition is that it might plausibly generalise to non-integer or negative values of n. And indeed it does. I don’t want to go into the details here, but it’s just a Taylor series really, and the binomial coefficients are set up with factorials in the right places to look like a Taylor series, so it all works out.

It is also worth remarking that it follows straight from the definition of a negative binomial coefficient, that

\binom{-n}{j}=(-1)^j \binom{n+j-1}{j}.

In any case, we can rewrite our expression for the generating function of the IID sum as

x^n\left[\sum_{k=0}^n \binom{n}{k}(-1)^k x^{6k}\right]\left[\sum_{j\ge 0} \binom{-n}{j}(-1)^j x^j\right]

By accounting for where we can gather exponents from each bracket, we can evaluate the coefficient of x^m as


Ie, k in the sum takes values in \{0,1,\ldots, \lfloor \frac{m+n}{6}\rfloor\}. At least in theory, this now gives us an explicit way to calculate the distribution of the average of multiple dice rolls. We have to be wary, however, that many compilers will not be happy dealing with large binomial coefficients, as the large factorials grow extremely rapidly. An approximation using logs is likely to be more tractable for larger settings.

Anyway, I leave you with the fruits of my labours.

dice30Related articles

Enhanced by Zemanta