# Lecture 8 – Bounds in the critical window

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

Preliminary – positive correlation, Harris inequality

I wrote about independence, association, and the FKG property a long time ago, while I was still an undergraduate taking a first course on Percolation in Cambridge. That post is here. In the lecture I discussed the special case of the FKG inequality applied in the setting of product measure setting, of which the Erdos-Renyi random graph is an example, and which is sometimes referred to as the Harris inequality.

Given two increasing events A and B, say for graphs on [n], then if $\mathbb{P}$ is product measure on the edge set, we have

$\mathbb{P}(A\cap B)\ge \mathbb{P}(A)\mathbb{P}(B).$

Intuitively, since both A and B are ‘positively-correlated’ with the not-rigorous notion of ‘having more edges’, then are genuinely positively-correlated with each other. We will use this later in the post, in the form $\mathbb{E}[X|A]\ge \mathbb{E}[X]$, whenever X is an increasing RV and A is an increasing event.

The critical window

During the course, we’ve discussed separately the key qualitative features of the random graph $G(n,\frac{\lambda}{n})$ in the

• subcritical regime when $\lambda<1$, for which we showed that all the components are small, in the sense that $\frac{1}{n}|L_1| \stackrel{\mathbb{P}}\rightarrow 0$, although the same argument would also give $|L_1|\le K\log n$ with high probability if we used stronger Chernoff bounds;
• supercritical regime when $\lambda>1$, for which there is a unique giant component , ie that $\frac{1}{n}|L_1|\stackrel{\mathbb{P}}\rightarrow \zeta_\lambda>0$, the survival probability of a Galton-Watson branching process with Poisson($\lambda$) offspring distribution. Arguing for example by a duality argument shows that with high probability all other components are small in the same sense as in the subcritical regime.

In between, of course we should study $G(n,\frac{1}{n})$, for which it was known that $L_1\stackrel{d}\sim n^{2/3},\, L_2\stackrel{d}\sim n^{2/3},\ldots$. (*) That is, the largest components are on the scale $n^{2/3}$, and there are lots of such critical components.

In the early work on random graphs, the story ended roughly there. But in the 80s, these questions were revived, and considerable work by Bollobas and Luczak, among many others, started investigating the critical setting in more detail. In particular, between the subcritical and the supercritical regimes, the ratio $\frac{|L_2|}{|L_1|}$ between the sizes of the largest and second-largest components goes from ‘concentrated on 1’ to ‘concentrated on 0’. So it is reasonable to ask what finer scaling of the edge probability $p(n)$ around $\frac{1}{n}$ should be chosen to see this transition happen.

Critical window

In this lecture, we studied the critical window, describing sequences of probabilities of the form

$p(n)=\frac{1+\lambda n^{-1/3}}{n},$

where $\lambda\in(-\infty,+\infty)$. (Obviously, this is a different use of $\lambda$ to previous lectures.)

It turns out that as we move $\lambda$ from $-\infty$ to $+\infty$, this window gives exactly the right scaling to see the transition of $\frac{|L_2|}{|L_1|}$ described above. Work by Bollobas and Luczak and many co-authors and others in the 80s establish a large number of results in this window, but for the purposes of this course, this can be summarised as saying that the critical window has the same scaling behaviour as $p(n)=1/n$, with a large number of components on the scale $\sim n^{2/3}$ (see (*) earlier), but different scaling limits.

Note: Earlier in the course, we have discussed local limits, in particular for $G(n,\lambda/n)$, where the local limit is a Galton-Watson branching process tree with offspring distribution $\mathrm{Poisson}(\lambda)$. Such local properties are not sufficient to distinguish between different probabilities within the critical window. Although there are lots of critical components, it remains the case that asymptotically almost all vertices are in ‘small components’.

The precise form of the scaling limit for

$\frac{1}{n^{2/3}} \left( |L_1|, |L_2|, |L_3|,\ldots \right)$

as $n\rightarrow\infty$ was shown by Aldous in 1997, by lifting a scaling limit result for the exploration process, which was discussed in this previous lecture and this one too. Since Brownian motion lies outside the assumed background for this course, we can’t discuss that, so this lecture establishes upper bounds on the correct scale of $|L_1|$ in the critical window. Continue reading

# Lecture 4 – Hitting time theorem

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

This lecture consisted of revision of the most relevant theory of Galton-Watson trees, with a focus on the case where the offspring distribution is Poisson, since, as we have seen in previous lectures, this is a strong candidate to approximate the structure of $G(n,\lambda/n)$. It makes sense to cover the theory of the trees before attempting to make rigorous the sense of approximation.

Given a Galton-Watson tree T, it is natural to label the vertices in a breadth-first order as $\varnothing=v_1,v_2,\ldots,v_{|T|}$. This is easiest if we have constructed the Galton-Watson tree as a subset of the infinite Ulam-Harris tree, where vertices have labels like (3,5,17,4), whose parent is (3,5,17). If this child vertex is part of the tree, then so are (3,5,17,1), (3,5,17,2), and (3,5,17,3). This means our breadth-first order is canonically well-defined, as we have a natural ordering of the children of each parent vertex.

Note: one advantage of using breadth-first order rather than depth-first order (which corresponds to the usual dictionary, or lexicographic ordering of the labels) is that if the tree is infinite, we don’t explore all of it during a depth-first search. (In the sense that there exist vertices which are never given a finite label.) For breadth-first search, a similar problem arises precisely when some vertex has infinitely many children. For a conventional Galton-Watson tree, the latter situation is much less of a problem than the infinite total population problem, which happens with positive probability whenever $\mu=\mathbb{E}[X]>1$.

Anyway, given the depth-first order, one can consider an exploration process $S_0,S_1,S_2,\ldots,S_{|T|}$ given by

$S_0=1,\quad S_i=S_{i-1}+(X_i-1),$

where $X_i$ is the number of children of $v_i$. In this way, we see that

$S_i=\big| \Gamma(v_1)\cup\ldots\cup\Gamma(v_i)\backslash \{v_1,\ldots,v_i\}\big|,\quad i\ge 1,$

records the number of vertices in some stack containing those which we have ‘seen but not explored’. Some authors prefer to start from 0, in which case one ends up with a similar but slightly different interpretation of the ‘stack’, but that’s fine since we aren’t going to define formally what ‘seen’ and ‘explored’ means in this post.

Essentially, we exhaust the vertices of the tree whenever $S_t=0$, and so the condition that $|T|=n$ requires

$S_n=0,\quad S_m\ge 1,\; m=0,1,\ldots,n-1.$

Conveniently, so long as we have avoiding ordering ambiguity, for example by insisting that trees live within the Ulam-Harris tree, we can reconstruct T uniquely from $(S_0,S_1,\ldots,S_{|T|})$.

Furthermore, if T is a Galton-Watson process, then the numbers of children $X_i$ are IID, and so in fact this exploration process is a random walk, and the size of the tree can be recovered as the hitting time of zero.

Note: making fully rigorous the argument that children in the GW tree are independent of the breadth-first walk fully rigorous is somewhat technical, and not to be dismissed lightly, though not of principle interest at the level of this topics course. See Proposition 1.5 in Section 1.2 of Le Gall’s notes or Section 1.2.2 of my doctoral thesis for further discussion and argument.

The hitting time theorem allows us to study the distribution of the hitting time of a random walk whose increments are bounded below by -1, in terms of the distribution of the value of the random walk.

Theorem: Let $(S_n,\, n\ge 0)$ be a random walk with $S_0=0$ and IID increments $(X_n,n\ge 1)$ satisfying $\mathbb{P}(X_n\ge -1)=1$. Let $H_{-k}=\inf \left\{n\,:\, S_n=-k\right\}$ be the hitting time of -k.

Then $\mathbb{P}\big( H_{-k}=n\big) = \frac{k}{n}\mathbb{P}\big(S_n=-k)$.

Commentary: there are local central limit theorem estimates and large deviation estimates that allow good control of the probability on the RHS for a rich class of contexts. So at a meta-level, the hitting time theorem allows us to reduce a complicated (though still classical) problem, to a real classical problem, which is particularly helpful when the LHS is a device for capturing relevant information about our random tree model.

# Lecture 3 – Couplings, comparing distributions

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

In this third lecture, we made our first foray into the scaling regime for G(n,p) which will be the main focus of the course, namely the sparse regime when $p=\frac{\lambda}{n}$. The goal for today was to give a self-contained proof of the result that in the subcritical setting $\lambda<1$, there is no giant component, that is, a component supported on a positive proportion of the vertices, with high probability as $n\rightarrow\infty$.

More formally, we showed that the proportion of vertices contained within the largest component of $G(n,\frac{\lambda}{n})$ vanishes in probability:

$\frac{1}{n} \left| L_1\left(G\left(n,\frac{\lambda}{n}\right)\right) \right| \stackrel{\mathbb{P}}\longrightarrow 0.$

The argument for this result involves an exploration process of a component of the graph. This notion will be developed more formally in future lectures, aiming for good approximation rather than bounding arguments.

But for now, the key observation is that when we ‘explore’ the component of a uniformly chosen vertex $v\in[n]$ outwards from v, at all times the number of ‘children’ of v which haven’t already been considered is ‘at most’ $\mathrm{Bin}(n-1,\frac{\lambda}{n})$. Since, for example, if we already know that eleven vertices, including the current one w are in C(v), then the distribution of the number of new vertices to be added to consideration because they are directly connected to w has conditional distribution $\mathrm{Bin}(n-11,\frac{\lambda}{n})$.

Firstly, we want to formalise the notion that this is ‘less than’ $\mathrm{Bin}(n,\frac{\lambda}{n})$, and also that, so long as we don’t replace 11 by a linear function of n, that $\mathrm{Bin}(n-11,\frac{\lambda}{n})\stackrel{d}\approx \mathrm{Poisson}(\lambda)$.

Couplings to compare distributions

coupling of two random variables (or distributions) X and Y is a realisation $(\hat X,\hat Y)$ on the same probability space with correct marginals, that is

$\hat X\stackrel{d}=X,\quad \hat Y\stackrel{d}=Y.$

We saw earlier in the course that we could couple G(n,p) and G(n,q) by simulating both from the same family of uniform random variables, and it’s helpful to think of this in general: ‘constructing the distributions from the same source of randomness’.

Couplings are a useful notion to digest at this point, as they embody a general trend in discrete probability theory. Wherever possible, we try to do as we can with the random objects, before starting any calculations. Think about the connectivity property of G(n,p) as discussed in the previous lecture. This can be expressed directly as a function of p in terms of a large sum, but showing it is an increasing function of p is essentially impossible by computation, whereas this is very straightforward using the coupling.

We will now review how to use couplings to compare distributions.

For a real-valued random variable X, with distribution function $F_X$, we always have the option to couple with a uniform U(0,1) random variable. That is, when $U\sim U[0,1]$, we have $(F_X^{-1}(U)\stackrel{d}= X$, where the inverse of the distribution function is defined (in the non-obvious case of atoms) as

$F_X^{-1}(u)=\inf\left\{ x\in\mathbb{R}\,:\, F(x)\ge u\right\}.$

Note that when the value taken by U increases, so does the value taken by $F_X^{-1}(U)$. This coupling can be used simultaneously on two random variables X and Y, as $(F_X^{-1}(U),F_Y^{-1}(U))$, to generate a coupling of X and Y.

The total variation distance between two probability measures is

$d_{\mathrm{TV}}(\mu,\nu):= \sup_{A}|\mu(A)-\nu(A)|$,

with supremum taken over all events in the joint support S of $\mu,\nu$. This is particularly clear in the case of discrete measures, as then

$d_{\mathrm{TV}}(\mu,\nu)=\frac12 \sum_{x\in S} \left| \mu\left(\{x\}\right) - \nu\left(\{x\}\right) \right|.$

(Think of the difference in heights between the bars, when you plot $\mu,\nu$ simultaneously as a bar graph…)

The total variation distances records how well we can couple two distributions, if we want them to be equal as often as possible. It is therefore a bad measure of distributions with different support. For example, the distributions $\delta_0$ and $\delta_{1/n}$ are distance 1 apart (the maximum) for all values of n. Similarly, the uniform distribution on [0,1] and the uniform distribution on $\{0,1/n,2/n,\ldots, n-1/n, 1\}$ are also distance 1 apart.

When there is more overlap, the following result is useful.

Proposition: Any coupling $(\hat X,\hat Y)$ of $X\sim \mu,\,Y\sim \nu$ satisfies $\mathbb{P}(X=Y)\le 1-d_{\mathrm{TV}}(\mu,\nu)$, and there exists a coupling such that equality is achieved. Continue reading

# Lecture 2 – Connectivity threshold

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

The goal of the second lecture was to establish the sharp phase transition for the connectivity of the random graph G(n,p(n)) around the critical regime $p(n)\sim \frac{\log n}{n}$. In the end, we showed that when $\omega(n)$ is any diverging sequence, and $p(n)=\frac{\log n-\omega(n)}{n}$, then we have that G(n,p(n)) is with high probability not connected.

In the next lecture, we will finish the classification by studying $p(n)=\frac{\log n+\omega(n)}{n}$, and show that for this range of p, the graph G(n,p(n)) is with high probability connected.

The details of the lecture, especially the calculation, are not presented fully here. There, I followed van der Hofstad’s recent book fairly closely, sometimes taking different approximations and routes through the algebra, though all versions remain fairly close to the original enumerations by Renyi.

Immediate remarks

• One is allowed to be surprised that for almost all scalings of p(n), G(n,p) is either whp connected or whp not connected. The speed of the transition is definitely interesting.
• As defined in lectures, the property that a graph is connected is an increasing property, meaning that it is preserved when you add additional edges to the graph.
• Because of the natural coupling between G(n,p) and G(n,q), the fact that connectedness is an increasing property makes life easier. For example, we can insist temporarily that $\omega(n)\ll \log n$, or whatever scaling turns out to be convenient for the proof, but conclude the result for all diverging $\omega(n)$. This avoids the necessity for an annoying case distinction.

Heuristics – Isolated vertices

It turns out that the ‘easiest’ way for such a graph to be disconnected is for it to have an isolated vertex. In determining that the graph has a cut into classes of sizes a and b with no edges between them, there is a trade-off between the number of ways to choose the partition (which increases with min(a,b) ) and the probabilistic penalty from banning the ab edges between the classes (which decreases with min(a,b) ). It turns out that the latter effect is slightly stronger, and so (1,n-1) dominates.

Method 1: second-moment method

In the case $p(n)=\frac{\log n - \omega(n)}{n}$, we use a second-moment method argument to establish that G(n,p) contains an isolated vertex with high probability. Note that a given vertex v is isolated precisely if n-1 edges are not present. Furthermore, two given vertices v,w are both isolated, precisely if 2n-3 edges are not present. So in fact, both the first moment and the second moment of the number of isolated vertices are straightforward to evaluate.

It turns out that the number of isolated vertices, $Y_n$, satisfies

$\mathbb{E}[Y_n]= \exp(\omega(n)+o(1))\rightarrow\infty.$ (*)

As always, we have to eliminate the possibility that this divergent expectation is achieved by the graph typically having no isolated vertices, but occasionally having very many. So we turn to the second moment, and can show

$\mathrm{Var}(Y_n)= (1+o(1))\mathbb{E}[Y_n],$

and so by Chebyshev’s inequality, we have $\mathbb{P}(Y_n=0)\rightarrow 0$.

Method 2: first-moment method

Counter-intuitively, although the case $p(n)=\frac{\log n + \omega(n)}{n}$ requires only a first-moment method, it is more technical because it involves the non-clear direction of the informal equivalence:

$\text{Connected}\; \iff ''\; \text{no isolated vertices}.$

At the time we showed (*), we also showed that for this regime of p(n), G(n,p) whp has no isolated vertices. It remains to show that it has no splits into (unions of) connected components of sizes k and n-k. Continue reading

# Random Graphs – Lecture 1

My plan is to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics about the course can be found here.

In the first lecture, we revised some basic definitions about graphs, focusing on those which are most relevant to a first study of the Erdos-Renyi random graph G(n,p) which will be the focus of the lecture course. We discussed in abstract why the independence of the (potential) edges makes the model easier to analyse, but reduces its suitability as a direct model for lots of networks one might see in the real world, where knowledge that A is directly connected to both B and C affects the probability that B is directly connected to C, in either direction. Thinking about the Facebook friendship graph is one of the best examples, where in this case, we expect this extra information to increase the probability that B and C are connected. Even as the world moves away from heteronormativity, it realistically remains the case that in a graph of the dating history amongst a well-defined community we would likely observe the opposite effect.

All of these more complicated phenomena can be captured by various random graphs, but G(n,p) remains the corner stone, evinced by the $>10^5$ citations towards one of Erdos and Renyi’s original papers on the topic.

Somewhat paraphrasing, one of their (well, mostly Renyi’s) original questions was: when n is large, what should p be so that there’s a good chance that G(n,p) is connected?

The answer to this question lies in Lecture 2, but to cement understanding of the model, and explore some key methods for proofs in discrete probability (as well as play around with the big-O and little-o notation), we investigated the following two situations, which are very far from interesting as far as connectivity of G(n,p) is concerned.

Dense regime

When p is fixed, there are many interesting questions one could ask about the asymptotic properties of G(n,p), but connectivity is not one of them. In particular, for $p\in(0,1)$ we claim:

Proposition: $\mathrm{diam}(G(n,p)) \stackrel{\mathbb{P}}\rightarrow 2$ as $n\rightarrow\infty$.

Note that if $\mathrm{diam}(G(n,p))=1$, then $G(n,p)\simeq K_n$, the complete graph on n vertices. In other words, every possible edge is actually present. But the probability of this event is $p^{\binom{n}{2}}\rightarrow 0$, so long as p<1.

It then suffices to prove that $\mathbb{P}(G(n,p)>2) \rightarrow 0$. We use a union bound, where we study the probability that the graph distance $d_{G(n,p}(v,w)>2$ for two fixed vertices $v\ne w$ first, and then sum over all such pairs. Of course, there is a probability p that the two vertices are directly connected by an edge. Then, there are (n-2) other vertices with the potential to be a common neighbour of v and w, which would ensure that the graph distance between them is at most two. So

$\mathbb{P}(d_{G(n,p)}(v,w)>2)=(1-p)[1-p^2]^{n-2} .$

Note that we are using independence throughout this calculation. Then comes the union bound:

$\mathbb{P}(\mathrm{diam}(G(n,p)) >2) \le \sum_{v\ne w \in[n]} \mathbb{P}(d_{G(n,P)}(v,w)>2)$

$\le \binom{n}{2} (1-p)[1-p^2]^{n-2} \rightarrow 0,$

since exponential decay ‘kills’ polynomial growth.

Ultra-sparse regime

In general, we work in the setting where p=p(n) depends on n. If p(n) decays fast enough (see Exercise 2), then with high probability G(n,p) has no edges at all. However, when $p(n)=o(n^{-3/2})$ we have

Proposition: $\mathbb{P}(\text{edges of }G(n,p)\text{ form a matching}) \rightarrow 1$ as $n\rightarrow\infty$.

A matching is a collection of edges with no vertices in common. So if the edge set of the graph is a matching, we have essentially no interesting connectivity structure at all. The longest path has length one, for example.

To prove this, note that the edge set of the graph fails to be a matching precisely if one of the vertices has degree at least two. But since a vertex v is connected to each of the (n-1) other vertices in the graph independently with probability p, we have

$\mathrm{deg}_{G(n,p)}(v) \sim \mathrm{Bin}(n-1,p),$

and so we can directly make the crude approximation

$\mathbb{P}(\mathrm{deg}_{G(n,p)}(v) =k) = \binom{n-1}{k}p^k(1-p)^{n-1-k}\le n^k p^k.$

We’ve made this very weak bound to make life easier when we sum:

$\mathbb{P}(\mathrm{deg}_{G(n,p)}(v) \ge 2) \le \sum_{k\ge 2}(np)^k = \frac{(np)^2}{1-np}.$

Since $p=o(n^{-3/2})$, we have $\frac{1}{1-np}=\frac{1}{1-o(1)}=1+o(1)$, and overall we obtain

$\mathbb{P}(\mathrm{deg}_{G(n,p)}(v) \ge 2) = o(\frac{1}{n}).$

Again, we finish with a union bound, considering this event across all vertices $v\in[n]$.

$\mathbb{P}(E(G(n,p))\text{ not a matching}) \le \sum_{v\in[n]} \mathbb{P}(\mathrm{deg}_{G(n,p)}(v)\ge 2)$

$=n\mathbb{P}(\mathrm{deg}_{G(n,p)}(1)\ge 2) = o(1),$

as required.

Next time

In the next lecture, we’ll study the regime $p(n)\sim \frac{\log n}{n}$, where G(n,p) experiences a phase transition from probably not connected to probably connected. Part of this involves making the notion probably connected precise, which will be useful throughout the rest of the course, as well as establishing the language for comparing G(n,p) and G(n,q).

The proof itself requires some more sophisticated versions of calculations from Lecture 1, and more sophisticated probabilistic tools (first- and second-moment methods) to convert them into statements about convergence in probability. This will be an advertisement for the more classical enumerative methods that underpinned much of the early work on random graphs.

The rest of the course will exploit much more some comparisons and embeddings involving branching processes and exploration processes, so don’t worry – it won’t be 26 hours of counting trees!

# Linear Algebra II: Eigenvectors and Diagonalisability

This post continues the discussion of the Oxford first-year course Linear Algebra II. We’ve moved on from determinants, and are now considering eigenvalues and eigenvectors of matrices and linear maps.

A good question to ask is: what’s the point of knowing about eigenvectors? I can think of a quick answer and a longer answer. The quick answer is that whenever we have a mapping of any kind, it is natural to ask about its fixed points. And since we are thinking about vector spaces and linear maps, if we can’t find any fixed points, we might nonetheless be able to find the best thing, some vectors whose direction is fixed by the map. In general, knowing about fixed points of a mapping might tell us other more qualitative properties, including the behaviour seen when you apply the map iteratively a large number of times. (Indeed a recent post discusses this exact problem for positive matrices in a context relevant to a chapter of my thesis…)

A more specific answer concerns bases. Recall that a linear map is defined independently of any basis: it’s just a map from the vector space to itself. We can express the linear map via a matrix with respect to some basis, but how to choose the basis? We could always choose the canonical basis in $\mathbb{R}^n$, since it’s easy to do vector and matrix calculations when most of the entries of all the vectors are zero. We also have a good visual idea (at least in up to three dimensions) of what a matrix might mean with respect to that basis. If we needed to divide the three-dimensional world around us into small volumes, we’d tend to describe it with small cubes rather than small arbitrary parallelopipeds.

But once we know something about the linear map, we might want to choose a basis of vectors on which the behaviour of the map is particularly easy to describe. And eigenvectors fulfil precisely this role. If we are able to choose a basis of eigenvectors, describing the map’s action, either abstractly, or via a (diagonal) matrix, is very straightforward. If we are given a matrix to begin with, we know how to do a change of basis, and changing to the basis of eigenvectors is precisely what’s going when we write $A=P^{-1}DP$, where D is a diagonal matrix. We construct P by taking its columns to be these eigenvectors. In particular, for a given vector x, y=Px is the vector giving the coefficients of x in the basis of eigenvectors.

So the case where we have a basis of eigenvectors is particularly useful, and in this case, we say the matrix or the map is diagonalisable. Remember how we find eigenvalues. If there exists a non-zero vector x satisfying $Ax=\lambda x$, then x is in the kernel of $A-\lambda I$. As we discussed last time, introducing the determinant gives a much more manageable way to verify which values of $\lambda$ result in $A-\lambda I$ having a non-trivial kernel. In particular, if non-zero x is in the kernel, we have $\mathrm{det}(A-\lambda I)=0$, and this leads to a polynomial of degree n (the dimensional of the vector space / size of the matrix) for $\lambda$, called the characteristic polynomial $\chi_A(z)$, which has the eigenvalues as its roots.

If we agree to work over the complex field, then this is good, because it means we always have eigenvalues, and so it becomes sensible to talk about exactly how many eigenvalues and eigenvectors we have. Observe that if we restrict to real vector spaces, this might not be the case. In the plane, the rotation by $\pi/2$ for example has no fixed vectors.

Multiplicities of eigenvalues

We call the algebraic multiplicity $\alpha(\lambda)$ of an eigenvalue $\lambda$ to be the exponent of the factor $(z-\lambda)$ in the factorisation of the characteristic polynomial. To define the geometric multiplicity, observe that all the eigenvectors with eigenvalue $\lambda$ form a subspace, and so it is meaningful to talk about the dimension of this subspace (‘eigenspace’), which is the geometric multiplicity $\gamma(\lambda)$. There are two facts that one needs to remember. The slightly less obvious one is that $\gamma(\lambda)\le \alpha(\lambda)$ for all $\lambda$. One can see this by, for example, working in a basis that extends a basis of the $\lambda$-eigenspace. Observe at this stage that the sum of the algebraic multiplicities has to be n by definition, while the sum of geometric multiplicities is at most n. And this makes sense, because the space spanned by all the eigenvectors is a subspace, and so has dimension at most n.

The more obvious, but more frequently forgotten result is that

$\alpha(\lambda)\ge 1 \quad \iff \quad \gamma(\lambda)\ge 1,$

which is simply a consequence of the property discussed a few paragraphs previously concerning the kernel of $A-\lambda I$.

In particular, we might make the heuristic observation that ‘most’ polynomials of degree n have n distinct roots. This is certainly true for quadratics: there is only one value that the discriminant can take such that we see a repeated root. Alternatively, imagine shifting the quadratic up and down (in a complex way if necessary); again there is only one moment at which it might have a repeated root. This observation can be generalised easily to higher degree polynomials in a number of ways.

So if we lift this observation across to matrices, we see that most matrices have n distinct eigenvalues, and thus have n linearly independent eigenvectors which form a basis, hence the matrix is diagonalisable. I think it’s really worth reflecting on this, since much of a first exploration into linear algebra ends up treating exactly the case where the matrix is not diagonalisable.

The principal example of a non-diagonalisable matrix is $\begin{pmatrix}2&1\\0&2\end{pmatrix}$, where the 2s can be replaced by any value, and the 1 can be replaced by an non-zero value. There’s plenty to learn about to what extent versions of this matrix of higher size represent all non-diagonalisable matrices, but such an exposition of Jordan normal form comes next year for the students taking this course.

It probably is worth saying now though, that this example gives a good sanity check for whether a method is actually using diagonalisability correctly. For example, it is easily seen that elementary row operations to not preserve diagonalisability by starting from $\begin{pmatrix}2&0\\0&2\end{pmatrix}$ and ending up at our counter-example. One could also argue from this that the set of non-diagonalisable matrices are dense within the set of matrices with a repeated eigenvalue. That is, having a repeated eigenvalue but full eigenspace is doubly-infinitely-unlikely.

Cayley-Hamilton theorem

Anyway, among other results, we also saw the Cayley-Hamilton theorem, which states that a matrix A satisfies its own characteristic equation. That is $\chi_A(A)=0$, where the zero on the right-hand side is the zero matrix. It’s tempting to substitute A into the expression $\mathrm{det}(A-\lambda I)$, but of course this is not valid. Indeed imagine a typical eigenvalue determinant matrix with terms like $(7-\lambda)$ on the diagonal; it doesn’t make sense to substitute a matrix for $\lambda$ as one of the entries of the overall matrix!

Fortunately, we can argue convincingly in the case where A is a diagonalisable matrix. Remember that $\chi_A(A)$ is a matrix. Now looki at the action of $\chi_A(A)$ on any eigenvector v, corresponding to eigenvalue $\lambda$. Applying some power of A to v gives v multiplied by the same power of $\lambda$, and so we end up with

$\chi_A(A)v = \chi_A(\lambda)v = 0.$

This only worked when v was an eigenvector, but fortunately there is a basis of eigenvectors if A is diagonalisable, and so $\chi_A(A)v=0$ for all v, hence $\chi_A(A)=0$.

But $\chi_A(A)$ is just a matrix-valued function of A. If you think about it, $\chi_A$ is a monic polynomial, all of whose non-leading coefficients are multinomials of degree at most n-1 in the entries of A. Furthermore, these multinomials have (non-negative) integer coefficients. Therefore the entries of $\chi_A(A)$ are multinomials of degree at most 2n-1 in the entries of A, and again have (non-negative) integer coefficients.

Even without the integrality of the coefficients, this says that, under any reasonable definition of continuity of matrices (which could be induced from any topology on $\mathbb{R}^{n\times n}$) the function $\chi_A(A)$ should be continuous as a function of A. But we’ve shown $\chi_A(A)=0$ for all diagonalisable A, and also argued that most complex-valued matrices are diagonalisable. Turning this into a formal statement about denseness means that we’ve shown the Cayley-Hamilton theorem for non-diagonalisable matrices also. It feels that because the coefficients are non-negative integers, we might also have shown the result for other fields too, but I have minimal knowledge or recollection at the moment of the things one has to check for this sort of result.

It’s worth ending with the brief comment that Cayley-Hamilton is useful, among other reasons because it enables us to write the inverse of A as a polynomial of degree at most n-1 in terms of A. In many settings this is a lot easier to work with in terms of calculations than an argument with minors.

# Linear Algebra II: Determinants 2

In the previous post, we introduced determinants of matrices (and by extension linear maps) via its multilinearity properties, and as the change-of-volume factor. We also discussed how to calculate them, via row operations, or Laplace expansion, or directly via a sum of products of entries over permutations.

The question of why this is ever a useful quantity to consider remains, and this post tries to answer it. We’ll start by seeing one example where this is a very natural quantity to consider, and then the main abstract setting, where the determinant is zero, and consider a particularly nice example of this.

Jacobeans as a determinant

We consider integration by substitution. Firstly, in one variable: when it comes to Riemann integration of a function g(x) with respect to x, we view dx as the width of a small column which approximates the function near x. Now, if we reparameterise, that is if we write x=f(y) for some well-behaved (in particular differentiable) function f, then the width of the column is dx= dy.(dx/dy)=f'(y) dy. This may be negative, if y is decreasing while x is increasing, but for now let’s not worry about this overly, for example by assuming the function g is non-negative. Thus if we want to integrate with y as the variable, we multiply the integrand by this factor $|f'(y)|$.

What about in higher dimensions? We have exactly the same situation, only instead of two-dimensional columns, we have (n+1)-dimensional columns. We then multiply the n-dimensional volume of the base by the height, again given by $g(\mathbf{x})$. If we have a similar transformation of the base variable $\mathbf{x}=f(\mathbf{y})$, we differentiate to get

$\mathrm{d}x_i = \sum_{j=1}^n\frac{\mathrm{d}f_i}{\mathrm{d}y_j} \mathrm{d}y_j.$

In other words

$\mathrm{d}\mathbf{x}= J \mathrm{d}\mathbf{y},$

where J is the Jacobean matrix of partial deriatives. In particular, we know how to relate the volume $[0,\mathrm{d}x_1]\times\ldots\times [0,\mathrm{d}x_n]$ to the volume $[0,\mathrm{d}y_1]\times \ldots\times [0,\mathrm{d}y_n]$. It’s simply the determinant of the Jacobean J. So if we want to integrate with respect to $\mathbf{y}$, it only remains to pre-multiply the integrand by $|\mathrm{det}J|$ and proceed otherwise as in the one-dimensional case.

Det A = 0

A first linear algebra course might well motivate the introducing matrices as a notational shortcut for solving families of linear equations, $Ax=b$. The main idea is that generally we can solve this equation uniquely. Almost all of the theory developed in such a first linear algebra course deals with the case when this fails to hold. In particular, there are many ways to characterise this case, and we list some of them now:

• Ax=b has no solutions for some b;
• A is not invertible;
• A has non-trivial kernel, that is, with dimension at least one;
• A does not have full rank, that is, the image has dimension less than n;
• The columns (or indeed the rows) are linearly dependent;
• The matrix can be row-reduced to a matrix with a row of zeroes.

It is useful that these are equivalent, as in abstract problems one can choose whichever interpretation from this list is most relevant. However, all of these are quite hard to check. Exhibiting a non-trivial kernel element is hard – one either has to do manual row-reduction, or the equivalent in the context of linear equations. But we can add the characterisation

• det A = 0;

to the list. And this is genuinely much easier to check for specific examples, either abstract or numerical.

Let’s quickly convince ourselves of a couple of these equivalences. Determinant is invariant under row-reductions, and by multilinearity it is certainly the case that det A = 0 if A has a row of zeroes. We also said that A is the change-of-volume factor. Note that A is a map from the domain to its image, so if A has less than full rank, then any set in the image has zero volume.

The Vandermonde matrix

This is a good example of this theory in practice. Consider the Vandermonde matrix where each row is a geometric progression:

$V=\begin{pmatrix}1&\alpha_1&\ldots&\alpha_1^{n-1}\\1&\alpha_2&\ldots&\alpha_2^{n-1}\\ \vdots&\vdots&\ddots&\vdots\\ 1&\alpha_n&\ldots&\alpha_n^{n-1}\end{pmatrix}.$

Now suppose we attempt to solve

$V\begin{pmatrix}a_0\\a_1\\ \vdots\\ a_{n-1}\end{pmatrix}=\begin{pmatrix}b_1\\b_2\\ \vdots \\ b_n\end{pmatrix}.$

There’s a natural interpretation to this, that’s especially clear with this suggestive notation. Each row corresponds to a polynomial, where the coefficients are given by the $(a_0,a_1,\ldots,a_{n-1})$, and the argument is given by $\alpha_i$.

So if we try to solve for $(a_0,a_1,\ldots,a_{n-1})$, given $(\alpha_1,\ldots,\alpha_n)$ and $(b_1,\ldots,b_n)$, we are asking whether we can find a polynomial P with degree at most n-1 such that $P(\alpha_i)=b_i$ for \$i=1. Lagrange interpolation gives an argument where we just directly write down the relevant polynomial, but we can also deploy our linear algebraic arguments too.

The equivalence of all these statements means that to verify existence and uniqueness of such a polynomial, we only need to check that the Vandermonde matrix has non-zero determinant. And in fact there are a variety of methods to show that

$\mathrm{det}V=\prod_{1\le i< j}\le n(\alpha_j-\alpha_i).$

For the polynomial question to be meaningful, we would certainly demand that the $(\alpha_i)$ are distinct, and so this determinant is non-zero, and we’ve shown that n points determine a degree (n-1) polynomial uniquely.

If we multiply on the left instead, suppose that we are considering a discrete probability distribution X that takes n known values $(\alpha_1,\ldots,\alpha_n)$ with unknown probabilities $(p_1,\ldots,p_n)$. Then we have

$(p_1,\ldots,p_n) V = (1,\mathbb{E}X, \mathbb{E}[X^2],\ldots, \mathbb{E}[X^{n-1}]).$

So, again by inverting the Vandermonde matrix (which is know is possible since its determinant is non-zero…) we can recover the distribution from the first (n-1) moments of the distribution.

A similar argument applies to show that the Discrete Fourier Transform is invertible, and in this case (where the $\alpha_i$s are roots of unity), the expression for the Vandermonde determinant is particularly tractable.

# Linear Algebra II: Determinants 1

This term, I’m giving tutorials on a course that’s new to me, the apparently notorious ‘Linear Algebra II’ for first year undergraduates. I can appreciate how it might have ended up with this reputation, but as always, every challenge is also an opportunity, So I’m going to (try to) write a short series of blog posts about what we’ve discussed in the tutorials.

The first problem sheet-and-a-half concerned determinants of matrices. There are three things worth addressing here:

1. What are abstract definitions, and which is most useful in each setting?
2. How to actually calculate them?
3. What’s the overall point?

The answers are obviously not completely unrelated, but we’ll probably defer the third question to a second post.

The determinant is a map from the set of matrices $\mathcal{M}_n$ to the base field (hereafter assumed to be $\mathbb{R},\mathbb{C}$). The Oxford course defines it through its properties:

• Multilinear in the columns of the matrix.
• Equal to zero if two columns are equal.
• Equal to one if the matrix is the identity.

One then checks that there is a unique such map, and so from now on it’s reasonable to call it the determinant of the matrix. It will follow from pretty much any consequence that we can replace ‘columns’ with ‘rows’ throughout and get the same map.

Other definitions

We have a closed form expression for the determinant given via permutations of n

$\mathrm{det}(A)=\sum_{\sigma\in \Sigma_n} \mathrm{sign}(\sigma) a_{1\sigma(1)}\ldots a_{n\sigma(n)}.$

We’ll come back to a discussion of when this particular definition is useful. It can be derived by carefully transforming the identity matrix into A, using the operations which are mentioned in the original definition of the determinant, in particular, keeping track of the number of transpositions of columns.

It’s clear from any definition that the determinant is a polynomial of degree n in the entries of the matrix, but this definition will be useful if you want to make some more precise comment on the nature of this polynomial. For example, if entries of the matrix are polynomials in x of various degree (think of the eigenvalue equation for example) this allows you to control (or at least bound) the overall degree of the determinant as a polynomial in x.

The determinant is also the volume of the n-dimensional parallelopiped formed by the column vectors of the matrix. This is easy to check in two dimensions, for the matrix $\begin{pmatrix}a&b\\c&d\end{pmatrix}$:

To calculate the area of the central parallelogram, we have to subtract the area of two small rectangles and four small triangles from the outer rectangle, obtaining

$(a+b)(c+d)-2bc - \frac12(ac+ac+bd+bd)=ad-bc,$

as we expect. This calculation is harder to execute in higher dimensions, and certainly harder to visualise.

Maybe, though, we don’t have to, so long as we can reassure ourselves that this volume satisfies the implicit definition of the determinant map at the start. Multilinearity in the columns is not that hard to see. If we multiply the jth column by some constant, we are stretching the parallelopiped by the same constant factor in one direction, and so the volume grows appropriately. The additivity property can similarly be thought of as joining together two parallelopipeds at their common face (which is common since the other column vectors have to be constant in this construction). If two column vectors are equal, then clearly this volume actually has dimension at most n-1, and thus volume zero, so the final two conditions are genuinely easy to check.

The challenge here is that there is a direction involved. Determinants can be negative, but in our classical viewpoint, areas generally are not. In 2D, we can think of this as saying that the area is positive if the vector (b,d) lies anti-clockwise from (a,c) in the parallelogram, while is it negative otherwise. Again, this is harder to visualise in higher dimensions, but it is at least plausible that one could develop a similar decomposition. Ultimately, we are happy with the notion of directed lengths (ie vectors on the real line), and these are easy to add up without having to separate into cases, and the case holds for areas and higher-dimensional volumes.

Evaluating determinants

If we actually want to compute the determinant of a given matrix, the sum over permutations is intractable since it doesn’t have any natural splits into stages. The implicit definitions and this area consideration are clearly useless for all but the most special of examples.

The Laplace expansion is the usual algorithm to calculate the determinant of an n x n matrix. You pick a row (or a column), and evaluate the determinants of the (n-1) x (n-1) minor matrices given by deleting this row (or column), and each column (or row) in turn. This leaves us with n determinants of smaller matrices, which we pre-multiply by the entries in the original deleted row (or column), and add up in an alternating way (*). This is highly computationally intensive for large matrices, but for 3×3 and 4×4 can be done by hand with probability of an error bounded away from 1.

There is the flexibility to choose the reference row or column. Since the entries of these affect the sum through small products, it is highly convenient to choose a row or column with a lot of zeros. In particular, if there’s a row or column with exactly one non-zero entry, this is an ideal candidate.

The sum over permutations also works well when a lot of the entries are zero, because then a lot of the permutations give a summand which is zero. Upper-triangular matrices are a good example: only one permutation (the identity permutation) avoids all the zero elements underneath the diagonal.

One can also observe from the multilinearity property of the determinant map that there are lots of operations we can apply to the matrix which leave the determinant fixed. These are often called elementary row operations, though obviously we can apply them to the columns as well. To summarise, if we interchange two rows, the sign of the determinant is reversed. And if we add some multiple of one row to any other row, the determinant stays the same.

When matrices are not square, it’s quite important to be specific about exactly what form you can reduce a general matrix to via such row operations, but in this context, it’s not hugely important. Reduced echelon form (without the condition that leading coefficients will be one) is achievable, but this is a special case of an upper triangular matrix, for which the determinant is given by the product of the diagonal entries, ie is easy.

Whether this is substantially easier than Laplace expansion depends on the matrix itself and taste, both to do manually, and to code.

(*) I’m not a fan personally of this alternating definition. It seems to me much more natural to define the minor as

$M^{i,j}=(a_{i+k,j_\ell})_{1\le k,\ell\le n-1},$

with indices taken modulo n. Then you don’t have any $\pm 1$s in the Laplace expansion.

Using determinants in abstract problems

So the determinant gives directly the area of the image (under A) of the unit hypercube. By linearity (of A), it is easy to see that it also gives the scale factor of the area change (under A) of any hyper-cuboid, parallel to the conventional axes, anywhere in the space. Then, eg by approximating any sensible n-dimensional shape (*) as a union of such hyper-cuboids, we can show that in fact the area of any sensible shape increases by a factor (det A) under application of A.

This is a good thing to remember, because it is an excellent heuristic for seeing why the determinant of a linear map is basis-independent. It also gives a much easier proof of the key result

$\mathrm{det}(AB)=\mathrm{det}(A) \mathrm{det}(B),$

than that given by fixing B and viewing det(AB) as a map from matrices to the field, just like the original definition of determinant.

Some of the theory in the course is proved using elementary row operations. But these invite complicated notation, so are best used only in simple arguments, or when things are fairly explicit to begin with. Given an abstract problem about determinants of matrices, it is often tempting to induct on the size of the matrix in some way. I think it’s worth saying that even though the Laplace expansion is explicitly set up in this way, the notation involved is also likely to be annoying here, while permutations are easy to describe inductively: eg let $\sigma(1)=k$, then view the remainder of the permutation as a bijection $\{2,3,\ldots,n\}\rightarrow [n]\backslash \{k\}$.

Shortly, we’ll have a second post answering the final question: what’s the point of working with determinants? We’ve already seen half an answer, in that they describe the change-of-volume factor of a matrix (or linear map), but this can be substantially developed.

# Hitting Probabilities for Markov Chains

This continues my previous post on popular questions in second year exams. In the interest of keeping it under 2,500 words I’m starting a new article.

In a previous post I’ve spoken about the two types of Markov chain convergence, in particular, considering when they apply. Normally the ergodic theorem can be used to treat the case where the chain is periodic, so the transition probabilities do not converge to a stationary distribution, but do have limit points – one at zero corresponding to the off-period transitions, and one non-zero. With equal care, the case where the chain is not irreducible can also be treated.

A favourite question for examiners concerns hitting probabilities and expected hitting times of a set A. Note these are unlikely to come up simultaneously. Unless the hitting probability is 1, the expected hitting time is infinite! In both cases, we use the law of total probability to derive a family of equations satisfied by the probabilities/times. The only difference is that for hitting times, we add +1 on the right hand side, as we advance one time-step to use the law of total probability.

The case of hitting probabilities is perhaps more interesting. We have:

$h_i^A = 1,\; i\in A, \quad h_i^A=\sum_{j\in S}p_{ij}h_j^A,\; i\not\in A.$

There are two main cases of interest: where the chain is finite but has multiple closed communicating classes, and where the chain is infinite, so even though it is irreducible, a trajectory might diverge before hitting 0.

For the case of a finite non-irreducible Markov chain, this is fairly manageable, by solving backwards from states where we know the values. Although of course you could ask about the hitting probability of an open state, the most natural question is to consider the probability of ending up in a particular closed class. Then we know that the hitting probability starting from site in the closed class A is 1, and the probability starting from any site in a different closed class is 0. To find the remaining values, we can work backwards one step at a time if the set of possible transitions is sparse enough, or just solve the simultaneous equations for $\{h_i^A: i\text{ open}\}$.

We therefore care mainly about an infinite state-space that might be transient. Typically this might be some sort of birth-and-death chain on the positive integers. In many cases, the hitting probability equations can be reduced to a quadratic recurrence relation which can be solved, normally ending up with the form

$h_i=A+B\lambda^i$,

where $\lambda$ might well be q/p or similar if the chain is symmetric. If the chain is bounded, typically you might know $h_0=1, h_N=0$ or similar, and so you can solve two simultaneous equations to find A and B. For the unbounded case you might often only have one condition, so you have to rely instead on the result that the hitting probabilities are the minimal solution to the family of equations. Note that you will always have $h^i_i=1$, but with no conditions, $h^i_j\equiv 1$ is always a family of solutions.

It is not clear a priori what it means to be a minimal solution. Certainly it is not clear why one solution might be pointwise smaller than another, but in the case given above, it makes sense. Supposing that $\lambda<1$, and A+B=1 say, then as we vary the parameters, the resulting set of ‘probabilities’ does indeed vary monotically pointwise.

Why is this true? Why should the minimum solution give the true hitting probability values? To see this, take the equations, and every time an $h_i^A$ appears on the right-hand side, substitute in using the equations. So we obtain, for $i\not\in A$,

$h_i^A=\sum_{j\in A}p_{ij}+\sum_{j\not\in A} p_{ij}h_j^A,$

and after a further iteration

$h_i^A=\sum_{j_1\in A}p_{ij_1}+\sum_{j_1\not\in A, j_2\in A}p_{ij_1}p_{j_1j_2}+\sum_{j_1,j_2\not\in A}p_{ij_1}p_{j_1j_2}h_{j_2}^A.$

So we see on the RHS the probability of getting from i to A in one step, and in two steps, and if keep iterating, we will get a large sum corresponding to the probability of getting from i to A in 1 or 2 or … or N steps, plus an extra term. Note that the extra term does not have to correspond to the probability of not hitting A by time N. After all, we do not yet know that $(h_{i}^A)$ as defined by the equations gives the hitting probabilities. However, we know that the probability of hitting A within N steps converges to the probability of hitting A at all, since the sequence is increasing and bounded, so if we take a limit of both sides, we get $h_i^A$ on the left, and something at least as large as the hitting probability starting from i on the right, because of the extra positive term. The result therefore follows.

It is worth looking out for related problems that look like a hitting probability calculation. There was a nice example on one of the past papers. Consider a simple symmetric random walk on the integers modulo n, arranged clockwise in a circle. Given that you start at state 0, what is the probability that your first return to state 0 involves a clockwise journey round the circle?

Because the system is finite and irreducible, it is not particularly interesting to consider the actual hitting probabilities. Also, note that if it is convenient to do so, we can immediately reduce the problem when n is even. In two steps, the chain moves from j to j+2 and j-2 with probability ¼ each, and stays at j with probability ½. So the two step chain is exactly equivalent to the lazy version of the same dynamics on n/2.

Anyway, even though the structure is different, our approach should be the same as for the hitting probability question, which is to look one step into the future. For example, to stand a chance of working, our first two moves must both be clockwise. Thereafter, we are allowed to move anticlockwise. There is nothing special about starting at 0 in defining the original probability. We could equally well ask for the probability that starting from j, the first time we hit 0 we have moved clockwise round the circle.

The only thing that is now not obvious is how to define moving clockwise round the circle, since it is not the case that all the moves have to be clockwise to have experienced a generally clockwise journey round the circle, but we definitely don’t want to get into anything complicated like winding numbers! In fact, the easiest way to make the definition is that given the hitting time of 0 is T, we demand that the chain was at state n at time T-1.

For convenience (ie to make the equations consistent) we take $h_0=0, h_n=1$ in an obvious abuse of notation, and then

$h_j=\frac12h_{j-1}+\frac12 h_{j+1},$

from which we get

$h_j=a+bj \Rightarrow h_j=\frac{j}{n}.$

Of course, once we have this in mind, we realise that we could have cut the circle at 0 (also known as n) and unfolded it to reduce the problem precisely to symmetric gambler’s ruin. In particular, the answer to the original problem is 1/2n, which is perhaps just a little surprising – maybe by thinking about the BM approximation to simple random walk, and that BM started from zero almost certainly crosses zero infinitely many times near we might have expected this probability to decay faster. But once it is unfolded into gambler’s ruin, we have the optimal stopping martingale motivation to reassure us that this indeed looks correct.

# Avoiding Mistakes in Probability Exams

Over the past week, I’ve given several tutorials to second year undergraduates preparing for upcoming papers on probability and statistics. In particular, I’ve now seen a lot of solutions to a lot of past papers and specimen questions, and it’s worthwhile to consider some of the typical mistakes students can make on these questions. Of course, as with any maths exam, there’s always the possibility of a particularly subtle or involved question coming up, but if the following three common areas of difficulty can be avoided, you’re on track for doing well.

Jacobians

In a previous course, a student will learn how to calculate the pdf of a function of a random variable. Here, we move onto the more interesting and useful case of finding the (joint) density of function(s) of two or more random variables. The key thing to remember here is that manipulating pdfs is not a strange arbitrary exercise – it is just integration. It is rarely of interest to consider the value of a pdf at a single point. We can draw meaningful conclusions from a pdf or from comparison of two pdfs by integrating them.

Then the question of substituting for new random variables is precisely integration by substitution, which we are totally happy with in the one-dimensional case, and should be fairly happy with in the two-dimensional case. To get from one joint density to another, we multiply by the absolute value of the Jacobian. To ensure you get it right, it makes sense to write out the informal infinitesimal relation

$f_{U,V}(u,v) du dv = f_{X,Y}(x,y)dx dy.$

This is certainly relevant if we put integral signs in front of both sides, and explains why you obtain $f_{U,V} = \frac{d(x,y)}{d(u,v)} f_{X,Y}$ rather than the other way round. Note though that if $\frac{d(u,v)}{d(x,y)}$ is easier to calculate for some reason, then you can evaluate this and take the inverse, as your functions will almost certainly be locally bijective almost everywhere.

It is important to take the modulus of the Jacobian, since densities cannot be negative! If this looks like a fudge, then consider the situation in one dimension. If we substitute for $x\mapsto f(x)=1-x$, then f’ is obviously negative, BUT we also end up reversing the order of the bounds of the integral, eg [1/3, ¾] will become [2/3,1/4]. So we have a negative integrand (after multiplying by f'(x)) but bounds in the ‘wrong’ order. These two factors of -1 will obviously cancel, so it suffices just to multiply by |f'(x)| at that stage. It is harder to express in words, but a similar relation works for the Jacobian substitution.

You also need to check where the new joint density is non-zero. Suppose X, Y are supported on [0,1], then when we write $f_{X,Y}(x,y)$ we should indicate that it is 0 off this region, either by splitting into cases, or adding the indicator function $1_{\{x,y\in[0,1]\}}$ as a factor. This is even more important after substitutions, as the range of the resulting random variables might be less obvious than the originals. Eg with X,Y as above, and $U=X^2, V=X/Y$, the resulting pdf will be non-zero only when $u\in[0,1], v\ge \sqrt{u}$. Failing to account for this will often lead to ludicrous answers. A general rule is that you can always check that any distribution you’ve produced does actually integrate to one.

Convergence using MGFs

There are two main reasons to use MGFs and PGFs. The first is that they behave nicely when applied to (possibly random) sums of independent random variables. The independence property is crucial to allow splitting of the MGF of the sum into the product of MGFs of the summands. Of course, implicit in this argument is that MGFs determine distributions.

A key theorem of the course is that this works even in the limit, so you can use MGFs to show convergence in distribution of a family of distributions. For this, you need to show that the MGFs converge pointwise on some interval [-a,a] around 0. (Note that the moments of the distribution are given by the family of derivatives at 0, as motivation for why this condition might be necessary.) Normally for such questions, you will have been asked to define the MGF earlier in the question, and probably will have found the MGF of a particular distribution or family of distributions, which might well end up appearing as the final answer.

Sometimes such an argument might involve substituting in something unusual, like t/N, rather than t, into a known MGF. Normally a Taylor series can be used to show the final convergence result. If you have a fraction, try to cancel terms so that you only have to evaluate one Taylor series, rather than lots.

Using the Markov Property

The Markov property is initially confusing, but once we become comfortable with the statement, it is increasingly irritating to have to answer the question: “show that this process has the Markov property.” This question is irritating because in most cases we want to answer: “because it obviously does!” Which is compelling, but unlikely to be considered satisfactory in a mathematics exam. Normally we observe that the random dynamics of the next step are a function only of the present location. Looking for the word ‘independent’ in the statement of the process under discussion is a good place to start for any argument along these lines.

The most developed example of a Markov process in this course is the Poisson process. I’ve written far too much about this before, so I won’t do so again, except to say this. When we think of the Poisson process, we generally have two thoughts going through our minds, namely the equivalent definitions of IID exponential inter-arrival times, and stationary, Poisson increments (or the infinitesimal version). If we draw a sketch of a sample trajectory of this process, we can label everything up and it is clear how it all fits together. But if you are asked to give a definition of the Poisson process $(N_t)$, it is inappropriate to talk about inter-arrival times unless you define them in terms of $N_t$, since that is the process you are actually trying to define! It is fine to write out

$T_k:=\min\{t: N_t=k\},\quad N_t=\max\{k: Y_1+Y_2+\ldots+Y_k\le t\}$

but the relation between the two characterisations of the process is not obvious. That is why it is a theorem of the course.

We have to be particularly careful of the difference in definition when we are calculating probabilities of various events. A classic example is this. Find the distribution of $N_2$, conditional on $T_3=1$. It’s very tempting to come up with some massive waffle to argue that the answer is 3+Po(1). The most streamlined observation is that the problem is easy if we are conditioning instead on $N_1=3$. We just use the independent Poisson increments definition of $(N_t)$, with no reference to inter-arrival times required. But then the Markov property applied at time 1 says that the distribution of $(N_2)$ depends only on the value of $N_1$, not on the process on the interval [0,1). In a sense, the condition that $T_3=1$ is giving us extra information on the behaviour of the process up to time 1, and the Markov property, which we know holds for the Poisson process, asserts precisely that the extra information doesn’t matter.