Enumerating Forests

I’ve just got back from a visit to Budapest University of Technology, where it was very pleasant to be invited to give a talk, as well as continuing the discussion our research programme with Balazs. My talk concerned a limit for the exploration process of an Erdos-Renyi random graph conditioned to have no cycles. Watch this space (hopefully very soon) for a fully rigorous account of this. In any case, my timings were not as slick as I would like, and I had to miss out a chunk I’d planned to say about a result of Britikov concerning enumerating unrooted forests. It therefore feels like an excellent time to write something again, and explain this paper, which you might be able to find here, if you have appropriate journal rights.

We are interested to calculate a_{n,m} the number of forests with vertex set [n] consisting of m unrooted trees. Recall that if we were interested in rooted trees, we could appeal to Prufer codes to show that there are m n^{n-m-1} such forests, and indeed results of Pitman give a coalescent/fragmentation scheme as m varies between 1 and n-1. It seems that there is no neat combinatorial re-interpretation of the unrooted case though, so Britikov uses an analytic method.

We know that

a_{n,m}= \frac{n!}{m!} \sum_{\substack{k_1+\ldots+k_m=n\\ k_i\ge 1}} \prod_{j=1}^m \frac{k_j^{k_j-2}}{k_j!}.

To see this, observe that the k_js correspond to the sizes of the m trees in the forest; \frac{n!}{\prod k_j!} gives the multinomial number of ways to assign vertices to the trees; given the labels for a tree of size k_j, there are k_j^{k_j-2} ways to make up the tree itself; and \frac{1}{m!} accounts for the fact that the trees have no order.

What we would really like to do is to take the uniform distribution on the set of all labelled trees, then simulate m IID copies of this distribution, and condition the union to contain precisely n vertices. But obviously this is an infinite set, so we cannot choose uniformly from it. Instead, we can tilt so that large trees are unlikely. In particular, for each x we define

\mathbb{P}(\xi=k) \propto \frac{k^{k-2} x^k}{k!},

and define the normalising constant

B(x):= \sum_{k\ge 1} \frac{k^{k-2}x^k}{k!},

whenever it exists. It turns out that x\le e^{-1} is precisely the condition for B(x)<\infty. Note now that if \xi_1,x_2,\ldots are IID copies of \xi, then

\mathbb{P}(\xi_1+\ldots+\xi_m=n) = \frac{x^n}{B(x)^m} \sum_{k_1+\ldots + k_m=n} \prod_{j=1}^m \frac{k_j^{k_j-2}}{k_j!},

and so we obtain

a_{n,m}= \frac{n!}{m!} \frac{B(x)^m}{x^n} \mathbb{P}(\xi_1+\ldots + \xi_m=n).

So asymptotics for a_{n,m} might follows from laws of large numbers of this distribution \xi.

So far, we haven’t said anything about how to choose this value x. But observe that if you want to have lots of trees in the forest, then the individual trees should generally be small, so we take x small to tilt away from a preference for large trees. It turns out that there is a similar interpretation of criticality for forests as for general graphs, and taking x equal to 1/e, its radius of convergence works well for this setting. If you want even fewer trees, there is no option to take x larger than 1/e, but instead one can use large deviations machinery rather than laws of large number asymptotics.

We will be interested in asymptotics of the characteristic function of \xi for x=1/e. In particular \mathbb{E}[e^{it\xi}]=\frac{B(xe^{it})}{B(x)}, and it will be enough to clarify the behaviour of this as t\rightarrow 0. It’s easier to work with a relation analytic function

\theta(x)=\sum_{k\ge 1} \frac{k^{k-1}x^k}{k!},

ie the integral of B. What now feels like a long time ago I wrote a masters’ thesis on the subject of multiplicative coalescence, and this shows up as the generating function of the solutions to Smoluchowski’s equations with monodisperse initial conditions, which are themselves closely related to the Borel distributions. In any case, several of the early papers on this topic made progress by establishing that the radius of convergence is 1/e, and that \theta(x)e^{-\theta(x)}=x everywhere where |x|\le 1/e. We want to consider x=1/e, for which \theta=1.

Note that \mathbb{E}\xi = \frac{\theta(x)}{B(x)}, so we will make progress by relating B(x),\theta(x) in two ways. One way involves playing around with contour integrals in a fashion that is clear in print, but involves quite a lot of notation. The second way is the Renyi relation which asserts that \theta(x)=B(x)+\frac{\theta(x)^2}{2}. We will briefly give a combinatorial proof. Observe that after multiplying through by factorials and interpreting the square of a generating function, this is equivalent to

k^{k-1} = k^{k-2} + \frac12 \sum_{\substack{l+m=k\\l,m\ge 1}} l^{l-1}m^{m-1}\binom{k}{l},

for all k. As we might expect from the appearance of this equality, we can prove it using a bijection on trees. Obviously on the LHS we have the size of the set of rooted trees on [k]. Now consider the set of pairs of disjoint rooted trees with vertex set [k]. This second term on the RHS is clearly the size of this set. Given an element of this set, join up the two roots, and choose whichever root was not initially in the same tree as 1 to be the new root. We claim this gives a bijection between this set, and the set of rooted trees on [k], for which 1 is not the root. Given the latter, the only pair of trees that leads to the right rooted tree on [k] under this mapping is given by cutting off the unique edge incident to the root that separates the root and vertex 1. In particular, since there is a canonical bijection between rooted trees for which 1 is the root, and unrooted trees (!), we can conclude the Renyi relation.

The Renyi relation now gives \mathbb{E}\xi = \frac{\theta(x)}{B(x)}=2 when x=1/e. If we wanted, we could show that the variance is infinite, which is not completely surprising, as the parameter x lies on the radius of convergence of the generating function.

Now, playing around with contour integrals, and being careful about which strands to take leads to the asymptotic as t\rightarrow 0

\mathbb{E}[ e^{it\xi}] = 1+2it + \frac{2}{3}i |2t|^{3/2} (i\mathrm{sign}(t))^{3/2} + o(|t|^{3/2}).

So from this, we can show that the characteristic function of the rescaled centred partial sum \frac{\xi_1+\ldots+\xi_N-2N}{bN^{2/3}} converges to \exp(-|t|^{3/2}\exp(\frac{i\pi}{4}\mathrm{sign} t)), where b= (32/9)^{1/3} is a constant arising out of the previous step.

We recognise this as the characteristic function of the stable distribution with parameters 3/2 and -1. In particular, we know now that \xi is in the domain of attraction for a stable-3/2 distribution. If we wanted a version of the central limit theorem for such partial sums, we could have that, but since we care about the partial sums of the \xi_is taking a specific value, rather than a range of values on the scale of the fluctuations, we actually need a local limit theorem.

To make this clear, let’s return to the simplest example of the CLT, with some random variables with mean \mu and variance \sigma^2<\infty. Then the partial sums satisfy

\mathbb{P}(\mu N + a\sigma\sqrt{N} \le S_N \le \mu_N+b\sigma\sqrt{N}) \rightarrow \int_a^b f_{\mathcal N}(x)dx,

as N\rightarrow\infty. But what about the probability of S_N taking a particular value m that lies between \mu N+a\sigma \sqrt{N} and \mu N + b\sigma \sqrt{N}? If the underlying distribution was continuous, this would be uncontroversial – considering the probability of lying in a range that is smaller than the scale of the CLT can be shown in a similar way to the CLT itself. A local limit theorem asserts that when the underlying distribution is supported on some lattice, mostly naturally the integers, then these probabilities are in the limit roughly the same whenever m is close to \mu N+a\sigma\sqrt{N}.

In this setting, a result of Ibragimov and Linnik that I have struggled to find anywhere in print (especially in English) gives us local limit theory for integer-supported distributions in the domain of attraction of a stable distribution. Taking p( ) to be the density of this distribution, we obtain

bm^{2/3}\mathbb{P}(\xi_1+\ldots+\xi_m=n) - p(\frac{n-2m}{b m^{2/3}}) \rightarrow 0

as n\rightarrow\infty, uniformly on any set of m for which z= \frac{n-2m}{bm^{2/3}} is bounded. Conveniently, the two occurrences of b clear, and Britikov obtains

a_{n,m} = (1+o(1)) \frac{\sqrt{2\pi} n^{n-1/6}}{2^{n-m}(n-m)!} p(\frac{n-2m}{n^{2/3}},

uniformly in the same sense as before.

Generating Functions for Dice

So last week I was writing an article for Betting Expert about laws of large numbers, and I was trying to produce some representations of distributions to illustrate the Weak LLN and the Central Limit Theorem. Because tossing a coin feels too simplistic, and also because the natural state space for this random variable, at least verbally, is not a subset of the reals, I decided to go for dice instead. So it’s clear what the distribution of the outcome of a single dice roll is, and with a bit of thought or a 6×6 grid, we can work out the distribution of the average of two dice rolls. But what about 100 rolls? Obviously, we need large samples to illustrate the laws of large numbers! In this post, we discuss how to calculate the distribution of the sample mean of n dice rolls.

First we observe that the total set of outcomes of n dice rolls is 6^n. The sum of the outcomes must lie between n and 6n inclusive. The distribution of the sum and the distribution of the sample mean are equivalent up to dividing by n. The final observation is that because the total number of outcomes has a nice form, we shouldn’t expect it to make any difference to the method if we calculate the probability of a given sum, or the number of configurations giving rise to that sum.

Indeed, tying in nicely with the first year probability course, we are going to use generating functions, and there is no difference in practice between the probability generating function, and the combinatorial generating function, if the underlying mechanism is a uniform choice. Well, in practice, there is a small difference, namely a factor of 6 here. The motivation for using generating functions is clear: we are considering the distribution of a sum of independent random variables. This is pretty much exactly why we bother to set up the machinery for PGFs.

Anyway, since each of {1,2,…,6} is equally likely, the GF of a single dice roll is

x+x^2+\ldots+x^6=x\cdot \frac{1-x^6}{1-x}.

So, if we want the generating function of the sum of n independent dice rolls, we can obtain this by raising the above function to the power n. We obtain

x^n(1-x^6)^n(1-x)^{-n}.

Note the factor of x^n at the beginning arises because the minimum value of the sum is n. So to work out the number of configurations giving rise to sum k, we need to evaluate the coefficient of x^k. We can deal with (1-x^6)^n fairly straightforwardly, but some thought it required regarding whether it’s possible to do similar job on (1-x)^{-n}.

We have to engage briefly with what is meant by a binomial coefficient. Note that

\binom{x}{k}=\frac{x(x-1)\ldots(x-k+1)}{1\cdot\ldots\cdot k}

is a valid definition even when x is not a positive integer, as it is simply a degree k polynomial in x. This works if x is a general positive real, and indeed if x is a general negative real. At this stage, we do need to keep k a positive integer, but that’s not a problem for our applications.

So we need to engage with how the binomial theorem works for exponents that are not positive integers. The tricky part with the standard expression as

(a+b)^n=\binom{n}{0}a^n+\ldots + \binom{n}{n}b^n,

is that the attraction of this symmetry in a and b prompts us to work in more generality than is entirely necessary to state the result. Note if we instead write

(1+x)^n=1+\binom{n}{1}x+\binom{n}{2}x^2+\ldots,

we have unwittingly described this finite sum as an infinite series. It just happens that all the binomial coefficients apart from the first (n+1) are zero. The nice thing about this definition is that it might plausibly generalise to non-integer or negative values of n. And indeed it does. I don’t want to go into the details here, but it’s just a Taylor series really, and the binomial coefficients are set up with factorials in the right places to look like a Taylor series, so it all works out.

It is also worth remarking that it follows straight from the definition of a negative binomial coefficient, that

\binom{-n}{j}=(-1)^j \binom{n+j-1}{j}.

In any case, we can rewrite our expression for the generating function of the IID sum as

x^n\left[\sum_{k=0}^n \binom{n}{k}(-1)^k x^{6k}\right]\left[\sum_{j\ge 0} \binom{-n}{j}(-1)^j x^j\right]

By accounting for where we can gather exponents from each bracket, we can evaluate the coefficient of x^m as

\sum_{6k+j=m+n}\binom{n}{k}\binom{n+j-1}{j}(-1)^k.

Ie, k in the sum takes values in \{0,1,\ldots, \lfloor \frac{m+n}{6}\rfloor\}. At least in theory, this now gives us an explicit way to calculate the distribution of the average of multiple dice rolls. We have to be wary, however, that many compilers will not be happy dealing with large binomial coefficients, as the large factorials grow extremely rapidly. An approximation using logs is likely to be more tractable for larger settings.

Anyway, I leave you with the fruits of my labours.

dice30Related articles

Enhanced by Zemanta

Large Deviations and the CLT

Taking a course on Large Deviations has forced me to think a bit more carefully about what happens when you have large collections of IID random variables. I guess the first thing think to think about is ‘What is a Large Deviation‘? In particular, how large or deviant does it have to be?

Of primary interest is the tail of the distribution function of S_n=X_1+\ldots+X_n, where the X_i are independent and identically distributed as X. As we can always negate everything later if necessary, we typically consider the probability of events of the type:

\mathbb{P}(S_n\geq \theta(n))

where \theta(n) is some function which almost certainly increases fairly fast with n. More pertinently, if we are looking for some limit which corresponds to an actual random variable, we perhaps want to look at lots of related \theta(n)s simultaneously. More concretely, we should fix \theta and consider the probabilities

\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha). (*)

Throughout, we lose no generality by assuming that \mathbb{E}X=0. Of course, it is possible that this expectation does not exist, but that is certainly a question for another post!

Now let’s consider the implications of our choice of \theta(n). If this increases with n too slowly, and the likely deviation of S_n is greater than \theta(n), then the event might not be a large deviation at all. In fact, the difference between this event and the event (S_n is above 0, that is, its mean) becomes negligible, and so the probability at (*) might be 1/2 or whatever, regardless of the value of \alpha. So object \lim \frac{S_n}{\theta(n)} whatever that means, certainly cannot be a proper random variable, as if we were to have convergence in distribution, this would imply that the limit RV consisted of point mass at each of \{+\infty, -\infty\}.

On the other hand, if \theta(n) increases rapidly with n, then the probabilities at (*) might become very small indeed when \alpha>0. For example, we might expect:

\lim_{n\rightarrow\infty}\mathbb{P}(\frac{S_n}{\theta(n)}\geq \alpha)=\begin{cases}0& \alpha>0\\1&\alpha<0.\end{cases}

and more information to be required when \alpha=0. This is what we mean by a large deviation event. Although we always have to define everything concretely in terms of some finite sum S_n, we are always thinking about the behaviour in the limit. A large deviation principle exists in an enormous range of cases to show that these probabilities in fact decay exponentially. Again, that is the subject for another post, or indeed the lecture course I’m attending.

Instead, I want to return to the Central Limit Theorem. I first encountered this result in popular science books in a vague “the histogram of boys’ heights looks like a bell” kind of way, then, once a normal random variable had been to some extent defined, it returned in A-level statistics courses in a slightly more fleshed out form. As an undergraduate, you see it in several forms, including as a corollary following from Levy’s convergence theorem.

In all applications though, it is generally used as a method of calculating good approximations. It is not uncommon to see it presented as:

\mathbb{P}(a\sigma\sqrt{n}+\mu n\leq S_n\leq b\sigma\sqrt{n}+\mu n)\approx \frac{1}{\sqrt{2\pi}}\int_a^b e^{-x^2/2}dx.

Although in many cases that is the right way to think use it, it isn’t the most interesting aspect of the theorem itself. CLT says that the correct scaling of \theta(n) so that the deviation probabilities lie between the two cases outline above is the same (that is, \theta(n)=O(\sqrt{n}) in some sense) for an enormous class of distributions, and in particular, most distributions that one might encounter in practice (ie finite mean, finite variance). There is even greater universality, as furthermore the limit distribution at this interface has the same form (some appropriate normal distribution) whenever X is in this class of distributions. I think that goes a long way to explaining why we should care about the theorem. It also immediately prompts several questions:

  • What happens for less regular distributions? It is now more clear what the right question to ask in this setting might be. What is the appropriate scaling for \theta(n) in this case, if such a scaling exists? Is there a similar universality property for suitable classes of distributions?
  • What is special about the normal distribution? The theorem itself shows us that it appears as a universal scaling limit in distribution, but we might reasonably ask what properties such a distribution should have, as perhaps this will offer a clue to a version of CLT or LLNs for less regular distributions.
  • We can see that the Weak Law of Large Numbers follows immediately from CLT. In fact we can say more, perhaps a Slightly Less Weak LLN, that

\frac{S_n-\mu n}{\sigma \theta(n)}\stackrel{d}{\rightarrow}0

  • whenever \sqrt{n}<<\theta(n). But of course, we also have a Strong Law of Large Numbers, which asserts that the empirical mean converges almost surely. What is the threshhold for almost sure convergence, because there is no a priori reason why it should be \theta(n)=n?

To be continued next time.

NMSS 2012 – Strong Law of Large Numbers for a Coin Flip

The 2012 National Mathematics Summer School, held at Queens’ College affiliated to the University of Birmingham, and run by the United Kingdom Mathematics Trust, is drawing to a close today. I gave a problem-based talk on Probability to two groups of 20 junior students (15/16 year olds selected based on strong performance in national competitions for their agegroups), and a lecture to the six senior students (some of 2011’s strongest and most enthusiastic junior students) on the SLLN for the simplest non-trivial random variable imaginable: a coin flip.

In case any of the students, or indeed anyone else, is interested, a text of the problems, and the worked solutions that took up the majority of the lecture will be available here for a short while. Do email me if there are any questions!

Senior Probability Solutions. [Link removed. Email me if interested]

An Exchangeable Law of Large Numbers

In the proof of De Finetti’s Theorem in my last post, I got to a section where I needed to show a particular convergence property of a sequence of exchangeable random variables. For independent identically distributed RVs, we have Kolmogorov’s 0-1 law, and in particular a strong law of large numbers. Does a version of this result hold for exchangeable sequences? As these represent only a mild generalisation of iid sequences, we might hope so. The following argument demonstrates that this is true, as well as providing a natural general proof of De Finetti.

Define \mathcal{E}_n=\sigma(\{f(X_1,\ldots,X_n): f\text{ symmetric, Borel}\}), the smallest sigma-field wrt which the first n RVs are exchangeable. Note that \mathcal{E}_1\supset\mathcal{E}_2\supset\ldots\supset \mathcal{E}=\cap_n\mathcal{E}_n, the exchangeable sigma-field.

So now take g(X) symmetric in the first n variables. By exchangeability E[\frac{1}{n}\sum_1^n f(X_j)g(X)]=E[f(X_1)g(X)]. Now set g=1_A, for A\in\mathcal{E}_n, and so because the LHS integrand is \mathcal{E}_n-meas. we have Z_n=\frac{1}{n}\sum_1^n f(X_j)=E[f(X_1)|\mathcal{E}_n]. So Z is a backwards martingale.

We have a convergence theorem for backwards martingales, which tells us that \lim_n n^{-1}\sum^n f(X_j) exists, and in fact = E[f(X_1)|\mathcal{E}] almost surely. Setting f(X)=1(X\leq x) gives that \lim_n\frac{\#\{X_i\leq x: i\leq n\}}{n}=F(x):=P(X_1\leq x|\mathcal{E}). We now perform a similar procedure for functions defined on the first k RVs, in an attempt to demonstrate independence.

For f:\mathbb{R}^k\rightarrow\mathbb{R}, we seek a backwards martingale, so we take sums over the n^{(k)} ways to choose k of the first n RVs. So \frac{1}{n(n-1)\ldots(n-k+1)}\sum_{I\subset[n]} f(X_{i_1},\ldots,X_{i_k}) is a backwards martingale, and hence E[f(X_1,\ldots,X_k)|\mathcal{E}]=\lim_n \frac{1}{n(n-1)\ldots(n-k+1)}\sum f(-). As before, set f(y_1,\ldots,y_k)=1(y_1\leq x_1)\ldots 1(y_k\leq x_k). Crucially, we can replace the falling factorial term with n^{-k} as we are only considering the limit, then exchange summation as everything is positive and nice to get: E[f(X_1,\ldots,X_k)|\mathcal{E}]=\lim(\frac{1}{n}\sum 1(X_1\leq x_1))\ldots(\frac{1}{n}\sum 1(X_k\leq x_k)) thus demonstrating independence of (X_n) conditional on \mathcal{E}.

So what have we done? Well, we’ve certainly proven de Finetti in the most general case, and we have in addition demonstrated the existence of a Strong Law of Large Numbers for exchangeable sequences, where the limit variable is \mathcal{E}-measurable.