# Urns and the Dirichlet Distribution

As I’ve explained in some posts from a while ago, I’ve been thinking about some models related to random graph processes, where we ensure the configuration stays critical by deleting any cycles as they appear. Under various assumptions, this behaves in the limit as the number of vertices grows to infinity, like a coagulation-fragmentation process, with multiplicative coalescence and quadratic fragmentation rate, where the fragmentation kernel is the Poisson-Dirichlet distribution, PD(1/2,1/2). I found it quite hard to find accessible notes on these, partly because the theory is still relatively recently, and also because it seems to be one of those topics where you can’t understand anything properly until you kind of understand everything.

This post was motivated and is based on chapter 3 of Pitman’s Combinatorial Stochastic Processes, and the opening pair of lectures from Pierre Tarres’s TCC course on Self-Interaction and Learning.

It makes sense to begin by discussing the Dirichlet distribution, and there to start with the most simple case, the Beta distribution. As we learned in the Part A Statistics course while trying some canonical examples of posterior distributions, it is convenient to ignore the normalising constants of various distributions until right at the end. This is particularly true of the Beta distribution, which is indeed often used as a prior in such situations. The density function of $\text{Beta}(\alpha,\beta)$ is $x^{\alpha-1}(1-x)^{\beta-1}$. If these are natural numbers, we have a quick proof by induction using integration by parts, otherwise a slightly longer but still elementary argument gives the normalising constant as

$\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}.$

(note that the ‘base case’ is the definition of the Gamma function.) For the generalisation we are about to make, it is helpful to think of this Beta density as a distribution not on [0,1], but on partitions of [0,1] into two parts. That is pairs (x,y) such that x+y=1. Why? Because then the density has the form $x^{\alpha-1}y^{\beta-1}$ and is clear how we might generalise this.

Indeed the Dirichlet distribution with parameters $(\alpha_1,\ldots,\alpha_m)$ is a random variable supported on the subset of $\mathbb{R}^m$ with $\sum p_k=1$ with density $\propto \prod p_k^{\alpha_k-1}$. For similar reasons, the correct normalising constant in the general case is

$\frac{\Gamma(\sum \alpha_k)}{\prod \Gamma(\alpha_k)}.$

You can prove this by inducting on the number of variables, using the Beta distribution as a base case.

In many situations, it is useful to be able to express some distribution as a function of IID random variables with a simpler distribution. We can’t quite do that for the Dirichlet distribution, but we can express it very simply as function of independent RVs from the same family. It turns out that the family of Gamma distributions is a wise choice. Recall that the gamma distribution with parameters $(\alpha,\beta)$ has density:

$\frac{1}{\beta^\alpha \Gamma(\alpha)}y^{\alpha-1}e^{-y/\beta},\quad y>0.$

Anyway, define independent RVs $Y_k$ as gamma distributions with parameters $(\alpha_k,1)$, then we can specify $(X_1,\ldots,X_m)$ as the Dirichlet distribution with these parameters by:

$(X_1,\ldots,X_m)\stackrel{d}{=}\frac{(Y_1,\ldots,Y_m)}{\sum Y_k}.$

In other words, the Dirichlet distribution gives the ratio between independent gamma RVs. Note the following:

– the sum of the gamma distributions, ie the factor we have to scale by to get back to a ratio, is a gamma distribution itself.

– If we wanted, we could define it in an identical way using Gamma with parameters $(\alpha_k,\beta)$ for some fixed $\beta$.

– More helpfully, because the gamma distribution is additive in the first argument, we can take a limit to construct a gamma process, where the increments have the form required. This will be a useful interpretation when we take a limit, as largest increments will correspond to largest jumps.

Polya’s Urn

This is one of the best examples of a self-reinforcing process, where an event which has happened in the past is more likely to happen again in the future.

The basic model is as follows. We start with one white ball and one black ball in a bag. We draw a ball from the bag uniformly at random then replace it along with an additional ball of the same colour. Repeat this procedure.

The first step is to look at the distribution at some time n, ie after n balls have been added, so there are n+2 in total. Note that there are exactly n+1 possibilities for the state of the bag at this time. We must have between 1 and n+1 black balls, and indeed all of these are possible. In general, part of the reason why this process is self-reinforcing is that any distribution is in some sense an equilibrium distribution.

What follows is a classic example of a situation which is a notational nightmare in general, but relatively straightforward for a fixed finite example.

Let’s example n=5, and consider the probability that the sequence of balls drawn is BBWBW. This probability is:

$\frac12\times \frac 23\times \frac14\times \frac 35\times \frac26.$

So far this isn’t especially illuminating, especially if we start trying to cancel these fractions. But note that the denominator of the product will clearly be 6!. What about the numerator? Well, the contribution to the numerator of the product from black balls is 1x2x3=3! while the contribution from white balls is 1×2=2!. In particular, the contribution to the numerator from each colour is independent of the order of whites and blacks. It depends only on the number of whites and blacks. So we can conclude that the probability that we end up a particular ordering of k+1 whites and (n-k)+1 blacks is

$\frac{k! (n-k)!}{(n+1)!},$

and so the probability that we end up with k+1 whites where we no longer care about ordering is

$\binom{n}{k}\frac{k!(n-k)!}{(n+1)!}=\frac{1}{n+1}.$

In other words, the distribution of the number of white balls in the bag after n balls have been added is uniform on [1, n+1].

That looks like it might be something of a neat trick, so the natural question to ask is what happens if we adjust the initial conditions. Suppose that instead we start with $a_1,\ldots,a_m$ balls of each of m colours. Obviously, this is going to turn into a proof by suggestive notation. In fact, the model doesn’t really rely on the $(a_i)$ being positive integers. Everything carries through with $a_i\in\mathbb{R}$ if we view the vector as the initial distribution.

As before, the order in which balls of various colours are drawn doesn’t matter hugely. Suppose that the first n balls drawn feature $n_i$ balls of colour i. The probability of this is:

$\binom{n}{n_1,\ldots,n_k}\frac{\prod_i \alpha_i(\alpha_i+1)\ldots (\alpha_i+n_i-1)}{\alpha(\alpha+1)\ldots(\alpha+(n-1))}$

where $\alpha=\sum_i \alpha_i$. Then for large n, assuming for now that the $\alpha_i\in\mathbb{N}$ we have

$\frac{\alpha(\alpha+1)\ldots(\alpha+n-1)}{n!}=\frac{[\alpha_i+(n_i-1)]!}{n_i! (\alpha_i-1)!}\approx \frac{n_i^{\alpha_i-1}}{(\alpha_i-1)!}.$

The denominator will just be a fixed constant, so we get that overall, the probability above is approximately

$\frac{\prod_i n_i^{\alpha_i-1}}{n^{\alpha-1}}=\prod (\frac{n_i}{n})^{\alpha_i-1},$

which we recall is the pdf of the distribution distribution with parameters $(\alpha_i)$ as telegraphed by our choice of notation. With some suitable martingale machinery, you can also prove that this convergence happens almost surely, for a suitable limit RV defined on the tail sigma algebra.

Next time I’ll introduce a more complicated family of self-reinforcing processes, and discuss some interesting limits of the Dirichlet distribution that relate to such processes.