The Chinese Restaurant Process

A couple of months ago I wrote a post about Polya’s Urn, the simplest example of self-reinforcing process. Recall that we have a bag containing black and white balls, and sequentially we draw a ball then replace it together with an additional ball of the same colour. The process is self-reinforcing in the sense that if there is a surplus of black balls, the dynamics will reinforce this by adding more black balls than white balls. Alternatively, you can think of a natural limit process when the number of balls is large, for which any distribution is an invariant distribution. We have seen models such as the Preferential Attachment dynamics for network creation, where the degrees of vertices clearly have this self-reinforcing property. New vertices are more likely to join to existing vertices with large degrees.

One difference between the Polya Urn and some of the models we might be interested in for applications is that for the urn model, the number of classes (in this context colours of balls) is fixed. In many applications, we will want to allow new classes to appear. In the process which follows, we will allow this, and the new classes will have initial size equal to 1, so will be at a disadvantage for the self-reinforcing dynamics. Nonetheless, some will show up in a meaningful way in the limit. It is worth emphasising that Polya’s Urn gave us the Dirichlet distribution in the limit, and this can be thought of as a partition of [0,1]. These more general processes will give us a more interesting family of partitions, called the Poisson-Dirichlet distributions. These will turn up in a wide variety of contexts, and this is perhaps the friendliest way to introduce them.

The model is this. We start with a single diner who sits at the first table. Then whenever the (n+1)th diner arrives, they join a table with k diners already with probability k/n+1, and they start a new table with probability 1/n+1.

(Aside: I’m not exactly sure how this relates to a Chinese restaurant? It seems more reminiscent of a university dining hall during freshers’ week, but I guess that would be a less catchy name for a model.)

Anyway, the interest in this description lies not in organising seating arrangements. Consider choosing uniformly at random from the set of permutations on [n+1]. Suppose x maps to n+1 and n+1 maps to y. Consider taking the permutation of [n] formed by instead mapping x to y and ignoring n+1. This has the uniform distribution on the set of permutations of [n]. By reversing this procedure, we can construct a uniform permutation of [n+1] from a uniform permutation of [n]. When you do this as a process for n growing, observe that the orbits correspond exactly to tables in the Chinese Restaurant Process. If we wanted the CRP to give all the information about the permutation, we could specify the ordering round each table, by saying that with probability 1/n+1 the new diner sits to the left of any given existing diner.

As a starting point for why this is a useful description of the uniform permutation distribution, observe that the size of the component containing the element 1 evolves as a Polya Urn with initial vector (1,1). The second 1 in the initial vector corresponds to the possibility of starting a new table, which is maintained at every stage. This tells us immediately that as n grows to infinity, the proportion of elements in the same cycle as 1 in the uniform permutation converges in distribution to U[0,1]. The construction also allows for an easy proof that the expected number of cycles is roughly log n for large n, since on each pass of the process, the probability that there is a new cycle formed is 1/k.

In this case, the partition induced on [n] by the process is clearly exchangeable given our permutation description. However, this will turn out to hold in greater generality. Note also,, that conditional on the size of the cycle containing 1, the sizes of the remaining cycles are given by a uniform permutation on a smaller number of elements. So the limiting result holds jointly in the first k cycle sizes for all k. More precisely, if (N_1,N_2,\ldots) are the cycle sizes ordered by least element, then the frequencies converge to:

(U_1,(1-U_1)U_2,(1-U_1)(1-U_2)U_3,\ldots),

where the Us are independent U[0,1] RVs. This is known as a stick-breaking procedure, where at each step we break off some proportion of the stick according to a fixed distribution, and assemble the pieces into a partition.

We generalise this process to get a two-parameter version. The standard notation for the parameters is (\alpha,\theta). Then we amend the dynamics. We now have to take into account how many tables are occupied when the (n+1)th diner arrives. If k tables are occupied, and the ith table has n_i diners, then the new one will join this table with probability \frac{n_i-\alpha}{n+\theta}, and will start a new table otherwise, so with probability \frac{\theta+k\alpha}{n+\theta}. The original process therefore corresponds to parameters (0,1).

First we examine which parameters are possible. If \alpha<0, and m|\alpha|<\theta<(m+1)|\alpha|, then with high probability the (m+1)th table will eventually be occupied, whereafter the probability of forming a further table will be negative. So we have to demand instead that \theta is an integer multiple of -\alpha. Then the number of tables is bounded by this multiple, so for large n, the probability of joining one of the k (fixed) tables is roughly \frac{n_i}{n}, so this should behave roughly like the standard Polya Urn. And indeed, the induced frequencies do converge to the Dirichlet distribution with k equal parameters.

Obviously \alpha cannot be greater than 1, otherwise the probability of the second diner joining the first table is negative. If it is equal to 1, then every diner starts a new table, which isn’t very interesting. So we care about \alpha\in[0,1), and for the probability of the second diner starting a new table to be non-negative we require \theta>-\alpha.

It turns out that the partitions induced by this process are exchangeable also. We also have a stick-breaking construction, although now the broken proportions are not IID, but distributed as

U_i\sim \mathrm{Beta}(1-\alpha,\theta+i\alpha),

with the same notation otherwise. It turns out that under mild assumptions, these are all the infinite exchangeable random partitions with this stick-breaking property.

My initial struggle with this process was to understand what roles (\alpha,\theta) played in a more precise way. It turns out this is best explained through the limit of the partition, but Pitman’s Exercise 3.2.2 does at least give an idea of how such a process with parameters (1/2,0) might naturally arise as a version of an urn model.

3.2.2. Let an urn initially contain two balls of different colours. Draw 1 is a simple draw from the urn with replacement. Thereafter, balls are drawn from the urn, with replacement of the ball drawn, and addition of two more balls as follows. If the ball drawn is of a colour never drawn before, it is replaced together with two additional balls of two distinct new colours, different to the colours of balls already in the urn. Whereas if the ball drawn is of a colour that has been drawn before, it is replaced together with two balls of its own colour.

Let n_1 be the number of times a ball of the colour of the first ball drawn (and replaced) is drawn. Let $n_2,n_3,\ldots$ be the number of times balls of each other colour are drawn. Suppose after n draws, we have drawn k colours. (There will be other colours in the bag not yet drawn.) Then, for each drawn colour i, there are 2n_i-1 balls of that colour in the bag, giving 2n-k in total. But there should be 2n balls in total, so there are k other balls. Then the probability that we see a new colour is k/2n, and the probability that we see colour i again is $\latex \frac{2n_i-1}{2n}=\frac{n_i-1/2}{n}$, which exactly corresponds to the dynamics for PD(1/2,0).

The other question I was puzzled by initially is where does the dust come from in the limit? Recall that in an infinite exchangeable partition, the sum of the frequencies does not need to be 1. The difference between this sum and 1 gives the probability that an element is in a block by itself. Obviously, when the number of tables is bounded (as when \alpha<0) this is not an issue, but for positive \alpha, this won’t hold. So we need to account for these singletons. The temptation is to imagine that these correspond to tables which are started but never joined. But this use of ‘never’ is not ideal. For each k, the k-th table will eventually include arbitrarily large numbers of diners. But for any finite n, there will likely be some proportion of people dining alone, some in pairs, and so on. So the sum of all of these proportions in the limit gives this dust.

Generalising Polya’s Urn in another direction, if I have time, I might write something about a model which I recently read about on arXiv where the classes are vertices of a graph, and there is dependence between them based on the presence of edges. This might also be a good moment to explain some other generalisations and stochastic approximation methods used to treat them.

REFERENCES

This post is almost entirely a paraphrase of Sections 3.1 and 3.2 from Pitman’s Combinatorial Stochastic Processes, available online here.

Enhanced by Zemanta

Urns and the Dirichlet Distribution

As I’ve explained in some posts from a while ago, I’ve been thinking about some models related to random graph processes, where we ensure the configuration stays critical by deleting any cycles as they appear. Under various assumptions, this behaves in the limit as the number of vertices grows to infinity, like a coagulation-fragmentation process, with multiplicative coalescence and quadratic fragmentation rate, where the fragmentation kernel is the Poisson-Dirichlet distribution, PD(1/2,1/2). I found it quite hard to find accessible notes on these, partly because the theory is still relatively recently, and also because it seems to be one of those topics where you can’t understand anything properly until you kind of understand everything.

This post was motivated and is based on chapter 3 of Pitman’s Combinatorial Stochastic Processes, and the opening pair of lectures from Pierre Tarres’s TCC course on Self-Interaction and Learning.

It makes sense to begin by discussing the Dirichlet distribution, and there to start with the most simple case, the Beta distribution. As we learned in the Part A Statistics course while trying some canonical examples of posterior distributions, it is convenient to ignore the normalising constants of various distributions until right at the end. This is particularly true of the Beta distribution, which is indeed often used as a prior in such situations. The density function of \text{Beta}(\alpha,\beta) is x^{\alpha-1}(1-x)^{\beta-1}. If these are natural numbers, we have a quick proof by induction using integration by parts, otherwise a slightly longer but still elementary argument gives the normalising constant as

\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}.

(note that the ‘base case’ is the definition of the Gamma function.) For the generalisation we are about to make, it is helpful to think of this Beta density as a distribution not on [0,1], but on partitions of [0,1] into two parts. That is pairs (x,y) such that x+y=1. Why? Because then the density has the form x^{\alpha-1}y^{\beta-1} and is clear how we might generalise this.

Indeed the Dirichlet distribution with parameters (\alpha_1,\ldots,\alpha_m) is a random variable supported on the subset of \mathbb{R}^m with \sum p_k=1 with density \propto \prod p_k^{\alpha_k-1}. For similar reasons, the correct normalising constant in the general case is

\frac{\Gamma(\sum \alpha_k)}{\prod \Gamma(\alpha_k)}.

You can prove this by inducting on the number of variables, using the Beta distribution as a base case.

In many situations, it is useful to be able to express some distribution as a function of IID random variables with a simpler distribution. We can’t quite do that for the Dirichlet distribution, but we can express it very simply as function of independent RVs from the same family. It turns out that the family of Gamma distributions is a wise choice. Recall that the gamma distribution with parameters (\alpha,\beta) has density:

\frac{1}{\beta^\alpha \Gamma(\alpha)}y^{\alpha-1}e^{-y/\beta},\quad y>0.

Anyway, define independent RVs Y_k as gamma distributions with parameters (\alpha_k,1), then we can specify (X_1,\ldots,X_m) as the Dirichlet distribution with these parameters by:

(X_1,\ldots,X_m)\stackrel{d}{=}\frac{(Y_1,\ldots,Y_m)}{\sum Y_k}.

In other words, the Dirichlet distribution gives the ratio between independent gamma RVs. Note the following:

– the sum of the gamma distributions, ie the factor we have to scale by to get back to a ratio, is a gamma distribution itself.

– If we wanted, we could define it in an identical way using Gamma with parameters (\alpha_k,\beta) for some fixed \beta.

– More helpfully, because the gamma distribution is additive in the first argument, we can take a limit to construct a gamma process, where the increments have the form required. This will be a useful interpretation when we take a limit, as largest increments will correspond to largest jumps.

Polya’s Urn

This is one of the best examples of a self-reinforcing process, where an event which has happened in the past is more likely to happen again in the future.

The basic model is as follows. We start with one white ball and one black ball in a bag. We draw a ball from the bag uniformly at random then replace it along with an additional ball of the same colour. Repeat this procedure.

The first step is to look at the distribution at some time n, ie after n balls have been added, so there are n+2 in total. Note that there are exactly n+1 possibilities for the state of the bag at this time. We must have between 1 and n+1 black balls, and indeed all of these are possible. In general, part of the reason why this process is self-reinforcing is that any distribution is in some sense an equilibrium distribution.

What follows is a classic example of a situation which is a notational nightmare in general, but relatively straightforward for a fixed finite example.

Let’s example n=5, and consider the probability that the sequence of balls drawn is BBWBW. This probability is:

\frac12\times \frac 23\times \frac14\times \frac 35\times \frac26.

So far this isn’t especially illuminating, especially if we start trying to cancel these fractions. But note that the denominator of the product will clearly be 6!. What about the numerator? Well, the contribution to the numerator of the product from black balls is 1x2x3=3! while the contribution from white balls is 1×2=2!. In particular, the contribution to the numerator from each colour is independent of the order of whites and blacks. It depends only on the number of whites and blacks. So we can conclude that the probability that we end up a particular ordering of k+1 whites and (n-k)+1 blacks is

\frac{k! (n-k)!}{(n+1)!},

and so the probability that we end up with k+1 whites where we no longer care about ordering is

\binom{n}{k}\frac{k!(n-k)!}{(n+1)!}=\frac{1}{n+1}.

In other words, the distribution of the number of white balls in the bag after n balls have been added is uniform on [1, n+1].

That looks like it might be something of a neat trick, so the natural question to ask is what happens if we adjust the initial conditions. Suppose that instead we start with a_1,\ldots,a_m balls of each of m colours. Obviously, this is going to turn into a proof by suggestive notation. In fact, the model doesn’t really rely on the (a_i) being positive integers. Everything carries through with a_i\in\mathbb{R} if we view the vector as the initial distribution.

As before, the order in which balls of various colours are drawn doesn’t matter hugely. Suppose that the first n balls drawn feature n_i balls of colour i. The probability of this is:

\binom{n}{n_1,\ldots,n_k}\frac{\prod_i \alpha_i(\alpha_i+1)\ldots (\alpha_i+n_i-1)}{\alpha(\alpha+1)\ldots(\alpha+(n-1))}

where \alpha=\sum_i \alpha_i. Then for large n, assuming for now that the \alpha_i\in\mathbb{N} we have

\frac{\alpha(\alpha+1)\ldots(\alpha+n-1)}{n!}=\frac{[\alpha_i+(n_i-1)]!}{n_i! (\alpha_i-1)!}\approx \frac{n_i^{\alpha_i-1}}{(\alpha_i-1)!}.

The denominator will just be a fixed constant, so we get that overall, the probability above is approximately

\frac{\prod_i n_i^{\alpha_i-1}}{n^{\alpha-1}}=\prod (\frac{n_i}{n})^{\alpha_i-1},

which we recall is the pdf of the distribution distribution with parameters (\alpha_i) as telegraphed by our choice of notation. With some suitable martingale machinery, you can also prove that this convergence happens almost surely, for a suitable limit RV defined on the tail sigma algebra.

Next time I’ll introduce a more complicated family of self-reinforcing processes, and discuss some interesting limits of the Dirichlet distribution that relate to such processes.

Enhanced by Zemanta