Hall’s Marriage Theorem

Hall’s Marriage Theorem gives conditions on when the vertices of a bipartite graph can be split into pairs of vertices corresponding to disjoint edges such that every vertex in the smaller class is accounted for. Such a set of edges is called a matching. If the sizes of the vertex classes are equal, then the matching naturally induces a bijection between the classes, and such a matching is called a perfect matching.

The number of possible perfect matchings of K_{n,n} is n!, which is a lot to check. Since it’s useful to have bijections, it’s useful to have matchings, so we would like a simple way to check whether a bipartite graph has a matching. Hall’s Marriage Theorem gives a way to reduce the number of things to check to 2^n, which is still large. However, much more importantly, the condition for the existence of a matching has a form which is much easier to check in many applications. The statement is as follows:

Given bipartite graph G with vertex classes X and Y, there is a matching of X into Y precisely when for every subset A\subset X, |\Gamma(A)|\ge |A|, where \Gamma(A) is the set of vertices joined to some vertex in A, called the neighbourhood of A.

Taking A=X, it is clear that |Y|\ge |X| is a necessary condition for the result to hold, unsurprisingly. Perhaps the most elementary standard proof proceeds by induction on the size of X, taking the smallest A to give a contradiction, then using the induction hypothesis to lift smaller matchings up to the original graph. This lifting is based on the idea that a subset relation between sets induces a subset relation between their neighbourhoods.

In this post, I want to consider this theorem as a special case of the Max-Flow-Min-Cut Theorem, as this will support useful generalisations much more easily. The latter theorem is a bit complicated notationally to set up, and I don’t want it to turn into the main point of this post, so I will summarise. The Wikipedia article, and lots of sets of lecture notes are excellent sources of more detailed definitions.

The setting is a weakly-connected directed graph, with two identified vertices, the source, with zero indegree, and the sink, with zero outdegree. Every other vertex lies on a path (not necessarily unique) between the source and the sink. Each edge has a positive capacity, which should be thought of as the maximum volume allowed to flow down the edge. A flow is a way of assigning values to each edge so that they do not exceed the capacity, and there is volume conservation at each interior vertex. That is, the flow into the vertex is equal to flow out of the vertex. The value of the flow is the sum of the flows out of the source, which is necessarily equal to the sum of flows into the sink.

A cut is a partition of the vertices into two classes, with the source in one and the sink in the other. The value of a cut is then the sum of the capacities of any edges going from the class containing the source to the class containing the sink. In most examples, the classes will be increasing, in the sense that any path from source to sink changes class exactly once.

The Max-Flow-Min-Cut Theorem asserts that the maximum value of a flow through the system is equal to the minimum value of a cut. The proof is elementary, though it relies on defining a sensible algorithm to construct a minimal cut from a maximal flow that is not going to be interesting to explain without more precise notation available.

First we explain why Hall’s Marriage Theorem is a special case of this result. Suppose we are given the setup of HMT, with edges directed from X to Y with infinite capacity. We add edges of capacity 1 from some new vertex x_0 to each vertex of X, and from each vertex of Y to a new vertex y_0. The aim is to give necessary and sufficient conditions for the existence of a flow of value |X|. Note that one direction of HMT is genuinely trivial: if there is a matching, then the neighbourhood size condition must hold. We focus on the other direction. If the maximum flow is less than |X|, then there should be a cut of this size as well. We can parameterise a cut by the class of vertices containing the source, say S. Let A=SnX. If \Gamma(A)\not\subset S, then there will be an infinite capacity edge in the cut. So if we are looking for a minimal cut, this should not happen, hence \Gamma(A)\subset S if S is minimal. Similarly, there cannot be any edges from X\A to \Gamma(A). The value of the cut can then be given by

|\Gamma(A)|+|X|-|A|, which is at least |X| if the neighbourhood size assumption is given. Note we can use the same method with the original edges having capacity one, but we have to track slightly more quantities.

This topic came up because I’ve been thinking about fragmentation chains over this holiday. I have a specific example concerning forests of unrooted trees in mind, but won’t go into details right now. The idea is that we often have distributions governing random partitions of some kind, let’s say of [n]. Conditioning on having a given number of classes might give a family of distributions P_{n,k} for the partitions of [n] into k parts. We would be interested to know how easy it is to couple these distributions in a nice way. One way would be via a coalescence or fragmentation process. In the latter, we start with [n] itself, then at each step, split one of the parts into two according to some (random, Markovian) rule. We are interested in finding out whether such a fragmentation process exists for a given distribution.

It suffices to split the problem into single steps. Can we get from P_{n,k} to P_{n,k+1}?

The point I want to make is that this is just a version of Hall’s Marriage Theorem again, at least in terms of proof method. We can take X to be the set of partitions of [n] into k parts, and Y the set of partitions into (k+1) parts. Then we add a directed edge with infinite capacity between x\in X and y\in Y if y can be constructed from x by breaking a part into two. Finally, we connect a fresh vertex x_0 to each edge in X, only now we insist that the capacity is equal to P_{n,k}(x), and similarly an edge from y to y_0 with capacity equal to P_{n,k+1}(y). The existence of a fragmentation chain over this step is then equivalent to the existence of a flow of value 1 in the directed graph network.

Although in many cases this remains challenging to work with, which I will explore in a future post perhaps, this is nonetheless a useful idea to have in mind when it comes to deciding on whether such a construction is possible for specific examples.

Advertisements

The Secretary Problem

This post explores in a direction rather different to my research and recent problems I’ve been talking about. We discuss a problem of optimal stopping theory. In less politically correct times, there were various other names for this including (according to Wikipedia) the marriage problem, the Sultan’s dowry problem and the fussy suitor problem. A much more appealing situation might be someone running an auction or selling a house, who wants the maximum bid, but has to accept or reject bids immediately. In any case, we shall quickly reduce it to a context-free setting.

The problem is as follows. A manager needs to hire a secretary and has n applicants, who are interviewed one at a time. I can’t think of a good reason why this should be the case in this context, but the manager’s only concern is to maximise the probability of picking the best of the n candidates. This would be easy if the choice were made at the end, but in fact the decision whether to accept or reject an applicant has to be made straight after the interview. What is the optimal strategy?

It must be clarified what is going on. In a practical setting, there would be some underlying numerical attribute which could be compared, and so the problem would reduce to what I’ve previously called the Passport Photograph problem. In this setting, after the kth candidate, we only know the relative ranks of the first k candidates to be seen. This in turn massively reduces the possibilities for a strategy. The strategy must be of the form:

Accept the kth candidate if they are the best candidate yet seen with some probability that is a function of k and n. Reject otherwise. In particular, this probability cannot depend on the history of the process – that is of basically no relevance at all since we only know the relative ranks at any given time.

There are two ideas that I want to justify informally for now. It seems reasonable to assume that this probability will either be 0 or 1 for each k and n. For example, one could work backwards in k, from n back to 1 working out the probabilities, and proving at each stage inductively that there is no advantage to a mixed strategy.

Secondly, suppose we decide we will accept the kth candidate if they are the best so far. Well if they are not the best, we pass to the next candidate. I claim that if we are using an optimal strategy, we should also accept the (k+1)st candidate if they are the best so far. This property is called monotonicity for the algorithm. In many senses this is the meat of the problem. If we can prove this, then everything else will be relatively straightforward. For now though, we assume it is true. This means that our optimal strategy for n is precisely defined by a threshold value of k. So we rewrite the strategy as follows:

We observe the first (k-1) candidates. After that, as soon as a candidate is the best so far, we accept them.

Note that this allows for the possibility that we do not choose anyone, so if this feels like a problem, we could always demand we choose the last candidate, even if we know they are not optimal. The question that remains is: what value should k take? Let’s make life easy and take n to be large, so we can assume k=cn for some constant c.

The probability that the second best was in the first cn, but the first was not is c(1-c). On that event, this strategy will definitely give us the best candidate. The probability that the third best was in the first cn, but the first and second were not is c(1-c)^2. Our strategy gives us the best candidate overall if and only if the first candidate appears before the second candidate in what remains of the process. Since we are just sampling from a random permutation, these are equally likely, so the probability that we get the best candidate is

\frac{1}{2} c(1-c)^2.

Continuing, the probability that the best candidate in the first cn is the fourth best overall, and the strategy gives us the best candidate is:

\frac{1}{3}c(1-c)^3.

So to find the optimal c, we have to maximise the sum

\sum_{k\geq 1}\frac{1}{k}c(1-c)^k

=-c\log c.

To find the maximum, we set the derivative equal to zero. Solving, we get c=1/e.

It turns out the most natural way to generalise this problem is to consider n independent random variables taking the values 0 and 1. The aim is to find a strategy that maximises the probability of picking the final 1. An informal analysis proceeds much as in the secretary problem. In this setting, we will work with fixed finite n rather than in some limit. Let us call the {0,1}-valued RVs, I_1,I_2,\ldots,I_n, and set

p_j=\mathbb{P}(I_j=1),\quad q_j=1-p_j,\quad r_j=\frac{p_j}{q_j},

in keeping with Bruss’s original paper dealing with this so-called ‘Odds problem’. Then, as before, we want to find the threshold for k. We should remark that the argument given below is shorter than Bruss’s proof, which uses generating functions, but again we will not address the matter of proving that the optimal strategy should have this threshold form.

First we explain why the secretary problem is a special case of this. Observe that, given no information about the first (k-1) candidates apart from their relative ranks, the probability that the kth candidate is the best so far is independent of the history of the process, and is equal to 1/k. So taking p_k=1/k in the odds problem precisely replicates the secretary problem, since the best candidate is, by construction, the final candidate who had the property of being the best so far.

Now we work out how to calculate the threshold for k. We could define k by:

k=\arg\max_{1\leq j \leq n}\left\{\mathbb{P}(I_k+\ldots+I_n=1)\right\}.

By summing over the possibilities for which of the I_j’s is 1, we get

\mathbb{P}(I_k+\ldots+I_n)=q_kq_{k+1}\ldots q_n\big[\frac{p_k}{q_k}+\ldots+\frac{p_n}{q_n}\big]=q_k\ldots q_n[r_k+\ldots+r_n].

So we ask, for which values of k is:

q_k\ldots q_n[r_k+\ldots+r_n]\geq q_{k+1}\ldots q_n[r_{k+1}+\ldots+r_n] ?

This is equivalent to

q_k\ldots q_n \cdot r_k \geq (1-q_k)q_{k+1}\ldots d_n[r_{k+1}+\ldots+r_n],

and we recall that (1-q_k)=p_k=q_kr_k, hence

\iff q_{k+1}\ldots q_n\ge q_{k+1}\ldots q_n[r_{k+1}+\ldots+r_n],

\iff 1\geq r_{k+1}+\ldots+r_n.

So we conclude that the correct optimal value for the threshold k is the largest value of j such that

r_j+\ldots+r_n < 1.

We can check that this agrees with our answer for the secretary problem since

\frac{1}{cn}+\frac{1}{cn+1}+\ldots+\frac{1}{n}\approx \int_{cn}^n \frac{dx}{x}=-\log c

which is approximately 1 precisely when c is roughly 1/e.

Finally a word on generalisations. Dendievel’s paper which appeared on arXiv yesterday considers a problem of Richard Weber where the RVs instead take values {-1,0,1}, and you want a strategy that successfully selects either the last time the value -1 appears or the last time +1 appears. The case where the probabilities of the event 1 and the event -1 are 1) equal and 2) constant is relatively straightforward by backwards induction as here, and the author deals with the two natural generalisations by removing 1) and 2) in turn. It seems natural that such a method should adapt for RVs taking any finite number of values.

REFERENCES

Bruss – Sum the odds to one and stop (2000)

Bruss – A note on bounds for the odds theorem of optional stopping (2003)

Dendievel – Weber’s optimal stopping problem and generalisations (2013) http://arxiv.org/abs/1309.2860

Mixing Times 6 – Aldous-Broder Algorithm and Cover Times

In several previous posts, I’ve discussed the Uniform Spanning Tree. The definition is straightforward: we choose uniformly at random from the set of trees which span a fixed underlying graph. But for a dense underlying graph, there are a very large number spanning trees. Cayley’s formula says that the complete graph K_n has n^{n-2} spanning trees, so to select from this list is impractical.

We seek a better algorithm. In a post about a year ago, I presented the result that the path between two fixed points x and y in the UST is distributed as the path generated by Loop-Erased Random Walk, for which we start at x and delete cycles as they appear. An initial problem might be that this only gives us a single path, which might be enough in some contexts, but in general we will want to specify the whole tree. Wilson’s Algorithm is an unsurprising but useful extension to this equivalence which does just that. You start by constructing the LERW between two vertices, then you add the LERW which connects some other vertex to the path you already have. Then you take a further vertex not currently explored and start LERW there, continuing until you hit the tree that you already have. Iterate this process, which must terminate after at most n steps when there are no vertices which to start from. The tree thus obtained is the UST. The tricky part is proving that the method for selecting which unused vertices to start from has no effect on the distribution of paths between two fixed points.

I want to consider a different algorithm, discovered roughly simultaneously by Aldous and Broder. Start a random walk on the underlying graph at some particular vertex. Every time we traverse an edge which takes us to a vertex we haven’t yet explored, add this edge to the tree. For now I don’t want to give a proof that this algorithm works, but rather to talk about how fast it works, because it ties in nicely with something from the Mixing Times book we’ve been reading recently. It is clear that the algorithm terminates at the first time the random walk has visited every vertex. This is a stopping time, called the cover time of the Markov chain. If we are working with an underlying complete, then we notice that this is annoying, because it means that the cover time will increase like n.log n. That is, it will take an increasingly long time to gather the final few vertices into the tree. Perhaps some combination of Aldous-Broder initially then Wilson’s method for the final o(n) vertices might be preferable?

I want to discuss how to treat this cover time. Often we have information about the hitting times of states from other states \mathbb{E}_x T_y. A relationship between S, the hitting time, defined to be the maximum of the previous display over x and y, and the expected cover time would be useful, especially for a highly symmetric graph like the complete graph where the expected hitting times are all the same.

Matthews’ Method relates these two for an irreducible finite Markov chain on n states. It says:

t_{cov}\leq t_{hit}\left(1+\frac12+\ldots+\frac 1 n\right).

We first remark that this agrees with what we should get for the random walk on the complete graph. There, the hitting time of x from y is a geometric random variable with success probability 1/n, hence expectation is n. The cover time is the standard coupon collector problem, giving expectation n log n, and the sum of reciprocals factor is asymptotically a good approximation.

The intuition is that if we continue until we hit state 1, then reset and continue until we hit state 2, and so on, by the time we hit state n after (n-1) iterations, this is a very poor overestimate of the cover time, because we are actually likely to have hit most states many times. What we want to do really is say that after we’ve hit state 1, we continue until we hit state 2, unless we’ve already done so, in which case we choose a different state to aim for, one which we haven’t already visited. But this becomes complicated because we then need to know the precise conditional probabilities of visiting any site on the way between two other states, which will depend rather strongly on the exact structure of the chain.

Peres et al give a coupling proof in Chapter 11 of their book which I think can be made a bit shorter, at least informally. The key step is that we still consider hitting the sites in order, only now in a random order.

That is, we choose a permutation \sigma\in S_n uniformly at random, and we let T_k be the first time that states \sigma(1),\ldots,\sigma(k) have all been visited. This is a random time that is measurable in the product space, and for each \sigma it is a stopping time.

The key observation is that \mathbb{P}(T_{k+1}=T_k)=1-\frac{1}{k+1}. This holds conditional on any path of the Markov chain because the requirement for the event is that \sigma(k+1) is visited after \{\sigma(1),\ldots,\sigma(k)\}. The statement therefore holds as stated as well as just pathwise. Then, by the SMP, conditional on \{T_{k+1}>T_k\}, we have

T_{k+1}-T_k \leq_{st} t_{hit}.

Note that by the definition of t_{hit}, this bound on the hitting time T_{k+1} is unaffected by concerns about where the chain actually is at T_k (since it is not necessarily at \sigma(k)).

So, removing the conditioning, we have:

\mathbb{E}\Big[T_{k+1}-T_k\Big]\leq\frac{1}{k+1}t_{hit},

and so the telescoping sum gives us Matthews’ result.

One example is the cover time of random walk on the n x n torus, which turns out to be

O(n^2(\log n)^2).

If anyone remembers that Microsoft screensaver from many years ago which started with a black screen and a snake leaving a trail of white pixels as it negotiated the screen, this will be familiar. The last few black bits take a frustratingly long while to disappear. Obviously that isn’t quite a random walk, but it perhaps diminishes the surprise that it should take this long to find the cover time.

There are a couple of interesting things I wanted to say about electrical networks for Markov chains and analytic methods for mixing times, but the moment may have passed, so this is probably the last post about Mixing Times. Plans are in motion for a similar reading group next term, possible on Random Matrices.

Minimum Spanning Trees

In my last post, I discussed the Uniform Spanning Tree. To summarise very briefly, given a connected graph on n vertices, a tree is a subgraph, that is a subset of the edges, which is connected, but which contains no cycles. It turns out this requires the tree to have n-1 edges.

We are interested in natural mechanisms for generating randomly chosen spanning trees of a given graph. One way we can always do this is to choose uniformly at random from the set of possible trees. This UST is in some sense canonical, but it is worth knowing about some other measures on trees that might be of interest.

A family of natural problems in operations research concerns an arbitrary complex network, with some weight or cost associated to each connection. The question is how to perform some operation on the network so as to minimise the resulting cost. Perhaps the most famous such problem is that of the Travelling Salesman. The story is that a salesman needs to visit n locations and wants to do the trip as efficiently as possible. This might be thought of as some sort of financial or time cost, but proably the easiest way to set it up is to imagine he is trying to minimise the distance he has to travel. It is not hard to see why this problem might genuinely arise in plenty of real-world situations, where a organisation or agent is trying to be as efficient as possible.

It might be the case that it is not possible to travel between every pair of locations, but we needn’t assume that for now. So if he knows the distance between any pair of cities, he wants to know which of the possible routes gives the shortest overall distance. The problem is that there are n! routes, and this grows roughly like n^n, which is faster than exponential, so for as few as 20 cities it has turned into a comparison which is too large to compute.

There are various algorithms which reduce the number of routes that must be checked, and some approximation methods. But if you want the exact answer, it is not currently possible to calculate this in polynomial time.

Minimal Spanning Trees and Uniqueness

For the travelling salesman, we were looking for the minimal cost spanning path. In the case of the complete graph, this is the same as the minimal cost non-repeating path of length n-1. Such paths are a subset of the set of spanning trees on the underlying graph. So what if we look instead for the minimal cost spanning tree? This exists as after all, there are only finitely many spanning trees.

So far, this has been deterministic, but we were looking for a random spanning tree. We can achieve this by choosing the weights at random. Anything other than assigning the weights as an IID sequence seems likely to be complicated, but there isn’t a canonical choice of the distribution of the weights. Our first question will be whether the distribution of the weights affects the distribution of the induced MST. In fact it will turn out that so long as the distribution is continuous, it has no effect on the distribution of the MST. The continuous condition might seem odd, but it is present only to ensure that the weights almost certainly end up generating a unique MST.

It turns out that there is a straightforward greedy algorithm to find the MST once the weights are known. We will examine some consequences of this algorithm in the random setting. First we check uniqueness. The condition required for uniqueness is that the weights be distinct. Note that this is slightly weaker than the statement that all of sums of (n-1)-tuples be distinct, which immediately implies a unique MST.

We now prove this condition. Suppose we have distinct weights, and an associated MST. If the underlying graph is a tree, then the result is clear. Otherwise, add some extra edge e, with weight w(e). By the definition of a tree, this generates exctly one cycle. Consider the other edges, say e_1,\ldots,e_k in this cycle. If any of w(e_i)>w(e) then we can replace e_i with e to get a spanning tree with smaller weight, a contradiction of the claim that we started with an MST. So by distinctness of weights, we conclude that w(e)>w(e_i) for all i.

Conversely, suppose we remove some edge e which IS in the MST. We end up with exactly two connected components. Consider all the edges in the underlying graph between the two components, and suppose that one of these f satisfies w(f)<w(e). Then if we add in edge f, which is by construction not in the original MST, we end up with a smaller total weight than we started with, a further contradiction.

We can summarise this in a neat form. Given an edge e between x and y, consider the set of all edges in the underlying graph with weight LESS THAN w(e). Then if x and y are in different components, the edge e must be in the MST. Since we have an explicit description of which edges are present, it follows that the MST is unique. The problem is that working out the component structure of the graph with higher weights removed is computationally rather intensive. We want a slightly faster algorithm.

Kruskal’s Algorithm

Several rather similar algorithms were developed roughly simultaneously. Prim’s algorithm is a slight generalisation of what we will discuss. Anyway, for now we consider Kruskal’s algorithm which has the advantage that it can be described without really needing to draw a diagram.

We start by ordering the weights. Without loss of generality, we might as well relabel the edges so that

w(e_1)< w(e_2)<\ldots< w(e_{|E|}).

Now, by the condition derived in the argument for uniqueness, we must have e_1 and e_2 in any MST. Now consider e_3. Unless doing so would create a cycle, add e_3. Then, unless doing so would create a cycle, add e_4. Continue. It is clear that the result of this procedure is acyclic. To check it is actually a spanning tree, we show that it is also connected. Suppose not, and two of the components are A and B. Let e be the edge between A and B with minimal weight. According to the algorithm, we should have included e in our MST because at no point would adding it possibly have created a cycle. So we have proved that this greedy algorithm does indeed give the (unique) MST.

A useful consequence of this is that we know the two edges with overall minimum weight are definitely in the MST. In the search for a random measure on spanning trees, what is most important is that we didn’t use the actual values of the weights in this construction, only the order. In other words, we might as well have assumed the weights were a random permutation from S_{|E|}. This now answers our original question about how the random weight MST depends in distribution on the underlying edge weight distribution. So long as with probability one the weights are distinct (which holds if the distribution is continuous), then the distribution of the resulting spanning tree is constant.

It’s not too hard to show this isn’t the same as UST: n=4 suffices as a counterexample. But the difference in asymptotic behaviour of properties such as the diameter is of interest, and will be explored in the next post.

Mixing Times 3 – Convex Functions on the Space of Measures

The meat of this course covers rate of convergence of the distribution of Markov chains. In particular, we want to be thinking about lots of distributions simultaneously, so we really to be comfortable working with the space of measures on a (for now) finite state space. This is not really too bad actually, since we can embed it in a finite-dimensional real vector space.

\mathcal{M}_1(E)=\{(x_v:v\in\Omega),x_v\geq 0, \sum x_v=1\}\subset \mathbb{R}^\Omega.

Since most operations we might want to apply to distributions are linear, it doesn’t make much sense to inherit the usual Euclidean metric. In the end, the metric we use is the same as the L_1 metric, but the motivation is worth exploring. Typically, the size of |\Omega| will be function of n, a parameter which will tend to infinity. So we do not want to be too rooted in the actual set \Omega for what will follow.

Perhaps the best justification for total variation distance is from a gambling viewpoint. Suppose your opinion for the distribution of some outcome is \mu, and a bookmaker has priced their odds according to their evaluation of the outcome as \nu. You want to make the most money, assuming that your opinion of the distribution is correct (which in your opinion, of course it is!). So assuming the bookmaker will accept an arbitrarily complicated (but finite obviously, since there are only |\Omega| possible outcomes) bet, you want to place money on whichever event evinces the greatest disparity between your measure of likeliness and the bookmaker’s. If you can find an event which you think is very likely, and which the bookmaker thinks is unlikely, you are (again, according to your own opinion of the measure) on for a big profit. This difference is the total variation distance ||\mu-\nu||_{TV}.

Formally, we define:

||\mu-\nu||_{TV}:=\max_{A\subset\Omega}|\mu(A)-\nu(A)|.

Note that if this maximum is achieved at A, it is also achieved at A^c, and so we might as well go with the original intuition of

||\mu-\nu||_{TV}=\max_{A\subset\Omega} \left[\mu(A)-\nu(A)\right].

If we decompose \mu(A)=\sum_{x\in A}\mu(x), and similarly for A^c, then add the results, we obtain:

||\mu-\nu||_{TV}=\frac12\sum_{x\in\Omega}|\mu(A)-\nu(A)|.

There are plenty of other interesting interpretations of total variation distance, but I don’t want to get bogged down right now. We are interested in the rate of convergence of distributions of Markov chains. Given some initial distribution \lambda of X_0, we are interested in ||\lambda P^t-\pi||_{TV}. The problem is that doing everything in terms of some general \lambda is really annoying, at the very least for notational reasons. So really we want to investigate

d(t)=\max_{\lambda\in\mathcal{M}_1(E)}||\lambda P^t-\pi||_{TV},

the worst-case scenario, where we choose the initial distribution that mixes the slowest, at least judging at time t. Now, here’s where the space of measures starts to come in useful. For now, we relax the requirement that measures must be probability distributions. In fact, we allow them to be negative as well. Then \lambda P^t-\pi is some signed measure on \Omega with zero total mass.

But although I haven’t yet been explicit about this, it is easy to see that ||\cdot||_{TV} is a norm on this space. In fact, it is (equivalent to – dividing by 1/2 makes no difference!) the product norm of the L_1 norm as defined before. Recall the norms are convex functions. This is an immediate consequence of the triangle inequality. The set of suitable distributions \lambda is affine, because an affine combination of probability distributions is another probability distribution.

Then, we know from linear optimisation theory, that convex functions on an affine space achieve their maxima at boundary points. And the boundary points for this definition of \lambda\in\mathcal{M}_1(E), are precisely the delta-measures at some point of the state space \delta_v. So in fact, we can replace our definition of d(t) by:

d(t)=\max_{x\in\Omega}||P^t(x,\cdot)-\pi||_{TV},

where P^t(x,\cdot) is the same as (\delta_x P^t)(\cdot). Furthermore, we can immediately apply this idea to get a second result for free. In some problems, particularly those with neat couplings across all initial distributions, it is easier to work with a larger class of transition probabilities, rather than the actual equilibrium distribution, so we define:

\bar{d}(t):=\max_{x,y\in\Omega}||P^t(x,\cdot)-P^t(y,\cdot)||_{TV}.

The triangle inequality gives \bar{d}(t)\leq 2d(t) immediately. But we want to show d(t)\leq \bar{d}(t), and we can do that as before, by considering

\max_{\lambda,\mu\in\mathcal{M}_1(E)}||\lambda P^t-\mu P^t||_{TV}.

The function we are maximising is a convex function on \mathcal{M}_1(E)^2, and so it attains its maximum at a boundary point, which must be \lambda=\delta_x,\mu=\delta_y. Hence \bar{d}(t) is equal to the displayed expression above, which is certainly greater than or equal to the original formulation of d(t), as this is the maximum of the same expression over a strict subset.

I’m not suggesting this method is qualitatively different to that proposed by the authors of the book. However, I think this is very much the right way to be thinking about these matters of maximising norms over a space of measures. Partly this is good because it gives an easy ‘sanity check’ for any idea. But also because it gives some idea of whether it will or won’t be possible to extend the ideas to the case where the state space is infinite, which will be of interest much later.

Mixing Times 2 – Metropolis Chains

In our second reading group meeting for Mixing Times of Markov Chains, we reviewed chapters 3 and 4 of the Levin, Peres and Wilmer book. This post and the next contains a couple of brief thoughts about the ideas I found most interesting in each chapter.

Before reading chapter 3, the only thing I really knew about Monte Carlo methods was the slogan. If you want to sample from a probability distribution that you can’t describe explicitly, find a Markov chain which has that distribution as an equilibrium distribution, then run it for long enough starting from wherever you fancy. Then the convergence theorem for finite Markov chains means that the state of the chain after a long time approximates well the distribution you were originally looking for.

On the one previous occasion I had stopped and thought about this, I had two questions which I never really got round to answering. Firstly, what sort of distributions might you not be able to simulate directly? Secondly, and perhaps more fundamentally, how would you go about finding a Markov chain for which a given distribution is in equilibrium?

In the end, the second question is the one answered by this particular chapter. The method is called a Metropolis chain, and the basic idea is that you take ANY Markov chain with appropriate state space, then fiddle with the transition probabilities slightly. The starting chain is called a base chain. It is completely possible to adjust the following algorithm for a general base chain, but for simplicity, let’s assume it is possible to take an irreducible chain for which the transition matrix is symmetric. By thinking about the DBEs, this shows that the uniform distribution is the (unique) equilibrium distribution. Suppose the  transition matrix is given by \Psi(x,y), to copy notation from the book. Then set:

P(x,y)=\begin{cases}\Psi(x,y)\left[1\wedge \frac{\pi(y)}{\pi(x)}\right]&y\neq x\\ 1-\sum_{z\neq x} \Psi(x,z)\left [1\wedge \frac{\pi(z)}{\pi(x)}\right]& y=x.\end{cases}

Note that this second case (y=x) is of essentially no importance. It just confirms that the rows of P add to 1. It is easy to check from the DBEs that \pi is the equilibrium distribution of matrix P. One way to think of this algorithm is that we run the normal chain, but occasionally suppress transitions is they involve a move from a state which is likely (under \pi), to one which is less likely. This is done in proportion to the ratio, so it is unsurprising perhaps that the limit in distribution is \pi.

Conveniently, this algorithm also gives us some ideas for how to answer the first question. Note that at no point do we need to know \pi(x) for some state x. We only need to use \frac{\pi(x)}{\pi(y)} the ratios of probabilities. So this is perfect for distributions where there is a normalising constant which is computationally taxing to evaluate. For example, in the Ising model and similar statistical physics objects, probabilities are viewed more as weightings. There is a normalising constant, often called the partition function Z in this context, lying in the background, but especially the underlying geometry is quite exotic we definitely don’t want to have to worry about actually calculating Z. Thus we have a way to generate samples from such models. The other classic example is a random walk on a large, perhaps unknown graph. Then the equilibrium distribution at a vertex is inversely proportional to the degree of that vertex, but again you might not know about this information over the entire graph. It is reasonable to think of a situation where you might be able to take a random walk on a graph, say the connectivity graph of the internet, without knowing about all the edges at any one time. So, even though you potentially explore everywhere, you only need to know a small amount at any one time.

Of course, the drawback of both of these examples is that a lack of knowledge about the overall system means that it is hard in general to know how many steps the Metropolis chain must run before we can be sure that we are the equilibrium distribution it has been constructed to approach. So, while these chains are an excellent example to have in mind while thinking about mixing times, they are also a good motivation for the subject itself. General rules about speed of convergence to equilibrium are precisely what are required to make such implementation concrete.

The Perron-Frobenius Theorem for Stochastic Matrices

This article was prompted by a question asked by one of my students about 5 minutes before the end of the final lecture of my Markov Chains course in Linyi. Although I didn’t have time to think about it right then, the flight home offered plenty of opportunity for having a play around. I am well aware that the proof given below is not the best one, but the ideas of minimising quadrants of a matrix seemed fairly natural. Anyway, it’s been sitting on my desktop for over two months now, so I decided I should put it up.

———

Recall that the transition probabilities of a finite Markov chain are determined by a stochastic matrix P. That is, each row of P is a probability distribution. In both theory and applications, the existence and properties of an invariant distribution is of interest. This is a probability distribution \pi satisfying the relation:

\pi P=\pi. (1)

It is clear that \bigg(\begin{smallmatrix}1\\ \vdots\\1\end{smallmatrix}\bigg) is a right- or column eigenvector of P, with eigenvalue 1. Since the spectrum of P^T is the same as that of P, we conclude that 1 is a left-eigenvalue of P also. So we can be assured of the existence of a vector \pi satisfying (1). What is unclear is that this eigenvector \pi should be a probability distribution. Since at least one entry must be non-zero, it will suffice to show that every entry of \pi is non-negative.

A necessary condition for the uniqueness of an invariant distribution is that the Markov chain be irreducible. This is best defined using the terminology of random walks: the chain is irreducible if for every pair of states i,j\in I, it is possible to move from i to j and back again. In terms of the transition matrix, P is irreducible if it is not block upper-triangular, up to reordering rows and columns.

We want to show that when P is irreducible, the (unique) 1-eigenvector is a probability distribution. The standard method proposed in this context is to exhibit the invariant distribution directly. For example, Norris’s Markov Chains defines

\gamma_i^k=\text{ expected time spent in i between visits to k }=\mathbb{E}_k\sum_{n=0}^{T_k}1_{\{X_n=i\}},

and shows that (\gamma_i^k)_{i\in I} satisfies (1).

Nonetheless, the result required is clearly at least one step removed from the probabilistic interpretation, so it would be satisfying to find a direct proof of existence. Typically, one quotes the substantially more general theorem of Perron and Frobenius, the most relevant form of which is:

Theorem (Perron-Frobenius): Given A a non-negative and irreducible square matrix. Then there is a positive real eigenvalue \lambda with multiplicity 1 and such that all other eigenvalues have absolute value less than or equal to \lambda. Then the (unique up to scaling) left- and right-eigenvectors corresponding to \lambda are positive.

Here we present a short proof of this result in the case where A is the (stochastic) transition matrix of a Markov chain.

Proposition: An irreducible stochastic matrix has a 1-eigenvector with all entries non-negative.

Proof: We show instead the contrapositive: that if a stochastic matrix has a 1-eigenvector with both negative and non-negative components, then it is reducible. The motivation is this third eigenvector given in example (2). Observe that the communicating classes are \{1,2\} and \{3\}, and it is not hard to see that for any eigenvector with negative and non-negative components, the sign of a component is a class property.

Informally, given an n\times n stochastic matrix P, and a 1-eigenvector \pi with this property, we relabel the states so that the non-negative components, which we call A\subset I are first. That is, in a moderate abuse of notation:

\pi=(\underbrace{\pi_A}_{\geq 0}\quad\underbrace{\pi_B}_{<0}).\quad\text{ If we write P as }\begin{pmatrix}P_{AA}&P_{AB}\\P_{BA}&P_{BB}\end{pmatrix},

our aim is to show that the sub-matrices P_{AB} and P_{BA} are both zero. This implies that states in A and states in B do not communicate, showing that P is reducible. We can formulate this as a linear programming problem:

\text{Maximise }\sum_{\substack{x\in A,y\in B\\x\in B, y\in A}}p_{xy}\text{ s.t. }\begin{cases}p_{xy}\geq 0&\forall x,y\in I\\p_{x1}+\ldots+p_{xn}=1&\forall x\in I\\\pi_1p_{1y}+\ldots+\pi_np_{ny}=\pi_y&\forall y\in I\end{cases}

It suffices to show that this maximum is 0. Now, we take |A|=i, and assume that 1\leq i\leq n-1, that is, there are a positive number of negative and non-negative components. Noting that the sum of the rows in a stochastic matrix is 1, we may consider instead:

\text{Minimise }\sum_{\substack{x,y\in A\\x,y\in B}}p_{xy}\text{ s.t. }\begin{cases}p_{xy}\geq 0&\forall x,y\in I\\p_{x1}+\ldots+p_{xn}=1&\forall x\in I\\\pi_1p_{1y}+\ldots+\pi_np_{ny}=\pi_y&\forall y\in I\end{cases}

and it will suffice to show that this minimum is n. To do this, we consider instead the dual:

\text{Maximise }\lambda_1+\ldots+\lambda_n+\pi_1\mu_1+\ldots+\pi_n\mu_n,

\text{ s.t. }\lambda_x+\pi_y\mu_x\leq\begin{cases}1&\text{if }x,y\leq i\text{ or }x,y\geq i+1\& \text{otherwise}\end{cases}

The objective is certainly bounded by n. And in fact this is attainable, for example by taking:

\lambda_1=\ldots=\lambda_i=1,\quad \lambda_{i+1}=\ldots=\lambda_n=0

\mu_1=\ldots=\mu_i=0,\quad \mu_{i+1}=-\frac{1}{\pi_{i+1}}, \ldots,\mu_n=-\frac{1}{\pi_n}.

Applying strong duality for linear programming problems completes the proof.

How to take a Passport Photograph – Part 1

As seems to be case whenever you start somewhere new, I’ve needed an almost infinite supply of passport-sized photographs recently. The university, my college, my department and of course the Chinese immigration authorities all wanted a record of my beautiful features. Anyway, as a result of all of this interest, I was in the do-it-yourself photo booth WHSmiths in Wandsworth getting some more the other day. The first attempt looked fine, but the machine offered me the possibility of trying again, up to twice if I wanted. This seemed like a win-win situation, so I said yes, not realising that the one I already had would not be kept ‘in the bag’. The second attempt looked somewhat startled, a pose that runs in my family, but not wanting to risk the possibility of a disastrous third attempt (and the financial penalty of having to do the whole operation again) I confirmed that I was happy and made do with the result. Naturally, the question that struck me: what is the optimal strategy for such a situation? (Assuming that, unlike me, you knew the rules from the beginning)

Mathematical model and choices

Let’s formulate this mathematically. Suppose there are n possible trials, corresponding to iid random variables X_1,\ldots,X_n. (Note that this assumes that your ‘performance’ does not improve or otherwise change during the process. Perhaps not a reasonable assumption in some contexts?) After trials X_1,\ldots,X_k have been observed, you have to choose whether to accept the value X_k as your ‘final answer’, or whether to continue.

The first key decision is: what distribution should the X_is have? Since in the original problem there isn’t a natural metric for quality, let’s assume that the X_is represent some well-defined quantitative utility, distributed as a uniform [0,1] random variable. Perhaps a normal random variable might be a more realistic model, but I can solve it in this case, so let’s stick to this for now. In addition, for the sake of making the eventual answer more simple, let’s say that 0 is the best quality and 1 is the worst. That is, we are looking for a strategy that stops the process at a time T so as to minimise \mathbb{E}X_T.

Finding an optimal strategy

The key observation is the following. In words, if we reject X_1, we can forget about its value as that is independent of \{X_2,\ldots,X_n\} which is now all that remains to base future judgments on. We return to the original problem with one fewer trial. In more mathematical notation, conditional on \{T\neq 1\}, T is independent of X_1.

The following argument assumes that an optimal strategy exists, which is not ideal, but can easily be justified. For now though, we proceed relatively informally by induction on n.

Let S be the stopping time for the optimal strategy on X_2,\ldots,X_n which we assume exists by induction. It is ‘obvious’ that the optimal strategy T for X_1,\ldots,X_n should be the following:

  • T=1 iff X_1<a(n), where this is a deterministic quality with dependence only on n.
  • Conditional on T\neq 1, take T=S.

From this alone, we can calculate \mathbb{E}X_T.

\mathbb{E}X_T=\mathbb{E}X_S1_{T=S}+\mathbb{E}X_T1_{T\neq S}

= (1-\mathbb{P}(T=1))\mathbb{E}X_S+\mathbb{E}X_1 1_{X_1<a(n)}

= (1-a(n))\mathbb{E}X_S +\frac12 a(n)^2.

This is minimised precisely when a(n)=\mathbb{E}X_s. We conclude that the optimal strategy, as of course we might well expect, is to take T=1 precisely if X_1 is less than the expected result of applying the optimal strategy to the remaining random variables.

By extension, we have a(n+1)=\mathbb{E}X_T, and so

a(n+1)=a(n)\left[1-\frac{a(n)}{2}\right]. (*)

The first few values are:

a(1)=1,\quad a(2)=\frac12,\quad a(3)=\frac38,\quad a(4)=\frac{39}{128},\quad\ldots

Behaviour of a(n)

The first question is: as n grows large, does a(n)\rightarrow 0? Well this isn’t too hard: the recursive definition (*) confirms that the sequence $a(1),a(2),\ldots$ is (strictly) decreasing, and so has a limit, which must be a fixed point of the equation (*). The only such fixed point is 0.

The second question is: what is the asymptotic behaviour of a(n) for large n? A quick run on MATLAB, or examination of the equation (*) suggests that

a(n)\approx \frac{2}{n}

should describe the behaviour well for large n. My basic attempts to verify this were initially unsuccessful, but I felt fairly sure that this should be true in some metric sense because of the following highly non-rigorous but nonetheless convincing idea.

Claim: a_n=\frac{2}{n}+O(n^{-3}) satisfies (*).

Why? Well, then:

\frac{2}{n+1}-\left[(\frac{2}{n}+O(n^{-3}))-\frac12(\frac{2}{n}+O(n^{-3}))^2\right]

=\frac{2}{n+1}-\frac{2}{n}+\frac{2}{n^2}+O(n^{-3})

=\frac{2}{n^2(n+1)}+O(n^{-3})=O(n^{-3}).

This proves the claim, but none of the = signs are really especially meaningful here. Perhaps there is a really slick way to tie this up that I’ve missed? In any case, I will save my own slightly involved method for a new post.

Loss Networks and Erlang’s Fixed Point

Loss Networks

In Erlang’s telephone line model discussed in the previous post, we considered users competing for a finite set of resources. When insufficient resources are available, the call is lost. A loss network generalises this situation to more complicated resource configurations. We think of links 1, …, J, each with some integer capacity c_j. Each incoming call requires 1 units of each link in some subset of the set of links, and lasts for a time distributed as an exponential random variable with parameter 1, independent of everything else in the model. We call this subset the route, and denote by A_{jr} the incidence matrix of links on routes. Calls arrive as PP(\nu_r)$ independently for each route: no queueing occurs – a call is lost if some link required is operating at full capacity. We call the probability of this event L_r, the loss probability. Observe that (n_r), the number of calls on each route r, is a Markov chain on the truncated space \{An\leq c\}.

By checking the DBEs, it is clear that an ED for this Markov chain is proportional to the ED for the MC without the capacity constraint, with state-space restricted to the truncated space. But without capacity constraints, the system is a linear migration process, for which we discovered the form of the ED in the previous section. If we write H(c)=\mathbb{P}(An\leq c) in the linear migration process, we can compute the acceptance probability for the finite capacity system as:

1-L_r=\frac{H(C-Ae_r)}{H(C)}

Approximating Blocking Probabilities

We want to calculate B_j, the equilibrium blocking probability, that a given link j is full. We have two methods: firstly, to find the distribution for (n_r) with maximum probability, for which the blocking probabilities appear as shadow prices. And secondly, to make a reasonable approximation about blocking independence, and solve explicitly. We want to show that these methods give the same answers.

To maximise the probability \pi(n)\propto \prod_r \frac{\nu_r^{n_r}}{n_r!} on \{An\leq c\}, we take logs and maximise using Stirling’s approximation, which is reasonable as we are implicitly working under a regime where the throughput tends to infinity while preserving ratios.

The primal problem is

\max\quad \sum_r(x_r\log \nu_r-x_r\log x_r+x_r),\quad\text{s.t. }Ax\leq c

which has Lagrangian

L(x,y,z)=\sum_r x_r+\sum_r x_r(\log \nu_r-\log x_r-\sum_j y_jA_{jr})+\sum_j y_jc_j-\sum_j y_jc_j

We observe that complementary slackness here has the form y.z=0, and remember that by Strong Duality, which applies here because everything relevant is convex, this equality holds at the primal optimum. Differentiating the Lagrangian at the optimum allows us to specify the optimal x in terms of y:

\bar{x}_r=\nu_r e^{-\sum y_jA_{jr}}

The dual problem is then to minimise

\min\quad \sum_r \nu_re^{-\sum_jy_jA_{jr}}+\sum_j y_jc_j

At this point, we make the suggestive substitution e^{-y_j}=1-B_j, observing that this gives B non-negative by default since y is non-negative. After further work, we will deduce that these B do indeed have a sensible interpretation as blocking probabilities, but it should be stressed that this is in no way obvious yet. Now complementary slackness asserts:

\sum_rA_{jr}\nu_r\prod_i(1-B_i)^{A_{ir}}\left\{\begin{array}{l l}=c_j& \quad B_j>0\\ \leq c_j & \quad B_j=0\\ \end{array} \right.

Note that the primal objective function is strictly convex so \bar{x} as discussed is the unique optimum. The dual objective is strictly convex in yA, so if A has full rank J, this induces a unique optimum in terms of y. We assume A is full rank (since for example we can perturb slightly) and that there is no degeneracy in the blocking.

Now we consider a sequence of networks with proportionally increasing arrival rates and capacities Continue reading

Braess’s Paradox and Traffic Networks

For a traffic network, we have routes and links as before, only here we specify the delay on a link as a function of the flow along that link D_j(y_j), where y_j=\sum_r A_{jr}x_r as before. Then a natural notion of an equilibrium is a distribution of flows such that no customer stands to gain by changing route. Formally, take S the set of source-destination pairs, H_{sr} the incidence matrix of pairs s and routes r, and S(r) the routes between the same source-destination pair as r.

Then define a Wardrop equilibrium to be a flow vector (x_r,r\in R) such that for any x_r>0,

\sum_j D_j(y_j)A_{jr}=\min_{r'\in S(r)}\sum_j D_j(y_j)A_{jr}

that is, the delay on this route is less than or equal to the delay on any route between that pair of vertices.

A natural assumption would be that D_j(y_j) is increasing in y_j for each j, and if we further assume that each D_j is continuously differentiable, then we can prove that a Wardrop Equilibrium exists.

Why? Consider the optimisation problem

\min\quad\sum_j \int_0^{y_j}D_j(u)du,\quad x\geq 0,Hx=f,Ax=y

where f is the flow requirement. The feasible region is convex and the objective function is convex differentiable so the optimum exists. The Lagrangian is

L(x,y,\lambda,\mu)=\sum_j \int_0^{y_j}D_j(u)du+\lambda\cdot (f-Hx)-\mu\cdot(y-Ax)

and at the minimum

0=D_j(y_j)-\mu_j,\quad 0=-\lambda_{S(r)}+\sum_j \mu_jA_{jr}

So \mu_j the delay on j, and then

\lambda_{S(r)}\left\{\begin{array}{l l}=\sum_j\mu_jA_{jr}& \quad x_r>0\\ \leq \sum_j\mu_jA_{jr} & \quad x_r=0\\ \end{array} \right.

So can interpret \lambda_{S(r)} as the minimum delay for this route, and so a solution to this problem is a Wardrop Equilibrium. If D_js are strictly increasing, then have uniqueness in (y_j), from the optimisation problem, but not in (x_r). For example, you could have two essentially identical routes from A to B, and from B to C, and it then doesn’t matter how you concatenate them to form a flow from A to C.

Note the counterintuitive Braess’s Paradox, where adding a route increasing the size of the delay at the Wardrop Equilibrium. The canonical example can be seen here, where the addition of the road across causes the delay to increase from 83 to 92. And indeed such effects have been observed in ‘real life’ as well.

Note that one explanation for why this seems counter-intuitive is that the objective function isn’t necessarily the most natural one. For a Wardrop equilibrium, the agents are acting entirely selfishly given present information. From the point of view of society, it might make more sense to minimise the average delay \min \sum_jy_jD_j(y_j). In this case the Lagrangian gives

0=D_j(y_j)+y_jD_j'(y_j)-\mu_j,\quad 0=-\lambda_{S(r)}+\sum_j\mu_jA_{jr}

So if add a congestion charge of y_jD_j'(y_j) to the delay function, the Wardrop Equilibrium becomes equal to the societal optimal. Note that this is NOT the societal optimal for (delay + charge). However, we have seen that a toll will enable selfish behaviour to produce a societal optimum. This is an example of mechanism design, which can be considered a branch of game theory, where the aim is to design systems so that agents work in a societally optimal manner.