The Secretary Problem

This post explores in a direction rather different to my research and recent problems I’ve been talking about. We discuss a problem of optimal stopping theory. In less politically correct times, there were various other names for this including (according to Wikipedia) the marriage problem, the Sultan’s dowry problem and the fussy suitor problem. A much more appealing situation might be someone running an auction or selling a house, who wants the maximum bid, but has to accept or reject bids immediately. In any case, we shall quickly reduce it to a context-free setting.

The problem is as follows. A manager needs to hire a secretary and has n applicants, who are interviewed one at a time. I can’t think of a good reason why this should be the case in this context, but the manager’s only concern is to maximise the probability of picking the best of the n candidates. This would be easy if the choice were made at the end, but in fact the decision whether to accept or reject an applicant has to be made straight after the interview. What is the optimal strategy?

It must be clarified what is going on. In a practical setting, there would be some underlying numerical attribute which could be compared, and so the problem would reduce to what I’ve previously called the Passport Photograph problem. In this setting, after the kth candidate, we only know the relative ranks of the first k candidates to be seen. This in turn massively reduces the possibilities for a strategy. The strategy must be of the form:

Accept the kth candidate if they are the best candidate yet seen with some probability that is a function of k and n. Reject otherwise. In particular, this probability cannot depend on the history of the process – that is of basically no relevance at all since we only know the relative ranks at any given time.

There are two ideas that I want to justify informally for now. It seems reasonable to assume that this probability will either be 0 or 1 for each k and n. For example, one could work backwards in k, from n back to 1 working out the probabilities, and proving at each stage inductively that there is no advantage to a mixed strategy.

Secondly, suppose we decide we will accept the kth candidate if they are the best so far. Well if they are not the best, we pass to the next candidate. I claim that if we are using an optimal strategy, we should also accept the (k+1)st candidate if they are the best so far. This property is called monotonicity for the algorithm. In many senses this is the meat of the problem. If we can prove this, then everything else will be relatively straightforward. For now though, we assume it is true. This means that our optimal strategy for n is precisely defined by a threshold value of k. So we rewrite the strategy as follows:

We observe the first (k-1) candidates. After that, as soon as a candidate is the best so far, we accept them.

Note that this allows for the possibility that we do not choose anyone, so if this feels like a problem, we could always demand we choose the last candidate, even if we know they are not optimal. The question that remains is: what value should k take? Let’s make life easy and take n to be large, so we can assume k=cn for some constant c.

The probability that the second best was in the first cn, but the first was not is c(1-c). On that event, this strategy will definitely give us the best candidate. The probability that the third best was in the first cn, but the first and second were not is c(1-c)^2. Our strategy gives us the best candidate overall if and only if the first candidate appears before the second candidate in what remains of the process. Since we are just sampling from a random permutation, these are equally likely, so the probability that we get the best candidate is

\frac{1}{2} c(1-c)^2.

Continuing, the probability that the best candidate in the first cn is the fourth best overall, and the strategy gives us the best candidate is:


So to find the optimal c, we have to maximise the sum

\sum_{k\geq 1}\frac{1}{k}c(1-c)^k

=-c\log c.

To find the maximum, we set the derivative equal to zero. Solving, we get c=1/e.

It turns out the most natural way to generalise this problem is to consider n independent random variables taking the values 0 and 1. The aim is to find a strategy that maximises the probability of picking the final 1. An informal analysis proceeds much as in the secretary problem. In this setting, we will work with fixed finite n rather than in some limit. Let us call the {0,1}-valued RVs, I_1,I_2,\ldots,I_n, and set

p_j=\mathbb{P}(I_j=1),\quad q_j=1-p_j,\quad r_j=\frac{p_j}{q_j},

in keeping with Bruss’s original paper dealing with this so-called ‘Odds problem’. Then, as before, we want to find the threshold for k. We should remark that the argument given below is shorter than Bruss’s proof, which uses generating functions, but again we will not address the matter of proving that the optimal strategy should have this threshold form.

First we explain why the secretary problem is a special case of this. Observe that, given no information about the first (k-1) candidates apart from their relative ranks, the probability that the kth candidate is the best so far is independent of the history of the process, and is equal to 1/k. So taking p_k=1/k in the odds problem precisely replicates the secretary problem, since the best candidate is, by construction, the final candidate who had the property of being the best so far.

Now we work out how to calculate the threshold for k. We could define k by:

k=\arg\max_{1\leq j \leq n}\left\{\mathbb{P}(I_k+\ldots+I_n=1)\right\}.

By summing over the possibilities for which of the I_j’s is 1, we get

\mathbb{P}(I_k+\ldots+I_n)=q_kq_{k+1}\ldots q_n\big[\frac{p_k}{q_k}+\ldots+\frac{p_n}{q_n}\big]=q_k\ldots q_n[r_k+\ldots+r_n].

So we ask, for which values of k is:

q_k\ldots q_n[r_k+\ldots+r_n]\geq q_{k+1}\ldots q_n[r_{k+1}+\ldots+r_n] ?

This is equivalent to

q_k\ldots q_n \cdot r_k \geq (1-q_k)q_{k+1}\ldots d_n[r_{k+1}+\ldots+r_n],

and we recall that (1-q_k)=p_k=q_kr_k, hence

\iff q_{k+1}\ldots q_n\ge q_{k+1}\ldots q_n[r_{k+1}+\ldots+r_n],

\iff 1\geq r_{k+1}+\ldots+r_n.

So we conclude that the correct optimal value for the threshold k is the largest value of j such that

r_j+\ldots+r_n < 1.

We can check that this agrees with our answer for the secretary problem since

\frac{1}{cn}+\frac{1}{cn+1}+\ldots+\frac{1}{n}\approx \int_{cn}^n \frac{dx}{x}=-\log c

which is approximately 1 precisely when c is roughly 1/e.

Finally a word on generalisations. Dendievel’s paper which appeared on arXiv yesterday considers a problem of Richard Weber where the RVs instead take values {-1,0,1}, and you want a strategy that successfully selects either the last time the value -1 appears or the last time +1 appears. The case where the probabilities of the event 1 and the event -1 are 1) equal and 2) constant is relatively straightforward by backwards induction as here, and the author deals with the two natural generalisations by removing 1) and 2) in turn. It seems natural that such a method should adapt for RVs taking any finite number of values.


Bruss – Sum the odds to one and stop (2000)

Bruss – A note on bounds for the odds theorem of optional stopping (2003)

Dendievel – Weber’s optimal stopping problem and generalisations (2013)


Random Mappings for Cycle Deletion

In previous posts here and here, I’ve talked about attempts to describe a cycle deleting process. We amend the dynamics of the standard random graph process by demanding that whenever a cycle is formed in the graph we delete all the edges that lie on the cycle. The aim of this is to prevent the system growing giant components, and perhaps give a system that displays the characteristics of self-organised criticality. In the posts linked to, we discuss the difficulties caused by the fact that the tree structure of components in such a process is not necessarily uniform.

Today we look in the opposite direction. It gives a perfectly reasonable model to take a multiplicative coalescent with quadratic fragmentation (this corresponds to cycle deletion, since there are O(n^2) edges which would give a cycle if added to a tree on n vertices) and a fragmentation kernel corresponding to adding an extra edge to a uniform spanning tree on n vertices then deleting the edges of the unique cycle. The focus of the rest of this post, we consider this fragmentation mechanism, in particular thinking about how we would sample from it most practically. Not least, without going through Prufer codes or some other clever machinery, it is not trivial to sample a uniform spanning tree.

First, we count the number of unicyclic graphs on n labelled vertices. If we know that the vertices on the cycle are v_1,\ldots,v_k, then the number of cycles with an identified edge is

u_1=1,\quad u_k=\frac{k!}{2},\, k\ge 2.

If we know that the tree coming off the cycle from vertex v_i has size m, say, then each of the possible rooted labelled trees with size m is equally likely. So taking w_j=j^{j-1}, the number of rooted trees on j labelled vertices, we get B_n(u_\bullet,w_\bullet) for the number of such unicyclic graphs on [n]. Recall B_n is the nth Bell polynomial, which gives the size of a compound combinatorial structure, where we have some structure on blocks and some other structure within blocks. Then the random partition of [n] given by the tree sizes has the distribution \text{Gibbs}_n(u_\bullet,w_\bullet).

Consider now a related object, the so-called random mapping digraph. What follows is taken from Chapter 9 of Combinatorial Stochastic Processes. We can view any mapping M_n:[n]\rightarrow[n] as a digraph where every vertex has out-degree 1. Each such digraph contains a collection of directed cycles, supported on those elements x for which M_n^k(x)=x for some k. Such an element x is called a cyclic point. Each cyclic point can be viewed as the root of a labelled tree.

In an identical manner to the unicyclic graph, the sizes of these directed trees in the digraph decomposition of a uniform random mapping is distributed as \text{Gibbs}_n(\bullet !,w_\bullet). So this is exactly the same as the cycle deletion kernel, apart from in the probability that the partition has precisely one block. In practice, for large n, the probability of this event is very small in both cases. And if we wanted to sample the cycle deletion kernel exactly, we could choose the trivial partition with some probability p, and otherwise sample from the random mapping kernel, where p is chosen such that

p+\frac{1-p}{B_n(\bullet !, w_\bullet)}=\frac{1}{B_n(u_\bullet,w_\bullet)}.

At least we know from the initial definition of a random mapping, that B_n(\bullet !,w_\bullet)=n^n. The number of unicyclic graphs with an identified edge is less clear. It turns out that the partition induced by the random mapping has a nice limit, after rescaling, as the lengths of excursions away from 0 in the standard Brownian bridge on [0,1].

The time for a fuller discussion of this sort of phenomenon is in the context of Poisson-Dirichlet distributions, as the above exchangeable partition turns out to be PD(1/2,1/2). However, for now we remark that the jumps of a subordinator give a partition after rescaling. The case of a stable subordinator is particularly convenient, as calculations are made easier by the Levy-Khintchine formula.

A notable example is the stable-1/2 subordinator, which can be realised as the inverse of the local time process at zero of a Brownian motion. The jumps of this process are then the excursion lengths of the original Brownian motion. A calculation involving the tail of the w_j’s indicates that 1/2 is the correct parameter for a subordinator to describe the random mappings. Note that the number of blocks in the partition corresponds to the local time at zero of the Brownian motion. (This is certainly not obvious, but it should at least be intuitively clear why a larger local time roughly indicates more excursions which indicates more blocks.)

So it turns out, after checking some of the technicalities, that it will suffice to show that the rescaled number of blocks in the random mapping partition \frac{|\Pi_n|}{\sqrt{n}} converges to the Raleigh density, which is a size-biased Normal random variable (hence effectively first conditioned to be positive), and which also is the distribution of the local time of the standard Brownian bridge.

After that very approximate description, we conclude by showing that the distribution of the number of blocks does indeed converge as we require. Recall Cayley’s formula kn^{n-k-1} for the number of labelled forests on [n] with a specified set of k roots. We also need to know how many labelled forests there are with any set of roots. Suppose we introduce an extra vertex, labelled 0, and connect it only to the roots of a rooted labelled forest on [n]. This gives a bijection between unlabelled trees on {0,1,…,n} and labelled forests with a specified set of roots on [n]. So we can use Cayley’s original formula to conclude there are (n+1)^{n-1} such forests. We can do a quick sanity check that these are the same, which is equivalent to showing

\sum_{k=1}^n k n^{-k-1}\binom{n}{k}=\frac{1}{n}(1+\frac{1}{n})^{n-1}.

This odd way of writing it is well-motivated. The form of the LHS is reminiscent of a generating function, and the additional k suggests taking a derivative. Indeed, the LHS is the derivative


evaluated at \frac{1}{n}. This is clearly the same as the RHS.

That said, having established that the random mapping partition is essentially the same, it is computationally more convenient to consider that instead. By the digraph analogy, we again need to count forests with k roots on n vertices, and multiply by the number of permutations of the roots. This gives:

\mathbb{P}(|\Pi_n|=k)=\frac{kn^{n-k-1}\cdot k! \binom{n}{k}}{n^n}=\frac{k}{n}\prod_{i=1}^{k-1}\left(1-\frac{i}{n}\right).

Now we can consider the limit. Being a bit casual with notation, we get:

\lim \mathbb{P}(\frac{|\Pi_n|}{\sqrt{n}}\in dl)\approx \sqrt{n}dl \mathbb{P}(|\Pi_n|=l\sqrt{n}).

Since the Raleigh distribution has density l\exp(-\frac12 l^2)dl, it suffices for this informal verification to check that

\prod_{i=1}^{l\sqrt{n}}(1-\frac{i}{n})\approx \exp(-\frac12 l^2). (*)

We take logs, so the LHS becomes:


If we view this as a function of l and differentiate, we get

d(LHS)=\sqrt{n}dl \log (1-\frac{l}{\sqrt{n}})\approx \sqrt{n}dl \left[-\frac{l}{\sqrt{n}}-\frac{l^2}{2n}\right]\approx -ldl.

When l is zero, the LHS should be zero, so we can obtain the desired result (*) by integrating then taking an exponential.