# Infinite Games

The standard example of a two-person game is the Prisoners’ Dilemma. There are various ways of setting up the story for the model. Essentially there are two prisoners being questioned simultaneously but separately by detectives following some crime which they committed together. They have to make a choice about whether to confess the details or not. If they both keep quiet, they will likely receive short sentences because of a lack of evidence. If they both confess, then obviously they each receive a full sentence. If only one prisoner confesses and provides evidence against the other, then the other receives an even harsher sentence (because he proclaimed innocence) while the confessor gets let off with only a caution as a reward. Generally we call keeping quiet ‘cooperating’ and informing on the other prisoner ‘defection’. Some potential numerical outcomes of the various pairs of strategies can be summarised as:

CC – (2,2), CD – (0,3)

DC – (3,0), DD – (1,1)

From now on, we forget all about the underlying story and just think about this as a standard two-person game, as it provides plenty of interest by itself.

In words, if both agents cooperate, they get a better result than if they both defect. However, both cooperating is not a Nash equilibrium, because each agent would do better if they unilaterally chose to defect. Indeed, the strategy of ‘cooperate’ is strongly dominated by the strategy of ‘defect’. That is, if you cooperate you will do worse than if you defect, irrespective of how the other agent acts. A key notion of elementary game theory is that we should (iteratively if necessary) reject weakly dominated strategies, as suggested by common sense. Why would you choose to do something if there is an alternative which is always at least as good and sometimes strictly better? So the only rational outcome for the Prisoners’ Dilemma is for both agents to defect.

This is an example where the Nash equilibrium is not Pareto optimal. A pair of pure strategies is Pareto optimal unless it is possible to find a different pair where both agents do better. For example, both cooperating is strictly better for each agent than both defecting in this case. The question we might want to ask is: will repeating the game encourage mutual cooperation?

This idea is tempting. Perhaps cooperating in early rounds will encourage the opponent to cooperate in future rounds, and thus a scheme of mutual cooperation will develop. Note that each agent’s aim is only to maximise his own return, rather than his return relative to the other agent. He doesn’t mind if his opponent does better so long as he does well.

Suppose that the game is played twice. The key observation is that we should consider the second game first. Irrespective of what has happened in the first game, it is still a Nash equilibrium for both agents to defect. Recall that defection strongly dominated cooperation, so what has happened before has no interest to either agent. Whatever happened in the past, and whatever the other agent does next, it is better for him to defect. Now we’ve established that they will both always defect in the second round, it is clear that there behaviour in the first round can have no consequences apart from the immediate one, and so again they should both defect in the first round too. This procedure is called backwards induction, and applies equally well to situations where the outcome of the first game might affect the range of options or outcomes available in the second game.

In particular, it is clear that however many times we play the Prisoners’ Dilemma, so long as the number of repeats is finite, the optimal strategy will always be to defect every time. But what if we repeat an infinite number of times?

The first thing to be sorted out is how to define the total reward for a strategy. If we just add everything up, all strategies give infinite reward which is not very useful for comparison. One option might be to take the limit of the mean reward ie

$\lim_{n\rightarrow\infty}\frac{\sum_{m=1}^n R_m}{n}$.

The problem with this is that its value is only dependent on eventual behaviour, in the technical sense of tail events and so forth. More simply, the outcome of the first game has no effect on the value of this expression. Nor the second, nor the third and so on. It feels mathematically impossible, and indeed pointless, to choose a strategy so as to optimise a function which is in fact independent of the choice of strategy.

Instead we introduce a discount factor $\delta<1$, and instead seek to maximise

$\sum_{m=1}^\infty \delta^m R_m$.

This feels unsatisfactory at some level, as we might want all of the games to be of equal numerical significance. However, we could partially resolve our fears by taking a $\delta\uparrow 1$ limit at the end of the analysis. More directly, in any genuine application, a situation cannot be repeated indefinitely, and so it makes sense to allow the process to terminate after each go with probability $1-\delta$, with $\delta$ chosen suitably small. Now, since the payoffs for each individual round are bounded, the total payoff is also bounded.

As with other time-homogeneous infinite horizon problems, the great advantage of this formulation is that the same game remains after one round has taken place as at the beginning. This motivates considering stationary strategies, where the rule for choosing an action is essentially the same for every round. That is, it is dependent on previous outcomes, but in a reasonably straightforward way. This definition is quite informal, but it seems that little is to be gained from making it too much tighter right now.

Some notable stationary strategies are: C, cooperate every time; D, defect every time; G, for grim, cooperate until the other agent defects, after which you defect forever; T, for tit-for-tat, initially cooperate, then do whatever the other agent did in the previous round. There is a vast body of literature, both in formal papers and in the so-called blogosphere (the reasons for this might well be an interesting subject for a separate post), concerning the relative merits of these strategies. Tit-for-tat in particular attracts much attention as a dominant strategy in a vast range of contexts. Among the features that can easily be incorporated to the model is a population evolution, where more successful strategies are more likely to propagate into subsequent generations, and noise, where with some small probability, on each turn an agent does the opposite of what they planned to do. In particular, computer simulation is easily applicable to all of these strategies, and there have even been competitions where teams submit strategies to ‘compete’ over 100,000 iterations of the Prisoners’ dilemma.

In general, we want to find Nash equilibria for the infinite game. In particular, as usual we want to reduce the number of candidates by restricting attention to subgame perfect Nash equilibria. (There are several criticisms to be made of this approach, but this is certainly a matter to be discussed elsewhere.) We say that a pair of strategies satisfy the one-stage deviation condition if neither player can improve their payoff by deviating from their strategy unilaterally for one round, then returning to that strategy. Perhaps the most interesting theorem of the book is the following: subgame perfect Nash equilibria and strategies satisfying the one-stage deviation condition are equivalent. The proof is only substantially more than definition chasing.

This enables us to show, for example, that the adjusted grim strategy where each player starts by always cooperating, until either player defects, at which point both defect forever, is a subgame perfect Nash equilibrium. (The apparently inconsequential change in definition from that of a grim strategy is required for subgame perfection.) We conclude that there is a natural framework in which mutual cooperation can exist as a stable solution to the infinite Prisoners’ Dilemma, as we had initially said we would like.

# Philosophy of Utility Maximisation

Over the weekend, I skimmed through the Springer Undergraduate Mathematics Series book ‘Game Theory‘ by James Webb. It’s a book that has been sitting on my bookcase for a while, and a topic which sits at only one step removed from what I am particularly interested in. But somehow, possibly because of its almost complete absence from the Cambridge maths course, I had never got round to reading the book or otherwise finding out much about the field. This has now changed, so I am writing three short remarks on some aspects I thought were of particular interest.

——–

Consider the following. I offer you the following bet. I toss a fair coin: if it comes up heads, you have to pay me £1, but if it comes up tails, I will give you £3. Would you take the bet? Obviously you are the only person who can actually answer that, but I’d expect the only obstacle to you taking the bet might be indifference. Why? Well your expected profit from this bet is £1.50, which is positive, and so it certainly seems that you are getting a better deal.

Now let’s up the stakes. If the coin comes up heads, you have to pay me £1,000,000, but if it comes up tails, you get £3m. Do you still take the bet? Well unless you are in a very fortunate financial state, you shouldn’t be taking this bet because you are not in a position to honour your commitment in the event (with probably 50%) that you lose.

But what about an intermediate bet? Suppose we change the stakes to £100 and £300. Or to £1,000 and £3,000. Would you take the bet? Without getting too bogged down in what the answer is, and what personal circumstances might lead to each possibility, suffice it to say this: each of the adjusted bets still offers a substantial positive expected profit, but the real risk of losing a large sum of money makes one less inclined to take up the offer.

Traditionally, we can describe this apparent logical problem like this. How much we value sums of money is not always proportional to the sum itself. Although some mathematicians might disagree, we don’t value numbers intrinsically, instead we value the effects that sums of money can have on our lives. More concretely, winning the lottery is awesome, and winning the lottery twice is even more awesome. However, it probably isn’t twice as good as winning it once, since many of the best things about winning the lottery, like retiring and paying off your mortgage or your student loan, can only be done once.

Mathematically, this is best represented by a utility function. This is a map from the set of outcomes of the bet or experiment to the reals, specifying the relative value the agent assigns to different outcomes (monetary or otherwise). We need the utility function to satisfy a few obvious rules. If it is a function of the profit or return, we expect it to be strictly increasing. That is, everyone places a higher utility on a larger sum of money. If we are considering utility as a function of (possibly non-numerical) outcomes, eg elements in a general probability space, then we need it to be transitive. That is, it describes a partial ordering on the set. We would also expect that all outcomes can be compared, so it is in fact a total ordering. The agent’s aim is then instead to maximise their expected utility $\mathbb{E}U(x)$.

There are a couple of questions that are natural to ask:

1) Can all factors influencing a decision really be expressed numerically?

2) Is maximising some utility function actually the right thing to do?

We ask question 1) because we might think that some factors that influence our decisions such as fear, superstition or some code of morality cannot be quantified explicitly. We ask question 2) because at first it doesn’t seem as if maximising expected utility is any different to maximising expected monetary value.

This needs some clarification. The best way to do this is through the notion of risk aversion. This is an alternative way to think about the original problem with the £1,000 vs £3,000 bet. The risk or uncertainty of a random variable corresponding to a financial return is described by the variance. In general, we assume that most investors are risk-averse, that is, they prefer a guaranteed return of £10 rather than a non-trivial random variable with expectation equal to £10. So instead of maximising expectation, we maximise

$\mathbb{E}X-\alpha \mathrm{Var}(X)$,

where $\alpha$ is some positive constant. Of course, if an agent for some reason prefers risky processes, then choose $\alpha<0$.

So the question we are asking is: might this be happening for utility maximisation as well? More precisely, perhaps we need to consider risk-aversion even for utilities?

The way to resolve the problem is to reverse how we are looking at the situation. Tempting though it is to define the utility as a relative value assigned to outcomes, this forces us into a position where we have to come up with some method for assigning these in specific cases, most of which suggest problems of some kind. Although in practice the previous definition is fine and easy to explain, what we actually want is the following. When we make a choice, we must be maximising something. Define the utility to be the function such that maximising the expectation of this function corresponds to the agent’s decision process.

This is much more satisfactory, provided such a function is actually well-defined. Von Neumann and Morgenstern showed that it is, provided the agent’s preferences have a little bit more structure than discussed above. The further conditions required are continuity and independence, and are completely natural, but I don’t want to define them now because I haven’t done much of the necessary notation here.

In many ways, this is much more satisfying. We don’t always want precise enumerations of all contributory factors, but it is reassuring to know that subject to some entirely reasonable conditions, there is some structure under which we are acting optimally whenever we make a decision.

# Invariant Distributions of Markov Chains

My lecture course in Linyi was all about Markov chains, and we spent much of the final two sessions discussing the properties of invariant distributions. I was not surprised, however, that none of the class chose this topic as the subject for a presentation to give after the end of the teaching week. One of the main problems is that so many rather similar properties are introduced roughly simultaneously. As we did in the class, I thought it was worth making some sort of executive summary, as a mixture of revision and interest.

Definition: $\pi$ is an invariant measure if $\pi P=\pi$. If in addition $\sum_{i\in I}\pi_i=1$, then we say it is an invariant distribution. Of course, if I is finite, then any invariant measure can be normalised to give an invariant distribution.

The key initial questions are about existence and uniqueness. First, if there are multiple communicating classes, then an invariant measure (resp. distribution) is a linear (resp. affine) combination of the invariant measures / distributions on each (closed) class. So we restrict attention to irreducible Markov chains.

In the finite case, P is a stochastic matrix so has a column eigenvector with eigenvalue 1, namely the vector with all entries equal to 1. Thus, by reference to general theory in linear algebra, P has a row eigenvector $\pi$ with eigenvalue 1. To paraphrase a remark made by one of my students, what is not clear is that this should be a measure. Demonstrating that this is true is rather non-trivial I think, normally done by reference to the rather more general Perron-Frobenius theorem, though on the flight home I came up with a short argument using Lagrangian duality. For now, we accept existence in the finite case, and note that we typically show existence by showing that the vector of expected time spent in each state between successive visits to a fixed reference state satisfies the properties of an invariant measure.

This is a good moment to note that recurrence is not a necessary condition for the existence of an invariant measure. For example, the random walk on $\mathbb{Z}^3$ is transient, but the uniform measure is invariant. However, it is not a sufficient condition for the existence of an invariant distribution either. (Of course, an irreducible finite chain is always recurrent, and always has an invariant distribution, so now we are considering only the infinite state space case.) The random walk on $\mathbb{Z}^2$ is recurrent, but the invariant measure is not normalisable.

The property we in fact need is positive recurrence. This says that the expected return time to each point is finite. Again, this is a class property. This is a common requirement in probabilistic arguments: almost surely finite is often not strong enough to show results if the expectation is infinite (see for example the various requirements for the optional stopping theorem). If this holds, then $\pi_i=\frac{1}{\mathbb{E}T_i}$, where $T_i$ is the the return time starting from some $i\in I$.

The final question is ‘Why are we interested?’ One of the best answers is to look at convergence properties. A simple suggestion is this: if we start in equilibrium, then $X_0,X_1,X_2,\ldots$ are all equal in distribution. Note that the dependence structure remains complicated, and much much more interesting than the individual distributions. Next, we observe that a calculation of n-step transition probabilities for a finite chain will typically involve a linear combination of nth powers of eigenvalues. One of the eigenvalues is 1, and the others lie strictly between -1 and 1. We observe in examples that the constant coefficient in $p_{ij}^{(n)}$ is generally a function of j alone, and so $p_{ij}^{(n)}\rightarrow\lambda_j$, some distribution on I. By considering $P^{n+1}=P\cdot P^n$, it is easy to see that if this converges, $(\lambda_j)_{j\in I}$ is an invariant distribution. The classic examples which do not work are

$P=\begin{pmatrix}0&1\\1&0\end{pmatrix}$ and $P=\begin{pmatrix}0&1&0\\ 0&0&1\\1&0&0\end{pmatrix}$,

as then the distribution of $X_n$ is a function of the remainder of n modulo 3 alone. With a little thought, we can give a precise classification of such chains which force you to be in particular proper subsets of the state space at regular times n. Chains without this property are called aperiodic, and we can show that distributions for such chains converge to the equilibrium distribution as $n\rightarrow\infty$.

# Teaching Probability in China

As I’ve mentioned in passing in a few recent posts, which have tended to focus on Markov chains more than might perhaps be expected, I’ve just finished a two-week trip teaching at Linyi University in Shandong province, about 600km south of Beijing. I’ve been describing some of my touristic and social observations of China on my personal blog, but it has been mainly a mathematical adventure, so I thought I would describe some of the main cultural differences in the approach to science here. I was surprised to find these a far larger obstacle than the more obvious language barrier in the classroom.

1) I think many of us were surprised by how different the gender divide was at Linyi to what we are accustomed to in Cambridge. Although of course it varies from year to year, from course to course, and in select cases from lecturer to lecturer, in general there is a significant male majority in lecture halls. In China, amongst my students (who had just finished their third year), the ratio seemed to have been reversed and then further extremised. I had about twelve students in attendance during the middle of the course, and only one was a boy.

2) Perhaps the most clear reason for this reversal of the Western trend became apparent when I asked some of my group what they were planning to do after leaving university. As most of them have just a year left, they all seemed to have a fairly concrete idea of what their plans were, and the answers were perhaps unexpected. One of the most able girls said she wanted to continue with postgraduate mathematics, but all of the rest wanted to become teachers. And not just high school either. In fact, all bar two or three planned to teach middle or even elementary school. I don’t want to get burned by talking about things I don’t much understand, but I understand this is very different to the UK. Here, I do not believe many maths graduates from top universities (in a national context) have ambitions in primary education. I do not know which way round the causality works. Are they encouraged towards a pure science degree as a route into schools, or are teaching jobs viewed the natural job to follow a hard quantitative course? In either case, it sounds like an excellent situation to be in from the point of mathematical education.

3) The role of the lecturer in China evidently carries a great deal more gravitas than comparable positions in England. Clearly his task is to stand at the board and write down the notes, which the audiences copies politely. So far, so like Cambridge Part III. But then it seems that the notion of interacting with the class doesn’t really apply here. Even an utterly uncontentious question such as ‘what is your name, and which year are you in?’ produced a flurry of nervousness, and then each student stood up stiffly in turn to answer. This continued even as the class size reduced, and my manner became more informal, for example starting to explain ideas while perching on a desk in the front row. I guess old habits die hard. It’s a bit of a shame because I feel the purpose of a lecture is not just to transcribe course material, but also to give motivation and a flavour of the mathematical landscape, as well as to inject some enthusiasm into an appreciation of the topic. And it’s very hard to start such ideas in a culture where it is totally unexpected.

4) This was even more evident in some of the research talks. Each member of the Cambridge group and various Linyi staff, postgraduates and masters students gave a talk outlining some aspect of their research interest. The motivation had been to try and recreate a seminar series, at least in environment if not in totality given the expansive range of areas of study represented. I was struck by the fact that what seemed like the entire maths undergraduate community had been forced to attend, but many of the talks were completely inaccessible to a non-specialist audience. A paper on a technical property of spectral theory read out at speaking pace from a powerpoint is not going to get younger students excited about research maths.

5) As suggested by their enthusiasm for PDEs and complicated graph theory construction problems in the research talks, the students seemed to relish calculation over theory. I found it very difficult to get across the idea that it was sometimes better to use the reuslt we had just shown rather than immediately attempting to diagonalise all visible matrices. I think some of the undergraduate teaching mirrors sections of Maths A level in this country, where the emphasis is on deploying solution techniques rather than ideas. For example, the class looked absolutely lost when I tried to use recurrence relations, to the extent that I decided just to start my explanation from the basics. It turned out that I had notated the general solution as $h_i=\alpha \lambda^i+\beta \mu^i$ whereas their teacher had used $h_i=\alpha \lambda^{i-1}+\beta \mu^{i-1}$, and this had caused the major confusion. The problem is compounded by the fact that despite constant encouragement, they feel mortified at the thought of asking a question or seeking a clarification. Again, it seems to be a matter of respect, but of course this spirit can remain in a more enquiring environment.

Overall, it was a very interesting insight into a completely alien mathematical culture. I know which one I prefer, but I think my perspective is richer for having experienced something of mathematics in China. And hopefully any teaching I do in the near future will seem straightforward by comparison, if only for the property of sharing a language with the students…!

# The Poisson Process – A Third Characteristion

There remains the matter of the distribution of the number of people to arrive in a fixed non-infinitissimal time interval. Consider the time interval [0,1], which we divide into n smaller intervals of equal width. As n grows large enough that we know the probability that two arrivals occur in the same interval tends to zero (as this is $\leq no(\frac{1}{n})$), we can consider this as a sequence of iid Bernoulli random variables as before. So

$\mathbb{P}(N_1=k)=\binom{n}{k}(\frac{\lambda}{n})^k(1-\frac{\lambda}{n})^{n-k}$
$\approx \frac{n^k}{k!} \frac{\lambda^k}{n^k}(1-\frac{\lambda}{n})^n\approx \frac{\lambda^k}{k!}e^{-\lambda}.$

We recognise this as belonging to a Poisson (hence the name of the process!) random variable. We can repeat this for a general time interval and obtain $N_t\sim \text{Po}(\lambda t)$.

Note that we implicitly assumed that, in the infinitissimal case at least, behaviour in disjoint intervals was independent. We would hope that this would lift immediately to the large intervals, but it is not immediately obvious how to make this work. This property of independent increments is one of the key definitions of a Levy Process, of which the Poisson process is one of the two canonical examples (the other is Brownian Motion).

As before, if we can show that the implication goes both ways (and for this case it is not hard – letting $t\rightarrow 0$ clearly gives the infinitissimal construction), we can prove results about Poisson random variables with ease, for example $\text{Po}(\lambda)+\text{Po}(\mu)=\text{Po}(\lambda+\mu)$.

This pretty much concludes the construction of the Poisson process. We have three characterisations:
1) $X_n\sim\text{Exp}(\lambda)$ all iid.
2) The infinitissimal construction as before, with independence.
3) The number of arrivals in a time interval of width t $\sim \text{Po}(\lambda t)$. (This is sometimes called a stationary increments property.) Furthermore, we have independent increments.

A formal derivation of the equivalence of these forms is important but technical, and so not really worth attempting here. See James Norris’s book for example for a fuller exposition.

The final remark is that the Poisson Process has the Markov property. Recall that this says that conditional on the present, future behaviour is independent of the past. Without getting into too much detail, we might like to prove this by using the independent increments property. But remember that for a continuous process, it is too much information to keep track of all the distributions at once. It is sufficient to track only the finite marginal distributions, provided the process is cadlag, which the Poisson process is, assuming we deal with the discontinuities in the right way. Alternatively, the exponential random variable is memoryless, a property that can be lifted, albeit with some technical difficulties, to show the Markov property.

# The Poisson Process – Distributions of Holding Times

So, I was planning to conclude my lecture course on Markov Chains at the Cambridge-Linyi summer school with an hour devoted to the Poisson Process. Unfortunately, working through the discrete time material and examples took longer than I’d expected, so we never got round to it. As I’d put a fair bit of work into the preparation, I figured I might as well write it up here instead.

We need a plausible mathematical model for a situation where people or events arrive randomly in continuous time, in a manner that is as close as possible to the notion of iid random variables. In discrete time, a natural candidate would be a set of iid Bernoulli random variables. For example, with probability p a single bus will arrive in an time interval of a minute. With probability 1-p, no bus will arrive. We might have some logistical motivation for why it is not possible that two or more arrive in a given interval, or we could instead choose a more complicated distribution.

One way to proceed would be to specify the distribution of the times between arrivals. These should be independent and identically distributed, at least intuitively. However, although we might be able to give a sensible guess right now, it is not immediately clear what this distribution should be. For now, we merely remark that the arrival times are called $X_1,X_2,\ldots$, and the holding times between arrivals are defined by
$S_1=X_1, S_n=X_n-X_{n-1},n\geq 2$.

In fact the motivating discrete example gives us much of the machinery we will actually need. Recall that when we define probability distributions for continuously-valued random variables we need a different plan of attack than for discrete RVs. Whereas for the discrete case, it is enough to specify the probability of each outcome, for a continuous random variable, we have to specify the probabilities of intervals, and take care that they have the obvious additive and nesting properties that we want. Taking the integral (whether Riemannian or Lebesgue) of a so-called density function is a natural way to do this.

Similarly here, we build up from small time intervals. The first remark is this: it is natural that the expected number of arrivals in the first minute is equal to the expected number of arrivals in the second minute. After all, we are considering the most general process possible. If there are an infinite number of potential arriving agents, then behaviour in the first minute should be independent and equal (in distribution) to behaviour in the second minute. We can naturally extend this idea to a linear relation. If $N_s$ is the number of arrivals in the time [0,s], then we should have $\mathbb{E}N_s=\lambda s$, where $\lambda$ is some constant of proportionality, equal to $\mathbb{E}N_1$.

The key remark is that as $s\rightarrow 0$, $\mathbb{P}(N_s=1)$ becomes small, and $\mathbb{P}(N_s\geq 2)$ becomes very small. In fact it suffices that $\mathbb{P}(N_s \geq 2)=o(s)$, as this implies:

$\mathbb{P}(N_s=0)=1-\lambda s+o(s),\quad \mathbb{P}(N_s=1)=\lambda s+o(s).$

Note that we are not currently attempting a formal construction of the process. As always, finding a probability space with enough freedom to equip a continuous process is a fiddly task. We are for now just trying to work out what the distribution of the holding times between arrivals should be. There are obvious advantages to defining the process as a collection of iid random variables, for example that we can construct it on a product space.

To do this, we split the time interval [0,1] into blocks

$[0,1]=[0,\frac{1}{n}]\cup[\frac{1}{n},\frac{2}{n}]\cup\ldots\cup[\frac{n-1}{n},1]$

So the probability that someone arrives in the time $[\frac{k}{n},\frac{k+1}{n}]$ is $\frac{\lambda}{n}$. So

$\mathbb{P}(\text{no-one arrives in time }[0,1])=(1-\frac{\lambda}{n})^n\approx e^{-\lambda}.$

As we are working in an $n\rightarrow\infty$ regime, we can replace 1 by general time t, to obtain:

$\mathbb{P}(\text{no-one arrives in time }[0,t])=(1-\frac{\lambda t}{n})^n\approx e^{-\lambda t}.$

So the distribution function of the first arrival time is $F(t)=1-e^{-\lambda t}$ in the conventional notation.
Thus $X_1\sim \text{Exp}(\lambda)$.

However, to emphasis how useful the infinitissimal definition is for actual problems, consider these examples.

1) If we have two arrivals processes at the same object, for example arriving from the left and the right at the same shop, say with rates $\lambda,\mu$, then we want to show that the first arrival time is still exponential. Because of the linearity of expectation property, it is clearly from the definition that the total arrivals process is Poisson with rate $\lambda+\mu$, and so the result follows. Showing this by examining the joint distribution of two exponential random variables is also possible, but much less elegant.

2) Similarly, if we have two shops, A and B, and each arriving person chooses one at random, then the first arrival time at A is: with probability 1/2 distributed as Exp($\lambda$), with probability 1/4 as $\Gamma(2,\lambda)$, and so on. A fairly non-trivial calculation is required to show that this is the same as $Exp(\frac{1}{2}\lambda)$, whereas this follows almost instantly using the the infinitissimal definition.

Moral: with the infinitissimal case, the difficulties are all in the probability space. However, once we have settled those problems, everything else is nice as the key property is linear. Whereas for a construction by iid jump times, the existence and well-definedness is clear, but even for a distribution as tractable as the exponential random variable, manipulation can be tricky.