# Infinite Games

The standard example of a two-person game is the Prisoners’ Dilemma. There are various ways of setting up the story for the model. Essentially there are two prisoners being questioned simultaneously but separately by detectives following some crime which they committed together. They have to make a choice about whether to confess the details or not. If they both keep quiet, they will likely receive short sentences because of a lack of evidence. If they both confess, then obviously they each receive a full sentence. If only one prisoner confesses and provides evidence against the other, then the other receives an even harsher sentence (because he proclaimed innocence) while the confessor gets let off with only a caution as a reward. Generally we call keeping quiet ‘cooperating’ and informing on the other prisoner ‘defection’. Some potential numerical outcomes of the various pairs of strategies can be summarised as:

CC – (2,2), CD – (0,3)

DC – (3,0), DD – (1,1)

From now on, we forget all about the underlying story and just think about this as a standard two-person game, as it provides plenty of interest by itself.

In words, if both agents cooperate, they get a better result than if they both defect. However, both cooperating is not a Nash equilibrium, because each agent would do better if they unilaterally chose to defect. Indeed, the strategy of ‘cooperate’ is strongly dominated by the strategy of ‘defect’. That is, if you cooperate you will do worse than if you defect, irrespective of how the other agent acts. A key notion of elementary game theory is that we should (iteratively if necessary) reject weakly dominated strategies, as suggested by common sense. Why would you choose to do something if there is an alternative which is always at least as good and sometimes strictly better? So the only rational outcome for the Prisoners’ Dilemma is for both agents to defect.

This is an example where the Nash equilibrium is not Pareto optimal. A pair of pure strategies is Pareto optimal unless it is possible to find a different pair where both agents do better. For example, both cooperating is strictly better for each agent than both defecting in this case. The question we might want to ask is: will repeating the game encourage mutual cooperation?

This idea is tempting. Perhaps cooperating in early rounds will encourage the opponent to cooperate in future rounds, and thus a scheme of mutual cooperation will develop. Note that each agent’s aim is only to maximise his own return, rather than his return relative to the other agent. He doesn’t mind if his opponent does better so long as he does well.

Suppose that the game is played twice. The key observation is that we should consider the second game first. Irrespective of what has happened in the first game, it is still a Nash equilibrium for both agents to defect. Recall that defection strongly dominated cooperation, so what has happened before has no interest to either agent. Whatever happened in the past, and whatever the other agent does next, it is better for him to defect. Now we’ve established that they will both always defect in the second round, it is clear that there behaviour in the first round can have no consequences apart from the immediate one, and so again they should both defect in the first round too. This procedure is called backwards induction, and applies equally well to situations where the outcome of the first game might affect the range of options or outcomes available in the second game.

In particular, it is clear that however many times we play the Prisoners’ Dilemma, so long as the number of repeats is finite, the optimal strategy will always be to defect every time. But what if we repeat an infinite number of times?

The first thing to be sorted out is how to define the total reward for a strategy. If we just add everything up, all strategies give infinite reward which is not very useful for comparison. One option might be to take the limit of the mean reward ie

$\lim_{n\rightarrow\infty}\frac{\sum_{m=1}^n R_m}{n}$.

The problem with this is that its value is only dependent on eventual behaviour, in the technical sense of tail events and so forth. More simply, the outcome of the first game has no effect on the value of this expression. Nor the second, nor the third and so on. It feels mathematically impossible, and indeed pointless, to choose a strategy so as to optimise a function which is in fact independent of the choice of strategy.

Instead we introduce a discount factor $\delta<1$, and instead seek to maximise

$\sum_{m=1}^\infty \delta^m R_m$.

This feels unsatisfactory at some level, as we might want all of the games to be of equal numerical significance. However, we could partially resolve our fears by taking a $\delta\uparrow 1$ limit at the end of the analysis. More directly, in any genuine application, a situation cannot be repeated indefinitely, and so it makes sense to allow the process to terminate after each go with probability $1-\delta$, with $\delta$ chosen suitably small. Now, since the payoffs for each individual round are bounded, the total payoff is also bounded.

As with other time-homogeneous infinite horizon problems, the great advantage of this formulation is that the same game remains after one round has taken place as at the beginning. This motivates considering stationary strategies, where the rule for choosing an action is essentially the same for every round. That is, it is dependent on previous outcomes, but in a reasonably straightforward way. This definition is quite informal, but it seems that little is to be gained from making it too much tighter right now.

Some notable stationary strategies are: C, cooperate every time; D, defect every time; G, for grim, cooperate until the other agent defects, after which you defect forever; T, for tit-for-tat, initially cooperate, then do whatever the other agent did in the previous round. There is a vast body of literature, both in formal papers and in the so-called blogosphere (the reasons for this might well be an interesting subject for a separate post), concerning the relative merits of these strategies. Tit-for-tat in particular attracts much attention as a dominant strategy in a vast range of contexts. Among the features that can easily be incorporated to the model is a population evolution, where more successful strategies are more likely to propagate into subsequent generations, and noise, where with some small probability, on each turn an agent does the opposite of what they planned to do. In particular, computer simulation is easily applicable to all of these strategies, and there have even been competitions where teams submit strategies to ‘compete’ over 100,000 iterations of the Prisoners’ dilemma.

In general, we want to find Nash equilibria for the infinite game. In particular, as usual we want to reduce the number of candidates by restricting attention to subgame perfect Nash equilibria. (There are several criticisms to be made of this approach, but this is certainly a matter to be discussed elsewhere.) We say that a pair of strategies satisfy the one-stage deviation condition if neither player can improve their payoff by deviating from their strategy unilaterally for one round, then returning to that strategy. Perhaps the most interesting theorem of the book is the following: subgame perfect Nash equilibria and strategies satisfying the one-stage deviation condition are equivalent. The proof is only substantially more than definition chasing.

This enables us to show, for example, that the adjusted grim strategy where each player starts by always cooperating, until either player defects, at which point both defect forever, is a subgame perfect Nash equilibrium. (The apparently inconsequential change in definition from that of a grim strategy is required for subgame perfection.) We conclude that there is a natural framework in which mutual cooperation can exist as a stable solution to the infinite Prisoners’ Dilemma, as we had initially said we would like.