### Zero-Determinant Strategies in the Iterated Prisoner’s Dilemma

#### Posted by Mike Shulman

*A note to those arriving from the article in the Chronicle of Higher Education: my opinion may not have been accurately represented in that article. Please read my whole post and judge for yourself.*

Usually I blog about things because I am excited about them, but today is a little different. Not that I’m *not* excited about this post, but recently I’ve been too busy with other things to do much blogging. However, what excitement can’t do, annoyance can, and I’m getting tired of seeing this (amazing, surprising, insightful, groundbreaking) paper subject to hype and (what seems to me to be) misinterpretation.

- William Press and Freeman Dyson,
*Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent*, PNAS open access, March 2012.

If you don’t know what the prisoner’s dilemma is, Wikipedia has a fairly good article. The important things to take away are that

- At first, it seems like defection is the only rational strategy, but
- When the game is played over and over again, there can be an advantage to cooperation.

The first point is easy to see: defection is a dominant strategy, and mutual defection is the only Nash equilibrium.

The second point can be made both *experimentally* and *theoretically*. On the experimental side, in computer tournaments of iterated prisoner’s dilemma, the strategy **TIT FOR TAT** (or a small variation thereof) almost invariably accumulates the highest total scores. TIT FOR TAT is a very simple strategy, which cooperates on the first round, and thereafter simply copies what its opponent did on the previous round — thereby rewarding cooperation with more cooperation and punishing defection with defection. If we add an “evolutionary” aspect to the tournament, with higher-scoring strategies reproducing more, then over time TIT FOR TAT and its variations come to dominate the landscape.

On the theoretical side, Robert Axelrod (who also performed the first computer tournaments, in the 1980s) proved various theorems. For instance, assuming the expected number of iterations is sufficiently large, TIT FOR TAT is *collectively stable*, meaning that no single mutant strategy in a population of TIT FOR TATs can average a better score than the wild types surrounding it. TIT FOR TAT is not unique in this way: for instance, the strategy ALWAYS DEFECT is also collectively stable. But TIT FOR TAT is furthermore collectively stable with respect to invasion by small *groups* of mutants, whereas a small group of TIT FOR TATs (who can cooperate with each other) can successfully invade a population of ALWAYS DEFECT. Axelrod wrote a classic book called *The Evolution of Cooperation* about the iterated prisoner’s dilemma and TIT FOR TAT, which is still a good read.

TIT FOR TAT does have a few drawbacks. For instance, imagine a “sneaky” strategy that tries defecting once or twice to see if its opponent is gullible, but otherwise plays TIT FOR TAT. An ordinary TIT FOR TAT playing against this strategy will end up in a cycle of mutual defection. The same thing happens in a “noisy” environment where occasional “random” defections happen even against the will of the players. In this sort of situation, slight modifications of TIT FOR TAT can do a bit better; one called GENEROUS TIT FOR TAT is like TIT FOR TAT but forgives its opponents’ defection about 1/10 of the time.

The iterated prisoner’s dilemma is of course a very simplistic model, but this makes it easy to analyze, and the impressive thing is that conclusions of this sort seem applicable to many situations in “real life”. Animals of various sorts have been observed seemingly “playing tit for tat”, as have corporations and nation-states (to say nothing of undergraduates in economics experiments). It has been widely hypothesized that this sort of mechanism underlies how cooperation between animals and human beings evolved in the first place.

Enter Press and Dyson. Here’s some of the hype I’ve collected on the Internet.

Robert Axelrod’s 1980 tournaments of iterated prisoner’s dilemma strategies have been condensed into the slogan, Don’t be too clever, don’t be unfair. Press and Dyson have shown that cleverness and unfairness triumph after all.

Press and Dyson have discovered strategies that trump Tit-for-Tat…

although “kind, non-envious, retaliatory, forgiving” are still effective strategies, the optimal strategy is one that removes agency from its opponent, requiring the opponent to cooperate with the whims of its ruler.

The new paper is undoubtedly a breakthrough, but as far as I can tell, it doesn’t justify these claims.

So what do Press and Dyson actually do in this 5-page paper? They start by supposing that both players are playing a probabilistic strategy which depends only on the previous round. (I’ll come back to that assumption later on.) Thus each player’s strategy is determined by four probabilities: the probabilities of cooperation given that the previous round’s outcome was CC (both cooperate), CD ($Y$ exploited $X$), DC ($X$ exploited $Y$), or DD (mutual defection). Putting together these probabilities for both players, we obtain a probability distribution for the outcome of each round, conditional on the outcome of the previous round — in other words, a Markov chain.

The stationary distribution $\mathbf{v}$ of this Markov chain is the probability distribution of outcomes for any given round (in the long run), and hence governs the long-run expected payoffs of each player by the dot product formulas

$s_X = \mathbf{v} \cdot \mathbf{S}_X$ $s_Y = \mathbf{v} \cdot \mathbf{S}_Y$

where $\mathbf{S}_X$ and $\mathbf{S}_Y$ are the vectors of payoffs for player $X$ and $Y$ respectively in the four possible outcomes. If we order the outcomes as (CC,CD,DC,DD), then with the common choice of payoffs (3-3 for mutual cooperation, 1-1 for mutual defection, 5-0 for exploitation) we have

$\mathbf{S}_X = (3,0,5,1)$ $\mathbf{S}_Y = (3,5,0,1).$

Now using some undergraduate linear algebra, Press and Dyson show that the dot product of the stationary distribution $\mathbf{v}$ with any vector $\mathbf{f}$ can be expressed as a $4\times 4$ determinant in which one column is $\mathbf{f}$, one column is entirely under the control of player $X$ (that is, it involves only the four probabilities that describe $X$’s strategy), and another column is entirely under the control of player $Y$.

This means that either $X$ or $Y$ can (potentially) independently force the dot product of $\mathbf{v}$ with some other chosen vector $\mathbf{f}$ to be zero, by choosing their strategy so as to make the column they control be proportional to $\mathbf{f}$. In particular, taking $\mathbf{f}= \alpha \mathbf{S}_X + \beta \mathbf{S}_Y + \gamma \mathbf{1}$, either player can (potentially) force a given linear relation to hold between the long-run scores of both players:

$\alpha s_X + \beta s_Y + \gamma = 0.$

I say “potentially” because it might turn out that for a given $\alpha,\beta,\gamma$ there is no choice of *probabilities* (which, of course, have to be between $0$ and $1$) that make the columns proportional. For instance, if $X$ chooses $\beta=0$, thereby trying to force her own score to be some fixed number, it turns out there are (unsurprisingly) no such solutions. But if $X$ chooses $\alpha=0$, there *are* solutions: $X$ can force $Y$’s score to be some fixed number (within reasonable bounds — between $1$ and $3$ for the standard payoffs) *independently of whatever strategy $Y$ might choose*. (Obviously $X$ cannot force $Y$’s score to be less than $1$, since $Y$ can guarantee himself an average payoff of at least $1$ by defecting all the time. Similarly, $Y$ can guarantee himself an average payoff of at most $3$ by cooperating all the time.)

To describe one such strategy, pick two small positive numbers $\varepsilon_1$ and $\varepsilon_2$. The strategy is (assuming I did the algebra correctly):

- If both players cooperated last round, cooperate with probability $1-\varepsilon_1$.
- If I exploited my opponent last round, cooperate with probability $\frac{1}{2}(\varepsilon_1 + 3\varepsilon_2)$.
- If my opponent exploited me last round, cooperate with probability $1 - 2 \varepsilon_1 - \varepsilon_2$.
- If both players defected last round, cooperate with probability $\varepsilon_2$.

The other linear relation that Press and Dyson consider is $(s_X - 1) = \chi(s_Y - 1)$: forcing a given ratio to hold between the two player’s scores (or more precisely, the scores minus the all-defection payoff). In this case there are solutions for *any* $\chi \ge 1$: player $X$ can “extort” a greater share of the payoffs *independently of whatever strategy $Y$ might choose*. One such strategy is (again, assuming I did the algebra correctly):

- If both players cooperated last round, cooperate with probability $1-\frac{2\chi-2}{4\chi + 1}$.
- If I exploited my opponent last round, cooperate with probability $\frac{\chi+4}{4\chi+1}$.
- If my opponent exploited me last round, defect.
- If both players defected last round, defect.

Of course, by defecting all the time, player $Y$ can guarantee $s_X = s_Y = 1$, which satisfies $(s_X - 1) = \chi(s_Y - 1)$ for any $\chi$. However, given that $X$ is playing a “extortionate” strategy of this sort, one can show that $Y$’s *own* score will be maximized if he cooperates all the time — in which case $X$ does in fact get a higher average payoff than $Y$. The limiting scores as $\chi\to\infty$, for the standard payoffs, are $s_Y \to 1$ (removing the incentive for $Y$ to cooperate at all) and $s_X \to 4 \frac{1}{3}$.

I find this paper simply astounding. I think it’s surprising in the extreme that such a simple analysis of the iterated prisoner’s dilemma, leading to such interesting conclusions, had not yet been done after so many decades of study. And the wider implications for theories of cooperation are also fascinating.

In their discussion, Press and Dyson focus on the results in terms of “evolutionary” players versus those with a “theory of mind”. In their parlance, an “evolutionary” player is one who attempts to maximize his *own* score using some optimization scheme. This is the sort of player against which you want to play an extortionate strategy, since his optimization will then lead him to cooperate, giving you an unequal share of the payoffs.

On the other hand, playing an extortionate strategy is not itself an “evolutionary” strategy — its attraction is based on the assumption that your opponent will *change his strategy* in response to it, learning to cooperate and putting the two of you into your preferred sweet spot in the long run. Press and Dyson say that a player has a “theory of mind” if she can adjust her strategy with the intent of making her opponent adjust *his* strategy in response. If you try playing an extortionate strategy *against* a player with a theory of mind, he may realize what you are up to and decide to defect all the time — thereby lowering his own scores in the short run, but hoping to convince you that your extortionate strategy is not working and that you should switch to a more equitable one. Thus, Press and Dyson say that the iterated prisoner’s dilemma between two players with a theory of mind becomes an ultimatum game.

An alternative way of slicing the pie which doesn’t bring in theories of mind is to talk about short-memory and long-memory strategies. Zero-determinant strategies are obviously very short-memory strategies, like TIT FOR TAT: they only make decisions based on the previous round. The behaviour of a player with a “theory of mind”, on the other hand, instead of being described as “switching strategies”, could be considered to be a single very long-memory strategy, which happens to mimic particular short-memory strategies over short time spans.

(In an appendix, Press and Dyson prove that although zero-determinant strategies are derived under the assumption that both players have a memory of only one round, in fact they still behave the same way as long as the other player has a memory of only *finitely* many rounds. (Or more precisely, I guess, a memory which is short relative to the total number of rounds being played.) The idea is simple: if $Y$ is playing a strategy with a memory of $N$ rounds, while $X$ is playing a strategy with a memory of only 1 round, then by averaging over the resulting probability distribution of all sequences of $N$ outcomes, we can produce an alternative strategy for $Y$, with a memory of only 1 round, whose long-run average scores against $X$’s strategy are identical.)

So what about the hype? Do the “zero-determinant strategies” of Press and Dyson overturn the dominance of TIT FOR TAT, showing that selfishness wins out after all? I think the answer is: not remotely! Zero-determinant strategies thrive in a very different environment than TIT FOR TAT.

Remember that TIT FOR TAT’s experimental and theoretical dominance is relative to a *large population* of many players, each with their own strategy. This is crucial to TIT FOR TAT’s success, which has the interesting attribute that it never “wins” a single match! In other words, when you play TIT FOR TAT against some other strategy repeatedly for some large number of rounds, TIT FOR TAT will never come out “ahead” in that its score is greater than that of the other strategy. (In fact, TIT FOR TAT is the limiting case $\chi = 1$ of the Press-Dyson extortionate strategies: it forces the two players’ long-run scores to be equal. In a finite game, of course, edge effects can make this inexact.)

The way that TIT FOR TAT thrives in Axelrod’s evolutionary (or, perhaps better, “ecological”) environment is by playing in a *large number* of *high-scoring games*. Namely, against any other strategy which is “nice” (which Axelrod defined to mean “never being the first to defect”), TIT FOR TAT accumulates high scores from constant mutual cooperation — as does the other player.

By contrast, non-nice strategies may “win” individual matches (scoring higher than their opponent does *in that match*), but those matches tend to have low total scores. (And against TIT FOR TAT, non-nice strategies generally tie at very low scores such as mutual defection.) Extortionate strategies (which must be non-nice to achieve $\chi\gt 1$) are no exception.

Where extortionate strategies *do* succeed is in the situation that Press and Dyson describe: when you are playing against a single fixed opponent only (and, preferably, an evolutionary one without a theory of mind). But even in this situation, extortionate strategies do not “defeat” TIT FOR TAT when played *against* it. TIT FOR TAT has no theory of mind, but neither is it evolutionary. Against an extortionate player, TIT FOR TAT’s simple rule leads to a cycle of mutual defection — exactly the response of the player *with* a theory of mind to an unfair ultimatum! And if the extortionate player reacts to the refusal of the ultimatum by switching to a fairer strategy, then TIT FOR TAT is forgiveness itself.

The only way that a zero-determinant strategy can “trump” TIT FOR TAT is that it can successfully take advantage of some opponents which TIT FOR TAT cannot. Of course, even ALWAYS DEFECT has this property: it can take advantage of ALWAYS COOPERATE, which TIT FOR TAT cannot. But extortionate zero-determinant strategies are cleverer than ALWAYS DEFECT, and can exploit a wider range of opponents (in particular, “evolutionary” ones). Thus if the goal is to “win” matches in this sense, rather than to accumulate points, then these extortionate strategies can be successful.

To test these conclusions, the authors of the following paper ran a computer tournament including both TIT FOR TAT and some zero-determinant strategies:

- Alexander Stewart and Joshua Plotkin,
*Extortion and cooperation in the Prisoner’s Dilemma*, PNAS (not open access, sadly).

I don’t know the names of all the 20 or so strategies that played, but in addition to TIT FOR TAT they included ALWAYS DEFECT, ALWAYS COOPERATE, GENEROUS TIT FOR TAT, TIT FOR TWO TATS (defect only if your opponent’s previous *two* moves were defections), and RANDOM (flip a coin), and probably some “sneakier” ones. The zero-determinant strategy EXTORT-2 which enforces $(s_X - 1) = 2 (s_Y - 1)$ came in second to last in the number of total *points*, but second to the top in the number of matches “won” — in both cases playing second fiddle to ALWAYS DEFECT. (The paper doesn’t give a lot of details, but I suspect that most of these “victories” were rather pyrrhic. For instance, if ALWAYS DEFECT plays TIT FOR TAT, then on the first round ALWAYS DEFECT will score 5 versus TIT FOR TAT’s 0, and then on all subsequent rounds they will both score 1. Technically, this is a “victory” for ALWAYS DEFECT.)

Stewart and Plotkin also considered another very interesting zero-determinant strategy, which Press and Dyson did not mention. This one tries to enforce the relationship $(s_X - 3) = \chi (s_Y - 3)$, where $3$ is the “mutual cooperation” reward score. One such strategy is (assuming I did the algebra right):

- If both players cooperated last round, cooperate.
- If I exploited my opponent last round, cooperate.
- If my opponent exploited me last round, cooperate with probability $\frac{\chi-1}{3\chi+2}$.
- If both players defected last round, cooperate with probability $\frac{2(\chi-1)}{3\chi+2}$.

Note that $\chi=1$ reproduces TIT FOR TAT again. For $\chi \gt 1$, the relationship $(s_X - 3) = \chi (s_Y - 3)$ looks like it is saying that if your score is more than $3$, mine is twice as much more than $3$. But actually, it’s impossible for *both* players’ long-run average scores to be strictly *greater* than $3$, so it would be more accurate to write this as $(3 - s_X) = \chi (3 - s_Y)$ — meaning that if your score is *less* than $3$, then mine is twice as much less!

Stewart and Plotkin chose $\chi=2$ and called the resulting strategy ZDGTFT-2, because it’s very similar to GENEROUS TIT FOR TAT (GTFT). (Actually, I can’t be sure whether their ZDGTFT-2 is this particular strategy or another one which enforces the same linear relationship; they don’t describe theirs explicitly.) The difference between ZDGTFT-2 and GTFT is that with ZDGTFT-2, the probability of forgiving the other player’s defection when I also defected last round is twice as large as it is when I cooperated last round.

In Stewart and Plotkin’s tournament, the overall score of ZDGTFT-2 was very slightly better than that of GTFT (by about one percentage point) — while GTFT did somewhat better than ordinary TIT FOR TAT (by about five percentage points). So can zero-determinant strategies beat TIT FOR TAT at its own game? One tournament isn’t conclusive, but it suggests that if so, their advantage is probably miniscule.

In conclusion, I think the unexpected existence of zero-determinant strategies is a discovery of the first order, with many fascinating implications that we can see already, and others that I’m sure are yet to come. It opens up entirely new ways to analyze evolutionary, social, and international relationships, potentially greatly expanding the (already surprisingly wide) applicability of the iterated prisoner’s dilemma. But as far as I can tell, the only way that extortionate strategies can “beat” TIT FOR TAT is if we change the rules, so that we are no longer playing a game that TIT FOR TAT ever *claimed* to be able to win.

## Re: Zero-Determinant Strategies in the Iterated Prisoner’s Dilemma

Thanks for this summary.

I am not by far an expert on this matter, most I know of it is via a friend’s Master Thesis many years back – which was a thesis in environmental sociology. Now reading your account here, reminds me of some questions I had and have.

What I find striking about the discussion about the prisonor dilemma is how on a very crude simplistic mathematical model a heap of sociological interpretation is stacked. In effect, to rules for just bit-flips one assigns words designed to describe behaviour of multicellular organisms in a complex environment.

My first question is: is there actual interest in the prisoner dilemma in pure mathematics, or is the interest it generates all qua its interpretation as a model in sociology or ecology?

You mention above

and so I am guessing that the answer to this question is “the second of the above”, but let me know. (I could probably find out by reading more of the sources, but allow me to ask anyway.)

My second question is: as a model in sociology and ecology, how useful has it been when compared with experiment?

There are plenty of examples of mathematically precise but hopelessly overly simplistic assumptions about behavious of humans – say in finance. Is the “prisoner dilemma” an exception?

In this respect finally a comment: you end by saying:

This leads me back to my question: are we intending to do math here, or model building in sociology? In the first case, changing the rules in the middle of the game is an error that invalidates the whole undertaking. But in the second case uncertainty in the rules along which to play is a hallmark of the whole subject. It’s what makes sociology (or whatever: ecology, finance) a non-exact science. If we regard the prisoner dilemma as actually being a model for some kind of complex systems, then we should be prepared to change its rules and be in fact centrally intersted in how the system changes as the rules are being changed.

Sorry if all these questions and comments maybe are tangential to the central point of your post. But it’s what came to my mind when reading this.