AlphaZero should've been called 'Go-NaD'

About six years ago, I got drunk while playing board games with some friends and subsequently made a fool of myself. I don't mean that I like had one-too-many buds and weighed in on Canadian immigration policy which I knew nothing about – though, make no mistake, this was not¹ beneath me by any means. No, I mean I got prodigiously wasted and lost like seven straight games of Stratego to my friend Alex (henceforth to be referred to as 'Bastard Man') with increasingly comedic advantages being afforded to me to no avail.

This was years before the thought of chewing Feldspar² ever crossed my mind, though I'm sure those bystanders who bore witness to the repeated Caepioan defeats at my command –the tactlessness of which was surely only matched by my incredulity– seen in me far sooner than I was willing to admit to myself. Now, many are saying³ that the numerous posts dedicated to the subject of Game Theory form one massive cope to make up for the lack of "street smarts" or inability to employ them. Whatever, dude.

Today I'm deep diving into provectorum strategum ars belli.

Stratego

Inspired by the Napoleonic wars, Stratego is a partially observable, zero-sum, Normal Form sequential game.

Partially observable meaning that, unlike chess or Go, each player only knows fraction of the true game state based on: information that is private to them, information that is public to both players, and the inferences that the player is able to make about their opponent's private information.

The game takes place on a 10x10 board, with each player deploying 40 pieces of varying strengths in a secret configuration.

There are 12 piece ranks, with increasing rank corresponding to strength. Like any good game (I'm a Go truther), this rule of thumb is not without its twists:

Bombs and Flags are immovable,
3s can "defuse" bombs,
Spies can capture Marshals (10) when they are the aggressor,
Scouts can move any number of spaces along an unobstructed vertical or horizontal path (the rest of the movable pieces are constricted to single moves in the four cardinal directions),
Pieces cannot jump over one another, nor over the deadzone/lake obstacles in the neutral zone.

Additionally, to prevent degenerate gameplay⁴ there are two rules which force progression:

The 2-square rule states that players aren't allowed oscillate the same piece back and forth between the same two squares more than twice,
the bbffr rule states that you can't chase a piece around the board that you'll never be able to attack (barring a misplay⁵)

The complete ruleset can be found online.⁶

Battle occurs when a player moves a piece into a square occupied by an opponent's piece. The piece with the lower rank is captured and removed from the board, ties result in both pieces being removed from the board (and occasionally place in one's beverage a la "little man").

The game ends either when one player captures their opponent's flag, or via attrition if no legal moves remain for one of the players (the losing player). Astute tacticians need not be reminded that draws are possible if both flags are surrounded with bombs, and both players have a singular non-Miner piece each.

Strategy

Modeling optimal strategy comes with all the expected difficulty of a game whose complexity class exceeds that of chess and even Go. The complexity of Stratego is augmented by the partial observability of the game, and the board size. It has a decision tree size of $10^{535}$ , whereas Go has a "puny" $10^{360}$ states.⁷

Whereas the average chess game has a singular initial state, $10^{123}$ possible game states,⁸ takes about 60 turns per player to determine a victor or stalemate, with a relatively low standard deviation, Stratego has

\frac{40!}{6!2!3!4!^35!8!} \approx 10^{33}

initial deployment possibilities (per player, so ${10^{33}}^2$ total), and an average game length of about 400 moves, though it's not uncommon for games to span over a thousand moves, to the variance is much higher. As such, emergent strategy in Stratego is much harder to measure and quantify. There are a few sensible starting points or heuristics which we can observe though.

Flag Position

Naturally, bomb economy and Bastard Man mind games indicate that it's more efficient to place your Flag in a corner, boxed in by only two bombs, leaving a free bomb to be placed somewhere out in the field (which is funny), but this is predictable, so mixing in other "conventional" strategies such as a non-corner placement, and surrounding the Flag with 3 Bombs is also common. Additionally, decoy Flags (any piece that can defeat a Miner) can be surrounded with Bombs to give a false sense of victory to an opponent.

Marshal Maneuvering

It's also common to keep a Spy near the Marshal or General, in order to quickly retaliate if an enemy Marshal is encountered. Similarly, knowledge of this tactic can be exploited by giving the impression that any two pieces moving in lockstep might be the Spy/Marshal pair.

Scouts and Miners

Taking a page out of Sandusky's book, keeping ample amounts of Scouts and Miners in the reserves for mid to late game is essential for exploration and disarmament.

Information is King

And, of course, on top of all this, maintaining your "fog of war" quantified by the entropy of your observable board state is a vital axiom for Stratego dominance. In other words, the act of moving a piece at any time communicates 100% more information about bomb placement than not moving a piece that hasn't already been moved.

DeepNash

Each year, the International Stratego Federation hosts a series of tournaments, and despite the relatively simple constraints laid out above, a few dozen players in the uppermost echelons of classic tournament play continue to dominate, indicating that the game is not reducible to mere chance.

Enter DeepNash,⁷ Google's Stratego bot which boasts an 84% win rate. This is significant compared to e.g. the chess bot Stockfish, since it deviates from the obvious exploitation of a computer's ability to store effectively infinitely more arbitrary data bout game state than a human opponent (don't get it twisted, DeepNash still does this, but it's ever so slightly more sophisticated). Memory hardware makes rapid evaluation of possible moves vastly more viable at scale relative to human (maybe piss drunk human) evaluation, even with our lizard brained fancy pruning heuristics.

Methods such as $\alpha\beta$ -pruning⁹ fail for Stratego almost purely because of the size of the requisite game tree, let alone the fact that the value function for a possible move given a board state is unknowable for most early-to-mid game states when public information is minimized.

So how, then, does DeepNash work? Perhaps unsurprisingly, the answer is Reinforcement Learning. Through many iterations of self-play from zero knowledge, DeepNash is able to converge on a policy approximating a Nash Equilibrium in the modified instance of a game induced by the following regularization reward function:

r_i(\pi_i, \pi_{-i}, a_i, a_{-i}) = r(a_i, a_{-i}) - \eta\log\frac{\pi_i(a_i)}{\hat\pi_i(a_i)} + \eta\log\frac{\pi_{-i}(a_{-i})}{\hat\pi_{-i}(a_{-i})}

where:

$i \in \{-1, 1\}$ is the player index,
$\eta$ is a regularization parameter akin to a step size in gradient descent,
$\pi_i$ is that player's policy (a probability distribution over possible actions, weighted by fitness),
$\hat\pi_i$ is a regularization policy which, roughly speaking for now, keeps the policy unpredictable

$r(a_i, a_{-i})$ measures how good a move is when player $i$ takes action $a_i$ and the opponent responds with $a_{-i}$ . Since it's incredibly difficult to determine the value of an individual move in a game spanning potentially thousands of such moves, or how taking a given action will impact the transition dynamics to the next state, DeepNash is a model-free approach, meaning that the underlying Markov Decision Process governing the transition dynamics between states is unknown.

Unlike model-based RL contexts where the transition dynamics are made explicit by the environment as SARSA tuples, the transition trajectories of this instance of an RL problem are modeled as:

\tau = \big(o_t, a_t, (r_{i}, r_{-i})_t, \mu_t, \psi_t \big)

where:

$o_t$ is an observation of the complete state $s_t$ at time $t$ ,
$\mu_t$ is the probability distribution of $\pi_i(\cdot | o_t)$ ,
and $\psi_t$ determines which player's turn it is.

In the absence of explicit transition dynamics informed by a complete representation of the game state, it would seem that the best we can do to model reward is something like:

r = \begin{cases} \infty &\text{ if } a_i \text{ captures } F_{-i} \\ -\infty &\text{ if } a_{-i} \text{ captures } F_{i} \\ -\infty < x < \infty &\text{ o.w.} \end{cases}

Furthermore, this implies that DeepNash doesn't even attempt to track a belief space approximating the opponent's board state. There are, however, some observations which can help inform the design of the reward function. For starters, Stratego is zero-sum, as explicated by the boundary conditions on terminal states laid out already – meaning that a move that is good for player $i$ is necessarily proportionately bad for player $-i$ .

DeepNash employs the Regularized Nash Dynamics algorithm in order to interpolate reward values for all those handwavey $-\infty < x < \infty \text{ o.w.}$ non-terminal states which comprise 99.999% of all moves. The algorithm for two player Normal Form Games¹⁰ is as follows:

Initialization: Start an arbitrary regularization policy $\hat\pi_0$ (e.g. uniform random),
Reward Transformation: Construct the transformed game with $\hat\pi_n$
Dynamics: Run the replicator dynamics in conjunction with actions sampled from $\hat\pi_0$ until convergence to a fixed policy $\bar\pi_n$
Update: Set the regularization policy for the next iteration to be the fixed policy from the current iteration: $\hat\pi_{n+1} \leftarrow \bar\pi_n$ and repeat until the regularization policy converges on the fixed policy.

R-NaD is represented in DeepNash's reward function as those $\eta$ gradient-step terms as negative and positive normalization.

Negative Regularization

- \eta\log\frac{\pi_i(a_i)}{\hat\pi_i(a_i)}

This term discounts moves that are out of the ordinary (such as leaving $F_i$ unprotected). Note that this term just decreases the relative weight of a conventionally low fitness (but high Bastard Man fitness) under the current regularization policy, it does not totally disallow it.

Positive Regularization

Similarly, the positive term:

+ \eta\log\frac{\pi_{-i}(a_{-i})}{\hat\pi_{-i}(a_{-i})}

accounts for an opponent's deviation from conventional warfare. Together, these terms effectively penalize/boost reward for player $i$ 's decisioning and predictive ability given $\pi_{-i}$ .

Convergence?

Transforming the rewards via these regularizations establishes a fixed point $\bar\pi$ towards which convergence is guaranteed by the exponentially decreasing Lyapunov function:

\mathcal L_{\bar\pi(\pi)} = \sum_i\sum_{a_i \in A_i} \bar\pi(a_i) \log \frac{\bar\pi_i(a_i)}{\pi_i(a_i)}

And, while no individual fixed point at an arbitrary iteration is guaranteed to be optimal, iterated self play forces exploration and emergent policy improvement towards stability.

Regularized Follow the Leader

According to a folk theorem, the average policies of no-regret algorithms in self-play or against best responses converge to a Nash equilibrium in two-player zero-sum games.
However, the action space $A$ grows exponentially in the number of information sets, so game-specific regret-minimizing algorithms are needed for extensive form games, with sequential decisions.

These regularization updates are used to compute policy updates along the gradient:

\frac{d}{d\tau} \pi_{i,\tau} = \pi_{i, \tau}(a_i)\Big[Q_{\pi_{i, \tau}}(a_i) - \sum_{b_{i} \neq a_i}\pi_{i,\tau}(b_i) Q_{\pi_{i,\tau}}(b_i) \Big]

where $Q$ measures the quality of an action via expected reward:

Q_{\pi_{i, \tau}}(a_i) = \mathbb E_{a_{-i} \sim \pi_{_{-i,\tau}}}[r_i(\pi_{i,\tau}, \pi_{-i,\tau}, a_i, a_{-i})]

As such, the $\frac{d}{d\tau}$ term measures the expected payoff for an action $a_i$ vs. that of the average of all actions $b_i$ .

DeepNash was trained by playing several games against current and previous iterations of itself, learning new policies while remaining defensive against old strategies as well. In doing so, it converges on a Mixed Nash Equilibrium, whereby it randomizes its moves based on a probability distribution of optimal moves given an optimal opponent.

Nash Equilibria are a set of policies where no individual player can increase his expected utility by unilaterally deviating from his policy, assuming that the other player(s) keep theirs fixed. All games of imperfect information possess at least one Nash Equilibrium.

This approach makes DeepNash indifferent towards any sort of meta or strategical oversight since its policy is entirely informed by self-pay (pause).¹¹

Arch

The neural network underpinning DeepNash is a Convolutional Neural Network which takes as input a tensor whose frames capture all the observable state for each player as well as a latent history of moves taken by each.

Heuristics

While DeepNash ~~is private~~,¹² there are some VODs available, from which we can make some observations about "high level strategy":

Information is King - As mentioned earlier, limiting which pieces get moved in order to protect the entropy of your board state from the opponent's perspective seems crucial.

Scout aggressively – The first several moves of each game usually involved exploration by Scout pieces, and the next few hundred moves consisted of shuffling the same five or six pieces around at a time to exploit the information gathered by the suicide scout mission
Quality material >> raw material: The utility of each piece matters more than purely its rank. For example, two Miners are considered more valuable than two Lieutenants, especially late game.
Protect Miners: Most of the miners were ket in the back two rows, protected for the late game, and would never be used to attack unless in dire need of information.
Divide and Conquer: the Marshall always appeared in the third row, typically near the Spy (unless the Spy was bluffing on the other side of the board with the General). The average strength of the other pieces in the first two rows range between 4.4–5, and 4.5–5.4, respectively (over the four available recorded games).
Credible bluffing: e.g. protecting "revealed" pieces with non-Spy pieces so as to be able to use the Spy aggressively.

Bombs: Bombs always only appeared in the back row. About 50% of the time, the Flag was placed in a corner. "Having an unpredictable deployment is important for being unexploitable."

Anyways, watch your back, Alex

Footnotes

is not ↩
Q.V. Supra 302 ↩
Bastard Man is saying ↩
in the mathematical sense, not the teetotaling sense ↩
Unlike Bastard Man, the literal rulebook of Stratego assume that opponents are rational agents. ↩
https://www.hasbro.com/common/instruct/Stratego.PDF ↩
Perolat et al. "Mastering Stratego, the classic game of imperfect information." Google DeepMind, 1 December 2022. ↩ ↩²
Shannon's number, though more accurately according to John Tromp, via Haskell aka via God,¹³ that number is closer to $4.8 \times 10^{44}$ . ↩
https://www.chessprogramming.org/Alpha-Beta ↩
In a two-player zero-sum normal-form game, both players simultaneously play actions $a_i, a_{-i}$ sampled from their respective policies $\pi_{i}(a_i),\pi_{-i}(a_{-i})$ , and receive the opposite rewards: $r_1(a_1, a_{-1}) = -r_{-1}(a_1, a_{-1})$ . However, this obviously won't work for our sequential game, hence the need for regularization ↩
The paper references "Neural Fictitious Self Play" which sounds like a brand of erotica ↩
https://github.com/google-deepmind/open_spiel/tree/master/open_spiel/python/algorithms/rnad ↩
Haskell is divinely inspired. ↩