Final project for the Reinforcement Learning course, developed together with Ludovica Bianchi.
This work implements what proposed in Equilibrium Selection for Multi-agent Reinforcement Learning: A Unified Framework, written by Runyu Zhang, Jeff Shamma, Na Li. The paper addresses the problem of equilibrium selection in multi-agent reinforcement learning (MARL).
While many MARL algorithms ensure convergence to some Nash equilibrium (NE), these equilibria can differ significantly in terms of rewards and social welfare. Building on insights from classical game theory, where decentralized learning can select potential-maximizing or Pareto-optimal NEs in normal-form games, we implement a unified framework for equilibrium selection in stochastic games. The framework is modular and adaptable, allowing established learning rules and selection results from normal-form games to be extended to the MARL setting.
We apply this framework to two normal-form games with two players and two stages
For equilibrium selection, we assume that the group of agents follows certain iterative learning rules, which describe how they respond to the reward outcome from the previous action. In particular this is represented through a transition kernel
where
For this learning rule, there is no need for the auxiliary variables,
For this learning rule, the hidden variable
Action dynamics:
Mood dynamics:
Notice: In the case of the Marden Mood learning rule, the term
This identical interest game captures the process of two players collaboratively digging treasure at two different locations 0 and 1. Each location has a shallow level and a deep level, and it requires the players to collaboratively dig at the same location to make progress.
- Location 0 has reward 1 at the shallow level and 0.5 at the deep level.
- Location 1 has reward 0 at the shallow level and 2 at the deep level.
The process can be summarized as a stochastic game. For h = 1, there’s only one state and thus the stage reward is given by a 2 by 2 reward matrix with nonzero reward of value 1 only when a = (0,0). The transition to the second stage follows the following transition rule:
- if both players 1 and 2 choose to dig at location 0 (
$a_1 = a_2 = 0$ ), then the game will transit to state$s = A$ , where they can arrive at the deep level for location 0 and shallow level at location 1; - if
$a_1 = a_2 = 1$ then the game will transit to$s = B$ , where they can arrive at the deep level for location 1 and shallow level at location 0; - if the players fail to agree on a location, ($a = (0,1) or (1,0)$), they make no progress (
$s = O$ ) and the reward will remain the same as stage$h = 1$ .
In this game, there are two strict NEs:
- agreeing twice on location 0, i.e.
$a_{i,h} = 0$ for$h = 1,2$ and$i = 1,2$ ; - agreeing twice on location 1, i.e.
$a_{i,h} = 1$ for$h = 1,2$ and$i = 1,2$ .
Notice that the second NE is also the global optimal policy that gives a total reward 2, while the first only gives reward 1.5.
We are going to show that for log-linear learning, the stochastically stable policy is the global optimal policy, thus it should converge to the second instead of the first NE.
This coordination game models the process of two players choosing whether to hunt a stag or a hare over two stages. Hunting the stag is harder, and thus requires both players to fully cooperate and to keep hunting persistently for two stages to success, but it has a high reward (7.5 allocated evenly to both players). Hunting the hare instead is easier and can be done individually with one stage, but also with lower payoff: the value is 2, eventually to be split if both players decide to catch the hare.
The process can be summarized as a stochastic game:
In this game, there are two strict Nash equilibria:
- Stag NE: both players choose Stag at both stages (
$a_{i,h} = 0$ for$h = 1,2$ and$i = 1,2$ ), yielding the Pareto-dominant outcome with total reward 7.5. - Hare NE: both players choose Hare at both stages (
$a_{i,h} = 1$ for$h = 1,2$ and$i = 1,2$ ), yielding the risk-dominant outcome with total reward 2.
While the Stag equilibrium gives the higher social welfare, the Hare equilibrium is risk-dominant. We are going to show that running log-linear learning in this setting leads to convergence to the Hare equilibrium, while the Marden Mood learning rule favors the Pareto-optimal Stag equilibrium.
We reproduced the results presented in the paper, as shown in the following plots.
Log-linear learning rule
Log-linear learning rule
Marden Mood learning rule
.
├── img/ # Images for documentation
├── notebooks/ # Jupyter notebooks for experimentation
│ ├── Equilibrium_selection_MARL.ipynb # Reproduction of paper's results
├── src/ # Source code for the project
│ ├── game.py # Games implementation
│ ├── learning_rule.py # Learning rules implementation
│ ├── main.py # Entry point for running the experiments
│ └── unified_learning.py # MARL framework implementation
├── slides.pdf # Project presentation slides
├── requirements.txt # List of required Python dependencies
└── README.md # You're here!
git clone https://github.com/silviacalabretta/marl-equilibrium-selection
cd marl-equilibrium-selectionpython -m venv .env
source .env/bin/activatepip install -r requirements.txtYou can run experiments by executing
python -m src.mainand providing the following optional arguments:
--iterations(int): total number of learning steps the simulation will run [default: 1000];--game(str):"treasure"or"staghunt"[default:"treasure"];--learning-rule(str):"loglinear"or"mardenmood"[default:"loglinear"];--rule-coeffs(float): parameters for the chosen learning rule [default: 0.01];--num-runs(int): the number of independent learning simulations to execute (each run executes the learning process for--iterationssteps) [default: 1];
Example:
python -m src.main --iterations 2000 --game staghunt --learning-rule mardenmood --rule-coeffs 0.01 2-
Zhang, R., Shamma, J., & Li, N. (2024). Equilibrium Selection for Multi-agent Reinforcement Learning: A Unified Framework. arXiv preprint, arXiv:2406.08844.
-
Blume, L. E. (1993). The Statistical Mechanics of Strategic Interaction. Games and Economic Behavior, 5(3), 387–424.
-
Marden, J. R., Young, H. P., & Pao, L. Y. (2012). Achieving Pareto Optimality Through Distributed Learning. In Proceedings of the 51st IEEE Conference on Decision and Control (CDC), 7419–7424. https://doi.org/10.1109/CDC.2012.6426834
-
Young, H. P. (1993). The Evolution of Conventions. Econometrica, 61(1), 57–84.
-
Marden, J. R., & Shamma, J. S. (2012). Revisiting Log-linear Learning: Asynchrony, Completeness and Payoff-based Implementation. Games and Economic Behavior, 75(2), 788–808.






