Optimal last-iterate convergence in matrix games with bandit feedback using the log-barrier

This Action Pack explores the challenges of achieving optimal last-iterate convergence in zero-sum matrix games where agents use bandit feedback and log-barrier optimization. It highlights the fundamental Ω(t⁻¹/⁴) lower bound on the exploitability gap for such decentralized learning settings.

advanced30 min4 steps

The play

Define Your Zero-Sum Game Matrix
Set up a payoff matrix for a zero-sum game. This matrix represents Player 1's rewards for each action pair, with Player 2's rewards being the negative of Player 1's. This forms the foundation for agent interaction.
Conceptualize Bandit Feedback
Understand that in this setting, agents only observe the reward for the specific actions they chose in a given round, not the full payoff matrix or gradients. This limited information dictates the learning approach.
Incorporate Log-Barrier for Strategy Optimization
Grasp how log-barrier methods are used to optimize agent strategies. These methods ensure that probability distributions over actions remain valid (non-negative and sum to 1) by penalizing approaches to the boundary of the feasible region, crucial for stable learning.
Analyze Last-Iterate Convergence Challenges
Acknowledge the difficulty of achieving last-iterate convergence (where the final strategy directly approaches equilibrium) in uncoupled bandit feedback settings. Understand the proven Ω(t⁻¹/⁴) lower bound on the exploitability gap, signifying a fundamental limit on how fast equilibrium can be approached.

Starter code

import numpy as np

PAYOFF_MATRIX = np.array([
    [ 0, -1,  1],
    [ 1,  0, -1],
    [-1,  1,  0]
])

# Player 1's initial mixed strategy (e.g., uniform)
player1_strategy = np.array([1/3, 1/3, 1/3])

# Player 2's initial mixed strategy (e.g., uniform)
player2_strategy = np.array([1/3, 1/3, 1/3])

print("Defined Game Matrix:\n", PAYOFF_MATRIX)
print("Player 1 Strategy:\n", player1_strategy)
print("Player 2 Strategy:\n", player2_strategy)