Skip to main content
Article
uncategorizedmachine-learninggame-theoryoptimizationbandit-feedbackai-agents

Optimal last-iterate convergence in matrix games with bandit feedback using the log-barrier

This Action Pack explores the challenges of achieving optimal last-iterate convergence in zero-sum matrix games where agents use bandit feedback and log-barrier optimization. It highlights the fundamental Ω(t⁻¹/⁴) lower bound on the exploitability gap for such decentralized learning settings.

advanced30 min4 steps
The play
  1. Define Your Zero-Sum Game Matrix
    Set up a payoff matrix for a zero-sum game. This matrix represents Player 1's rewards for each action pair, with Player 2's rewards being the negative of Player 1's. This forms the foundation for agent interaction.
  2. Conceptualize Bandit Feedback
    Understand that in this setting, agents only observe the reward for the specific actions they chose in a given round, not the full payoff matrix or gradients. This limited information dictates the learning approach.
  3. Incorporate Log-Barrier for Strategy Optimization
    Grasp how log-barrier methods are used to optimize agent strategies. These methods ensure that probability distributions over actions remain valid (non-negative and sum to 1) by penalizing approaches to the boundary of the feasible region, crucial for stable learning.
  4. Analyze Last-Iterate Convergence Challenges
    Acknowledge the difficulty of achieving last-iterate convergence (where the final strategy directly approaches equilibrium) in uncoupled bandit feedback settings. Understand the proven Ω(t⁻¹/⁴) lower bound on the exploitability gap, signifying a fundamental limit on how fast equilibrium can be approached.
Starter code
import numpy as np

PAYOFF_MATRIX = np.array([
    [ 0, -1,  1],
    [ 1,  0, -1],
    [-1,  1,  0]
])

# Player 1's initial mixed strategy (e.g., uniform)
player1_strategy = np.array([1/3, 1/3, 1/3])

# Player 2's initial mixed strategy (e.g., uniform)
player2_strategy = np.array([1/3, 1/3, 1/3])

print("Defined Game Matrix:\n", PAYOFF_MATRIX)
print("Player 1 Strategy:\n", player1_strategy)
print("Player 2 Strategy:\n", player2_strategy)
Optimal last-iterate convergence in matrix games with bandit feedback using the log-barrier — Action Pack