Gaussian Approximation for Asynchronous Q-learning

This Action Pack applies theoretical insights from Gaussian approximation research to improve asynchronous Q-learning. Learn to implement a polynomial stepsize schedule ($k^{-\omega}$) to enhance training stability and convergence rates for your reinforcement learning agents.

intermediate30 min5 steps

The play

Understand Asynchronous Q-Learning
Grasp the fundamentals of Q-learning, focusing on its update mechanism. Asynchronous Q-learning typically involves multiple agents or threads updating a shared Q-table or model, leading to potential instability if not managed correctly.
Implement Polynomial Stepsize Schedule
Adopt a polynomial stepsize (learning rate) schedule of the form $k^{-\omega}$, where 'k' is the global step count and '$\omega$' is a parameter. This schedule ensures the learning rate gradually decays over time, crucial for convergence in stochastic approximation algorithms like Q-learning. The research suggests $\omega \in (0.5, 1]$ for optimal convergence.
Integrate Stepsize into Q-Update Rule
Modify your Q-learning update rule to use the dynamically calculated polynomial stepsize. Instead of a fixed learning rate (alpha), replace it with `current_learning_rate = initial_alpha / (k**omega)` in your Q-table update equation: `Q(s,a) = Q(s,a) + current_learning_rate * [R + gamma * max(Q(s',a')) - Q(s,a)]`.
Monitor Learning Stability and Performance
Run your asynchronous Q-learning agent with the polynomial stepsize. Monitor key metrics such as average reward per episode, Q-value changes, and convergence of policies. Observe how the decaying learning rate contributes to smoother training and more stable final policies compared to a fixed learning rate.
Tune the Omega ($\omega$) Parameter
Experiment with different values for the $\omega$ parameter within the recommended range of (0.5, 1]. A higher $\omega$ leads to faster decay, potentially reaching convergence quicker but risking premature stagnation. A lower $\omega$ provides slower decay, potentially leading to more exploration but slower convergence. Fine-tune $\omega$ to optimize for your specific environment and task.

Starter code

import numpy as np

# Dummy Q-table and environment parameters
num_states = 10
num_actions = 4
q_table = np.zeros((num_states, num_actions))
learning_rate_initial = 1.0 # Initial learning rate scaling factor
omega = 0.75 # Recommended range (0.5, 1] from research
gamma = 0.99 # Discount factor
epsilon = 0.1 # Exploration rate

# Simulation parameters
num_episodes = 1000
steps_per_episode = 100

# Function to simulate environment interaction (dummy)
def get_action(state, q_table, epsilon):
    if np.random.rand() < epsilon:
        return np.random.randint(num_actions) # Explore
    else:
        return np.argmax(q_table[state, :]) # Exploit

def get_next_state_reward(state, action):
    # Dummy environment: always go to next state, reward 1 if state 9, else 0
    next_state = (state + 1) % num_states
    reward = 1 if next_state == num_states - 1 else 0
    return next_state, reward

print("Starting Q-learning with polynomial stepsize...")

global_step_counter = 0 # To calculate k for k^-omega

for episode in range(num_episodes):
    current_state = np.random.randint(num_states) # Start from a random state
    
    for step_in_episode in range(steps_per_episode):
        global_step_counter += 1
        
        # Calculate polynomial learning rate
        # Ensure k is at least 1 to avoid division by zero or large numbers for k=0
        k = max(1, global_step_counter)
        current_learning_rate = learning_rate_initial / (k**omega)
        
        action = get_action(current_state, q_table, epsilon)
        next_state, reward = get_next_state_reward(current_state, action)
        
        # Q-learning update rule
        old_q_value = q_table[current_state, action]
        max_next_q = np.max(q_table[next_state, :])
        
        new_q_value = old_q_value + current_learning_rate * (reward + gamma * max_next_q - old_q_value)
        q_table[current_state, action] = new_q_value
        
        current_state = next_state
        
        if current_state == num_states - 1: # Reached terminal state
            break

print("Q-learning finished. Final Q-table (first 5 states, all actions):\n", q_table[:5, :])

Source

Paperarxiv.org