Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Understand and apply the critical conditions and practical 'recipe' for On-Policy Distillation (OPD) to optimize Large Language Model (LLM) post-training. This leads to more efficient, robust, and cost-effective distilled LLMs.

intermediate1 hour4 steps

The play

Grasp On-Policy Distillation Fundamentals
Familiarize yourself with On-Policy Distillation (OPD) as a key post-training technique for Large Language Models. Understand its goal: transferring knowledge from a larger teacher model to a smaller student model, often under specific policy-driven data generation.
Identify Critical Success Conditions for OPD
Based on the research, identify the two crucial conditions that govern the success of OPD. These conditions are key to moving beyond trial-and-error and implementing principled distillation strategies. (Note: Specific conditions are detailed in the full research paper.)
Implement the Practical OPD 'Recipe'
Apply the provided 'recipe' for effective OPD implementation. This recipe integrates the success conditions identified in step 2, guiding data sampling, loss function design, and training dynamics to optimize the distillation process.
Evaluate Distilled LLM Performance
Measure the impact of your OPD implementation. Assess the student model's performance on relevant benchmarks, comparing it against the teacher model and other distillation methods. Focus on metrics like accuracy, inference speed, and computational cost.

Starter code

import torch
import torch.nn as nn
from torch.optim import Adam

# Placeholder for your pre-trained teacher and student models
class DummyModel(nn.Module):
    def __init__(self, vocab_size=1000, hidden_size=768):
        super().__init__()
        self.linear = nn.Linear(hidden_size, vocab_size)
    def forward(self, input_data):
        # Simulate LLM output (e.g., logits)
        return torch.randn(input_data.shape[0], input_data.shape[1], 1000) # (batch, seq_len, vocab_size)

teacher_model = DummyModel()
student_model = DummyModel()

# Assume you have a DataLoader for your training data
# For demonstration, create dummy data
dummy_inputs = torch.randint(0, 1000, (4, 128)) # batch_size=4, seq_len=128

optimizer = Adam(student_model.parameters(), lr=1e-5)
distillation_loss_fn = nn.KLDivLoss(reduction='batchmean')
temperature = 2.0

# --- Start of conceptual distillation loop ---
# The paper's "recipe" and "two crucial conditions" would guide how you
# prepare data, define the loss, and manage the training dynamics within this loop.

# 1. Get teacher logits (no_grad for efficiency)
with torch.no_grad():
    teacher_logits = teacher_model(dummy_inputs)
    teacher_probs = nn.functional.softmax(teacher_logits / temperature, dim=-1)

# 2. Get student logits
student_logits = student_model(dummy_inputs)
student_log_probs = nn.functional.log_softmax(student_logits / temperature, dim=-1)

# 3. Calculate distillation loss
loss = distillation_loss_fn(student_log_probs, teacher_probs)

# 4. Backpropagate and update student model
optimizer.zero_grad()
loss.backward()
optimizer.step()

print(f"Conceptual distillation step complete. Loss: {loss.item()}")
# --- End of conceptual distillation loop ---

# To make this truly runnable, you'd replace DummyModel with actual LLMs (e.g., from Hugging Face)
# and dummy_inputs with actual tokenized text data from a DataLoader.
# The paper's contribution is *how* to optimize this process.

Source

Paperarxiv.org