TIP: Token Importance in On-Policy Distillation

Optimize On-Policy Knowledge Distillation (OPD) by identifying and prioritizing high-value tokens in teacher supervision. This strategy aims to improve student model efficiency and performance by focusing learning on the most impactful signals, reducing training costs, and accelerating deployment.

advanced1-3 days5 steps

The play

Understand On-Policy Distillation (OPD) Basics
Familiarize yourself with the core mechanics of On-Policy Knowledge Distillation, where a student model learns from its own generated data, supervised by a teacher model's token-level guidance. Ensure your current distillation pipeline is functional.
Recognize Token Importance Variance
Acknowledge that not all token positions contribute equally to effective student learning. The goal is to identify which specific tokens within the teacher's guidance provide the most valuable learning signals.
Implement Token Importance Scoring
Develop or integrate a mechanism to score the importance of individual tokens during teacher supervision. This could involve analyzing teacher attention weights, gradient contributions, or other heuristics that indicate a token's impact on the teacher's output or the student's learning trajectory.
Apply Selective Supervision or Weighted Loss
Based on the token importance scores, modify your distillation loss function. You can implement selective supervision (e.g., only supervise tokens above a certain importance threshold) or dynamic weighting of token losses (e.g., assign higher weights to more important tokens).
Evaluate Impact on Student Model
Train your student model using the refined distillation strategy. Evaluate its performance on downstream tasks, focusing on metrics like accuracy, latency, and model size. Compare against a baseline student trained with unweighted or uniform token supervision to quantify the benefits.

Starter code

import torch
import torch.nn.functional as F

# Example: Mock logits and importance scores
# student_logits: (batch_size, seq_len, vocab_size)
# teacher_logits: (batch_size, seq_len, vocab_size)
# importance_scores: (batch_size, seq_len) - values between 0 and 1

batch_size, seq_len, vocab_size = 2, 5, 1000
student_logits_example = torch.randn(batch_size, seq_len, vocab_size)
teacher_logits_example = torch.randn(batch_size, seq_len, vocab_size)
importance_scores_example = torch.rand(batch_size, seq_len)

def weighted_distillation_loss(student_logits, teacher_logits, importance_scores, temperature=1.0):
    student_probs = F.log_softmax(student_logits / temperature, dim=-1)
    teacher_probs = F.softmax(teacher_logits / temperature, dim=-1)
    # Calculate KL divergence per token, then apply importance scores
    loss_per_token = F.kl_div(student_probs, teacher_probs, reduction='none').sum(dim=-1)
    weighted_loss = (loss_per_token * importance_scores).mean()
    return weighted_loss

# Calculate the weighted loss
loss = weighted_distillation_loss(student_logits_example, teacher_logits_example, importance_scores_example)
print(f"Calculated weighted distillation loss: {loss.item():.4f}")

# For selective supervision, you might filter tokens:
# threshold = 0.7
# masked_importance = (importance_scores_example > threshold).float()
# selective_loss = weighted_distillation_loss(student_logits_example, teacher_logits_example, masked_importance)
# print(f"Calculated selective distillation loss: {selective_loss.item():.4f}")

Source

Paperarxiv.org