Article
llmmachine-learningknowledge-distillationnlptoken-importance
TIP: Token Importance in On-Policy Distillation
Identify and quantify the importance of individual tokens during On-Policy Knowledge Distillation (OPD). This helps pinpoint high-signal learning cues from a teacher model, allowing for more efficient and targeted student model training.
intermediate30 min4 steps
The play
- Set Up Your OPD EnvironmentPrepare your On-Policy Distillation (OPD) pipeline. This requires a pre-trained teacher model (e.g., a large LLM) and a student model (e.g., a smaller LLM) that generates sequences to be supervised by the teacher.
- Define Token Importance MetricsUnderstand different methods to quantify how 'important' a token is. Common metrics include teacher logit difference (confidence), student-teacher distribution discrepancy (KL divergence), or contribution to the distillation loss.
- Implement Importance Calculation FunctionIntegrate a function into your codebase that computes token importance based on your chosen metrics. This function will take teacher and student logits (and optionally ground truth IDs) to produce importance scores per token position.
- Calculate Token Importance ScoresDuring your OPD training loop, for each generated sequence, pass the teacher's and student's logits to your importance calculation function. Analyze the resulting importance scores to understand which tokens are most critical for learning or where the student struggles.
Starter code
import torch
import torch.nn.functional as F
def calculate_token_importance(teacher_logits, student_logits, ground_truth_ids=None, method='logit_diff'):
if method == 'logit_diff':
sorted_logits = torch.sort(teacher_logits, descending=True, dim=-1).values
importance = sorted_logits[..., 0] - sorted_logits[..., 1]
return importance
elif method == 'kl_div':
teacher_probs = F.softmax(teacher_logits, dim=-1)
student_log_probs = F.log_softmax(student_logits, dim=-1)
kl_divergence = F.kl_div(student_log_probs, teacher_probs, reduction='none').sum(dim=-1)
return kl_divergence
elif method == 'loss_contribution' and ground_truth_ids is not None:
batch_size, seq_len, vocab_size = teacher_logits.shape
loss = F.cross_entropy(teacher_logits.view(-1, vocab_size), ground_truth_ids.view(-1), reduction='none')
importance = loss.view(batch_size, seq_len)
return importance
else:
raise ValueError(f"Unknown method: {method} or missing ground_truth_ids for loss_contribution")
# --- Example Usage ---
# Mock data: batch_size=2, seq_len=3, vocab_size=10
batch_size, seq_len, vocab_size = 2, 3, 10
teacher_logits = torch.randn(batch_size, seq_len, vocab_size)
student_logits = torch.randn(batch_size, seq_len, vocab_size)
ground_truth_ids = torch.randint(0, vocab_size, (batch_size, seq_len))
# Calculate importance using logit difference
logit_diff_importance = calculate_token_importance(teacher_logits, student_logits, method='logit_diff')
print(f"Logit Difference Importance:\n{logit_diff_importance}")
# Calculate importance using KL divergence
kl_div_importance = calculate_token_importance(teacher_logits, student_logits, method='kl_div')
print(f"\nKL Divergence Importance:\n{kl_div_importance}")
# Calculate importance using loss contribution
loss_contribution_importance = calculate_token_importance(teacher_logits, student_logits, ground_truth_ids, method='loss_contribution')
print(f"\nLoss Contribution Importance:\n{loss_contribution_importance}")