TIP: Token Importance in On-Policy Distillation

Identify and quantify the importance of individual tokens during On-Policy Knowledge Distillation (OPD). This helps pinpoint high-signal learning cues from a teacher model, allowing for more efficient and targeted student model training.

intermediate30 min4 steps

The play

Set Up Your OPD Environment
Prepare your On-Policy Distillation (OPD) pipeline. This requires a pre-trained teacher model (e.g., a large LLM) and a student model (e.g., a smaller LLM) that generates sequences to be supervised by the teacher.
Define Token Importance Metrics
Understand different methods to quantify how 'important' a token is. Common metrics include teacher logit difference (confidence), student-teacher distribution discrepancy (KL divergence), or contribution to the distillation loss.
Implement Importance Calculation Function
Integrate a function into your codebase that computes token importance based on your chosen metrics. This function will take teacher and student logits (and optionally ground truth IDs) to produce importance scores per token position.
Calculate Token Importance Scores
During your OPD training loop, for each generated sequence, pass the teacher's and student's logits to your importance calculation function. Analyze the resulting importance scores to understand which tokens are most critical for learning or where the student struggles.

Starter code

import torch
import torch.nn.functional as F

def calculate_token_importance(teacher_logits, student_logits, ground_truth_ids=None, method='logit_diff'):
    if method == 'logit_diff':
        sorted_logits = torch.sort(teacher_logits, descending=True, dim=-1).values
        importance = sorted_logits[..., 0] - sorted_logits[..., 1]
        return importance
    elif method == 'kl_div':
        teacher_probs = F.softmax(teacher_logits, dim=-1)
        student_log_probs = F.log_softmax(student_logits, dim=-1)
        kl_divergence = F.kl_div(student_log_probs, teacher_probs, reduction='none').sum(dim=-1)
        return kl_divergence
    elif method == 'loss_contribution' and ground_truth_ids is not None:
        batch_size, seq_len, vocab_size = teacher_logits.shape
        loss = F.cross_entropy(teacher_logits.view(-1, vocab_size), ground_truth_ids.view(-1), reduction='none')
        importance = loss.view(batch_size, seq_len)
        return importance
    else:
        raise ValueError(f"Unknown method: {method} or missing ground_truth_ids for loss_contribution")

# --- Example Usage ---
# Mock data: batch_size=2, seq_len=3, vocab_size=10
batch_size, seq_len, vocab_size = 2, 3, 10
teacher_logits = torch.randn(batch_size, seq_len, vocab_size)
student_logits = torch.randn(batch_size, seq_len, vocab_size)
ground_truth_ids = torch.randint(0, vocab_size, (batch_size, seq_len))

# Calculate importance using logit difference
logit_diff_importance = calculate_token_importance(teacher_logits, student_logits, method='logit_diff')
print(f"Logit Difference Importance:\n{logit_diff_importance}")

# Calculate importance using KL divergence
kl_div_importance = calculate_token_importance(teacher_logits, student_logits, method='kl_div')
print(f"\nKL Divergence Importance:\n{kl_div_importance}")

# Calculate importance using loss contribution
loss_contribution_importance = calculate_token_importance(teacher_logits, student_logits, ground_truth_ids, method='loss_contribution')
print(f"\nLoss Contribution Importance:\n{loss_contribution_importance}")