What do Language Models Learn and When? The Implicit Curriculum Hypothesis

Understand how Large Language Models (LLMs) acquire specific skills sequentially during pretraining. By mapping this 'implicit curriculum,' you can optimize training, diagnose issues, and build more robust and predictable AI systems.

advanced1-3 hours (for initial setup and simulation)5 steps

The play

Identify Target Skills
Define the specific cognitive or linguistic capabilities (e.g., factual recall, logical reasoning, code generation) that are critical for your LLM's intended applications. Break down complex capabilities into measurable sub-skills.
Instrument Training Loop
Integrate specialized evaluation metrics and datasets for each target skill directly into your LLM's pretraining pipeline. Ensure these evaluations can be run periodically without significantly disrupting training.
Monitor Skill Emergence
At regular checkpoints (e.g., every N training steps or after processing X billion tokens), pause training to run the skill-specific evaluations. Record the performance scores for each skill at each checkpoint.
Map the Implicit Curriculum
Analyze the collected data to plot the performance trajectory of each skill over pretraining time. Identify the order in which skills emerge, mature, or plateau. Look for correlations between skill acquisition and training stages.
Optimize Training & Data
Use insights from the implicit curriculum map to refine your pretraining strategy. This might involve adjusting data mix, introducing specific curricula (e.g., starting with simpler tasks), modifying architectural choices, or fine-tuning hyperparameters to accelerate or improve skill acquisition.

Starter code

import numpy as np
import json

def evaluate_skill(model_checkpoint_path: str, skill_eval_dataset: str) -> float:
    """
    Placeholder function to simulate evaluating a specific skill.
    In a real scenario, this would load a model, run inference on a
    specialized dataset, and return a specific skill metric (e.g., accuracy on reasoning tasks).
    """
    # Simulate a skill score improving over time, tied to model progress
    # A real implementation would involve loading the model and running actual evaluations.
    print(f"Evaluating skill for model at: {model_checkpoint_path} on {skill_eval_dataset}")
    # Simulate progress: higher steps might mean higher scores
    step_num = int(model_checkpoint_path.split('_')[-1].replace('.pt', ''))
    base_score = np.interp(step_num, [1000, 50000], [20, 90]) # Score improves from 20% to 90%
    return base_score + (np.random.rand() - 0.5) * 10 # Add some noise

def track_implicit_curriculum(training_steps: list, model_checkpoints: dict, skill_datasets: dict):
    """
    Simulates tracking the emergence of multiple skills over LLM pretraining steps.
    """
    skill_emergence_data = {}
    for step in training_steps:
        print(f"\n--- Training Step: {step} ---")
        current_model_path = model_checkpoints.get(step, f"model_at_step_{step}.pt")
        
        step_skills = {}
        for skill_name, dataset_path in skill_datasets.items():
            score = evaluate_skill(current_model_path, dataset_path)
            step_skills[skill_name] = score
            print(f"  Skill '{skill_name}' score: {score:.2f}%")
        skill_emergence_data[step] = step_skills
    return skill_emergence_data

if __name__ == "__main__":
    # Example usage:
    total_training_steps = [1000, 5000, 10000, 20000, 50000]
    
    # Simulate model checkpoints available at specific steps
    mock_model_checkpoints = {
        1000: "path/to/model_step_1000.pt",
        5000: "path/to/model_step_5000.pt",
        10000: "path/to/model_step_10000.pt",
        20000: "path/to/model_step_20000.pt",
        50000: "path/to/model_step_50000.pt"
    }

    # Define specific skill evaluation datasets
    mock_skill_datasets = {
        "factual_recall": "data/factual_qa.json",
        "reasoning": "data/logic_puzzles.json",
        "syntactic_parsing": "data/syntax_trees.json"
    }

    print("Starting implicit curriculum tracking simulation...")
    results = track_implicit_curriculum(total_training_steps, mock_model_checkpoints, mock_skill_datasets)
    print("\n--- Simulation Complete ---")
    print(json.dumps(results, indent=2))

Source

Paperarxiv.org