S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

S2D2 is a training-free self-speculation method that significantly accelerates decoding for Block-diffusion LLMs. It generates text in parallel blocks, enabling faster-than-autoregressive speeds without retraining, making diffusion models more practical for real-time text generation.

advanced1 day5 steps

The play

Understand S2D2's Core Paradigm
Grasp the S2D2 decoding process: text generation occurs in blocks, employing parallel denoising to propose multiple candidate token sequences within each block. A subsequent verification step selects the most probable sequence, leveraging parallel computation for speed.
Prepare Your Block-Diffusion LLM
Ensure you have access to a pre-trained Block-diffusion LLM. If a specific implementation isn't available, conceptually adapt a text-diffusion model to generate or process text in distinct blocks, allowing for multi-token predictions simultaneously.
Implement Speculative Decoding Logic
Design the speculative decoding mechanism. Within each block's denoising steps, generate multiple, diverse candidate token sequences in parallel. This simulates the 'speculation' aspect, where the model proposes several continuations for the current block.
Develop a Verification/Acceptance Mechanism
Create a robust verification step to evaluate the quality and probability of the speculated token sequences. This mechanism should select the best candidate from the parallel proposals, ensuring high-quality output while maintaining speed gains.
Integrate into Inference Pipeline
Embed the S2D2 block-wise generation, speculative decoding, and verification logic into your LLM's overall inference pipeline. This replaces traditional token-by-token autoregressive generation with the accelerated block-based approach.

Starter code

import torch
from transformers import AutoTokenizer, T5ForConditionalGeneration

class ConceptualBlockDiffusionLLM:
    def __init__(self, model_name="t5-small", device="cuda" if torch.cuda.is_available() else "cpu"):
        print(f"Loading conceptual Diffusion LLM (using {model_name} as a base for tokenization/embedding)")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        # Using T5 as a placeholder for a text-generating model to demonstrate block generation
        self.model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)
        self.device = device

    def generate_block_speculatively(self, prompt: str, block_size: int = 5, num_speculations: int = 3) -> list[str]:
        """
        Conceptually generates a block of text using a speculative-like approach.
        In a real S2D2, this would involve parallel denoising and a specific verification step
        for a Block-diffusion LLM. Here, we simulate generating multiple candidate continuations
        for a given block, which is the core idea of speculation.
        """
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        # Generate multiple sequences to simulate speculation within a block
        generated_ids = self.model.generate(
            **inputs,
            do_sample=True, # Enable sampling for diverse speculations
            num_return_sequences=num_speculations,
            max_new_tokens=block_size,
            top_k=50, # Control diversity
            top_p=0.95,
            temperature=0.7,
            pad_token_id=self.tokenizer.eos_token_id # Ensure generation stops properly
        )
        speculations = [self.tokenizer.decode(g, skip_special_tokens=True) for g in generated_ids]
        return speculations

# Example Usage:
if __name__ == "__main__":
    llm = ConceptualBlockDiffusionLLM()
    prompt = "The quick brown fox jumps over"
    print(f"Prompt: '{prompt}'")

    # Simulate S2D2's block generation with multiple speculations
    speculated_blocks = llm.generate_block_speculatively(prompt, block_size=5, num_speculations=3)
    print("\nSpeculated Continuations (conceptual S2D2 block proposals):")
    for i, block in enumerate(speculated_blocks):
        print(f"  Option {i+1}: '{block.replace(prompt, '').strip()}'") # Show only the new part

    print("\nIn a real S2D2 implementation, a verification step would select the best option from these speculations.")