PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

The Polynomial Mixer (PoM) replaces self-attention in transformer models, achieving linear-time computational complexity for token mixing. It aggregates input tokens into a compact polynomial representation, drastically improving efficiency and scalability for processing longer sequences in LLMs.

intermediate1 hour5 steps

The play

Understand PoM's Core Principles
Grasp PoM's core innovation: linear-time complexity and its method of aggregating input tokens into a compact polynomial representation for contextual retrieval, addressing self-attention's quadratic scaling limitations.
Identify Self-Attention Layers
Locate existing self-attention modules (e.g., `nn.MultiheadAttention` in PyTorch or similar implementations) within your current transformer architecture that you intend to replace.
Acquire or Implement PoM Module
Obtain a `PolynomialMixer` implementation. This might involve sourcing it from an official research repository, a community-contributed library, or developing a custom module based on the original paper's specifications.
Replace Attention with PoM
Substitute the identified self-attention layers within your transformer blocks with the `PolynomialMixer` module, ensuring that the input and output tensor shapes are compatible with the surrounding architecture.
Train and Evaluate Modified Model
Fine-tune or retrain your model with the newly integrated PoM layers. Subsequently, evaluate its performance on relevant downstream tasks, paying close attention to efficiency gains and the model's ability to maintain or improve contextual understanding.

Starter code

import torch
import torch.nn as nn

# Dummy PolynomialMixer for structural replacement demonstration.
# In a real scenario, this class would contain the actual PoM logic
# as described in the research paper.
class DummyPolynomialMixer(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        # A simple linear projection to maintain shape for demonstration.
        # Real PoM involves polynomial functions and aggregation mechanisms.
        self.linear_proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, query, key, value, key_padding_mask=None, need_weights=False, attn_mask=None):
        # For demonstration, we'll just pass the query through a linear layer.
        # The actual PoM mechanism would aggregate information from key/value
        # using polynomial functions.
        x = query # Assuming query, key, value are the same input for simplicity in this dummy
        output = self.linear_proj(x)
        if need_weights:
            return output, None # PoM typically doesn't output attention weights
        return output

class TransformerBlockWithPoM(nn.Module):
    def __init__(self, embed_dim, num_heads, dim_feedforward, dropout=0.1):
        super().__init__()
        # Replace nn.MultiheadAttention with our DummyPolynomialMixer
        self.mixer = DummyPolynomialMixer(embed_dim, num_heads)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.dropout1 = nn.Dropout(dropout)

        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, dim_feedforward),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(dim_feedforward, embed_dim)
        )
        self.norm2 = nn.LayerNorm(embed_dim)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x):
        # Input x shape: (seq_len, batch_size, embed_dim)
        # Apply PoM replacement
        # DummyPolynomialMixer expects query, key, value similar to MultiheadAttention
        mixer_out = self.mixer(x, x, x) # Pass x three times as q, k, v for self-mixing
        x = x + self.dropout1(mixer_out)
        x = self.norm1(x)

        # Apply Feed-Forward Network
        ffn_out = self.ffn(x)
        x = x + self.dropout2(ffn_out)
        x = self.norm2(x)
        return x

# Example Usage:
if __name__ == "__main__":
    seq_len = 100
    batch_size = 4
    embed_dim = 512
    num_heads = 8
    dim_feedforward = 2048

    # Input tensor (e.g., from an embedding layer or previous transformer block)
    input_tensor = torch.randn(seq_len, batch_size, embed_dim)

    # Initialize a transformer block with our PoM placeholder
    pom_block = TransformerBlockWithPoM(embed_dim, num_heads, dim_feedforward)

    # Pass input through the block
    output_tensor = pom_block(input_tensor)

    print(f"Input shape: {input_tensor.shape}")
    print(f"Output shape: {output_tensor.shape}")
    # Verify output shape is consistent with input
    assert input_tensor.shape == output_tensor.shape
    print("Dummy PoM block ran successfully, demonstrating structural replacement.")