PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

Implement the Polynomial Mixer (PoM) to replace self-attention in transformer models, achieving linear computational complexity. This enables more efficient processing of longer sequences, overcoming a major bottleneck in current LLMs.

intermediate30 min5 steps

The play

Review PoM's Core Mechanism
Understand how PoM replaces self-attention by aggregating input tokens into a compact polynomial representation and then retrieving contextual information from it, as described in the research paper.
Analyze Computational Benefits
Compare PoM's linear-time complexity against the quadratic scaling of traditional self-attention to grasp its efficiency advantages, especially for extended sequence lengths.
Evaluate Integration Potential
Consider how PoM's design as a 'drop-in replacement' could simplify its integration into existing transformer architectures and frameworks.
Identify Key Use Cases
Pinpoint specific AI applications and scenarios (e.g., long document analysis, advanced LLMs, complex code comprehension) where PoM's efficiency for long contexts would deliver the most significant impact.
Monitor Research & Implementations
Stay updated on the official PoM research, future publications, and any reference implementations or integrations into popular machine learning libraries like Hugging Face Transformers.

Starter code

import torch
import torch.nn as nn

class PolynomialMixer(nn.Module):
    def __init__(self, dim: int, degree: int):
        super().__init__()
        self.dim = dim
        self.degree = degree
        self.poly_weights = nn.Parameter(torch.randn(degree + 1, dim))
        self.linear_transform = nn.Linear(dim, dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch_size, sequence_length, dim)
        
        aggregated_rep = torch.zeros_like(x[:, 0, :])
        for d in range(self.degree + 1):
            aggregated_rep += self.poly_weights[d] * (x ** d).mean(dim=1)
        
        contextual_x = self.linear_transform(x) + aggregated_rep.unsqueeze(1)
        
        return contextual_x

Source

Paperarxiv.org