AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding

AdaptToken uses an entropy-based mechanism to intelligently select the most relevant tokens from long videos for Multi-modal Large Language Models (MLLMs). This overcomes memory and context limitations, significantly improving MLLM efficiency and effectiveness for extended video understanding tasks.

advanced1-2 hours4 steps

The play

Understand MLLM Video Processing Limitations
Recognize that traditional MLLM approaches struggle with long videos due to high memory costs and limited context windows, often processing only short, pre-defined clips.
Implement Entropy-Based Information Scoring
Develop or integrate a method to quantify the 'informativeness' (entropy) of individual tokens or frames within video segments. High entropy indicates more unique or critical information.
Apply Adaptive Cross-Clip Token Selection
Design an algorithm to compare and select the most informative (high-entropy) tokens not just within a single video segment, but across multiple, potentially disparate, video clips. Prioritize tokens that offer the most novel information.
Integrate Selected Tokens into MLLM Pipeline
Feed the adaptively selected, high-entropy tokens as input to your MLLM. This reduces the overall token count while retaining critical information, enabling the MLLM to process substantially longer videos more effectively.

Starter code

import math
from collections import Counter

def calculate_shannon_entropy(data):
    if not data: return 0.0
    counts = Counter(data)
    total = len(data)
    entropy = 0.0
    for count in counts.values():
        probability = count / total
        entropy -= probability * math.log2(probability)
    return entropy

# Example usage: Imagine 'tokens' are features or identifiers from video frames
video_segment_tokens_1 = ['sky', 'car', 'tree', 'car', 'sky', 'person']
video_segment_tokens_2 = ['dialogue', 'action', 'dialogue', 'explosion', 'action']

entropy_1 = calculate_shannon_entropy(video_segment_tokens_1)
entropy_2 = calculate_shannon_entropy(video_segment_tokens_2)

print(f"Entropy for segment 1: {entropy_1:.2f}")
print(f"Entropy for segment 2: {entropy_2:.2f}")

# In a real AdaptToken scenario, you'd apply this to extracted features or embeddings
# and then select tokens based on their individual or local entropy scores across clips.

Source

Paperarxiv.org