Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Improve video world models by implementing 'Hybrid Memory' to track dynamic subjects that move out of sight and re-emerge. This prevents simulation errors like freezing or vanishing, leading to more consistent and realistic world simulations for AI agents.

advanced2 hours5 steps

The play

Analyze Current Model Limitations
Evaluate your existing video world models for failure modes in dynamic environments, specifically when subjects temporarily leave the frame or become occluded. Identify instances of freezing, vanishing, or distortion upon re-emergence.
Define Hybrid Memory Requirements
Outline the functional specifications for a memory system that can robustly track objects even when they are not directly observable. Consider requirements for short-term (visible) and long-term (occluded) state management, and how to associate re-emerging objects with their past states.
Design Memory Architecture
Propose an architecture for a 'Hybrid Memory' component. This might involve a combination of explicit object tracking for visible entities and a more persistent, abstract representation for objects that have gone out of sight, along with mechanisms for re-identification.
Integrate with World Model
Determine how the designed Hybrid Memory will interact with your world model's perception, prediction, and latent state components. Map out the data flow for storing and retrieving object information to maintain world consistency across occlusions.
Develop Dynamic Evaluation Metrics
Establish specific metrics to assess the improved model's performance in dynamic scenarios. Focus on evaluating object persistence, re-identification accuracy, and simulation consistency when objects re-emerge after being out of sight.

Starter code

import torch

class HybridMemory:
    def __init__(self, memory_capacity=100):
        self.short_term_memory = {}
        self.long_term_memory = {}
        self.memory_capacity = memory_capacity
        self.next_id = 0

    def store_visible_object(self, obj_id, current_state):
        """Stores or updates the state of a currently visible object."""
        self.short_term_memory[obj_id] = current_state
        if obj_id not in self.long_term_memory:
            self.long_term_memory[obj_id] = current_state # Initialize long-term if new

    def object_occluded(self, obj_id):
        """Moves object from short-term to long-term only, marking it as occluded."""
        if obj_id in self.short_term_memory:
            # Potentially update long-term with last known state before occlusion
            self.long_term_memory[obj_id] = self.short_term_memory.pop(obj_id)

    def retrieve_object_state(self, obj_id):
        """Retrieves the most current known state of an object, visible or occluded."""
        if obj_id in self.short_term_memory:
            return self.short_term_memory[obj_id]
        elif obj_id in self.long_term_memory:
            return self.long_term_memory[obj_id]
        return None # Object not found

    def re_identify_object(self, new_detection_state, potential_ids):
        """Conceptual method: matches a new detection to a known occluded object."""
        # This would involve feature matching, trajectory prediction, etc.
        for obj_id in potential_ids:
            if obj_id in self.long_term_memory: # Check if this ID is in long-term memory
                # Placeholder for actual re-identification logic
                # e.g., compare new_detection_state with self.long_term_memory[obj_id]
                # For now, just return the first potential match
                print(f"Re-identified object {obj_id} from long-term memory.")
                return obj_id
        return self.create_new_object(new_detection_state)

    def create_new_object(self, initial_state):
        """Creates a new entry for a newly appearing object."""
        new_obj_id = f"obj_{self.next_id}"
        self.next_id += 1
        self.short_term_memory[new_obj_id] = initial_state
        self.long_term_memory[new_obj_id] = initial_state
        return new_obj_id

# Example Usage (conceptual)
memory = HybridMemory()

# Frame 1: Object appears
obj_a_id = memory.create_new_object(torch.randn(1, 64)) # Latent state
print(f"Created new object: {obj_a_id}")

# Frame 2: Object still visible
memory.store_visible_object(obj_a_id, torch.randn(1, 64) * 1.1) # Updated state

# Frame 3: Object goes out of sight
memory.object_occluded(obj_a_id)
print(f"Object {obj_a_id} occluded. Current state: {memory.retrieve_object_state(obj_a_id)[:2]}...")

# Frame N: Object re-emerges, needs re-identification
re_emerging_state = torch.randn(1, 64) * 0.9 # New detection
re_identified_id = memory.re_identify_object(re_emerging_state, [obj_a_id])

if re_identified_id == obj_a_id:
    memory.store_visible_object(re_identified_id, re_emerging_state)
    print(f"Successfully re-integrated object {re_identified_id} with new state: {memory.retrieve_object_state(re_identified_id)[:2]}...")

Source

Paperarxiv.org