What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

Leverage Vision-Language Models (VLMs) and NLP metrics to analyze 'what' users see in eye-tracking data, not just 'where'. This framework quantifies the semantic similarity of visual attention, offering deeper insights into user intent and improving human-computer interaction.

intermediate30 min6 steps

The play

Acquire Eye-Tracking Data
Obtain eye-tracking data that includes fixation points overlaid on images or visual stimuli. Ensure the data can be parsed to identify specific image regions corresponding to fixations.
Define Regions of Interest (ROIs)
For each fixation point or sequence of fixations, programmatically extract the corresponding image region (ROI). These regions will be the input for your Vision-Language Model.
Describe ROIs with VLMs
Use a pre-trained Vision-Language Model (VLM), such as BLIP, CLIP, or LLaVA, to generate a concise natural language description for each extracted ROI. This converts visual information into semantic text.
Generate Semantic Scanpaths
For each user's eye-tracking session, compile the sequence of VLM-generated descriptions corresponding to their scanpath. This forms a 'semantic scanpath' for each user.
Quantify Semantic Similarity
Apply Natural Language Processing (NLP) metrics to compare semantic scanpaths. Embed the VLM descriptions into a vector space (e.g., using Sentence Transformers) and calculate similarity scores (e.g., cosine similarity) between different scanpaths or segments.
Interpret and Apply Insights
Analyze the semantic similarity scores to understand user visual cognition, intent, and attention patterns. Use these insights to inform user experience design, personalize content, or improve human-AI interaction systems.

Starter code

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# 1. Load VLM and NLP models
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
nlp_model = SentenceTransformer('all-MiniLM-L6-v2')

# 2. Simulate image regions (replace with actual eye-tracking crops)
# Download a sample image
image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures_doc/resolve/main/image.png"
raw_image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')

# For demonstration, define two arbitrary regions within the image
region1 = raw_image.crop((0, 0, raw_image.width // 2, raw_image.height)) # Left half
region2 = raw_image.crop((raw_image.width // 2, 0, raw_image.width, raw_image.height)) # Right half

print("--- Processing Region 1 ---")
# 3. Generate VLM description for Region 1
inputs1 = processor(region1, return_tensors="pt")
out1 = model.generate(**inputs1)
description1 = processor.decode(out1[0], skip_special_tokens=True)
print(f"Description 1: {description1}")

print("\n--- Processing Region 2 ---")
# 3. Generate VLM description for Region 2
inputs2 = processor(region2, return_tensors="pt")
out2 = model.generate(**inputs2)
description2 = processor.decode(out2[0], skip_special_tokens=True)
print(f"Description 2: {description2}")

print("\n--- Calculating Semantic Similarity ---")
# 4. Embed descriptions and calculate semantic similarity
embeddings = nlp_model.encode([description1, description2])
similarity_score = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
print(f"Semantic Similarity between descriptions: {similarity_score:.4f}")

Source

Paperarxiv.org