Paper·arxiv.org
researchmachine-learningllmevaluationcontext-engineering
What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric
Leverage Vision-Language Models (VLMs) and NLP metrics to analyze 'what' users see in eye-tracking data, not just 'where'. This framework quantifies the semantic similarity of visual attention, offering deeper insights into user intent and improving human-computer interaction.
intermediate30 min6 steps
The play
- Acquire Eye-Tracking DataObtain eye-tracking data that includes fixation points overlaid on images or visual stimuli. Ensure the data can be parsed to identify specific image regions corresponding to fixations.
- Define Regions of Interest (ROIs)For each fixation point or sequence of fixations, programmatically extract the corresponding image region (ROI). These regions will be the input for your Vision-Language Model.
- Describe ROIs with VLMsUse a pre-trained Vision-Language Model (VLM), such as BLIP, CLIP, or LLaVA, to generate a concise natural language description for each extracted ROI. This converts visual information into semantic text.
- Generate Semantic ScanpathsFor each user's eye-tracking session, compile the sequence of VLM-generated descriptions corresponding to their scanpath. This forms a 'semantic scanpath' for each user.
- Quantify Semantic SimilarityApply Natural Language Processing (NLP) metrics to compare semantic scanpaths. Embed the VLM descriptions into a vector space (e.g., using Sentence Transformers) and calculate similarity scores (e.g., cosine similarity) between different scanpaths or segments.
- Interpret and Apply InsightsAnalyze the semantic similarity scores to understand user visual cognition, intent, and attention patterns. Use these insights to inform user experience design, personalize content, or improve human-AI interaction systems.
Starter code
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# 1. Load VLM and NLP models
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
nlp_model = SentenceTransformer('all-MiniLM-L6-v2')
# 2. Simulate image regions (replace with actual eye-tracking crops)
# Download a sample image
image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures_doc/resolve/main/image.png"
raw_image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
# For demonstration, define two arbitrary regions within the image
region1 = raw_image.crop((0, 0, raw_image.width // 2, raw_image.height)) # Left half
region2 = raw_image.crop((raw_image.width // 2, 0, raw_image.width, raw_image.height)) # Right half
print("--- Processing Region 1 ---")
# 3. Generate VLM description for Region 1
inputs1 = processor(region1, return_tensors="pt")
out1 = model.generate(**inputs1)
description1 = processor.decode(out1[0], skip_special_tokens=True)
print(f"Description 1: {description1}")
print("\n--- Processing Region 2 ---")
# 3. Generate VLM description for Region 2
inputs2 = processor(region2, return_tensors="pt")
out2 = model.generate(**inputs2)
description2 = processor.decode(out2[0], skip_special_tokens=True)
print(f"Description 2: {description2}")
print("\n--- Calculating Semantic Similarity ---")
# 4. Embed descriptions and calculate semantic similarity
embeddings = nlp_model.encode([description1, description2])
similarity_score = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
print(f"Semantic Similarity between descriptions: {similarity_score:.4f}")Source