VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Enhance Vision-Language Models (VLMs) by generating synthetic images with explicit annotations. This method addresses VLM weaknesses in spatial understanding and viewpoint recognition, overcoming limitations of natural image datasets.

intermediate30 min5 steps

The play

Identify VLM Perception Gaps
Recognize specific limitations in your VLM's visual perception, particularly in spatial understanding, object relationships, or viewpoint recognition, which hinder performance on critical tasks.
Plan Synthetic Data Generation
Design a strategy to programmatically create synthetic images. Focus on generating diverse scenarios that explicitly target the identified visual perception weaknesses, ensuring precise control over object placement, lighting, and camera angles.
Generate Annotated Synthetic Data
Produce synthetic images, ensuring each image comes with precise, explicit ground truth annotations. These annotations should detail spatial relationships, object bounding boxes, 3D poses, and viewpoint information, which are often implicit or absent in natural datasets.
Train VLMs with Synthetic Data
Integrate the newly generated and explicitly annotated synthetic dataset into your VLM's training or fine-tuning pipeline. Prioritize training phases or modules that address low-level visual skills.
Evaluate Enhanced VLM Performance
Measure the improvement in your VLM's visual perception capabilities. Focus evaluation on tasks directly related to spatial understanding, object localization, and viewpoint recognition to confirm the effectiveness of the synthetic data augmentation.

Starter code

from PIL import Image, ImageDraw
import json

# Define image dimensions
width, height = 200, 200
img = Image.new('RGB', (width, height), color = 'white')
d = ImageDraw.Draw(img)

# Define a synthetic object (e.g., a red square)
# Coordinates (x0, y0, x1, y1)
bbox_coords = (50, 50, 150, 150)
d.rectangle(bbox_coords, fill='red', outline='black')

# Create explicit annotation for the synthetic object
annotation = {
    "image_id": "synthetic_square_001",
    "objects": [
        {
            "label": "square",
            "bbox": list(bbox_coords), # Convert tuple to list for JSON
            "color": "red",
            "spatial_relation": "center", # Example of explicit spatial info
            "viewpoint": "front" # Example of explicit viewpoint info
        }
    ]
}

# Save image and annotation
img_filename = "synthetic_image_001.png"
json_filename = "synthetic_annotation_001.json"
img.save(img_filename)
with open(json_filename, 'w') as f:
    json.dump(annotation, f, indent=4)

print(f"Generated {img_filename} and {json_filename}")

Source

Paperarxiv.org