Skip to main content
Paper·arxiv.org
machine-learningresearchdata-pipelinesfine-tuningembeddingsvisionfoundry

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Enhance Vision-Language Models (VLMs) by generating synthetic images with explicit annotations. This method addresses VLM weaknesses in spatial understanding and viewpoint recognition, overcoming limitations of natural image datasets.

intermediate30 min5 steps
The play
  1. Identify VLM Perception Gaps
    Recognize specific limitations in your VLM's visual perception, particularly in spatial understanding, object relationships, or viewpoint recognition, which hinder performance on critical tasks.
  2. Plan Synthetic Data Generation
    Design a strategy to programmatically create synthetic images. Focus on generating diverse scenarios that explicitly target the identified visual perception weaknesses, ensuring precise control over object placement, lighting, and camera angles.
  3. Generate Annotated Synthetic Data
    Produce synthetic images, ensuring each image comes with precise, explicit ground truth annotations. These annotations should detail spatial relationships, object bounding boxes, 3D poses, and viewpoint information, which are often implicit or absent in natural datasets.
  4. Train VLMs with Synthetic Data
    Integrate the newly generated and explicitly annotated synthetic dataset into your VLM's training or fine-tuning pipeline. Prioritize training phases or modules that address low-level visual skills.
  5. Evaluate Enhanced VLM Performance
    Measure the improvement in your VLM's visual perception capabilities. Focus evaluation on tasks directly related to spatial understanding, object localization, and viewpoint recognition to confirm the effectiveness of the synthetic data augmentation.
Starter code
from PIL import Image, ImageDraw
import json

# Define image dimensions
width, height = 200, 200
img = Image.new('RGB', (width, height), color = 'white')
d = ImageDraw.Draw(img)

# Define a synthetic object (e.g., a red square)
# Coordinates (x0, y0, x1, y1)
bbox_coords = (50, 50, 150, 150)
d.rectangle(bbox_coords, fill='red', outline='black')

# Create explicit annotation for the synthetic object
annotation = {
    "image_id": "synthetic_square_001",
    "objects": [
        {
            "label": "square",
            "bbox": list(bbox_coords), # Convert tuple to list for JSON
            "color": "red",
            "spatial_relation": "center", # Example of explicit spatial info
            "viewpoint": "front" # Example of explicit viewpoint info
        }
    ]
}

# Save image and annotation
img_filename = "synthetic_image_001.png"
json_filename = "synthetic_annotation_001.json"
img.save(img_filename)
with open(json_filename, 'w') as f:
    json.dump(annotation, f, indent=4)

print(f"Generated {img_filename} and {json_filename}")
Source
VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images — Action Pack