Paper·arxiv.org
machine-learningresearchdata-pipelinesfine-tuningembeddingsvisionfoundry
VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
Enhance Vision-Language Models (VLMs) by generating synthetic images with explicit annotations. This method addresses VLM weaknesses in spatial understanding and viewpoint recognition, overcoming limitations of natural image datasets.
intermediate30 min5 steps
The play
- Identify VLM Perception GapsRecognize specific limitations in your VLM's visual perception, particularly in spatial understanding, object relationships, or viewpoint recognition, which hinder performance on critical tasks.
- Plan Synthetic Data GenerationDesign a strategy to programmatically create synthetic images. Focus on generating diverse scenarios that explicitly target the identified visual perception weaknesses, ensuring precise control over object placement, lighting, and camera angles.
- Generate Annotated Synthetic DataProduce synthetic images, ensuring each image comes with precise, explicit ground truth annotations. These annotations should detail spatial relationships, object bounding boxes, 3D poses, and viewpoint information, which are often implicit or absent in natural datasets.
- Train VLMs with Synthetic DataIntegrate the newly generated and explicitly annotated synthetic dataset into your VLM's training or fine-tuning pipeline. Prioritize training phases or modules that address low-level visual skills.
- Evaluate Enhanced VLM PerformanceMeasure the improvement in your VLM's visual perception capabilities. Focus evaluation on tasks directly related to spatial understanding, object localization, and viewpoint recognition to confirm the effectiveness of the synthetic data augmentation.
Starter code
from PIL import Image, ImageDraw
import json
# Define image dimensions
width, height = 200, 200
img = Image.new('RGB', (width, height), color = 'white')
d = ImageDraw.Draw(img)
# Define a synthetic object (e.g., a red square)
# Coordinates (x0, y0, x1, y1)
bbox_coords = (50, 50, 150, 150)
d.rectangle(bbox_coords, fill='red', outline='black')
# Create explicit annotation for the synthetic object
annotation = {
"image_id": "synthetic_square_001",
"objects": [
{
"label": "square",
"bbox": list(bbox_coords), # Convert tuple to list for JSON
"color": "red",
"spatial_relation": "center", # Example of explicit spatial info
"viewpoint": "front" # Example of explicit viewpoint info
}
]
}
# Save image and annotation
img_filename = "synthetic_image_001.png"
json_filename = "synthetic_annotation_001.json"
img.save(img_filename)
with open(json_filename, 'w') as f:
json.dump(annotation, f, indent=4)
print(f"Generated {img_filename} and {json_filename}")Source