Skip to main content
Article
visual-question-answeringvqavision-languagemultimodalgpt-4vopenaiimage-understandingprompt-engineering

Practical Visual Question Answering with Vision Models

Learn to make AI agents answer questions about images. This pack covers basic prompting for Visual Question Answering (VQA), advanced techniques for reasoning, and how to handle common errors. Start building multimodal applications with GPT-4V.

beginner15 min4 steps
The play
  1. Setup and Basic VQA Query
    First, ensure you have the OpenAI Python library installed (`pip install openai`) and your API key is set as an environment variable (`OPENAI_API_KEY`). Your first Visual Question Answering task is to ask a general question to understand the image's content.
  2. Ask Targeted Questions for Attributes
    Move beyond general descriptions. Effective Visual Question Answering involves asking for specific details. Prompt the model to identify attributes like color, texture, or the presence of specific objects.
  3. Perform Counting and Spatial Reasoning
    Test the model's ability to count objects and understand their spatial relationships. These are common but challenging VQA tasks. Be specific in your questions.
  4. Use Chain-of-Thought for Complex Reasoning
    For questions requiring multiple steps of logic, instruct the model to 'think step-by-step'. This prompting technique, a core part of advanced Visual Question Answering, improves accuracy by forcing a deliberate reasoning process before the final answer.
Starter code
# Save this as vqa_starter.py
# Run from your terminal: python vqa_starter.py
# Ensure you have OPENAI_API_KEY set in your environment.

import os
from openai import OpenAI

# Ensure you have the library installed: pip install openai

# The client automatically looks for the OPENAI_API_KEY environment variable.
# If you don't have it set, you can pass it manually: OpenAI(api_key="your-key")

try:
    client = OpenAI()
except openai.OpenAIError as e:
    print(f"Error initializing OpenAI client: {e}")
    print("Please ensure your OPENAI_API_KEY environment variable is set correctly.")
    exit(1)

# This is a public domain image of a kitchen scene.
IMAGE_URL = "https://upload.wikimedia.org/wikipedia/commons/thumb/2/2f/Kitchen_still_life.jpg/1024px-Kitchen_still_life.jpg"

# The question we want to ask the model about the image.
QUESTION = "How many distinct types of vegetables can you identify on the table? List them."

print(f"Sending Visual Question Answering request for image: {IMAGE_URL}")
print(f"Question: {QUESTION}\n")

try:
    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": QUESTION
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": IMAGE_URL
                        }
                    },
                ],
            }
        ],
        max_tokens=400 # Limit the length of the response
    )

    answer = response.choices[0].message.content
    print("--- AI Response ---")
    print(answer)
    print("-------------------")

except Exception as e:
    print(f"An error occurred during the API call: {e}")
Practical Visual Question Answering with Vision Models — Action Pack