Skip to main content
Paper·arxiv.org
llmmachine-learningfine-tuningevaluationresearchcontext-engineering

Visual Preference Optimization with Rubric Rewards

Enhance Direct Preference Optimization (DPO) for multimodal tasks by using 'Rubric Rewards'. This method introduces structured, granular feedback to capture subtle visual quality differences, improving model alignment and fine-grained visual reasoning beyond coarse preference data.

advanced1 day5 steps
The play
  1. Identify Multimodal DPO Limitations
    Review your existing Direct Preference Optimization (DPO) setup for multimodal models. Pinpoint specific instances where current coarse preference data fails to capture fine-grained visual details, contextual nuances, or subtle quality differences.
  2. Design Granular Rubric Categories
    Create a structured rubric with specific, actionable criteria for evaluating multimodal outputs. Examples include 'visual fidelity', 'contextual relevance', 'object accuracy', 'style consistency', or 'absence of hallucinations'. Assign a scoring mechanism (e.g., 1-5 scale, binary pass/fail) for each criterion.
  3. Develop Data Annotation Protocol
    Establish a detailed protocol for generating rubric-based preference data. This involves either human annotators or an automated system applying your defined rubric to pairs of multimodal outputs, yielding fine-grained feedback for each specific criterion, not just an overall preference.
  4. Integrate Rubric Rewards into DPO Loss
    Modify your standard DPO loss function to incorporate the granular feedback from the rubric. This could involve creating a composite reward signal by weighting different rubric criteria, using a multi-objective loss, or directly feeding the structured rubric scores into the optimization process.
  5. Train and Evaluate with Enhanced DPO
    Apply the rubric-reward-enhanced DPO to fine-tune your multimodal model. Evaluate its performance using both traditional metrics and a new evaluation framework based on your defined rubric, specifically assessing improvements in fine-grained visual reasoning and the reduction of model 'hallucinations'.
Starter code
rubric_template = {
    "visual_fidelity": {
        "description": "How accurately does the output reflect visual details?",
        "scale": "1-5",
        "criteria": {
            "1": "Major distortions/inaccuracies",
            "3": "Minor inaccuracies, generally acceptable",
            "5": "Perfectly faithful reproduction"
        }
    },
    "contextual_relevance": {
        "description": "How well does the output align with the given prompt/context?",
        "scale": "1-5",
        "criteria": {
            "1": "Completely irrelevant",
            "3": "Partially relevant, some inconsistencies",
            "5": "Highly relevant and coherent"
        }
    },
    "object_accuracy": {
        "description": "Are specific objects or entities correctly depicted and placed?",
        "scale": "binary",
        "criteria": {
            "true": "All specified objects are accurate",
            "false": "Errors in object depiction or placement"
        }
    }
}
Source
Visual Preference Optimization with Rubric Rewards — Action Pack