Visual Preference Optimization with Rubric Rewards

Enhance Direct Preference Optimization (DPO) for multimodal tasks by using 'Rubric Rewards'. This method introduces structured, granular feedback to capture subtle visual quality differences, improving model alignment and fine-grained visual reasoning beyond coarse preference data.

advanced1 day5 steps

The play

Identify Multimodal DPO Limitations
Review your existing Direct Preference Optimization (DPO) setup for multimodal models. Pinpoint specific instances where current coarse preference data fails to capture fine-grained visual details, contextual nuances, or subtle quality differences.
Design Granular Rubric Categories
Create a structured rubric with specific, actionable criteria for evaluating multimodal outputs. Examples include 'visual fidelity', 'contextual relevance', 'object accuracy', 'style consistency', or 'absence of hallucinations'. Assign a scoring mechanism (e.g., 1-5 scale, binary pass/fail) for each criterion.
Develop Data Annotation Protocol
Establish a detailed protocol for generating rubric-based preference data. This involves either human annotators or an automated system applying your defined rubric to pairs of multimodal outputs, yielding fine-grained feedback for each specific criterion, not just an overall preference.
Integrate Rubric Rewards into DPO Loss
Modify your standard DPO loss function to incorporate the granular feedback from the rubric. This could involve creating a composite reward signal by weighting different rubric criteria, using a multi-objective loss, or directly feeding the structured rubric scores into the optimization process.
Train and Evaluate with Enhanced DPO
Apply the rubric-reward-enhanced DPO to fine-tune your multimodal model. Evaluate its performance using both traditional metrics and a new evaluation framework based on your defined rubric, specifically assessing improvements in fine-grained visual reasoning and the reduction of model 'hallucinations'.

Starter code

rubric_template = {
    "visual_fidelity": {
        "description": "How accurately does the output reflect visual details?",
        "scale": "1-5",
        "criteria": {
            "1": "Major distortions/inaccuracies",
            "3": "Minor inaccuracies, generally acceptable",
            "5": "Perfectly faithful reproduction"
        }
    },
    "contextual_relevance": {
        "description": "How well does the output align with the given prompt/context?",
        "scale": "1-5",
        "criteria": {
            "1": "Completely irrelevant",
            "3": "Partially relevant, some inconsistencies",
            "5": "Highly relevant and coherent"
        }
    },
    "object_accuracy": {
        "description": "Are specific objects or entities correctly depicted and placed?",
        "scale": "binary",
        "criteria": {
            "true": "All specified objects are accurate",
            "false": "Errors in object depiction or placement"
        }
    }
}

Source

Paperarxiv.org