Skip to main content
Paper·arxiv.org
machine-learningevaluationresearchcontent-creationllmavgen-bench

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

AVGen-Bench introduces a novel task-driven benchmark for Text-to-Audio-Video (T2AV) generation models. It provides a multi-granular evaluation framework to assess the joint correctness and coherence of audio and video outputs. This addresses current fragmented evaluation methods and leads to more robust T2AV model assessment.

intermediate1 hour5 steps
The play
  1. Understand T2AV Evaluation Challenges
    Identify the limitations of current Text-to-Audio-Video (T2AV) model evaluation, specifically focusing on isolated audio/video assessments and coarse embedding similarities that miss fine-grained multimodal interactions.
  2. Grasp AVGen-Bench Principles
    Review the core concepts of AVGen-Bench: its task-driven approach, multi-granular evaluation, and emphasis on measuring the joint correctness and coherence between generated audio and video.
  3. Design Joint Coherence Metrics
    Develop or adapt evaluation metrics for your T2AV models that specifically assess the multimodal coherence and joint correctness between the audio and video components, moving beyond isolated quality checks.
  4. Integrate Benchmark Approach
    Incorporate the principles of AVGen-Bench into your model's evaluation pipeline. This may involve setting up specific tasks to test audio-video alignment and using more granular metrics to quantify performance.
  5. Compare and Refine Models
    Use the improved evaluation framework to rigorously compare your T2AV model's performance against baselines or other models, identifying areas for improvement in multimodal generation.
Starter code
avgen-bench evaluate \
  --model-outputs-dir /path/to/your/t2av_generations \
  --prompts-file /path/to/evaluation_prompts.json \
  --config-preset multimodal_coherence \
  --output-report /path/to/evaluation_report.json
Source
AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation — Action Pack