AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

AVGen-Bench introduces a novel task-driven benchmark for Text-to-Audio-Video (T2AV) generation models. It provides a multi-granular evaluation framework to assess the joint correctness and coherence of audio and video outputs. This addresses current fragmented evaluation methods and leads to more robust T2AV model assessment.

intermediate1 hour5 steps

The play

Understand T2AV Evaluation Challenges
Identify the limitations of current Text-to-Audio-Video (T2AV) model evaluation, specifically focusing on isolated audio/video assessments and coarse embedding similarities that miss fine-grained multimodal interactions.
Grasp AVGen-Bench Principles
Review the core concepts of AVGen-Bench: its task-driven approach, multi-granular evaluation, and emphasis on measuring the joint correctness and coherence between generated audio and video.
Design Joint Coherence Metrics
Develop or adapt evaluation metrics for your T2AV models that specifically assess the multimodal coherence and joint correctness between the audio and video components, moving beyond isolated quality checks.
Integrate Benchmark Approach
Incorporate the principles of AVGen-Bench into your model's evaluation pipeline. This may involve setting up specific tasks to test audio-video alignment and using more granular metrics to quantify performance.
Compare and Refine Models
Use the improved evaluation framework to rigorously compare your T2AV model's performance against baselines or other models, identifying areas for improvement in multimodal generation.

Starter code

avgen-bench evaluate \
  --model-outputs-dir /path/to/your/t2av_generations \
  --prompts-file /path/to/evaluation_prompts.json \
  --config-preset multimodal_coherence \
  --output-report /path/to/evaluation_report.json

Source

Paperarxiv.org