AVVA

Audio-Video Vector Alignment
Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio–Video Foundation Model

Ali Vosoughi*¹, Dimitra Emmanouilidou², Hannes Gamper²
¹University of Rochester  •  ²Microsoft Research
*Work completed during internship at Microsoft Research

Key Achievements

192
Hours of Curated Training Data
4.5×
Better Video-to-Audio Retrieval
5
Alignment Scoring Metrics
3
Evaluation Datasets

AVVA achieves significant improvements in video-to-audio retrieval across AudioCaps, VALOR, and VGGSound datasets using only 192 hours of LLM-curated training data – demonstrating that data quality effectively trades for data quantity.

Architecture Overview

AVVA Framework Architecture

Figure 1: Overview of the proposed audiovisual alignment approach showing (a) data curation stage and (b) AVVA model training

AVVA integrates Whisper (speech-based foundation model) for audio and DINOv2 for video analysis in a dual-encoder structure with contrastive learning on AV pairs, enabling direct audio-video alignment without textual mediation during training.

Performance Results

AudioCaps V2A Top-1
6.23%
vs 1.40% DenseAV baseline
VALOR V2A Top-1
7.75%
vs 2.20% DenseAV baseline
VGGSound V2A Top-1
6.86%
vs 1.60% DenseAV baseline
Data Curation Benefit
Quality
Trades data quality for quantity
Performance vs Training Data

Figure 4: Audio-to-video model performance over hours of training data, showing consistent improvements with data curation across Top-k accuracies

Multimodal Reasoning Engine

AVVA's data curation uses LLMs to score audio-video alignment, effectively trading data quality for data quantity

Multimodal Reasoning Engine Architecture

Figure 2: The MRE integrates audio-LLM and video-LLM outputs into Mistral LLM for audiovisual scene alignment scoring

Temporal Alignment

How well do audio events match the timing of video events?

⏱️

Example: A clap sound syncing with hands meeting in the video

Open Source & Reproducible

AVVA implements a scoring mechanism for selecting aligned training data segments and demonstrates significant improvements in top-k accuracies for video-to-audio retrieval across AudioCaps, VALOR, and VGGSound compared to existing methods.