New frameworks and benchmarks advance audio-visual generation
ByPulseAugur Editorial·[14 sources]·
Researchers have introduced OmniCustom, a framework for customizing both video identity and audio timbre simultaneously from reference images and audio. This DiT-based model uses separate LoRA modules for identity and timbre control, enhanced by a contrastive learning objective. Separately, the NAVA framework offers native audio-visual alignment for joint generation, improving synchronization and timbre controllability with a 6.3B parameter model. Additionally, LongAV-Compass has been developed as a benchmark for evaluating minute-long audio-visual generation across various conditioning modalities, assessing consistency and alignment over extended durations.
AI
IMPACT
New models and benchmarks improve control and evaluation for audio-visual generation, pushing the boundaries of synchronized media synthesis.
RANK_REASON
Multiple research papers introducing new models and benchmarks for audio-visual generation.
arXiv:2602.12304v4 Announce Type: replace-cross Abstract: Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, thi…
Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual c…
NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising.
Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified eva…
LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences.
arXiv:2512.21094v2 Announce Type: replace Abstract: Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly s…
arXiv:2606.03183v1 Announce Type: cross Abstract: Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substant…
arXiv:2605.20183v2 Announce Type: replace Abstract: Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing b…
arXiv:2605.30339v1 Announce Type: new Abstract: Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical corr…
arXiv cs.CV
TIER_1English(EN)·Longbin Ji, Guan Wang, Xuan Wei, Chenye Yang, Xiangrui Liu, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Jingzhou He·
arXiv:2605.30073v1 Announce Type: new Abstract: Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fu…
Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this …
Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual c…
arXiv:2605.26244v1 Announce Type: new Abstract: Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-condi…