New frameworks and benchmarks advance audio-visual generation

By PulseAugur Editorial · [14 sources] · 2026-05-25 00:00

Researchers have introduced OmniCustom, a framework for customizing both video identity and audio timbre simultaneously from reference images and audio. This DiT-based model uses separate LoRA modules for identity and timbre control, enhanced by a contrastive learning objective. Separately, the NAVA framework offers native audio-visual alignment for joint generation, improving synchronization and timbre controllability with a 6.3B parameter model. Additionally, LongAV-Compass has been developed as a benchmark for evaluating minute-long audio-visual generation across various conditioning modalities, assessing consistency and alignment over extended durations. AI

IMPACT New models and benchmarks improve control and evaluation for audio-visual generation, pushing the boundaries of synchronized media synthesis.

RANK_REASON Multiple research papers introducing new models and benchmarks for audio-visual generation.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 14 sources. How we write summaries →

New frameworks and benchmarks advance audio-visual generation

COVERAGE [14]

arXiv cs.AI TIER_1 English(EN) · Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao · 2026-06-02 04:00

Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

arXiv:2606.01031v1 Announce Type: cross Abstract: Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume strict temporal correspondence between generated and reference videos. This assumption does…
arXiv cs.AI TIER_1 Italiano(IT) · Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li, Dong Xu · 2026-05-29 04:00

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

arXiv:2602.12304v4 Announce Type: replace-cross Abstract: Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, thi…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-28 15:21

Native Audio-Visual Alignment for Generation

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual c…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-28 00:00

Native Audio-Visual Alignment for Generation

NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising.
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-25 18:12

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified eva…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-25 00:00

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences.
arXiv cs.CV TIER_1 English(EN) · Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, Yize Zhang, Zhaoxiang Zhang, Jiaheng Liu · 2026-06-03 04:00

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

arXiv:2512.21094v2 Announce Type: replace Abstract: Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly s…
arXiv cs.CV TIER_1 English(EN) · Jaemin Jung, Kyeongha Rho, Inkyu Shin, Joon Son Chung · 2026-06-03 04:00

Inference-Time Scaling for Joint Audio-Video Generation

arXiv:2606.03183v1 Announce Type: cross Abstract: Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substant…
arXiv cs.CV TIER_1 English(EN) · Yujie Wei, Yujin Han, Zhekai Chen, Yongming Li, Kaixun Jiang, Zhihang Liu, Quanhao Li, Zhiwu Qing, Xiang Wang, Zhen Xing, Ruihang Chu, Lingyi Hong, Yefei He, Junjie Zhou, Junqiu Yu, Yang Shi, Difan Zou, Kai Zhu, Shiwei Zhang, Yingya Zhang, Yu Liu, Xihui … · 2026-06-02 04:00

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

arXiv:2605.20183v2 Announce Type: replace Abstract: Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing b…
arXiv cs.CV TIER_1 English(EN) · Tingle Li, Siddharth Gururani, Kevin J. Shih, Gantavya Bhatt, Sang-gil Lee, Zhifeng Kong, Arushi Goel, Gopala Anumanchipalli, Ming-Yu Liu · 2026-05-29 04:00

Benchmarking Single-Factor Physical Video-to-Audio Generation

arXiv:2605.30339v1 Announce Type: new Abstract: Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical corr…
arXiv cs.CV TIER_1 English(EN) · Longbin Ji, Guan Wang, Xuan Wei, Chenye Yang, Xiangrui Liu, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Jingzhou He · 2026-05-29 04:00

Native Audio-Visual Alignment for Generation

arXiv:2605.30073v1 Announce Type: new Abstract: Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fu…
arXiv cs.CV TIER_1 English(EN) · Ming-Yu Liu · 2026-05-28 17:59

Benchmarking Single-Factor Physical Video-to-Audio Generation

Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this …
arXiv cs.CV TIER_1 English(EN) · Jingzhou He · 2026-05-28 15:21

Native Audio-Visual Alignment for Generation

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual c…
arXiv cs.CV TIER_1 English(EN) · Tengfei Liu, Yang Shi, Xuanyu Zhu, Jiafu Tang, Liu Yang, Qixun Wang, Zhuoran Zhang, Yuqi Tang, Fengxiang Wang, Yuhao Dong, Xinlong Chen, Bozhou Li, Bohan Zeng, Yue Ding, Xiaohan Zhang, Jialu Chen, Haotian Wang, Yuanxing Zhang, Pengfei Wan, Leye Wang · 2026-05-27 04:00

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

arXiv:2605.26244v1 Announce Type: new Abstract: Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-condi…

COVERAGE [14]

RELATED ENTITIES

RELATED TOPICS