Baidu ERNIE Team releases NAVA audio-visual generation model

By PulseAugur Editorial · [1 sources] · 2026-05-29 02:17

Baidu's ERNIE Team has released NAVA, a 6.3 billion parameter model capable of generating synchronized audio and video from a single text prompt. NAVA utilizes an Align-then-Fuse MMDiT architecture to achieve state-of-the-art performance on benchmarks like Verse-Bench for audio-visual synchronization and video quality. The model can generate one minute of 720p video with synchronized audio in approximately one minute and offers features like precise multi-timbre control and language-described camera control. AI

IMPACT Sets new SOTA on audio-visual synchronization benchmarks with a smaller parameter count, potentially lowering the barrier for high-quality AV generation.

RANK_REASON Model release from a significant AI lab (Baidu ERNIE Team) with accompanying paper and technical details. [lever_c_demoted from frontier_release: ic=1 ai=1.0]

Read on Hugging Face Trending Models →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Baidu ERNIE Team releases NAVA audio-visual generation model

COVERAGE [1]

Hugging Face Trending Models TIER_1 Deutsch(DE) · ernie-research · 2026-05-29 02:17

ernie-research/NAVA

text-to-video · 104 downloads · 41 likes

COVERAGE [1]

ernie-research/NAVA

RELATED ENTITIES

RELATED TOPICS