Baidu releases NAVA, a 6.3B parameter audio-visual generation model

By PulseAugur Editorial · [1 sources] · 2026-05-29 02:17

Baidu has released NAVA, a 6.3 billion parameter model capable of generating synchronized audio and video from a single text prompt. This model utilizes an Align-then-Fuse MMDiT architecture to achieve state-of-the-art performance on audio-visual synchronization benchmarks. NAVA can produce 720p, one-minute videos with stereo audio in approximately one minute and offers precise control over speaker voice timbre. AI

IMPACT Sets new SOTA on audio-visual synchronization benchmarks with a significantly smaller parameter count.

RANK_REASON The cluster describes a new model release with a corresponding paper and benchmark results. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Hugging Face Trending Models →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Baidu releases NAVA, a 6.3B parameter audio-visual generation model

COVERAGE [1]

Hugging Face Trending Models TIER_1 Bahasa(ID) · baidu · 2026-05-29 02:17

baidu/NAVA

text-to-video · 159 downloads · 55 likes

COVERAGE [1]

baidu/NAVA

RELATED ENTITIES

RELATED TOPICS