Italiano(IT) Nava - A 6.3B audio-video model .

NAVA model generates synchronized audio and video from text prompts

By PulseAugur Editorial · [1 sources] · 2026-05-29 18:35

A new 6.3 billion parameter model named NAVA has been released, capable of generating synchronized audio and video from a single text prompt. It features multi-speaker speech control and image-conditioned continuations. NAVA utilizes an Align-then-Fuse MMDiT architecture to establish audio-video correspondence before fusing context, achieving state-of-the-art results on the Verse-Bench benchmark with significantly fewer parameters than existing open-source models. AI

IMPACT This model advances synchronized audio-video generation, potentially impacting content creation and media synthesis.

RANK_REASON The cluster describes the release of a new AI model with its technical details and benchmark performance. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/StableDiffusion →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

NAVA model generates synchronized audio and video from text prompts

COVERAGE [1]

r/StableDiffusion TIER_2 Italiano(IT) · /u/AgeNo5351 · 2026-05-29 18:35

Nava - A 6.3B audio-video model.

<table> <tr><td> <a href="https://www.reddit.com/r/StableDiffusion/comments/1trb93v/nava_a_63b_audiovideo_model/"> <img alt="Nava - A 6.3B audio-video model ." src="https://external-preview.redd.it/eDRlYmJhMGxlNDRoMbXMqXLWXT2mbb7jB8JmHoTjuf_SMCFFPQRD1fFDsUU9.png?width=640&cro…

COVERAGE [1]

Nava - A 6.3B audio-video model.

RELATED TOPICS