baidu/NAVA
Baidu has released NAVA, a 6.3 billion parameter model capable of generating synchronized audio and video from a single text prompt. This model utilizes an Align-then-Fuse MMDiT architecture to achieve state-of-the-art performance on audio-visual synchronization benchmarks. NAVA can produce 720p, one-minute videos with stereo audio in approximately one minute and offers precise control over speaker voice timbre. AI
IMPACT Sets new SOTA on audio-visual synchronization benchmarks with a significantly smaller parameter count.