PulseAugur
LIVE 10:15:03
research · [1 source] ·
0
research

Omni2Sound model unifies video, text to audio generation with new dataset

Researchers have developed Omni2Sound, a unified diffusion model capable of generating audio from video, text, or a combination of both. The model addresses challenges in data scarcity and cross-task competition by introducing SoundAtlas, a large-scale dataset with tightly aligned audio captions, and a novel three-stage progressive training schedule. Omni2Sound achieves state-of-the-art performance across video-to-audio, text-to-audio, and video-text-to-audio generation tasks within a single model, demonstrating strong generalization capabilities. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Introduces a unified model for multimodal audio generation, potentially simplifying workflows for content creators and researchers.

RANK_REASON This is a research paper introducing a new model and dataset for audio generation.

Read on arXiv cs.CV →

COVERAGE [1]

  1. arXiv cs.CV TIER_1 · Yusheng Dai, Zehua Chen, Yuxuan Jiang, Baolong Gao, Qiuhong Ke, Jianfei Cai, Jun Zhu ·

    Omni2Sound: Towards Unified Video-Text-to-Audio Generation

    arXiv:2601.02731v3 Announce Type: replace-cross Abstract: Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: …