Researchers have developed ScenA, a novel method for generating multi-speaker audio scenes from natural language prompts and reference voices. Unlike previous systems that rely on structured supervision, ScenA utilizes a text-to-audio foundation model trained on in-the-wild data, enabling it to produce realistic audio with background noise, room acoustics, and overlapping dialogue. A key challenge addressed is the "Reference Shortcut," where the model could bypass the text prompt by matching acoustic similarity; this is mitigated by a high-noise-biased timestep distribution during training. ScenA demonstrates superior performance on the CoVoMix2-Dialogue benchmark, outperforming existing systems in speaker binding and generating richer conversational audio. AI
IMPACT This research could lead to more realistic and controllable AI-generated dialogue for applications like virtual assistants and content creation.
RANK_REASON This is a research paper detailing a new method for audio generation. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →