Brief · PulseAugur

RESEARCH · arXiv cs.AI English(EN) · 14h · [2 sources]

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

Researchers have developed ScenA, a novel method for generating multi-speaker audio scenes from natural language descriptions and voice references. Unlike previous systems that rely on structured supervision, ScenA utilizes a text-to-audio flow-matching foundation model pretrained on diverse, in-the-wild audio data. This approach allows for the inclusion of realistic ambient sounds, room acoustics, and overlapping dialogue. A key challenge addressed is the "Reference Shortcut," where the model might bypass the text prompt by relying solely on acoustic similarity; ScenA mitigates this by employing a high-noise-biased training distribution. Evaluations on the CoVoMix2-Dialogue benchmark show ScenA outperforms existing systems in speaker binding and generates richer, more natural conversational audio. AI

IMPACT This research advances generative audio models by enabling more realistic and controllable multi-speaker scene creation, potentially impacting applications in virtual assistants and content creation.