PulseAugur
EN
LIVE 08:33:56

New AI model ScenA generates realistic multi-speaker audio scenes from text

Researchers have developed ScenA, a novel method for generating multi-speaker audio scenes from natural language descriptions and voice references. Unlike previous systems that rely on structured supervision, ScenA utilizes a text-to-audio flow-matching foundation model pretrained on diverse, in-the-wild audio data. This approach allows for the inclusion of realistic ambient sounds, room acoustics, and overlapping dialogue. A key challenge addressed is the "Reference Shortcut," where the model might bypass the text prompt by relying solely on acoustic similarity; ScenA mitigates this by employing a high-noise-biased training distribution. Evaluations on the CoVoMix2-Dialogue benchmark show ScenA outperforms existing systems in speaker binding and generates richer, more natural conversational audio. AI

IMPACT This research advances generative audio models by enabling more realistic and controllable multi-speaker scene creation, potentially impacting applications in virtual assistants and content creation.

RANK_REASON The cluster contains an academic paper detailing a new AI model and methodology.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring, Poriya Panet, Nir Zabari, Benjamin Brazowski, Or Patashnik, Yoav HaCohen ·

    Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

    arXiv:2606.19325v1 Announce Type: cross Abstract: Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines th…

  2. arXiv cs.AI TIER_1 English(EN) · Yoav HaCohen ·

    Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

    Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambie…