PulseAugur
EN
LIVE 08:32:02

UniVoice model unifies speech and singing voice generation

Researchers have developed UniVoice, a novel unified model capable of generating both speech and singing voices. This model utilizes a conditional flow matching approach with a Diffusion Transformer backbone, factorizing conditions into content, melody, and timbre. For speech, a null melody token allows natural prosody inference, while for singing, explicit MIDI note sequences provide melody control. Trained on extensive speech and singing datasets, UniVoice demonstrates competitive performance against specialized systems in both domains. AI

IMPACT This unified model could simplify the development of advanced voice synthesis tools for both spoken and sung content.

RANK_REASON This is a research paper detailing a new model architecture for audio generation. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Junjie Zheng, Huixin Xue, Shihong Ren, Chaofan Ding, Hao Liu, Zihao Chen ·

    UniVoice: A Unified Model for Speech and Singing Voice Generation

    arXiv:2606.05852v1 Announce Type: cross Abstract: Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-d…