UniVoice: A Unified Model for Speech and Singing Voice Generation
Researchers have developed UniVoice, a novel unified model capable of generating both speech and singing voices. This model utilizes a conditional flow matching approach with a Diffusion Transformer backbone, factorizing conditions into content, melody, and timbre. For speech, a null melody token allows natural prosody inference, while for singing, explicit MIDI note sequences provide melody control. Trained on extensive speech and singing datasets, UniVoice demonstrates competitive performance against specialized systems in both domains. AI
IMPACT This unified model could simplify the development of advanced voice synthesis tools for both spoken and sung content.