Brief · PulseAugur

RESEARCH · arXiv cs.AI English(EN) · 3d · [3 sources]

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

Researchers have developed new unified models for generating human vocal audio, capable of producing both speech and singing. UniVoice uses a conditional flow matching approach, separating content, melody, and timbre to allow for distinct control over speech prosody and singing melody. UniSinger, built on a multimodal diffusion transformer, unifies speaker cloning song generation with accompaniment co-generation for singing voice conversion. Both models demonstrate state-of-the-art performance on their respective tasks, offering new possibilities for audio generation and music production. AI

IMPACT These models advance the state-of-the-art in unified audio generation, potentially impacting music production and accessibility tools.

UniVoice
CosyVoice3
Vevo1.5
Diffusion Transformer
F5-TTS
MIDI
UniSinger
Text-to-speech
Conditional flow matching
Singing voice synthesis
Singing voice conversion