Researchers have developed UniSonate, a novel unified framework for generating speech, music, and sound effects using natural language instructions. This model addresses the fragmentation in generative audio by reconciling structured semantic representations with unstructured acoustic textures. UniSonate employs a dynamic token injection mechanism and a Multimodal Diffusion Transformer (MM-DiT) to achieve precise duration control and state-of-the-art results in text-to-speech and text-to-music tasks, while also performing competitively in text-to-audio generation. AI
影响 Unifies disparate audio generation tasks, potentially simplifying workflows for content creators and researchers.
排序理由 Academic paper introducing a new unified audio generation model.
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →