English(EN) X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Mistral AI 和 X-Voice 通过新架构推进多语言语音克隆

作者 PulseAugur 编辑部 · [3 个来源] · 2026-05-05 21:11

研究人员推出 X-Voice，一个参数量为 0.4B 的紧凑型模型，能够进行30种语言的零样本跨语言语音克隆。该模型采用两阶段训练流程，结合了统一的国际音标表示和开源资源。另外，Mistral AI 发布了 Voxtral TTS，一个参数量为 4B 的大型模型，结合了自回归和流匹配架构，以解决文本到语音合成中的“表现力差距”。Voxtral TTS 可根据简短的音频提示生成自然、忠实于说话人声音的9种语言语音，并展现出优于现有系统的强劲性能。 AI

影响来自学术界和商业实验室的新文本到语音模型正在提高语音克隆的保真度和多语言能力，有望增强语音助手和音频内容创作。

排序理由该集群包含两篇不同的研究论文/发布，详细介绍了新的文本到语音模型。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

arXiv cs.AI TIER_1 English(EN) · Rixi Xu, Qingyu Liu, Haitao Li, Yushen Chen, Zhikang Niu, Yunting Yang, Jian Zhao, Ke Li, Berrak Sisman, Qinyuan Cheng, Xipeng Qiu, Kai Yu, Xie Chen · 2026-05-08 04:00

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

arXiv:2605.05611v1 Announce Type: cross Abstract: In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the Internat…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-07 02:57

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified represe…
MarkTechPost TIER_1 English(EN) · Asif Razzaq · 2026-05-05 21:11

Closing the ‘Expressivity Gap’: How Mistral’s Voxtral TTS is Redefining Multilingual Voice Cloning with a Hybrid Autoregressive and Flow-Matching Architecture

<p>Voice AI has a dirty secret. Most text-to-speech systems sound fine — until they don’t. They can read a sentence. What they cannot do is mean it. The rhythm is off. The emotion is flat. The speaker sounds like themselves for two seconds, then drifts into generic syntheti…

报道来源 [3]

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

Closing the ‘Expressivity Gap’: How Mistral’s Voxtral TTS is Redefining Multilingual Voice Cloning with a Hybrid Autoregressive and Flow-Matching Architecture

相关实体

相关话题