PulseAugur
实时 08:48:57

Mistral AI and X-Voice advance multilingual voice cloning with new architectures

Researchers have introduced X-Voice, a compact 0.4B parameter model capable of zero-shot cross-lingual voice cloning in 30 languages. The model utilizes a two-stage training process with a unified International Phonetic Alphabet representation and open-sourced resources. Separately, Mistral AI has released Voxtral TTS, a larger 4B parameter model that combines autoregressive and flow-matching architectures to address the 'expressivity gap' in text-to-speech synthesis. Voxtral TTS generates natural, speaker-faithful speech in 9 languages from short audio prompts and demonstrates strong performance against existing systems. AI

影响 New TTS models from academic and commercial labs are improving voice cloning fidelity and multilingual capabilities, potentially enhancing voice agents and audio content creation.

排序理由 The cluster contains two distinct research papers/releases detailing new text-to-speech models.

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

Mistral AI and X-Voice advance multilingual voice cloning with new architectures

报道来源 [3]

  1. arXiv cs.AI TIER_1 English(EN) · Rixi Xu, Qingyu Liu, Haitao Li, Yushen Chen, Zhikang Niu, Yunting Yang, Jian Zhao, Ke Li, Berrak Sisman, Qinyuan Cheng, Xipeng Qiu, Kai Yu, Xie Chen ·

    X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

    arXiv:2605.05611v1 Announce Type: cross Abstract: In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the Internat…

  2. Hugging Face Daily Papers TIER_1 English(EN) ·

    X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

    In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified represe…

  3. MarkTechPost TIER_1 English(EN) · Asif Razzaq ·

    Closing the ‘Expressivity Gap’: How Mistral’s Voxtral TTS is Redefining Multilingual Voice Cloning with a Hybrid Autoregressive and Flow-Matching Architecture

    <p>Voice AI has a dirty secret. Most text-to-speech systems sound fine — until they don&#8217;t. They can read a sentence. What they cannot do is mean it. The rhythm is off. The emotion is flat. The speaker sounds like themselves for two seconds, then drifts into generic syntheti…