Researchers have introduced X-Voice, a compact 0.4B parameter model capable of zero-shot cross-lingual voice cloning in 30 languages. The model utilizes a two-stage training process with a unified International Phonetic Alphabet representation and open-sourced resources. Separately, Mistral AI has released Voxtral TTS, a larger 4B parameter model that combines autoregressive and flow-matching architectures to address the 'expressivity gap' in text-to-speech synthesis. Voxtral TTS generates natural, speaker-faithful speech in 9 languages from short audio prompts and demonstrates strong performance against existing systems. AI
影响 New TTS models from academic and commercial labs are improving voice cloning fidelity and multilingual capabilities, potentially enhancing voice agents and audio content creation.
排序理由 The cluster contains two distinct research papers/releases detailing new text-to-speech models.
在 Hugging Face Daily Papers 阅读 →
- ElevenLabs Flash v2.5
- Hugging Face
- International Phonetic Alphabet
- LEMAS-TTS
- Ministral 3B
- Mistral AI
- Qwen3-TTS
- Voxtral TTS
- Whisper
- X-Voice
AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →