Researchers have introduced X-Voice, a compact 0.4B parameter model capable of zero-shot cross-lingual voice cloning in 30 languages. The model utilizes a two-stage training process with a unified International Phonetic Alphabet representation and open-sourced resources. Separately, Mistral AI has released Voxtral TTS, a larger 4B parameter model that combines autoregressive and flow-matching architectures to address the 'expressivity gap' in text-to-speech synthesis. Voxtral TTS generates natural, speaker-faithful speech in 9 languages from short audio prompts and demonstrates strong performance against existing systems. AI
Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →
IMPACT New TTS models from academic and commercial labs are improving voice cloning fidelity and multilingual capabilities, potentially enhancing voice agents and audio content creation.
RANK_REASON The cluster contains two distinct research papers/releases detailing new text-to-speech models.