PulseAugur
EN
LIVE 12:30:47

New 2B-parameter TTS model dots.tts achieves SOTA

Researchers have introduced dots.tts, a 2 billion parameter text-to-speech model that operates in a continuous latent space. The model incorporates several innovations, including an AudioVAE for a structured speech representation, full-history conditioning for improved consistency, and self-corrective post-training for enhanced robustness. Dots.tts achieves state-of-the-art results on benchmarks like Seed-TTS-Eval and offers efficient, low-latency generation through MeanFlow distillation. AI

IMPACT Sets new SOTA on multilingual TTS benchmarks, potentially improving voice cloning and emotional expressiveness in AI applications.

RANK_REASON The cluster contains a technical report detailing a new text-to-speech model with performance benchmarks.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

COVERAGE [3]

  1. arXiv cs.AI TIER_1 English(EN) · Shi Lian, Changtao Li, Bohan Li, Hankun Wang, Da Zheng, Junfeng Tian, Yufeng Ma, Colin Zhang, Kai Yu ·

    dots.tts Technical Report

    arXiv:2606.07080v1 Announce Type: cross Abstract: We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are …

  2. arXiv cs.AI TIER_1 English(EN) · Kai Yu ·

    dots.tts Technical Report

    We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multip…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    dots.tts Technical Report

    A 2B-parameter continuous autoregressive text-to-speech model trained on a multilingual corpus achieves state-of-the-art performance on multiple benchmarks while enabling efficient low-latency speech generation through specialized distillation techniques.