PulseAugur
EN
LIVE 00:57:02

CPU TTS benchmark: Kokoro 82M leads in quality, Inflect-Nano-v1 in speed

A benchmark comparing three open-weight Text-to-Speech (TTS) models—Kokoro 82M, Supertonic 3, and Inflect-Nano-v1—on a CPU revealed significant performance and quality differences. Inflect-Nano-v1, despite its small parameter count and fastest real-time factor (RTF) of 0.1376, was found to be over-rated by UTMOS scoring and suffers from a hard output length limitation. Supertonic 3 offered a trade-off, with a 5-step configuration achieving a MOS of 4.37 at an RTF of 0.3164, while Kokoro 82M, though the slowest with RTFs between 0.5711 and 0.7865, produced the most human-like audio. AI

IMPACT Provides insights into the trade-offs between speed and audio quality for CPU-based TTS models, guiding developers on model selection.

RANK_REASON The cluster details a benchmark comparing multiple open-weight TTS models, including performance metrics and quality assessments. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

CPU TTS benchmark: Kokoro 82M leads in quality, Inflect-Nano-v1 in speed

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/gvij ·

    CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample

    <table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1udg3rf/cpuonly_tts_benchmark_kokoro_82m_vs_supertonic_3/"> <img alt="CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample" src="https://previ…