PulseAugur
EN
LIVE 04:14:08

New TTS model Chatterbox-Flash uses block diffusion for streaming

Researchers have developed Chatterbox-Flash, a novel zero-shot text-to-speech model that utilizes prior-calibrated block diffusion. This approach fine-tunes a pretrained autoregressive TTS decoder into a block-diffusion decoder, allowing for parallel token generation within blocks while maintaining streaming capabilities. The model addresses quality degradation issues by employing inference-time techniques like prior-calibrated scoring and an adaptive early-decoding schedule, achieving high-fidelity synthesis comparable to existing baselines with improved streaming performance. AI

IMPACT Introduces a new method for zero-shot TTS that improves synthesis quality and streaming performance.

RANK_REASON This is a research paper describing a new model and methodology. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Deokjin Seo, Gangin Park, Kihyun Nam ·

    Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

    arXiv:2605.30748v1 Announce Type: cross Abstract: We present Chatterbox-Flash, a zero-shot text-to-speech model obtained by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while retaining blo…