New TTS model Chatterbox-Flash uses block diffusion for streaming

By PulseAugur Editorial · [1 sources] · 2026-06-01 04:00

Researchers have developed Chatterbox-Flash, a novel zero-shot text-to-speech model that utilizes prior-calibrated block diffusion. This approach fine-tunes a pretrained autoregressive TTS decoder into a block-diffusion decoder, allowing for parallel token generation within blocks while maintaining streaming capabilities. The model addresses quality degradation issues by employing inference-time techniques like prior-calibrated scoring and an adaptive early-decoding schedule, achieving high-fidelity synthesis comparable to existing baselines with improved streaming performance. AI

IMPACT Introduces a new method for zero-shot TTS that improves synthesis quality and streaming performance.

RANK_REASON This is a research paper describing a new model and methodology. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Deokjin Seo, Gangin Park, Kihyun Nam · 2026-06-01 04:00

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

arXiv:2605.30748v1 Announce Type: cross Abstract: We present Chatterbox-Flash, a zero-shot text-to-speech model obtained by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while retaining blo…

COVERAGE [1]

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

RELATED ENTITIES

RELATED TOPICS