Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS
Researchers have developed Chatterbox-Flash, a novel zero-shot text-to-speech model that utilizes prior-calibrated block diffusion. This approach fine-tunes a pretrained autoregressive TTS decoder into a block-diffusion decoder, allowing for parallel token generation within blocks while maintaining streaming capabilities. The model addresses quality degradation issues by employing inference-time techniques like prior-calibrated scoring and an adaptive early-decoding schedule, achieving high-fidelity synthesis comparable to existing baselines with improved streaming performance. AI
IMPACT Introduces a new method for zero-shot TTS that improves synthesis quality and streaming performance.