Researchers have developed Chatterbox-Flash, a novel zero-shot text-to-speech model that utilizes prior-calibrated block diffusion. This approach fine-tunes a pretrained autoregressive TTS decoder into a block-diffusion decoder, allowing for parallel token generation within blocks while maintaining streaming capabilities. The model addresses quality degradation issues by employing inference-time techniques like prior-calibrated scoring and an adaptive early-decoding schedule, achieving high-fidelity synthesis comparable to existing baselines with improved streaming performance. AI
IMPACT Introduces a new method for zero-shot TTS that improves synthesis quality and streaming performance.
RANK_REASON This is a research paper describing a new model and methodology. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →