Reliable Neural-Codec Text-to-Speech by ASR Self-Verification and Distillation: Near-Zero Catastrophic Failures Across Models and Codecs
Researchers have developed a method to significantly reduce catastrophic failures in open autoregressive neural-codec text-to-speech (TTS) models. By employing Automatic Speech Recognition (ASR) self-verification, where multiple ASR models assess the TTS output, failure rates can be driven to near-zero. This robustness can then be distilled back into the TTS model, recovering much of the improved performance at inference time without additional cost. The approach shows effectiveness across various TTS systems and codecs, though one larger model demonstrated resistance to the improvements. AI
IMPACT Enhances the reliability of TTS systems, making them more suitable for real-world applications by reducing unexpected output failures.