End-to-End Training for Discrete Token LLM based TTS System
Researchers have developed a novel end-to-end training framework for discrete token Large Language Model (LLM) based Text-to-Speech (TTS) systems. This approach unifies the training of the speech tokenizer, LLM, a flow-matching model, and a reward model, unlike previous cascaded systems trained independently. The joint optimization encourages the discrete speech token space to better capture acoustic and semantic information, leading to improved TTS generation. Experiments show this end-to-end method achieves state-of-the-art results on the Seed-TTS-Eval benchmark with a significantly smaller LLM. AI
IMPACT This unified training approach could lead to more efficient and higher-quality speech synthesis models.