FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS
Researchers have developed FlowEdit, a novel framework designed to adapt frozen flow-matching text-to-speech (TTS) systems for lifelong pronunciation correction. Instead of retraining the entire model, FlowEdit learns pronunciation adjustments as latent edits in the text embedding space. These corrections are stored in a Modern Hopfield Network, acting as an associative memory, and are retrieved during inference using soft attention. This approach significantly reduces pronunciation errors on proper nouns, achieving a 92.7% relative decrease in Phoneme Error Rate on a multilingual benchmark while preserving overall speech quality. AI
IMPACT This research could lead to more adaptable and accurate text-to-speech systems that can learn from user feedback without full retraining.