ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis
Researchers have developed ZeSTA, a new framework for improving personalized speech synthesis using zero-shot text-to-speech (ZS-TTS) as a data augmentation source. The method addresses the common issue of speaker similarity degradation when mixing synthetic and real speech data during fine-tuning. ZeSTA employs a domain-conditioned training approach that distinguishes between real and synthetic speech, coupled with oversampling of real data to stabilize adaptation, particularly in low-resource scenarios. AI
IMPACT This research could lead to more efficient and effective personalized voice generation, particularly in scenarios with limited training data.