Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 10h

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

Researchers have developed ZeSTA, a new framework for improving personalized speech synthesis using zero-shot text-to-speech (ZS-TTS) as a data augmentation source. The method addresses the common issue of speaker similarity degradation when mixing synthetic and real speech data during fine-tuning. ZeSTA employs a domain-conditioned training approach that distinguishes between real and synthetic speech, coupled with oversampling of real data to stabilize adaptation, particularly in low-resource scenarios. AI

IMPACT This research could lead to more efficient and effective personalized voice generation, particularly in scenarios with limited training data.

LibriTTS
ZS-TTS
ZeSTA
Youngwon Choi