From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data
Researchers have developed a novel voice conversion framework that uses K-Nearest Neighbors (KNN) retrieval on WavLM representations to align non-parallel speech data. This method constructs synthetic training pairs from non-parallel source and target audio, enabling supervised learning without requiring explicit alignment or parallel corpora. The framework also incorporates a speaker loss to maintain consistent target-speaker identity, demonstrating high naturalness and speaker similarity across multiple languages, even when trained solely on English data. AI
IMPACT This method could enable more accessible and multilingual voice conversion without requiring parallel datasets.