An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis
Researchers have developed a new system for emotional speech synthesis (ESS) that integrates speaker embeddings and prosody bottlenecks into the FastSpeech 2 model. This system is designed to generate humanlike, natural-sounding voices with desired emotional expressions. It can produce emotional speech for a single speaker or transfer speaking styles between speakers while preserving the target speaker's identity. AI