PulseAugur
EN
LIVE 09:22:47

New HPRO framework enhances emotional expressiveness in LLM-based TTS

Researchers have developed HPRO, a novel framework designed to improve emotional expressiveness in large language model-based text-to-speech (TTS) systems. HPRO addresses limitations in current methods, such as information conflict and scale gaps, by introducing the HD-Emo codec. This codec separates content and emotional preference tokens, allowing for distinct optimization of emotional expression without degrading semantic meaning. The framework progressively aligns objectives across different levels (frame, word, sentence) to enhance emotional range while maintaining intelligibility. AI

IMPACT This research could lead to more emotionally nuanced and natural-sounding AI-generated speech, improving user experience in applications like virtual assistants and content creation.

RANK_REASON The cluster contains an academic paper detailing a new method for text-to-speech synthesis.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New HPRO framework enhances emotional expressiveness in LLM-based TTS

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Sihang Nie, Xiaofen Xing, Rui Xing, Haoming Li, Ruitong Xiao, Jingyuan Xing, Baiji Liu, Xiangmin Xu ·

    HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

    arXiv:2606.28249v1 Announce Type: cross Abstract: Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting emotion…

  2. arXiv cs.CL TIER_1 English(EN) · Xiangmin Xu ·

    HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

    Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting emotional expressiveness. While preference-driven optimiz…