Brief

last 24h

[2/2] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

RESEARCH · arXiv cs.AI English(EN) · 18h · [2 sources]

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

Researchers have developed a new method to understand how natural language instructions influence the output of style-captioned text-to-speech (TTS) systems. By adapting the DAAM framework to speech diffusion models, the study analyzes how specific words in style captions shape the generated waveforms. The findings indicate that style tokens have a lower temporal variance than content tokens and that their influence peaks in the early stages of generation and deeper layers of the model. AI

IMPACT Provides a deeper understanding of controllability in expressive TTS systems, potentially leading to improved voice generation.
RESEARCH · arXiv cs.AI English(EN) · 18h · [2 sources]

FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS

Researchers have developed FlowEdit, a novel framework designed to adapt frozen flow-matching text-to-speech (TTS) systems for lifelong pronunciation correction. Instead of retraining the entire model, FlowEdit learns pronunciation adjustments as latent edits in the text embedding space. These corrections are stored in a Modern Hopfield Network, acting as an associative memory, and are retrieved during inference using soft attention. This approach significantly reduces pronunciation errors on proper nouns, achieving a 92.7% relative decrease in Phoneme Error Rate on a multilingual benchmark while preserving overall speech quality. AI

IMPACT This research could lead to more adaptable and accurate text-to-speech systems that can learn from user feedback without full retraining.

Brief

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS