Brief · PulseAugur

TOOL · arXiv cs.CL English(EN) · 5d

Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs

Researchers have developed a novel speech-to-LLM interface called Convex Gate (C-Gate) that constrains speech representations to the LLM's input embedding manifold. This approach ensures compatibility with pretrained LLMs while preserving continuous expressivity, unlike previous methods that either lost paralinguistic information or allowed representations to drift. C-Gate demonstrated strong joint performance in automatic speech recognition and emotion recognition, improving word error rate by up to 48.7% and matching single-task emotion accuracy. The study suggests that the geometry of time-resolved trajectories in the embedding space, rather than discrete token identities, is crucial for multimodal integration in frozen LLMs. AI

IMPACT Introduces a new method for integrating speech data into LLMs, potentially improving multimodal AI capabilities.

LLMs
arXiv
Convex Gate (C-Gate)