English(EN) Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

研究发现语音表征影响大语言模型推理

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-10 15:19

研究人员调查了不同的语音表征如何影响口语对话模型的推理能力。他们发现，语音和文本标记之间的时间粒度不匹配会削弱推理能力，因为语音标记通常具有更多的时间冗余。为解决此问题，他们引入了一个因子化音频语言模型头，并探索了各种帧率，确定了 4.17 Hz 是语音问答与中间层表征对齐的最佳速率。 AI

影响研究了语音表征如何影响大语言模型推理，可能改进口语对话系统。

排序理由这是一篇研究论文，详细介绍了对大语言模型语音表征的调查。[lever_c_demoted from research: ic=1 ai=1.0]

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Wei Xue · 2026-06-10 15:19

哪种语音表征更能匹配文本原生推理？一项关于帧率和表征的语音-文本对齐研究

Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text und…