PulseAugur
实时 10:26:33
English(EN) AuRA: Internalizing Audio Understanding into LLMs as LoRA

新方法提升LLM的空间音频和通用音频理解能力

研究人员开发了两种新方法,Spatial-Omni和AuRA,以增强大型语言模型(LLM)的音频理解能力。Spatial-Omni将空间音频线索通过一阶Ambisonics编码整合到现有LLM中,为空间音频任务创建了新的数据集和基准。另一方面,AuRA采用了一种带有LoRA适配的蒸馏方法,将音频编码内化到LLM中,实现了高效的并行推理,并优于级联系统。 AI

影响 这些方法可能带来更复杂的多模态AI系统,能够进行更丰富的音频场景分析和交互。

排序理由 两篇研究论文介绍了将音频理解整合到LLM中的新方法。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

报道来源 [4]

  1. arXiv cs.AI TIER_1 English(EN) · Zhiyuan Zhu, Yixuan Chen, Yiwen Shao, Wenxiang Guo, Changhao Pan, Yu Zhang, Yuxiang Wang, Wei Liu, Houhua Zhang, Chengkuan Zeng, Wenbo Cheng, Yunxi Liu, Rui Yang, Steve Yves, Liefeng Bo, Zhou Zhao ·

    Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding

    arXiv:2606.10738v1 Announce Type: cross Abstract: Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. …

  2. arXiv cs.AI TIER_1 English(EN) · Bo Cheng, Lei Shi, Zhanyu Ma, Yuan Wu, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He ·

    AuRA: Internalizing Audio Understanding into LLMs as LoRA

    arXiv:2606.11033v1 Announce Type: cross Abstract: Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse…

  3. arXiv cs.CL TIER_1 English(EN) · Renqing He ·

    AuRA: Internalizing Audio Understanding into LLMs as LoRA

    Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speec…

  4. arXiv cs.AI TIER_1 English(EN) · Zhou Zhao ·

    Spatial-Omni:通过FOA编码将空间音频理解集成到多模态大模型中

    Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. We propose Spatial-Omni, a lightweight method that…