PulseAugur
实时 16:55:38

新基准和方法推动多模态大语言模型能力发展

研究人员正在开发新的多模态大语言模型(MLLMs)方法,以提高它们对顺序音视频数据和大规模视觉识别的理解能力。一种方法DLLM-VSR使用扩散模型进行视觉语音识别,通过迭代去噪和解码转录文本取得了最先进的成果。另一篇论文介绍了SONIC-O1,这是一个用于评估MLLMs在真实世界音视频理解能力的基准,突出了不同人口群体之间的性能差异。此外,还在探索用于MLLMs高效训练和推理的新技术,包括用于训练的异构并行以及用于推理的“分而治之”策略,以应对标签空间扩展导致的性能下降问题。 AI

影响 多模态大语言模型的进步有望在音视频理解、语音识别和大规模视觉任务方面取得更好的性能。

排序理由 多篇研究论文介绍了用于多模态大语言模型的新模型、基准以及训练/推理技术。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 7 个来源。 我们如何撰写摘要 →

新基准和方法推动多模态大语言模型能力发展

报道来源 [7]

  1. arXiv cs.AI TIER_1 English(EN) · Jeong Hun Yeo, Chae Won Kim, Hyeongseop Rha, Yong Man Ro ·

    用于视觉语音识别的扩散大型语言模型

    arXiv:2605.28456v1 Announce Type: new Abstract: Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, t…

  2. arXiv cs.AI TIER_1 English(EN) · Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza ·

    SONIC-O1:用于评估多模态大语言模型在音视频理解方面能力的真实世界基准测试

    arXiv:2601.21666v2 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. …

  3. arXiv cs.LG TIER_1 English(EN) · Yashaswi Karnati, Kamran Jafari, Akash Mehra, Li Ding, Pranav Prashant Thombre, Ali Roshan Ghias, Shifang Xu, Parth Mannan, Yu Yao, Hao Wu, Eric Harper, Ashwath Aithal, Nima Tajbakhsh ·

    面向多模态大语言模型训练的异构并行

    arXiv:2605.27678v1 Announce Type: new Abstract: Foundation model training is becoming multimodal, from post-training pipelines to large-scale pretraining. As modality coverage broadens, context windows grow, and encoder LLM scales diverge, a single LLM-centric TP/CP/PP/DP/EP layo…

  4. arXiv cs.AI TIER_1 English(EN) · Zhipeng Ye, Jiaqi Huang, Feng Jiang, Qiufeng Wang, Yikang Duan, Dawei Wang, Xihang Zhou, Qian Qiao ·

    面向大规模视觉识别的 LLM 分治推理方法

    arXiv:2605.24799v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when applied to large scale image classification, their performance degrades significantly as th…

  5. arXiv cs.LG TIER_1 English(EN) · Yulin Yuan, Hongshuo Zhao, Xiangming Meng ·

    面向基于扩散的多模态大语言模型的视觉冗余控制并行解码

    arXiv:2605.25820v1 Announce Type: new Abstract: Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not o…

  6. arXiv cs.LG TIER_1 English(EN) · Xiangming Meng ·

    面向基于扩散的多模态大语言模型的视觉冗余控制并行解码

    Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not only which predictions are reliable in isolation,…

  7. arXiv cs.CV TIER_1 English(EN) · Yong Man Ro ·

    用于视觉语音识别的扩散大型语言模型

    Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the best of our knowledge, the first Diffusion…