PulseAugur
实时 06:56:18
English(EN) Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

新的基准 SpecVQA 和 M3-VQA 在科学和多跳推理方面挑战多模态大语言模型

研究人员推出了 M$^3$-VQA,这是一个新的基准,旨在评估多模态大语言模型 (MLLMs) 在涉及多个实体和多跳推理的复杂推理任务上的表现。该基准挑战模型理解跨越视觉和文本来源的细粒度细节,需要顺序和并行推理。对 16 个领先的 MLLMs 的初步评估显示,它们在知识获取和推理能力方面存在显著局限性,尽管在提供精确证据时性能有了实质性提高。 AI

影响 该基准将通过突出当前局限性来推动大语言模型多模态推理的进步。

排序理由 引入了一个新的基准来评估多模态大语言模型。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。 我们如何撰写摘要 →

新的基准 SpecVQA 和 M3-VQA 在科学和多跳推理方面挑战多模态大语言模型

报道来源 [6]

  1. arXiv cs.AI TIER_1 English(EN) · Jialu Shen, Han Lyu, Suyang Zhong, Hanzheng Li, Haoyi Tao, Nan Wang, Changhong Chen, Xi Fang ·

    SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

    arXiv:2604.28039v1 Announce Type: new Abstract: Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-specific characteristics. Here we in…

  2. arXiv cs.AI TIER_1 English(EN) · Xi Fang ·

    SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

    Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-specific characteristics. Here we introduce SpecVQA, a professional scientific-image…

  3. arXiv cs.CL TIER_1 English(EN) · Boqi Chen, Xudong Liu, Yunke Ao, Jianing Qiu ·

    Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

    arXiv:2604.23443v1 Announce Type: new Abstract: Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. Howeve…

  4. Hugging Face Daily Papers TIER_1 English(EN) ·

    M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

    We present M$^3$-VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop reasoning. Unlike existing VQA datasets that focus …

  5. arXiv cs.CV TIER_1 English(EN) · Jiatong Ma, Longteng Guo, Yuchen Liu, Zijia Zhao, Dongze Hao, Xuanxu Lin, Jing Liu ·

    M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

    arXiv:2604.25122v1 Announce Type: new Abstract: We present M$^3$-VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop rea…

  6. arXiv cs.CV TIER_1 English(EN) · Jing Liu ·

    M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

    We present M$^3$-VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop reasoning. Unlike existing VQA datasets that focus …