English(EN) Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

新的基准 SpecVQA 和 M3-VQA 在科学和多跳推理方面挑战多模态大语言模型

作者 PulseAugur 编辑部 · [6 个来源] · 2026-04-28 01:57

研究人员推出了 M$^3$-VQA，这是一个新的基准，旨在评估多模态大语言模型 (MLLMs) 在涉及多个实体和多跳推理的复杂推理任务上的表现。该基准挑战模型理解跨越视觉和文本来源的细粒度细节，需要顺序和并行推理。对 16 个领先的 MLLMs 的初步评估显示，它们在知识获取和推理能力方面存在显著局限性，尽管在提供精确证据时性能有了实质性提高。 AI

影响该基准将通过突出当前局限性来推动大语言模型多模态推理的进步。

排序理由引入了一个新的基准来评估多模态大语言模型。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。我们如何撰写摘要 →

报道来源 [6]

arXiv cs.AI TIER_1 English(EN) · Jialu Shen, Han Lyu, Suyang Zhong, Hanzheng Li, Haoyi Tao, Nan Wang, Changhong Chen, Xi Fang · 2026-05-01 04:00

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

arXiv:2604.28039v1 Announce Type: new Abstract: Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-specific characteristics. Here we in…
arXiv cs.AI TIER_1 English(EN) · Xi Fang · 2026-04-30 15:51

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-specific characteristics. Here we introduce SpecVQA, a professional scientific-image…
arXiv cs.CL TIER_1 English(EN) · Boqi Chen, Xudong Liu, Yunke Ao, Jianing Qiu · 2026-04-28 04:00

Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

arXiv:2604.23443v1 Announce Type: new Abstract: Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. Howeve…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-04-28 01:57

M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

We present M$^3$-VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop reasoning. Unlike existing VQA datasets that focus …
arXiv cs.CV TIER_1 English(EN) · Jiatong Ma, Longteng Guo, Yuchen Liu, Zijia Zhao, Dongze Hao, Xuanxu Lin, Jing Liu · 2026-04-29 04:00

M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

arXiv:2604.25122v1 Announce Type: new Abstract: We present M$^3$-VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop rea…
arXiv cs.CV TIER_1 English(EN) · Jing Liu · 2026-04-28 01:57

M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

We present M$^3$-VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop reasoning. Unlike existing VQA datasets that focus …

报道来源 [6]

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

相关实体

相关话题