English(EN) WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

新的VQA基准和方法解决了知识、适应性和关联性问题

作者 PulseAugur 编辑部 · [9 个来源] · 2026-05-20 03:44

研究人员推出了几个新的视觉问答（VQA）系统基准和方法。HyLoVQA提出了一种动态超网络生成的低秩适应技术，用于持续VQA，提高了对新任务和对象的适应性。WikiVQABench提供了一个使用维基百科和维基数据的知识增强型VQA基准，旨在测试需要外部知识的模型。此外，UCSF-PDGM-VQA专注于脑肿瘤MRI解读，突出了当前VLM在临床环境中的局限性，而RoboSurg-VQA则解决了手术分割感知的VQA问题，VISTAQA则对答案正确性和像素级证据关联性进行了基准测试。 AI

影响这些新的基准和适应技术旨在提高Vision-Language Models在复杂、真实世界场景中的可靠性和能力。

排序理由多篇研究论文介绍了视觉问答的新基准和方法。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 9 个来源。我们如何撰写摘要 →

报道来源 [9]

arXiv cs.AI TIER_1 English(EN) · Ojonugwa Oluwafemi Ejiga Peter, Frederick Akor Ejiga, Fahmi Khalifa, Md Mahmudur Rahman · 2026-05-26 04:00

用于胃肠内窥镜检查的参数高效视觉语言模型：医学图像生成与临床视觉问答

arXiv:2605.24792v1 Announce Type: cross Abstract: The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and significant bottlenecks in conventional model fine-tuning. Such limitations impede the succe…
arXiv cs.AI TIER_1 English(EN) · Basel Shbita, Pengyuan Li, Anna Lisa Gentile · 2026-05-22 04:00

WikiVQABench: 来自维基百科和维基数据的知识驱动视觉问答基准

arXiv:2605.21479v1 Announce Type: cross Abstract: Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observa…
arXiv cs.AI TIER_1 English(EN) · Shiv Ghosh, Junayd Lateef, Chih-Hua Liu, Yannan Yu, Andreas M. Rauschecker, Madhumita Sushil · 2026-05-22 04:00

UCSF-PDGM-VQA：用于脑肿瘤MRI解读的视觉问答数据集

arXiv:2605.17140v2 Announce Type: replace-cross Abstract: Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process r…
arXiv cs.CL TIER_1 English(EN) · Yiran Wang, Chenyi Xiong, Ziyue Qin, Miao Zhang, Kui Xiao, Zhifei Li · 2026-05-22 04:00

HyLoVQA：动态超网络生成的低秩适应用于持续视觉问答

arXiv:2605.22035v1 Announce Type: cross Abstract: Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This ofte…
arXiv cs.CL TIER_1 English(EN) · Zhifei Li · 2026-05-21 06:12

HyLoVQA：动态超网络生成的低秩适应用于持续视觉问答

Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This often leads to cross-level task interference, hinderin…
arXiv cs.AI TIER_1 English(EN) · Anna Lisa Gentile · 2026-05-20 17:58

WikiVQABench：一个来自维基百科和维基数据的知识驱动视觉问答基准

Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-20 17:58

WikiVQABench: 来自维基百科和维基数据的知识驱动视觉问答基准

Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce…
arXiv cs.CV TIER_1 English(EN) · Chengyi Zhang, Zi Ye, Ziyang Wang · 2026-05-25 04:00

RoboSurg-VQA：用于手术分割感知视觉问答的多模态基准

arXiv:2605.23068v1 Announce Type: new Abstract: Reliable visual understanding in robot-assisted and minimally invasive surgery (RMIS/MIS) demands more than accurate masks: in clinical practice, clinicians pose language-like questions about procedural context, visibility, artefact…
arXiv cs.CV TIER_1 English(EN) · Krzysztof Czarnecki · 2026-05-20 03:44

VISTAQA：联合视觉问答和像素级证据的基准测试

Establishing a clear link between model predictions and the visual evidence that supports them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing b…

报道来源 [9]

相关实体

相关话题