English(EN)WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
新的VQA基准和方法解决了知识、适应性和关联性问题
作者PulseAugur 编辑部·[8 个来源]·
研究人员推出了几个新的视觉问答(VQA)系统基准和方法。HyLoVQA提出了一种动态超网络生成的低秩适应技术,用于持续VQA,提高了对新任务和对象的适应性。WikiVQABench提供了一个使用维基百科和维基数据的知识增强型VQA基准,旨在测试需要外部知识的模型。此外,UCSF-PDGM-VQA专注于脑肿瘤MRI解读,突出了当前VLM在临床环境中的局限性,而RoboSurg-VQA则解决了手术分割感知的VQA问题,VISTAQA则对答案正确性和像素级证据关联性进行了基准测试。
AI
arXiv:2605.22035v1 Announce Type: cross Abstract: Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This ofte…
arXiv cs.AI
TIER_1English(EN)·Basel Shbita, Pengyuan Li, Anna Lisa Gentile·
arXiv:2605.21479v1 Announce Type: cross Abstract: Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observa…
arXiv cs.AI
TIER_1English(EN)·Shiv Ghosh, Junayd Lateef, Chih-Hua Liu, Yannan Yu, Andreas M. Rauschecker, Madhumita Sushil·
arXiv:2605.17140v2 Announce Type: replace-cross Abstract: Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process r…
Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This often leads to cross-level task interference, hinderin…
Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce…
Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce…
arXiv cs.CV
TIER_1English(EN)·Chengyi Zhang, Zi Ye, Ziyang Wang·
arXiv:2605.23068v1 Announce Type: new Abstract: Reliable visual understanding in robot-assisted and minimally invasive surgery (RMIS/MIS) demands more than accurate masks: in clinical practice, clinicians pose language-like questions about procedural context, visibility, artefact…
Establishing a clear link between model predictions and the visual evidence that supports them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing b…