New VQA benchmarks and methods tackle knowledge, adaptation, and grounding
ByPulseAugur Editorial·[9 sources]·
Researchers have introduced several new benchmarks and methods for Visual Question Answering (VQA) systems. HyLoVQA proposes a dynamic hypernetwork-generated low-rank adaptation technique for continual VQA, improving adaptation to new tasks and objects. WikiVQABench offers a knowledge-grounded VQA benchmark using Wikipedia and Wikidata, designed to test models requiring external knowledge. Additionally, UCSF-PDGM-VQA focuses on brain tumor MRI interpretation, highlighting current VLM limitations in clinical settings, while RoboSurg-VQA addresses surgical segmentation-aware VQA, and VISTAQA benchmarks joint answer correctness and pixel-level evidence grounding.
AI
IMPACT
These new benchmarks and adaptation techniques aim to improve the reliability and capabilities of Vision-Language Models in complex, real-world scenarios.
RANK_REASON
Multiple research papers introducing new benchmarks and methods for Visual Question Answering.
arXiv:2605.24792v1 Announce Type: cross Abstract: The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and significant bottlenecks in conventional model fine-tuning. Such limitations impede the succe…
arXiv cs.AI
TIER_1English(EN)·Basel Shbita, Pengyuan Li, Anna Lisa Gentile·
arXiv:2605.21479v1 Announce Type: cross Abstract: Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observa…
arXiv cs.AI
TIER_1English(EN)·Shiv Ghosh, Junayd Lateef, Chih-Hua Liu, Yannan Yu, Andreas M. Rauschecker, Madhumita Sushil·
arXiv:2605.17140v2 Announce Type: replace-cross Abstract: Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process r…
arXiv:2605.22035v1 Announce Type: cross Abstract: Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This ofte…
Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This often leads to cross-level task interference, hinderin…
Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce…
Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce…
arXiv cs.CV
TIER_1English(EN)·Chengyi Zhang, Zi Ye, Ziyang Wang·
arXiv:2605.23068v1 Announce Type: new Abstract: Reliable visual understanding in robot-assisted and minimally invasive surgery (RMIS/MIS) demands more than accurate masks: in clinical practice, clinicians pose language-like questions about procedural context, visibility, artefact…
Establishing a clear link between model predictions and the visual evidence that supports them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing b…