New framework unifies segmentation and VQA for robotic surgery

By PulseAugur Editorial · [1 sources] · 2026-06-16 04:00

Researchers have developed a novel framework that unifies pixel-level segmentation and visual question answering (VQA) for robotic surgery. This approach uses object tokens generated by a vision-language model (VLM) to guide answer prediction and produce segmentation masks via a SAM-based decoder. By optimizing these object tokens with both segmentation and VQA objectives, the model learns spatially grounded representations that enhance reasoning and provide explicit pixel-level grounding. The method demonstrated superior performance on the RAMIE and EndoVis18 datasets, improving fine-grained surgical scene understanding. AI

IMPACT Enhances fine-grained surgical scene understanding and reasoning for robotic surgery applications.

RANK_REASON The cluster contains an academic paper detailing a new technical approach in computer vision. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CV TIER_1 English(EN) · Yiping Li, Ronald de Jong, Romy van Jaarsveld, Franco Badaloni, Gino Kuiper, Jelle Ruurda, Josien Pluim, Marcel Breeuwer · 2026-06-16 04:00

Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery

arXiv:2606.15861v1 Announce Type: new Abstract: Visual Question Answering (VQA) in robotic surgery, referred to as surgical VQA, requires high-level understanding of complex surgical scenes and the integration of visual perception with language reasoning, with the potential to su…

COVERAGE [1]

Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery

RELATED ENTITIES

RELATED TOPICS