Brief · PulseAugur

TOOL · arXiv cs.CV English(EN) · 8h

Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery

Researchers have developed a novel framework that unifies pixel-level segmentation and visual question answering (VQA) for robotic surgery. This approach uses object tokens generated by a vision-language model (VLM) to guide answer prediction and produce segmentation masks via a SAM-based decoder. By optimizing these object tokens with both segmentation and VQA objectives, the model learns spatially grounded representations that enhance reasoning and provide explicit pixel-level grounding. The method demonstrated superior performance on the RAMIE and EndoVis18 datasets, improving fine-grained surgical scene understanding. AI

IMPACT Enhances fine-grained surgical scene understanding and reasoning for robotic surgery applications.

SAM
vision-language model
visual question answering
Computer vision and pattern recognition
Object Tokens
robot-assisted surgery
RAMIE
EndoVis18