Researchers have introduced X2SAM, a novel multimodal large language model designed to perform pixel-level segmentation across both images and videos. Unlike previous models that relied on low-level visual prompts or were specialized for only one media type, X2SAM can interpret complex conversational instructions and visual prompts for temporally consistent video mask generation. The model also introduces the Video Visual Grounded (V-VGD) segmentation benchmark and demonstrates strong performance on video segmentation while remaining competitive on image segmentation tasks. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Introduces a unified approach to segmentation across images and videos, potentially improving multimodal AI capabilities.
RANK_REASON This is a research paper describing a new model and benchmark. [lever_c_demoted from research: ic=1 ai=1.0]