An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation
Researchers have introduced a new task called Multi-temporal Referring Segmentation (MTRS) to evaluate the ability of Large Vision-Language Models (LVLMs) to understand and segment language-described changes across multiple time-stamped images. They have also developed CRAFT-Agent, a pipeline for constructing a dataset named MTRefSeg-21K, which contains over 21,000 image-text-mask triplets. To address the poor performance of existing models on this task, they propose MTRefSeg-R1, a novel LVLM framework that first learns temporal change perception and then fine-tunes for language-guided localization, demonstrating improved results. AI
IMPACT Introduces a new benchmark and framework to advance LVLM capabilities in understanding temporal changes in images.