Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 9h

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

Researchers have introduced a new task called Multi-temporal Referring Segmentation (MTRS) to evaluate the ability of Large Vision-Language Models (LVLMs) to understand and segment language-described changes across multiple time-stamped images. They have also developed CRAFT-Agent, a pipeline for constructing a dataset named MTRefSeg-21K, which contains over 21,000 image-text-mask triplets. To address the poor performance of existing models on this task, they propose MTRefSeg-R1, a novel LVLM framework that first learns temporal change perception and then fine-tunes for language-guided localization, demonstrating improved results. AI

IMPACT Introduces a new benchmark and framework to advance LVLM capabilities in understanding temporal changes in images.

Large Vision-Language Models (LVLMs)
Multi-temporal Referring Segmentation (MTRS)
MTRefSeg-21K
MTRefSeg-R1
CRAFT-Agent