New MTRS benchmark and CRAFT-Agent tackle multi-temporal vision-language tasks

By PulseAugur Editorial · [1 sources] · 2026-06-02 04:00

Researchers have introduced a new task called Multi-temporal Referring Segmentation (MTRS) to evaluate the ability of Large Vision-Language Models (LVLMs) to understand and segment language-described changes across multiple time-stamped images. They have also developed CRAFT-Agent, a pipeline for constructing a dataset named MTRefSeg-21K, which contains over 21,000 image-text-mask triplets. To address the poor performance of existing models on this task, they propose MTRefSeg-R1, a novel LVLM framework that first learns temporal change perception and then fine-tunes for language-guided localization, demonstrating improved results. AI

IMPACT Introduces a new benchmark and framework to advance LVLM capabilities in understanding temporal changes in images.

RANK_REASON The cluster contains a research paper introducing a new task, dataset, and model framework. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao, Junyu Gao, Xuelong Li · 2026-06-02 04:00

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

arXiv:2606.00987v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. To bridge this gap, we introduce \t…

COVERAGE [1]

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

RELATED TOPICS