Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning
Researchers have developed a novel two-stage framework called Rea2Seg for image segmentation tasks that leverage multimodal large language models (MLLMs). This approach first identifies candidate masks from an MLLM's attention maps and then uses the MLLM to reason over these candidates and select the most accurate one. To further evaluate and advance these capabilities, a new benchmark, ReasonSeg-SGDR, has been introduced to assess perception, grounding, and reasoning abilities across various dimensions. AI
IMPACT Introduces a new method for improving MLLM-based image segmentation and a benchmark to evaluate these models.