Brief · PulseAugur

TOOL · arXiv cs.CV English(EN) · 8h

SAMTok: Representing Any Mask with Two Words

Researchers have developed SAMTok, a novel method for integrating pixel-level understanding into multi-modal large language models (MLLMs). This technique converts any region mask into two discrete tokens, allowing standard MLLMs like QwenVL to process and generate masks without architectural changes. Trained on a large dataset of masks, SAMTok enables models to achieve state-of-the-art results in various region-based tasks, including captioning, visual question answering, and referring segmentation. AI

IMPACT Enables standard LLMs to perform complex pixel-level image manipulation tasks, potentially broadening their application in interactive AI systems.

Hugging Face
arXiv
DagsHub
SAM2
alphaXiv
ScienceCast
CatalyzeX
Gotit.pub
QwenVL
Yikang Zhou