Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 6h

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Researchers have developed a new framework called SAGA that leverages multimodal large language models (MLLMs) to improve visual embeddings for image retrieval. Unlike traditional methods that use uniform scalar distances, SAGA utilizes attribute-specific gradients derived from a frozen MLLM to provide more nuanced supervision. This approach enhances the encoder's ability to capture differentiating attributes between images, leading to significant improvements in zero-shot image retrieval performance across several benchmark datasets. AI

IMPACT Enhances image retrieval by providing attribute-aware supervision for visual embeddings, outperforming SOTA baselines.

Group Relative Policy Optimization
SAGA
GRPO
CUB-200-2011
Cars-196
FGVC-Aircraft
iNaturalist Aves
Shubhang Bhatnagar