Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings
Researchers have developed a new framework called SAGA that leverages multimodal large language models (MLLMs) to improve visual embeddings for image retrieval. Unlike traditional methods that use uniform scalar distances, SAGA utilizes attribute-specific gradients derived from a frozen MLLM to provide more nuanced supervision. This approach enhances the encoder's ability to capture differentiating attributes between images, leading to significant improvements in zero-shot image retrieval performance across several benchmark datasets. AI
IMPACT Enhances image retrieval by providing attribute-aware supervision for visual embeddings, outperforming SOTA baselines.