ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval
Researchers have introduced ELVA, a novel framework designed to address "grain blindness" in Multimodal Large Language Models (MLLMs) used for Universal Multimodal Retrieval (UMR). Grain blindness occurs when models treat all negative samples equally, overlooking the nuanced information within complex queries. ELVA utilizes a rule-based Reinforcement Learning with Verifiable Rewards (RLVR) framework to differentiate between negative samples based on their similarity to positive samples, thereby improving the model's ability to learn distinct grain information. The framework also introduces MRBench, a new benchmark specifically for evaluating multi-grain query scenarios. ELVA has demonstrated state-of-the-art results on standard retrieval benchmarks and achieved a significant 13.1% improvement on MRBench. AI
IMPACT This research could lead to more nuanced and effective multimodal retrieval systems, improving how AI models understand and process complex queries across different data types.