CIR benchmarks overestimate AI model composition abilities

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

A new research paper questions the effectiveness of current benchmarks for Composed Image Retrieval (CIR), a task requiring models to combine image and text information. The study found that many existing CIR benchmarks can be solved using only one modality, indicating models are exploiting "unimodal shortcuts" rather than truly composing information. After auditing and validating queries, researchers re-evaluated models on a cleaner subset, revealing a greater reliance on multimodal composition but also a decrease in accuracy, suggesting current benchmarks overestimate model capabilities. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Reveals limitations in current AI evaluation methods for multimodal tasks, potentially guiding future benchmark development.

RANK_REASON Academic paper analyzing benchmark limitations for a specific AI task. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

paper
other

COVERAGE [1]

arXiv cs.CV TIER_1 · Pasquale Minervini · 2026-05-14 12:56

Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal co…

COVERAGE [1]

Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

RELATED TOPICS