English(EN) Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

CIR基准高估了AI模型的组合能力

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-14 12:56

一篇新的研究论文质疑了当前组合图像检索（CIR）基准的有效性，CIR是一项需要模型结合图像和文本信息 Thus, the study found that many existing CIR benchmarks can be solved using only one modality, indicating models are exploiting "unimodal shortcuts" rather than truly composing information. After auditing and validating queries, researchers re-evaluated models on a cleaner subset, revealing a greater reliance on multimodal composition but also a decrease in accuracy, suggesting current benchmarks overestimate model capabilities. AI

影响揭示了当前多模态任务AI评估方法的局限性，可能指导未来的基准开发。

排序理由学术论文分析特定AI任务的基准局限性。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CV TIER_1 English(EN) · Pasquale Minervini · 2026-05-14 12:56

Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal co…

报道来源 [1]

Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

相关话题