English(EN) UniNote: A Unified Embedding Model for Multimodal Representation and Ranking

新框架推动文档和图像的多模态检索

作者 PulseAugur 编辑部 · [15 个来源] · 2026-05-28 03:11

研究人员介绍了几个用于多模态检索任务的新框架和基准。动态适配器路由 (DAR) 通过基于原型的路由来选择适配器，以解决持续多模态检索问题。V-SPLADE 提供了一种无需推理的视觉文档稀疏检索器，通过字幕门控令牌监督来改进词汇基础。HiKEY 提出了一个用于文档问答的分层检索框架，利用文档结构进行更好的路由和证据整合。此外，DeepImageSearch 将图像检索视为视觉历史中的自主探索任务，并引入了一个新的基准 (DISBench) 来评估代理推理。 AI

影响这些进展提供了改进的搜索和理解复杂多模态数据的方法，有可能加速文档分析和视觉问答等领域的研究和应用开发。

排序理由多篇研究论文介绍了多模态检索任务的新方法和基准。

在 arXiv cs.IR (Information Retrieval) 阅读 →

AI 生成摘要 · Google Gemini · 来自 15 个来源。我们如何撰写摘要 →

报道来源 [15]

arXiv cs.CL TIER_1 English(EN) · Erfan Loweimi, Mengjie Qian, Kate Knill, Guanfeng Wu, Chi-Ho Chan, Abbas Haider, Muhammad Awan, Josef Kittler, Hui Wang, Mark Gales · 2026-06-05 04:00

要多模态还是不要多模态：通过主动模态检测实现查询自适应音视频人物检索

arXiv:2606.05931v1 Announce Type: new Abstract: When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusin…
arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Mark Gales · 2026-06-04 09:33

要多模态还是要“不”多模态：通过主动模态检测实现查询自适应音视频人物检索

When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, …
arXiv cs.AI TIER_1 English(EN) · Jingbiao Mei · 2026-06-04 04:00

EReL@MIR 2025 多模态文档检索挑战赛（赛道1）概述

arXiv:2606.04240v1 Announce Type: cross Abstract: Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multi…
arXiv cs.AI TIER_1 English(EN) · Alicja Dobrzeniecka, Filip Szatkowski, Sebastian Cygert, Szymon Lukasik, Bartlomiej Twardowski · 2026-06-01 04:00

超越分类：用于持续多模态检索的动态适配器路由

arXiv:2605.31229v1 Announce Type: cross Abstract: While retrieval is a core function of vision-language models, continually updating these models for retrieval tasks remains critically underexplored. Existing work often approaches continual retrieval through the lens of class-inc…
arXiv cs.AI TIER_1 English(EN) · Bartlomiej Twardowski · 2026-05-29 12:32

超越分类：持续多模态检索的动态适配器路由

While retrieval is a core function of vision-language models, continually updating these models for retrieval tasks remains critically underexplored. Existing work often approaches continual retrieval through the lens of class-incremental learning (CIL), evaluating both standard …
arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Seung-won Hwang · 2026-05-29 07:01

面向生产规模视觉文档搜索的无推理多模态学习稀疏检索

As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual documents to serve queries without neural encoding at …
arXiv cs.AI TIER_1 English(EN) · Joongmin Shin, Gyuho Shim, Jeongbae Park, Jaehyung Seo, Heuiseok Lim · 2026-05-29 04:00

HiKEY：开放域文档问答的层级多模态检索

arXiv:2605.29606v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) for document-based Open-domain Question Answering (ODQA) on large-scale industrial corpora faces two critical bottlenecks: routing failure in locating the correct document and evidence fragmentat…
arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Zhou Zhao · 2026-05-28 14:50

DocRetriever：一个用于多模态文档检索的即插即用框架及综合基准测试

Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve high-precision retrieval, they face inherent li…
arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Heuiseok Lim · 2026-05-28 08:42

HiKEY：开放域文档问答的层级多模态检索

Retrieval-augmented generation (RAG) for document-based Open-domain Question Answering (ODQA) on large-scale industrial corpora faces two critical bottlenecks: routing failure in locating the correct document and evidence fragmentation in integrating scattered information. Existi…
arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Yao Hu · 2026-05-28 03:11

UniNote：多模态表示和排名的统一嵌入模型

Item-to-Item (I2I) retrieval is a fundamental part of modern content platforms, supporting critical industrial workflows from recommendation engines to content auditing. While multimodal embedding methods have advanced general retrieval, they often falter in I2I scenarios due to …
arXiv cs.CV TIER_1 English(EN) · Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, Liqiang Nie · 2026-06-02 04:00

PUMA：用于高效统一多模态检索的层剪枝语言模型，具备模态自适应学习能力

arXiv:2507.08064v3 Announce Type: replace-cross Abstract: As multimedia content expands, the demand for unified multimodal retrieval (UMR) in real-world applications increases. Recent work leverages multimodal large language models (MLLMs) to tackle this task. However, their larg…
arXiv cs.CV TIER_1 English(EN) · Prasanna Sridhar, Horace Lee, David M. S. Pinto, Andrew Zisserman, Abhishek Dutta · 2026-06-02 04:00

WISE：用于视觉场景、音频、物体、人脸、语音和元数据的多模态搜索引擎

arXiv:2602.12819v2 Announce Type: replace-cross Abstract: In this paper, we present WISE, an open-source audiovisual search engine which integrates a range of multimodal retrieval capabilities into a single, practical tool accessible to users without machine learning expertise. W…
arXiv cs.CV TIER_1 English(EN) · Gyu-Hwung Cho (NAVER Corp., Republic of Korea, Seoul National University, Republic of Korea), Youngjune Lee (NAVER Corp., Republic of Korea), Kiyoon Jeong (NAVER Corp., Republic of Korea), Siyoung Lee (NAVER Corp., Republic of Korea), Sanggyu Han (NAVER … · 2026-06-01 04:00

面向生产规模视觉文档搜索的无推理多模态学习稀疏检索

arXiv:2605.30917v1 Announce Type: cross Abstract: As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual docu…
arXiv cs.CV TIER_1 English(EN) · Chenlong Deng, Mengjie Deng, Junjie Wu, Dun Zeng, Teng Wang, Qingsong Xie, Jiadeng Huang, Shengjie Ma, Changwang Zhang, Zhaoxiang Wang, Jun Wang, Yutao Zhu, Zhicheng Dou · 2026-06-01 04:00

DeepImageSearch：为视觉历史中的上下文感知图像检索对多模态代理进行基准测试

arXiv:2602.10809v2 Announce Type: replace Abstract: Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, …
arXiv cs.CV TIER_1 English(EN) · Ruofan Hu, Menghui Zhu, Jieming Zhu, Bo Chen, Shengyang Xu, Minjie Hong, Xiaoda Yang, Sashuai Zhou, Li Tang, Tao Jin, Zhou Zhao · 2026-05-29 04:00

DocRetriever：一个用于多模态文档检索的即插即用框架及综合基准测试

arXiv:2605.30027v1 Announce Type: new Abstract: Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve…

报道来源 [15]

相关实体

相关话题