PulseAugur
实时 14:05:38
English(EN) UniNote: A Unified Embedding Model for Multimodal Representation and Ranking

新框架推动文档和图像的多模态检索

研究人员介绍了几个用于多模态检索任务的新框架和基准。动态适配器路由 (DAR) 通过基于原型的路由来选择适配器,以解决持续多模态检索问题。V-SPLADE 提供了一种无需推理的视觉文档稀疏检索器,通过字幕门控令牌监督来改进词汇基础。HiKEY 提出了一个用于文档问答的分层检索框架,利用文档结构进行更好的路由和证据整合。此外,DeepImageSearch 将图像检索视为视觉历史中的自主探索任务,并引入了一个新的基准 (DISBench) 来评估代理推理。 AI

影响 这些进展提供了改进的搜索和理解复杂多模态数据的方法,有可能加速文档分析和视觉问答等领域的研究和应用开发。

排序理由 多篇研究论文介绍了多模态检索任务的新方法和基准。

在 arXiv cs.IR (Information Retrieval) 阅读 →

AI 生成摘要 · Google Gemini · 来自 15 个来源。 我们如何撰写摘要 →

报道来源 [15]

  1. arXiv cs.CL TIER_1 English(EN) · Erfan Loweimi, Mengjie Qian, Kate Knill, Guanfeng Wu, Chi-Ho Chan, Abbas Haider, Muhammad Awan, Josef Kittler, Hui Wang, Mark Gales ·

    要多模态还是不要多模态:通过主动模态检测实现查询自适应音视频人物检索

    arXiv:2606.05931v1 Announce Type: new Abstract: When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusin…

  2. arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Mark Gales ·

    要多模态还是要“不”多模态:通过主动模态检测实现查询自适应音视频人物检索

    When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, …

  3. arXiv cs.AI TIER_1 English(EN) · Jingbiao Mei ·

    Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

    arXiv:2606.04240v1 Announce Type: cross Abstract: Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multi…

  4. arXiv cs.AI TIER_1 English(EN) · Alicja Dobrzeniecka, Filip Szatkowski, Sebastian Cygert, Szymon Lukasik, Bartlomiej Twardowski ·

    Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval

    arXiv:2605.31229v1 Announce Type: cross Abstract: While retrieval is a core function of vision-language models, continually updating these models for retrieval tasks remains critically underexplored. Existing work often approaches continual retrieval through the lens of class-inc…

  5. arXiv cs.AI TIER_1 English(EN) · Bartlomiej Twardowski ·

    超越分类:持续多模态检索的动态适配器路由

    While retrieval is a core function of vision-language models, continually updating these models for retrieval tasks remains critically underexplored. Existing work often approaches continual retrieval through the lens of class-incremental learning (CIL), evaluating both standard …

  6. arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Seung-won Hwang ·

    面向生产规模视觉文档搜索的无推理多模态学习稀疏检索

    As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual documents to serve queries without neural encoding at …

  7. arXiv cs.AI TIER_1 English(EN) · Joongmin Shin, Gyuho Shim, Jeongbae Park, Jaehyung Seo, Heuiseok Lim ·

    HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering

    arXiv:2605.29606v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) for document-based Open-domain Question Answering (ODQA) on large-scale industrial corpora faces two critical bottlenecks: routing failure in locating the correct document and evidence fragmentat…

  8. arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Zhou Zhao ·

    DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark

    Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve high-precision retrieval, they face inherent li…

  9. arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Heuiseok Lim ·

    HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering

    Retrieval-augmented generation (RAG) for document-based Open-domain Question Answering (ODQA) on large-scale industrial corpora faces two critical bottlenecks: routing failure in locating the correct document and evidence fragmentation in integrating scattered information. Existi…

  10. arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Yao Hu ·

    UniNote: A Unified Embedding Model for Multimodal Representation and Ranking

    Item-to-Item (I2I) retrieval is a fundamental part of modern content platforms, supporting critical industrial workflows from recommendation engines to content auditing. While multimodal embedding methods have advanced general retrieval, they often falter in I2I scenarios due to …

  11. arXiv cs.CV TIER_1 English(EN) · Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, Liqiang Nie ·

    PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning

    arXiv:2507.08064v3 Announce Type: replace-cross Abstract: As multimedia content expands, the demand for unified multimodal retrieval (UMR) in real-world applications increases. Recent work leverages multimodal large language models (MLLMs) to tackle this task. However, their larg…

  12. arXiv cs.CV TIER_1 English(EN) · Prasanna Sridhar, Horace Lee, David M. S. Pinto, Andrew Zisserman, Abhishek Dutta ·

    WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata

    arXiv:2602.12819v2 Announce Type: replace-cross Abstract: In this paper, we present WISE, an open-source audiovisual search engine which integrates a range of multimodal retrieval capabilities into a single, practical tool accessible to users without machine learning expertise. W…

  13. arXiv cs.CV TIER_1 English(EN) · Gyu-Hwung Cho (NAVER Corp., Republic of Korea, Seoul National University, Republic of Korea), Youngjune Lee (NAVER Corp., Republic of Korea), Kiyoon Jeong (NAVER Corp., Republic of Korea), Siyoung Lee (NAVER Corp., Republic of Korea), Sanggyu Han (NAVER … ·

    Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search

    arXiv:2605.30917v1 Announce Type: cross Abstract: As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual docu…

  14. arXiv cs.CV TIER_1 English(EN) · Chenlong Deng, Mengjie Deng, Junjie Wu, Dun Zeng, Teng Wang, Qingsong Xie, Jiadeng Huang, Shengjie Ma, Changwang Zhang, Zhaoxiang Wang, Jun Wang, Yutao Zhu, Zhicheng Dou ·

    DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

    arXiv:2602.10809v2 Announce Type: replace Abstract: Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, …

  15. arXiv cs.CV TIER_1 English(EN) · Ruofan Hu, Menghui Zhu, Jieming Zhu, Bo Chen, Shengyang Xu, Minjie Hong, Xiaoda Yang, Sashuai Zhou, Li Tang, Tao Jin, Zhou Zhao ·

    DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark

    arXiv:2605.30027v1 Announce Type: new Abstract: Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve…