BM25
PulseAugur coverage of BM25 — every cluster mentioning BM25 across labs, papers, and developer communities, ranked by signal.
7 天有情绪数据
-
AI系统利用OCR、RAG和LangGraph自动化合同审查
本文详细介绍了如何构建一个AI驱动的合同智能系统,自动化从各种文档格式中提取关键条款。该系统结合了PaddleOCR的光学字符识别(OCR)、FAISS和BM25等混合检索方法,以及LangGraph管道中的GPT-4o模型。这种方法旨在将非结构化的合同数据转化为结构化报告,解决漏报、财务损失和合规风险等问题。
-
Study benchmarks 22 models on patent data tasks
A new study evaluated 22 different models, ranging from small encoders to large instruction-tuned LLMs, on their ability to process patent data for tasks like retrieval, classification, and clustering. The research foun…
-
RAG pipeline struggles with citations, developer proposes fix
A developer detailed a sophisticated Parent-Child RAG pipeline on GitHub, which, despite its advanced components like hybrid vector stores and LangGraph, suffered from inaccurate citations and hallucinations. The core i…
-
AI嵌入(Embeddings)解析:从含义到向量和RAG
嵌入(Embeddings)是AI的核心概念,将文本和其他数据转换为捕捉含义的数值表示。这些数值向量使AI模型能够理解单词和概念之间的关系,从而实现语义搜索和检索增强生成(RAG)等功能。虽然像Pinecone、Weaviate和Chroma这样的向量数据库常用于存储和查询这些嵌入,但像Meilisearch这样的工具的BM25检索等替代方法在特定用例中也可能有效,提供更简单的操作和更低的成本。
-
New tools Veles and Agent-native Git enhance AI coding workflows
Veles is a new open-source MCP server written in Rust that combines BM25 keyword search with semantic vector search. This hybrid approach aims to provide AI coding assistants like Claude and Cursor with more accurate co…
-
BM25 code retrieval improved with adaptive q-log odds
Researchers have developed a new method called adaptive q-log odds to improve the performance of BM25, a popular search algorithm, specifically for code retrieval tasks. This technique modifies the underlying mathematic…
-
KernelMind project details code retrieval improvements and evaluation methods
The KernelMind project is detailing its development process, focusing on improving its code retrieval and evaluation capabilities. Early versions struggled with subjective evaluation, prompting the creation of a benchma…
-
DocNest tool preserves PDF structure for better RAG performance
A developer has created DocNest, a tool designed to improve Retrieval-Augmented Generation (RAG) systems by focusing on document ingestion rather than just retrieval. DocNest preserves the structure of documents, includ…
-
乌克兰法院引用显示共引可预测性显著衰减
研究人员开发了一个新的基准 UA-StatuteRetrieval,用于评估法律信息系统中共引可预测性随时间的稳定性。通过分析 2007 年至 2026 年的 3.96 亿份乌克兰法院引用,他们发现检索性能显著衰减,可预测性下降高达 47%。虽然高频文章和刑事诉讼程序保持了稳定性,但中频文章和民事法律显示出明显的退化,部分原因在于 2017 年的司法改革和文章引用模式 4.3% 的语义变化。
-
Qwen 2.5 驱动多轮检索系统荣登 SemEval 排行榜
研究人员开发了一个用于多轮对话的三阶段检索系统,提高了信息检索任务的准确性。该系统首先使用微调的 Qwen 2.5 7B 模型优化上下文相关的查询,生成独立的问句。然后,它采用结合了 BM25 和密集向量检索的混合搜索,并与倒数排名融合(Reciprocal Rank Fusion)相结合,最后由一个交叉编码器模型对结果进行重新排序以提高精度。这种方法在最近的 SemEval 任务中取得了显著的 nDCG@5 分数,优于许多其他系统。
-
Agentic RAG empowers LLMs to retrieve information on demand
Agentic Retrieval-Augmented Generation (RAG) offers a more advanced approach to information retrieval than static RAG, which struggles with complex or time-sensitive queries. Agentic RAG empowers LLMs to decide when and…
-
New bilingual dataset and RAG system improve geospatial question answering
Researchers have developed a new bilingual dataset and a hybrid retrieval-augmented generation (RAG) system for answering geospatial questions about Tatarstan. The system integrates semantic search with geospatial filte…
-
New benchmark LIMIT+ reveals neural retrievers struggle with complex set-compositional queries
A new study published on arXiv investigates the performance of information retrieval systems when faced with complex, set-compositional queries. Researchers found that while neural retrieval methods significantly outper…
-
New framework benchmarks enterprise AI document processing pipelines
Researchers have developed EnterpriseDocBench, a new framework for evaluating the end-to-end performance of enterprise AI document processing pipelines. The framework assesses parsing fidelity, indexing efficiency, retr…
-
New RAG research tackles tabular data, cost, and cross-lingual knowledge
Several recent research papers explore advancements in Retrieval-Augmented Generation (RAG) systems. One paper introduces Orthogonal Subspace Decomposition (OSD) to separate task-specific behavior from document knowledg…
-
PostgreSQL extension adds BM25 relevance-ranked full-text search
A new open-source PostgreSQL extension, pg_textsearch, has been released, offering advanced BM25 relevance-ranked full-text search capabilities. This extension integrates seamlessly with PostgreSQL's existing text searc…
-
Apple Intelligence debuts, enhancing devices with generative AI and RAG
Apple has unveiled "Apple Intelligence," a new personal intelligence system integrating generative models into its devices like the iPhone, iPad, and Mac. This announcement was a key topic at WWDC 2024, highlighting App…
-
Eugene Yan explains how to bootstrap labels for search relevance
Eugene Yan's blog post addresses a reader's question about bootstrapping labels for semantic search systems without relying on expensive human annotators. Yan suggests starting with traditional lexical search methods li…