SoftMatcha 2 使万亿级 token 搜索速度提升至 0.3 秒以内

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-11 04:00

研究人员开发了 SoftMatcha 2，这是一种新颖的算法，旨在对海量文本数据集进行快速、语义灵活的模式匹配。该系统可以在一秒钟内搜索万亿个 token，并能处理查询中的变体，如替换、插入和删除。它通过动态语料库感知剪枝和面向磁盘的设计来实现效率，在大语料库上性能优于现有方法，并证明了其在识别基准污染和增强信息检索方面的实用性。 AI

影响该算法可以显著加速大型语言模型和其他人工智能应用的数据处理和分析。

排序理由这是一篇详细介绍新算法及其经验评估的研究论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv stat.ML 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv stat.ML TIER_1 English(EN) · Masataka Yoneda, Yusuke Matsushita, Go Kamoda, Kohei Suenaga, Takuya Akiba, Masaki Waga, Sho Yokoi · 2026-06-11 04:00

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

arXiv:2602.10908v2 Announce Type: replace-cross Abstract: We present SoftMatcha 2, an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while allowing semantic variations in the form of substitution, ins…

报道来源 [1]

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

相关实体

相关话题