PulseAugur
实时 00:43:45
English(EN) DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

新的DeepWeb-Bench测试前沿AI模型执行复杂研究任务的能力

研究人员推出了一项名为DeepWeb-Bench的新基准,旨在评估前沿语言模型的深度研究能力。该基准比现有基准更具挑战性,需要广泛收集证据、跨源核对和多步推理。对九个前沿模型的初步评估显示,推理和校准失败是主要瓶颈,占错误总数的70%以上,而非检索问题。 AI

影响 该基准将推动前沿模型在复杂推理和证据综合方面取得进步,超越简单的检索任务。

排序理由 该集群描述了一篇在arXiv上发布的新学术基准论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

报道来源 [2]

  1. arXiv cs.AI TIER_1 English(EN) · Sixiong Xie, Zhuofan Shi, Haiyang Shen, Jiuzheng Wang, Siqi Zhong, Mugeng Liu, Chongyang Pan, Peilun Jia, Baoqing Sun, Xiang Jing, Yun Ma ·

    DeepWeb-Bench:一项需要海量跨源证据和长时程推导的深度研究基准

    arXiv:2605.21482v1 Announce Type: new Abstract: Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existin…

  2. arXiv cs.AI TIER_1 English(EN) · Yun Ma ·

    DeepWeb-Bench:一项需要海量跨源证据和长时程推导的深度研究基准

    Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish…