DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
Researchers have introduced DeepWeb-Bench, a new benchmark designed to evaluate the deep research capabilities of frontier language models. This benchmark is significantly more challenging than existing ones, requiring extensive evidence collection, cross-source reconciliation, and multi-step derivation. Initial evaluations on nine frontier models revealed that derivation and calibration failures, rather than retrieval issues, constitute the primary bottleneck, accounting for over 70% of errors. AI
IMPACT This benchmark will push frontier models to improve complex reasoning and evidence synthesis, moving beyond simple retrieval tasks.