Brief · PulseAugur

RESEARCH · Hugging Face Daily Papers English(EN) · 1w · [2 sources]

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Researchers have introduced ResearchClawBench, a new benchmark designed to evaluate the end-to-end autonomous research capabilities of AI agents. The benchmark comprises 40 tasks across 10 scientific domains, each based on real published papers. Current AI systems, including agents and large language models, show significant limitations in reliably re-discovering scientific findings, with the strongest systems achieving scores far below full re-discovery. AI

IMPACT Highlights current limitations in AI's ability to perform autonomous scientific research, indicating a need for further development in reasoning and evidence synthesis.

Claude-Opus-4.7
Claude Code
ResearchClawBench