CORE-Bench
PulseAugur coverage of CORE-Bench — every cluster mentioning CORE-Bench across labs, papers, and developer communities, ranked by signal.
4 day(s) with sentiment data
-
New benchmark approach evaluates AI agents beyond accuracy
A new research paper proposes moving beyond accuracy-centric evaluation for AI agents, even when benchmarks saturate. The study uses CORE-Bench Hard, a computational reproducibility benchmark, to demonstrate the value o…
-
New benchmark CORE-Bench tests AI agents' scientific reproducibility
Researchers have introduced CORE-Bench, a new benchmark designed to evaluate the ability of AI agents to perform computational reproducibility tasks. This benchmark comprises 270 tasks derived from 90 scientific papers …
-
Anthropic AI engineers ship code 8x faster with recursive self-improvement
Anthropic has released data indicating significant advancements in AI development, with their engineers now shipping code eight times faster than in a previous baseline period. The company's AI models, like Claude, are …
-
Anthropic details AI's growing role in its own development
Anthropic has published research indicating that AI systems are increasingly contributing to their own development, a trend they term "recursive self-improvement." This process, where AI assists in designing and develop…
-
AI agents struggle to reproduce research, new benchmarks reveal
Researchers have developed AutoReproduce, a multi-agent framework designed to automatically reproduce AI experiments from research papers. This system utilizes a "paper lineage" to mine implicit knowledge from cited lit…