English(EN) UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do

英国AI安全研究所：基准测试低估AI代理能力

作者 PulseAugur 编辑部 · [2 个来源] · 2026-07-03 16:14

英国AI安全研究所发现，当前的AI基准测试常常低估AI代理的真实能力。他们的研究表明，通过增加计算预算，特别是令牌限制，AI代理在软件工程等任务上的成功率可以显著提高，最高可达25%。这表明AI发展的实际进展可能比以往的测量速度快得多，而较新的模型显示出最显著的改进。 AI

影响当前的AI基准测试可能需要修订，以准确反映AI代理的能力，从而可能加速AI发展的感知速度。

排序理由该集群报告了一项由研究机构关于AI基准测试的研究结果。[lever_c_demoted from research: ic=1 ai=1.0]

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

The Decoder TIER_1 English(EN) · Matthias Bastian · 2026-07-03 16:14

UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do

<p><img alt="" class="attachment-full size-full wp-post-image" height="768" src="https://the-decoder.com/wp-content/uploads/2026/07/test_time_compute_illustration-2.png" style="height: auto; margin-bottom: 10px;" width="1376" /></p> <p> In a study covering seven benchmarks, the U…
Mastodon — mastodon.social TIER_1 Deutsch(DE) · aisyndicate · 2026-07-03 18:11

The British AI Security Institute proves: Standard benchmarks underestimate agents because they throttle the compute budget. Success rates increase for SWE tasks

Das britische AI Security Institute belegt: Standard-Benchmarks unterschätzen Agenten, weil sie das Rechenbudget drosseln. Bei SWE-Aufgaben steigt die Erfolgsrate um 25 Prozent – wer Agenten nur unter Budgetzwang evaluiert, übersieht reale Fähigkeiten. https:// the-decoder.de/bri…

链接 the-decoder.de/britisches-ki-sicherheitsi…