English(EN) One AI Agent Scored 16.9% Then 92.8% on the Same Task — the Fix Wasn't a Smarter Model

Anthropic的VirBench基准测试显示确定性工具可提高AI代理的准确性

作者 PulseAugur 编辑部 · [1 个来源] · 2026-07-03 11:33

Anthropic开发的一个名为VirBench的新基准测试揭示了AI代理性能存在显著的不一致性，即使使用相同的模型和提示。该基准测试表明，代理在同一任务上可能产生截然不同的输出，Claude Sonnet 4的准确率从92.8%下降到16.9%。关键发现是，解决方案并非更高级的模型，而是一个简单的、确定性的Python工具。当集成该工具后，Claude Sonnet 4的准确率跃升至92.8%，GPT-5.5达到99.7%，有效消除了可变性。 AI

影响强调了确定性工具在代理工程中的关键作用，表明提高AI性能的重点已从模型规模转向系统架构。

排序理由该项目讨论了一个新的基准测试（VirBench）及其关于AI代理性能的发现，这是一个面向研究的主题。[lever_c_demoted from research: ic=1 ai=1.0]

在 Towards AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

Anthropic的VirBench基准测试显示确定性工具可提高AI代理的准确性

报道来源 [1]

Towards AI TIER_1 English(EN) · Chew Loong Nian - AI ENGINEER · 2026-07-03 11:33

One AI Agent Scored 16.9% Then 92.8% on the Same Task — the Fix Wasn't a Smarter Model

<div class="medium-feed-item"><p class="medium-feed-snippet">I gave a frontier agent the exact same question three times and got three different answers: 266 sequences it should have returned, then…</p><p class="medium-feed-link"><a href="https://pub.towardsai.net/one-ai-a…

报道来源 [1]

One AI Agent Scored 16.9% Then 92.8% on the Same Task — the Fix Wasn't a Smarter Model

相关实体

相关话题