The UK's AI Security Institute has found that current AI benchmarks often underestimate the true capabilities of AI agents. Their study revealed that by increasing the compute budget, particularly the token limit, success rates for AI agents on tasks like software engineering can increase significantly, by up to 25%. This suggests that the actual progress in AI development may be considerably faster than previously measured, with newer models showing the most substantial improvements. AI
IMPACT Current AI benchmarks may need revision to accurately reflect AI agent capabilities, potentially accelerating the perceived pace of AI development.
RANK_REASON The cluster reports on findings from a study by a research institute regarding AI benchmarks. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →