Frontier AI models fail new IT benchmark, scoring below 50%

By PulseAugur Editorial · [1 sources] · 2026-05-27 17:20

A new benchmark, ITBench-AA, has been released to evaluate the capabilities of frontier AI models on enterprise IT tasks, specifically focusing on Site Reliability Engineering (SRE). In initial tests, even the most advanced models like Claude Opus 4.7 and GPT-5.5 scored below 50% on diagnosing Kubernetes incidents. The benchmark revealed that models struggle with root-cause analysis, and longer investigation trajectories do not necessarily lead to higher accuracy, with some models over-investigating and identifying false positives. AI

IMPACT Highlights significant limitations in current frontier models for complex, real-world enterprise IT operations, suggesting a need for improved reasoning and diagnostic capabilities.

RANK_REASON The cluster describes the release of a new benchmark for evaluating AI models on specific tasks, which falls under research. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Hugging Face Blog →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

Hugging Face Blog TIER_1 English(EN) · 2026-05-27 17:20

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

COVERAGE [1]

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

RELATED ENTITIES

RELATED TOPICS