Brief · PulseAugur

TOOL · Hugging Face Blog English(EN) · 3h

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

A new benchmark, ITBench-AA, has been released to evaluate the capabilities of frontier AI models on enterprise IT tasks, specifically focusing on Site Reliability Engineering (SRE). In initial tests, even the most advanced models like Claude Opus 4.7 and GPT-5.5 scored below 50% on diagnosing Kubernetes incidents. The benchmark revealed that models struggle with root-cause analysis, and longer investigation trajectories do not necessarily lead to higher accuracy, with some models over-investigating and identifying false positives. AI

IMPACT Highlights significant limitations in current frontier models for complex, real-world enterprise IT operations, suggesting a need for improved reasoning and diagnostic capabilities.

GPT-5.5
Gemini 3.1 Pro Preview
Claude Opus 4.7
GLM-5.1
DeepSeek V4 Pro
Artificial Analysis
Kubernetes
Gemma 4 31B
IBM Research
Gemini 3.5 Flash
Qwen3.7 Max
ITBench-AA