Brief · PulseAugur

TOOL · r/cursor English(EN) · 6h

Unhinged results from UC Berkeley's new ALE benchmark of 55 different industries

A new benchmark from UC Berkeley, the ALE benchmark, has revealed significant cost and runtime disparities between various AI models across 55 industries. The benchmark highlights that custom harnesses can outperform commercial models like Codex, and that models like Anthropic's Claude Opus 4.8 are significantly slower and more expensive than previous versions for similar results. The findings suggest a highly variable and unoptimized AI market where direct benchmarking is crucial for users to determine the most cost-effective and efficient models for their specific workloads. AI

IMPACT Highlights extreme cost and runtime inefficiencies in current AI models, necessitating user-driven benchmarking for optimal workload performance.

Gemini 3.1 Pro
Codex
Claude Code
Opus 4.7
Mimo v2.5
University of California, Berkeley
Grok 4.3
GPT 5.5 High
Cursor CLI
Composer 2.5
Qwen 3.7 Max
Opus 4.8
ALE benchmark