A new benchmark from UC Berkeley, the ALE benchmark, has revealed significant cost and runtime disparities between various AI models across 55 industries. The benchmark highlights that custom harnesses can outperform commercial models like Codex, and that models like Anthropic's Claude Opus 4.8 are significantly slower and more expensive than previous versions for similar results. The findings suggest a highly variable and unoptimized AI market where direct benchmarking is crucial for users to determine the most cost-effective and efficient models for their specific workloads. AI
IMPACT Highlights extreme cost and runtime inefficiencies in current AI models, necessitating user-driven benchmarking for optimal workload performance.
RANK_REASON The cluster reports on the results of a new academic benchmark evaluating AI models across various industries. [lever_c_demoted from research: ic=1 ai=1.0]
- ALE benchmark
- Claude Code
- Codex
- Composer 2.5
- Cursor CLI
- Gemini 3.1 Pro
- GPT 5.5 High
- Grok 4.3
- Mimo v2.5
- Opus 4.7
- Opus 4.8
- Qwen 3.7 Max
- University of California, Berkeley
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →