Anthropic's latest models, Opus 4.8 and Opus 4.7, have been compared across ten different tests. While both models show strong performance, Opus 4.8 demonstrated a notable improvement in handling complex legal queries. However, the comparison also revealed that Opus 4.8 experienced a complete failure when presented with certain legal questions, indicating areas for further development. AI
IMPACT Highlights potential improvements and limitations in LLM reasoning, particularly for specialized domains like legal applications.
RANK_REASON The cluster compares two versions of a model, detailing performance across various tests, which falls under research and development analysis. [lever_c_demoted from research: ic=1 ai=1.0]
Read on Mastodon — fosstodon.org →
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →