A user conducted a non-scientific comparison between Claude Opus 4.6 and 4.8, using Codex 5.5 as the judge. The results indicated that Claude 4.8 performed better overall in understanding the codebase and detecting risks, despite being slower and more verbose. Codex 5.5, acting as the judge, also reflected that while Claude 4.8 was a more thorough investigator, its own output would have been more concise and efficient. AI
IMPACT Suggests incremental improvements in model understanding and risk detection, but highlights trade-offs with verbosity and efficiency.
RANK_REASON User-conducted benchmark comparing two versions of a model. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →