Anthropic releases Claude Opus 4.8 with effort controls and improved coding
ByPulseAugur Editorial·[9 sources]·
Anthropic has released Claude Opus 4.8, featuring enhanced effort controls, dynamic workflows, and improved honesty in coding tasks. This new model demonstrates significant gains on benchmarks like SWE-bench Pro and GraphWalks, while also offering a faster and cheaper mode. The release aims to address common failure modes in AI coding agents, such as constraint violations and overconfidence, by providing more robust configuration and alignment.
AI
IMPACT
Sets new SOTA on coding benchmarks and improves agent reliability, potentially accelerating adoption of advanced AI coding assistants.
RANK_REASON
New flagship model release from a frontier lab (Anthropic).
arXiv:2606.00579v1 Announce Type: new Abstract: As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed tool-us…
arXiv:2603.03202v3 Announce Type: replace Abstract: As large language models (LLMs) advance their mathematical capabilities toward the IMO and research level, the scarcity of challenging, high-quality problems has become a significant bottleneck for training, evaluation and self-…
arXiv:2604.11088v2 Announce Type: replace Abstract: Random rules improve a coding agent's task performance as much as expert-curated ones (both $+13.8$pp on a discriminative subset of SWE-bench Verified), and in our data every individually beneficial rule is a negative constraint…
arXiv:2605.29442v1 Announce Type: cross Abstract: AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajectories that miss how developers actually experience misalignment. We present an observational…
AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajectories that miss how developers actually experience misalignment. We present an observational study of 20,574 coding-agent sessions from 1,639 …
arXiv:2603.24631v2 Announce Type: replace-cross Abstract: Code agents resolve 65-70% of SWE-bench Verified issues, but Pass@1 cannot tell us why the rest fail, and, as we show, capable-model failures are systematically misdiagnosed without trajectory data. We introduce TRAJEVAL, …
HN — claude cli stories
TIER_1English(EN)·pbjerkeseth·
<p>The frontier model race has been moving in fits and starts. OpenAI shipped GPT-5.5 and a new Codex line. Google pushed Gemini 3.1 Pro and a faster Gemini Flash. xAI keeps iterating on Grok. And now Anthropic has shipped <strong>Claude Opus 4.8</strong>, only 41 days after Opus…
<!-- SC_OFF --><div class="md"><p>We ended up building two products: the software we ship, and the system/harness around our agents that makes them useful in building the thing we ship.</p> <p>A harness is the durable layer around a model: instructions, tools, permissions, contex…