PulseAugur
实时 16:38:07

Anthropic 发布 Claude Opus 4.8,具备努力度控制和改进的编码能力

Anthropic 发布了 Claude Opus 4.8,该版本具有增强的努力度控制、动态工作流和在编码任务中提高的诚实度。这个新模型在 SWE-bench ProGraphWalks 等基准测试中取得了显著进步,同时还提供了一个更快、更便宜的模式。该版本旨在通过提供更强大的配置和对齐来解决 AI 编码代理中常见的失败模式,例如约束违反和过度自信。 AI

影响 在编码基准测试上设定了新的 SOTA(State-of-the-Art),并提高了代理的可靠性,可能加速先进 AI 编码助手的采用。

排序理由 来自前沿实验室(Anthropic)的新旗舰模型发布。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 9 个来源。 我们如何撰写摘要 →

Anthropic 发布 Claude Opus 4.8,具备努力度控制和改进的编码能力

报道来源 [9]

  1. arXiv cs.CL TIER_1 English(EN) · Dongping Chen, Xuanao Huang, Zhihan Hu, Qingyuan Shi, Dianqi Li, Tianyi Zhou ·

    沙盒化编码代理是具有竞争力的全模态任务求解器

    arXiv:2606.00579v1 Announce Type: new Abstract: As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed tool-us…

  2. arXiv cs.CL TIER_1 English(EN) · Dadi Guo, Yuejin Xie, Qingyu Liu, Weixian Huang, Jiayu Liu, Zhiyuan Fan, Qihan Ren, Shuai Shao, Tianyi Zhou, Jianjie Feng, Wenze Su, Yujiu Yang, Dongrui Liu, Yi R. Fung ·

    Code2Math:你的代码代理能通过探索有效地演进数学问题吗?

    arXiv:2603.03202v3 Announce Type: replace Abstract: As large language models (LLMs) advance their mathematical capabilities toward the IMO and research level, the scarcity of challenging, high-quality problems has become a significant bottleneck for training, evaluation and self-…

  3. arXiv cs.AI TIER_1 English(EN) · Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He ·

    Guardrails 优于 Guidance:一项关于代码代理的规则、技能和持久化配置的大规模研究

    arXiv:2604.11088v2 Announce Type: replace Abstract: Random rules improve a coding agent's task performance as much as expert-curated ones (both $+13.8$pp on a discriminative subset of SWE-bench Verified), and in our data every individually beneficial rule is a negative constraint…

  4. arXiv cs.AI TIER_1 English(EN) · Ningzhi Tang, Chaoran Chen, Gelei Xu, Yiyu Shi, Yu Huang, Collin McMillan, Tao Dong, Toby Jia-Jun Li ·

    编码代理如何辜负用户:对 20,574 次真实开发者-代理错位情况的大规模分析

    arXiv:2605.29442v1 Announce Type: cross Abstract: AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajectories that miss how developers actually experience misalignment. We present an observational…

  5. Hugging Face Daily Papers TIER_1 English(EN) ·

    编码代理如何辜负用户:对 20,574 次真实开发者-代理错位的大规模分析

    AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajectories that miss how developers actually experience misalignment. We present an observational study of 20,574 coding-agent sessions from 1,639 …

  6. arXiv cs.AI TIER_1 English(EN) · Myeongsoo Kim, Dingmin Wang, Siwei Cui, Farima Farmahinifarahani, Terry Yue Zhuo, Shweta Garg, Baishakhi Ray, Rajdeep Mukherjee, Varun Kumar ·

    连贯性崩溃:诊断代码代理在生成正确代码后失败的原因

    arXiv:2603.24631v2 Announce Type: replace-cross Abstract: Code agents resolve 65-70% of SWE-bench Verified issues, but Pass@1 cannot tell us why the rest fail, and, as we show, capable-model failures are systematically misdiagnosed without trajectory data. We introduce TRAJEVAL, …

  7. HN — claude cli stories TIER_1 English(EN) · pbjerkeseth ·

    Show HN:Ouijit,一款开源的任务和终端管理器,适用于编码代理

  8. dev.to — Anthropic tag TIER_1 English(EN) · Arindam Majumder ·

    Claude Opus 4.8:努力控制、动态工作流和默认诚实的编码代理

    <p>The frontier model race has been moving in fits and starts. OpenAI shipped GPT-5.5 and a new Codex line. Google pushed Gemini 3.1 Pro and a faster Gemini Flash. xAI keeps iterating on Grok. And now Anthropic has shipped <strong>Claude Opus 4.8</strong>, only 41 days after Opus…

  9. r/ClaudeAI TIER_2 English(EN) · /u/StravuKarl ·

    为我们的编码代理构建约束:八种失效模式,八大支柱

    <!-- SC_OFF --><div class="md"><p>We ended up building two products: the software we ship, and the system/harness around our agents that makes them useful in building the thing we ship.</p> <p>A harness is the durable layer around a model: instructions, tools, permissions, contex…