SWE-bench Pro
PulseAugur coverage of SWE-bench Pro — every cluster mentioning SWE-bench Pro across labs, papers, and developer communities, ranked by signal.
-
Z.AI's GLM 5.1 model leads in long-horizon agentic tasks, outperforming rivals
Z.AI has released its GLM 5.1 model, an open-source option designed for long-horizon agentic tasks capable of running autonomously for up to 8 hours. This model reportedly outperforms GPT-5.4, Claude Opus 4.6, and Gemin…
-
Poolside AI 发布开源 Laguna XS.2 和 M.1 编码模型
Poolside AI 发布了两款新的代理式编码模型 Laguna M.1 和 Laguna XS.2,以及它们的代理训练和运行时间。Laguna M.1 是一个大型混合专家(MoE)模型,在 NVIDIA Hopper GPU 上使用 30T 个 token 进行训练,而 Laguna XS.2 是一个较小的开源模型,可在 Apache 2.0 许可下使用。这些模型专为长周期任务设计,旨在实现能够编写和执行代码的更强大的 AI 代理。
-
Anthropic 的 'Mythos' AI 因过于危险而无法公开发布
Anthropic 开发了一个名为 Claude Mythos 的新 AI 模型,该模型在基准测试性能方面取得了显著进步,尤其是在识别软件漏洞方面。由于其在查找和利用安全漏洞方面的先进能力,Anthropic 选择不公开发布 Mythos。取而代之的是,该公司通过“Project Glasswing”向特定组织提供有限的访问权限,以协助网络安全研究和漏洞发现,并大力支持开源安全计划。
-
OpenAI abandons SWE-bench Verified due to flawed tests and data contamination
OpenAI has announced it will no longer use SWE-bench Verified to evaluate the coding capabilities of frontier AI models. The benchmark has become contaminated, with models showing improved scores primarily due to exposu…