实体 SWE Bench Pro

SWE Bench Pro

PulseAugur coverage of SWE Bench Pro — every cluster mentioning SWE Bench Pro across labs, papers, and developer communities, ranked by signal.

Show in brief

总计 · 30天

90 天内 66

发布 · 30天

90 天内 0

论文 · 30天

90 天内 19

层级分布 · 90 天

frontier release 9
significant 7
research 28
tool 19
commentary 3

主题

关系

时间线

2026-07-08 research_milestone OpenAI audited the SWE-Bench Pro coding benchmark and found it unreliable. 来源
2026-07-08 research_milestone OpenAI audited SWE-Bench Pro and found it unreliable for measuring frontier coding capability. 来源

情绪 · 30 天

19 天有情绪数据

LAB BRAIN

hypothesis resolved confirmed 置信度 0.70

Anthropic's focus on 'abstention' in Opus 4.8 will drive adoption for critical coding tasks

Opus 4.8's improved ability to abstain from answering when uncertain, rather than providing incorrect information, is a critical feature for complex coding tasks. This trait, highlighted in recent evidence, could lead to increased adoption of Claude Opus for high-stakes software development where accuracy and reliability are paramount.

observation resolved confirmed 置信度 0.85

SWE-Bench Pro scores are rapidly increasing, with multiple models surpassing 50%

Recent evidence shows MiniMax's M3 model achieving 59% and Microsoft's MAI-Code-1-Flash achieving 51% on SWE-Bench Pro. This indicates a significant upward trend in AI coding benchmark performance, with several models now breaking the 50% barrier.

hypothesis resolved confirmed 置信度 0.65

MiniMax M3 may become a leading open-source alternative for coding tasks

MiniMax's M3 model has demonstrated strong performance on SWE-Bench Pro (59%) and Terminal Bench 2 (66%), coupled with a 1M token context window. If its accessibility and performance remain competitive, it could emerge as a preferred open-source option for developers seeking advanced coding assistance, potentially challenging proprietary models.

查看全部假设 →

最近 · 第 1/4 页 · 共 66 条

SWE Bench Pro

Anthropic's focus on 'abstention' in Opus 4.8 will drive adoption for critical coding tasks

SWE-Bench Pro scores are rapidly increasing, with multiple models surpassing 50%

MiniMax M3 may become a leading open-source alternative for coding tasks

中国 GLM-5.2 模型在编码基准测试中优于 GPT-5.5，成本更低 · 跟踪 1 个来源

Grok 4.5 声称比 Claude Opus 4.8 效率高 4.2 倍，第三方已验证

OpenAI发现30%的流行AI编码基准测试存在问题

OpenAI 因 30% 的任务失败率撤回 SWE-Bench Pro 推荐

SpaceXAI 与 Cursor 合作推出 Grok 4.5，目标是处理复杂任务并提高代币效率 · 追踪 2 个来源

OpenAI 停止推荐 SWE-Bench Pro

OpenAI发现流行的AI编码基准SWE-Bench Pro不可靠

xAI发布Grok 4.5，以更低成本挑战GPT-5.5和Claude · 追踪10个来源

OpenAI 标记流行的 AI 编码基准测试的可靠性问题

腾讯Hy3模型在真实世界任务中表现强劲

Anthropic 的 Claude Sonnet 5 成为默认模型，在关键编码任务上表现优于 Opus

Tencent 发布 Hy3，一个开放的 295B MoE 模型，支持 256K 上下文

AI基准测试图表：如何识别饱和度和污染

Dockerless 无需运行测试即可验证 AI 编码代理补丁

Fable 5 的高级 AI 功能集成到新的持久化代理中

SWE-Doctor 代理利用运行时诊断改进 LLM 补丁生成

Anthropic 发布 Claude Sonnet 5，以更低成本提供接近 Opus 的性能 · 追踪 5 个来源

智谱AI发布拥有100万上下文窗口的GLM-5.2，挑战顶级闭源模型

研究发现编码 AI 基准分数因“奖励黑客行为”而虚高

Sakana AI模型在SWE-Bench Pro上超越Claude Opus和GPT-5.5