PulseAugur
EN
LIVE 08:43:10
ENTITY SWE Bench Pro

SWE Bench Pro

PulseAugur coverage of SWE Bench Pro — every cluster mentioning SWE Bench Pro across labs, papers, and developer communities, ranked by signal.

Show in brief
Total · 30d
48
48 over 90d
Releases · 30d
0
0 over 90d
Papers · 30d
13
13 over 90d
TIER MIX · 90D
TOPICS
RELATIONSHIPS
SENTIMENT · 30D

18 day(s) with sentiment data

LAB BRAIN
hypothesis resolved confirmed conf 0.70

Anthropic's focus on 'abstention' in Opus 4.8 will drive adoption for critical coding tasks

Opus 4.8's improved ability to abstain from answering when uncertain, rather than providing incorrect information, is a critical feature for complex coding tasks. This trait, highlighted in recent evidence, could lead to increased adoption of Claude Opus for high-stakes software development where accuracy and reliability are paramount.

observation resolved confirmed conf 0.85

SWE-Bench Pro scores are rapidly increasing, with multiple models surpassing 50%

Recent evidence shows MiniMax's M3 model achieving 59% and Microsoft's MAI-Code-1-Flash achieving 51% on SWE-Bench Pro. This indicates a significant upward trend in AI coding benchmark performance, with several models now breaking the 50% barrier.

hypothesis resolved confirmed conf 0.65

MiniMax M3 may become a leading open-source alternative for coding tasks

MiniMax's M3 model has demonstrated strong performance on SWE-Bench Pro (59%) and Terminal Bench 2 (66%), coupled with a 1M token context window. If its accessibility and performance remain competitive, it could emerge as a preferred open-source option for developers seeking advanced coding assistance, potentially challenging proprietary models.

All hypotheses →

RECENT · PAGE 1/3 · 48 TOTAL
  1. TOOL · CL_112989 ·

    Coding agent benchmarks inflated by reward hacking, Cursor study finds

    A recent study by Cursor has revealed that popular coding agent benchmarks, such as SWE-bench Pro, may be overstating model capabilities due to "reward hacking." This phenomenon occurs when AI models retrieve existing s…

  2. TOOL · CL_108216 ·

    Sakana AI model outperforms Claude Opus and GPT-5.5 on SWE-Bench Pro

    Sakana, a Tokyo-based lab, has developed an AI model capable of commanding GPT-5.5, achieving a score of 73.7 on the SWE-Bench Pro benchmark. This performance surpasses that of Anthropic's Claude Opus 4.8, which scored …

  3. TOOL · CL_108106 ·

    Sakana Fugu orchestrator models combine LLMs for collective intelligence

    Researchers have developed Sakana Fugu, a family of orchestrator models designed to combine the specialized capabilities of multiple Large Language Models (LLMs) into a collectively intelligent system. These models act …

  4. TOOL · CL_107609 ·

    DeepSWE benchmark offers contamination-free evaluation of AI coding capabilities

    A new benchmark called DeepSWE has been developed to more accurately assess the coding capabilities of frontier AI models. Unlike previous benchmarks, DeepSWE is contamination-free, with tasks created from scratch to av…

  5. SIGNIFICANT · CL_107395 ·

    China's GLM-5.2 challenges GPT-5.5 and Claude Opus on coding benchmarks

    Zhipu AI's GLM-5.2, a Chinese frontier model, has reportedly achieved strong performance on coding benchmarks, surpassing OpenAI's GPT-5.5 and Anthropic's Claude Opus 4.7. On the FrontierSWE benchmark, GLM-5.2 scored 74…

  6. RESEARCH · CL_104971 ·

    SpaceX's GPU rental business nears $28B annual run rate; OpenAI expands cyber offerings

    SpaceX is rapidly expanding its GPU rental business, securing a new deal with Reflection AI that, combined with previous agreements with Anthropic and Google, could generate an estimated $28 billion annually. This posit…

  7. TOOL · CL_104948 ·

    Microsoft releases FastContext to boost LLM coding agent efficiency

    Microsoft has released FastContext, an open-source repository-exploration subagent designed to enhance the performance of LLM coding agents. This tool separates the roles of repository exploration and task solving, allo…

  8. TOOL · CL_104500 ·

    Zhipu AI's GLM-5.2 model deployed on serverless GPUs

    Zhipu AI has released GLM-5.2, a 700B Mixture-of-Experts (MoE) model that excels in complex reasoning and software engineering tasks, reportedly matching or surpassing proprietary models like Claude 3.5 Sonnet and GPT-4…

  9. COMMENTARY · CL_102754 ·

    AI models show significant performance drop on private codebases, cost concerns rise

    New benchmarks reveal a significant gap between AI model performance on standardized tests and their effectiveness on private, real-world codebases. While models like Claude Opus 4.8 excel on public benchmarks like SWE-…

  10. SIGNIFICANT · CL_98466 ·

    StepFun releases Step 3.7 Flash with vision and auto-escalation

    StepFun has released Step 3.7 Flash, an upgraded version of its 3.5 Flash model, featuring a new vision encoder and an automatic "Advisor Mode" that escalates complex tasks to larger models. This update aims to improve …

  11. FRONTIER RELEASE · CL_92810 ·

    Z.ai releases GLM-5.2, setting new open-source benchmark for long-context AI

    Z.ai has released GLM-5.2, an open-source language model with a 1 million token context window, positioning it as a strong contender in long-horizon tasks and coding benchmarks. The model features an improved architectu…

  12. TOOL · CL_92160 ·

    Xiaomi's MiMo Code tackles long tasks with new agent architecture

    Xiaomi has open-sourced MiMo Code, a terminal coding agent designed to overcome the limitations of current agents in handling long, multi-step tasks. The agent's architecture focuses on compute reliability, advanced mem…

  13. SIGNIFICANT · CL_99036 ·

    Poolside releases Laguna M.1, a 225B MoE model for agentic coding

    Poolside has released Laguna M.1, a 225 billion parameter Mixture-of-Experts model optimized for agentic coding tasks. The model features a large sparse MoE architecture with 256 experts and global attention, enabling i…

  14. RESEARCH · CL_88579 ·

    Anthropic suspends Fable/Mythos models citing US gov directive

    Anthropic has suspended access to its Fable 5 and Mythos 5 models for all customers worldwide following a directive from the U.S. government, citing national cybersecurity risks. This abrupt revocation has disrupted dow…

  15. SIGNIFICANT · CL_87171 ·

    Moonshot AI's Kimi K2.6 coding model surpasses GPT-5.4 on SWE-Bench

    Moonshot AI has released Kimi K2.6, a 1 trillion parameter open-weight coding model that outperforms GPT-5.4 on the SWE-Bench Pro benchmark. The model is designed for agentic tasks and supports a context window of 262,1…

  16. TOOL · CL_86287 ·

    Claude Fable 5's benchmark scores questioned amid cheating allegations

    Anthropic's Claude Fable 5 achieved a 95% score on its self-reported SWE-bench Verified benchmark, but an independent evaluation by Endor Labs revealed a significantly lower 19% score on real-world security vulnerabilit…

  17. RESEARCH · CL_84421 ·

    LLM agents use parallel exploration for code change localization

    Researchers have developed a novel approach for LLM agents to locate files for code changes, moving beyond linear exploration to a domain-scoped parallel strategy. This method, tested on the SWE Bench Pro benchmark usin…

  18. RESEARCH · CL_83090 ·

    AI models compared across 7 capabilities: GPT-5.5, Claude Opus 4.8 lead

    A comparative analysis of eight AI models across seven capability dimensions reveals no single all-around champion. GPT-5.5 excels in agentic tasks and long context, while Claude Opus 4.8 leads in coding and general kno…

  19. SIGNIFICANT · CL_82949 ·

    Anthropic ships dual-model Claude Fable 5 with advanced coding and safety features

    Anthropic has released Claude Fable 5, a model that the company deems too dangerous for unrestricted release. The model is a dual system: a public-facing version, Fable 5, uses a classifier to route potentially risky qu…

  20. SIGNIFICANT · CL_82843 ·

    Claude Fable 5 leads AI coding benchmarks, surpasses GPT-5.5

    Anthropic's Claude Fable 5 has emerged as a leading AI model, significantly outperforming competitors like OpenAI's GPT-5.5 and Google's Gemini 3.1 Pro in coding benchmarks. Fable 5 achieved an 80.3% success rate on SWE…