SWE-bench Pro
PulseAugur coverage of SWE-bench Pro — every cluster mentioning SWE-bench Pro across labs, papers, and developer communities, ranked by signal.
-
Z.AI's GLM 5.1 model leads in long-horizon agentic tasks, outperforming rivals
Z.AI has released its GLM 5.1 model, an open-source option designed for long-horizon agentic tasks capable of running autonomously for up to 8 hours. This model reportedly outperforms GPT-5.4, Claude Opus 4.6, and Gemin…
-
Poolside AI releases open-weight Laguna XS.2 and M.1 coding models
Poolside AI has released two new agentic coding models, Laguna M.1 and Laguna XS.2, along with their agent training and operation runtime. Laguna M.1 is a large Mixture of Experts (MoE) model trained on 30T tokens using…
-
Anthropic's Claude Mythos finds zero-days; GLM-5.1 targets long tasks
Anthropic's Claude Mythos Preview has demonstrated a significant capability in identifying zero-day vulnerabilities in critical software, leading to the formation of Project Glasswing to enhance cybersecurity. Meanwhile…
-
OpenAI abandons SWE-bench Verified due to flawed tests and data contamination
OpenAI has announced it will no longer use SWE-bench Verified to evaluate the coding capabilities of frontier AI models. The benchmark has become contaminated, with models showing improved scores primarily due to exposu…
-
Anthropic's Claude Opus 4.7
Anthropic has launched Claude Design, a new product integrated with its Claude Opus 4.7 model, enabling users to collaborate on visual content creation. This tool allows for the generation and refinement of designs, pro…
-
OpenAI launches GPT-5.5, boosting AI intelligence and speed for complex tasks
OpenAI has released GPT-5.5 and GPT-5.5 Pro, their latest and most intuitive models, designed for complex tasks and agentic capabilities. These models excel in areas like coding, data analysis, and operating software, o…