SWE-bench
PulseAugur coverage of SWE-bench — every cluster mentioning SWE-bench across labs, papers, and developer communities, ranked by signal.
- instance of DeepSeek V4-Pro 90%
- instance of HumanEval 70%
- instance of Terminal-Bench 70%
- used by Claude Fable-5 70%
- instance of Massive Multitask Language Understanding 70%
- competes with Terminal Bench 2.0 70%
- competes with DeepSeek V4-Pro 70%
- acquired by SWE-bench Verified 70%
- used by GPQA: A Graduate-Level Google-Proof Q&A Benchmark 70%
- instance of Generative Ai Interactive Agents 70%
- instance of arXiv 60%
- competes with HumanEval 60%
18 day(s) with sentiment data
-
DeepSeek releases 1.6T open-weight V4-Pro model with MIT license · 1 source tracked
DeepSeek has released its V4 series of Mixture-of-Experts models, including V4-Pro (1.6T total parameters) and V4-Flash (284B total). Both models are released under the MIT license, offering full open weights and suppor…
-
Alibaba Qwen unveils AgentWorld language model for environment simulation
Alibaba's Qwen team has introduced Qwen-AgentWorld, a new language world model designed to simulate various agent environments. This model focuses on training LLMs to understand and predict environments, rather than jus…
-
OpenMythos benchmarks released, highlights Qwen 3.6 discrepancies
The OpenMythos model has released its benchmarks, showcasing its performance across SWE-bench Pro, CyberGym, and cybench. While the model performs well for its size and cybersecurity focus, there's potential for further…
-
Xiaomi launches MiMo Code with persistent memory, claims Claude Code advantage
Xiaomi has released MiMo Code, an open-source fork of the OpenCode terminal coding agent. This new version introduces a persistent memory system designed to handle long tasks, along with subagent orchestration and intel…
-
Anthropic's Claude Opus 4.8 claims AI crown as OpenAI retires GPT-4.5
OpenAI is retiring several of its older AI models, including GPT-4.5 and o3, with GPT-4.5 being removed from ChatGPT on June 27, 2026. This move is seen as a strategic shift ahead of potential IPO plans and the release …
-
New RAD method controls MoE language model reasoning without text analysis
Researchers have developed a new method called RAD (Routing Agreement Decoding) for controlling reasoning in sparse Mixture-of-Experts (MoE) language models. This technique leverages the internal routing states of MoE m…
-
DeepReinforce AI releases Ornith-1.0 family of open-source coding models
DeepReinforce AI has released the Ornith-1.0 family of open-source models, designed for agentic coding tasks. The models, available in various sizes including 9B, 35B, and 397B parameters, are built upon Gemma 4 and Qwe…
-
AI bug fixing costs plummet 75x, now cheaper than human developers
The cost of using frontier AI models to fix software bugs has dramatically decreased, falling by approximately 75 times since March 2023. This reduction, which effectively halves the cost every 250 days, now makes AI bu…
-
OpenAI, Google, DeepSeek unveil major AI model updates in June 2026
The AI landscape is heating up in June 2026 with major advancements from OpenAI, Google DeepMind, and DeepSeek. OpenAI is reportedly in internal testing with GPT-5.6, showing significant reasoning improvements and lower…
-
AI agent monitors flawed by wall-clock calibration, study finds
A new research paper, "Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence," published on arXiv, identifies a critical flaw in runtime monitors for autonomous …
-
Frontier AI models show "prefill awareness," potentially impacting safety tests
A new paper explores the concept of "prefill awareness" in frontier AI models, investigating whether these models can distinguish between tampered and untampered content. Researchers Parv Mahajan and Andy Wang found tha…
-
Local LLMs poised to replace cloud coding assistants for 80% of tasks by 2026
The discussion around local Large Language Models (LLMs) for coding in 2026 suggests that these models are becoming capable of handling a significant portion of daily coding tasks, potentially replacing cloud-based solu…
-
Fireworks AI launches GLM-5.2 with 1M context, optimized for coding
Fireworks AI has launched GLM-5.2, a new frontier model with a 1 million token context window, optimized for coding tasks. The model has undergone independent validation on benchmarks including SWE-bench and GPQA. Firew…
-
GeneralVLA-2 enhances robot planning with improved 3D reconstruction and memory
Researchers have introduced GeneralVLA-2, an advancement in vision-language-action systems designed for robotic planning. The system incorporates GeoFuse-MV3D to enhance 3D reconstruction accuracy by leveraging geometry…
-
GeneralVLA-2 advances robot planning with improved 3D reconstruction and memory
Researchers have introduced GeneralVLA-2, an advancement in vision-language-action systems designed for robot planning. This system incorporates GeoFuse-MV3D for enhanced 3D reconstruction and an improved KnowledgeBank …
-
LLM benchmarks saturate quickly due to training data contamination
Public LLM benchmarks are becoming saturated and less useful for differentiating top-tier models due to their training data inadvertently including benchmark questions. This contamination issue, observed in benchmarks l…
-
DeepSeek V4 excels at coding but lags in general reasoning
DeepSeek V4's coding performance is exceptionally high, achieving top scores on benchmarks like SWE-bench and LiveCodeBench. However, evaluations by CAISI suggest its general reasoning and agentic capabilities lag signi…
-
Claude Fable 5 and Higgsfield MCP build $10K websites in 90 seconds
A developer has demonstrated a workflow for creating high-end 3D scroll websites in under 90 seconds using Anthropic's Claude Fable 5 and the Higgsfield MCP. This process leverages Claude Fable 5's coding and site-cloni…
-
Paper defines 'agent harness' for AI coding assistants
A new paper published on arXiv proposes a formal definition for "agent harness," a term used in software engineering for systems that wrap language models to create coding agents. The authors trace the term's origins an…
-
Anthropic AI engineers ship code 8x faster with recursive self-improvement
Anthropic has released data indicating significant advancements in AI development, with their engineers now shipping code eight times faster than in a previous baseline period. The company's AI models, like Claude, are …