ENTITY SWE-bench

SWE-bench

PulseAugur coverage of SWE-bench — every cluster mentioning SWE-bench across labs, papers, and developer communities, ranked by signal.

Total · 30d

67

67 over 90d

Releases · 30d

0

0 over 90d

Papers · 30d

40

40 over 90d

TIER MIX · 90D

frontier release 5
significant 3
research 17
tool 34
commentary 8

TOPICS

RELATIONSHIPS

SENTIMENT · 30D

18 day(s) with sentiment data

RECENT · PAGE 1/4 · 67 TOTAL

SIGNIFICANT · CL_111948 · Jun 26 · 07:03

DeepSeek releases 1.6T open-weight V4-Pro model with MIT license · 1 source tracked

DeepSeek has released its V4 series of Mixture-of-Experts models, including V4-Pro (1.6T total parameters) and V4-Flash (284B total). Both models are released under the MIT license, offering full open weights and suppor…
FRONTIER RELEASE · CL_108496 · Jun 24 · 05:31

Alibaba Qwen unveils AgentWorld language model for environment simulation

Alibaba's Qwen team has introduced Qwen-AgentWorld, a new language world model designed to simulate various agent environments. This model focuses on training LLMs to understand and predict environments, rather than jus…
RESEARCH · CL_107144 · Jun 23 · 18:56

OpenMythos benchmarks released, highlights Qwen 3.6 discrepancies

The OpenMythos model has released its benchmarks, showcasing its performance across SWE-bench Pro, CyberGym, and cybench. While the model performs well for its size and cybersecurity focus, there's potential for further…
TOOL · CL_105288 · Jun 23 · 07:00

Xiaomi launches MiMo Code with persistent memory, claims Claude Code advantage

Xiaomi has released MiMo Code, an open-source fork of the OpenCode terminal coding agent. This new version introduces a persistent memory system designed to handle long tasks, along with subagent orchestration and intel…
RESEARCH · CL_104214 · Jun 22 · 19:16

Anthropic's Claude Opus 4.8 claims AI crown as OpenAI retires GPT-4.5

OpenAI is retiring several of its older AI models, including GPT-4.5 and o3, with GPT-4.5 being removed from ChatGPT on June 27, 2026. This move is seen as a strategic shift ahead of potential IPO plans and the release …
TOOL · CL_105172 · Jun 22 · 03:17

New RAD method controls MoE language model reasoning without text analysis

Researchers have developed a new method called RAD (Routing Agreement Decoding) for controlling reasoning in sparse Mixture-of-Experts (MoE) language models. This technique leverages the internal routing states of MoE m…
FRONTIER RELEASE · CL_111214 · Jun 21 · 03:30

DeepReinforce AI releases Ornith-1.0 family of open-source coding models

DeepReinforce AI has released the Ornith-1.0 family of open-source models, designed for agentic coding tasks. The models, available in various sizes including 9B, 35B, and 397B parameters, are built upon Gemma 4 and Qwe…
TOOL · CL_101774 · Jun 20 · 13:25

AI bug fixing costs plummet 75x, now cheaper than human developers

The cost of using frontier AI models to fix software bugs has dramatically decreased, falling by approximately 75 times since March 2023. This reduction, which effectively halves the cost every 250 days, now makes AI bu…
SIGNIFICANT · CL_100532 · Jun 19 · 11:08

OpenAI, Google, DeepSeek unveil major AI model updates in June 2026

The AI landscape is heating up in June 2026 with major advancements from OpenAI, Google DeepMind, and DeepSeek. OpenAI is reportedly in internal testing with GPT-5.6, showing significant reasoning improvements and lower…
TOOL · CL_100092 · Jun 19 · 04:00

AI agent monitors flawed by wall-clock calibration, study finds

A new research paper, "Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence," published on arXiv, identifies a critical flaw in runtime monitors for autonomous …
TOOL · CL_97318 · Jun 17 · 17:41

Frontier AI models show "prefill awareness," potentially impacting safety tests

A new paper explores the concept of "prefill awareness" in frontier AI models, investigating whether these models can distinguish between tampered and untampered content. Researchers Parv Mahajan and Andy Wang found tha…
COMMENTARY · CL_95979 · Jun 17 · 05:12

Local LLMs poised to replace cloud coding assistants for 80% of tasks by 2026

The discussion around local Large Language Models (LLMs) for coding in 2026 suggests that these models are becoming capable of handling a significant portion of daily coding tasks, potentially replacing cloud-based solu…
FRONTIER RELEASE · CL_95424 · Jun 16 · 22:11

Fireworks AI launches GLM-5.2 with 1M context, optimized for coding

Fireworks AI has launched GLM-5.2, a new frontier model with a 1 million token context window, optimized for coding tasks. The model has undergone independent validation on benchmarks including SWE-bench and GPQA. Firew…
TOOL · CL_106548 · Jun 16 · 00:00

GeneralVLA-2 enhances robot planning with improved 3D reconstruction and memory

Researchers have introduced GeneralVLA-2, an advancement in vision-language-action systems designed for robotic planning. The system incorporates GeoFuse-MV3D to enhance 3D reconstruction accuracy by leveraging geometry…
RESEARCH · CL_96078 · Jun 16 · 00:00

GeneralVLA-2 advances robot planning with improved 3D reconstruction and memory

Researchers have introduced GeneralVLA-2, an advancement in vision-language-action systems designed for robot planning. This system incorporates GeoFuse-MV3D for enhanced 3D reconstruction and an improved KnowledgeBank …
TOOL · CL_85566 · Jun 11 · 13:00

LLM benchmarks saturate quickly due to training data contamination

Public LLM benchmarks are becoming saturated and less useful for differentiating top-tier models due to their training data inadvertently including benchmark questions. This contamination issue, observed in benchmarks l…
TOOL · CL_84598 · Jun 11 · 03:25

DeepSeek V4 excels at coding but lags in general reasoning

DeepSeek V4's coding performance is exceptionally high, achieving top scores on benchmarks like SWE-bench and LiveCodeBench. However, evaluations by CAISI suggest its general reasoning and agentic capabilities lag signi…
TOOL · CL_84567 · Jun 11 · 03:16

Claude Fable 5 and Higgsfield MCP build $10K websites in 90 seconds

A developer has demonstrated a workflow for creating high-end 3D scroll websites in under 90 seconds using Anthropic's Claude Fable 5 and the Higgsfield MCP. This process leverages Claude Fable 5's coding and site-cloni…
TOOL · CL_82560 · Jun 10 · 04:00

Paper defines 'agent harness' for AI coding assistants

A new paper published on arXiv proposes a formal definition for "agent harness," a term used in software engineering for systems that wrap language models to create coding agents. The authors trace the term's origins an…
RESEARCH · CL_80489 · Jun 9 · 08:16

Anthropic AI engineers ship code 8x faster with recursive self-improvement

Anthropic has released data indicating significant advancements in AI development, with their engineers now shipping code eight times faster than in a previous baseline period. The company's AI models, like Claude, are …