ENTITY SWE-bench Verified

SWE-bench Verified

PulseAugur coverage of SWE-bench Verified — every cluster mentioning SWE-bench Verified across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

36 over 90d

Releases · 30d

0 over 90d

Papers · 30d

20 over 90d

TIER MIX · 90D

frontier release 2
significant 3
research 10
tool 19
commentary 2

TOPICS

RELATIONSHIPS

SENTIMENT · 30D

13 day(s) with sentiment data

RECENT · PAGE 1/2 · 36 TOTAL

SIGNIFICANT · CL_110786 · Jun 25 · 17:11

DeepReinforce releases Ornith-1.0 open-source coding models that learn RL scaffolds

DeepReinforce has launched Ornith-1.0, a family of open-source coding models available under the MIT license. These models, built upon Gemma 4 and Qwen 3.5, are designed for agentic coding tasks and uniquely learn their…
SIGNIFICANT · CL_110172 · Jun 25 · 07:03

Alibaba's Qwen3-Coder-Next achieves 70.6% on SWE-bench with efficient MoE architecture

The Qwen3-Coder-Next model, an 80 billion parameter Mixture-of-Experts model from Alibaba's Qwen team, has demonstrated impressive efficiency by achieving 70.6% on the SWE-bench Verified benchmark with only approximatel…
RESEARCH · CL_107786 · Jun 23 · 17:05

New SHERLOC framework boosts LLM code repair efficiency and accuracy

Researchers have developed SHERLOC, a novel framework designed to improve the efficiency and accuracy of Large Language Model (LLM) agents in code repair tasks. This training-free framework utilizes a reasoning LLM with…
COMMENTARY · CL_102754 · Jun 21 · 15:37

AI models show significant performance drop on private codebases, cost concerns rise

New benchmarks reveal a significant gap between AI model performance on standardized tests and their effectiveness on private, real-world codebases. While models like Claude Opus 4.8 excel on public benchmarks like SWE-…
TOOL · CL_98376 · Jun 18 · 08:34

Users optimize Qwen3.6-27B for consumer GPUs with long context

Users are sharing optimized settings for running the Qwen3.6-27B large language model on consumer hardware, particularly focusing on maximizing performance with limited VRAM. Discussions cover various quantization metho…
RESEARCH · CL_97275 · Jun 17 · 19:59

Chinese AI labs release powerful open models, challenging US frontier AI

Chinese AI labs are rapidly advancing their open-weight models, with Z.ai's GLM-5.2 achieving impressive benchmark scores and a one million token context window, rivaling top closed models like Opus 4.8 and GPT-5.5 at a…
RESEARCH · CL_96671 · Jun 17 · 11:23

New tuning method boosts LLM coding agent performance

Researchers have developed a new method called probe-and-refine tuning to improve the performance of large language model (LLM) coding agents. This technique focuses on enhancing the guidance files that direct agents to…
TOOL · CL_93606 · Jun 16 · 04:00

HyDRA framework dynamically routes LLM queries, cutting costs and improving efficiency

Researchers have developed HyDRA, a novel framework for dynamically routing queries to heterogeneous pools of large language models. Unlike previous methods that make binary strong-vs-weak decisions or require retrainin…
TOOL · CL_93154 · Jun 16 · 04:00

New study reveals widespread reward hackability in code RL training environments

A new paper from arXiv details how easily current code reinforcement learning (RL) training environments can be exploited. Researchers found that a significant percentage of tasks in SWE-bench Verified and R2E-Gym accep…
TOOL · CL_106548 · Jun 16 · 00:00

GeneralVLA-2 enhances robot planning with improved 3D reconstruction and memory

Researchers have introduced GeneralVLA-2, an advancement in vision-language-action systems designed for robotic planning. The system incorporates GeoFuse-MV3D to enhance 3D reconstruction accuracy by leveraging geometry…
RESEARCH · CL_96078 · Jun 16 · 00:00

GeneralVLA-2 advances robot planning with improved 3D reconstruction and memory

Researchers have introduced GeneralVLA-2, an advancement in vision-language-action systems designed for robot planning. This system incorporates GeoFuse-MV3D for enhanced 3D reconstruction and an improved KnowledgeBank …
RESEARCH · CL_93485 · Jun 16 · 00:00

New LLM techniques enhance reasoning via iterative refinement and optimized looping · 5 sources tracked

Researchers have developed new methods to improve the reasoning capabilities of large language models (LLMs) through test-time scaling. The REVES framework uses a two-stage iterative process to augment training data and…
SIGNIFICANT · CL_99036 · Jun 15 · 09:17

Poolside releases Laguna M.1, a 225B MoE model for agentic coding

Poolside has released Laguna M.1, a 225 billion parameter Mixture-of-Experts model optimized for agentic coding tasks. The model features a large sparse MoE architecture with 256 experts and global attention, enabling i…
TOOL · CL_86287 · Jun 11 · 22:00

Claude Fable 5's benchmark scores questioned amid cheating allegations

Anthropic's Claude Fable 5 achieved a 95% score on its self-reported SWE-bench Verified benchmark, but an independent evaluation by Endor Labs revealed a significantly lower 19% score on real-world security vulnerabilit…
COMMENTARY · CL_84695 · Jun 11 · 04:30

Claude Code outperforms OpenAI Codex for production coding tasks

A team of 12 engineers has found Anthropic's Claude Code to be a superior AI coding assistant compared to OpenAI's Codex for production development. Over three months and 50+ projects, they determined Claude Code is bet…
RESEARCH · CL_79494 · Jun 8 · 15:45

MetaAI Recursive Self-Design Framework Introduced with DGM Benchmark Results

A new research paper introduces the concept of "MetaAI Recursive Self-Design," defining it as an AI-assisted development pattern where the AI itself modifies its building and improvement mechanisms. The paper proposes a…
TOOL · CL_74420 · Jun 6 · 04:00

New method FuseSearch boosts code localization efficiency

Researchers have developed FuseSearch, a new method to improve code localization in automated software development. This approach reformulates the task as a joint quality-efficiency optimization, aiming to reduce redund…
RESEARCH · CL_72413 · Jun 4 · 09:26

New methods enhance AI agent reliability and safety

Researchers have developed new methods to improve the reliability and safety of AI agents. One approach, TRACE, focuses on monitoring long-horizon agent trajectories to detect malicious or unintended behaviors by analyz…
TOOL · CL_70242 · Jun 4 · 04:00

AI agent intervention timing proves unreliable, study finds

A new research paper explores the challenges of determining when to intervene in autonomous AI agents, particularly during long-horizon tasks. The study found that agents can enter a "saturation trap" where they show no…
TOOL · CL_62924 · Jun 1 · 04:00

CoMem framework decouples context management for faster AI agents

Researchers have developed CoMem, a new framework that separates context management from an agent's primary workflow, allowing these processes to run concurrently. This asynchronous approach uses a k-step-off pipeline t…

DeepReinforce releases Ornith-1.0 open-source coding models that learn RL scaffolds

Alibaba's Qwen3-Coder-Next achieves 70.6% on SWE-bench with efficient MoE architecture

New SHERLOC framework boosts LLM code repair efficiency and accuracy

AI models show significant performance drop on private codebases, cost concerns rise

Users optimize Qwen3.6-27B for consumer GPUs with long context

Chinese AI labs release powerful open models, challenging US frontier AI

New tuning method boosts LLM coding agent performance

HyDRA framework dynamically routes LLM queries, cutting costs and improving efficiency

New study reveals widespread reward hackability in code RL training environments

GeneralVLA-2 enhances robot planning with improved 3D reconstruction and memory

GeneralVLA-2 advances robot planning with improved 3D reconstruction and memory

New LLM techniques enhance reasoning via iterative refinement and optimized looping · 5 sources tracked

Poolside releases Laguna M.1, a 225B MoE model for agentic coding

Claude Fable 5's benchmark scores questioned amid cheating allegations

Claude Code outperforms OpenAI Codex for production coding tasks

MetaAI Recursive Self-Design Framework Introduced with DGM Benchmark Results

New method FuseSearch boosts code localization efficiency

New methods enhance AI agent reliability and safety

AI agent intervention timing proves unreliable, study finds

CoMem framework decouples context management for faster AI agents