SWE-bench Verified
PulseAugur coverage of SWE-bench Verified — every cluster mentioning SWE-bench Verified across labs, papers, and developer communities, ranked by signal.
13 day(s) with sentiment data
-
DeepReinforce releases Ornith-1.0 open-source coding models that learn RL scaffolds
DeepReinforce has launched Ornith-1.0, a family of open-source coding models available under the MIT license. These models, built upon Gemma 4 and Qwen 3.5, are designed for agentic coding tasks and uniquely learn their…
-
Alibaba's Qwen3-Coder-Next achieves 70.6% on SWE-bench with efficient MoE architecture
The Qwen3-Coder-Next model, an 80 billion parameter Mixture-of-Experts model from Alibaba's Qwen team, has demonstrated impressive efficiency by achieving 70.6% on the SWE-bench Verified benchmark with only approximatel…
-
New SHERLOC framework boosts LLM code repair efficiency and accuracy
Researchers have developed SHERLOC, a novel framework designed to improve the efficiency and accuracy of Large Language Model (LLM) agents in code repair tasks. This training-free framework utilizes a reasoning LLM with…
-
AI models show significant performance drop on private codebases, cost concerns rise
New benchmarks reveal a significant gap between AI model performance on standardized tests and their effectiveness on private, real-world codebases. While models like Claude Opus 4.8 excel on public benchmarks like SWE-…
-
Users optimize Qwen3.6-27B for consumer GPUs with long context
Users are sharing optimized settings for running the Qwen3.6-27B large language model on consumer hardware, particularly focusing on maximizing performance with limited VRAM. Discussions cover various quantization metho…
-
Chinese AI labs release powerful open models, challenging US frontier AI
Chinese AI labs are rapidly advancing their open-weight models, with Z.ai's GLM-5.2 achieving impressive benchmark scores and a one million token context window, rivaling top closed models like Opus 4.8 and GPT-5.5 at a…
-
New tuning method boosts LLM coding agent performance
Researchers have developed a new method called probe-and-refine tuning to improve the performance of large language model (LLM) coding agents. This technique focuses on enhancing the guidance files that direct agents to…
-
HyDRA framework dynamically routes LLM queries, cutting costs and improving efficiency
Researchers have developed HyDRA, a novel framework for dynamically routing queries to heterogeneous pools of large language models. Unlike previous methods that make binary strong-vs-weak decisions or require retrainin…
-
New study reveals widespread reward hackability in code RL training environments
A new paper from arXiv details how easily current code reinforcement learning (RL) training environments can be exploited. Researchers found that a significant percentage of tasks in SWE-bench Verified and R2E-Gym accep…
-
GeneralVLA-2 enhances robot planning with improved 3D reconstruction and memory
Researchers have introduced GeneralVLA-2, an advancement in vision-language-action systems designed for robotic planning. The system incorporates GeoFuse-MV3D to enhance 3D reconstruction accuracy by leveraging geometry…
-
GeneralVLA-2 advances robot planning with improved 3D reconstruction and memory
Researchers have introduced GeneralVLA-2, an advancement in vision-language-action systems designed for robot planning. This system incorporates GeoFuse-MV3D for enhanced 3D reconstruction and an improved KnowledgeBank …
-
New LLM techniques enhance reasoning via iterative refinement and optimized looping · 5 sources tracked
Researchers have developed new methods to improve the reasoning capabilities of large language models (LLMs) through test-time scaling. The REVES framework uses a two-stage iterative process to augment training data and…
-
Poolside releases Laguna M.1, a 225B MoE model for agentic coding
Poolside has released Laguna M.1, a 225 billion parameter Mixture-of-Experts model optimized for agentic coding tasks. The model features a large sparse MoE architecture with 256 experts and global attention, enabling i…
-
Claude Fable 5's benchmark scores questioned amid cheating allegations
Anthropic's Claude Fable 5 achieved a 95% score on its self-reported SWE-bench Verified benchmark, but an independent evaluation by Endor Labs revealed a significantly lower 19% score on real-world security vulnerabilit…
-
Claude Code outperforms OpenAI Codex for production coding tasks
A team of 12 engineers has found Anthropic's Claude Code to be a superior AI coding assistant compared to OpenAI's Codex for production development. Over three months and 50+ projects, they determined Claude Code is bet…
-
MetaAI Recursive Self-Design Framework Introduced with DGM Benchmark Results
A new research paper introduces the concept of "MetaAI Recursive Self-Design," defining it as an AI-assisted development pattern where the AI itself modifies its building and improvement mechanisms. The paper proposes a…
-
New method FuseSearch boosts code localization efficiency
Researchers have developed FuseSearch, a new method to improve code localization in automated software development. This approach reformulates the task as a joint quality-efficiency optimization, aiming to reduce redund…
-
New methods enhance AI agent reliability and safety
Researchers have developed new methods to improve the reliability and safety of AI agents. One approach, TRACE, focuses on monitoring long-horizon agent trajectories to detect malicious or unintended behaviors by analyz…
-
AI agent intervention timing proves unreliable, study finds
A new research paper explores the challenges of determining when to intervene in autonomous AI agents, particularly during long-horizon tasks. The study found that agents can enter a "saturation trap" where they show no…
-
CoMem framework decouples context management for faster AI agents
Researchers have developed CoMem, a new framework that separates context management from an agent's primary workflow, allowing these processes to run concurrently. This asynchronous approach uses a k-step-off pipeline t…