PulseAugur
EN
LIVE 20:03:13
ENTITY SWE-bench

SWE-bench

PulseAugur coverage of SWE-bench — every cluster mentioning SWE-bench across labs, papers, and developer communities, ranked by signal.

Show in brief
Total · 30d
39
39 over 90d
Releases · 30d
0
0 over 90d
Papers · 30d
26
26 over 90d
TIER MIX · 90D
TOPICS
RELATIONSHIPS
SENTIMENT · 30D

13 day(s) with sentiment data

RECENT · PAGE 2/2 · 39 TOTAL
  1. COMMENTARY · CL_20705 ·

    AI models: Choose benchmarks over hype for true performance

    A recent analysis highlights that tech companies often select AI models based on hype rather than performance on relevant benchmarks. The article emphasizes that benchmarks like SWE-bench for coding, Terminal-Bench for …

  2. TOOL · CL_20742 ·

    VCBench benchmark tests LLMs for venture capital founder success prediction

    Researchers have introduced VCBench, a novel benchmark designed to evaluate the capabilities of large language models in predicting founder success within the venture capital industry. This benchmark includes a dataset …

  3. RESEARCH · CL_20477 ·

    New RL method optimizes agent training by controlling rollout pass rates

    Researchers have developed a new technique called Prefix Sampling (PS) to improve the efficiency of reinforcement learning (RL) for AI agents. This method addresses wasted compute on rollout groups with skewed pass rate…

  4. TOOL · CL_19659 ·

    SubQuadratic's SSA offers linear scaling for LLMs, challenging AI industry's compute moat

    A new attention mechanism called Subquadratic Sparse Attention (SSA) has been developed, offering a linearly scaling solution for long-context retrieval and reasoning. This innovation promises significant speedups, with…

  5. TOOL · CL_19355 ·

    Subquadratic debuts 12M-token context window with linear scaling architecture

    Subquadratic, a startup with 11 PhD researchers, has launched a new model featuring its Subquadratic Selective Attention (SSA) architecture, which claims to scale linearly with context length. This innovation allows for…

  6. RESEARCH · CL_15893 ·

    MolViBench benchmark evaluates LLMs on molecular coding tasks for drug discovery

    Researchers have introduced MolViBench, a novel benchmark designed to evaluate the capabilities of large language models (LLMs) in molecular coding tasks. This benchmark addresses the gap left by existing evaluations, w…

  7. TOOL · CL_13981 ·

    DeepClaude slashes coding agent costs by 17x using DeepSeek V4 Pro

    An open-source tool called DeepClaude has gained significant traction by allowing developers to use the Claude Code agent loop with DeepSeek V4 Pro instead of Anthropic's models. This swap drastically reduces costs, wit…

  8. RESEARCH · CL_13613 ·

    Vintage AI trained on 1930s data learns to code and fix software bugs

    Researchers have fine-tuned a large language model, Talkie-1930-13B, trained only on data predating 1931, to perform software engineering tasks. Despite its limited knowledge base, the model successfully patched a bug i…

  9. RESEARCH · CL_11687 ·

    AI agent swarms may fail due to 'Inverse-Wisdom Law,' study finds

    A new paper introduces the Inverse-Wisdom Law, challenging the assumption that AI agent swarms benefit from the "Wisdom of the Crowd." The research demonstrates that these swarms can prioritize internal architectural ag…

  10. FRONTIER RELEASE · CL_17253 ·

    Mistral’s Model Lets You Vibe Long-Running Code in the Cloud

    Mistral AI has released Mistral Medium 3.5, a new 128 billion parameter model designed for extended coding tasks with a 256K context window. This model powers new remote coding agents within Mistral's Vibe platform, ena…

  11. RESEARCH · CL_07393 ·

    Qwen 3.6 Plus outperforms DeepSeek V4 Pro in price and quality benchmarks

    A recent battle test of six April-released Large Language Models (LLMs) revealed that the Qwen 3.6 Plus, released 22 days prior, outperformed the newer DeepSeek V4 Pro. Despite DeepSeek V4 Pro's advanced reasoning archi…

  12. RESEARCH · CL_06668 ·

    AgentEval framework improves AI agent workflow evaluation with DAG-based error tracking

    Researchers have developed AgentEval, a new framework for evaluating agentic workflows by representing them as directed acyclic graphs (DAGs). This approach allows for detailed step-level assessment and tracking of erro…

  13. RESEARCH · CL_04040 ·

    SWE-bench tests AI agents' real-world capability, showing 80% resolution rate

    Evaluating the real-world performance of AI agents is becoming critical as they transition from experimental stages to production environments. Traditional metrics like perplexity scores are insufficient for assessing a…

  14. RESEARCH · CL_17452 ·

    Public AI models replicate Anthropic's vulnerability discovery findings

    Researchers have successfully replicated Anthropic's Mythos findings using publicly available AI models like GPT-5.4 and Claude Opus 4.6. This suggests that advanced AI capabilities for discovering software vulnerabilit…

  15. COMMENTARY · CL_01323 ·

    How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

    Current methods for evaluating large language models, such as MMLU and HumanEval, may be insufficient as they do not capture the nuances of interactive, goal-oriented conversations. A more effective approach would invol…

  16. FRONTIER RELEASE · CL_00841 ·

    Cosine Genie leverages GPT-4o fine-tuning to become top coding agent

    Cosine has launched Genie, a coding agent that has achieved the top ranking on the SWE-Bench benchmark, surpassing previous leaders by a significant margin. This success is attributed to fine-tuning OpenAI's GPT-4o mode…

  17. RESEARCH · CL_00777 ·

    OpenAI abandons SWE-bench Verified due to flawed tests and data contamination

    OpenAI has announced it will no longer use SWE-bench Verified to evaluate the coding capabilities of frontier AI models. The benchmark has become contaminated, with models showing improved scores primarily due to exposu…

  18. FRONTIER RELEASE · CL_00230 ·

    OpenAI releases GPT-4o with fine-tuning and enhanced multimodal capabilities

    OpenAI has released fine-tuning capabilities for its GPT-4o model, allowing developers to customize its performance and tone for specific applications. This feature, available on paid tiers, offers developers the chance…

  19. FRONTIER RELEASE · CL_02309 ·

    Introducing gpt-realtime and Realtime API updates

    OpenAI has released GPT-4.1, a new series of models for its API that offer significant improvements in coding, instruction following, and long context comprehension, outperforming previous models like GPT-4o. The compan…