PulseAugur
实时 23:28:11
实体 SWE-bench

SWE-bench

PulseAugur coverage of SWE-bench — every cluster mentioning SWE-bench across labs, papers, and developer communities, ranked by signal.

Show in brief
总计 · 30天
28
90 天内 28
发布 · 30天
0
90 天内 0
论文 · 30天
19
90 天内 19
层级分布 · 90 天
关系
情绪 · 30 天

6 天有情绪数据

最近 · 第 1/2 页 · 共 28 条
  1. COMMENTARY · CL_47077 ·

    AI benchmarks fail to measure real-world reliability, author warns

    The author argues that current AI benchmarks are misleading, as they fail to measure crucial aspects like factual accuracy and the tendency to hallucinate plausible but false information. Despite high scores on benchmar…

  2. COMMENTARY · CL_42389 ·

    Coding agents need better failure analysis and recovery training

    Coding agents often fail not at the initial task understanding, but in the execution phase, making subtle errors that cascade into incorrect final outputs. Current training and evaluation methods, like SWE-bench, focus …

  3. TOOL · CL_39900 ·

    Claude Opus 4.5 leads coding benchmarks; DeepSeek V4 excels at large refactors

    A comparison of Claude Opus 4.5 and DeepSeek V4 highlights their distinct strengths in coding tasks. Claude Opus 4.5 excels at precise, surgical fixes for production bugs and single-file issues, achieving a leading 80.9…

  4. SIGNIFICANT · CL_36099 ·

    Anthropic releases Claude 4 Opus, claims world's best AI for coding

    Anthropic has released its new Claude 4 model series, featuring Claude Opus 4 and Claude Sonnet 4. The Opus 4 model is specifically highlighted as the leading AI for programming tasks, achieving a 72.5% score on the SWE…

  5. TOOL · CL_28290 ·

    AI agents exhibit "Bystander Effect," sacrificing reasoning for conformity

    Researchers have identified a "Bystander Effect" in multi-agent systems where collaboration can lead to reduced reasoning quality, a phenomenon termed "cognitive loafing." Through analysis of 22,500 trajectories across …

  6. RESEARCH · CL_28293 ·

    New LLM training methods boost efficiency and error recovery

    Researchers have developed new techniques for improving the efficiency of training large language models (LLMs). One method, Step Rejection Fine-Tuning (SRFT), leverages unsuccessful training trajectories by assessing t…

  7. TOOL · CL_25288 ·

    AI coding benchmark scores may be misleading, analysis finds

    A recent analysis suggests that widely reported AI coding benchmark scores may be misleading. Models that achieve high scores on benchmarks like SWE-Bench when tested under specific conditions see a dramatic drop in per…

  8. SIGNIFICANT · CL_23645 ·

    DeepSeek releases open-source coding model matching GPT-4o

    DeepSeek has released V3-0324, an open-source coding model that matches or surpasses leading models like GPT-4o and Claude 3.5 Sonnet in coding performance. This Mixture-of-Experts model, with 671 billion total paramete…

  9. COMMENTARY · CL_23256 ·

    Jack Clark predicts 60% chance of automated AI R&D by 2028

    Jack Clark, co-founder of Anthropic, has predicted a 60% chance that AI research will be fully automated by the end of 2028, and a 30% chance by 2027. He bases this forecast on rapid advancements in AI capabilities acro…

  10. COMMENTARY · CL_20705 ·

    AI models: Choose benchmarks over hype for true performance

    A recent analysis highlights that tech companies often select AI models based on hype rather than performance on relevant benchmarks. The article emphasizes that benchmarks like SWE-bench for coding, Terminal-Bench for …

  11. TOOL · CL_20742 ·

    VCBench benchmark tests LLMs for venture capital founder success prediction

    Researchers have introduced VCBench, a novel benchmark designed to evaluate the capabilities of large language models in predicting founder success within the venture capital industry. This benchmark includes a dataset …

  12. RESEARCH · CL_20477 ·

    New RL method optimizes agent training by controlling rollout pass rates

    Researchers have developed a new technique called Prefix Sampling (PS) to improve the efficiency of reinforcement learning (RL) for AI agents. This method addresses wasted compute on rollout groups with skewed pass rate…

  13. TOOL · CL_19659 ·

    SubQuadratic's SSA offers linear scaling for LLMs, challenging AI industry's compute moat

    A new attention mechanism called Subquadratic Sparse Attention (SSA) has been developed, offering a linearly scaling solution for long-context retrieval and reasoning. This innovation promises significant speedups, with…

  14. TOOL · CL_19355 ·

    Subquadratic debuts 12M-token context window with linear scaling architecture

    Subquadratic, a startup with 11 PhD researchers, has launched a new model featuring its Subquadratic Selective Attention (SSA) architecture, which claims to scale linearly with context length. This innovation allows for…

  15. RESEARCH · CL_15893 ·

    MolViBench benchmark evaluates LLMs on molecular coding tasks for drug discovery

    Researchers have introduced MolViBench, a novel benchmark designed to evaluate the capabilities of large language models (LLMs) in molecular coding tasks. This benchmark addresses the gap left by existing evaluations, w…

  16. TOOL · CL_13981 ·

    DeepClaude slashes coding agent costs by 17x using DeepSeek V4 Pro

    An open-source tool called DeepClaude has gained significant traction by allowing developers to use the Claude Code agent loop with DeepSeek V4 Pro instead of Anthropic's models. This swap drastically reduces costs, wit…

  17. RESEARCH · CL_13613 ·

    Vintage AI trained on 1930s data learns to code and fix software bugs

    Researchers have fine-tuned a large language model, Talkie-1930-13B, trained only on data predating 1931, to perform software engineering tasks. Despite its limited knowledge base, the model successfully patched a bug i…

  18. RESEARCH · CL_11687 ·

    AI agent swarms may fail due to 'Inverse-Wisdom Law,' study finds

    A new paper introduces the Inverse-Wisdom Law, challenging the assumption that AI agent swarms benefit from the "Wisdom of the Crowd." The research demonstrates that these swarms can prioritize internal architectural ag…

  19. FRONTIER RELEASE · CL_17253 ·

    Mistral’s Model Lets You Vibe Long-Running Code in the Cloud

    Mistral AI has released Mistral Medium 3.5, a new 128 billion parameter model designed for extended coding tasks with a 256K context window. This model powers new remote coding agents within Mistral's Vibe platform, ena…

  20. RESEARCH · CL_07393 ·

    Qwen 3.6 Plus outperforms DeepSeek V4 Pro in price and quality benchmarks

    A recent battle test of six April-released Large Language Models (LLMs) revealed that the Qwen 3.6 Plus, released 22 days prior, outperformed the newer DeepSeek V4 Pro. Despite DeepSeek V4 Pro's advanced reasoning archi…