实体 SWE-bench

SWE-bench

PulseAugur coverage of SWE-bench — every cluster mentioning SWE-bench across labs, papers, and developer communities, ranked by signal.

总计 · 30天

28

90 天内 28

发布 · 30天

0

90 天内 0

论文 · 30天

19

90 天内 19

层级分布 · 90 天

frontier release 3
significant 2
research 6
tool 12
commentary 5

关系

情绪 · 30 天

6 天有情绪数据

最近 · 第 2/2 页 · 共 28 条

RESEARCH · CL_06668 · Apr 28 · 04:00

AgentEval framework improves AI agent workflow evaluation with DAG-based error tracking

Researchers have developed AgentEval, a new framework for evaluating agentic workflows by representing them as directed acyclic graphs (DAGs). This approach allows for detailed step-level assessment and tracking of erro…
RESEARCH · CL_04040 · Apr 26 · 09:53

SWE-bench tests AI agents' real-world capability, showing 80% resolution rate

Evaluating the real-world performance of AI agents is becoming critical as they transition from experimental stages to production environments. Traditional metrics like perplexity scores are insufficient for assessing a…
RESEARCH · CL_17452 · Apr 17 · 14:09

Public AI models replicate Anthropic's vulnerability discovery findings

Researchers have successfully replicated Anthropic's Mythos findings using publicly available AI models like GPT-5.4 and Claude Opus 4.6. This suggests that advanced AI capabilities for discovering software vulnerabilit…
COMMENTARY · CL_01323 · Sep 9 · 17:28

How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

Current methods for evaluating large language models, such as MMLU and HumanEval, may be insufficient as they do not capture the nuances of interactive, goal-oriented conversations. A more effective approach would invol…
FRONTIER RELEASE · CL_00841 · Aug 22 · 14:57

Cosine Genie leverages GPT-4o fine-tuning to become top coding agent

Cosine has launched Genie, a coding agent that has achieved the top ranking on the SWE-Bench benchmark, surpassing previous leaders by a significant margin. This success is attributed to fine-tuning OpenAI's GPT-4o mode…
RESEARCH · CL_00777 · Aug 13 · 10:00

OpenAI abandons SWE-bench Verified due to flawed tests and data contamination

OpenAI has announced it will no longer use SWE-bench Verified to evaluate the coding capabilities of frontier AI models. The benchmark has become contaminated, with models showing improved scores primarily due to exposu…
FRONTIER RELEASE · CL_00230 · May 13 · 10:05

OpenAI releases GPT-4o with fine-tuning and enhanced multimodal capabilities

OpenAI has released fine-tuning capabilities for its GPT-4o model, allowing developers to customize its performance and tone for specific applications. This feature, available on paid tiers, offers developers the chance…
FRONTIER RELEASE · CL_02309 · Aug 22 · 07:00

Introducing gpt-realtime and Realtime API updates

OpenAI has released GPT-4.1, a new series of models for its API that offer significant improvements in coding, instruction following, and long context comprehension, outperforming previous models like GPT-4o. The compan…