Brief · PulseAugur

TOOL · r/MachineLearning English(EN) · 2h

LLM agents patch security bugs, pass all tests, but still leave the vulnerability open [R]

A new benchmark, CVE-Bench, was developed to evaluate LLM agents' ability to patch security vulnerabilities in Python projects. Across 18 projects and 20 real-world CVEs, the best performing models achieved only a 50% success rate in fully patching vulnerabilities. Notably, even when models appeared to fix a bug and pass regression tests, the vulnerability often remained, highlighting a dangerous failure mode where the fix is indistinguishable from a correct one without hidden security tests. AI

IMPACT LLM agents show significant limitations in reliably patching security vulnerabilities, indicating a need for more robust testing and development before deployment in security-critical applications.

GPT-5.5
GPT-5.4-nano
GPT-5.4-mini
Laguna-m.1
Laguna-xs.2
CVE-Bench