LLM agents patch security bugs, pass all tests, but still leave the vulnerability open [R]
A new benchmark, CVE-Bench, was developed to evaluate LLM agents' ability to patch security vulnerabilities in Python projects. Across 18 projects and 20 real-world CVEs, the best performing models achieved only a 50% success rate in fully patching vulnerabilities. Notably, even when models appeared to fix a bug and pass regression tests, the vulnerability often remained, highlighting a dangerous failure mode where the fix is indistinguishable from a correct one without hidden security tests. AI
IMPACT LLM agents show significant limitations in reliably patching security vulnerabilities, indicating a need for more robust testing and development before deployment in security-critical applications.