A new benchmark called SEC-bench Pro has been developed to evaluate the capabilities of large language models (LLMs) in complex, long-horizon software security tasks, such as finding vulnerabilities in real-world systems. The benchmark includes 183 validated vulnerabilities from V8 and SpiderMonkey, with a significant portion of these having substantial rewards from Google's Vulnerability Reward Program. Current frontier models demonstrate less than 40% success on these tasks, highlighting limitations in LLM-based bug hunting for intricate software security challenges. AI
IMPACT Highlights current LLM limitations in complex software security tasks, suggesting a need for improved agent capabilities in this domain.
RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating AI capabilities. [lever_c_demoted from research: ic=1 ai=1.0]
- ClaudeCode
- Codex
- Google Vulnerability Reward Program
- Kimi-K2.6
- Large language models
- SEC-bench Pro
- SpiderMonkey
- V8
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →