Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice?
A new benchmark, Snyk VulnBench JS 1.0, has been developed to evaluate the repeatability of large language model (LLM) security reviews. The benchmark found that while LLM findings can vary significantly between runs, reference-matched findings show greater stability. The research suggests that combining agentic LLM security review with deterministic static application security testing (SAST) tools like Snyk Code offers a more robust approach than relying on either method alone. AI
IMPACT Highlights the need for hybrid approaches in AI-assisted code security, combining LLMs with traditional SAST tools for improved reliability.