PulseAugur
EN
LIVE 10:45:51

LLMs Show Uneven Repeatability in Security Audits, Benchmark Reveals

A new benchmark, Snyk VulnBench JS 1.0, has been developed to evaluate the repeatability of large language model (LLM) security reviews. The benchmark found that while LLM findings can vary significantly between runs, reference-matched findings show greater stability. The research suggests that combining agentic LLM security review with deterministic static application security testing (SAST) tools like Snyk Code offers a more robust approach than relying on either method alone. AI

IMPACT Highlights the need for hybrid approaches in AI-assisted code security, combining LLMs with traditional SAST tools for improved reliability.

RANK_REASON The cluster contains an academic paper detailing a new benchmark and its findings. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Liran Tal, Johannes Kloos, Arsenii Rudich, Stephen Thoemmes, Manoj Nair ·

    Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice?

    arXiv:2606.15762v1 Announce Type: cross Abstract: We ran 300 repeated vulnerability-finding scans to measure how repeatable agentic large language model (LLM) security review is on the same JavaScript code, prompt, and benchmark harness. The headline result is that LLM security f…