LLMs struggle to detect real-world code vulnerabilities, study finds

By PulseAugur Editorial · [1 sources] · 2026-07-03 04:00

A new study published on arXiv evaluates the real-world effectiveness of deep learning models and large language models for detecting vulnerabilities in code. The research found that current models, including prominent LLMs like Claude 3.5 Sonnet, GPT-4o, and GPT-5, struggle to generalize from benchmark datasets to real-world scenarios. When tested on a newly constructed dataset of recently fixed Linux kernel vulnerabilities, model performance dropped significantly, highlighting a gap between academic evaluations and practical application. AI

IMPACT Current LLMs show poor generalization for code vulnerability detection, indicating a need for more robust models and datasets for real-world security applications.

RANK_REASON The cluster contains a research paper evaluating existing models on a new dataset. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLMs struggle to detect real-world code vulnerabilities, study finds

COVERAGE [1]

arXiv cs.LG TIER_1 English(EN) · Chaomeng Lu, Bert Lagaisse · 2026-07-03 04:00

From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection

arXiv:2512.10485v2 Announce Type: replace-cross Abstract: Vulnerability detection methods based on deep learning (DL) have shown strong performance on benchmark datasets, yet their real-world effectiveness remains underexplored. Recent work suggests that both graph neural network…

COVERAGE [1]

From Lab to Reality: A Practical Evaluation of Deep Learning Models and LLMs for Vulnerability Detection

RELATED ENTITIES

RELATED TOPICS