A new study published on arXiv evaluates the real-world effectiveness of deep learning models and large language models for detecting vulnerabilities in code. The research found that current models, including prominent LLMs like Claude 3.5 Sonnet, GPT-4o, and GPT-5, struggle to generalize from benchmark datasets to real-world scenarios. When tested on a newly constructed dataset of recently fixed Linux kernel vulnerabilities, model performance dropped significantly, highlighting a gap between academic evaluations and practical application. AI
IMPACT Current LLMs show poor generalization for code vulnerability detection, indicating a need for more robust models and datasets for real-world security applications.
RANK_REASON The cluster contains a research paper evaluating existing models on a new dataset. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →