Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 6h

DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

A new framework called DualGauge has been developed to automatically benchmark the security and functionality of code generated by LLMs and coding agents. The accompanying DualGauge-Bench dataset includes 307 tasks with paired functional and security tests. Evaluations across 10 LLMs and 3 coding agents revealed that even the best models struggle with joint security-functionality success, often failing at output contract boundaries or with insufficient guards. Factors like model scale, quantization, or iterative scaffolding did not reliably improve performance, indicating that secure and correct code generation is not an emergent property of general coding capability. AI

IMPACT Reveals significant security and functionality gaps in LLM-generated code, suggesting current models are unreliable for security-critical applications.