Automated pipeline generates LLM code security benchmarks

By PulseAugur Editorial · [1 sources] · 2026-05-22 04:00

Researchers have developed AutoBaxBuilder, an automated pipeline designed to generate code security benchmarks for large language models. This system uses LLMs to create functional tests and security exploits, significantly reducing the manual effort and cost typically required for benchmark creation. The generated benchmark, AutoBaxBench, has been released publicly and evaluated on current LLMs, demonstrating a substantial reduction in human effort by a factor of 12. AI

IMPACT Automates the creation of security benchmarks for LLM-generated code, enabling more rigorous testing and faster iteration.

RANK_REASON The cluster contains an academic paper detailing a new method for generating code security benchmarks. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.LG TIER_1 English(EN) · Tobias von Arx, Niels M\"undler, Mark Vero, Maximilian Baader, Martin Vechev · 2026-05-22 04:00

AutoBaxBuilder: Bootstrapping Code Security Benchmarking

arXiv:2512.21132v2 Announce Type: replace-cross Abstract: As large language models (LLMs) see wide adoption in software engineering, the reliable assessment of the correctness and security of LLM-generated code is crucial. Notably, prior work showed that LLMs are prone to generat…

COVERAGE [1]

AutoBaxBuilder: Bootstrapping Code Security Benchmarking

RELATED ENTITIES

RELATED TOPICS