New dataset aids LLM analysis of software vulnerabilities

By PulseAugur Editorial · [1 sources] · 2026-05-22 04:00

Researchers have introduced ASSEMBLAGE-DEEPHISTORY, a novel dataset designed to aid in the analysis of software vulnerabilities across different build configurations and historical versions. This dataset contains over 73,000 binaries from 248 open-source projects, compiled using various compilers and operating systems, and includes detailed metadata linking binaries to their source code, vulnerable functions, and package versions. Three analyses were conducted to demonstrate the dataset's utility, including an LLM benchmark for vulnerability detection, an embedding comparison for clustering, and a regression analysis of binary similarity. AI

IMPACT Provides a new resource for training and evaluating AI models in identifying software vulnerabilities across diverse build environments.

RANK_REASON The cluster contains an academic paper detailing a new dataset and benchmark for software vulnerability analysis. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New dataset aids LLM analysis of software vulnerabilities

COVERAGE [1]

arXiv cs.LG TIER_1 English(EN) · Chang Liu, Noah Fleischmann, Nicol\`o Altamura, Edward Raff, James Holt, Kristopher Micinski · 2026-05-22 04:00

ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage

arXiv:2605.21615v1 Announce Type: cross Abstract: Existing binary corpora typically capture only one or two axes of binary variation: they either provide cross-compiler builds without a temporal axis, or CVE labels for single-build binaries. None combine cross-build diversity, cr…

COVERAGE [1]

ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage

RELATED ENTITIES

RELATED TOPICS