PulseAugur
EN
LIVE 08:39:08

New dataset aids LLM analysis of software vulnerabilities

Researchers have introduced ASSEMBLAGE-DEEPHISTORY, a novel dataset designed to aid in the analysis of software vulnerabilities across different build configurations and historical versions. This dataset contains over 73,000 binaries from 248 open-source projects, compiled using various compilers and operating systems, and includes detailed metadata linking binaries to their source code, vulnerable functions, and package versions. Three analyses were conducted to demonstrate the dataset's utility, including an LLM benchmark for vulnerability detection, an embedding comparison for clustering, and a regression analysis of binary similarity. AI

IMPACT Provides a new resource for training and evaluating AI models in identifying software vulnerabilities across diverse build environments.

RANK_REASON The cluster contains an academic paper detailing a new dataset and benchmark for software vulnerability analysis. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.LG TIER_1 English(EN) · Chang Liu, Noah Fleischmann, Nicol\`o Altamura, Edward Raff, James Holt, Kristopher Micinski ·

    ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage

    arXiv:2605.21615v1 Announce Type: cross Abstract: Existing binary corpora typically capture only one or two axes of binary variation: they either provide cross-compiler builds without a temporal axis, or CVE labels for single-build binaries. None combine cross-build diversity, cr…