New benchmark reveals LLMs struggle with evolving software APIs

By PulseAugur Editorial · [1 sources] · 2026-06-24 04:58

Researchers have introduced LibEvoBench, a new benchmark designed to evaluate how well code generation models handle evolving APIs across different software library versions. The benchmark, along with a new metric called the Software Evolution Understanding Score (SEUS), reveals that current state-of-the-art models struggle with temporal knowledge, performing poorly on evolving APIs and showing no improvement when a target version is specified. However, providing relevant documentation significantly enhances model accuracy, indicating a need for new training approaches that incorporate temporally grounded knowledge. AI

IMPACT Highlights a critical limitation in LLM code generation, potentially driving new research into temporally aware models.

RANK_REASON The cluster contains a new academic paper introducing a novel benchmark and metric for evaluating LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New benchmark reveals LLMs struggle with evolving software APIs

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Arie van Deursen · 2026-06-24 04:58

LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models

Large software projects often depend on older versions of libraries, even as APIs continue to evolve across releases. This creates a challenge for LLMs: they must maintain knowledge of multiple API versions, not merely the latest or most common one. However, current LLMs are trai…

COVERAGE [1]

LibEvoBench: Probing Temporal Knowledge Stratification in Code Generation Models

RELATED ENTITIES

RELATED TOPICS