LLMs show significant performance drops on transformed benchmarks, indicating memorization

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed a new method combining metamorphic testing with negative log-likelihood to diagnose data leakage in large language models used for program repair. By creating variant benchmarks through semantics-preserving transformations, they observed significant drops in repair success rates across several LLMs, including GPT-4o and Llama-3.1. The study found a strong correlation between performance degradation on these transformed benchmarks and the models' likelihood of having memorized the original data, suggesting this combined approach offers a more reliable way to detect and potentially mitigate data leakage in LLM evaluations. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Introduces a more robust evaluation method for LLMs in software engineering, potentially leading to more reliable performance metrics.

RANK_REASON Academic paper introducing a new methodology for evaluating LLMs.

Read on arXiv cs.AI →

paper
safety

COVERAGE [1]

arXiv cs.AI TIER_1 · Annibale Panichella · 2026-04-23 11:59

A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair

LLM-based automated program repair (APR) techniques have shown promising results in reducing debugging costs. However, prior results can be affected by data leakage: large language models (LLMs) may memorize bug fixes when evaluation benchmarks overlap with their pretraining data…

COVERAGE [1]

A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair

RELATED ENTITIES

RELATED TOPICS