Researchers have developed a new method combining metamorphic testing with negative log-likelihood to diagnose data leakage in large language models used for program repair. By creating variant benchmarks through semantics-preserving transformations, they observed significant drops in repair success rates across several LLMs, including GPT-4o and Llama-3.1. The study found a strong correlation between performance degradation on these transformed benchmarks and the models' likelihood of having memorized the original data, suggesting this combined approach offers a more reliable way to detect and potentially mitigate data leakage in LLM evaluations. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Introduces a more robust evaluation method for LLMs in software engineering, potentially leading to more reliable performance metrics.
RANK_REASON Academic paper introducing a new methodology for evaluating LLMs.