PulseAugur
LIVE 08:04:30
research · [1 source] ·
0
research

LLMs show significant performance drops on transformed benchmarks, indicating memorization

Researchers have developed a new method combining metamorphic testing with negative log-likelihood to diagnose data leakage in large language models used for program repair. By creating variant benchmarks through semantics-preserving transformations, they observed significant drops in repair success rates across several LLMs, including GPT-4o and Llama-3.1. The study found a strong correlation between performance degradation on these transformed benchmarks and the models' likelihood of having memorized the original data, suggesting this combined approach offers a more reliable way to detect and potentially mitigate data leakage in LLM evaluations. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Introduces a more robust evaluation method for LLMs in software engineering, potentially leading to more reliable performance metrics.

RANK_REASON Academic paper introducing a new methodology for evaluating LLMs.

Read on arXiv cs.AI →

LLMs show significant performance drops on transformed benchmarks, indicating memorization

COVERAGE [1]

  1. arXiv cs.AI TIER_1 · Annibale Panichella ·

    A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair

    LLM-based automated program repair (APR) techniques have shown promising results in reducing debugging costs. However, prior results can be affected by data leakage: large language models (LLMs) may memorize bug fixes when evaluation benchmarks overlap with their pretraining data…