PulseAugur
实时 06:18:49

LLMs show significant performance drops on transformed benchmarks, indicating memorization

Researchers have developed a new method combining metamorphic testing with negative log-likelihood to diagnose data leakage in large language models used for program repair. By creating variant benchmarks through semantics-preserving transformations, they observed significant drops in repair success rates across several LLMs, including GPT-4o and Llama-3.1. The study found a strong correlation between performance degradation on these transformed benchmarks and the models' likelihood of having memorized the original data, suggesting this combined approach offers a more reliable way to detect and potentially mitigate data leakage in LLM evaluations. AI

影响 Introduces a more robust evaluation method for LLMs in software engineering, potentially leading to more reliable performance metrics.

排序理由 Academic paper introducing a new methodology for evaluating LLMs.

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

LLMs show significant performance drops on transformed benchmarks, indicating memorization

报道来源 [1]

  1. arXiv cs.AI TIER_1 English(EN) · Annibale Panichella ·

    A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair

    LLM-based automated program repair (APR) techniques have shown promising results in reducing debugging costs. However, prior results can be affected by data leakage: large language models (LLMs) may memorize bug fixes when evaluation benchmarks overlap with their pretraining data…