Researchers have developed a new method combining metamorphic testing with negative log-likelihood to diagnose data leakage in large language models used for program repair. By creating variant benchmarks through semantics-preserving transformations, they observed significant drops in repair success rates across several LLMs, including GPT-4o and Llama-3.1. The study found a strong correlation between performance degradation on these transformed benchmarks and the models' likelihood of having memorized the original data, suggesting this combined approach offers a more reliable way to detect and potentially mitigate data leakage in LLM evaluations. AI
影响 Introduces a more robust evaluation method for LLMs in software engineering, potentially leading to more reliable performance metrics.
排序理由 Academic paper introducing a new methodology for evaluating LLMs.
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →