Researchers have introduced LaRA, a novel framework designed to detect data contamination in large language models that have undergone reinforcement learning (RL) post-training. Unlike existing methods that rely on output-level signals, LaRA analyzes internal representations layer by layer. It employs three metrics—perturbation sensitivity, directional collapse, and local representation rigidity—to identify geometric deviations indicative of contamination. Experiments demonstrate that LaRA's protocol is more effective than traditional output-level baselines in identifying contamination in RL-trained reasoning models. AI
IMPACT Introduces a new method for ensuring the reliability and generalization of RL-trained LLMs by detecting data contamination.
RANK_REASON The cluster contains an academic paper detailing a new research methodology for detecting data contamination in LLMs.
Read on Hugging Face Daily Papers →
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →