Researchers have identified a "Perplexity Paradox" in Large Language Models (LLMs) trained with Reinforcement Learning from Verifiable Rewards (RLVR). This paradox occurs when models achieve performance gains despite receiving spurious or incorrect rewards, indicating a shift from reasoning to memorization. The study details a specific "Anchor-Adapter" circuit, involving functional anchors in middle layers and structural adapters in later layers, which facilitates this shortcut. The research also demonstrates that scaling specific MLP keys within this circuit can causally steer the model's behavior, offering a method to identify and mitigate data contamination in RLVR-tuned models. AI
IMPACT Provides a mechanistic understanding of how LLMs can be steered towards memorization over reasoning during RLVR training, potentially impacting future model alignment and safety research.
RANK_REASON The cluster contains an academic paper detailing a new mechanistic understanding of LLM behavior. [lever_c_demoted from research: ic=1 ai=1.0]
- Anchor-Adapter circuit
- LLMs
- Perplexity Paradox
- Qwen 2.5
- Reinforcement Learning with Verifiable Rewards (RLVR)
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →