A new research paper introduces "Perturbation Probing," a diagnostic method for understanding the internal workings of large language models. This technique uses two forward passes per prompt to identify and analyze "behavioral circuits" within the models' feed-forward networks (FFNs). The study found two main circuit structures: opposition circuits, which emerge when reinforcement learning from human feedback (RLHF) modifies pre-training tendencies, and routing circuits, which are involved in pre-training behaviors distributed through attention mechanisms. The research demonstrates how these circuits can be manipulated to alter model behavior, such as controlling safety refusals or switching language output, and highlights variations in circuit topology across different model architectures like Qwen and Gemma. AI
Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →
IMPACT Provides a new toolkit for understanding and editing LLM behavior at a mechanistic level.
RANK_REASON Academic paper detailing a new diagnostic method for LLMs.