PulseAugur
EN
LIVE 20:43:25

New diagnostic tool probes LLM circuits for safety and behavior insights

A new research paper introduces "Perturbation Probing," a diagnostic method for understanding the internal workings of large language models. This technique uses two forward passes per prompt to identify and analyze "behavioral circuits" within the models' feed-forward networks (FFNs). The study found two main circuit structures: opposition circuits, which emerge when reinforcement learning from human feedback (RLHF) modifies pre-training tendencies, and routing circuits, which are involved in pre-training behaviors distributed through attention mechanisms. The research demonstrates how these circuits can be manipulated to alter model behavior, such as controlling safety refusals or switching language output, and highlights variations in circuit topology across different model architectures like Qwen and Gemma. AI

IMPACT Provides a new toolkit for understanding and editing LLM behavior at a mechanistic level.

RANK_REASON Academic paper detailing a new diagnostic method for LLMs.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

New diagnostic tool probes LLM circuits for safety and behavior insights

COVERAGE [3]

  1. arXiv cs.CL TIER_1 English(EN) · Hongliang Liu, Tung-Ling Li, Yuhao Wu ·

    Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

    arXiv:2604.27401v1 Announce Type: new Abstract: Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amorti…

  2. arXiv cs.CL TIER_1 English(EN) · Yuhao Wu ·

    Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

    Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight …

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

    Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight …