PulseAugur
LIVE 04:23:24
research · [3 sources] ·
0
research

New diagnostic tool probes LLM circuits for safety and behavior insights

A new research paper introduces "Perturbation Probing," a diagnostic method for understanding the internal workings of large language models. This technique uses two forward passes per prompt to identify and analyze "behavioral circuits" within the models' feed-forward networks (FFNs). The study found two main circuit structures: opposition circuits, which emerge when reinforcement learning from human feedback (RLHF) modifies pre-training tendencies, and routing circuits, which are involved in pre-training behaviors distributed through attention mechanisms. The research demonstrates how these circuits can be manipulated to alter model behavior, such as controlling safety refusals or switching language output, and highlights variations in circuit topology across different model architectures like Qwen and Gemma. AI

Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →

IMPACT Provides a new toolkit for understanding and editing LLM behavior at a mechanistic level.

RANK_REASON Academic paper detailing a new diagnostic method for LLMs.

Read on arXiv cs.CL →

COVERAGE [3]

  1. arXiv cs.CL TIER_1 · Hongliang Liu, Tung-Ling Li, Yuhao Wu ·

    Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

    arXiv:2604.27401v1 Announce Type: new Abstract: Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amorti…

  2. arXiv cs.CL TIER_1 · Yuhao Wu ·

    Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

    Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight …

  3. Hugging Face Daily Papers TIER_1 ·

    Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

    Perturbation probing generates task-specific causal hypotheses for FFN neurons in large language models using two forward passes per prompt and no backpropagation, followed by a one-time intervention sweep of about 150 passes amortized across all identified neurons. Across eight …