Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 4d · [2 sources]

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

Researchers have identified a shared latent mechanism across various backdoor attacks in large language models, challenging the view that these are isolated trigger-response failures. By using sparse autoencoders on model activations, they found a small set of features consistently activated across different attack types, including jailbreaking and bias induction. These features were shown to be causal and transferable across models like Qwen3, Gemma~3, and Llama~3.1, leading to a new mitigation technique called Concept Ablation Fine-Tuning (CAFT) that suppresses backdoor formation by ablating this shared subspace. AI

IMPACT Identifies a unified approach to detecting and mitigating various LLM backdoor attacks, potentially improving model security.

Llama~3.1
LLMs
Qwen3
Gemma~3
Omar Mohamed Ahmed Mahmoud