English(EN) Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

研究人员发现大语言模型后门攻击的共享潜在机制

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-06 03:41

研究人员已在大语言模型的各种后门攻击中识别出一种共享的潜在机制，这挑战了这些攻击是孤立的触发-响应失败的观点。通过在模型激活上使用稀疏自编码器，他们发现一组小的特征在不同类型的攻击（包括越狱和偏见诱导）中持续被激活。这些特征被证明是因果性的，并且可以在Qwen3、Gemma~3和Llama~3.1等模型之间转移，从而催生了一种名为概念消融微调（CAFT）的新缓解技术，该技术通过消融这个共享子空间来抑制后门的形成。 AI

影响确定了一种统一的方法来检测和缓解各种大语言模型后门攻击，有可能提高模型安全性。

排序理由该集群包含一篇学术论文，详细介绍了大语言模型安全的新研究发现和方法论。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Omar Mahmoud, Aly M. Kassem, Thommen George Karimpanal, Buddhika Laknath Semage, Negar Rostamzadeh, Golnoosh Farnadi, Santu Rana · 2026-06-09 04:00

共享的潜在结构能够统一 LLM 中的后门检测和缓解

arXiv:2606.07963v1 Announce Type: new Abstract: Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to specific triggers or behaviors. We show this view is incomplete. Across diverse backdoor behav…
arXiv cs.CL TIER_1 English(EN) · Santu Rana · 2026-06-06 03:41

共享的潜在结构能够统一 LLM 中的后门检测和缓解

Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to specific triggers or behaviors. We show this view is incomplete. Across diverse backdoor behaviors, we identify a shared latent mechanism that…

报道来源 [2]

共享的潜在结构能够统一 LLM 中的后门检测和缓解

共享的潜在结构能够统一 LLM 中的后门检测和缓解

相关实体

相关话题