English(EN) Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks

新的Patcher方法可防御LLM免受恶意微调攻击

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-06 04:04

研究人员开发了一种名为Patcher的新方法，用于防御开源大语言模型免受恶意微调攻击。这些攻击会利用有毒数据集在监督微调过程中损害模型的安全性。Patcher受对抗性训练的启发，通过增加优化步骤来创建能够抵抗更强、全参数微调攻击的模型参数。实验表明，Patcher在提高模型在各种攻击场景和模型规模下的鲁棒性方面非常有效。 AI

影响通过提供强大的对抗性微调防御能力，增强了LLM的安全性。

排序理由该集群包含一篇详细介绍LLM安全新方法的学术论文。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Haoming Wen, Shi Chen, Qingyu Shi, Siyuan Liu, Minrui Luo, Jingzhao Zhang, Tianxing He · 2026-06-09 04:00

防御恶意微调：通过扩展训练时对抗性攻击

arXiv:2606.07970v1 Announce Type: cross Abstract: Current open-weight large language models (LLMs) are prone to malicious finetuning attacks, which could compromise the safety alignment of LLMs with only a few steps of supervised finetuning (SFT) on poisoned datasets. Existing al…
arXiv cs.CL TIER_1 English(EN) · Tianxing He · 2026-06-06 04:04

防御恶意微调：通过扩展训练时对抗性攻击

Current open-weight large language models (LLMs) are prone to malicious finetuning attacks, which could compromise the safety alignment of LLMs with only a few steps of supervised finetuning (SFT) on poisoned datasets. Existing alignment-stage defenses are primarily designed to d…

报道来源 [2]

防御恶意微调：通过扩展训练时对抗性攻击

防御恶意微调：通过扩展训练时对抗性攻击

相关实体

相关话题