English(EN) One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries

研究发现：AI模型防御在适应性攻击面前失效

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-26 04:00

一项新的研究论文指出，目前针对AI模型恶意微调的防御措施不足。该研究分析了15种近期防御方法，发现它们主要掩盖有害行为而非消除它们，使其容易受到适应性攻击。研究人员开发了一种统一的适应性攻击，成功破解了这些防御，表明当前方法未能提供强大的安全性，在部署前需要进一步开发。 AI

影响当前针对恶意AI模型微调的防御措施不足，需要新的适应性攻击策略来确保强大的安全性。

排序理由分析AI模型漏洞的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Itay Zloczower, Eyal Lenga, Gilad Gressel, Yisroel Mirsky · 2026-05-26 04:00

One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries

arXiv:2605.14605v2 Announce Type: replace-cross Abstract: Model providers increasingly release open weights or allow users to fine-tune foundation models through APIs. Although these models are safety-aligned before release, their safeguards can often be removed by fine-tuning on…