English(EN) On the Hardness of Junking LLMs

新研究探索通过自然后门“垃圾化”大型语言模型

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-06 16:47

研究人员探索了“垃圾化问题”，即在大型语言模型中寻找自然存在的标记序列，这些序列可以在没有明确对抗性提示的情况下触发有害输出。本研究将该问题形式化，并使用贪婪随机搜索方法来发现这些“自然后门”。虽然该问题比传统越狱更难，但提出的策略取得了很高的成功率，表明这些后门是存在的且易于恢复。 AI

影响识别了一类新的大型语言模型漏洞，可能影响安全和对齐研究。

排序理由学术论文，详细介绍了一种识别大型语言模型漏洞的新方法。

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.LG TIER_1 English(EN) · Marco Rando, Samuel Vaiter · 2026-05-07 04:00

On the Hardness of Junking LLMs

arXiv:2605.05116v1 Announce Type: new Abstract: Large language models (LLMs) are known to be vulnerable to jailbreak attacks, which typically rely on carefully designed prompts containing explicit semantic structure. These attacks generally operate by fixing an adversarial instru…
arXiv cs.LG TIER_1 English(EN) · Samuel Vaiter · 2026-05-06 16:47

On the Hardness of Junking LLMs

Large language models (LLMs) are known to be vulnerable to jailbreak attacks, which typically rely on carefully designed prompts containing explicit semantic structure. These attacks generally operate by fixing an adversarial instruction and optimizing small adversarial component…