English(EN) Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

新研究探索用于大语言模型（LLM）越狱检测和缓解的先进方法

作者 PulseAugur 编辑部 · [8 个来源] · 2026-05-22 02:12

研究人员正在开发检测和缓解针对大语言模型（LLMs）的越狱攻击的新方法。一种名为SelfGrader的方法使用锚定令牌级对数概率来评估查询安全性，具有低延迟和低开销。另一项研究探讨了多模态大语言模型（MLLMs）的不同设计范式，特别是显式的图像-工具交互，如何提高对抗越狱的鲁棒性。此外，还提出了一个名为“行为几何”的框架，用于在模型群体之间进行有效的易感性预测和防御迁移。最后，研究表明语言和模态相互作用，共同塑造了多模态大语言模型（MLLMs）的攻击面，这表明安全评估需要跨语言进行并考虑这些相互作用。 AI

影响新研究引入了先进的大语言模型（LLM）安全技术，有望提高对抗性攻击的鲁棒性，并实现更安全的人工智能（AI）系统部署。

排序理由多篇arXiv论文发表了关于大语言模型（LLM）安全和越狱缓解技术的研究。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 8 个来源。我们如何撰写摘要 →

报道来源 [8]

arXiv cs.AI TIER_1 English(EN) · Zikai Zhang, Rui Hu, Olivera Kotevska, Jiahao Xu · 2026-05-29 04:00

SelfGrader：通过锚定Token级Logits进行LLM越狱检测

arXiv:2604.01473v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to det…
arXiv cs.AI TIER_1 English(EN) · Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li, Binghang Lu, Neil Zhenqiang Gong · 2026-05-28 04:00

当“图文联想”遇上安全：多模态越狱的鲁棒性由什么决定？

arXiv:2605.27932v1 Announce Type: cross Abstract: Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct re…
arXiv cs.AI TIER_1 English(EN) · Hayden Helm, Xiaodong Liu, Weiwei Yang · 2026-05-27 04:00

通过模型的行为几何预测和缓解越狱易感性

arXiv:2605.26409v1 Announce Type: cross Abstract: Evaluating and mitigating a generative system's susceptibility to jailbreak attacks is critical to its safe deployment. Given the number of deployable systems, full per-configuration evaluation and optimization is impractical. In …
arXiv cs.AI TIER_1 English(EN) · Mengqi He, Xinyu Tian, Xin Shen, Shu Zou, Jinhong Ni, Zhaoyuan Yang, Weikang Li, Xuesong Li, Jing Zhang · 2026-05-26 04:00

打破刹车，而非车轮：通过熵最大化实现无目标越狱

arXiv:2605.10764v2 Announce Type: replace-cross Abstract: Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. …
arXiv cs.AI TIER_1 English(EN) · Seokil Ham, Jaehyuk Jang, Wonjun Lee, Changick Kim · 2026-05-26 04:00

越狱以保护：通过临时越狱进行缓冲和加固，以实现大型语言模型的安全微调

arXiv:2605.24550v1 Announce Type: new Abstract: Fine-tuning-as-a-Service (FaaS) enables personalization of large language models (LLMs), but it can weaken safety-alignment under harmful fine-tuning attacks. Recent work has shown that activating harmful-behavior modules during fin…
arXiv cs.AI TIER_1 English(EN) · Xiaodong Wu, Xiangman Li, Qi Li, Lingshuang Liu, Jianbing Ni · 2026-05-26 04:00

SoK：GPT和DeepSeek模型越狱韧性的全面安全分析

arXiv:2506.18543v2 Announce Type: replace-cross Abstract: The rapid proliferation of Large Language Models (LLMs) has heightened concerns regarding their exposure to jailbreak attacks, which craft adversarial inputs designed to elicit unsafe content. Although proprietary models s…
arXiv cs.CL TIER_1 English(EN) · Casey Ford, Madison Van Doren, Sicheng Jin, Emily Dix · 2026-05-25 04:00

相同模型，不同弱点：语言和模态如何重塑前沿大型多模态模型的越狱攻击面

arXiv:2605.23157v1 Announce Type: new Abstract: The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study co…
arXiv cs.CL TIER_1 English(EN) · Emily Dix · 2026-05-22 02:12

同一模型，不同弱点：语言和模态如何重塑前沿大型多模态模型的越狱攻击面

The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study comparing jailbreak vulnerability in US English (e…

报道来源 [8]

相关实体

相关话题