English(EN) What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

新型攻击利用大语言模型在内容审核中的盲点

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-08 16:21

研究人员开发了一种名为“人类可感知对抗性攻击”（HPAA）的新方法，该方法利用了人类与大型语言模型（LLM）对有害内容感知的差异。通过使用诸如间距和视觉强调之类的排版操纵，这些攻击可以使有害文本对人类来说易于识别，同时又不会被基于LLM的内容审核系统检测到。在测试中，HPAA达到了86%以上的人类识别率，而审核系统的检测率不到1%，揭示了当前内容审核中的一个重大漏洞。 AI

影响凸显了基于LLM的内容审核中的一个关键漏洞，需要采用与人类感知更一致的新方法。

排序理由该集群包含一篇详细介绍新型对抗性攻击方法的学术论文。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.LG TIER_1 English(EN) · Qin Yang, Lu Malloy, Joshua Lee, Xiaohan Chang, Meisam Mohammady, Doowon Kim, Yuan Hong · 2026-06-09 04:00

人眼所见，大模型所失：利用人类感知进行对抗性文本攻击

arXiv:2606.09700v1 Announce Type: cross Abstract: Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily operate on tokenized text and largely ignore the visual cues that humans…
arXiv cs.LG TIER_1 English(EN) · Yuan Hong · 2026-06-08 16:21

人眼所见，大模型所失：利用人类感知进行对抗性文本攻击

Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily operate on tokenized text and largely ignore the visual cues that humans naturally rely on when interpreting content. We s…

报道来源 [2]

人眼所见，大模型所失：利用人类感知进行对抗性文本攻击

人眼所见，大模型所失：利用人类感知进行对抗性文本攻击

相关实体

相关话题