English(EN) Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

新的中文毒性攻击框架挑战大语言模型防御

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-21 10:01

研究人员开发了一个名为CITA的新框架，用于生成更复杂的中文毒性攻击，以对抗大语言模型。该框架增强了隐式毒性并混淆措辞，使检测更具挑战性。在测试中，现有的毒性检测器显示出显著的失败率，平均攻击成功率为69.48%。生成的毒性数据还被用于微调一个防御模型，提高了其对这些高级攻击的鲁棒性。 AI

影响引入了一种新的红队测试大语言模型的方法，可能带来更鲁棒的毒性检测系统。

排序理由该集群包含一篇学术论文，详细介绍了一种针对大语言模型生成对抗性攻击的新方法。

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Jingyi Kang, Junyu Lu, Bo Xu, Hongbo Wang, Linlin zong, Roy Ka-Wei Lee, Hongfei Lin · 2026-05-22 04:00

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

arXiv:2605.22258v1 Announce Type: new Abstract: Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese …
arXiv cs.CL TIER_1 English(EN) · Hongfei Lin · 2026-05-21 10:01

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese Implicit Toxicity Attack (CITA), a controlled re…