New Chinese Toxicity Attack Framework Challenges LLM Defenses

By PulseAugur Editorial · [2 sources] · 2026-05-21 10:01

Researchers have developed a new framework called CITA to generate more sophisticated Chinese toxicity attacks for large language models. This framework enhances implicit toxicity and obfuscates wording to make detection more challenging. When tested, existing toxicity detectors showed significant failure rates, with an average attack success rate of 69.48%. The generated data was also used to fine-tune a defense model, improving its robustness against these advanced attacks. AI

IMPACT Introduces a novel method for red-teaming LLMs, potentially leading to more robust toxicity detection systems.

RANK_REASON The cluster contains an academic paper detailing a new method for generating adversarial attacks on LLMs.

Read on arXiv cs.CL →

paper
safety

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

arXiv cs.CL TIER_1 English(EN) · Jingyi Kang, Junyu Lu, Bo Xu, Hongbo Wang, Linlin zong, Roy Ka-Wei Lee, Hongfei Lin · 2026-05-22 04:00

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

arXiv:2605.22258v1 Announce Type: new Abstract: Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese …
arXiv cs.CL TIER_1 English(EN) · Hongfei Lin · 2026-05-21 10:01

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese Implicit Toxicity Attack (CITA), a controlled re…

COVERAGE [2]

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

RELATED ENTITIES

RELATED TOPICS