PulseAugur
EN
LIVE 07:58:20

New adversarial attacks exploit LLM blind spots in content moderation

Researchers have developed a new class of adversarial attacks called Human-Perceptible Adversarial Attacks (HPAA) that exploit the difference between human and large language model (LLM) perception. These attacks embed harmful expressions into text using visually salient typographic manipulations, such as spacing and emphasis, which humans can easily recognize but LLMs often miss. In black-box settings, HPAA achieved over 86% human recognition of harmful content while evading detection by deployed moderation systems with less than 1% detection rate, highlighting a significant vulnerability in current LLM-based content moderation. AI

IMPACT Highlights a critical vulnerability in LLM content moderation, potentially necessitating new approaches that better align with human perception.

RANK_REASON The cluster contains an academic paper detailing a new method for adversarial attacks on LLM-based content moderation systems. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.LG TIER_1 English(EN) · Qin Yang, Lu Malloy, Joshua Lee, Xiaohan Chang, Meisam Mohammady, Doowon Kim, Yuan Hong ·

    What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

    arXiv:2606.09700v1 Announce Type: cross Abstract: Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily operate on tokenized text and largely ignore the visual cues that humans…

  2. arXiv cs.LG TIER_1 English(EN) · Yuan Hong ·

    What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

    Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily operate on tokenized text and largely ignore the visual cues that humans naturally rely on when interpreting content. We s…