What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks
Researchers have developed a new method called Human-Perceptible Adversarial Attacks (HPAA) that exploits the difference between human and large language model (LLM) perception of harmful content. By using typographic manipulations like spacing and visual emphasis, these attacks can make harmful text easily recognizable to humans while remaining undetected by LLM-based moderation systems. In tests, HPAA achieved over 86% human recognition with less than 1% detection by moderation systems, revealing a significant vulnerability in current content moderation. AI
IMPACT Highlights a critical vulnerability in LLM-based content moderation, necessitating new approaches that better align with human perception.