Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 6d

Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes

Researchers have developed a new method called CPD Online to detect adversarial prompts that attempt to jailbreak large language models. This technique treats prompt detection as an online change-point detection problem, analyzing sequential entropy changes in the model's token predictions. CPD Online is model-agnostic, requires no training, and can pinpoint the onset of malicious prompts, outperforming existing perplexity-based detectors on various open-weight models. AI

IMPACT This new detection method could enhance the safety of LLMs by identifying and mitigating malicious prompts, potentially reducing the need for extensive guardrail interventions.

LLMs
LLaMA-2-7B
LLaMA Guard
CPD Online
Mohammed Alshaalan