New method detects adversarial LLM prompts using sequential entropy changes

By PulseAugur Editorial · [1 sources] · 2026-05-19 15:15

Researchers have developed a new method called CPD Online to detect adversarial prompts that attempt to jailbreak large language models. This technique treats prompt detection as an online change-point detection problem, analyzing sequential entropy changes in the model's token predictions. CPD Online is model-agnostic, requires no training, and can pinpoint the onset of malicious prompts, outperforming existing perplexity-based detectors on various open-weight models. AI

IMPACT This new detection method could enhance the safety of LLMs by identifying and mitigating malicious prompts, potentially reducing the need for extensive guardrail interventions.

RANK_REASON The cluster contains a new academic paper detailing a novel method for detecting adversarial prompts in LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New method detects adversarial LLM prompts using sequential entropy changes

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Miguel R. D. Rodrigues · 2026-05-19 15:15

Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes

Optimization-based adversarial suffixes can jailbreak aligned large language models (LLMs) while remaining fluent, weakening static and windowed perplexity-based detectors. We cast adversarial suffix detection as an online change-point detection problem over the token-level next-…

COVERAGE [1]

Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes

RELATED ENTITIES

RELATED TOPICS