Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 1d · [2 sources]

DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing

Researchers have developed DoubtProbe, a novel defense mechanism designed to counter jailbreaking attempts on large language models (LLMs) in black-box scenarios. This dual-branch framework combines structural verification with semantic auditing to identify inconsistencies in jailbreak prompts that evade safety alignments. When tested on models like Qwen2.5-72B and Llama 3.1 70B, DoubtProbe significantly reduced attack success rates while maintaining low false positive rates on benign requests. AI

IMPACT This research offers a new method for improving LLM safety by detecting and mitigating jailbreaking attempts through structural and semantic analysis.

Qwen2.5-72B
AlpacaEval
Llama 3.1 70B
DoubtProbe
CodeAttack