D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting
Researchers have developed a new defense mechanism called D-Judge to counter multi-turn jailbreak attacks on large language models. These attacks use feedback from auxiliary judge models to iteratively refine prompts towards harmful goals. D-Judge works by rewriting the victim LLM's responses before they are evaluated by the attacker's judge, thus misaligning the feedback signal without altering the response's meaning. This strategy derails the prompt-refinement process, leading to improved safety on benchmarks like HarmBench while maintaining performance on benign tasks. AI
IMPACT Introduces a novel defense against sophisticated multi-turn jailbreaks, potentially enhancing LLM safety and reliability.