NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense
Researchers are developing new methods to defend large language models against prompt injection and jailbreak attacks. GuardNet utilizes an ensemble of shallow neural networks for efficient detection, while SlotGCG focuses on optimizing attack placement within prompts to exploit positional vulnerabilities. NeuroArmor offers a runtime defense by comparing prompts against safe variants to balance safety and helpfulness, and CRI proposes a framework to enhance jailbreak attacks by leveraging compliance directions in the model's activation space. AI
IMPACT These research efforts aim to improve the security and reliability of LLMs, making them safer for broader deployment and reducing risks associated with malicious use.