PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations
Researchers have developed a new framework called PsychoPass to detect adversarial conversations with large language models. This method models conversations as geometric paths in embedding space, analyzing their trajectories rather than just individual turns. PsychoPass extracts geometric features to predict potential attacks early in the conversation, demonstrating robustness across different encoders and outperforming baseline guardrails. AI
IMPACT Introduces a novel approach to LLM safety by analyzing conversation geometry, potentially enabling more robust real-time detection of adversarial attacks.