Researchers have developed a new method using process mining to analyze how Large Language Models (LLMs) respond to red teaming attacks. This approach moves beyond simple success/fail metrics to examine the sequential interactions during an attack. Experiments with GPT-OSS 120B and Llama 3.3 70B revealed distinct defense patterns, showing GPT-OSS quickly entering a refusal state while Llama had multiple pathways to being jailbroken. AI
IMPACT Introduces a novel analytical framework for understanding LLM vulnerabilities beyond simple pass/fail metrics.
RANK_REASON The cluster contains an academic paper detailing a new methodology for evaluating LLM safety. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →