RouteScan: A Non-Intrusive Approach to Auditing MoE LLMs Safety via Expert Routing Telemetry
Researchers have developed RouteScan, a novel framework for auditing the safety of Mixture-of-Experts (MoE) Large Language Models (LLMs) without needing access to sensitive user data. This non-intrusive method analyzes low-level GPU execution telemetry, specifically the patterns of expert routing, to detect harmful behaviors. Evaluations on open-source MoE models show RouteScan achieves high generalization and accuracy, even on unseen harmful domains and novel jailbreak techniques, while offering a privacy advantage over content-based auditing. AI
IMPACT Offers a privacy-preserving method for LLM safety auditing, potentially enabling broader deployment of MoE models.