Researchers have developed RouteScan, a novel framework for auditing the safety of Mixture-of-Experts (MoE) Large Language Models (LLMs) without needing access to sensitive user data. This non-intrusive method analyzes low-level GPU execution telemetry, specifically the patterns of expert routing, to detect harmful behaviors. Evaluations on open-source MoE models show RouteScan achieves high generalization and accuracy, even on unseen harmful domains and novel jailbreak techniques, while offering a privacy advantage over content-based auditing. AI
IMPACT Offers a privacy-preserving method for LLM safety auditing, potentially enabling broader deployment of MoE models.
RANK_REASON The cluster contains a research paper detailing a new method for auditing LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →