Researchers have developed a new attack method called RouteHijack that targets Mixture-of-Experts (MoE) Large Language Models (LLMs). This attack exploits the routing mechanism within MoE architectures, identifying and manipulating safety-critical experts to bypass alignment safeguards. RouteHijack demonstrated a significant success rate across various MoE models, including vision-language models, highlighting a fundamental vulnerability in sparse expert architectures. AI
影响 Exposes a fundamental vulnerability in MoE architectures, necessitating new defense strategies beyond output-level alignment.
排序理由 Academic paper detailing a novel attack method on MoE LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →