Mixtral MoE routing analyzed for safety under harmful prompts

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-26 04:00

Researchers have analyzed the routing behavior of the Mixtral 8x7B-Instruct model when presented with both benign and harmful prompts. They used activation-based and gradient-based signals to understand how the model selects experts for processing different types of input. The study found that while most experts are shared between benign and harmful prompts, a small subset shows distinct preferences. Interventions to suppress these preferred experts reduced harmful responses, indicating that safety-relevant routing is subtle and distributed across layers. AI

影响 Provides insights into the internal workings of Mixture-of-Experts models, potentially informing future safety research and development.

排序理由 Academic paper analyzing model behavior. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Md Nurul Absar Siddiky · 2026-05-26 04:00

Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts

arXiv:2605.24270v1 Announce Type: new Abstract: Sparse mixture-of-experts (MoE) language models activate only a small subset of parameters for each token, making router behavior a central part of model computation. This paper studies routing behavior of Mixtral 8x7B-Instruct unde…

报道来源 [1]

Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts

相关话题