Brief

last 24h

[2/2] 222 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · arXiv cs.CL Deutsch(DE) · 6h

Expert-Aware Refusal Steering

Researchers have developed new methods to steer refusal behavior in Mixture-of-Experts (MoE) large language models. They found that existing steering vector techniques remain effective even with MoE architectures. The proposed expert-aware methods further enhance this steering by leveraging specific expert routing patterns and directions, demonstrating that refusal signals can be effectively controlled by individual experts. AI

IMPACT Introduces novel techniques for controlling LLM safety alignment, potentially improving robustness against harmful requests.
- arXiv
- Mixture-of-Experts (MoE) LLMs
TOOL · arXiv cs.CL English(EN) · 1w

RouteScan: A Non-Intrusive Approach to Auditing MoE LLMs Safety via Expert Routing Telemetry

Researchers have developed RouteScan, a novel framework for auditing the safety of Mixture-of-Experts (MoE) Large Language Models (LLMs) without needing access to sensitive user data. This non-intrusive method analyzes low-level GPU execution telemetry, specifically the patterns of expert routing, to detect harmful behaviors. Evaluations on open-source MoE models show RouteScan achieves high generalization and accuracy, even on unseen harmful domains and novel jailbreak techniques, while offering a privacy advantage over content-based auditing. AI

IMPACT Offers a privacy-preserving method for LLM safety auditing, potentially enabling broader deployment of MoE models.
- RouteScan
- Mixture-of-Experts (MoE) LLMs

Brief

Expert-Aware Refusal Steering

RouteScan: A Non-Intrusive Approach to Auditing MoE LLMs Safety via Expert Routing Telemetry