Expert-Aware Refusal Steering
Researchers have developed new methods to steer refusal behavior in Mixture-of-Experts (MoE) large language models. They found that existing steering vector techniques remain effective even with MoE architectures. The proposed expert-aware methods further enhance this steering by leveraging specific expert routing patterns and directions, demonstrating that refusal signals can be effectively controlled by individual experts. AI
IMPACT Introduces novel techniques for controlling LLM safety alignment, potentially improving robustness against harmful requests.