PulseAugur
EN
LIVE 12:18:09

New LoMC Framework Enhances Refusal Suppression in Routed Foundation Models

Researchers have developed a new framework called Localized Multidirectional Correction (LoMC) to address refusal suppression in routed Mixture-of-Experts (MoE) and hybrid-MoE foundation models. LoMC aims to enhance non-refusal responses while preserving overall capabilities by applying targeted corrections within specific model components. This method involves identifying an edit support, aggregating correction directions, and applying rank-one layer-wise corrections only within that support, thereby increasing correction capacity without broadening the intervention scope. Experiments on various safety benchmarks have demonstrated LoMC's effectiveness in improving desired behaviors across different routed model architectures. AI

IMPACT Introduces a novel technique for improving safety and control in complex routed AI models.

RANK_REASON The cluster contains an academic paper detailing a new method for AI model safety.

Read on arXiv stat.ML →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New LoMC Framework Enhances Refusal Suppression in Routed Foundation Models

COVERAGE [2]

  1. arXiv stat.ML TIER_1 English(EN) · Yan Hong, Kedong Xiu, Wei Li, Jun Lan, Huijia Zhu, Shuheng Zhou, Zhongcai Lyu, Weiqiang Wang, Jianfu Zhang ·

    LoMC: Localized Multidirectional Correction for Refusal Suppression in Routed Foundation Models

    arXiv:2606.13709v1 Announce Type: new Abstract: We study controlled post-training refusal suppression in routed MoE and hybrid-MoE foundation models, aiming to increase non-refusal target-response behavior while preserving general capability under a compact intervention footprint…

  2. arXiv stat.ML TIER_1 English(EN) · Jianfu Zhang ·

    LoMC: Localized Multidirectional Correction for Refusal Suppression in Routed Foundation Models

    We study controlled post-training refusal suppression in routed MoE and hybrid-MoE foundation models, aiming to increase non-refusal target-response behavior while preserving general capability under a compact intervention footprint. Existing broad direction-based edits can pertu…