PulseAugur
EN
LIVE 05:30:45

New 'apostate' operator reduces LLM refusal rates with minimal impact

A new operator called "contrastive co-vector" has been developed for the "apostate" tool, aiming to reduce refusal rates in language models while minimizing impact on harmless behavior. This method involves fitting a predictor to reproduce harmless variance while explicitly suppressing harmful prompts. Testing on the "granite-3.3-8b" model showed a significant reduction in refusal rate from 96.0% to 5.0%, with a minimal increase in harmless KL divergence to 0.081 nats. AI

IMPACT This new operator could lead to more compliant and less restrictive AI models, improving user interaction and utility.

RANK_REASON The item describes a new technical method for modifying language models, including experimental results, which falls under research. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New 'apostate' operator reduces LLM refusal rates with minimal impact

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/AccountAntique9327 ·

    New ablation operator. (apostate)

    <!-- SC_OFF --><div class="md"><p>Today I added a new operator to apostate. This new operator is a <strong>contrastive co-vector</strong> edit <code>E = I − R Dᵀ</code>. Removing the refusal direction outright disturbs benign behavior, while naively preserving all harmless varian…