A new operator called "contrastive co-vector" has been developed for the "apostate" tool, aiming to reduce refusal rates in language models while minimizing impact on harmless behavior. This method involves fitting a predictor to reproduce harmless variance while explicitly suppressing harmful prompts. Testing on the "granite-3.3-8b" model showed a significant reduction in refusal rate from 96.0% to 5.0%, with a minimal increase in harmless KL divergence to 0.081 nats. AI
IMPACT This new operator could lead to more compliant and less restrictive AI models, improving user interaction and utility.
RANK_REASON The item describes a new technical method for modifying language models, including experimental results, which falls under research. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →