PulseAugur
LIVE 11:30:58
research · [2 sources] ·

Nous Research steers LLM refusals by targeting 0.1% of neurons

Researchers at Nous Research have developed a new method called Contrastive Neuron Attribution (CNA) to identify and steer specific neurons within large language models that are responsible for refusing harmful requests. By targeting a mere 0.1% of MLP activations, CNA significantly reduced refusal rates by over 50% across various Llama and Qwen models, from 1 billion to 72 billion parameters. Importantly, this method maintained output quality above 0.97 and does not require additional training or weight modification, unlike previous techniques. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Provides a novel, efficient method for understanding and controlling LLM safety mechanisms without extensive retraining.

RANK_REASON Research paper detailing a new method for AI safety and interpretability.

Read on MarkTechPost →

Nous Research steers LLM refusals by targeting 0.1% of neurons

COVERAGE [2]

  1. MarkTechPost TIER_1 · Asif Razzaq ·

    Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification

    <p>Nous Research releases Contrastive Neuron Attribution (CNA), a method that identifies and ablates sparse MLP neuron circuits to steer LLM behavior — no sparse autoencoder training, no weight modification, and no degradation of general capability benchmarks.</p> <p>The post <a …

  2. Mastodon — fosstodon.org TIER_1 · [email protected] ·

    Nous Research has released Contrastive Neuron Attribution (CNA), a method that identifies the specific MLP neurons controlling AI model refusal behaviour. By ab

    Nous Research has released Contrastive Neuron Attribution (CNA), a method that identifies the specific MLP neurons controlling AI model refusal behaviour. By ablating just 0.1% of MLP activations, refusal rates drop by over 50% across Llama and Qwen models from 1B to 72B paramete…