research · [2 sources] · 2026-05-23 10:32

Nous Research steers LLM refusals by targeting 0.1% of neurons

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 2 sources

Researchers at Nous Research have developed a new method called Contrastive Neuron Attribution (CNA) to identify and steer specific neurons within large language models that are responsible for refusing harmful requests. By targeting a mere 0.1% of MLP activations, CNA significantly reduced refusal rates by over 50% across various Llama and Qwen models, from 1 billion to 72 billion parameters. Importantly, this method maintained output quality above 0.97 and does not require additional training or weight modification, unlike previous techniques. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Provides a novel, efficient method for understanding and controlling LLM safety mechanisms without extensive retraining.

RANK_REASON Research paper detailing a new method for AI safety and interpretability.

Read on MarkTechPost →

paper
safety

Nous Research steers LLM refusals by targeting 0.1% of neurons

COVERAGE [2]

MarkTechPost TIER_1 · Asif Razzaq · 2026-05-23 10:32

Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification

<p>Nous Research releases Contrastive Neuron Attribution (CNA), a method that identifies and ablates sparse MLP neuron circuits to steer LLM behavior — no sparse autoencoder training, no weight modification, and no degradation of general capability benchmarks.</p> <p>The post <a …
Mastodon — fosstodon.org TIER_1 · [email protected] · 2026-05-23 10:52

Nous Research has released Contrastive Neuron Attribution (CNA), a method that identifies the specific MLP neurons controlling AI model refusal behaviour. By ab

Nous Research has released Contrastive Neuron Attribution (CNA), a method that identifies the specific MLP neurons controlling AI model refusal behaviour. By ablating just 0.1% of MLP activations, refusal rates drop by over 50% across Llama and Qwen models from 1B to 72B paramete…

LINKS marktechpost.com/…/nous-research-releases…

COVERAGE [2]

Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification

Nous Research has released Contrastive Neuron Attribution (CNA), a method that identifies the specific MLP neurons controlling AI model refusal behaviour. By ab

RELATED ENTITIES

RELATED TOPICS