PulseAugur
EN
LIVE 21:40:03
Polski(PL) Badacze z ByteDance i HKUST udowadniają, że tradycyjne uczenie modeli AI na zadaniach OCR utrudnia pracę z dokumentami. Ich projekt MMProLong pokazuje, że klucz

Nous Research's CNA method steers LLM refusal behavior by targeting 0.1% of neurons

Researchers at Nous Research have developed a new method called Contrastive Neuron Attribution (CNA) to identify and manipulate specific neurons within large language models that control refusal behavior. By targeting just 0.1% of these neurons, CNA can reduce harmful request refusal rates by over 50% in models like Llama and Qwen, while maintaining high output quality. This technique operates without requiring additional training or modification of model weights, and importantly, it reveals that the underlying neural structures for distinguishing harmful from benign prompts exist even in base models before alignment fine-tuning. AI

IMPACT Enables precise control over LLM safety mechanisms, potentially leading to more robust alignment techniques and a deeper understanding of model behavior.

RANK_REASON The cluster describes a new research paper detailing a novel method for analyzing and manipulating AI model behavior.

Read on Mastodon — sigmoid.social →

AI-generated summary · Google Gemini · from 4 sources. How we write summaries →

Nous Research's CNA method steers LLM refusal behavior by targeting 0.1% of neurons

COVERAGE [4]

  1. MarkTechPost TIER_1 English(EN) · Asif Razzaq ·

    Nous Research Releases Contrastive Neuron Attribution (CNA): Sparse MLP Circuit Steering Without SAE Training or Weight Modification

    <p>Nous Research releases Contrastive Neuron Attribution (CNA), a method that identifies and ablates sparse MLP neuron circuits to steer LLM behavior — no sparse autoencoder training, no weight modification, and no degradation of general capability benchmarks.</p> <p>The post <a …

  2. Mastodon — sigmoid.social TIER_1 Polski(PL) · [email protected] ·

    Researchers from Nous Research developed the CNA method, which allows almost complete removal of security locks in Llama and Qwen models through an operation on just

    Badacze z Nous Research opracowali metodę CNA, która pozwala niemal całkowicie zdjąć blokady bezpieczeństwa w modelach Llama i Qwen poprzez operację na zaledwie 0,1% ich neuronów. # si # ai # sztucznainteligencja # wiadomości # informacje # technologia https:// aisight.pl/technol…

  3. Mastodon — sigmoid.social TIER_1 Polski(PL) · [email protected] ·

    ByteDance and HKUST researchers prove that traditional AI model training on OCR tasks hinders document work. Their MMProLong project shows that key

    Badacze z ByteDance i HKUST udowadniają, że tradycyjne uczenie modeli AI na zadaniach OCR utrudnia pracę z dokumentami. Ich projekt MMProLong pokazuje, że kluczem do sukcesu nie jest rozmiar modelu, lecz zastąpienie mechanicznej transkrypcji grami pytań i odpowiedzi. # si # ai # …

  4. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    Nous Research has released Contrastive Neuron Attribution (CNA), a method that identifies the specific MLP neurons controlling AI model refusal behaviour. By ab

    Nous Research has released Contrastive Neuron Attribution (CNA), a method that identifies the specific MLP neurons controlling AI model refusal behaviour. By ablating just 0.1% of MLP activations, refusal rates drop by over 50% across Llama and Qwen models from 1B to 72B paramete…