Brief · PulseAugur

TOOL · arXiv cs.LG English(EN) · 1d

NeST: Neuron Selective Tuning for LLM Safety

Researchers have introduced NeST, a novel framework for efficient post-hoc safety alignment in Large Language Models (LLMs). This method identifies safety-relevant neurons through activation probing and trains shared updates at the cluster level, significantly reducing the need for extensive fine-tuning. NeST demonstrates robust generalization to various jailbreaks without requiring attack-specific data, achieving substantial reductions in unsafe outputs across both text-only and multimodal models with minimal trainable parameters and no inference-time overhead. AI

IMPACT NeST offers a more efficient and maintainable approach to LLM safety alignment, potentially reducing the computational cost and complexity of deploying safe AI systems.

Hugging Face
LLM
arXiv
LoRA
NeST
Lichao Wu