PulseAugur
实时 23:08:16

Researchers develop SNMF for interpretable LLM feature analysis

Researchers have developed a new method for understanding the internal workings of large language models by decomposing MLP activations. This technique, semi-nonnegative matrix factorization (SNMF), identifies interpretable features that are sparse combinations of co-activated neurons and maps them to their activating inputs. Experiments on models like Llama 3.1, Gemma 2, and GPT-2 demonstrated that SNMF-derived features are more effective for causal steering than existing methods, revealing a hierarchical structure in the models' activation spaces. AI

影响 Introduces a novel, interpretable method for dissecting LLM internals, potentially improving model understanding and debugging.

排序理由 This is a research paper detailing a new method for analyzing LLM activations. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

Researchers develop SNMF for interpretable LLM feature analysis

报道来源 [1]

  1. arXiv cs.CL TIER_1 English(EN) · Or Shafran, Atticus Geiger, Mor Geva ·

    Constructing Interpretable Features from Compositional Neuron Groups

    arXiv:2506.10920v2 Announce Type: replace Abstract: A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that …