English(EN) Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

新的“Bag of Dims”方法实现了无需训练的 Transformer 可解释性

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-12 04:00

研究人员开发了一种名为“Bag of Dims”的新颖方法，该方法能够对 Transformer 模型进行无需训练的机械可解释性分析。该方法利用 Transformer 隐藏状态中各个维度的符号模式来编码语义内容，其功能类似于独立的二进制寄存器。在 Qwen 3.5-4B、Gemma 3-4B 和 Mistral 7B 等多个模型系列上的实验表明，仅凭这些符号模式就具有高度预测性，在下一个词预测中达到了很高的准确率，并能够在没有任何额外训练的情况下发现大量语义特征。 AI

影响这种无需训练的可解释性方法可以显著降低理解 Transformer 模型的计算成本。

排序理由该集群包含一篇详细介绍分析 Transformer 模型新方法的论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Varun Reddy Nalagatla · 2026-06-12 04:00

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

arXiv:2606.12629v1 Announce Type: cross Abstract: We show that the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs and confidence via their magnitudes, …

报道来源 [1]

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

相关实体

相关话题