新的SAE方法增强了可解释性和稳定性

作者 PulseAugur 编辑部 · [10 个来源] · 2026-05-29 12:44

研究人员在稀疏自编码器（SAE）方面取得了几项进展，以提高其可解释性和稳定性。Concept-SAE 提供了一个可控的接口，用于探测 SAE 中用户定义的概念，增强了其诊断能力。子空间感知稀疏自编码器（SASA）通过用学习到的解码器子空间替换单向量解码器来解决特征分裂问题，从而产生更连贯的特征并提高效率。此外，还提出了对齐训练和均值中心化技术来解决特征死亡和不稳定性等问题，使 SAE 成为理解深度神经网络更可靠的工具。 AI

影响这些进展为分析深度学习模型的内部工作原理提供了更强大、更具可解释性的工具。

排序理由多篇arXiv论文提出了稀疏自编码器的新方法和分析。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 10 个来源。我们如何撰写摘要 →

报道来源 [10]

arXiv cs.AI TIER_1 English(EN) · Chenhao Zhang, Chris Lin, Su-In Lee · 2026-06-08 04:00

稀疏自编码器中概念学习和神经元解释的几何视角

arXiv:2606.07007v1 Announce Type: cross Abstract: We propose a unified mathematical framework for a geometric understanding of concept learning and neuron interpretation in sparse autoencoders (SAEs). While SAEs improve interpretability of neural networks by learning sparse featu…
arXiv cs.LG TIER_1 English(EN) · Su-In Lee · 2026-06-05 07:52

稀疏自编码器中概念学习和神经元解释的几何视角

We propose a unified mathematical framework for a geometric understanding of concept learning and neuron interpretation in sparse autoencoders (SAEs). While SAEs improve interpretability of neural networks by learning sparse feature representations, a principled definition of ''c…
arXiv cs.LG TIER_1 English(EN) · Seyed Arshan Dalili, Mehrdad Mahdavi · 2026-06-05 04:00

面向有效机制可解释性的子空间感知稀疏自编码器

arXiv:2606.06333v1 Announce Type: new Abstract: Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. …
arXiv cs.LG TIER_1 English(EN) · Jianrong Ding, Muxi Chen, Chenchen Zhao, Qiang Xu · 2026-06-05 04:00

Concept-SAE：稀疏自编码器的可控可逆概念接口

arXiv:2509.22015v2 Announce Type: replace Abstract: Standard Sparse Autoencoders (SAEs) excel at discovering a dictionary of a model's learned features, providing a powerful lens for passive feature discovery. However, this passive nature makes it difficult to systematically eval…
arXiv cs.AI TIER_1 English(EN) · Mehrdad Mahdavi · 2026-06-04 16:08

面向有效机制可解释性的子空间感知稀疏自编码器

Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the…
arXiv cs.LG TIER_1 English(EN) · Micha{\l} Brzozowski, Neo Christopher Chung · 2026-06-03 04:00

Aligned Training：一种改进稀疏自编码器（SAE）特征质量和稳定性的无参数方法

arXiv:2605.18629v2 Announce Type: replace Abstract: Sparse autoencoders (SAEs) are one of the main methods to interpret the inner workings of deep neural networks (DNNs), decomposing activations into higher-dimensional features. However, they exhibit critical shortcomings where a…
arXiv cs.LG TIER_1 English(EN) · Elana Simon, Etowah Adams, James Zou · 2026-06-01 04:00

稀疏自编码器中激活离群值与特征死亡的关系

arXiv:2605.31518v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) decompose neural network activations into interpretable features, but many learned features never activate, a problem called feature death that wastes dictionary capacity and can reintroduce superposition.…
arXiv cs.LG TIER_1 English(EN) · Walter Nelson, Theofanis Karaletsos, Francesco Locatello · 2026-06-01 04:00

迈向可识别的稀疏自编码器

arXiv:2605.31245v1 Announce Type: new Abstract: Recently, sparse autoencoders (SAEs) have emerged as an attractive tool for interpreting and interacting with representations in practical neural networks. While it is common empirical folklore, we also show theoretically that SAEs …
arXiv cs.LG TIER_1 English(EN) · James Zou · 2026-05-29 16:36

稀疏自编码器中激活离群值与特征死亡的关系

Sparse autoencoders (SAEs) decompose neural network activations into interpretable features, but many learned features never activate, a problem called feature death that wastes dictionary capacity and can reintroduce superposition. Death rates vary dramatically between models: n…
arXiv cs.LG TIER_1 English(EN) · Francesco Locatello · 2026-05-29 12:44

迈向可识别的稀疏自编码器

Recently, sparse autoencoders (SAEs) have emerged as an attractive tool for interpreting and interacting with representations in practical neural networks. While it is common empirical folklore, we also show theoretically that SAEs are highly unstable: different training runs are…

报道来源 [10]

相关实体

相关话题