English(EN) Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

新方法增强稀疏自编码器的可解释性和稳定性

作者 PulseAugur 编辑部 · [6 个来源] · 2026-05-06 04:00

研究人员开发了新方法来解决稀疏自编码器（SAE）的局限性，SAE用于解释大型语言模型的内部表示。一篇论文介绍了自适应弹性网络SAE（AEN-SAE），这是一种可微分架构，可在不进行启发式重采样的情况下缓解特征饥饿和收缩偏差。另一项研究提出了一种用于分析SAE特征的成对矩阵协议，揭示了单特征检查可能会错误标记因果轴，并且相干性损失与方向模式有关。此外，另一篇论文提出，结合局部顺序辅助损失（如有限差分符号误差）可以提高自编码器重建精度，超出标准的均方误差。 AI

影响稀疏自编码器技术的这些进步可能带来更强大的LLM可解释性工具，有助于理解和调试复杂模型。

排序理由该集群包含多篇学术论文，详细介绍了改进稀疏自编码器及其可解释性的新研究。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。我们如何撰写摘要 →

报道来源 [6]

arXiv cs.LG TIER_1 English(EN) · Faris Chaudhry, Keisuke Yano, Anthea Monod · 2026-05-08 04:00

Feature Starvation as Geometric Instability in Sparse Autoencoders

arXiv:2605.05341v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) are used to disentangle the dense, polysemantic internal representations of large language models (LLMs) into interpretable, monosemantic concepts. However, standard $\ell_1$-regularized SAEs suffer from f…
arXiv cs.LG TIER_1 English(EN) · Harvey Dam, Martin Burtscher, Tripti Agarwal, Ganesh Gopalakrishnan · 2026-05-08 04:00

Local-Order Auxiliary Losses Can Improve Autoencoder Reconstruction

arXiv:2504.04202v4 Announce Type: replace Abstract: Mean-squared error is the default objective for training autoencoders, yet compressed reconstructions often depend not only on pointwise accuracy but also on preserving local spatial order. We study whether structural auxiliary …
arXiv cs.AI TIER_1 English(EN) · Ruben Fernandez-Boullon, Pablo Magari\~nos-Docampo, Javier Perez-Robles · 2026-05-08 04:00

From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

arXiv:2605.06494v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) have become central to mechanistic interpretability, decomposing transformer activations into monosemantic features. Yet existing analyses characterise features almost exclusively through top-activating to…
arXiv cs.AI TIER_1 English(EN) · Javier Perez-Robles · 2026-05-07 16:15

From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

Sparse autoencoders (SAEs) have become central to mechanistic interpretability, decomposing transformer activations into monosemantic features. Yet existing analyses characterise features almost exclusively through top-activating token lists or decoder weight vectors, leaving the…
arXiv cs.LG TIER_1 English(EN) · Michael A. Riegler, Birk Sebastian Frostelid Torpmann-Hagen · 2026-05-06 04:00

Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

arXiv:2605.03160v1 Announce Type: new Abstract: The standard sparse-autoencoder (SAE) interpretability protocol labels each feature from its top-activating contexts and validates by single-feature steering. We propose the pairwise matrix protocol, co-varying steering coefficient …
arXiv stat.ML TIER_1 English(EN) · Anthea Monod · 2026-05-06 18:11

Feature Starvation as Geometric Instability in Sparse Autoencoders

Sparse autoencoders (SAEs) are used to disentangle the dense, polysemantic internal representations of large language models (LLMs) into interpretable, monosemantic concepts. However, standard $\ell_1$-regularized SAEs suffer from feature starvation (dead neurons) and shrinkage b…

报道来源 [6]

Feature Starvation as Geometric Instability in Sparse Autoencoders

Local-Order Auxiliary Losses Can Improve Autoencoder Reconstruction

From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

Feature Starvation as Geometric Instability in Sparse Autoencoders

相关实体

相关话题