English(EN) Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

新的解释器方法提高了数据偏移下AI模型的可解释性

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-21 00:46

研究人员开发了一种几何自适应解释器（GAE），旨在提高字典式可解释性方法在模型遇到分布外数据时的忠实度。GAE解决了由分布偏移引起的失调问题，分布偏移会旋转模型激活的活动子空间，从而导致解释器字典失调。通过仅使用无标签的分布外数据将字典与分布外活动子空间重新对齐，GAE在无需梯度更新的情况下增强了因果忠实度，其性能与现有的基于训练的方法相当或更优。 AI

影响提高了AI模型在遇到新的、未见过的数据时的解释可靠性，这对于安全性和调试至关重要。

排序理由该集群包含一篇详细介绍AI模型可解释性新方法的学术论文。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Sungjun Lim, Heedong Kim, Andrew Lee, Kyungwoo Song · 2026-05-22 04:00

面向分布偏移下忠实字典式可解释性的几何自适应解释器

arXiv:2605.21849v1 Announce Type: cross Abstract: Mechanistic interpretability aims to explain a model's behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfu…
arXiv cs.CL TIER_1 English(EN) · Kyungwoo Song · 2026-05-21 00:46

面向分布偏移下忠实字典式可解释性的几何自适应解释器

Mechanistic interpretability aims to explain a model's behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfulness under out-of-distribution (OOD) shift has re…

报道来源 [2]

面向分布偏移下忠实字典式可解释性的几何自适应解释器

面向分布偏移下忠实字典式可解释性的几何自适应解释器

相关实体

相关话题