English(EN) Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

新方法通过对齐语义和机制来审计大型语言模型特征

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-06 15:46

研究人员开发了一种新的无监督方法，通过对齐语义内容和内部计算机制来发现大型语言模型中的特征。该方法根据模型输出的含义及其潜在的归因签名对其进行聚类，而无需预定义的输出目标。发现的聚类揭示了传统方法可能忽略的多种延续模式，为审计大型语言模型的内部工作提供了一种可扩展的方法。 AI

影响提供了一种审计大型语言模型内部计算的新颖方法，增强了模型的安全性和可解释性。

排序理由该集群包含一篇详细介绍新研究方法的学术论文。

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.LG TIER_1 English(EN) · Hyunjin Cho, Youngji Roh, Jaehyung Kim · 2026-06-09 04:00

共享语义，分化机制：通过对齐语义与机制进行无监督特征发现

arXiv:2606.08236v1 Announce Type: cross Abstract: As large language models are increasingly deployed in high-stakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them. Circuit analysis is a central ap…
arXiv cs.CL TIER_1 English(EN) · Jaehyung Kim · 2026-06-06 15:46

共享语义，分化机制：通过对齐语义和机制实现无监督特征发现

As large language models are increasingly deployed in high-stakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them. Circuit analysis is a central approach in mechanistic interpretability, but it is …