English(EN) The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

研究人员使用持久同调来绘制对抗性攻击下大型语言模型潜在空间的变化图

作者 PulseAugur 编辑部 · [1 个来源] · 2026-04-27 04:00

研究人员开发了一种使用持久同调分析大型语言模型（LLM）内部工作的新方法。该技术表征了对抗性输入如何改变 LLM 潜在空间的几何和拓扑结构。研究发现，无论模型架构或具体攻击类型如何，对抗性攻击始终导致拓扑压缩，简化潜在空间，并将特征压缩成更少、更大的特征。 AI

影响引入了一种新颖的拓扑方法来理解 LLM 的漏洞和内部表征。

排序理由在 arXiv 上发表的学术论文，详细介绍了 LLM 的新可解释性方法。

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.LG TIER_1 English(EN) · Aideen Fay, In\'es Garc\'ia-Redondo, Qiquan Wang, Haim Dubossarsky, Anthea Monod · 2026-04-27 04:00

对抗性影响的形状：用持久同调表征大型语言模型的潜在空间

arXiv:2505.20435v3 Announce Type: replace Abstract: Existing interpretability methods for Large Language Models (LLMs) predominantly capture linear directions or isolated features. This overlooks the high-dimensional, relational, and nonlinear geometry of model representations. W…