English(EN) Language-Switching Triggers Take a Latent Detour Through Language Models

新的XAI框架解释了用于治理的大语言模型行为转变

作者 PulseAugur 编辑部 · [3 个来源] · 2026-05-18 16:53

一篇新论文介绍了一种比较XAI（XAIΔ）的框架，旨在解释大语言模型在经过微调或扩展等干预后的行为转变。目前的方法不足，因为它们将模型视为静态的，或者仅仅比较解释而没有详细说明过渡过程。这种方法旨在为模型修改提供一个原则性的因果链记录方式，这对于监管合规至关重要。 AI

影响为解释大语言模型变化建立了新标准，这对于监管合规和模型审计至关重要。

排序理由该集群包含一篇学术论文，提出了一种用于解释大语言模型行为转变的新研究框架。

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

arXiv cs.CL TIER_1 English(EN) · Francis Kulumba, Wissam Antoun, Th\'eo Lasnier, Beno\^it Sagot, Djam\'e Seddah · 2026-05-26 04:00

Language-Switching Triggers Take a Latent Detour Through Language Models

arXiv:2605.18646v2 Announce Type: replace Abstract: Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switch…
arXiv cs.AI TIER_1 English(EN) · Martino Ciaperoni, Marzio Di Vece, Roberto Pellungrini, Luca Pappalardo, Fosca Giannotti, Francesco Giannini · 2026-05-22 04:00

Comparing Explanations is Not Enough, Explain the Change: New Standards are Needed to Explain Behavioral Shifts in Large Language Models

arXiv:2602.02304v2 Announce Type: replace Abstract: Large-scale foundation models exhibit \emph{behavioral shifts} when subjected to interventions such as scaling, fine-tuning, reinforcement learning with human feedback, or in-context learning. Current explainability methods are …
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-18 16:53

Language-Switching Triggers Take a Latent Detour Through Language Models

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive langu…