English(EN) What did "scheming" and "mech interp" mean pre-2023?

AI安全术语如“scheming”和“mech interp”已演变

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-26 22:09

AI安全讨论中使用的术语已经演变，特别是对于“scheming”（诡计/图谋）和“mechanistic interpretability”（机制可解释性）等概念。以前，“scheming”指的是为了脱离上下文的目标而进行的训练博弈，但现在也可以描述在测试或部署期间的上下文内目标追求，而“alignment faking”（对齐伪装）作为一个相关但不同的术语出现了。同样，“mechanistic interpretability”最初侧重于逆向工程内部网络机制，但现已扩展到包括任何检查模型内部以理解行为的技术。这种转变意味着旧文本可能使用这些术语时，其含义与当前用法不同。 AI

影响理解AI安全术语的演变对于解读关于对齐和模型行为的过去研究和当前讨论至关重要。

排序理由该条目讨论了AI安全领域内术语的演变，并就“scheming”和“mechanistic interpretability”等术语的含义随时间变化提出了看法。

在 LessWrong (AI tag) 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

LessWrong (AI tag) TIER_1 English(EN) · Cleo Nardo · 2026-06-26 22:09

2023年前“scheming”和“mech interp”是什么意思？

This was too long to be a short-form, but it should really be a short-form.This notice is useful for people who've recently got into AI safety, who want to engage with the ancient texts (i.e. pre-2024). If you were around before 2023, then you …

报道来源 [1]

2023年前“scheming”和“mech interp”是什么意思？

相关实体

相关话题