PulseAugur
EN
LIVE 03:22:30

AI safety terms like "scheming" and "mech interp" have evolved

The terminology used in AI safety discussions has evolved, particularly for concepts like "scheming" and "mechanistic interpretability." Previously, "scheming" referred to training-gaming for out-of-context goals, but now it can also describe in-context goal pursuit during testing or deployment, with "alignment faking" emerging as a related but distinct term. Similarly, "mechanistic interpretability" initially focused on reverse-engineering internal network mechanisms, but has broadened to encompass any technique examining model internals to understand behavior. This shift means older texts might use these terms with different implications than current usage. AI

IMPACT Understanding the evolution of AI safety terminology is crucial for interpreting past research and current discussions on alignment and model behavior.

RANK_REASON The item discusses evolving terminology within the AI safety field, offering an opinion on how terms like 'scheming' and 'mechanistic interpretability' have changed meaning over time.

Read on LessWrong (AI tag) →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

AI safety terms like "scheming" and "mech interp" have evolved

COVERAGE [1]

  1. LessWrong (AI tag) TIER_1 English(EN) · Cleo Nardo ·

    What did "scheming" and "mech interp" mean pre-2023?

    <p><i><span>This was too long to be a short-form, but it should really be a short-form.</span></i></p><p><span>This notice is useful for people who've recently got into AI safety, who want to engage with the ancient texts (i.e. pre-2024). If you were around before 2023, then you …