English(EN) Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

新流程利用可解释性塑造语言模型学习

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-10 17:31

研究人员开发了一种新的数据中心化语言模型训练后流程，利用可解释性来理解和塑造学习信号。该方法允许在优化前检查偏好数据集，从而对期望的行为进行细粒度的用户反馈。该流程可以诊断现有数据中不期望的信号，减轻目标外学习，并放大诸如安全措施和个性等特定模型属性。 AI

影响通过审计学习信号本身，实现对人工智能行为更可控、更透明的塑造。

排序理由该集群包含一篇详细介绍人工智能研究新方法的学术论文。

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.LG TIER_1 English(EN) · Leon Bergen, Usha Bhalla, Sidharth Baskaran, Max Loeffler, Raphael Sarfati, Dhruvil Gala, Ryan Panwar, Santiago Aranguri, Thomas Fel, Atticus Geiger, Matthew Kowal, Siddharth Boppana, Daniel Balsam, Owen Lewis, Jack Merullo, Thomas McGrath, Ekdeep Singh … · 2026-06-11 04:00

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

arXiv:2606.12360v1 Announce Type: new Abstract: Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility…
arXiv cs.LG TIER_1 English(EN) · Ekdeep Singh Lubana · 2026-06-10 17:31

训练后分析：利用可解释性表征数据和塑造学习信号

Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, a…