Brief · PulseAugur

RESEARCH · arXiv cs.LG English(EN) · 14h · [2 sources]

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

Researchers have developed a new data-centric pipeline for post-training language models that uses interpretability to understand and shape the learning signal. This method allows for the inspection of preference datasets before optimization, enabling fine-grained user feedback on desired behaviors. The pipeline can diagnose undesirable signals in existing data, mitigate off-target learning, and amplify specific model properties like safeguards and personality. AI

IMPACT Enables more controlled and transparent shaping of AI behavior by auditing the learning signal itself.

Ekdeep Singh Lubana
arXiv