Brief · PulseAugur

TOOL · r/MachineLearning English(EN) · 7h

Contrastive targeted SFT as a mechinterp method - has anyone mapped causal dependency interactions this way? [D]

A machine learning practitioner is exploring a novel method for understanding and controlling AI model behavior by mapping causal dependencies between different capabilities. The approach involves using contrastive supervised fine-tuning (SFT) to isolate specific circuits within a 31B parameter model. By training variants that emphasize or de-emphasize certain dimensions and then ablating identified circuits, the practitioner aims to build a causal dependency graph of model capabilities. This graph could then inform optimal training orders for future model development and enhance behavioral control. AI

IMPACT This research could lead to more predictable and controllable AI behavior by mapping internal causal dependencies.

31B model
Substantial_Diver469