Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision
Researchers have introduced Self-Distillation Zero (SD-Zero), a novel method for improving language model training efficiency. This technique trains a single model to act as both a generator and a reviser, using binary rewards to create dense, token-level supervision. SD-Zero has demonstrated significant performance gains on math and code reasoning tasks, outperforming existing baselines like Rejection Fine-Tuning and GRPO with a comparable training sample budget. AI
IMPACT This method could lead to more sample-efficient training of large language models, potentially reducing the computational cost and time required for model development.