Brief · PulseAugur

TOOL · arXiv cs.CL English(EN) · 5h

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Researchers have introduced Self-Distillation Zero (SD-Zero), a novel method for improving language model training efficiency. This technique trains a single model to act as both a generator and a reviser, using binary rewards to create dense, token-level supervision. SD-Zero has demonstrated significant performance gains on math and code reasoning tasks, outperforming existing baselines like Rejection Fine-Tuning and GRPO with a comparable training sample budget. AI

IMPACT This method could lead to more sample-efficient training of large language models, potentially reducing the computational cost and time required for model development.

Qwen3-4B-Instruct
GRPO
Olmo-3-7B-Instruct
Self-Distillation Fine-Tuning
Yinghui He
Rejection Fine-Tuning
Self-Distillation Zero