Brief · PulseAugur

RESEARCH · arXiv cs.AI English(EN) · 4d · [2 sources]

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

Researchers have proposed a new perspective on large language model post-training, focusing on the distribution of states rather than just tokens. Their study suggests that the source and locality of training states can be as crucial as the supervision signal itself. Experiments using Qwen3-0.6B-Base demonstrated that on-policy distillation from a weaker teacher model could still improve performance across multiple benchmarks, and lightweight reinforcement learning enhanced a specific task while preserving retention. AI

IMPACT This research offers a new lens for understanding and improving LLM post-training, potentially leading to more efficient and effective fine-tuning techniques.

MMLU
GSM8K
TruthfulQA
Qwen3-0.6B-Base