Researchers have proposed a new perspective on large language model post-training, viewing it as a process of shaping the distribution of states rather than solely focusing on tokens. This state-distribution shaping approach was tested using Qwen3-0.6B-Base on GSM8K, TruthfulQA, and MMLU benchmarks. The study found that supervised fine-tuning (SFT) can lead to retention loss if overdone, while on-policy distillation and reinforcement learning can improve performance without sacrificing retention. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT This research offers a new theoretical lens for understanding and potentially improving LLM post-training techniques like SFT, RL, and distillation.
RANK_REASON Academic paper proposing a new theoretical framework for LLM post-training methods. [lever_c_demoted from research: ic=1 ai=1.0]