Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 7h

Stage-1 Controls the Entropy Regime, Not the Outcome

A new research paper explores the impact of different Stage-1 training methods on vision-language models (VLMs). The study found that while Stage-1 training, such as supervised fine-tuning (SFT) or on-policy distillation (OPD), leads to similar in-domain performance, it significantly influences the entropy regime of the model. Specifically, OPD results in higher policy entropy and answer diversity compared to SFT, although these advantages diminish after the Stage-2 reinforcement learning phase. AI

IMPACT This research clarifies the role of early-stage training in VLM development, suggesting that while it influences model behavior, the ultimate performance gains may be limited.

MathVista
Qwen2.5-VL-7B
Geometry3K