Researchers have developed G2D, a three-stage pipeline that combines GRPO and DPO for more efficient offline preference optimization in language models. This method involves a brief GRPO warm-up, followed by constructing a static preference dataset and then fine-tuning with DPO. Experiments on Qwen2.5-7B and Llama-3.1-8B models demonstrated that G2D can match or exceed the performance of full online GRPO with significantly reduced computational cost, by focusing on the informativeness of the preference data rather than just the quantity. AI
影响 Offers a compute-efficient alternative to online RL for language model training by improving data informativeness.
排序理由 The cluster contains an academic paper detailing a new method for language model optimization.
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →