New research from institutions including the Hong Kong University of Science and Technology (Guangzhou) reveals a critical flaw in the common post-training paradigm for multimodal large language models (MLLMs). The standard approach of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) can inadvertently harm model performance by introducing distributional drift, causing models to mimic correct answers superficially rather than truly understand them. This issue is particularly pronounced in stronger models, where SFT can degrade capabilities before RL even begins. The proposed PRISM framework addresses this by inserting a distribution alignment stage between SFT and RL, using a novel mixture-of-experts discriminator to separately correct for perceptual and reasoning errors, thereby improving overall model performance. AI
影响 This research suggests a significant improvement in multimodal LLM training by addressing a previously overlooked flaw in the SFT-to-RL pipeline, potentially leading to more robust and capable models.
排序理由 The cluster describes a new research paper proposing a novel framework (PRISM) to improve the training of multimodal large language models by addressing issues in the SFT-to-RL pipeline. [lever_c_demoted from research: ic=1 ai=1.0]
- DAPO
- DeepSeek
- Gemini 3 Flash
- Hong Kong University of Science and Technology (Guangzhou)
- multimodal large language models
- Nanyang Technological University
- PRISM
- Qwen
- Supervised Fine-Tuning (SFT)
- Tsinghua University
- GRPO
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →