Group Relative Policy Optimization (GRPO)
PulseAugur coverage of Group Relative Policy Optimization (GRPO) — every cluster mentioning Group Relative Policy Optimization (GRPO) across labs, papers, and developer communities, ranked by signal.
4 天有情绪数据
-
SafeDiffusion-R1 通过在线奖励引导增强图像模型安全性
研究人员开发了 SafeDiffusion-R1,一个用于增强扩散模型安全性的新框架。该方法利用基于群体相对策略优化(GRPO)的在线强化学习方法,引导模型避免生成不安全内容。通过利用 CLIP 嵌入,它避免了昂贵的配对数据或专门的奖励模型的需求,显著减少了不当内容的生成,同时保持或提高了整体图像质量。
-
AI代理在供应链中展现潜力但面临可靠性风险
一篇新的研究论文探讨了在供应链管理中使用自主生成式AI代理,并利用MIT啤酒游戏评估其性能。研究发现,虽然先进的AI模型可以超越人类水平的表现并降低高达67%的成本,但它们也带来了显著的可靠性风险,称为“代理牛鞭效应”。为了缓解这些问题,研究人员提出了一种名为Group Relative Policy Optimization (GRPO) 的强化学习后训练框架,以提高这些AI代理的稳定性和可靠性。
-
New SLAS method enhances text-to-image model training
Researchers have developed a new method called Super-Linear Advantage Shaping (SLAS) to improve text-to-image models trained with reinforcement learning. This technique addresses reward hacking by reshaping the policy s…
-
LoRA rank allocation fails in RL fine-tuning, study finds
A new study on the Qwen 2.5 1.5B model reveals that adaptive rank allocation techniques, effective in supervised fine-tuning, do not translate to reinforcement learning with Group Relative Policy Optimization (GRPO). Re…
-
New SRPO method enhances multimodal reasoning in vision-language models
Researchers have introduced Structured Role-aware Policy Optimization (SRPO), a novel method to enhance the reasoning abilities of large vision-language models (LVLMs). SRPO addresses the limitation of current reinforce…