Grpo
PulseAugur coverage of Grpo — every cluster mentioning Grpo across labs, papers, and developer communities, ranked by signal.
6 天有情绪数据
GRPO to be integrated into Anyscale's LLM post-training automation
The recent Anyscale Agent Skill launch focuses on automating LLM post-training runs, while another cluster details GRPO's use in multi-agent LLM deferral to humans. Given GRPO's demonstrated ability to incorporate human expertise and Anyscale's push for automation, it's plausible GRPO will be integrated as a method within Anyscale's automated post-training workflows to enhance human-in-the-loop capabilities.
GROW framework to see adoption for VLM agent development beyond Minecraft
The GROW framework, leveraging adapted GRPO, has shown state-of-the-art performance on over 800 Minecraft tasks for VLM agents. This success in a complex, open-world environment suggests potential for broader application in other VLM agent development scenarios, such as robotics, simulation, or other interactive environments where multi-turn learning and handling long contexts are critical.
GRPO and its variants (HölderPO, GROW) are central to recent LLM policy optimization research
Multiple recent clusters highlight GRPO and its derivatives (HölderPO, GROW) as key advancements in LLM policy optimization. This indicates a strong research trend focusing on refining reinforcement learning techniques for LLMs, particularly in areas like multi-agent interaction, handling complex reward structures, and improving stability and adaptability in diverse tasks.
-
LLMs fine-tuned for traffic control with critic-guided reinforcement learning
Researchers have developed DGLight, a novel framework that fine-tunes large language models for traffic signal control. This approach utilizes a Deep Q-Network critic to guide the optimization process, enabling the mode…
-
New training methods boost VLM mobile agents' interactive and safety capabilities
Researchers have developed two new approaches for enhancing the capabilities of vision-language model (VLM)-based mobile agents. Mobile-R1 introduces a hierarchical curriculum to improve exploration and self-correction,…
-
SEVerA framework verifies self-evolving AI agents for safety and correctness
Researchers have introduced SEVerA, a framework designed to synthesize self-evolving AI agents with formal safety and correctness guarantees. This approach treats agentic code generation as a constrained learning proble…
-
New method uses hidden states to improve AI reasoning credit assignment
Researchers have developed a new method called Span-level Hidden state Enabled Advantage Reweighting (SHEAR) to improve credit assignment in reinforcement learning for language models. SHEAR leverages the Wasserstein di…
-
Researchers use SHAP and RL to improve robot generalization and affordance reasoning
Researchers have developed a framework using SHapley Additive exPlanations (SHAP) to analyze and improve the generalizability of reinforcement learning (RL) algorithms in robotics. This approach quantifies the impact of…
-
V-GRPO method enhances denoising generative models with faster, stable reinforcement learning
Researchers have introduced V-GRPO, a novel online reinforcement learning method designed to align denoising generative models with desired outcomes. This approach overcomes previous limitations by efficiently utilizing…
-
Controllable Spoken Dialogue Generation: An LLM-Driven Grading System for K-12 Non-Native English Learners
Researchers have developed a new LLM-driven framework to adapt spoken dialogue generation for K-12 English learners in non-native environments. This system uses China's national curriculum to control lexical complexity …
-
DVPO和EVPO通过新颖的RL优化技术推进LLM训练后
研究人员引入了DVPO,这是一个新的强化学习框架,旨在改进大型语言模型(LLM)的训练后,特别是在处理嘈杂或不完整的监督信号时。DVPO利用分布值建模和不对称风险正则化来平衡鲁棒性和泛化性,旨在避免现有方法可能产生的过于保守的策略。在对话、数学推理和科学问答任务上的实验表明,在嘈杂条件下,DVPO的表现优于PPO和GRPO等标准方法。
-
Researchers propose Objective-aware Trajectory Credit Assignment for visual generation
Researchers have developed a new framework called Objective-aware Trajectory Credit Assignment (OTCA) to improve the training of visual generative models using reinforcement learning. Current methods often assign reward…
-
Kwai AI's SRPO achieves DeepSeek-R1-Zero performance with 10x fewer training steps
Researchers from Kuaishou's Kwaipilot team have developed a novel reinforcement learning framework called SRPO, designed to improve the efficiency and performance of large language models. This new method addresses limi…