Grpo
PulseAugur coverage of Grpo — every cluster mentioning Grpo across labs, papers, and developer communities, ranked by signal.
6 天有情绪数据
GRPO to be integrated into Anyscale's LLM post-training automation
The recent Anyscale Agent Skill launch focuses on automating LLM post-training runs, while another cluster details GRPO's use in multi-agent LLM deferral to humans. Given GRPO's demonstrated ability to incorporate human expertise and Anyscale's push for automation, it's plausible GRPO will be integrated as a method within Anyscale's automated post-training workflows to enhance human-in-the-loop capabilities.
GROW framework to see adoption for VLM agent development beyond Minecraft
The GROW framework, leveraging adapted GRPO, has shown state-of-the-art performance on over 800 Minecraft tasks for VLM agents. This success in a complex, open-world environment suggests potential for broader application in other VLM agent development scenarios, such as robotics, simulation, or other interactive environments where multi-turn learning and handling long contexts are critical.
GRPO and its variants (HölderPO, GROW) are central to recent LLM policy optimization research
Multiple recent clusters highlight GRPO and its derivatives (HölderPO, GROW) as key advancements in LLM policy optimization. This indicates a strong research trend focusing on refining reinforcement learning techniques for LLMs, particularly in areas like multi-agent interaction, handling complex reward structures, and improving stability and adaptability in diverse tasks.
-
Anyscale launches skill to automate LLM post-training runs
Anyscale has introduced a new Anyscale Agent Skill designed to simplify and automate the process of generating LLM post-training runs. This skill assists users in selecting the most appropriate post-training method, suc…
-
New GRTO framework unifies RL with differentiable tool use for segmentation
Researchers have developed a new framework called Group Relative Tool Optimization (GRTO) to improve referring segmentation tasks in computer vision. This method integrates reinforcement learning with differentiable too…
-
HölderPO unifies LLM policy optimization with Hölder mean
Researchers have introduced HölderPO, a novel framework for optimizing large language models by unifying token-level probability aggregation through the Hölder mean. This approach offers continuous control over the trad…
-
New GROW framework boosts VLM agents with adapted GRPO
Researchers have introduced GROW, a novel reinforcement learning framework designed to enhance the capabilities of vision-language model (VLM) agents in open-world tasks. Unlike previous methods that relied heavily on s…
-
Study shows training data curriculum fine-tunes RL agent specialization
A new study on arXiv explores how different training data curricula impact the performance of reinforcement learning (RL) agents designed to work with large language models (LLMs) and external memory banks. The research…
-
Vector Policy Optimization trains LLMs for diverse outputs
Researchers have introduced Vector Policy Optimization (VPO), a novel reinforcement learning algorithm designed to enhance the diversity of language model outputs. Unlike traditional methods that optimize for a single s…
-
New F-TIS method enables heterogeneous models in GRPO training
Researchers have introduced Filtered Truncated Importance Sampling (F-TIS), a new training paradigm designed for Reinforcement Learning from Human Feedback (RLHF) methods like GRPO. F-TIS addresses the challenge of trai…
-
Multi-agent LLM learns to defer to humans using GRPO
Researchers have developed a multi-agent large language model that learns to defer to human input. The model is trained using GRPO on a reward system that accounts for costs, and each instance of deferral is used as sup…
-
New G2D pipeline optimizes language models with less compute
Researchers have developed G2D, a three-stage pipeline that combines GRPO and DPO for more efficient offline preference optimization in language models. This method involves a brief GRPO warm-up, followed by constructin…
-
New RLVR framework POW3R adapts rewards for faster learning
Researchers have developed a new framework called POW3R to improve reinforcement learning with verifiable rewards (RLVR). This method addresses the issue where static rubric rewards in RLVR may not effectively guide tra…
-
Reinforcement learning trains small models for text-to-SPARQL generation
Researchers have explored using reinforcement learning to train smaller language models for zero-shot Text-to-SPARQL generation, a task crucial for knowledge graph question answering. They applied Group-Relative Policy …
-
New self-distillation methods boost LLM performance on reasoning tasks
Researchers have developed new self-distillation techniques for large language models to improve their performance without relying on external feedback. AVSD (Adaptive-View Self-Distillation) balances consensus signals …
-
New methods enhance language model reasoning with pairwise advantage estimation
Researchers have introduced LamPO (Lambda Style Policy Optimization) and LambdaPO, novel methods for enhancing reasoning in language models. These approaches move beyond traditional group-relative objectives by using pa…
-
New AMR-SD method improves LLM reasoning by refining token-level credit assignment
Researchers have developed a new method called Asymmetric Meta-Reflective Self-Distillation (AMR-SD) to improve the alignment of Large Language Models (LLMs) for complex reasoning tasks. Traditional methods struggle wit…
-
New benchmarks test VLM spatial reasoning, robustness, and consistency
Researchers have developed new benchmarks to evaluate the spatial reasoning capabilities of vision-language models (VLMs). ArchSIBench focuses on architectural space understanding, while Flat-Pack Bench assesses spatio-…
-
New PRISM framework corrects SFT flaws in multimodal LLM training
New research from institutions including the Hong Kong University of Science and Technology (Guangzhou) reveals a critical flaw in the common post-training paradigm for multimodal large language models (MLLMs). The stan…
-
New method speeds up VLA RL by focusing gradient computation
Researchers have developed a new method called Probabilistic Chunk Masking (PCM) to make reinforcement learning for vision-language-action (VLA) policies more efficient. This technique focuses gradient computation on th…
-
New method enhances vision-language models with group revision
Researchers have introduced a new group-revision optimization paradigm to improve object-level grounding in large vision-language models. This method addresses the limitations of sparse, response-level rewards in existi…
-
New E²PO framework enhances generative model alignment with human preference
Researchers have introduced a new framework called Embedding-perturbed Exploration Preference Optimization (E²PO) to address limitations in aligning generative models with human intent using reinforcement learning. Exis…
-
GEPA optimizes AI prompts by analyzing failed trajectories
Researchers have developed GEPA, a new method for optimizing prompts in complex AI systems. GEPA analyzes failed execution paths and automatically refines the prompts of the specific modules responsible for the errors. …