DAPO++
PulseAugur coverage of DAPO++ — every cluster mentioning DAPO++ across labs, papers, and developer communities, ranked by signal.
5 day(s) with sentiment data
-
New CFPO framework enhances multimodal reasoning in LVLMs
Researchers have introduced CounterFactual Policy Optimization (CFPO), a new framework designed to improve multimodal reasoning in Large Vision-Language Models (LVLMs). CFPO addresses grounding failures and hallucinatio…
-
New RLVR method ACPO enhances LLM reasoning capabilities
Researchers have analyzed Reinforcement Learning from Verifiable Rewards (RLVR) to understand its impact on large language model reasoning. Their theoretical analysis revealed that the degree of off-policy learning, inf…
-
New DUPL method boosts multimodal reasoning in LLMs
Researchers have introduced DUPL, a novel policy learning approach designed to enhance multimodal reasoning in large language models. This method specifically addresses the challenge of distinguishing between uncertaint…
-
New RL methods enhance LLM training stability and efficiency · 7 sources tracked
Researchers have developed several new methods to improve the stability and efficiency of reinforcement learning (RL) in large language models (LLMs). STARE addresses policy entropy collapse by reweighting token-level a…
-
New SAGC method boosts synchronous RL training efficiency
Researchers have developed a new method called Straggler-Aware Group Control (SAGC) to improve the efficiency of synchronous on-policy reinforcement learning. SAGC dynamically adjusts the training group size during oper…
-
New RLVR methods enhance LLM reasoning via first-token diversification and credit assignment
Two new research papers explore methods to improve Reinforcement Learning with Verifiable Rewards (RLVR) for training reasoning models. The first paper introduces REFT (Rollout Exploration with First-Token Diversificati…
-
New framework optimizes RL post-training for LLMs
A new framework called Pilot-Commit has been developed to optimize the allocation of computational resources during the post-training phase of large language models using reinforcement learning. This method addresses th…
-
New RLVR methods boost LLM training efficiency and data selection
Researchers are developing new methods to improve the efficiency and effectiveness of Reinforcement Learning with Verifiable Rewards (RLVR) for training Large Language Models (LLMs). Two papers introduce novel data sele…
-
Anyscale launches skill to automate LLM post-training runs
Anyscale has introduced a new Anyscale Agent Skill designed to simplify and automate the process of generating LLM post-training runs. This skill assists users in selecting the most appropriate post-training method, suc…
-
New method stabilizes LLM reasoning by rescuing near-boundary signals
Researchers have identified a key bottleneck in Reinforcement Learning from Verifiable Rewards (RLVR) that hinders LLM reasoning optimization. The study pinpoints rigid clipping decisions in standard hard-clipping metho…
-
New PRISM framework corrects SFT flaws in multimodal LLM training
New research from institutions including the Hong Kong University of Science and Technology (Guangzhou) reveals a critical flaw in the common post-training paradigm for multimodal large language models (MLLMs). The stan…
-
IBM releases Granite 4.1 LLMs with 512K context and Apache 2.0 license
IBM has released the Granite 4.1 family of large language models, comprising 3B, 8B, and 30B parameter versions. These models were trained on approximately 15 trillion tokens through a five-stage pre-training process th…