HölderPO unifies LLM policy optimization with Hölder mean

By PulseAugur Editorial · [1 sources] · 2026-05-22 04:00

Researchers have introduced HölderPO, a novel framework for optimizing large language models by unifying token-level probability aggregation through the Hölder mean. This approach offers continuous control over the trade-off between gradient concentration and variance, addressing limitations of fixed aggregation mechanisms that can lead to training collapse or suboptimal performance. A dynamic annealing algorithm is employed to schedule the Hölder mean parameter across the training lifecycle, demonstrating superior stability and convergence. Extensive evaluations show HölderPO achieving state-of-the-art accuracy on mathematical benchmarks and a high success rate on ALFWorld. AI

IMPACT Introduces a new optimization framework that improves LLM stability and performance on mathematical and reasoning tasks.

RANK_REASON The cluster contains an academic paper detailing a new method for optimizing large language models. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.LG TIER_1 English(EN) · Yuxiang Chen, Dingli Liang, Yihang Chen, Ziqin Gong, Chenyang Le, Zhaokai Wang, Jiachen Zhu, Lingyu Yang, Jianghao Lin, Weinan Zhang, Jun Wang · 2026-05-22 04:00

Holder Policy Optimisation

arXiv:2605.12058v2 Announce Type: replace Abstract: Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating …

COVERAGE [1]

Holder Policy Optimisation

RELATED ENTITIES

RELATED TOPICS