Brief · PulseAugur

TOOL · arXiv cs.LG English(EN) · 3d

Holder Policy Optimisation

Researchers have introduced HölderPO, a novel framework for optimizing large language models by unifying token-level probability aggregation through the Hölder mean. This approach offers continuous control over the trade-off between gradient concentration and variance, addressing limitations of fixed aggregation mechanisms that can lead to training collapse or suboptimal performance. A dynamic annealing algorithm is employed to schedule the Hölder mean parameter across the training lifecycle, demonstrating superior stability and convergence. Extensive evaluations show HölderPO achieving state-of-the-art accuracy on mathematical benchmarks and a high success rate on ALFWorld. AI

IMPACT Introduces a new optimization framework that improves LLM stability and performance on mathematical and reasoning tasks.

ALFWorld
GRPO
Yuxiang Chen
HölderPO