Holder Policy Optimisation
Researchers have introduced HölderPO, a novel framework for optimizing large language models by unifying token-level probability aggregation through the Hölder mean. This approach offers continuous control over the trade-off between gradient concentration and variance, addressing limitations of fixed aggregation mechanisms that can lead to training collapse or suboptimal performance. A dynamic annealing algorithm is employed to schedule the Hölder mean parameter across the training lifecycle, demonstrating superior stability and convergence. Extensive evaluations show HölderPO achieving state-of-the-art accuracy on mathematical benchmarks and a high success rate on ALFWorld. AI
IMPACT Introduces a new optimization framework that improves LLM stability and performance on mathematical and reasoning tasks.