P^2O method enhances LLM reasoning by optimizing prompts and policies

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-08 04:00

Researchers have developed a new method called P^2O (Joint Policy and Prompt Optimization) to address the issue of advantage collapse in Reinforcement Learning with Verifiable Rewards (RLVR) for large language models. This technique alternates between continuous policy updates and discrete prompt evolution, using the GEPA algorithm to discover effective prompts for challenging samples. By distilling these prompts into the model's parameters, P^2O improves out-of-distribution generalization and achieves up to a 9.5% performance increase over existing methods. AI

影响 Introduces a novel approach to enhance LLM reasoning by combining prompt engineering with reinforcement learning, potentially improving performance on complex tasks.

排序理由 This is a research paper detailing a new method for improving LLM reasoning. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.LG TIER_1 English(EN) · Xinyu Lu, Kaiqi Zhang, Jinglin Yang, Boxi Cao, Yaojie Lu, Hongyu Lin, Min He, Xianpei Han, Le Sun · 2026-05-08 04:00

P^2O: Joint Policy and Prompt Optimization

arXiv:2603.21877v3 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) enhances Large Language Model (LLM) reasoning but suffers from advantage collapse on `"hard samples'' where all rollouts fail. This lack of variance eliminates crucial learni…

报道来源 [1]

P^2O: Joint Policy and Prompt Optimization

相关实体

相关话题