PulseAugur
实时 09:49:07

New RLVR method enhances LLM reasoning with positive-negative prompt pairing

Researchers have developed a new method called prompt-efficient RLVR that improves the training of large language models for reasoning tasks. This technique focuses on selecting prompts that provide both positive anchors and signals from rare failures, a departure from previous variance-based methods. By pairing hard-but-solvable and easy-but-brittle prompts, and using a weighted approach to amplify successes and failures, the method enhances sample efficiency and leads to significant performance gains on mathematical reasoning benchmarks. AI

影响 Introduces a more sample-efficient training method for LLMs on reasoning tasks, potentially improving performance on complex problem-solving.

排序理由 This is a research paper detailing a novel method for training large language models. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

New RLVR method enhances LLM reasoning with positive-negative prompt pairing

报道来源 [1]

  1. arXiv cs.LG TIER_1 English(EN) · Yujuan Pang, Jiaxin Li, Xin Sheng, Ran Peng, Yong Ma ·

    Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing

    arXiv:2602.03452v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) is effective for training large language models on deterministic outcome reasoning tasks. Prior work shows RLVR works with few prompts, but prompt selection is often based on…