PulseAugur
实时 22:16:03
English(EN) Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Reasoning Arena 通过追踪锦标赛提升 LLM 推理能力

研究人员开发了“Reasoning Arena”,一个旨在增强大型语言模型推理能力的新框架。该系统解决了可验证奖励强化学习中的一个限制,即不同推理轨迹的相同奖励导致梯度信号缺失。Reasoning Arena 通过使用追踪锦标赛进行一对一比较,将这些信息量不足的奖励组转化为有价值的训练数据,从而产生更丰富的相对奖励信号。该方法提高了训练效率和基准测试性能,平均比标准 RLVR 性能高出 7.6%。 AI

影响 通过将信息量不足的奖励信号转化为有用的训练数据来增强 LLM 推理能力,可能加速开发。

排序理由 学术论文,详细介绍了改进 LLM 推理的新方法。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

报道来源 [3]

  1. arXiv cs.AI TIER_1 English(EN) · Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang ·

    Reasoning Arena: 当可验证奖励不足时进行追踪比赛

    arXiv:2606.09380v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become unin…

  2. arXiv cs.AI TIER_1 English(EN) · Albert Q. Jiang ·

    Reasoning Arena: 当可验证奖励不足时进行追踪比赛

    Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled tra…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

    Reasoning Arena improves reinforcement learning with verifiable rewards by using trace tournaments and Bradley-Terry models to generate meaningful gradients from non-diverse reward groups, resulting in faster training and better reasoning performance.