PulseAugur
实时 05:40:28

New RL method teaches LLMs to self-correct answers

Researchers have developed SCoRe, a novel two-stage reinforcement learning technique that enables language models to refine their own responses using self-generated data. This method significantly improves performance on benchmarks like MATH and HumanEval when applied to models such as Gemini 1.5 Flash and 1.0 Pro. Additionally, a separate study explored process versus outcome supervision for mathematical reasoning, finding that process-reward models yield better results, though the advantage diminishes with fewer samples. AI

影响 New self-correction techniques could enhance LLM reasoning capabilities and reduce the need for extensive human supervision in training.

排序理由 The cluster contains two academic papers detailing new methods for improving language model reasoning and self-correction.

在 Mastodon — fosstodon.org 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

New RL method teaches LLMs to self-correct answers

报道来源 [2]

  1. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    SCoRe is a two-stage on-policy RL recipe that teaches a language model to revise its own answers using only self-generated data. On Gemini 1.5 Flash and 1.0 Pro

    SCoRe is a two-stage on-policy RL recipe that teaches a language model to revise its own answers using only self-generated data. On Gemini 1.5 Flash and 1.0 Pro it gains 15.6 points on MATH and 9.1 on HumanEval over the base model. At matched inference budgets, sequential self-co…

  2. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    Let's Verify Step by Step compares process and outcome supervision on MATH. The process-reward model reaches 78.2% best-of-1860 vs 72.4% for outcome. But that g

    Let's Verify Step by Step compares process and outcome supervision on MATH. The process-reward model reaches 78.2% best-of-1860 vs 72.4% for outcome. But that gap narrows fast at small N, where most deployments actually live. https:// benjaminhan.net/posts/20260512 -lets-verify-s…