Researchers have developed SCoRe, a novel two-stage reinforcement learning technique that enables language models to refine their own responses using self-generated data. This method significantly improves performance on benchmarks like MATH and HumanEval when applied to models such as Gemini 1.5 Flash and 1.0 Pro. Additionally, a separate study explored process versus outcome supervision for mathematical reasoning, finding that process-reward models yield better results, though the advantage diminishes with fewer samples. AI
影响 New self-correction techniques could enhance LLM reasoning capabilities and reduce the need for extensive human supervision in training.
排序理由 The cluster contains two academic papers detailing new methods for improving language model reasoning and self-correction.
在 Mastodon — fosstodon.org 阅读 →
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →