English(EN) Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

新的数据策略提升大语言模型强化学习性能

作者 PulseAugur 编辑部 · [5 个来源] · 2026-06-17 00:00

研究人员开发了新的方法来改进大语言模型（LLMs）的强化学习（RL），重点关注数据调度和策展。一种方法是自适应数据调度（ADS），它将训练数据组织成语义集群，并自适应地采样策略边界数据，在推理基准测试上带来了5.2%的准确率提升。另一种以数据为中心的方法使用了一个包含约14,000个示例的精选数据集，涵盖检索、综合和推理任务，在长上下文基准测试上取得了显著的提升，并改善了代理任务的性能。 AI

影响这些以数据为中心的方法有望增强LLMs的推理能力，特别是在长上下文任务和代理应用方面，可能带来更有效的AI代理。

排序理由该集群包含两篇学术论文，详细介绍了通过数据调度和策展改进LLM强化学习的新颖方法。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 5 个来源。我们如何撰写摘要 →

报道来源 [5]

arXiv cs.CL TIER_1 English(EN) · Chenhao Dang, Jing Ma, Mingjie Liao · 2026-06-24 04:00

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

arXiv:2606.24133v1 Announce Type: cross Abstract: The composition of training data, governed by the diversity of sources and their mixing strategy, is a cornerstone of Large Language Model (LLM) pre-training. Online Data Mixing (ODM), the technique of adaptively adjusting data mi…
arXiv cs.CL TIER_1 English(EN) · Mingjie Liao · 2026-06-23 04:32

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

The composition of training data, governed by the diversity of sources and their mixing strategy, is a cornerstone of Large Language Model (LLM) pre-training. Online Data Mixing (ODM), the technique of adaptively adjusting data mixtures during training, has emerged as a promising…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-23 00:00

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

A novel online data mixing framework called Holistic Data Scheduler uses reinforcement learning with a multi-objective reward function to optimize large language model pre-training efficiency and performance.
arXiv cs.CL TIER_1 English(EN) · Vladimir Braverman · 2026-06-21 02:19

以合适的节奏学习：自适应数据调度改进LLM强化学习

Large Language Models (LLMs) achieve remarkable reasoning capabilities through reinforcement learning (RL) post-training. However, existing RL post-training commonly relies on uniform data sampling, which ignores the semantic structure of the training data and the changing capabi…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-17 00:00

超越奖励工程：长上下文强化学习的数据配方

Data-centric approach using curated datasets and minimal GRPO setup significantly improves long-context reasoning in large language models, outperforming prior reinforcement learning methods.

报道来源 [5]

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

以合适的节奏学习：自适应数据调度改进LLM强化学习

超越奖励工程：长上下文强化学习的数据配方

相关实体

相关话题