PulseAugur
实时 18:34:41
English(EN) Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

新的数据策略提升大语言模型强化学习性能

研究人员开发了新的方法来改进大语言模型(LLMs)的强化学习(RL),重点关注数据调度和策展。一种方法是自适应数据调度(ADS),它将训练数据组织成语义集群,并自适应地采样策略边界数据,在推理基准测试上带来了5.2%的准确率提升。另一种以数据为中心的方法使用了一个包含约14,000个示例的精选数据集,涵盖检索、综合和推理任务,在长上下文基准测试上取得了显著的提升,并改善了代理任务的性能。 AI

影响 这些以数据为中心的方法有望增强LLMs的推理能力,特别是在长上下文任务和代理应用方面,可能带来更有效的AI代理。

排序理由 该集群包含两篇学术论文,详细介绍了通过数据调度和策展改进LLM强化学习的新颖方法。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 5 个来源。 我们如何撰写摘要 →

新的数据策略提升大语言模型强化学习性能

报道来源 [5]

  1. arXiv cs.CL TIER_1 English(EN) · Chenhao Dang, Jing Ma, Mingjie Liao ·

    Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

    arXiv:2606.24133v1 Announce Type: cross Abstract: The composition of training data, governed by the diversity of sources and their mixing strategy, is a cornerstone of Large Language Model (LLM) pre-training. Online Data Mixing (ODM), the technique of adaptively adjusting data mi…

  2. arXiv cs.CL TIER_1 English(EN) · Mingjie Liao ·

    Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

    The composition of training data, governed by the diversity of sources and their mixing strategy, is a cornerstone of Large Language Model (LLM) pre-training. Online Data Mixing (ODM), the technique of adaptively adjusting data mixtures during training, has emerged as a promising…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

    A novel online data mixing framework called Holistic Data Scheduler uses reinforcement learning with a multi-objective reward function to optimize large language model pre-training efficiency and performance.

  4. arXiv cs.CL TIER_1 English(EN) · Vladimir Braverman ·

    以合适的节奏学习:自适应数据调度改进LLM强化学习

    Large Language Models (LLMs) achieve remarkable reasoning capabilities through reinforcement learning (RL) post-training. However, existing RL post-training commonly relies on uniform data sampling, which ignores the semantic structure of the training data and the changing capabi…

  5. Hugging Face Daily Papers TIER_1 English(EN) ·

    超越奖励工程:长上下文强化学习的数据配方

    Data-centric approach using curated datasets and minimal GRPO setup significantly improves long-context reasoning in large language models, outperforming prior reinforcement learning methods.