新的RL技术增强LLM推理、安全性和效率 · 跟踪8个来源

作者 PulseAugur 编辑部 · [15 个来源] · 2026-06-18 00:00

研究人员引入了几种新方法来改进用于大型语言模型（LLM）的强化学习（RL），以解决奖励稀疏性、信用分配和效率等挑战。Group-Graph Policy Optimization (G2PO) 将线性轨迹转换为状态转换图，以更好地进行长周期任务中的信用分配。SingGuard 提供了一种自适应策略的多模态安全护栏，用于对话中的安全评估，并能适应不断变化的审核规则。此外，Adaptive Correct-Only Efficiency Reward (ACOER) 通过将简洁奖励隔离到正确完成项来稳定训练，而 Adaptive Data Scheduling (ADS) 则优化数据采样以提高LLM RL性能。 AI

影响这些RL技术的进步可能带来更强大、更高效、更安全的AI代理，从而提高在复杂推理任务和长周期应用中的性能。

排序理由多篇研究论文介绍了用于LLM强化学习的新算法和框架。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 15 个来源。我们如何撰写摘要 →

报道来源 [15]

arXiv cs.AI TIER_1 English(EN) · Akshay V. Jagadeesh, Rahul K. Arora, Khaled Saab, Ali Malik, Mikhail Trofimov, Foivos Tsimpourlas, Johannes Heidecke, Karan Singhal · 2026-06-24 04:00

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

arXiv:2606.24014v1 Announce Type: new Abstract: As AI systems are deployed across increasingly diverse and high-stakes settings, model alignment must generalize beyond the tasks and domains seen during training. This is especially important for reinforcement learning (RL), which …
arXiv cs.AI TIER_1 English(EN) · Bingnan Xiao, Chenhao Yang, Wei Ni, Xin Wang, Tony Q. S. Quek · 2026-06-24 04:00

Agentic AI for Bilevel Long-Term Optimization of Policy-Driven Physical Layer Systems

arXiv:2606.24416v1 Announce Type: new Abstract: Network operators' changing policies, service requirements, and stringent real-time constraints render existing methods designed with fixed objectives and constraints ineffective. This paper presents Agentic long-term performance op…
arXiv cs.AI TIER_1 English(EN) · Elias Bareinboim, Junzhe Zhang, Sanghack Lee · 2026-06-24 04:00

An Introduction to Causal Reinforcement Learning

arXiv:2606.24160v1 Announce Type: new Abstract: Causal inference provides a set of principles and tools that allow one to combine data and knowledge about an environment to reason with questions of counterfactual nature, i.e., what would have happened had reality been different, …
arXiv cs.AI TIER_1 English(EN) · Tianyuan Shi, Canbin Huang, Bei Li, Xin Chen, Xiaojun Quan, Jingang Wang, Qifan Wang · 2026-06-24 04:00

Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning

arXiv:2606.24064v1 Announce Type: new Abstract: Distilling reasoning capabilities from strong to weak language models typically involves imitating specific solution trajectories, effectively transferring what to answer rather than how to reason. This trajectory-level imitation en…
arXiv cs.LG TIER_1 English(EN) · Luca Viano, Till Freihaut, Emanuele Nevali, Volkan Cevher, Matthieu Geist, Giorgia Ramponi · 2026-06-24 04:00

Multi-agent imitation learning with function approximation: Linear Markov games and beyond

arXiv:2602.22810v2 Announce Type: replace Abstract: In this work, we present the first theoretical analysis of multi-agent imitation learning (MAIL) in linear Markov games where both the transition dynamics and each agent's reward function are linear in some given features. We de…
arXiv cs.AI TIER_1 English(EN) · Anurag Akula, Satheesh K. Perepu, Abhishek Sarkar, Kaushik Dey · 2026-06-24 04:00

ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning

arXiv:2606.24601v1 Announce Type: new Abstract: Multi-agent reinforcement learning (MARL) addresses the problem of training multiple agents that pursue collaborative, competitive, or mixed objectives. Prior work has investigated transfer learning between source and target domains…
arXiv cs.AI TIER_1 English(EN) · Kaushik Dey · 2026-06-23 14:03

ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning

Multi-agent reinforcement learning (MARL) addresses the problem of training multiple agents that pursue collaborative, competitive, or mixed objectives. Prior work has investigated transfer learning between source and target domains in MARL; however, the majority of existing appr…
arXiv cs.AI TIER_1 English(EN) · Tony Q. S. Quek · 2026-06-23 10:53

Agentic AI for Bilevel Long-Term Optimization of Policy-Driven Physical Layer Systems

Network operators' changing policies, service requirements, and stringent real-time constraints render existing methods designed with fixed objectives and constraints ineffective. This paper presents Agentic long-term performance optimization (Agentic-LTPO), a nested bilevel opti…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-23 05:28

An Introduction to Causal Reinforcement Learning

Causal inference provides a set of principles and tools that allow one to combine data and knowledge about an environment to reason with questions of counterfactual nature, i.e., what would have happened had reality been different, even when no data of this unrealized reality is …
arXiv cs.CL TIER_1 English(EN) · Karan Singhal · 2026-06-22 23:35

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

As AI systems are deployed across increasingly diverse and high-stakes settings, model alignment must generalize beyond the tasks and domains seen during training. This is especially important for reinforcement learning (RL), which can introduce unexpected misalignment through re…
arXiv cs.CL TIER_1 English(EN) · Qi Zhang · 2026-06-22 08:12

Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning

Group-based Reinforcement Learning (RL) has significantly enhanced Large Language Models (LLMs) in agentic scenarios. To achieve finer-grained policy updates, recent agentic RL frameworks have shifted from trajectory-level to step-level training. However, long-horizon agentic RL …
arXiv cs.CL TIER_1 English(EN) · SingGuard Team · 2026-06-22 05:37

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering, assistant responses, and cross-modal composition, while mode…
arXiv cs.AI TIER_1 English(EN) · Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li, Bolin Ding, Jingren Zhou · 2026-06-19 04:00

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

arXiv:2606.20002v1 Announce Type: cross Abstract: This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves…
arXiv cs.AI TIER_1 English(EN) · Jingren Zhou · 2026-06-18 09:38

连接点滴：通过强化学习训练具备跨领域泛化能力的长期生命周期智能体LLM

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously explo…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-18 00:00

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Large language models can be trained through reinforcement learning to develop a meta-capability enabling continuous learning and adaptation across long sequences of tasks in dynamic environments.

报道来源 [15]

相关实体

相关话题