实体 reinforcement learning

reinforcement learning

PulseAugur coverage of reinforcement learning — every cluster mentioning reinforcement learning across labs, papers, and developer communities, ranked by signal.

Show in brief

总计 · 30天

142

90 天内 142

发布 · 30天

90 天内 0

论文 · 30天

135

90 天内 135

层级分布 · 90 天

significant 2
research 54
tool 82
commentary 4

关系

instance of SOFT ACTOR-CRITIC REINFORCEMENT LEARNING FOR ROBOTIC MANIPULATOR WITH HINDSIGHT EXPERIENCE REPLAY 95%
used by robotics 90%
used by large-language models 80%
used by Grpo 70%
used by supervised fine-tuning 70%
instance of robotics 70%
used by Group Relative Policy Optimization 70%
instance of Markov decision process 70%
used by vision-language model 70%
used by AlphaZero 70%
used by train of thought 70%
affiliated with model predictive control 70%

时间线

2026-05-18 research_milestone A new paper proposes a reinforcement learning framework for modeling customer trajectories in retail. 来源

情绪 · 30 天

19 天有情绪数据

最近 · 第 2/8 页 · 共 142 条

TOOL · CL_42508 · May 20 · 17:22

Robot Tactile Olympiad benchmark accelerates blind manipulation tasks

Researchers have introduced roto 2.0, a new benchmark for tactile-based reinforcement learning in robotics. This benchmark utilizes GPU parallelism and focuses on end-to-end "blind" manipulation tasks across four differ…
RESEARCH · CL_42473 · May 20 · 15:39

Reinforcement learning optimizes urban street design and traffic signals

Researchers have developed DeCoR, a novel reinforcement learning framework designed to optimize urban street design and traffic signal control. The system first learns to generate optimal crosswalk layouts by encoding p…
RESEARCH · CL_42477 · May 20 · 15:14

新的强化学习策略通过一次性生成控制提高效率

研究人员开发了新的强化学习策略方法，旨在提高效率和表达能力。一种方法，基于分数的一次性均值流策略优化（SOM），使用Q函数分数和概率流ODE构建目标速度场，通过减少训练和推理时间，在在线强化学习中实现了最先进的性能。另一项开发，随机均值流策略（SMFP），提供了一个一次性生成策略类别，通过均值流变换将噪声映射到动作，为离策略设置中稳定和探索性的策略改进提供了统一的目标。
RESEARCH · CL_42482 · May 20 · 14:19

PREFINE method enhances AI safety alignment using preference tuning

Researchers have developed PREFINE, a novel method for adapting pre-trained reinforcement learning policies to incorporate safety constraints without full retraining. This technique leverages trajectory-level preference…
RESEARCH · CL_42484 · May 20 · 14:08

量子强化学习推动变分量子算法状态制备和过程合成

研究人员开发了一个名为CRiSP的新框架，该框架使用强化学习和基于Transformer的策略来改进变分量子算法（VQA）的初始状态制备。该方法旨在克服 barren plateaus 和局部最小值等限制，在QAOA基准测试中优于现有的Clifford初始化技术。另外，另一项研究探索了用于过程合成的量子强化学习，提出了状态编码算法以提高可扩展性，并在流程图合成问题上展示了与经典强化学习方法相比具有竞争力的性能。
RESEARCH · CL_42523 · May 20 · 14:07

新的YANN-RL方法加速了化工过程的AI控制

研究人员开发了一种名为Y-wise Affine Neural Network (YANN-RL) 的新强化学习（RL）方法，专为化工过程系统中的控制而设计。该方法旨在克服该领域RL通常面临的信任和训练时间长的挑战。通过为控制方案提供自信且可解释的起点，YANN-RL在涉及CSTR、四罐系统和萃取塔的案例研究中展示了缩短的训练时间和减少的数据需求。
RESEARCH · CL_41847 · May 20 · 13:14

AI研究通过新的RL框架推进自动驾驶安全

两篇新研究论文探讨了用于更安全自动驾驶的先进强化学习技术。第一篇论文介绍了一种多智能体强化学习（MARL）方法，其中自动驾驶汽车和行人进行协同训练，通过更好地预测行人不可预测的行为，与基线方法相比，碰撞减少了30%。第二篇论文提出了一个认知-物理强化学习（CoPhy）框架，该框架整合了来自视觉-语言模型的知识，并使用预测性世界模型来确保安全和遵守驾驶意图，在基准测试中取得了最先进的结果。
RESEARCH · CL_41862 · May 20 · 10:36

New PG-DPO framework enhances reinforcement learning for non-exponential discounting

Researchers have developed a new framework called Pontryagin-Guided Direct Policy Optimization (PG-DPO) to address limitations in reinforcement learning methods. Traditional approaches using Bellman-style recursions str…
COMMENTARY · CL_40385 · May 20 · 09:26

AI models likely to develop power-seeking behavior with advanced training

Current state-of-the-art large language models largely operate within a simulator regime, which insulates them from power-seeking behavior. However, as these models are increasingly trained using long-horizon reinforcem…
TOOL · CL_41868 · May 20 · 08:15

新的CIG奖励方法增强了强化学习的探索能力

研究人员推出了一种新颖的强化学习奖励机制——条件信息增益（CIG），旨在改进探索策略。CIG通过提供轨迹级别信息增益的可行替代方案，解决了现有方法的局限性，使其能够扩展到高维状态空间。在离散和连续控制环境的十二项任务中进行了测试，CIG在存在随机干扰因素的情况下，与之前的探索技术相比，表现出具有竞争力或更优越的性能。
RESEARCH · CL_41798 · May 20 · 03:07

AI框架优化航空发动机管道设计以适应制造

研究人员开发了一个名为FPRO的新型强化学习框架，用于优化航空发动机中自由曲面管道的设计和制造。该方法将特定领域的制造知识作为约束集成到强化学习过程中。FPRO生成的无碰撞、可制造的管道路径可以直接转换为六轴弯管机的制造指令，并通过实际验证展示了其可行性。
RESEARCH · CL_42791 · May 20 · 00:33

Mahjong RL simulator Mahjax achieves 2M steps/sec on GPUs

Researchers have developed Mahjax, a new GPU-accelerated simulator for the complex game of Riichi Mahjong, implemented in JAX. This tool is designed to facilitate reinforcement learning research, particularly for agents…
TOOL · CL_39391 · May 19 · 17:30

强化学习详解：策略、MDP和轨迹

本文通过定义关键概念来解释强化学习代理如何做出决策。它涵盖了策略、马尔可夫决策过程（MDP）和轨迹。该系列旨在为理解近端策略优化（PPO）算法打下基础。
RESEARCH · CL_39995 · May 19 · 12:39

New research advances optimization and reinforcement learning theory

Researchers have developed new theoretical frameworks for optimizing decision-making processes in machine learning. One paper introduces regret-based stopping criteria for Bayesian optimization, ensuring solutions are w…
TOOL · CL_41182 · May 19 · 07:36

新的RL越狱方法利用LRM注意力模式

研究人员开发了一种专门针对大型推理模型（LRM）的新型越狱方法，LRM以其逐步解决问题的能力而闻名。该方法利用强化学习，并将模型的注意力模式纳入奖励函数，因为研究表明，当注意力被误导时，越狱的成功率更高。这种方法通过多样化的说服策略得到增强，显著提高了在各种基准和模型上的攻击成功率。
RESEARCH · CL_39980 · May 19 · 03:33

新研究推进用于人工智能生成和机器人技术的流匹配模型

研究人员开发了增强流匹配模型（一种生成式AI）的新方法。一种名为“Precise”的方法通过使用与SDE一致的随机采样来改进强化学习的训练后阶段，以实现更好的对齐和更快的优化。另一篇论文探讨了用于具身AI轨迹的“稀疏组合流匹配”，直接在物理空间中组合运动原语以提高准确性。一项调查还回顾了用于表格数据的扩散模型和流匹配模型，强调了挑战和未来方向，而另一项工作则研究了“过渡匹配”作为某些分布的潜在优于流匹配的替代方案，并引入了用于无监督异…
RESEARCH · CL_39989 · May 19 · 00:17

强化学习优化体力活动以改善健康生物标志物

研究人员开发了一种新颖的离线强化学习算法，用于创建个性化的体力活动建议。该算法分析了“All of Us”研究项目中的步数数据和健康生物标志物，以优化每日步数分布，从而降低心血管代谢风险。模拟研究表明，该方法优于现有的连续动作强化学习方法，预示着增加和更一致的体力活动将带来更好的健康结果。
TOOL · CL_38815 · May 18 · 16:46

Latent visual reasoning tokens prove non-essential for inference

Researchers have investigated the role of latent visual reasoning, a technique that incorporates visual evidence into multimodal reasoning by using continuous latent tokens before text generation. Their findings suggest…
TOOL · CL_38262 · May 18 · 15:01

DiPRL method learns discrete programmatic policies for reinforcement learning

Researchers have developed DiPRL, a novel method for learning discrete programmatic policies in reinforcement learning. This approach aims to overcome the performance degradation often seen when converting continuous pr…
TOOL · CL_38270 · May 18 · 14:17

Reinforcement learning models customer retail journeys for layout optimization

Researchers have developed a new reinforcement learning (RL) framework to model customer movement in retail environments, aiming to provide practical insights for store layout optimization. This approach treats customer…

Robot Tactile Olympiad benchmark accelerates blind manipulation tasks

Reinforcement learning optimizes urban street design and traffic signals

新的强化学习策略通过一次性生成控制提高效率

PREFINE method enhances AI safety alignment using preference tuning

量子强化学习推动变分量子算法状态制备和过程合成

新的YANN-RL方法加速了化工过程的AI控制

AI研究通过新的RL框架推进自动驾驶安全

New PG-DPO framework enhances reinforcement learning for non-exponential discounting

AI models likely to develop power-seeking behavior with advanced training

新的CIG奖励方法增强了强化学习的探索能力

AI框架优化航空发动机管道设计以适应制造

Mahjong RL simulator Mahjax achieves 2M steps/sec on GPUs

强化学习详解：策略、MDP和轨迹

New research advances optimization and reinforcement learning theory

新的RL越狱方法利用LRM注意力模式

新研究推进用于人工智能生成和机器人技术的流匹配模型

强化学习优化体力活动以改善健康生物标志物

Latent visual reasoning tokens prove non-essential for inference

DiPRL method learns discrete programmatic policies for reinforcement learning

Reinforcement learning models customer retail journeys for layout optimization