Proximal Policy Optimization
PulseAugur coverage of Proximal Policy Optimization — every cluster mentioning Proximal Policy Optimization across labs, papers, and developer communities, ranked by signal.
- instance of Pfadfinder und Pfadfinderinnen Österreichs 90%
- instance of reinforcement learning 90%
- instance of deep reinforcement learning 90%
- used by large-language models 90%
- developed Advantage Actor-Critic 90%
- used by long short-term memory 90%
- used by reinforcement learning 70%
- developed Grpo 70%
- uses Grpo 70%
- used by reinforcement learning from human feedback 70%
- used by Pfadfinder und Pfadfinderinnen Österreichs 70%
- instance of Direct Preference Optimization 70%
- 2026-05-26 research_milestone A new method is proposed to stabilize reinforcement learning training by strategically dropping transitions. source
19 day(s) with sentiment data
-
AI policies learn cybersecurity penetration testing faster with history aggregation
Researchers have developed and evaluated reinforcement learning policies for penetration testing in cybersecurity scenarios with partial observability. They compared several Proximal Policy Optimization (PPO) variants, …
-
New research revisits action factorization for complex RL spaces · 2 sources tracked
A new research paper explores methods for handling complex action spaces in reinforcement learning, particularly those that combine discrete and continuous actions. The study analyzes various factorization techniques ac…
-
New model simulates tuberculosis spread in Mars colony
Researchers have developed a new model to simulate the spread of latent tuberculosis within a radiation-exposed Mars colony. The model links galactic cosmic radiation to immune competence, which in turn affects the reac…
-
New RL framework uses vision-language models for GUI agent supervision
Researchers have developed a new reinforcement learning framework for Computer-Use Agents (CUAs) that leverages autonomous vision-language evaluation for supervision. This approach addresses the challenge of obtaining s…
-
EMAgnet introduces adaptive regularization for policy gradient self-play
Researchers have developed EMAgnet, a novel parameter-space exponential moving average (EMA) regularization technique for policy gradient self-play in large games. Unlike previous methods that use a uniform distribution…
-
New research unifies PPO-Clip and KL-PPO algorithms
Researchers have demonstrated that the clipped surrogate gradient in Proximal Policy Optimization (PPO) can be precisely replicated by a Kullback-Leibler surrogate with a per-sample coefficient. This equivalence holds t…
-
CoorDex enables humanoid robots to manipulate objects while walking
Researchers have developed CoorDex, a new learning pipeline designed to enable dexterous humanoid robots to perform manipulation tasks while in motion. This system converts high-dimensional body and hand control into co…
-
Humanoid robot 'cerebellum' gets GPT-style model with 2B frames of motion data
Researchers have introduced AstraBrain-WBC 0.5, a novel GPT-style foundational model designed for humanoid robot general cerebellum control. This model leverages a massive dataset of 2 billion frames of human motion dat…
-
RLAIF and PPO: Key Techniques for Enhancing LLM Behavior
This article explores Reinforcement Learning from AI Feedback (RLAIF) and Proximal Policy Optimization (PPO) as key techniques for improving large language model behavior. It details how a combination of a reward model,…
-
New AI method optimizes additive manufacturing with attention-based RL
Researchers have developed a novel approach to optimize additive manufacturing processes by integrating a multi-head attention mechanism with the Soft Actor-Critic (SAC) algorithm. This method addresses limitations in t…
-
Graph RL router boosts quantum circuit fidelity using calibration data
Researchers have developed a new quantum circuit routing method using graph reinforcement learning that incorporates calibration data from quantum processors. This approach, trained with proximal policy optimization and…
-
New methods enhance VLA model efficiency and performance in robotics · 9 sources tracked
Researchers are developing new methods to improve the efficiency and performance of Vision-Language-Action (VLA) models in robotics. One approach, Flow Policy Optimization (FPO), uses reinforcement learning to fine-tune…
-
New research explores RL advancements for LLMs and AI agents · 8 sources tracked
Multiple research papers released on arXiv explore advancements in reinforcement learning (RL) for large language models (LLMs) and other AI agents. One paper introduces RiVER, a framework for training LLMs on score-bas…
-
Model-free RL controllers enhance cyber-physical system resilience against attacks · arXiv paper
A new research paper published on arXiv explores the effectiveness of model-free reinforcement learning (RL) controllers in enhancing the resilience of cyber-physical systems against cyberattacks. The study analyzes fou…
-
SIQ-1 fine-tune of Qwen3.6 shows Opus-like reasoning, beats GPT-5.5
A new model, SIQ-1, has been developed by fine-tuning Qwen-35B-A3 using PPO. This model demonstrates strong performance on autoresearch tasks, outperforming GLM-5.2 and Qwen-350B, with its generated ideas reportedly com…
-
Mamba and PPO achieve superior safety in spacecraft control
A new research paper explores the effectiveness of various recurrent neural network architectures and reinforcement learning algorithms for adaptive safety-critical control in spacecraft proximity operations. The study …
-
New pipeline enables humanoid robots to manipulate objects while walking
Researchers have developed CoorDex, a novel learning pipeline that enables humanoid robots to perform dexterous manipulation while in motion. This system converts high-dimensional body and hand control into coordinated …
-
New RL framework learns graph partitioning with structural priors
Researchers have developed RIDGECUT, a novel reinforcement learning framework designed for graph partitioning problems, specifically targeting the Normalized Cut problem. This method incorporates domain knowledge by con…
-
AI Alignment: RLHF, DPO, IPO, and KTO Tradeoffs Explored
The choice of AI model alignment method—RLHF, DPO, IPO, or KTO—significantly impacts project timelines and resource allocation. RLHF, a multi-stage process involving a reward model and PPO, is compute-intensive and can …
-
AI estimates food material properties using reinforcement learning
Researchers have developed a novel approach using latent space reinforcement learning to estimate material properties in food fracture simulations, specifically demonstrated with orange peeling. This method trains a goa…