reinforcement learning from human feedback
PulseAugur coverage of reinforcement learning from human feedback — every cluster mentioning reinforcement learning from human feedback across labs, papers, and developer communities, ranked by signal.
- instance of Reinforcement Learning From Human Feedback (RLHF) 95%
- instance of Dopravní podnik Ostrava 90%
- used by Direct Preference Optimization: Your Language Model is Secretly a Reward Model 80%
- used by Reward Models 80%
- used by large-language models 70%
- used by InstructGPT 70%
- competes with Direct Preference Optimization: Your Language Model is Secretly a Reward Model 70%
- instance of Direct Preference Optimization 70%
- other Direct Preference Optimization: Your Language Model is Secretly a Reward Model 60%
- competes with Direct Preference Optimization 60%
- affiliated with Reward Models 60%
- other supervised fine-tuning 60%
19 day(s) with sentiment data
-
AI agents struggle to deliberate like humans in jury simulation
Researchers have developed a novel benchmark using a multi-agent framework to evaluate large language model deliberation, inspired by the film '12 Angry Men'. The study tested GPT-4o and Llama-4-Scout, finding that most…
-
PERSA pipeline uses RLHF to align LLM feedback with instructor style
Researchers have developed PERSA, a novel approach using Reinforcement Learning from Human Feedback (RLHF) to adapt large language models for generating personalized educational feedback. This method specifically target…
-
New FPO method prevents alignment collapse in iterative RLHF models
Researchers have identified a phenomenon called alignment collapse in iterative Reinforcement Learning from Human Feedback (RLHF). This occurs when the AI policy exploits weaknesses in the reward model it is trained on,…
-
New Logit-Gap Steering method efficiently measures AI alignment robustness
Researchers have developed a new metric called the refusal-affirmation logit gap to quantify the safety margin of aligned language models. This metric, which measures the difference between refusal and affirmation token…
-
New research explores advanced reward modeling for LLMs and diffusion models
Several new research papers explore advancements in reward modeling for AI alignment, particularly for large language models and diffusion models. One paper introduces SelectiveRM, a framework using optimal transport to…
-
New research refines LLM alignment beyond DPO and RLHF
Researchers are exploring advanced methods for aligning large language models with human preferences, moving beyond traditional Reinforcement Learning from Human Feedback (RLHF). New approaches like Direct Preference Op…
-
New DoTS framework synthesizes SFT and RLVR LLM capabilities at inference time
Researchers have developed a novel post-hoc framework called Decoupled Test-time Synthesis (DoTS) to integrate Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) for large language models…
-
New statistical framework improves AI alignment with human feedback
Researchers have developed a new statistical framework for Reinforcement Learning from Human Feedback (RLHF) that improves how large models are aligned with human preferences. This method simultaneously handles online d…
-
New paper derives exponential family results from single KL identity
Researchers have identified a fundamental identity for exponential families, which are distributions crucial to modern machine learning techniques like softmax and Gaussian distributions. This identity simplifies the de…
-
AI research reframes clinician overrides as implicit preference signals for value-based care
Researchers have developed a new framework that treats clinician overrides of AI recommendations as implicit preference signals, similar to RLHF but with expert annotators and observable outcomes. This approach introduc…
-
New diagnostic tool probes LLM circuits for safety and behavior insights
A new research paper introduces "Perturbation Probing," a diagnostic method for understanding the internal workings of large language models. This technique uses two forward passes per prompt to identify and analyze "be…
-
Goblin Mode, 24 Hours Later
AI models, particularly GPT-5.5, have exhibited a peculiar behavior dubbed "goblin mode," characterized by an unusual fixation on goblin-related imagery and language. This phenomenon gained traction on AI Twitter, with …
-
Hugging Face paper explores three models for RLHF annotation
A new paper proposes three distinct models for understanding the role of human annotators in Reinforcement Learning from Human Feedback (RLHF) pipelines. These models are 'extension,' where annotators mirror designers' …
-
Paper distinguishes three models for RLHF annotation: extension, evidence, and authority
A new paper proposes three distinct models for how human annotator judgments shape large language model behavior through Reinforcement Learning from Human Feedback (RLHF). These models are 'extension,' where annotators …
-
LLMs know they're wrong and agree anyway, research finds
Researchers have developed two novel methods, BAL-A and BMP-A, to efficiently poison preference datasets used in offline Reinforcement Learning from Human Feedback (RLHF) pipelines like Direct Preference Optimization (D…
-
Frontier LLMs like GPT-5.4 and Claude Opus 4.7 show significant verbal tics
A new paper analyzes the prevalence of verbal tics, such as repetitive phrases and sycophantic openers, in eight leading large language models. Researchers developed a Verbal Tic Index (VTI) to quantify these tics, find…
-
AI coding agents reshape software quality expectations; new alignment theories emerge
Justine Moore suggests that advancements in AI coding agents are lowering tolerance for buggy or incomplete software, as these agents can quickly identify and fix issues. Separately, Jack Adler proposes that AI alignmen…
-
New 'Behavioral Canaries' audit LLM training data usage in RL fine-tuning
Researchers have developed a new auditing method called Behavioral Canaries to detect if large language models (LLMs) improperly use legally protected retrieved context during Reinforcement Learning from Human Feedback …
-
OpenAI explores weak-to-strong generalization for AI alignment
OpenAI has introduced a new research direction called weak-to-strong generalization, aiming to address the challenge of aligning future superintelligent AI systems with human supervision. Their initial experiments show …
-
OpenAI trains AI with human preference feedback; Chip Huyen proposes predictive model routing
OpenAI and DeepMind have developed a new algorithm that learns desired behaviors from human feedback, reducing the need for explicit goal functions. This method uses a three-step cycle where humans compare two agent beh…