ENTITY reinforcement learning from human feedback

reinforcement learning from human feedback

PulseAugur coverage of reinforcement learning from human feedback — every cluster mentioning reinforcement learning from human feedback across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

60 over 90d

Releases · 30d

0 over 90d

Papers · 30d

46 over 90d

TIER MIX · 90D

research 22
tool 26
commentary 12

TOPICS

paper 46
safety 30
model release 17
other 13
opinion 6
product 4
infra 3
policy 1

RELATIONSHIPS

instance of Reinforcement Learning From Human Feedback (RLHF) 95%
instance of Dopravní podnik Ostrava 90%
used by Direct Preference Optimization: Your Language Model is Secretly a Reward Model 80%
used by Reward Models 80%
used by large-language models 70%
used by InstructGPT 70%
competes with Direct Preference Optimization: Your Language Model is Secretly a Reward Model 70%
instance of Direct Preference Optimization 70%
other Direct Preference Optimization: Your Language Model is Secretly a Reward Model 60%
competes with Direct Preference Optimization 60%
affiliated with Reward Models 60%
other supervised fine-tuning 60%

SENTIMENT · 30D

19 day(s) with sentiment data

RECENT · PAGE 3/3 · 60 TOTAL

TOOL · CL_18567 · May 6 · 04:00

AI agents struggle to deliberate like humans in jury simulation

Researchers have developed a novel benchmark using a multi-agent framework to evaluate large language model deliberation, inspired by the film '12 Angry Men'. The study tested GPT-4o and Llama-4-Scout, finding that most…
TOOL · CL_18538 · May 6 · 04:00

PERSA pipeline uses RLHF to align LLM feedback with instructor style

Researchers have developed PERSA, a novel approach using Reinforcement Learning from Human Feedback (RLHF) to adapt large language models for generating personalized educational feedback. This method specifically target…
RESEARCH · CL_20269 · May 5 · 20:01

New FPO method prevents alignment collapse in iterative RLHF models

Researchers have identified a phenomenon called alignment collapse in iterative Reinforcement Learning from Human Feedback (RLHF). This occurs when the AI policy exploits weaknesses in the reward model it is trained on,…
TOOL · CL_15984 · May 5 · 04:00

New Logit-Gap Steering method efficiently measures AI alignment robustness

Researchers have developed a new metric called the refusal-affirmation logit gap to quantify the safety margin of aligned language models. This metric, which measures the difference between refusal and affirmation token…
RESEARCH · CL_15878 · May 3 · 11:45

New research explores advanced reward modeling for LLMs and diffusion models

Several new research papers explore advancements in reward modeling for AI alignment, particularly for large language models and diffusion models. One paper introduces SelectiveRM, a framework using optimal transport to…
RESEARCH · CL_15452 · May 3 · 04:45

New research refines LLM alignment beyond DPO and RLHF

Researchers are exploring advanced methods for aligning large language models with human preferences, moving beyond traditional Reinforcement Learning from Human Feedback (RLHF). New approaches like Direct Preference Op…
RESEARCH · CL_14206 · May 1 · 12:20

New DoTS framework synthesizes SFT and RLVR LLM capabilities at inference time

Researchers have developed a novel post-hoc framework called Decoupled Test-time Synthesis (DoTS) to integrate Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) for large language models…
RESEARCH · CL_11872 · May 1 · 04:00

New statistical framework improves AI alignment with human feedback

Researchers have developed a new statistical framework for Reinforcement Learning from Human Feedback (RLHF) that improves how large models are aligned with human preferences. This method simultaneously handles online d…
RESEARCH · CL_11524 · Apr 30 · 15:48

New paper derives exponential family results from single KL identity

Researchers have identified a fundamental identity for exponential families, which are distributions crucial to modern machine learning techniques like softmax and Gaussian distributions. This identity simplifies the de…
RESEARCH · CL_11482 · Apr 30 · 15:30

AI research reframes clinician overrides as implicit preference signals for value-based care

Researchers have developed a new framework that treats clinician overrides of AI recommendations as implicit preference signals, similar to RLHF but with expert annotators and observable outcomes. This approach introduc…
RESEARCH · CL_11458 · Apr 30 · 04:13

New diagnostic tool probes LLM circuits for safety and behavior insights

A new research paper introduces "Perturbation Probing," a diagnostic method for understanding the internal workings of large language models. This technique uses two forward passes per prompt to identify and analyze "be…
RESEARCH · CL_09174 · Apr 29 · 12:19

Goblin Mode, 24 Hours Later

AI models, particularly GPT-5.5, have exhibited a peculiar behavior dubbed "goblin mode," characterized by an unusual fixation on goblin-related imagery and language. This phenomenon gained traction on AI Twitter, with …
RESEARCH · CL_14658 · Apr 28 · 17:39

Hugging Face paper explores three models for RLHF annotation

A new paper proposes three distinct models for understanding the role of human annotators in Reinforcement Learning from Human Feedback (RLHF) pipelines. These models are 'extension,' where annotators mirror designers' …
RESEARCH · CL_08537 · Apr 28 · 17:39

Paper distinguishes three models for RLHF annotation: extension, evidence, and authority

A new paper proposes three distinct models for how human annotator judgments shape large language model behavior through Reinforcement Learning from Human Feedback (RLHF). These models are 'extension,' where annotators …
RESEARCH · CL_15418 · Apr 28 · 04:00

LLMs know they're wrong and agree anyway, research finds

Researchers have developed two novel methods, BAL-A and BMP-A, to efficiently poison preference datasets used in offline Reinforcement Learning from Human Feedback (RLHF) pipelines like Direct Preference Optimization (D…
RESEARCH · CL_06722 · Apr 28 · 04:00

Frontier LLMs like GPT-5.4 and Claude Opus 4.7 show significant verbal tics

A new paper analyzes the prevalence of verbal tics, such as repetitive phrases and sycophantic openers, in eight leading large language models. Researchers developed a Verbal Tic Index (VTI) to quantify these tics, find…
COMMENTARY · CL_05918 · Apr 27 · 22:44

AI coding agents reshape software quality expectations; new alignment theories emerge

Justine Moore suggests that advancements in AI coding agents are lowering tolerance for buggy or incomplete software, as these agents can quickly identify and fix issues. Separately, Jack Adler proposes that AI alignmen…
RESEARCH · CL_04993 · Apr 24 · 03:38

New 'Behavioral Canaries' audit LLM training data usage in RL fine-tuning

Researchers have developed a new auditing method called Behavioral Canaries to detect if large language models (LLMs) improperly use legally protected retrieved context during Reinforcement Learning from Human Feedback …
RESEARCH · CL_00955 · Dec 14 · 00:00

OpenAI explores weak-to-strong generalization for AI alignment

OpenAI has introduced a new research direction called weak-to-strong generalization, aiming to address the challenge of aligning future superintelligent AI systems with human supervision. Their initial experiments show …
RESEARCH · CL_02599 · Jun 13 · 07:00

OpenAI trains AI with human preference feedback; Chip Huyen proposes predictive model routing

OpenAI and DeepMind have developed a new algorithm that learns desired behaviors from human feedback, reducing the need for explicit goal functions. This method uses a three-step cycle where humans compare two agent beh…

AI agents struggle to deliberate like humans in jury simulation

PERSA pipeline uses RLHF to align LLM feedback with instructor style

New FPO method prevents alignment collapse in iterative RLHF models

New Logit-Gap Steering method efficiently measures AI alignment robustness

New research explores advanced reward modeling for LLMs and diffusion models

New research refines LLM alignment beyond DPO and RLHF

New DoTS framework synthesizes SFT and RLVR LLM capabilities at inference time

New statistical framework improves AI alignment with human feedback

New paper derives exponential family results from single KL identity

AI research reframes clinician overrides as implicit preference signals for value-based care

New diagnostic tool probes LLM circuits for safety and behavior insights

Goblin Mode, 24 Hours Later

Hugging Face paper explores three models for RLHF annotation

Paper distinguishes three models for RLHF annotation: extension, evidence, and authority

LLMs know they're wrong and agree anyway, research finds

Frontier LLMs like GPT-5.4 and Claude Opus 4.7 show significant verbal tics

AI coding agents reshape software quality expectations; new alignment theories emerge

New 'Behavioral Canaries' audit LLM training data usage in RL fine-tuning

OpenAI explores weak-to-strong generalization for AI alignment

OpenAI trains AI with human preference feedback; Chip Huyen proposes predictive model routing