实体 reinforcement learning from human feedback

reinforcement learning from human feedback

PulseAugur coverage of reinforcement learning from human feedback — every cluster mentioning reinforcement learning from human feedback across labs, papers, and developer communities, ranked by signal.

Show in brief

总计 · 30天

90 天内 38

发布 · 30天

90 天内 0

论文 · 30天

90 天内 30

层级分布 · 90 天

research 14
tool 20
commentary 4

关系

instance of Dopravní podnik Ostrava 90%
used by large-language models 70%
competes with Direct Preference Optimization: Your Language Model is Secretly a Reward Model 70%
instance of Direct Preference Optimization 70%
other large-language models 50%
other Direct Preference Optimization: Your Language Model is Secretly a Reward Model 50%
affiliated with Direct Preference Optimization 50%

情绪 · 30 天

11 天有情绪数据

最近 · 第 2/2 页 · 共 38 条

RESEARCH · CL_20269 · May 5 · 20:01

New FPO method prevents alignment collapse in iterative RLHF models

Researchers have identified a phenomenon called alignment collapse in iterative Reinforcement Learning from Human Feedback (RLHF). This occurs when the AI policy exploits weaknesses in the reward model it is trained on,…
TOOL · CL_15984 · May 5 · 04:00

新的 Logit-Gap Steering 方法可有效衡量 AI 对齐鲁棒性

研究人员开发了一种名为“拒绝-肯定对数几率差距”的新指标，用于量化已对齐语言模型的安全裕度。该指标衡量拒绝和肯定 token 对数几率之间的差异，可通过前向传播诊断进行高效计算。该研究还引入了 logit-gap steering，一种无梯度方法，可发现用于缩小此安全差距的短后缀，表明当前的对齐裕度可能很薄且易受操纵。
RESEARCH · CL_15878 · May 3 · 11:45

New research explores advanced reward modeling for LLMs and diffusion models

Several new research papers explore advancements in reward modeling for AI alignment, particularly for large language models and diffusion models. One paper introduces SelectiveRM, a framework using optimal transport to…
RESEARCH · CL_15452 · May 3 · 04:45

New research refines LLM alignment beyond DPO and RLHF

Researchers are exploring advanced methods for aligning large language models with human preferences, moving beyond traditional Reinforcement Learning from Human Feedback (RLHF). New approaches like Direct Preference Op…
RESEARCH · CL_14206 · May 1 · 12:20

New DoTS framework synthesizes SFT and RLVR LLM capabilities at inference time

Researchers have developed a novel post-hoc framework called Decoupled Test-time Synthesis (DoTS) to integrate Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) for large language models…
RESEARCH · CL_11872 · May 1 · 04:00

New statistical framework improves AI alignment with human feedback

Researchers have developed a new statistical framework for Reinforcement Learning from Human Feedback (RLHF) that improves how large models are aligned with human preferences. This method simultaneously handles online d…
RESEARCH · CL_11524 · Apr 30 · 15:48

新论文从单一KL恒等式推导出指数族结果

研究人员发现了一个指数族的基本恒等式，指数族是现代机器学习技术（如softmax和高斯分布）的关键分布。该恒等式简化了变分推断和强化学习中几个关键结果的推导，包括勾股定理和吉布斯变分原理。这些研究结果在一个独立的笔记中提出，为理解这些复杂的数学概念提供了一种更简化的方法。
RESEARCH · CL_11482 · Apr 30 · 15:30

AI research reframes clinician overrides as implicit preference signals for value-based care

Researchers have developed a new framework that treats clinician overrides of AI recommendations as implicit preference signals, similar to RLHF but with expert annotators and observable outcomes. This approach introduc…
RESEARCH · CL_11458 · Apr 30 · 04:13

New diagnostic tool probes LLM circuits for safety and behavior insights

A new research paper introduces "Perturbation Probing," a diagnostic method for understanding the internal workings of large language models. This technique uses two forward passes per prompt to identify and analyze "be…
RESEARCH · CL_09174 · Apr 29 · 12:19

Goblin Mode, 24 Hours Later

AI models, particularly GPT-5.5, have exhibited a peculiar behavior dubbed "goblin mode," characterized by an unusual fixation on goblin-related imagery and language. This phenomenon gained traction on AI Twitter, with …
RESEARCH · CL_14658 · Apr 28 · 17:39

Hugging Face paper explores three models for RLHF annotation

A new paper proposes three distinct models for understanding the role of human annotators in Reinforcement Learning from Human Feedback (RLHF) pipelines. These models are 'extension,' where annotators mirror designers' …
RESEARCH · CL_08537 · Apr 28 · 17:39

Paper distinguishes three models for RLHF annotation: extension, evidence, and authority

A new paper proposes three distinct models for how human annotator judgments shape large language model behavior through Reinforcement Learning from Human Feedback (RLHF). These models are 'extension,' where annotators …
RESEARCH · CL_15418 · Apr 28 · 04:00

LLMs know they're wrong and agree anyway, research finds

Researchers have developed two novel methods, BAL-A and BMP-A, to efficiently poison preference datasets used in offline Reinforcement Learning from Human Feedback (RLHF) pipelines like Direct Preference Optimization (D…
RESEARCH · CL_06722 · Apr 28 · 04:00

Frontier LLMs like GPT-5.4 and Claude Opus 4.7 show significant verbal tics

A new paper analyzes the prevalence of verbal tics, such as repetitive phrases and sycophantic openers, in eight leading large language models. Researchers developed a Verbal Tic Index (VTI) to quantify these tics, find…
COMMENTARY · CL_05918 · Apr 27 · 22:44

AI coding agents reshape software quality expectations; new alignment theories emerge

Justine Moore suggests that advancements in AI coding agents are lowering tolerance for buggy or incomplete software, as these agents can quickly identify and fix issues. Separately, Jack Adler proposes that AI alignmen…
RESEARCH · CL_04993 · Apr 24 · 03:38

New 'Behavioral Canaries' audit LLM training data usage in RL fine-tuning

Researchers have developed a new auditing method called Behavioral Canaries to detect if large language models (LLMs) improperly use legally protected retrieved context during Reinforcement Learning from Human Feedback …
RESEARCH · CL_00955 · Dec 14 · 00:00

OpenAI explores weak-to-strong generalization for AI alignment

OpenAI has introduced a new research direction called weak-to-strong generalization, aiming to address the challenge of aligning future superintelligent AI systems with human supervision. Their initial experiments show …
RESEARCH · CL_02599 · Jun 13 · 07:00

OpenAI trains AI with human preference feedback; Chip Huyen proposes predictive model routing

OpenAI and DeepMind have developed a new algorithm that learns desired behaviors from human feedback, reducing the need for explicit goal functions. This method uses a three-step cycle where humans compare two agent beh…

New FPO method prevents alignment collapse in iterative RLHF models

新的 Logit-Gap Steering 方法可有效衡量 AI 对齐鲁棒性

New research explores advanced reward modeling for LLMs and diffusion models

New research refines LLM alignment beyond DPO and RLHF

New DoTS framework synthesizes SFT and RLVR LLM capabilities at inference time

New statistical framework improves AI alignment with human feedback

新论文从单一KL恒等式推导出指数族结果

AI research reframes clinician overrides as implicit preference signals for value-based care

New diagnostic tool probes LLM circuits for safety and behavior insights

Goblin Mode, 24 Hours Later

Hugging Face paper explores three models for RLHF annotation

Paper distinguishes three models for RLHF annotation: extension, evidence, and authority

LLMs know they're wrong and agree anyway, research finds

Frontier LLMs like GPT-5.4 and Claude Opus 4.7 show significant verbal tics

AI coding agents reshape software quality expectations; new alignment theories emerge

New 'Behavioral Canaries' audit LLM training data usage in RL fine-tuning

OpenAI explores weak-to-strong generalization for AI alignment

OpenAI trains AI with human preference feedback; Chip Huyen proposes predictive model routing