reinforcement learning from human feedback
PulseAugur coverage of reinforcement learning from human feedback — every cluster mentioning reinforcement learning from human feedback across labs, papers, and developer communities, ranked by signal.
- instance of Dopravní podnik Ostrava 90%
- used by large-language models 70%
- competes with Direct Preference Optimization: Your Language Model is Secretly a Reward Model 70%
- instance of Direct Preference Optimization 70%
- other large-language models 50%
- other Direct Preference Optimization: Your Language Model is Secretly a Reward Model 50%
- affiliated with Direct Preference Optimization 50%
11 天有情绪数据
-
New FPO method prevents alignment collapse in iterative RLHF models
Researchers have identified a phenomenon called alignment collapse in iterative Reinforcement Learning from Human Feedback (RLHF). This occurs when the AI policy exploits weaknesses in the reward model it is trained on,…
-
新的 Logit-Gap Steering 方法可有效衡量 AI 对齐鲁棒性
研究人员开发了一种名为“拒绝-肯定对数几率差距”的新指标,用于量化已对齐语言模型的安全裕度。该指标衡量拒绝和肯定 token 对数几率之间的差异,可通过前向传播诊断进行高效计算。该研究还引入了 logit-gap steering,一种无梯度方法,可发现用于缩小此安全差距的短后缀,表明当前的对齐裕度可能很薄且易受操纵。
-
New research explores advanced reward modeling for LLMs and diffusion models
Several new research papers explore advancements in reward modeling for AI alignment, particularly for large language models and diffusion models. One paper introduces SelectiveRM, a framework using optimal transport to…
-
New research refines LLM alignment beyond DPO and RLHF
Researchers are exploring advanced methods for aligning large language models with human preferences, moving beyond traditional Reinforcement Learning from Human Feedback (RLHF). New approaches like Direct Preference Op…
-
New DoTS framework synthesizes SFT and RLVR LLM capabilities at inference time
Researchers have developed a novel post-hoc framework called Decoupled Test-time Synthesis (DoTS) to integrate Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) for large language models…
-
New statistical framework improves AI alignment with human feedback
Researchers have developed a new statistical framework for Reinforcement Learning from Human Feedback (RLHF) that improves how large models are aligned with human preferences. This method simultaneously handles online d…
-
新论文从单一KL恒等式推导出指数族结果
研究人员发现了一个指数族的基本恒等式,指数族是现代机器学习技术(如softmax和高斯分布)的关键分布。该恒等式简化了变分推断和强化学习中几个关键结果的推导,包括勾股定理和吉布斯变分原理。这些研究结果在一个独立的笔记中提出,为理解这些复杂的数学概念提供了一种更简化的方法。
-
AI research reframes clinician overrides as implicit preference signals for value-based care
Researchers have developed a new framework that treats clinician overrides of AI recommendations as implicit preference signals, similar to RLHF but with expert annotators and observable outcomes. This approach introduc…
-
New diagnostic tool probes LLM circuits for safety and behavior insights
A new research paper introduces "Perturbation Probing," a diagnostic method for understanding the internal workings of large language models. This technique uses two forward passes per prompt to identify and analyze "be…
-
Goblin Mode, 24 Hours Later
AI models, particularly GPT-5.5, have exhibited a peculiar behavior dubbed "goblin mode," characterized by an unusual fixation on goblin-related imagery and language. This phenomenon gained traction on AI Twitter, with …
-
Hugging Face paper explores three models for RLHF annotation
A new paper proposes three distinct models for understanding the role of human annotators in Reinforcement Learning from Human Feedback (RLHF) pipelines. These models are 'extension,' where annotators mirror designers' …
-
Paper distinguishes three models for RLHF annotation: extension, evidence, and authority
A new paper proposes three distinct models for how human annotator judgments shape large language model behavior through Reinforcement Learning from Human Feedback (RLHF). These models are 'extension,' where annotators …
-
LLMs know they're wrong and agree anyway, research finds
Researchers have developed two novel methods, BAL-A and BMP-A, to efficiently poison preference datasets used in offline Reinforcement Learning from Human Feedback (RLHF) pipelines like Direct Preference Optimization (D…
-
Frontier LLMs like GPT-5.4 and Claude Opus 4.7 show significant verbal tics
A new paper analyzes the prevalence of verbal tics, such as repetitive phrases and sycophantic openers, in eight leading large language models. Researchers developed a Verbal Tic Index (VTI) to quantify these tics, find…
-
AI coding agents reshape software quality expectations; new alignment theories emerge
Justine Moore suggests that advancements in AI coding agents are lowering tolerance for buggy or incomplete software, as these agents can quickly identify and fix issues. Separately, Jack Adler proposes that AI alignmen…
-
New 'Behavioral Canaries' audit LLM training data usage in RL fine-tuning
Researchers have developed a new auditing method called Behavioral Canaries to detect if large language models (LLMs) improperly use legally protected retrieved context during Reinforcement Learning from Human Feedback …
-
OpenAI explores weak-to-strong generalization for AI alignment
OpenAI has introduced a new research direction called weak-to-strong generalization, aiming to address the challenge of aligning future superintelligent AI systems with human supervision. Their initial experiments show …
-
OpenAI trains AI with human preference feedback; Chip Huyen proposes predictive model routing
OpenAI and DeepMind have developed a new algorithm that learns desired behaviors from human feedback, reducing the need for explicit goal functions. This method uses a three-step cycle where humans compare two agent beh…