English(EN) Direct Preference Optimization Beyond Chatbots

大型语言模型探索偏好对齐和失败缓解技术

作者 PulseAugur 编辑部 · [41 个来源] · 2026-05-25 04:00

研究人员正在探索新的方法，以使大型语言模型（LLM）与人类偏好保持一致并缓解特定的失败模式。一种方法使用直接偏好优化（DPO）来利用模型自身的失败作为训练信号，从而减少OCR模型中的文本退化。其他研究侧重于理解和控制LLM的时间偏好推理，为个人代理开发轻量级的本地偏好工具包，以及创建以人为中心的偏好驱动判断框架。诸如“思想包含”（Inclusion-of-Thoughts）和“批判驱动推理对齐”（Critique-Driven Reasoning Alignment）等技术旨在提高LLM决策的稳定性和可解释性。 AI

影响偏好对齐和失败缓解的新方法可能带来更可靠、更可控的大型语言模型。

排序理由多篇arXiv论文提出了关于大型语言模型对齐和偏好建模的新研究。

在 Hugging Face Blog 阅读 →

AI 生成摘要 · Google Gemini · 来自 41 个来源。我们如何撰写摘要 →

报道来源 [41]

Hugging Face Blog TIER_1 English(EN) · 2026-06-03 12:55

Direct Preference Optimization 赋能超越聊天机器人
arXiv cs.AI TIER_1 English(EN) · Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad, Elham Dolatabadi · 2026-06-12 04:00

分析和改进医学大型语言模型中的细粒度偏好优化

arXiv:2606.12590v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Exi…
arXiv cs.AI TIER_1 English(EN) · Pengwei Sun · 2026-06-12 04:00

通过惩罚机制增强直接偏好优化

arXiv:2606.12505v1 Announce Type: cross Abstract: Offline preference optimization has become a practical substitute for reinforcement learning from human feedback, but pairwise objectives such as Direct Preference Optimization (DPO) and its variants use only the chosen and reject…
arXiv cs.AI TIER_1 English(EN) · Masanari Oi, Mahiro Ukai, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue · 2026-06-11 04:00

自回归直接偏好优化

arXiv:2602.09533v2 Announce Type: replace Abstract: Direct preference optimization (DPO) has emerged as a promising approach for aligning large language models (LLMs) with human preferences. However, the widespread reliance on the response-level Bradley-Terry (BT) model may limit…
arXiv cs.AI TIER_1 English(EN) · Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu · 2026-06-10 04:00

直接偏好优化：数据集、理论、变体及应用综合调查

arXiv:2410.15595v4 Announce Type: replace Abstract: With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, …
arXiv cs.LG TIER_1 English(EN) · Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen · 2026-06-08 04:00

多目标偏好优化：提升生成模型的人类对齐性

arXiv:2505.10892v2 Announce Type: replace Abstract: Post-training LLMs with RLHF and preference optimization methods (e.g., DPO, IPO) has greatly improved alignment, yet these approaches assume a single objective. In reality, humans express multiple, often conflicting objectives,…
arXiv cs.CL TIER_1 English(EN) · Julia Sep\'ulveda Coelho, Scott A. Hale · 2026-06-08 04:00

人们真正想要的是什么？AI偏好多样性分析

arXiv:2606.06674v1 Announce Type: new Abstract: Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting prefere…
arXiv cs.CL TIER_1 English(EN) · Zeyu Gan, Huayi Tang, Yong Liu · 2026-06-05 04:00

隐式偏好的统计先验：将技能选择解耦为个人代理中的局部约束

arXiv:2606.05828v1 Announce Type: cross Abstract: As Large Language Model (LLM) capabilities advance, locally deployed personal agents relying on API-based remote models and external skills have emerged as a novel paradigm. With the rapid expansion of available skills, enabling p…
arXiv cs.CL TIER_1 English(EN) · Ian Rios-Sialer, Shantanu Darveshi, Shuai Jiang, Avigya Paudel, Anastasiia Pronina, Ipshita Bandyopadhyay, Justin Shenk · 2026-06-05 04:00

大型语言模型中的时间偏好概念及其功能

arXiv:2606.05194v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly being deployed to make decisions that require trading off near-term gains against long-term consequences, yet little is known about how they internally represent or resolve these trade…
arXiv cs.CL TIER_1 English(EN) · Scott A. Hale · 2026-06-04 19:47

人们真正想要的是什么？AI偏好多元化分析

Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, …
arXiv cs.CL TIER_1 English(EN) · Yong Liu · 2026-06-04 08:07

隐式偏好的统计先验：将技能选择解耦为个人代理中的局部约束

As Large Language Model (LLM) capabilities advance, locally deployed personal agents relying on API-based remote models and external skills have emerged as a novel paradigm. With the rapid expansion of available skills, enabling personal agents to learn and adapt to implicit user…
arXiv cs.AI TIER_1 English(EN) · Mohammad Reza Ghasemi Madani, Soyeon Caren Han, Shuo Yang, Jey Han Lau · 2026-06-04 04:00

Inclusion-of-Thoughts：通过净化决策空间减轻偏好不稳定性

arXiv:2604.04944v2 Announce Type: replace-cross Abstract: Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, r…
arXiv cs.AI TIER_1 English(EN) · Peiming Li, Zhiyuan Hu, Yang Tang, Shiyu Li, Xi Chen · 2026-06-04 04:00

通过学习防御性推理来对齐深度隐式偏好

arXiv:2510.11194v3 Announce Type: replace Abstract: Personalized alignment is crucial for enabling Large Language Models (LLMs) to engage effectively in user-centric interactions. However, current methods face a dual challenge: they fail to infer users' deep implicit preferences …
arXiv cs.AI TIER_1 English(EN) · Yifan Wang, Jinyi Mu, Mayank Jobanputra, Yu Wang, Ji-Ung Lee, Soyoung Oh, Isabel Valera, Vera Demberg · 2026-06-04 04:00

稀疏混合专家奖励模型学习可解释的专业化专家以实现个性化偏好建模

arXiv:2606.04284v1 Announce Type: cross Abstract: Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to align with human values. However, most existing approaches assume a universal reward function,…
arXiv cs.CL TIER_1 English(EN) · Yuehan Qin, Li Li, Linxin Song, Wei Yang, Jiate Li, Yuqing Yang, Yue Zhao · 2026-06-03 04:00

记忆检索以应对不断变化的偏好

arXiv:2606.02976v1 Announce Type: new Abstract: Long-context dialogue systems must decide both when to access memory and which parts of the interaction history are relevant. Existing approaches typically rely on heuristic retrieval signals or always-on memory usage, failing to ac…
arXiv cs.CL TIER_1 English(EN) · Rui Li, Junfeng Liu, Xiangwen Kong, Linhai Xu, Zhifang Sui · 2026-06-03 04:00

SenseJudge：以人为本的偏好驱动判断框架

arXiv:2606.03189v1 Announce Type: new Abstract: Large Language Models (LLMs) as judges across various scenarios such as assessing model responses is becoming an increasingly accepted paradigm. However, existing judgment approaches often rely on trained judgers using fixed prefere…
arXiv cs.AI TIER_1 (CA) · Edwin V. Bonilla, He Zhao, Daniel M. Steinberg · 2026-06-03 04:00

因果偏好引导

arXiv:2602.01483v2 Announce Type: replace-cross Abstract: We propose causal preference elicitation, a Bayesian framework for expert-in-the-loop causal discovery that actively queries local edge relations to concentrate a posterior over directed acyclic graphs (DAGs). From any bla…
arXiv cs.CL TIER_1 English(EN) · Zhifang Sui · 2026-06-02 05:48

SenseJudge：以人为本的偏好驱动判断框架

Large Language Models (LLMs) as judges across various scenarios such as assessing model responses is becoming an increasingly accepted paradigm. However, existing judgment approaches often rely on trained judgers using fixed preference data, which tend to overlook diverse user pr…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 05:48

SenseJudge：以人为本的偏好驱动判断框架

Large Language Models (LLMs) as judges across various scenarios such as assessing model responses is becoming an increasingly accepted paradigm. However, existing judgment approaches often rely on trained judgers using fixed preference data, which tend to overlook diverse user pr…
arXiv cs.LG TIER_1 English(EN) · Jing Dong, Yaoliang Yu, Pascal Pourpart · 2026-06-02 04:00

奖励学习中的表征-可解释性权衡

arXiv:2606.00291v1 Announce Type: cross Abstract: In RLHF, each training example contains a prompt $x$ and two candidate responses $y,y'$, and annotators provide pairwise preferences between these responses. The learning problem is to convert these heterogeneous pairwise judgment…
arXiv cs.AI TIER_1 English(EN) · Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna P\'asztor, Andreas Krause · 2026-06-02 04:00

ActiveUltraFeedback：使用主动学习高效生成偏好数据

arXiv:2603.09692v2 Announce Type: replace-cross Abstract: Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resourc…
arXiv cs.AI TIER_1 English(EN) · Hyung Gyu Rho · 2026-06-02 04:00

Margin Adaptive DPO：利用奖励模型在偏好优化中实现细粒度控制

arXiv:2510.05342v2 Announce Type: replace-cross Abstract: Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preferenc…
arXiv cs.LG TIER_1 English(EN) · Jun-Jie Yang, Chia-Heng Hsu, Kui-Yuan Chen, Ping-Chun Hsieh · 2026-06-02 04:00

从无奖励表征到偏好：重新思考离线偏好驱动的强化学习

arXiv:2606.01123v1 Announce Type: new Abstract: Preference-based reinforcement learning (PbRL) avoids explicit reward engineering by learning from pairwise human preference feedback. Existing offline PbRL methods typically follow a two-stage pipeline, first learning a reward or p…
arXiv cs.LG TIER_1 English(EN) · Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar · 2026-06-01 04:00

偏好学习能从成对比较数据中恢复什么？

arXiv:2602.10286v2 Announce Type: replace Abstract: Pairwise preference learning is central to machine learning, with recent applications in aligning language models with human preferences. A typical dataset consists of triplets $(x, y^+, y^-)$, where response $y^+$ is preferred …
arXiv cs.AI TIER_1 English(EN) · Christian Moya, Alex Semendinger, Guang Lin, Elliott Thornley · 2026-06-01 04:00

偏好优化中的虚假相关性学习：机制、后果及通过平局训练的缓解方法

arXiv:2605.11134v2 Announce Type: replace-cross Abstract: Preference learning methods like Direct Preference Optimization (DPO) are known to induce reliance on spurious correlations, leading to sycophancy and length bias in today's language models and potentially severe goal misg…
arXiv cs.AI TIER_1 English(EN) · Zhenyu Sun, Zheng Xu, Ermin Wei · 2026-05-29 04:00

用于鲁棒偏好建模的上下文奖励适应

arXiv:2605.30323v1 Announce Type: cross Abstract: Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward …
arXiv cs.AI TIER_1 English(EN) · Ermin Wei · 2026-05-28 17:56

用于鲁棒偏好建模的上下文奖励适应

Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model often lacks the robustness required to gener…
arXiv cs.CL TIER_1 English(EN) · Shaobo Wang, Guo Chen, Ziyue Wang, Zhengyang Tang, Qingyang Liu, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang · 2026-05-28 04:00

预训练模型评估中缺失的一环：奖励引导解码在无需参数更新的情况下解锁面向任务的行为

arXiv:2605.28020v1 Announce Type: new Abstract: With the rapid progress of large language models (LLMs), reliably evaluating the capabilities of pre-trained LLMs has become increasingly important. The challenge is that base pre-trained models are optimized for next-token predicti…
arXiv cs.LG TIER_1 English(EN) · Yikang Gui, Prashant Doshi · 2026-05-28 04:00

通过抽象状态逆向学习可迁移奖励

arXiv:2501.01669v4 Announce Type: replace Abstract: Inverse reinforcement learning (IRL) has progressed significantly toward accurately learning the underlying rewards in both discrete and continuous domains from behavior data. The next advance is to learn {\em intrinsic} prefere…
arXiv cs.LG TIER_1 English(EN) · Austin Wang, Jiaqi Han, Stefano Ermon, Yisong Yue · 2026-05-27 04:00

超越成对偏好：列表式奖励感知对齐用于扩散模型

arXiv:2605.26491v1 Announce Type: new Abstract: Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary …
arXiv cs.CL TIER_1 English(EN) · Jared Scott, Jesse Roberts · 2026-05-27 04:00

KARMA：Karma-Aligned Reward Model Adaptation

arXiv:2605.26738v1 Announce Type: new Abstract: Human communication depends on implicit social signals where effectiveness is shaped by tone, context, and conversational norms rather than semantic content alone. We introduce KARMA (Karma-Aligned Reward Model Adaptation), a framew…
arXiv cs.CL TIER_1 English(EN) · Jesse Roberts · 2026-05-26 09:12

KARMA：Karma-Aligned Reward Model Adaptation

Human communication depends on implicit social signals where effectiveness is shaped by tone, context, and conversational norms rather than semantic content alone. We introduce KARMA (Karma-Aligned Reward Model Adaptation), a framework for LLM learning of context-sensitive conver…
arXiv cs.AI TIER_1 English(EN) · Payel Bhattacharjee, Osvaldo Simeone, Ravi Tandon · 2026-05-26 04:00

MARS：用于奖励建模的边际和语义感知数据增强

arXiv:2602.17658v2 Announce Type: replace-cross Abstract: Reward modeling is central to alignment pipelines such as RLHF, RLAIF, and PPO-based policy optimization, yet its reliability is constrained by limited and heterogeneous human preference data that are expensive to collect …
arXiv cs.AI TIER_1 English(EN) · Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo · 2026-05-25 04:00

超越基于VLM的奖励：Diffusion原生潜在奖励建模

arXiv:2602.11146v2 Announce Type: replace-cross Abstract: Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary rewar…
arXiv stat.ML TIER_1 English(EN) · Nathan Kallus · 2026-06-04 04:00

半参数偏好优化：你的语言模型其实是一个单指标模型

arXiv:2512.21917v3 Announce Type: replace-cross Abstract: Policy alignment to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., Bradley-Terry model / logistic link). Misspecification of this link can bias inferred rewar…
arXiv stat.ML TIER_1 English(EN) · Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar · 2026-06-01 04:00

从最佳$N$偏好数据中进行奖励学习：目标、权衡与设计原则

arXiv:2605.30619v1 Announce Type: new Abstract: Best-of-$N$ sampling is widely used to construct pairwise preference data: $N$ candidates are drawn from a base distribution, and the best is paired with a rejected response. Despite its widespread use, what Bradley--Terry (BT) rewa…
arXiv stat.ML TIER_1 English(EN) · Pradeep Ravikumar · 2026-05-28 22:15

从最佳$N$偏好数据中进行奖励学习：目标、权衡与设计原则

Best-of-$N$ sampling is widely used to construct pairwise preference data: $N$ candidates are drawn from a base distribution, and the best is paired with a rejected response. Despite its widespread use, what Bradley--Terry (BT) reward learning extracts from such data, and how to …
arXiv stat.ML TIER_1 English(EN) · Yeshwanth Cherapanamjeri, Constantinos Daskalakis, Gabriele Farina, Sobhan Mohammadpour · 2026-05-28 04:00

学习相关奖励模型：统计障碍与机遇

arXiv:2510.15839v2 Announce Type: replace-cross Abstract: Random Utility Models (RUMs) are a classical framework for modeling user preferences and play a key role in reward modeling for Reinforcement Learning from Human Feedback (RLHF). However, a crucial shortcoming of many of t…
arXiv cs.CV TIER_1 English(EN) · Jaxon Zhang, Binxin Yang, Hubery Yin, Chen Li, Jing Lyu · 2026-05-26 04:00

DRM：基于扩散的带步进式引导的奖励模型

arXiv:2605.25661v1 Announce Type: new Abstract: Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models. However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual …
arXiv cs.CV TIER_1 English(EN) · Jing Lyu · 2026-05-25 10:11

DRM：基于扩散的奖励模型，带步进式引导

Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models. However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities-such as aesthetics, composition, and v…
dev.to — LLM tag TIER_1 English(EN) · pixelbank dev · 2026-06-06 23:10

人类偏好数据——深度解析+问题：可分离滤波器优化

<p><em>A daily deep dive into llm topics, coding problems, and platform features from <a href="https://pixelbank.dev" rel="noopener noreferrer">PixelBank</a>.</em></p> <h2> Topic Deep Dive: Human Preference Data </h2> <p><em>From the RLHF & Alignment chapter</em></p> <h2> Int…

报道来源 [41]

相关实体

相关话题