新的DPO方法通过自适应技术增强LLM对齐

作者 PulseAugur 编辑部 · [7 个来源] · 2025-04-17 00:00

研究人员在直接偏好优化（DPO）方面取得了几项进展，DPO是一种用于将大型语言模型（LLM）与人类偏好对齐的方法。AdaDPO引入了自适应系数来平衡梯度更新，提高了效率并减轻了长度偏差，在基准测试中表现优于标准DPO。Uni-DPO提供了一个统一的动态框架，根据数据质量和模型性能自适应地重新加权样本，在各种任务上取得了优于Claude 3 Opus的卓越结果。此外，AttentionPO利用LLM自身的注意力机制来加权token，使其具有内容感知能力，并提高了在基准测试中的性能。 AI

影响 DPO的这些进展提供了更有效、更高效地将LLM与人类偏好对齐的方法，有望带来更有用、更准确的AI助手。

排序理由多篇研究论文介绍了用于LLM对齐的直接偏好优化（DPO）的新颖方法和改进。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 7 个来源。我们如何撰写摘要 →

报道来源 [7]

arXiv cs.CL TIER_1 English(EN) · Shaolong Chen, Madalina Ciobanu, Qingqing Mao, Ritankar Das · 2026-05-28 04:00

AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates

arXiv:2605.28440v1 Announce Type: new Abstract: DPO has become a widely adopted alternative to RLHF for aligning LLMs with human preferences, eliminating the need for a separate reward model or RL loop. Recent theoretical analysis uncovers an asymmetric gradient behavior in DPO: …
arXiv cs.CL TIER_1 English(EN) · Ritankar Das · 2026-05-27 13:05

AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates

DPO has become a widely adopted alternative to RLHF for aligning LLMs with human preferences, eliminating the need for a separate reward model or RL loop. Recent theoretical analysis uncovers an asymmetric gradient behavior in DPO: the loss suppresses dispreferred responses subst…
arXiv cs.AI TIER_1 English(EN) · Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang · 2026-05-26 04:00

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

arXiv:2506.10054v4 Announce Type: replace-cross Abstract: Direct Preference Optimization (DPO) has emerged as a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based methods typically treat all preferenc…
arXiv cs.CL TIER_1 English(EN) · Xiaobo Wang, Zixia Jia, Jiaqi Li, Qi Liu, Zilong Zheng · 2026-05-26 04:00

Adaptive Preference Optimization with Uncertainty-aware Utility Anchor

arXiv:2509.10515v1 Announce Type: cross Abstract: Offline preference optimization methods are efficient for large language models (LLMs) alignment. Direct Preference optimization (DPO)-like learning, one of the most popular approaches, stands out for its efficiency in reward mode…
arXiv cs.CL TIER_1 English(EN) · Chengyu Huang, Zhuohang Li, Sheng-Yen Chou, Claire Cardie · 2026-05-22 04:00

Token-weighted Direct Preference Optimization with Attention

arXiv:2605.21883v1 Announce Type: new Abstract: Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of indiv…
arXiv cs.CL TIER_1 English(EN) · Claire Cardie · 2026-05-21 01:43

Token-weighted Direct Preference Optimization with Attention

Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual tokens. Existing token-level PO methods co…
Together AI blog TIER_1 English(EN) · 2025-04-17 00:00

Direct Preference Optimization: A Technical Deep Dive

Together AI now supports DPO fine-tuning. Learn how Direct Preference Optimization aligns language models with human preferences — with code examples and technical details.

报道来源 [7]

AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates

AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

Adaptive Preference Optimization with Uncertainty-aware Utility Anchor

Token-weighted Direct Preference Optimization with Attention

Token-weighted Direct Preference Optimization with Attention

Direct Preference Optimization: A Technical Deep Dive

相关实体

相关话题