LLMs explore preference alignment and failure mitigation techniques

By PulseAugur Editorial · [41 sources] · 2026-05-25 04:00

Researchers are exploring new methods for aligning large language models (LLMs) with human preferences and mitigating specific failure modes. One approach uses Direct Preference Optimization (DPO) to reduce text degeneration in OCR models by leveraging the model's own failures as training signals. Other research focuses on understanding and controlling LLMs' temporal preference reasoning, developing lightweight local preference harnesses for personal agents, and creating frameworks for human-centric preference-driven judgment. Techniques like Inclusion-of-Thoughts and Critique-Driven Reasoning Alignment aim to improve LLM decision-making stability and interpretability. AI

IMPACT New methods for preference alignment and failure mitigation could lead to more reliable and controllable LLMs.

RANK_REASON Multiple arXiv papers present novel research on LLM alignment and preference modeling.

Read on Hugging Face Blog →

AI-generated summary · Google Gemini · from 41 sources. How we write summaries →

LLMs explore preference alignment and failure mitigation techniques

COVERAGE [41]

Hugging Face Blog TIER_1 English(EN) · 2026-06-03 12:55

Direct Preference Optimization Beyond Chatbots
arXiv cs.AI TIER_1 English(EN) · Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad, Elham Dolatabadi · 2026-06-12 04:00

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

arXiv:2606.12590v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Exi…
arXiv cs.AI TIER_1 English(EN) · Pengwei Sun · 2026-06-12 04:00

Boosting Direct Preference Optimization with Penalization

arXiv:2606.12505v1 Announce Type: cross Abstract: Offline preference optimization has become a practical substitute for reinforcement learning from human feedback, but pairwise objectives such as Direct Preference Optimization (DPO) and its variants use only the chosen and reject…
arXiv cs.AI TIER_1 English(EN) · Masanari Oi, Mahiro Ukai, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue · 2026-06-11 04:00

Autoregressive Direct Preference Optimization

arXiv:2602.09533v2 Announce Type: replace Abstract: Direct preference optimization (DPO) has emerged as a promising approach for aligning large language models (LLMs) with human preferences. However, the widespread reliance on the response-level Bradley-Terry (BT) model may limit…
arXiv cs.AI TIER_1 English(EN) · Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu · 2026-06-10 04:00

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

arXiv:2410.15595v4 Announce Type: replace Abstract: With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, …
arXiv cs.LG TIER_1 English(EN) · Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen · 2026-06-08 04:00

Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models

arXiv:2505.10892v2 Announce Type: replace Abstract: Post-training LLMs with RLHF and preference optimization methods (e.g., DPO, IPO) has greatly improved alignment, yet these approaches assume a single objective. In reality, humans express multiple, often conflicting objectives,…
arXiv cs.CL TIER_1 English(EN) · Julia Sep\'ulveda Coelho, Scott A. Hale · 2026-06-08 04:00

What Do People Actually Want From AI? Mapping Preference Plurality

arXiv:2606.06674v1 Announce Type: new Abstract: Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting prefere…
arXiv cs.CL TIER_1 English(EN) · Zeyu Gan, Huayi Tang, Yong Liu · 2026-06-05 04:00

Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents

arXiv:2606.05828v1 Announce Type: cross Abstract: As Large Language Model (LLM) capabilities advance, locally deployed personal agents relying on API-based remote models and external skills have emerged as a novel paradigm. With the rapid expansion of available skills, enabling p…
arXiv cs.CL TIER_1 English(EN) · Ian Rios-Sialer, Shantanu Darveshi, Shuai Jiang, Avigya Paudel, Anastasiia Pronina, Ipshita Bandyopadhyay, Justin Shenk · 2026-06-05 04:00

Temporal Preference Concepts and their Functions in a Large Language Model

arXiv:2606.05194v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly being deployed to make decisions that require trading off near-term gains against long-term consequences, yet little is known about how they internally represent or resolve these trade…
arXiv cs.CL TIER_1 English(EN) · Scott A. Hale · 2026-06-04 19:47

What Do People Actually Want From AI? Mapping Preference Plurality

Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, …
arXiv cs.CL TIER_1 English(EN) · Yong Liu · 2026-06-04 08:07

Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents

As Large Language Model (LLM) capabilities advance, locally deployed personal agents relying on API-based remote models and external skills have emerged as a novel paradigm. With the rapid expansion of available skills, enabling personal agents to learn and adapt to implicit user…
arXiv cs.AI TIER_1 English(EN) · Mohammad Reza Ghasemi Madani, Soyeon Caren Han, Shuo Yang, Jey Han Lau · 2026-06-04 04:00

Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

arXiv:2604.04944v2 Announce Type: replace-cross Abstract: Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, r…
arXiv cs.AI TIER_1 English(EN) · Peiming Li, Zhiyuan Hu, Yang Tang, Shiyu Li, Xi Chen · 2026-06-04 04:00

Aligning Deep Implicit Preferences by Learning to Reason Defensively

arXiv:2510.11194v3 Announce Type: replace Abstract: Personalized alignment is crucial for enabling Large Language Models (LLMs) to engage effectively in user-centric interactions. However, current methods face a dual challenge: they fail to infer users' deep implicit preferences …
arXiv cs.AI TIER_1 English(EN) · Yifan Wang, Jinyi Mu, Mayank Jobanputra, Yu Wang, Ji-Ung Lee, Soyoung Oh, Isabel Valera, Vera Demberg · 2026-06-04 04:00

Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling

arXiv:2606.04284v1 Announce Type: cross Abstract: Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to align with human values. However, most existing approaches assume a universal reward function,…
arXiv cs.CL TIER_1 English(EN) · Yuehan Qin, Li Li, Linxin Song, Wei Yang, Jiate Li, Yuqing Yang, Yue Zhao · 2026-06-03 04:00

Memory Retrieval for Changing Preferences

arXiv:2606.02976v1 Announce Type: new Abstract: Long-context dialogue systems must decide both when to access memory and which parts of the interaction history are relevant. Existing approaches typically rely on heuristic retrieval signals or always-on memory usage, failing to ac…
arXiv cs.CL TIER_1 English(EN) · Rui Li, Junfeng Liu, Xiangwen Kong, Linhai Xu, Zhifang Sui · 2026-06-03 04:00

SenseJudge: Human-Centric Preference-Driven Judgment Framework

arXiv:2606.03189v1 Announce Type: new Abstract: Large Language Models (LLMs) as judges across various scenarios such as assessing model responses is becoming an increasingly accepted paradigm. However, existing judgment approaches often rely on trained judgers using fixed prefere…
arXiv cs.AI TIER_1 (CA) · Edwin V. Bonilla, He Zhao, Daniel M. Steinberg · 2026-06-03 04:00

Causal Preference Elicitation

arXiv:2602.01483v2 Announce Type: replace-cross Abstract: We propose causal preference elicitation, a Bayesian framework for expert-in-the-loop causal discovery that actively queries local edge relations to concentrate a posterior over directed acyclic graphs (DAGs). From any bla…
arXiv cs.CL TIER_1 English(EN) · Zhifang Sui · 2026-06-02 05:48

SenseJudge: Human-Centric Preference-Driven Judgment Framework

Large Language Models (LLMs) as judges across various scenarios such as assessing model responses is becoming an increasingly accepted paradigm. However, existing judgment approaches often rely on trained judgers using fixed preference data, which tend to overlook diverse user pr…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 05:48

SenseJudge: Human-Centric Preference-Driven Judgment Framework

Large Language Models (LLMs) as judges across various scenarios such as assessing model responses is becoming an increasingly accepted paradigm. However, existing judgment approaches often rely on trained judgers using fixed preference data, which tend to overlook diverse user pr…
arXiv cs.LG TIER_1 English(EN) · Jing Dong, Yaoliang Yu, Pascal Pourpart · 2026-06-02 04:00

The Representation-Rationalizability Tradeoff in Reward Learning

arXiv:2606.00291v1 Announce Type: cross Abstract: In RLHF, each training example contains a prompt $x$ and two candidate responses $y,y'$, and annotators provide pairwise preferences between these responses. The learning problem is to convert these heterogeneous pairwise judgment…
arXiv cs.AI TIER_1 English(EN) · Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna P\'asztor, Andreas Krause · 2026-06-02 04:00

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

arXiv:2603.09692v2 Announce Type: replace-cross Abstract: Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resourc…
arXiv cs.AI TIER_1 English(EN) · Hyung Gyu Rho · 2026-06-02 04:00

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

arXiv:2510.05342v2 Announce Type: replace-cross Abstract: Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preferenc…
arXiv cs.LG TIER_1 English(EN) · Jun-Jie Yang, Chia-Heng Hsu, Kui-Yuan Chen, Ping-Chun Hsieh · 2026-06-02 04:00

From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning

arXiv:2606.01123v1 Announce Type: new Abstract: Preference-based reinforcement learning (PbRL) avoids explicit reward engineering by learning from pairwise human preference feedback. Existing offline PbRL methods typically follow a two-stage pipeline, first learning a reward or p…
arXiv cs.LG TIER_1 English(EN) · Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar · 2026-06-01 04:00

What Does Preference Learning Recover from Pairwise Comparison Data?

arXiv:2602.10286v2 Announce Type: replace Abstract: Pairwise preference learning is central to machine learning, with recent applications in aligning language models with human preferences. A typical dataset consists of triplets $(x, y^+, y^-)$, where response $y^+$ is preferred …
arXiv cs.AI TIER_1 English(EN) · Christian Moya, Alex Semendinger, Guang Lin, Elliott Thornley · 2026-06-01 04:00

Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

arXiv:2605.11134v2 Announce Type: replace-cross Abstract: Preference learning methods like Direct Preference Optimization (DPO) are known to induce reliance on spurious correlations, leading to sycophancy and length bias in today's language models and potentially severe goal misg…
arXiv cs.AI TIER_1 English(EN) · Zhenyu Sun, Zheng Xu, Ermin Wei · 2026-05-29 04:00

In-Context Reward Adaptation for Robust Preference Modeling

arXiv:2605.30323v1 Announce Type: cross Abstract: Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward …
arXiv cs.AI TIER_1 English(EN) · Ermin Wei · 2026-05-28 17:56

In-Context Reward Adaptation for Robust Preference Modeling

Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model often lacks the robustness required to gener…
arXiv cs.CL TIER_1 English(EN) · Shaobo Wang, Guo Chen, Ziyue Wang, Zhengyang Tang, Qingyang Liu, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang · 2026-05-28 04:00

The Missing Piece in Pre-trained Model Evaluation: Reward-Guided Decoding Unlocks Task-Oriented Behavior Without Parameter Updates

arXiv:2605.28020v1 Announce Type: new Abstract: With the rapid progress of large language models (LLMs), reliably evaluating the capabilities of pre-trained LLMs has become increasingly important. The challenge is that base pre-trained models are optimized for next-token predicti…
arXiv cs.LG TIER_1 English(EN) · Yikang Gui, Prashant Doshi · 2026-05-28 04:00

Inversely Learning Transferable Rewards via Abstracted States

arXiv:2501.01669v4 Announce Type: replace Abstract: Inverse reinforcement learning (IRL) has progressed significantly toward accurately learning the underlying rewards in both discrete and continuous domains from behavior data. The next advance is to learn {\em intrinsic} prefere…
arXiv cs.LG TIER_1 English(EN) · Austin Wang, Jiaqi Han, Stefano Ermon, Yisong Yue · 2026-05-27 04:00

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

arXiv:2605.26491v1 Announce Type: new Abstract: Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary …
arXiv cs.CL TIER_1 English(EN) · Jared Scott, Jesse Roberts · 2026-05-27 04:00

KARMA: Karma-Aligned Reward Model Adaptation

arXiv:2605.26738v1 Announce Type: new Abstract: Human communication depends on implicit social signals where effectiveness is shaped by tone, context, and conversational norms rather than semantic content alone. We introduce KARMA (Karma-Aligned Reward Model Adaptation), a framew…
arXiv cs.CL TIER_1 English(EN) · Jesse Roberts · 2026-05-26 09:12

KARMA: Karma-Aligned Reward Model Adaptation

Human communication depends on implicit social signals where effectiveness is shaped by tone, context, and conversational norms rather than semantic content alone. We introduce KARMA (Karma-Aligned Reward Model Adaptation), a framework for LLM learning of context-sensitive conver…
arXiv cs.AI TIER_1 English(EN) · Payel Bhattacharjee, Osvaldo Simeone, Ravi Tandon · 2026-05-26 04:00

MARS: Margin and Semantic-Aware Data Augmentation for Reward Modeling

arXiv:2602.17658v2 Announce Type: replace-cross Abstract: Reward modeling is central to alignment pipelines such as RLHF, RLAIF, and PPO-based policy optimization, yet its reliability is constrained by limited and heterogeneous human preference data that are expensive to collect …
arXiv cs.AI TIER_1 English(EN) · Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo · 2026-05-25 04:00

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

arXiv:2602.11146v2 Announce Type: replace-cross Abstract: Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary rewar…
arXiv stat.ML TIER_1 English(EN) · Nathan Kallus · 2026-06-04 04:00

Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

arXiv:2512.21917v3 Announce Type: replace-cross Abstract: Policy alignment to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., Bradley-Terry model / logistic link). Misspecification of this link can bias inferred rewar…
arXiv stat.ML TIER_1 English(EN) · Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar · 2026-06-01 04:00

Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles

arXiv:2605.30619v1 Announce Type: new Abstract: Best-of-$N$ sampling is widely used to construct pairwise preference data: $N$ candidates are drawn from a base distribution, and the best is paired with a rejected response. Despite its widespread use, what Bradley--Terry (BT) rewa…
arXiv stat.ML TIER_1 English(EN) · Pradeep Ravikumar · 2026-05-28 22:15

Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles

Best-of-$N$ sampling is widely used to construct pairwise preference data: $N$ candidates are drawn from a base distribution, and the best is paired with a rejected response. Despite its widespread use, what Bradley--Terry (BT) reward learning extracts from such data, and how to …
arXiv stat.ML TIER_1 English(EN) · Yeshwanth Cherapanamjeri, Constantinos Daskalakis, Gabriele Farina, Sobhan Mohammadpour · 2026-05-28 04:00

Learning Correlated Reward Models: Statistical Barriers and Opportunities

arXiv:2510.15839v2 Announce Type: replace-cross Abstract: Random Utility Models (RUMs) are a classical framework for modeling user preferences and play a key role in reward modeling for Reinforcement Learning from Human Feedback (RLHF). However, a crucial shortcoming of many of t…
arXiv cs.CV TIER_1 English(EN) · Jaxon Zhang, Binxin Yang, Hubery Yin, Chen Li, Jing Lyu · 2026-05-26 04:00

DRM: Diffusion-based Reward Model With Step-wise Guidance

arXiv:2605.25661v1 Announce Type: new Abstract: Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models. However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual …
arXiv cs.CV TIER_1 English(EN) · Jing Lyu · 2026-05-25 10:11

DRM: Diffusion-based Reward Model With Step-wise Guidance

Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models. However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities-such as aesthetics, composition, and v…
dev.to — LLM tag TIER_1 English(EN) · pixelbank dev · 2026-06-06 23:10

Human Preference Data — Deep Dive + Problem: Separable Filter Optimization

<p><em>A daily deep dive into llm topics, coding problems, and platform features from <a href="https://pixelbank.dev" rel="noopener noreferrer">PixelBank</a>.</em></p> <h2> Topic Deep Dive: Human Preference Data </h2> <p><em>From the RLHF & Alignment chapter</em></p> <h2> Int…

COVERAGE [41]

RELATED ENTITIES

RELATED TOPICS