Direct Preference Optimization Beyond Chatbots
Researchers are exploring new methods for aligning large language models (LLMs) with human preferences and mitigating specific failure modes. One approach uses Direct Preference Optimization (DPO) to reduce text degeneration in OCR models by leveraging the model's own failures as training signals. Other research focuses on understanding and controlling LLMs' temporal preference reasoning, developing lightweight local preference harnesses for personal agents, and creating frameworks for human-centric preference-driven judgment. Techniques like Inclusion-of-Thoughts and Critique-Driven Reasoning Alignment aim to improve LLM decision-making stability and interpretability. AI
IMPACT New methods for preference alignment and failure mitigation could lead to more reliable and controllable LLMs.
- DiNa-LRM
- Vision-Language Models
- Gongye Liu
- Diffusion LAIR
- Large Language Models
- Mistral-7B
- Direct Preference Optimization
- OpenAI Gym
- Bradley--Terry model
- MARS
- Reinforcement Learning from Human Feedback
- Energy-Based Decoding
- Qwen3-8B-Base
- KARMA
- AssistiveGym
- SenseJudge
- DharmaOCR
- Sparse Mixture-of-Experts
- Qwen3-4B-Instruct-2507
- Critique-Driven Reasoning Alignment
- Inclusion-of-Thoughts
- Hugging Face