English(EN) PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

PREFINE方法使用偏好微调增强AI安全对齐

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-20 14:19

研究人员开发了PREFINE，一种新颖的方法，用于在不完全重新训练的情况下，将预训练的强化学习策略适应到包含安全约束。该技术利用轨迹级别的偏好，类似于直接偏好优化（DPO）在大型语言模型（LLM）中的应用，来微调策略以实现更安全的行为。PREFINE已证明在约束违规和失败方面显著减少了60%以上，同时保持了原始奖励性能。与传统的离线强化学习或模仿学习方法相比，该方法提供了更高的数据和计算效率。 AI

影响通过实现预训练模型中成本感知的行为适应，增强AI安全性，提高效率并减少失败。

排序理由该集群包含一篇详细介绍AI安全对齐新方法的学术论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Richa Verma, Bavish Kulur, Sanjay Chawla, Balaraman Ravindran · 2026-05-22 04:00

PREFINE：基于偏好的隐式奖励和成本微调以实现安全对齐

arXiv:2605.21225v1 Announce Type: cross Abstract: We address the problem of making a pre-trained reinforcement learning (RL) policy safety-aware by incorporating cost constraints without retraining it from scratch. While costs could be numerically encoded, we assume a more genera…
arXiv cs.AI TIER_1 English(EN) · Balaraman Ravindran · 2026-05-20 14:19

PREFINE：基于偏好的隐式奖励和成本微调以实现安全对齐

We address the problem of making a pre-trained reinforcement learning (RL) policy safety-aware by incorporating cost constraints without retraining it from scratch. While costs could be numerically encoded, we assume a more general setting is when costs are provided as preference…

报道来源 [2]

PREFINE：基于偏好的隐式奖励和成本微调以实现安全对齐

PREFINE：基于偏好的隐式奖励和成本微调以实现安全对齐

相关实体

相关话题