PulseAugur
实时 22:04:09
English(EN) PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

PREFINE方法使用偏好微调增强AI安全对齐

研究人员开发了PREFINE,一种新颖的方法,用于在不完全重新训练的情况下,将预训练的强化学习策略适应到包含安全约束。该技术利用轨迹级别的偏好,类似于直接偏好优化(DPO)在大型语言模型(LLM)中的应用,来微调策略以实现更安全的行为。PREFINE已证明在约束违规和失败方面显著减少了60%以上,同时保持了原始奖励性能。与传统的离线强化学习或模仿学习方法相比,该方法提供了更高的数据和计算效率。 AI

影响 通过实现预训练模型中成本感知的行为适应,增强AI安全性,提高效率并减少失败。

排序理由 该集群包含一篇详细介绍AI安全对齐新方法的学术论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

报道来源 [2]

  1. arXiv cs.AI TIER_1 English(EN) · Richa Verma, Bavish Kulur, Sanjay Chawla, Balaraman Ravindran ·

    PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

    arXiv:2605.21225v1 Announce Type: cross Abstract: We address the problem of making a pre-trained reinforcement learning (RL) policy safety-aware by incorporating cost constraints without retraining it from scratch. While costs could be numerically encoded, we assume a more genera…

  2. arXiv cs.AI TIER_1 English(EN) · Balaraman Ravindran ·

    PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

    We address the problem of making a pre-trained reinforcement learning (RL) policy safety-aware by incorporating cost constraints without retraining it from scratch. While costs could be numerically encoded, we assume a more general setting is when costs are provided as preference…