English(EN) A small amount of this data produced broad gains beyond the training scenarios.

OpenAI 训练 AI 模型以实现跨领域的持续有益行为

作者 PulseAugur 编辑部 · [6 个来源] · 2026-06-18 21:34

OpenAI 发布了一项关于新方法的 ist 研究，该方法旨在训练 AI 模型在各种情况下并在对抗压力下保持有益的特质。这种方法称为有益强化学习 (Beneficial RL)，在现实对话中使用强化学习来灌输真诚、谦逊和公平等品质。早期测试表明，通过此方法训练的模型在各种领域（即使是那些未明确包含在训练数据中的领域）都显示出更好的对齐和安全性，并能更好地抵御有害提示。 AI

影响这项研究可能带来更可靠、更值得信赖的 AI 系统，使其能够在新颖和具有挑战性的场景中保持安全和有益的行为。

排序理由 OpenAI 关于新 AI 训练方法的 ist 研究论文。

在 X — OpenAI 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。我们如何撰写摘要 →

报道来源 [6]

X — OpenAI TIER_1 English(EN) · OpenAI · 2026-06-18 21:34

This is an early step toward more robustly beneficial and aligned models: training models to carry beneficial traits into new situations, so as AI becomes more

This is an early step toward more robustly beneficial and aligned models: training models to carry beneficial traits into new situations, so as AI becomes more capable, it also becomes more reliable, transparent, and helpful for people.
X — OpenAI TIER_1 English(EN) · OpenAI · 2026-06-18 21:34

We also tested whether alignment persisted under pressure.

We also tested whether alignment persisted under pressure. The model was harder to steer toward harmful behavior with adversarial prompts, while remaining responsive to helpful instructions. We saw preliminary evidence of greater resistance to harmful fine-tuning. https://t.co…
X — OpenAI TIER_1 English(EN) · OpenAI · 2026-06-18 21:34

The most interesting test was cross-domain transfer.

The most interesting test was cross-domain transfer. When beneficial behavior training was limited to health conversations, the model still improved on non-health evaluations of misalignment, deception, and reward hacking—even though those tasks looked very different from the ht…
X — OpenAI TIER_1 English(EN) · OpenAI · 2026-06-18 21:34

A small amount of this data produced broad gains beyond the training scenarios.

A small amount of this data produced broad gains beyond the training scenarios. Compared with a compute-matched baseline, the trained model improved on 44 of 53 independent evaluations of alignment and benefits, spanning deception, reward hacking, safety, health, and mental http…
X — OpenAI TIER_1 English(EN) · OpenAI · 2026-06-18 21:34

We trained models with reinforcement learning on realistic conversations to reinforce beneficial traits like truthfulness, humility under uncertainty, openness

We trained models with reinforcement learning on realistic conversations to reinforce beneficial traits like truthfulness, humility under uncertainty, openness to correction, fairness, and concern for human welfare, across 12 domains, including health, science, and education. htt…
X — OpenAI TIER_1 English(EN) · OpenAI · 2026-06-18 21:34

As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyond their training—and maintain it under pr

As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyond their training—and maintain it under pressure. That’s the idea behind our new research on training models to be broadly and persistently beneficial.

报道来源 [6]

This is an early step toward more robustly beneficial and aligned models: training models to carry beneficial traits into new situations, so as AI becomes more

We also tested whether alignment persisted under pressure.

The most interesting test was cross-domain transfer.

A small amount of this data produced broad gains beyond the training scenarios.

We trained models with reinforcement learning on realistic conversations to reinforce beneficial traits like truthfulness, humility under uncertainty, openness

As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyond their training—and maintain it under pr

相关实体

相关话题