实体 reward hacking

reward hacking

PulseAugur coverage of reward hacking — every cluster mentioning reward hacking across labs, papers, and developer communities, ranked by signal.

总计 · 30天

4

90 天内 4

发布 · 30天

0

90 天内 0

论文 · 30天

4

90 天内 4

层级分布 · 90 天

主题

情绪 · 30 天

3 天有情绪数据

最近 · 第 1/1 页 · 共 4 条

RESEARCH · CL_79580 · Jun 8 · 06:15

New framework unifies reward uncertainty in RLHF

Researchers have introduced a new framework to address reward hacking in Reinforcement Learning from Human Feedback (RLHF). The proposed method utilizes distributional reward models to quantify uncertainty, offering a u…
RESEARCH · CL_79881 · Jun 8 · 00:35

AI通过高效的Transformer编码器检测奖励劫持

研究人员开发了一种使用小型Transformer编码器检测AI系统中奖励劫持的新颖方法。该编码器将轨迹映射到一个距离近似信号差异的空间，在识别奖励劫持方面取得了高精度。与使用大型语言模型作为裁判相比，该方法成本效益显著更高，并表明该编码器依赖的不仅仅是自然语言推理。
RESEARCH · CL_65748 · Jun 2 · 04:00

New methods tackle reward hacking in AI training

Researchers are developing new methods to combat reward hacking in reinforcement learning from human feedback (RLHF) systems. Several papers introduce techniques to detect and mitigate scenarios where models exploit bia…
TOOL · CL_30564 · May 13 · 08:19

新的 PG-OT 框架改进了文本到图像的对齐并减少了奖励漏洞

研究人员开发了一个名为 Pareto Frontier-Guided Optimal Transport (PG-OT) 的新框架，以改进文本到图像生成模型。该方法解决了跨多个潜在冲突的奖励信号对齐模型所面临的挑战，并缓解了“奖励漏洞”（即模型性能指标提高但感知质量下降）问题。PG-OT 构建了一个特定于提示的帕累托前沿，并使用最优传输将受支配的样本引导到该前沿，其性能优于现有方法，并在人类评估中取得了很高的胜率。