AI安全研究面临被破坏风险，审计员未能发现漏洞

作者 PulseAugur 编辑部 · [2 个来源] · 2026-04-28 04:00

研究人员开发了一个名为Auditing Sabotage Bench的新基准，用于测试AI模型和人类检测机器学习研究代码库中细微破坏的能力。该基准包含九个机器学习代码库，其中包含故意设计的有缺陷的变体，旨在产生误导性结果。在测试中，即使是Gemini 3.1 Pro等先进模型也难以可靠地识别这些破坏，检测准确率仅为77%，修复成功率仅为42%。 AI

影响该基准突显了AI驱动研究的潜在风险，以及确保AI安全需要强大的审计工具。

排序理由该集群描述了在arXiv上发布的一个新的学术基准和论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

Alignment Forum TIER_1 English(EN) · egan · 2026-04-30 00:31

Research Sabotage in ML Codebases

One of the main hopes for AI safety is using AIs to <a href="https://joecarlsmith.com/2025/03/14/ai-for-ai-safety/" rel="noopener noreferrer nofollow" target="_blank">automate AI safety research</a>. However, if models are misaligned, then they …
arXiv cs.AI TIER_1 English(EN) · Eric Gan, Aryan Bhatt, Buck Shlegeris, Julian Stastny, Vivek Hebbar · 2026-04-28 04:00

Auditing Sabotage Bench: A Benchmark for Detecting and Fixing Research Sabotage in ML Codebases

arXiv:2604.16286v2 Announce Type: replace Abstract: As AI systems are increasingly used to conduct research autonomously, misaligned systems could introduce subtle flaws that produce misleading results while evading detection. We introduce Auditing Sabotage Bench, a benchmark for…

报道来源 [2]

Research Sabotage in ML Codebases

Auditing Sabotage Bench: A Benchmark for Detecting and Fixing Research Sabotage in ML Codebases

相关实体

相关话题