PulseAugur
实时 14:47:18
(CA) Blind deep-deployment evals for control & sabotage

AI safety evals could improve with new 'blind deep-deployment' method

A proposal for "blind deep-deployment" evaluations aims to improve AI safety by allowing external auditors to specify control and sabotage tests without direct access to internal AI lab systems. Auditors would provide detailed prompts and code harnesses, which AI labs would then implement using their own resources and internal checkpoints. This method seeks to enhance the realism of safety evaluations and provide actionable insights to AI labs, even if the labs do not share proprietary information. AI

影响 This evaluation method could improve the rigor of AI safety testing, potentially leading to more robust AI systems.

排序理由 The item proposes a novel methodology for AI safety evaluation, akin to a research paper. [lever_c_demoted from research: ic=1 ai=1.0]

在 LessWrong (AI tag) 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

AI safety evals could improve with new 'blind deep-deployment' method

报道来源 [1]

  1. LessWrong (AI tag) TIER_1 (CA) · Dylan Bowman ·

    Blind deep-deployment evaluations for control and sabotage

    <p><i><span>Thanks to </span></i><a href="https://www.lesswrong.com/users/ezra-newman" rel="noreferrer"><i><span>Ezra Newman</span></i></a><i><span> for initial ideation and various people at Apollo Research for feedback. This short personal piece does not necessarily reflect the…