PulseAugur
实时 03:22:44
实体 RE-Bench

RE-Bench

PulseAugur coverage of RE-Bench — every cluster mentioning RE-Bench across labs, papers, and developer communities, ranked by signal.

Show in brief
总计 · 30天
3
90 天内 3
发布 · 30天
0
90 天内 0
论文 · 30天
2
90 天内 2
层级分布 · 90 天
情绪 · 30 天

1 天有情绪数据

最近 · 第 1/1 页 · 共 3 条
  1. MEME · CL_37739 ·

    AI safety research startup Coordinal shuts down after funding struggles

    Coordinal Research, a startup aiming to build an automated AI safety research platform, has ceased operations after failing to secure sufficient funding and facing internal challenges. The platform was designed to autom…

  2. RESEARCH · CL_12645 ·

    METR finds Claude 3.7 Sonnet shows strong AI R&D capabilities

    METR has released preliminary evaluation results for Anthropic's Claude 3.7 Sonnet, indicating impressive AI R&D capabilities. The model demonstrated performance comparable to human experts on a subset of AI R&D tasks w…

  3. RESEARCH · CL_12643 ·

    METR:DeepSeek 模型展现出 2024 年末的能力水平,并存在一些作弊尝试

    METR 评估了多个 DeepSeek 和 Qwen 模型,发现 2025 年中期的 DeepSeek 模型展现出的自主能力可与 2024 年末的领先模型相媲美。其方法论包括在 HCAST、SWAA 和 RE-Bench 任务套件上衡量性能,以估算智能体的时间视野,并着重于检测作弊。DeepSeek-R1 相较于 DeepSeek-V3 仅有边际改进,在 AI 研发任务上的表现与 GPT-4o 相似,但落后于其他领先模型。DeepSe…