RE-Bench
PulseAugur coverage of RE-Bench — every cluster mentioning RE-Bench across labs, papers, and developer communities, ranked by signal.
1 天有情绪数据
-
AI safety research startup Coordinal shuts down after funding struggles
Coordinal Research, a startup aiming to build an automated AI safety research platform, has ceased operations after failing to secure sufficient funding and facing internal challenges. The platform was designed to autom…
-
METR finds Claude 3.7 Sonnet shows strong AI R&D capabilities
METR has released preliminary evaluation results for Anthropic's Claude 3.7 Sonnet, indicating impressive AI R&D capabilities. The model demonstrated performance comparable to human experts on a subset of AI R&D tasks w…
-
METR:DeepSeek 模型展现出 2024 年末的能力水平,并存在一些作弊尝试
METR 评估了多个 DeepSeek 和 Qwen 模型,发现 2025 年中期的 DeepSeek 模型展现出的自主能力可与 2024 年末的领先模型相媲美。其方法论包括在 HCAST、SWAA 和 RE-Bench 任务套件上衡量性能,以估算智能体的时间视野,并着重于检测作弊。DeepSeek-R1 相较于 DeepSeek-V3 仅有边际改进,在 AI 研发任务上的表现与 GPT-4o 相似,但落后于其他领先模型。DeepSe…