PulseAugur
实时 22:42:15
实体 Butter-Bench

Butter-Bench

PulseAugur coverage of Butter-Bench — every cluster mentioning Butter-Bench across labs, papers, and developer communities, ranked by signal.

Show in brief
总计 · 30天
1
90 天内 1
发布 · 30天
0
90 天内 0
论文 · 30天
1
90 天内 1
层级分布 · 90 天
最近 · 第 1/1 页 · 共 1 条
  1. TOOL · CL_17686 ·

    LLMs fail 'pass the butter' robot test, scoring far below human performance

    A new evaluation called Butter-Bench has revealed that current state-of-the-art large language models struggle significantly with controlling robots for practical tasks. In tests designed to assess their ability to perf…