PulseAugur
实时 21:43:55

LLMs fail 'pass the butter' robot test, scoring far below human performance

A new evaluation called Butter-Bench has revealed that current state-of-the-art large language models struggle significantly with controlling robots for practical tasks. In tests designed to assess their ability to perform household chores like passing the butter, the best-performing LLM achieved only a 40% completion rate, far below the 95% success rate of humans. Models like Gemini 2.5 Pro and Claude Opus 4.1 showed limitations in spatial awareness and task execution, highlighting a gap between LLM reasoning capabilities and real-world robotic application. AI

影响 Current LLMs show significant limitations in real-world robotic control, indicating a need for further development in spatial reasoning and task execution for practical applications.

排序理由 The cluster describes a new benchmark and evaluation paper assessing LLM capabilities in robotics. [lever_c_demoted from research: ic=1 ai=1.0]

在 HN — AI startup stories 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

LLMs fail 'pass the butter' robot test, scoring far below human performance

报道来源 [1]

  1. HN — AI startup stories TIER_1 English(EN) · lukaspetersson ·

    Our LLM-controlled office robot can't pass butter