LLMs fail 'pass the butter' robot test, scoring far below human performance

By PulseAugur Editorial · [1 sources] · 2025-10-28 14:13

A new evaluation called Butter-Bench has revealed that current state-of-the-art large language models struggle significantly with controlling robots for practical tasks. In tests designed to assess their ability to perform household chores like passing the butter, the best-performing LLM achieved only a 40% completion rate, far below the 95% success rate of humans. Models like Gemini 2.5 Pro and Claude Opus 4.1 showed limitations in spatial awareness and task execution, highlighting a gap between LLM reasoning capabilities and real-world robotic application. AI

IMPACT Current LLMs show significant limitations in real-world robotic control, indicating a need for further development in spatial reasoning and task execution for practical applications.

RANK_REASON The cluster describes a new benchmark and evaluation paper assessing LLM capabilities in robotics. [lever_c_demoted from research: ic=1 ai=1.0]

Read on HN — AI startup stories →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLMs fail 'pass the butter' robot test, scoring far below human performance

COVERAGE [1]

HN — AI startup stories TIER_1 English(EN) · lukaspetersson · 2025-10-28 14:13

Our LLM-controlled office robot can't pass butter

COVERAGE [1]

Our LLM-controlled office robot can't pass butter

RELATED ENTITIES

RELATED TOPICS