A new evaluation called Butter-Bench has revealed that current state-of-the-art large language models struggle significantly with controlling robots for practical tasks. In tests designed to assess their ability to perform household chores like passing the butter, the best-performing LLM achieved only a 40% completion rate, far below the 95% success rate of humans. Models like Gemini 2.5 Pro and Claude Opus 4.1 showed limitations in spatial awareness and task execution, highlighting a gap between LLM reasoning capabilities and real-world robotic application. AI
IMPACT Current LLMs show significant limitations in real-world robotic control, indicating a need for further development in spatial reasoning and task execution for practical applications.
RANK_REASON The cluster describes a new benchmark and evaluation paper assessing LLM capabilities in robotics. [lever_c_demoted from research: ic=1 ai=1.0]
Read on HN — AI startup stories →
- Butter-Bench
- Claude Opus 4.1
- Figure AI
- Gemini 2.5 Pro
- Google DeepMind
- GPT-5
- Grok 4
- Llama 4 Maverick
- LLM
- Nvidia
- Gemini 1.5
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →