A new evaluation called Butter-Bench has revealed that current state-of-the-art large language models struggle significantly with controlling robots for practical tasks. In tests designed to assess their ability to perform household chores like passing the butter, the best-performing LLM achieved only a 40% completion rate, far below the 95% success rate of humans. Models like Gemini 2.5 Pro and Claude Opus 4.1 showed limitations in spatial awareness and task execution, highlighting a gap between LLM reasoning capabilities and real-world robotic application. AI
影响 Current LLMs show significant limitations in real-world robotic control, indicating a need for further development in spatial reasoning and task execution for practical applications.
排序理由 The cluster describes a new benchmark and evaluation paper assessing LLM capabilities in robotics. [lever_c_demoted from research: ic=1 ai=1.0]
在 HN — AI startup stories 阅读 →
- Butter-Bench
- Claude Opus 4.1
- Figure AI
- Gemini 2.5 Pro
- Google DeepMind
- GPT-5
- Grok 4
- Llama 4 Maverick
- LLM
- Nvidia
- Gemini 1.5
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →