New benchmark reveals MLLMs struggle with physical tool use

By PulseAugur Editorial · [1 sources] · 2026-06-09 12:49

Researchers have introduced PhysTool-Bench, a new benchmark designed to evaluate the capabilities of multimodal large language models (MLLMs) in understanding and utilizing physical tools. The benchmark includes over 2,500 queries related to nearly 2,700 real-world tools across various industries. Initial tests on 13 leading MLLMs revealed significant limitations, with the top-performing model only correctly identifying 58.7% of tools and completing 21.0% of tasks, highlighting a critical gap in their ability to perceive and functionally reason about physical objects for embodied AI applications. AI

IMPACT Highlights critical limitations in MLLMs' physical world interaction, indicating a need for improved perception and functional commonsense for embodied AI.

RANK_REASON The cluster contains a research paper introducing a new benchmark for evaluating MLLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Wenjie Li · 2026-06-09 12:49

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLL…

COVERAGE [1]

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

RELATED ENTITIES

RELATED TOPICS