New benchmark reveals AI struggles with fine-grained hand-object interaction

By PulseAugur Editorial · [1 sources] · 2026-06-16 04:00

Researchers have introduced HanDyVQA, a new video question-answering benchmark designed to evaluate fine-grained understanding of hand-object interaction dynamics. The benchmark includes over 11,000 QA pairs across six question types, focusing on manipulation styles, motion, and part-level state changes. Even advanced models like Gemini 2.5 Pro struggled, achieving only 73% average accuracy compared to human performance of 97%, highlighting ongoing challenges in spatial relationship and geometric understanding. AI

IMPACT Highlights limitations in current video foundation models for understanding complex human-object interactions, guiding future research.

RANK_REASON The cluster describes a new academic benchmark for evaluating AI models on a specific task, published on arXiv. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CV TIER_1 English(EN) · Masatoshi Tateno, Gido Kato, Hirokatsu Kataoka, Yoichi Sato, Takuma Yagi · 2026-06-16 04:00

HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

arXiv:2512.00885v2 Announce Type: replace Abstract: Hand-object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects. However, existing semantic HOI benchmarks focused either on manipulation or on the resulti…

COVERAGE [1]

HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

RELATED ENTITIES

RELATED TOPICS