PulseAugur
EN
LIVE 14:00:38

New benchmark HumanMoveVQA reveals MLLM struggles with human motion understanding

Researchers have introduced HumanMoveVQA, a novel benchmark designed to assess the capabilities of Multimodal Large Language Models (MLLMs) in understanding complex human motion within videos. Current MLLMs struggle with global trajectory and orientation reasoning, often reducing intricate movements to simple semantic labels. HumanMoveVQA addresses this by providing over 10,000 question-answer pairs that focus on motion aggregation, sequential ordering, and trajectory inference, utilizing a world-consistent 3D motion tracking pipeline. Evaluations indicate a significant gap in state-of-the-art proprietary models, though fine-tuning with the benchmark's supervision shows promise for improvement. AI

IMPACT This benchmark could drive the development of more sophisticated video understanding models capable of nuanced human motion analysis.

RANK_REASON The cluster describes a new academic paper introducing a novel benchmark for evaluating AI models.

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New benchmark HumanMoveVQA reveals MLLM struggles with human motion understanding

COVERAGE [2]

  1. arXiv cs.CV TIER_1 English(EN) · Pulkit Gera, Faegheh Sardari, Asmar Nadeem, Valentina Bono, Padraig Boulton, Adrian Hilton, Armin Mustafa ·

    HumanMoveVQA: Can Video MLLMs reason about human movement in videos?

    arXiv:2606.27999v1 Announce Type: new Abstract: Despite the rapid advance of Multimodal Large Language Models (MLLMs) in high-level video understanding, a fundamental bottleneck remains: these models collapse complex human motion into coarse semantic labels. Existing benchmarks m…

  2. arXiv cs.CV TIER_1 English(EN) · Armin Mustafa ·

    HumanMoveVQA: Can Video MLLMs reason about human movement in videos?

    Despite the rapid advance of Multimodal Large Language Models (MLLMs) in high-level video understanding, a fundamental bottleneck remains: these models collapse complex human motion into coarse semantic labels. Existing benchmarks mostly focus on scene-centric events or local joi…