PulseAugur
EN
LIVE 21:47:15

Video-LLMs suffer from directional motion blindness, researchers find

Researchers have identified a significant limitation in current Video Large Language Models (Video-LLMs), termed "directional motion blindness," where models struggle to accurately perceive and articulate the direction of object movement. Despite motion direction information being present in the model's internal states, a "direction binding gap" prevents it from being correctly associated with verbal outputs. To address this, the researchers developed MoDirect, a dataset for tuning and evaluation, and DeltaDirect, a novel objective function that significantly improves motion direction accuracy from near chance to over 85% on synthetic benchmarks and by 21.9 points on real-world data. AI

IMPACT Identifies a critical perceptual flaw in Video-LLMs, potentially impacting their reliability for tasks requiring fine-grained temporal understanding.

RANK_REASON Academic paper detailing a new diagnostic method and proposed solution for a specific failure mode in Video-LLMs.

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.CV TIER_1 English(EN) · Jongseo Lee, Hyuntak Lee, Sunghun Kim, Sooa Kim, Jihoon Chung, Jinwoo Choi ·

    Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

    arXiv:2605.22823v1 Announce Type: new Abstract: Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, …

  2. arXiv cs.CV TIER_1 English(EN) · Jinwoo Choi ·

    Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

    Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near…