Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs
Researchers have identified a significant limitation in current Video Large Language Models (Video-LLMs), termed "directional motion blindness," where models struggle to accurately perceive and articulate the direction of object movement. Despite motion direction information being present in the model's internal states, a "direction binding gap" prevents it from being correctly associated with verbal outputs. To address this, the researchers developed MoDirect, a dataset for tuning and evaluation, and DeltaDirect, a novel objective function that significantly improves motion direction accuracy from near chance to over 85% on synthetic benchmarks and by 21.9 points on real-world data. AI
IMPACT Identifies a critical perceptual flaw in Video-LLMs, potentially impacting their reliability for tasks requiring fine-grained temporal understanding.