MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models
Researchers have introduced MET-Bench, a new benchmark designed to evaluate the capabilities of vision-language models in tracking entities across both text and image modalities. The study found a significant performance gap between text-only and multimodal entity tracking, attributing this primarily to visual reasoning deficits rather than perceptual issues. While explicit text-based reasoning strategies showed improvement, long-horizon multimodal tasks remain challenging. Applying reinforcement learning to open-source VLMs yielded gains within modalities but did not effectively transfer across them, indicating a need for enhanced multimodal representations and reasoning techniques. AI
IMPACT Highlights critical gaps in multimodal reasoning for current vision-language models, suggesting areas for future research and development.