New framework EyeVLM reveals VLM limitations in gaze understanding

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have introduced EyeVLM, a new framework to evaluate how well vision-language models (VLMs) understand human gaze and attention. The framework assesses two key tasks: gaze following, which requires precise visual and spatial reasoning, and social gaze prediction, which relies more on semantic understanding of interactions. Initial results indicate that current VLMs struggle with accurate gaze understanding, even after fine-tuning, and still lag behind specialized visual models. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Highlights current VLM shortcomings in understanding human attention, suggesting a need for improved multimodal reasoning beyond basic visual processing.

RANK_REASON The cluster describes a new academic paper introducing a framework and benchmark for evaluating specific capabilities of existing models. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

paper
other

COVERAGE [1]

arXiv cs.CV TIER_1 · Jean-Marc Odobez · 2026-05-19 13:50

Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central task in human behavior understanding that requires re…

COVERAGE [1]

Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

RELATED ENTITIES

RELATED TOPICS