New research benchmarks and enhances VLM gaze understanding

By PulseAugur Editorial · [5 sources] · 2026-05-19 13:50

Researchers have developed new methods to evaluate and improve how vision-language models (VLMs) understand human gaze. One study introduces EyeVLM, a framework to benchmark VLMs on gaze following and social gaze prediction, finding current models lack precise understanding. A separate paper proposes a novel training mechanism using local LoRA and an out-of-cone penalty to enhance gaze reasoning in vision foundation models for gaze following tasks, achieving state-of-the-art results. AI

IMPACT New benchmarks and training techniques could lead to more sophisticated AI systems capable of understanding human attention and social cues.

RANK_REASON The cluster contains two academic papers detailing new benchmarks and methods for evaluating and improving vision-language models' understanding of human gaze.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 5 sources. How we write summaries →

COVERAGE [5]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-21 15:21

Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. Ho…
arXiv cs.CV TIER_1 English(EN) · Hengfei Wang, Anshul Gupta, Pierre Vuillecard, Jean-Marc Odobez · 2026-05-25 04:00

Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

arXiv:2605.19859v2 Announce Type: replace Abstract: Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central t…
arXiv cs.CV TIER_1 English(EN) · Shijing Wang, Yaping Huang, Chaoqun Cui, David Wong, Yihua Cheng, Alexandros Neophytou, Hyung Jin Chang · 2026-05-22 04:00

Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

arXiv:2605.22607v1 Announce Type: new Abstract: Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler arc…
arXiv cs.CV TIER_1 English(EN) · Hyung Jin Chang · 2026-05-21 15:21

Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. Ho…
arXiv cs.CV TIER_1 English(EN) · Jean-Marc Odobez · 2026-05-19 13:50

Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central task in human behavior understanding that requires re…

COVERAGE [5]

Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

RELATED ENTITIES

RELATED TOPICS