PulseAugur
EN
LIVE 23:20:57

New research benchmarks and enhances VLM gaze understanding

Researchers have developed new methods to evaluate and improve how vision-language models (VLMs) understand human gaze. One study introduces EyeVLM, a framework to benchmark VLMs on gaze following and social gaze prediction, finding current models lack precise understanding. A separate paper proposes a novel training mechanism using local LoRA and an out-of-cone penalty to enhance gaze reasoning in vision foundation models for gaze following tasks, achieving state-of-the-art results. AI

IMPACT New benchmarks and training techniques could lead to more sophisticated AI systems capable of understanding human attention and social cues.

RANK_REASON The cluster contains two academic papers detailing new benchmarks and methods for evaluating and improving vision-language models' understanding of human gaze.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 5 sources. How we write summaries →

New research benchmarks and enhances VLM gaze understanding

COVERAGE [5]

  1. Hugging Face Daily Papers TIER_1 English(EN) ·

    Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

    Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. Ho…

  2. arXiv cs.CV TIER_1 English(EN) · Hengfei Wang, Anshul Gupta, Pierre Vuillecard, Jean-Marc Odobez ·

    Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

    arXiv:2605.19859v2 Announce Type: replace Abstract: Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central t…

  3. arXiv cs.CV TIER_1 English(EN) · Shijing Wang, Yaping Huang, Chaoqun Cui, David Wong, Yihua Cheng, Alexandros Neophytou, Hyung Jin Chang ·

    Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

    arXiv:2605.22607v1 Announce Type: new Abstract: Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler arc…

  4. arXiv cs.CV TIER_1 English(EN) · Hyung Jin Chang ·

    Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

    Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. Ho…

  5. arXiv cs.CV TIER_1 English(EN) · Jean-Marc Odobez ·

    Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

    Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central task in human behavior understanding that requires re…