Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments
A new study published on arXiv compares the visual attention of large vision-language models (VLMs) with human gaze patterns in safety-critical environments. Researchers collected eye-tracking data from participants viewing risky scenes and then prompted models like GPT-4o, Gemini Pro, Gemini Flash, and Claude to predict human attention. The findings indicate that VLMs can identify areas of interest that broadly align with human visual focus, suggesting their potential as scalable tools for approximating human attentional patterns without explicit eye-tracking training. AI
IMPACT Suggests VLMs can approximate human attentional patterns, potentially aiding in safety analysis and design.