Researchers have developed DeepGaze3.5-VL, a novel model that frames scanpath prediction as a discrete sequence modeling task using autoregressive token prediction. By mapping visual coordinates into a text vocabulary, the model leverages pretrained Vision-Language Models to capture diverse factors of variation, including personalized biases and task-specific objectives. This approach significantly improves predictive performance, achieving a new state-of-the-art on the MIT1003 dataset with a 46% improvement over its predecessor, DeepGaze III. The generative framework also offers a powerful tool for computational interventions and in-silico simulations of human visual attention. AI
IMPACT Establishes a new state-of-the-art in modeling human visual attention, with potential applications in interface design and cognitive state inference.
RANK_REASON The cluster describes a new research paper detailing a novel model and its performance on a benchmark dataset. [lever_c_demoted from research: ic=1 ai=1.0]
- alphaXiv
- arXiv
- CatalyzeX
- DagsHub
- DeepGaze3.5-VL
- DeepGaze III
- Gotit.pub
- Hugging Face
- MIT1003
- ScienceCast
- vision-language model
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →