From Scene to Object: Text-Guided Dual-Gaze Prediction

By PulseAugur Editorial · Summary by None from 1 source

Researchers have developed a novel dual-branch gaze prediction framework to improve interpretable driver attention prediction for autonomous driving. This framework addresses limitations in existing datasets by constructing a new object-level driver attention dataset called G-W3DA, which uses a multimodal large language model and Segment Anything Model 3 (SAM3) to decouple gaze into object-level masks. The proposed DualGaze-VLM architecture then leverages this data to achieve intent-driven spatial anchoring, outperforming current state-of-the-art models in spatial alignment metrics and generating attention heatmaps perceived as authentic by human evaluators. AI

Summary written by None from 1 source. How we write summaries →

IMPACT Enhances interpretability in autonomous driving systems by providing more precise object-level attention prediction.

RANK_REASON This is a research paper introducing a new dataset and model architecture for gaze prediction.

Read on arXiv cs.CV →

COVERAGE [1]

arXiv cs.CV TIER_1 · Zehong Ke, Yanbo Jiang, Jinhao Li, Zhiyuan Liu, Yiqian Tu, Qingwen Meng, Heye Huang, Jianqiang Wang · 2026-04-29 04:00

From Scene to Object: Text-Guided Dual-Gaze Prediction

arXiv:2604.20191v2 Announce Type: replace Abstract: Interpretable driver attention prediction is crucial for human-like autonomous driving. However, existing datasets provide only scene-level global gaze rather than fine-grained object-level annotations, inherently failing to sup…

COVERAGE [1]

From Scene to Object: Text-Guided Dual-Gaze Prediction

RELATED ENTITIES

RELATED TOPICS