New USS framework unifies spatial and semantic prompts for embodied visual tracking

By PulseAugur Editorial · [2 sources] · 2026-06-24 14:25

Researchers have introduced USS, a novel framework for Embodied Visual Tracking (EVT) that moves beyond text-only target indication to a unified spatial-semantic prompting system. This approach integrates various prompt types, including text, points, bounding boxes, and masks, within a single architecture. USS utilizes a latent world model to predict future representations, enhancing temporal robustness. Real-world robot experiments show that explicit spatial cues improve tracking success rates, especially in complex scenarios with distractors and long-duration tasks, outperforming text-only methods. AI

IMPACT This research could lead to more robust and precise embodied AI systems capable of complex navigation and object tracking in real-world environments.

RANK_REASON This is a research paper detailing a new framework for a computer vision task. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

paper
other

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New USS framework unifies spatial and semantic prompts for embodied visual tracking

COVERAGE [2]

arXiv cs.CV TIER_1 English(EN) · Yuchen Xie, Xinyu Zhou, Kuangji Zuo, Yanshuo Lu, Fengrui Huang, Boyu Ma, Jianfei Yang · 2026-06-25 04:00

USS: Unified Spatial-Semantic Prompts for Embodied Visual Tracking with Latent Dynamics Learning

arXiv:2606.25880v1 Announce Type: new Abstract: Embodied Visual Tracking (EVT) requires an agent to continuously follow a specified target while actively moving through dynamic environments. However, prevailing EVT paradigms predominantly rely on language-based target indication.…
arXiv cs.CV TIER_1 English(EN) · Jianfei Yang · 2026-06-24 14:25

USS: Unified Spatial-Semantic Prompts for Embodied Visual Tracking with Latent Dynamics Learning

Embodied Visual Tracking (EVT) requires an agent to continuously follow a specified target while actively moving through dynamic environments. However, prevailing EVT paradigms predominantly rely on language-based target indication. While language is expressive and convenient, cl…

COVERAGE [2]

USS: Unified Spatial-Semantic Prompts for Embodied Visual Tracking with Latent Dynamics Learning

USS: Unified Spatial-Semantic Prompts for Embodied Visual Tracking with Latent Dynamics Learning

RELATED ENTITIES

RELATED TOPICS