PROSE method uses vision-language models for egocentric scene registration

By PulseAugur Editorial · [2 sources] · 2026-06-15 11:11

Researchers have developed PROSE, a novel method for registering egocentric RGB sequences without requiring training or depth sensors. PROSE leverages pre-trained vision-language models to create object-level 3D scene graphs and match object instances across different captures. This approach demonstrates superior performance on the Aria Digital Twin and Aria Everyday Activities benchmarks compared to existing geometric and learned scene-graph methods. AI

IMPACT This method could enable more robust spatial memory for robots and AR systems by improving egocentric scene registration.

RANK_REASON The cluster contains a research paper detailing a new method for scene registration using vision-language models.

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

arXiv cs.CV TIER_1 English(EN) · Zhiang Chen, Nahyuk Lee, Boyang Sun, Taein Kwon, Marc Pollefeys, Zuria Bauer, Sunghwan Hong · 2026-06-16 04:00

PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

arXiv:2606.16569v1 Announce Type: new Abstract: Registering two captures of the same indoor space taken at different times underpins persistent spatial memory for robots and AR systems, yet the realistic version of this task is egocentric and its most scalable form is RGB-only. H…
arXiv cs.CV TIER_1 English(EN) · Sunghwan Hong · 2026-06-15 11:11

PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

Registering two captures of the same indoor space taken at different times underpins persistent spatial memory for robots and AR systems, yet the realistic version of this task is egocentric and its most scalable form is RGB-only. Head-mounted cameras yield blurry, fast-moving, p…

COVERAGE [2]

PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

RELATED ENTITIES

RELATED TOPICS