PulseAugur
EN
LIVE 11:12:53

New benchmark tests foundation models' active 3D navigation

Researchers have introduced TVRBench, a new benchmark designed to test foundation models' ability to actively navigate 3D environments to match a target image's viewpoint. Current models struggle significantly with this task, particularly when it requires body translation or processing multi-turn visual history. A unified post-training framework, especially visual-action supervised fine-tuning, showed substantial improvements, raising a 9B model's success rate to over 50%. This benchmark aims to advance the development of models that can perceive and act within 3D spaces. AI

IMPACT Establishes a new benchmark for evaluating and training embodied spatial intelligence in foundation models, highlighting current limitations and potential training avenues.

RANK_REASON The cluster contains a research paper introducing a new benchmark for evaluating foundation models. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. Hugging Face Daily Papers TIER_1 English(EN) ·

    Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

    Target Viewpoint Reproduction task challenges foundation models to actively adjust 3D viewpoints to match target images, revealing limitations in visual history processing and embodied movement mapping, with a unified post-training framework improving success rates through variou…

  2. arXiv cs.CV TIER_1 English(EN) · Liyang Li, Muzhi Zhu, Zhiyue Zhao, Hengyu Zhao, Ke Liu, Linhao Zhong, Hao Chen, Chunhua Shen ·

    Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

    arXiv:2606.01247v1 Announce Type: new Abstract: Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We in…