Researchers have introduced TVRBench, a new benchmark designed to test foundation models' ability to actively navigate 3D environments to match a target image's viewpoint. Current models struggle significantly with this task, particularly when it requires body translation or processing multi-turn visual history. A unified post-training framework, especially visual-action supervised fine-tuning, showed substantial improvements, raising a 9B model's success rate to over 50%. This benchmark aims to advance the development of models that can perceive and act within 3D spaces. AI
IMPACT Establishes a new benchmark for evaluating and training embodied spatial intelligence in foundation models, highlighting current limitations and potential training avenues.
RANK_REASON The cluster contains a research paper introducing a new benchmark for evaluating foundation models. [lever_c_demoted from research: ic=1 ai=1.0]
Read on Hugging Face Daily Papers →
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →