Researchers have introduced TVRBench, a new benchmark designed to test foundation models' ability to actively adjust their viewpoint in 3D environments to match target images. Current models struggle significantly with this task, particularly with multi-turn visual history and translating visual discrepancies into embodied movement. Post-training techniques, especially visual-action SFT, have shown promise in improving performance, with one model reaching over 50% success. AI
IMPACT Establishes a new benchmark for evaluating and training embodied spatial intelligence in foundation models, potentially driving progress in robotics and interactive AI.
RANK_REASON This is a research paper introducing a new benchmark and evaluation methodology for foundation models.
Read on Hugging Face Daily Papers →
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →