HAT-4D framework reconstructs 3D object interactions from single videos

By PulseAugur Editorial · [2 sources] · 2026-06-26 16:05

Researchers have introduced HAT-4D, a novel agentic framework designed to reconstruct 3D geometry, temporal dynamics, and physical interactions of multiple objects from a single monocular video. This approach integrates Vision-Language Models (VLMs) with a human-in-the-loop feedback mechanism to overcome challenges like depth ambiguities and occlusions in multi-object scenarios. HAT-4D aims to serve as a scalable data engine for Embodied AI and training VLAs, and it has been used to create MVOIK-4D, a new benchmark for monocular 4D interaction reconstruction. AI

IMPACT Enables more efficient data collection for Embodied AI and VLA training by reconstructing complex object interactions from single videos.

RANK_REASON The cluster describes a new research paper detailing a novel framework and benchmark for 4D reconstruction from video.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

HAT-4D framework reconstructs 3D object interactions from single videos

COVERAGE [2]

arXiv cs.AI TIER_1 English(EN) · Jiaxin Li, Yuxiang Wu, Zhenkai Zhang, Xinrui Shi, Haoyuan Wang, Yichen Zhao, Su Linxiang, Chenyang Yu, Mingyu Zhang, Yifan Ding, Boran Wen, Li Zhang, Ruiyang Liu, Yong-Lu Li · 2026-06-29 04:00

HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration

arXiv:2606.28215v1 Announce Type: cross Abstract: Extracting dynamic 4D object interactions from massive, in-the-wild monocular videos offers a highly efficient data collection pathway for scaling Embodied AI and training VLAs. However, existing monocular 4D reconstruction method…
arXiv cs.AI TIER_1 English(EN) · Yong-Lu Li · 2026-06-26 16:05

HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration

Extracting dynamic 4D object interactions from massive, in-the-wild monocular videos offers a highly efficient data collection pathway for scaling Embodied AI and training VLAs. However, existing monocular 4D reconstruction methods primarily focus on isolated objects, often faili…

COVERAGE [2]

HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration

HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration

RELATED ENTITIES

RELATED TOPICS