New Cross4D-JEPA method distills 2D models for 4D point cloud understanding

By PulseAugur Editorial · [2 sources] · 2026-07-01 06:49

Researchers have introduced Cross4D-JEPA, a novel self-supervised learning method for understanding dynamic 4D point clouds. This approach distills knowledge from 2D image or video foundation models, such as DINOv2 and V-JEPA 2, into a 4D point encoder. Cross4D-JEPA utilizes dense cross-modal correspondence to map 3D points to teacher patch features, training the student encoder to match these features without requiring masking, negatives, or a decoder. The method demonstrates superior performance on benchmarks like MSR-Action3D and NTU RGB+D 60 compared to intra-modal and global cross-modal baselines, highlighting the effectiveness of its granular correspondence approach. AI

IMPACT Enhances self-supervised learning for 4D point cloud analysis, potentially improving robotics and embodied perception.

RANK_REASON Academic paper introducing a new method and its evaluation. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New Cross4D-JEPA method distills 2D models for 4D point cloud understanding

COVERAGE [2]

arXiv cs.AI TIER_1 English(EN) · Trung Thanh Nguyen, Hai Nguyen-Truong, Tu Vo, Hoang M. Truong, Tuan-Anh Vu · 2026-07-02 04:00

Cross4D-JEPA: Dense Cross-modal Correspondence Distillation for 4D Point Cloud Representation Learning

arXiv:2607.00514v1 Announce Type: cross Abstract: Automatic understanding of dynamic 4D point clouds, the 3D-point sequences captured over time by depth sensors and LiDAR, is central to robotics and embodied perception. Yet annotating them densely is expensive, making self-superv…
arXiv cs.AI TIER_1 English(EN) · Tuan-Anh Vu · 2026-07-01 06:49

Cross4D-JEPA: Dense Cross-modal Correspondence Distillation for 4D Point Cloud Representation Learning

Automatic understanding of dynamic 4D point clouds, the 3D-point sequences captured over time by depth sensors and LiDAR, is central to robotics and embodied perception. Yet annotating them densely is expensive, making self-supervised pretraining the natural route to transferable…

COVERAGE [2]

Cross4D-JEPA: Dense Cross-modal Correspondence Distillation for 4D Point Cloud Representation Learning

Cross4D-JEPA: Dense Cross-modal Correspondence Distillation for 4D Point Cloud Representation Learning

RELATED ENTITIES

RELATED TOPICS