A new research paper investigates whether video foundation models possess an understanding of intuitive physics. The study probes frozen representations of models like V-JEPA, VideoMAE, and LTX-Video using benchmarks such as IntPhys2 and Minimal Video Pairs. Results indicate that V-JEPA performs best, particularly with temporal dynamics probes, while VideoMAE is competitive, and LTX-Video shows weaker but present signals. The research also found that physics knowledge is more accessible in intermediate to late layers of these models. AI
IMPACT Reveals emergent physics understanding in video models, potentially improving their real-world interaction capabilities.
RANK_REASON Research paper analyzing model capabilities.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →