You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences
Researchers have introduced Temporal Difference in Vision (TDV), a novel self-supervised learning paradigm for video that minimizes reliance on strong inductive biases. Unlike existing methods that often use augmentations, masking, or cropping, TDV operates on the causal assumption that the past influences the future. The system jointly trains an image and motion encoder, predicting the next frame's representation based on the current frame and encoded motion. Experiments indicate that TDV achieves state-of-the-art performance on dense spatial tasks without these traditional biases, suggesting a path toward representation learning with fewer assumptions. AI
IMPACT This research could lead to more efficient and scalable visual representation learning by reducing reliance on data augmentation and other strong assumptions.