TAP-JEPA: Frozen Future-Latent Probing and Two-Stage Score Fusion for EPIC-KITCHENS-100 Action Anticipation
Researchers have developed TAP-JEPA, a novel action anticipation model that achieved second place in the EPIC-KITCHENS-100 challenge. This model leverages frozen V-JEPA 2.1 features, utilizing a ViT-G/384 encoder and a latent predictor to estimate future video tokens. These tokens are then fused with observed context using attentive probes to predict actions, specifically verbs, nouns, and verb-noun pairs. The submission achieved a Mean Top-5 Recall of 27.91%, narrowly missing the top spot by 0.04 percentage points. AI
IMPACT This research advances action anticipation capabilities, potentially improving egocentric video analysis and human-computer interaction.