Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]
Researchers have developed a novel method for adaptive video tokenization that dynamically allocates tokens based on visual complexity. This approach leverages the latent space of a frozen video tokenizer to identify and discard redundant spatial positions, leading to content-driven compression. A Latent Inpainting Transformer (LIT) is then used to reconstruct these dropped positions, resulting in a highly efficient inference pipeline that achieves significant speedups over existing methods. AI
IMPACT Introduces a more efficient method for video tokenization, potentially improving compression and inference speeds for video processing AI.