ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs
Researchers have developed ST-SimDiff, a novel framework designed to make multimodal large language models (MLLMs) more efficient at processing long videos. The method addresses the computational burden by focusing on both static redundancy and dynamic changes within video content. ST-SimDiff utilizes a spatio-temporal graph to model token associations, employing a dual-selection strategy that identifies representative tokens for static information and key turning points for dynamic content. Experiments indicate that this approach significantly outperforms existing methods while reducing computational costs. AI
IMPACT Enhances efficiency for MLLMs processing video, potentially enabling broader applications with longer video inputs.