Researchers have developed ST-Merge, a novel framework designed to accelerate the inference speed of vision-language models (VLMs) and vision-language action models (VLAs) used in robotics. This plug-and-play, training-free method efficiently fuses redundant visual tokens during the encoding phase by constructing 3D spatiotemporal coordinates and employing a parallel matching and weighted aggregation mechanism. ST-Merge also includes a post-merge positional correction mechanism to maintain spatial accuracy. The framework has demonstrated significant speedups, achieving a 2x inference speedup on the Qwen2.5-VL model with minimal precision loss and an 8.3x speedup on a VLA policy at high resolution. AI
IMPACT Accelerates real-time control for robotic applications by reducing latency in vision-language models.
RANK_REASON The cluster describes a new technical framework for improving the performance of existing model types, detailed in an arXiv paper. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →