New ST-Merge framework boosts VLM/VLA inference speed for robotics

By PulseAugur Editorial · [1 sources] · 2026-06-30 04:00

Researchers have developed ST-Merge, a novel framework designed to accelerate the inference speed of vision-language models (VLMs) and vision-language action models (VLAs) used in robotics. This plug-and-play, training-free method efficiently fuses redundant visual tokens during the encoding phase by constructing 3D spatiotemporal coordinates and employing a parallel matching and weighted aggregation mechanism. ST-Merge also includes a post-merge positional correction mechanism to maintain spatial accuracy. The framework has demonstrated significant speedups, achieving a 2x inference speedup on the Qwen2.5-VL model with minimal precision loss and an 8.3x speedup on a VLA policy at high resolution. AI

IMPACT Accelerates real-time control for robotic applications by reducing latency in vision-language models.

RANK_REASON The cluster describes a new technical framework for improving the performance of existing model types, detailed in an arXiv paper. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New ST-Merge framework boosts VLM/VLA inference speed for robotics

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Junzhou Chen, Jindong Wang, Gang Zhou · 2026-06-30 04:00

Fast Enough to Act: Spatio-Temporal Visual Token Merging for Low-Latency Robotic VLMs and VLAs

arXiv:2606.29350v1 Announce Type: cross Abstract: Vision-language models and vision-language action models endow the robot with unprecedented capabilities. However, the input of video and high-resolution images yields a massive number of visual tokens, leading to extremely high i…

COVERAGE [1]

Fast Enough to Act: Spatio-Temporal Visual Token Merging for Low-Latency Robotic VLMs and VLAs

RELATED ENTITIES

RELATED TOPICS