Inference-time Policy Steering via Vision and Touch
Researchers have developed ViTaL, a new framework for steering pre-trained generative robot policies during deployment. This system uses both visual and tactile information to refine candidate actions before execution, addressing limitations of vision-only methods in contact-rich manipulation tasks. ViTaL formulates multimodal guidance as a bi-level optimization problem, with visual sampling for long-horizon mode selection and tactile-guided diffusion editing for short-horizon refinement. The framework incorporates a visuo-tactile latent world model and learned verifiers, including a text-conditioned tactile reward, to improve success rates in real-world manipulation tasks. AI
IMPACT Enhances robot manipulation capabilities by integrating multimodal sensory feedback for improved action selection and refinement.