Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 11h

From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models

Researchers have developed a new method called Demo2Reward to optimize the language instructions used by Vision-Language Models (VLMs) as reward models in reinforcement learning. This technique leverages a small number of expert demonstrations to fine-tune the VLM's reward function, aiming to reduce false positives without sacrificing true positives. Demo2Reward requires no additional training during policy learning and has shown superior performance across various simulated robotic tasks, effectively transferring to real-world robotic learning scenarios. AI

IMPACT Improves reward model accuracy for reinforcement learning in robotics, potentially reducing the need for manual reward function engineering.

Reinforcement learning
Vision-Language Models
Robotics
Demo2Reward
Christian Gumbsch