Vector Policy Optimization: Training for Diversity Improves Test-Time Search
Researchers have introduced Vector Policy Optimization (VPO), a novel reinforcement learning algorithm designed to enhance the diversity of language model outputs. Unlike traditional methods that optimize for a single scalar reward, VPO trains models to anticipate and generate solutions tailored to multiple, vector-valued reward functions. This approach aims to improve performance in complex search procedures by producing more varied responses, which is crucial for tasks like code generation and evolving search strategies. AI
IMPACT Enhances LLM adaptability in complex search tasks by optimizing for diverse reward functions.