Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts
A new research paper introduces the Hypentropy Policy Gradient (HPG) algorithm for optimizing embedding model routing in recommendation systems. The paper formalizes this problem as an adversarial contextual linear bandit with low-rank experts, addressing challenges like adversarial queries and limited model observability. HPG is designed to adapt to unknown low-rank structures, achieving a policy regret of \tilde{\mathcal O}(s\sqrt{MT}) and offering an efficient, parameter-free implementation. AI