Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges
Researchers have developed a new training strategy for AI safety judges, aiming to improve their consistency and reliability. The strategy involves using dynamic rubrics generated from prompt-response-label triples to expose judges to varied evaluation criteria. A curriculum approach progressively introduces these dynamic rubrics after initial training on fixed rubrics, leading to a 12B model that achieves high accuracy and stability across different rubric formulations. AI
IMPACT Enhances the reliability of AI safety evaluations, potentially leading to more robust AI systems.