Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 7h

Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges

Researchers have developed a new training strategy for AI safety judges, aiming to improve their consistency and reliability. The strategy involves using dynamic rubrics generated from prompt-response-label triples to expose judges to varied evaluation criteria. A curriculum approach progressively introduces these dynamic rubrics after initial training on fixed rubrics, leading to a 12B model that achieves high accuracy and stability across different rubric formulations. AI

IMPACT Enhances the reliability of AI safety evaluations, potentially leading to more robust AI systems.

ShieldGemma
HarmBench
AI safety judges