Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight
Researchers have developed a new method called On-Policy Critique Distillation (OPCD) to improve large language models using weak supervision. Instead of relying on weak models for direct labeling, OPCD uses them as critics to provide revision directions. This approach helps stronger models refine their outputs and learn more effectively, as demonstrated on reasoning and alignment benchmarks. AI
IMPACT Introduces a novel approach to scalable oversight for LLMs, potentially improving their reasoning and alignment capabilities.