CEPO self-distillation sharpens reasoning steps in language models

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have introduced Contrastive Evidence Policy Optimization (CEPO), a new method for self-distillation in reinforcement learning for language models. CEPO aims to improve the identification of crucial reasoning steps within a model's output by contrasting the correct answer's preference for a token against a wrong answer's disfavor. This approach, which leverages rejected rollouts for teacher construction, inherits safety guarantees while sharpening credit assignment. Empirical results show CEPO outperforming GRPO on multimodal mathematical reasoning benchmarks, achieving higher accuracy at both 2B and 4B model scales. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Introduces a novel self-distillation technique that improves reasoning accuracy in language models, potentially leading to more reliable AI agents for complex tasks.

RANK_REASON The cluster contains a new academic paper detailing a novel method for language model training. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

COVERAGE [1]

arXiv cs.CL TIER_1 · Salman Khan · 2026-05-19 06:46

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct…

COVERAGE [1]

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

RELATED ENTITIES

RELATED TOPICS