PulseAugur
EN
LIVE 21:20:16

New SCOPE framework trains LLMs via self-play on open-ended tasks

Researchers have developed SCOPE, a novel data-free self-play framework designed to train language models on open-ended tasks without external supervision. This framework co-evolves two policies: a Challenger that creates document-grounded tasks and a Solver that answers them. A frozen copy of the initial model acts as a self-judge, creating rubrics and grading responses. SCOPE has demonstrated significant performance improvements on various benchmarks for models like Qwen2.5, Qwen3, and OLMo-3, even surpassing models trained on curated prompts. AI

IMPACT This self-play framework could reduce reliance on curated datasets for training LLMs on complex, open-ended tasks.

RANK_REASON The cluster contains a research paper detailing a new framework for training language models.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

COVERAGE [3]

  1. arXiv cs.CL TIER_1 English(EN) · Wai-Chung Kwan, Aryo Pradipta Gema, Joshua Ong Jun Leang, Pasquale Minervini ·

    SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

    arXiv:2605.31433v1 Announce Type: new Abstract: Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-f…

  2. arXiv cs.CL TIER_1 English(EN) · Pasquale Minervini ·

    SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

    Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks tha…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

    SCOPE is a self-play framework that trains language models on open-ended tasks through policy co-evolution, achieving superior performance on both targeted and held-out benchmarks without external supervision.