PulseAugur
EN
LIVE 08:46:45

New On-Policy Replay method combats LLM forgetting

Researchers have developed a new method called On-Policy Replay (OPR) to address catastrophic forgetting in large language models during continual supervised fine-tuning. OPR filters historical prompts based on a task reward and replays surviving prompt-response pairs as standard SFT examples, avoiding auxiliary losses or distillation. Experiments on three 7-8B instruction-tuned models, including Qwen2.5-7B-Instruct, Qwen3-8B, and Llama3.1-8B-Instruct, demonstrated that OPR significantly reduces forgetting on the TRACE benchmark, achieving substantial improvements over tuned Vanilla Replay baselines. AI

IMPACT This research offers a novel approach to mitigate catastrophic forgetting in LLMs, potentially improving their adaptability to new tasks without sacrificing prior knowledge.

RANK_REASON This is a research paper detailing a new method for improving LLM training. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New On-Policy Replay method combats LLM forgetting

COVERAGE [1]

  1. arXiv cs.LG TIER_1 English(EN) · Yan Chen, Taojie Zhu, Meng Zhang, Xin Chen, Jiaqi Huang, Dongyang Xu, Yizhi Wang ·

    On-Policy Replay for Continual Supervised Fine-Tuning

    arXiv:2605.29495v1 Announce Type: new Abstract: Continual supervised fine-tuning (SFT) is the de facto recipe for adapting large language models (LLMs) to a stream of downstream tasks, but it suffers from catastrophic forgetting of earlier capabilities. Recent work shows that on-…