New methods enhance LLM reasoning for long-context and multilingual tasks

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 4 sources

Researchers have developed new methods for improving large language model reasoning capabilities, particularly for long-context and multilingual tasks. One approach, OGLS-SD, uses outcome-guided logit steering to calibrate teacher model responses during on-policy self-distillation, leading to more stable and effective reasoning. Another method, dGRPO, combines on-policy optimization with distillation to enhance long-context reasoning and introduces a new dataset called LongBlocks. Additionally, COPSD specifically targets low-resource languages by transferring reasoning behavior from high-resource languages through self-distillation, showing significant improvements in multilingual mathematical reasoning. AI

Summary written by gemini-2.5-flash-lite from 4 sources. How we write summaries →

IMPACT These new techniques offer improved stability and effectiveness for LLM reasoning, particularly in challenging long-context and multilingual scenarios, potentially broadening their applicability.

RANK_REASON Multiple arXiv papers detailing new methods for improving LLM reasoning.

Read on arXiv cs.CL →

COVERAGE [4]

arXiv cs.CL TIER_1 · Bingbing Wen · 2026-05-13 08:28

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regim…
arXiv cs.AI TIER_1 · Weitong Zhang · 2026-05-12 17:00

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

We study {on-policy self-distillation} (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite the performance gains of OPSD, we identify a common but often overlooked mismatch betw…
arXiv cs.CL TIER_1 · André F. T. Martins · 2026-05-12 15:04

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

Adapting large language models (LLMs) to long-context tasks requires post-training methods that remain accurate and coherent over thousands of tokens. Existing approaches are limited in several ways: 1) off-policy methods such as supervised fine-tuning (SFT) and knowledge distill…
arXiv cs.CL TIER_1 · Hinrich Schütze · 2026-05-10 14:06

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, but this ability is not equally accessible across languages. Especially low-resource languages exhibit much lower reasoning performance. To address this, we propose Crosslingual On-Policy Se…

COVERAGE [4]

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

RELATED ENTITIES

RELATED TOPICS