Brief

last 24h

[5/5] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · Anyscale blog English(EN) · 3d

Introducing the Anyscale Agent Skill for LLM Post

Anyscale has introduced a new Anyscale Agent Skill designed to simplify and automate the process of generating LLM post-training runs. This skill assists users in selecting the most appropriate post-training method, such as SFT, CPT, DPO, or RLVR, based on their model, dataset, and objectives. It then generates configuration files for popular frameworks like LLaMA-Factory and Ray Train, preparing them for deployment on Anyscale Jobs. AI

IMPACT Simplifies the complex process of LLM post-training, potentially accelerating adoption of advanced alignment and optimization techniques.
- ChatGPT
- LLM
- RLHF
- InstructGPT
- RLVR
- DeepSeek-R1
- SFT
- DAPO
- Anyscale
- GRPO
- Ray Train
- LLaMA-Factory
- Anyscale Jobs
- Anyscale Agent Skills
TOOL · arXiv cs.LG English(EN) · 5d

PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR

Researchers have developed PlexRL, a cluster-level runtime designed to improve the efficiency of training large language models (LLMs) for reinforcement learning with verifiable rewards (RLVR). RLVR training is often inefficient due to idle time caused by long-tailed rollouts and tool-induced stalls. PlexRL addresses this by multiplexing LLM services across multiple RLVR jobs, filling idle periods by time-slicing model execution without costly migrations. Evaluations show PlexRL can reduce GPU hour costs by up to 37.58% while maintaining algorithmic flexibility and adding minimal overhead. AI

IMPACT Optimizes LLM training infrastructure, potentially lowering costs and increasing throughput for RLVR applications.
- LLM
- RLVR
- PlexRL
TOOL · arXiv cs.AI English(EN) · 6d

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Researchers have developed a new framework called POW3R to improve reinforcement learning with verifiable rewards (RLVR). This method addresses the issue where static rubric rewards in RLVR may not effectively guide training by adapting criterion weights based on their current usefulness to the policy. POW3R uses rollout-level contrast to highlight criteria that differentiate policy outputs, making the reward signal more informative without altering the evaluation target. Experiments show POW3R significantly improves both mean rubric reward and strict completion rates across various tasks and datasets, often reaching optimal performance in fewer training steps. AI

IMPACT Enhances reinforcement learning by making reward signals more informative, potentially accelerating model training and improving performance on complex tasks.
- RLVR
- GRPO
- POW3R
RESEARCH · arXiv cs.CL English(EN) · 6d · [5 sources]

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

Researchers have developed new self-distillation techniques for large language models to improve their performance without relying on external feedback. AVSD (Adaptive-View Self-Distillation) balances consensus signals across multiple privileged information views with view-specific residuals to enhance learning. Self-Policy Distillation (SPD) extracts a capability subspace from gradients to improve performance and generalizability, particularly in code generation and mathematical reasoning. CEPO (Contrastive Evidence Policy Optimization) sharpens credit assignment at decisive tokens by contrasting correct answers with incorrect ones, improving accuracy on multimodal mathematical reasoning benchmarks. AI

IMPACT These self-distillation techniques offer improved performance and generalizability for LLMs in complex reasoning tasks without external supervision.
RESEARCH · arXiv cs.LG English(EN) · 42mo · [113 sources]

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Researchers are developing new methods to evaluate and enhance Large Language Models (LLMs). Apple's research proposes a benchmark to test LLMs' understanding of context, finding that quantized models and pre-trained dense models struggle with nuanced contextual features. Meanwhile, a new technique called Retrieval-Augmented Linguistic Calibration (RALC) improves how LLMs express confidence in their answers, enhancing faithfulness and calibration. Other research explores LLMs for clinical action extraction, demonstrating comparable performance to supervised models but highlighting limitations in clinical reasoning, and introduces Listwise Policy Optimization for more stable and diverse LLM training. AI

IMPACT New benchmarks and calibration techniques aim to improve LLM reliability and reasoning, potentially impacting their application in critical domains like healthcare and scientific discovery.

Brief

Introducing the Anyscale Agent Skill for LLM Post

PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex