Brief

last 24h

[15/15] 221 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

RESEARCH · arXiv cs.LG English(EN) · 3d · [2 sources]

B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation

Researchers have developed a new framework called Group Relative Tool Optimization (GRTO) to improve referring segmentation tasks in computer vision. This method integrates reinforcement learning with differentiable tool use, allowing segmentation decoders to be optimized alongside the main policy. A pre-training technique, Bootstrapped-GRTO (B-GRTO), further enhances convergence speed and performance. Experiments show B-GRTO significantly outperforms existing methods on challenging segmentation benchmarks. AI

IMPACT Introduces a novel method for integrating reinforcement learning with differentiable tool use, potentially improving performance in complex vision-language segmentation tasks.
TOOL · Anyscale blog English(EN) · 3d

Introducing the Anyscale Agent Skill for LLM Post

Anyscale has introduced a new Anyscale Agent Skill designed to simplify and automate the process of generating LLM post-training runs. This skill assists users in selecting the most appropriate post-training method, such as SFT, CPT, DPO, or RLVR, based on their model, dataset, and objectives. It then generates configuration files for popular frameworks like LLaMA-Factory and Ray Train, preparing them for deployment on Anyscale Jobs. AI

IMPACT Simplifies the complex process of LLM post-training, potentially accelerating adoption of advanced alignment and optimization techniques.
- ChatGPT
- LLM
- RLHF
- InstructGPT
- RLVR
- DeepSeek-R1
- SFT
- DAPO
- Anyscale
- GRPO
- Ray Train
- LLaMA-Factory
- Anyscale Jobs
- Anyscale Agent Skills
TOOL · Mastodon — fosstodon.org English(EN) · 4d

A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gr

Researchers have developed a multi-agent large language model that learns to defer to human input. The model is trained using GRPO on a reward system that accounts for costs, and each instance of deferral is used as supervised fine-tuning data. This allows the model to gradually incorporate human expertise, with a tunable cost parameter enabling a trade-off between accuracy and the budget for human intervention during deployment. AI

IMPACT Introduces a novel training methodology for multi-agent LLMs, enabling adaptive collaboration with human experts.
- LLM
- GRPO
TOOL · arXiv cs.AI English(EN) · 1w

AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

Researchers have developed a new method called Asymmetric Meta-Reflective Self-Distillation (AMR-SD) to improve the alignment of Large Language Models (LLMs) for complex reasoning tasks. Traditional methods struggle with assigning credit for rewards across all tokens in a sequence, leading to training issues. AMR-SD addresses this by using a reflection bottleneck to compress diagnostic signals into concise hints and critiques, which then guide precise token-level advantage modulations, ultimately enhancing training stability and performance on challenging benchmarks. AI

IMPACT Enhances LLM reasoning capabilities by addressing credit assignment bottlenecks, potentially leading to more reliable complex task performance.
TOOL · arXiv cs.AI English(EN) · 3d

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

Researchers have introduced GROW, a novel reinforcement learning framework designed to enhance the capabilities of vision-language model (VLM) agents in open-world tasks. Unlike previous methods that relied heavily on supervised fine-tuning, GROW adapts the Group Relative Policy Optimization (GRPO) algorithm by decomposing trajectories into state-action samples. This approach mitigates issues with long contexts and noise inherent in standard GRPO, enabling more effective multi-turn learning. Experiments on over 800 Minecraft tasks demonstrated that GROW achieves state-of-the-art performance, showcasing its potential for advancing VLM agents. AI

IMPACT Enhances VLM agent performance in open-world tasks by improving reinforcement learning efficiency.
- Minecraft
- GRPO
- VLM agents
- Xiongbin Wu
TOOL · arXiv cs.LG English(EN) · 3d

Holder Policy Optimisation

Researchers have introduced HölderPO, a novel framework for optimizing large language models by unifying token-level probability aggregation through the Hölder mean. This approach offers continuous control over the trade-off between gradient concentration and variance, addressing limitations of fixed aggregation mechanisms that can lead to training collapse or suboptimal performance. A dynamic annealing algorithm is employed to schedule the Hölder mean parameter across the training lifecycle, demonstrating superior stability and convergence. Extensive evaluations show HölderPO achieving state-of-the-art accuracy on mathematical benchmarks and a high success rate on ALFWorld. AI

IMPACT Introduces a new optimization framework that improves LLM stability and performance on mathematical and reasoning tasks.
- ALFWorld
- GRPO
- Yuxiang Chen
- HölderPO
RESEARCH · arXiv cs.AI English(EN) · 4d · [2 sources]

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Researchers have introduced Vector Policy Optimization (VPO), a novel reinforcement learning algorithm designed to enhance the diversity of language model outputs. Unlike traditional methods that optimize for a single scalar reward, VPO trains models to anticipate and generate solutions tailored to multiple, vector-valued reward functions. This approach aims to improve performance in complex search procedures by producing more varied responses, which is crucial for tasks like code generation and evolving search strategies. AI

IMPACT Enhances LLM adaptability in complex search tasks by optimizing for diverse reward functions.
RESEARCH · arXiv cs.LG English(EN) · 4d · [2 sources]

F-TIS: Harnessing Diverse Models in Collaborative GRPO

Researchers have introduced Filtered Truncated Importance Sampling (F-TIS), a new training paradigm designed for Reinforcement Learning from Human Feedback (RLHF) methods like GRPO. F-TIS addresses the challenge of training with heterogeneous models, where different models collaborate on the same task, which typically leads to off-policy samples that can hinder convergence. The proposed framework allows diverse models to work together efficiently, maintaining communication and achieving convergence comparable to on-policy training. In some scenarios, F-TIS even demonstrated improved generalization on out-of-distribution tasks, boosting performance by up to 12%. AI

IMPACT Enables more flexible and efficient collaborative training of diverse LLMs, potentially improving generalization.
- GRPO
- LLM
- arXiv
TOOL · arXiv cs.AI English(EN) · 6d

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Researchers have developed a new framework called POW3R to improve reinforcement learning with verifiable rewards (RLVR). This method addresses the issue where static rubric rewards in RLVR may not effectively guide training by adapting criterion weights based on their current usefulness to the policy. POW3R uses rollout-level contrast to highlight criteria that differentiate policy outputs, making the reward signal more informative without altering the evaluation target. Experiments show POW3R significantly improves both mean rubric reward and strict completion rates across various tasks and datasets, often reaching optimal performance in fewer training steps. AI

IMPACT Enhances reinforcement learning by making reward signals more informative, potentially accelerating model training and improving performance on complex tasks.
- RLVR
- GRPO
- POW3R
TOOL · arXiv cs.CL English(EN) · 6d

Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

Researchers have explored using reinforcement learning to train smaller language models for zero-shot Text-to-SPARQL generation, a task crucial for knowledge graph question answering. They applied Group-Relative Policy Optimization (GRPO) to the Qwen3-1.7B model, utilizing execution feedback and answer-level rewards instead of requiring gold query annotations. The GRPO-trained models showed significant improvement over a zero-shot baseline, demonstrating the viability of outcome-based reinforcement learning for this task when full supervision is unavailable. AI

IMPACT Demonstrates a viable method for training smaller models on complex tasks without extensive labeled data, potentially lowering barriers to knowledge graph querying.
RESEARCH · arXiv cs.AI English(EN) · 5d · [2 sources]

How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

Researchers have developed G2D, a three-stage pipeline that combines GRPO and DPO for more efficient offline preference optimization in language models. This method involves a brief GRPO warm-up, followed by constructing a static preference dataset and then fine-tuning with DPO. Experiments on Qwen2.5-7B and Llama-3.1-8B models demonstrated that G2D can match or exceed the performance of full online GRPO with significantly reduced computational cost, by focusing on the informativeness of the preference data rather than just the quantity. AI

IMPACT Offers a compute-efficient alternative to online RL for language model training by improving data informativeness.
RESEARCH · arXiv cs.CL English(EN) · 6d · [2 sources]

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

Researchers have introduced LamPO (Lambda Style Policy Optimization) and LambdaPO, novel methods for enhancing reasoning in language models. These approaches move beyond traditional group-relative objectives by using pairwise decomposed advantages, which better capture subtle differences in response quality. Experiments on various benchmarks with models like Qwen3 and Phi-4-mini show improved performance and training stability compared to existing methods. AI

IMPACT Introduces new techniques for more stable and efficient training of reasoning language models.
RESEARCH · arXiv cs.CL English(EN) · 6d · [5 sources]

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

Researchers have developed new self-distillation techniques for large language models to improve their performance without relying on external feedback. AVSD (Adaptive-View Self-Distillation) balances consensus signals across multiple privileged information views with view-specific residuals to enhance learning. Self-Policy Distillation (SPD) extracts a capability subspace from gradients to improve performance and generalizability, particularly in code generation and mathematical reasoning. CEPO (Contrastive Evidence Policy Optimization) sharpens credit assignment at decisive tokens by contrasting correct answers with incorrect ones, improving accuracy on multimodal mathematical reasoning benchmarks. AI

IMPACT These self-distillation techniques offer improved performance and generalizability for LLMs in complex reasoning tasks without external supervision.
RESEARCH · arXiv cs.CL English(EN) · 4d · [2 sources]

What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

A new study on arXiv explores how different training data curricula impact the performance of reinforcement learning (RL) agents designed to work with large language models (LLMs) and external memory banks. The research found that the composition of training data significantly influences an agent's specialization rather than uniformly boosting performance. A mixed curriculum combining different benchmarks yielded the best overall results, while training on a narrow out-of-domain set specifically improved temporal reasoning skills. AI

IMPACT Demonstrates that curriculum design is a key factor in tailoring AI agent capabilities for specific tasks.
RESEARCH · Hugging Face Daily Papers English(EN) · 1w · [9 sources]

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

Researchers have developed new benchmarks to evaluate the spatial reasoning capabilities of vision-language models (VLMs). ArchSIBench focuses on architectural space understanding, while Flat-Pack Bench assesses spatio-temporal reasoning in tasks like furniture assembly. SpaceDG addresses robustness by evaluating models under visual degradation, finding that current VLMs struggle with these challenges. Additionally, a framework called SAGE aims to improve spatial reasoning by enforcing geometric logic consistency. AI

IMPACT These benchmarks and methods aim to push the boundaries of VLM capabilities in understanding complex spatial relationships and real-world visual conditions.