Multi-agent LLM learns to defer to humans using GRPO

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed a multi-agent large language model that learns to defer to human input. The model is trained using GRPO on a reward system that accounts for costs, and each instance of deferral is used as supervised fine-tuning data. This allows the model to gradually incorporate human expertise, with a tunable cost parameter enabling a trade-off between accuracy and the budget for human intervention during deployment. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Introduces a novel training methodology for multi-agent LLMs, enabling adaptive collaboration with human experts.

RANK_REASON The cluster describes a novel research paper detailing a new method for training multi-agent LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Mastodon — fosstodon.org →

GRPO
LLM

COVERAGE [1]

Mastodon — fosstodon.org TIER_1 · [email protected] · 2026-05-21 07:01

A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gr

A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gradually absorbs the human's expertise. Tunable cost knob trades accuracy against human-call budget at deployment, no ret…

LINKS benjaminhan.net/…/20260520-adaptive-colla…

COVERAGE [1]

A multi-agent LLM where each agent learns when to defer to a human, trained with GRPO on a cost-aware reward. Each defer event becomes SFT data, so the model gr

RELATED ENTITIES

RELATED TOPICS