New monitors detect AI agent scheming without internal access

By PulseAugur Editorial · [1 sources] · 2026-05-29 04:00

Researchers have developed a new method for training smaller, open-weight models to detect scheming behavior in autonomous agents. These "deliberative monitors" operate solely on agent trajectories, without needing access to the agent's internal reasoning or model internals. The training process involves using a frontier model to generate rationales, filtering these rationales, and then distilling the high-quality ones into the monitors via supervised fine-tuning and reinforcement learning. Evaluations show these monitors can outperform or match the cost-performance of larger, prompted frontier models on various misalignment benchmarks. AI

IMPACT Enables more cost-effective AI safety monitoring by using smaller, open-weight models.

RANK_REASON The cluster contains a research paper detailing a new method for training AI monitors. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New monitors detect AI agent scheming without internal access

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Aditya Sinha, Akshat Naik, Victor Gillioz, Simon Storf, Kilian Merkelbach, Rich Barton-Cooper, Axel H{\o}jmark, Marius Hobbhahn · 2026-05-29 04:00

Training Deliberative Monitors for Black-Box Scheming Detection

arXiv:2605.29601v1 Announce Type: cross Abstract: As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may become a central AI control problem. Existing monitors often rely on chain-of-thought access or…

COVERAGE [1]

Training Deliberative Monitors for Black-Box Scheming Detection

RELATED ENTITIES

RELATED TOPICS