LLMs fail at planning and admitting ignorance, new papers show

By PulseAugur Editorial · [2 sources] · 2026-05-24 00:06

Two new papers evaluate the metacognitive abilities of large language models, specifically their capacity for planning and abstention. The TRIAGE paper found that most frontier and open-source LLMs perform poorly when tasked with planning problem-solving sequences and allocating token budgets without feedback, with reasoning-trained models underperforming standard ones. AbstentionBench revealed that current LLMs struggle to recognize unanswerable questions, and that reasoning fine-tuning can degrade their ability to abstain, as reinforcement learning methods lack a direct gradient for 'I don't know'. AI

IMPACT Reveals significant limitations in current LLMs' planning and self-awareness, impacting agentic system development and reliability.

RANK_REASON Two academic papers present new benchmarks and findings on LLM capabilities.

Read on Mastodon — mastodon.social →

paper
safety

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

Mastodon — mastodon.social TIER_1 English(EN) · [email protected] · 2026-05-24 00:06

Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE

Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to sta…

LINKS benjaminhan.net/…/20260523-triage-metacog…
Mastodon — mastodon.social TIER_1 English(EN) · [email protected] · 2026-05-24 00:06

Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine

Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine-tuning degrades abstention recall by ~24% — RLVR has no "abstain" action, so there's no gradient toward "I don't know."…

LINKS benjaminhan.net/…/20260523-abstentionbenc…

COVERAGE [2]

Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE

Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine

RELATED ENTITIES

RELATED TOPICS