PulseAugur
EN
LIVE 01:35:18
research · [2 sources] ·

LLMs fail at planning and admitting ignorance, new papers show

Two new papers evaluate the metacognitive abilities of large language models, specifically their capacity for planning and abstention. The TRIAGE paper found that most frontier and open-source LLMs perform poorly when tasked with planning problem-solving sequences and allocating token budgets without feedback, with reasoning-trained models underperforming standard ones. AbstentionBench revealed that current LLMs struggle to recognize unanswerable questions, and that reasoning fine-tuning can degrade their ability to abstain, as reinforcement learning methods lack a direct gradient for 'I don't know'. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Reveals significant limitations in current LLMs' planning and self-awareness, impacting agentic system development and reliability.

RANK_REASON Two academic papers present new benchmarks and findings on LLM capabilities.

Read on Mastodon — mastodon.social →

COVERAGE [2]

  1. Mastodon — mastodon.social TIER_1 · [email protected] ·

    Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE

    Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to sta…

  2. Mastodon — mastodon.social TIER_1 · [email protected] ·

    Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine

    Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine-tuning degrades abstention recall by ~24% — RLVR has no "abstain" action, so there's no gradient toward "I don't know."…