PulseAugur
实时 21:31:21
English(EN) Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine

新论文显示,大语言模型在规划和承认无知方面存在不足

两篇新论文评估了大语言模型的元认知能力,特别是它们的规划和弃权能力。TRIAGE 论文发现,大多数前沿和开源大语言模型在没有反馈的情况下,在规划问题解决序列和分配 token 预算的任务上表现不佳,而经过推理训练的模型表现不如标准模型。AbstentionBench 显示,当前的大语言模型难以识别不可回答的问题,并且推理微调会损害它们弃权的能力,因为强化学习方法缺乏直接的“我不知道”梯度。 AI

影响 揭示了当前大语言模型在规划和自我意识方面存在重大局限性,影响了代理系统的开发和可靠性。

排序理由 两篇学术论文提出了关于大语言模型能力的新基准和发现。

在 Mastodon — mastodon.social 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

报道来源 [2]

  1. Mastodon — mastodon.social TIER_1 English(EN) · [email protected] ·

    Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE

    Given a problem queue and a token budget, can an LLM plan which to attempt, in what order, and how much to spend on each — before any execution feedback? TRIAGE tests 20 frontier and open-source LLMs. Most plan worse than random. Reasoning-trained modes systematically lose to sta…

  2. Mastodon — mastodon.social TIER_1 English(EN) · [email protected] ·

    Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine

    Do current LLMs know when to say "I don't know"? AbstentionBench (NeurIPS '25) tests 20 frontier models across 20 unanswerable-question datasets. Reasoning fine-tuning degrades abstention recall by ~24% — RLVR has no "abstain" action, so there's no gradient toward "I don't know."…