A new study examined the verbal confidence of seven instruction-tuned, open-weight large language models (LLMs) with 3-9 billion parameters. Researchers found that these models failed to meet minimal validity criteria for expressing uncertainty, with all models deemed invalid on numeric confidence elicitation. Attempts to improve confidence reporting using categorical elicitation disrupted task performance in most models, leading to accuracy below 5%. The study suggests that current methods of verbal confidence elicitation are insufficient for capturing internal uncertainty signals in models of this size. AI
影响 Highlights limitations in current LLM confidence reporting, suggesting a need for improved methods before downstream use.
排序理由 Academic paper detailing experimental findings on LLM confidence elicitation.
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →