PulseAugur
实时 13:17:31
English(EN) When experts grade LLM answers in their own field, how well do the citations hold up? ExpertQA, a 2024 benchmark, has 484 experts write questions in their speci

ExpertQA基准测试揭示大型语言模型(LLM)引用不可靠

一个名为ExpertQA的新基准测试,于2024年开发,通过让484位专家在其专业领域提出问题来评估大型语言模型(LLM)。这些专家随后仔细评分LLM生成的答案,评估每个声明的支撑和可靠性。该基准测试显示,即使是写得很好的答案也常常包含未经证实的声明,而在医学领域,专家认为大约一半的引用来源不可靠。 AI

影响 凸显了大型语言模型(LLM)在事实准确性和引用可靠性方面存在的重大问题,影响了在关键领域的信任和部署。

排序理由 该集群描述了一个用于评估大型语言模型(LLM)的新学术基准测试。

在 Mastodon — fosstodon.org 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

报道来源 [1]

  1. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    When experts grade LLM answers in their own field, how well do the citations hold up? ExpertQA, a 2024 benchmark, has 484 experts write questions in their speci

    When experts grade LLM answers in their own field, how well do the citations hold up? ExpertQA, a 2024 benchmark, has 484 experts write questions in their specialty, then judge the answers claim by claim. Even fluent answers leave many claims unsupported, and in medicine about ha…