PulseAugur
EN
LIVE 21:54:12

ExpertQA benchmark reveals LLM citation unreliability

A new benchmark called ExpertQA, developed in 2024, evaluates Large Language Models by having 484 experts pose questions within their specialized fields. These experts then meticulously grade the LLM-generated answers, assessing each claim for support and reliability. The benchmark revealed that even well-written answers often contain unsupported claims, and in the medical domain, approximately half of the cited sources were deemed unreliable by experts. AI

IMPACT Highlights significant issues with LLM factual accuracy and citation reliability, impacting trust and deployment in critical domains.

RANK_REASON The cluster describes a new academic benchmark for evaluating LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Mastodon — fosstodon.org →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    When experts grade LLM answers in their own field, how well do the citations hold up? ExpertQA, a 2024 benchmark, has 484 experts write questions in their speci

    When experts grade LLM answers in their own field, how well do the citations hold up? ExpertQA, a 2024 benchmark, has 484 experts write questions in their specialty, then judge the answers claim by claim. Even fluent answers leave many claims unsupported, and in medicine about ha…