ExpertQA benchmark reveals LLM citation unreliability

By PulseAugur Editorial · [1 sources] · 2026-06-09 20:03

A new benchmark called ExpertQA, developed in 2024, evaluates Large Language Models by having 484 experts pose questions within their specialized fields. These experts then meticulously grade the LLM-generated answers, assessing each claim for support and reliability. The benchmark revealed that even well-written answers often contain unsupported claims, and in the medical domain, approximately half of the cited sources were deemed unreliable by experts. AI

IMPACT Highlights significant issues with LLM factual accuracy and citation reliability, impacting trust and deployment in critical domains.

RANK_REASON The cluster describes a new academic benchmark for evaluating LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Mastodon — fosstodon.org →

ExpertQA
LLM

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

ExpertQA benchmark reveals LLM citation unreliability

COVERAGE [1]

Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-06-09 20:03

When experts grade LLM answers in their own field, how well do the citations hold up? ExpertQA, a 2024 benchmark, has 484 experts write questions in their speci

When experts grade LLM answers in their own field, how well do the citations hold up? ExpertQA, a 2024 benchmark, has 484 experts write questions in their specialty, then judge the answers claim by claim. Even fluent answers leave many claims unsupported, and in medicine about ha…

LINKS benjaminhan.net/…/20260609-expertqa

COVERAGE [1]

When experts grade LLM answers in their own field, how well do the citations hold up? ExpertQA, a 2024 benchmark, has 484 experts write questions in their speci

RELATED ENTITIES

RELATED TOPICS