A new benchmark called ExpertQA, developed in 2024, evaluates Large Language Models by having 484 experts pose questions within their specialized fields. These experts then meticulously grade the LLM-generated answers, assessing each claim for support and reliability. The benchmark revealed that even well-written answers often contain unsupported claims, and in the medical domain, approximately half of the cited sources were deemed unreliable by experts. AI
IMPACT Highlights significant issues with LLM factual accuracy and citation reliability, impacting trust and deployment in critical domains.
RANK_REASON The cluster describes a new academic benchmark for evaluating LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
Read on Mastodon — fosstodon.org →
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →