LLM judges outperform traditional metrics in extractive QA evaluations

By PulseAugur Editorial · [1 sources] · 2026-06-01 04:00

Researchers have evaluated the effectiveness of using large language models (LLMs) as judges for extractive question-answering tasks. Their study found that LLM-as-a-judge methods correlate much more strongly with human evaluations than traditional metrics like Exact Match and F1-score, achieving up to 0.85 correlation with open-source models. The LLM judges performed well on numerical answers but struggled with complex types like job titles, and notably, no self-preference bias was observed even when the same model answered and judged. Prompt phrasing had minimal impact, with zero-shot, context-free judging proving most effective. AI

IMPACT This research offers a more reliable method for evaluating QA models, potentially improving future model development and benchmarking.

RANK_REASON The cluster contains an academic paper detailing a new evaluation methodology for NLP tasks. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Xanh Ho, Jiahao Huang, Florian Boudin, Akiko Aizawa · 2026-06-01 04:00

Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses

arXiv:2504.11972v3 Announce Type: replace Abstract: Extractive QA tasks are commonly evaluated using Exact Match (EM) and F1-score, but these metrics often fail to reflect true model performance. Recent studies have proposed using large language models (LLMs) as judges (LLM-as-a-…

COVERAGE [1]

Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses

RELATED ENTITIES

RELATED TOPICS