Brief · PulseAugur

TOOL · arXiv cs.CL English(EN) · 1w

Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses

Researchers have evaluated the effectiveness of using large language models (LLMs) as judges for extractive question-answering tasks. Their study found that LLM-as-a-judge methods correlate much more strongly with human evaluations than traditional metrics like Exact Match and F1-score, achieving up to 0.85 correlation with open-source models. The LLM judges performed well on numerical answers but struggled with complex types like job titles, and notably, no self-preference bias was observed even when the same model answered and judged. Prompt phrasing had minimal impact, with zero-shot, context-free judging proving most effective. AI

IMPACT This research offers a more reliable method for evaluating QA models, potentially improving future model development and benchmarking.

LLM-as-a-judge
F1-score
Exact Match
Xanh Ho Thi