Researchers have evaluated the effectiveness of using large language models (LLMs) as judges for extractive question-answering tasks. Their study found that LLM-as-a-judge methods correlate much more strongly with human evaluations than traditional metrics like Exact Match and F1-score, achieving up to 0.85 correlation with open-source models. The LLM judges performed well on numerical answers but struggled with complex types like job titles, and notably, no self-preference bias was observed even when the same model answered and judged. Prompt phrasing had minimal impact, with zero-shot, context-free judging proving most effective. AI
IMPACT This research offers a more reliable method for evaluating QA models, potentially improving future model development and benchmarking.
RANK_REASON The cluster contains an academic paper detailing a new evaluation methodology for NLP tasks. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →