PulseAugur / Brief
EN
LIVE 17:54:51

Brief

last 24h
[1/1] 223 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses

    Researchers have evaluated the effectiveness of using large language models (LLMs) as judges for extractive question-answering tasks. Their study found that LLM-as-a-judge methods correlate much more strongly with human evaluations than traditional metrics like Exact Match and F1-score, achieving up to 0.85 correlation with open-source models. The LLM judges performed well on numerical answers but struggled with complex types like job titles, and notably, no self-preference bias was observed even when the same model answered and judged. Prompt phrasing had minimal impact, with zero-shot, context-free judging proving most effective. AI

    IMPACT This research offers a more reliable method for evaluating QA models, potentially improving future model development and benchmarking.