Researchers have developed a new diagnostic testbed called PAVE to evaluate how Large Language Models (LLMs) arbitrate between their internal knowledge and retrieved evidence in fact-checking scenarios. The PAVE framework categorizes LLM verifiers into four epistemic states based on their prior knowledge and confidence, assessing their ability to reconcile or prioritize parametric versus contextual information. Experiments with seven LLMs demonstrated inconsistent and model-specific arbitration behaviors, underscoring the need for careful verifier selection in RAG-based fact-checking applications. To address this, a novel test-time arbitration method was proposed, enhancing factual reliability across various LLM families without altering the models themselves. AI
IMPACT Highlights critical arbitration flaws in LLM fact-checking, potentially guiding development of more reliable AI verification systems.
RANK_REASON The cluster contains an academic paper detailing a new evaluation framework for LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →