New PAVE testbed diagnoses LLM fact-checking arbitration flaws

By PulseAugur Editorial · [1 sources] · 2026-06-02 04:00

Researchers have developed a new diagnostic testbed called PAVE to evaluate how Large Language Models (LLMs) arbitrate between their internal knowledge and retrieved evidence in fact-checking scenarios. The PAVE framework categorizes LLM verifiers into four epistemic states based on their prior knowledge and confidence, assessing their ability to reconcile or prioritize parametric versus contextual information. Experiments with seven LLMs demonstrated inconsistent and model-specific arbitration behaviors, underscoring the need for careful verifier selection in RAG-based fact-checking applications. To address this, a novel test-time arbitration method was proposed, enhancing factual reliability across various LLM families without altering the models themselves. AI

IMPACT Highlights critical arbitration flaws in LLM fact-checking, potentially guiding development of more reliable AI verification systems.

RANK_REASON The cluster contains an academic paper detailing a new evaluation framework for LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Yuxi Sun, Wenbo Shang, Wei Gao, Xin Huang, Jing Ma · 2026-06-02 04:00

Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG-based Fact-Checking

arXiv:2606.01120v1 Announce Type: new Abstract: In RAG-based fact-checking, LLMs are increasingly used as verifiers to check given claims against retrieved evidence. Their parametric knowledge can induce pre-evidence tendencies that may conflict with the retrieved context, yet ex…

COVERAGE [1]

Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG-based Fact-Checking

RELATED ENTITIES

RELATED TOPICS