Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 23h · [2 sources]

Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval

Researchers have developed ParaEval, a new framework designed to improve the evaluation of large language models. Current multiple-choice question-answering benchmarks are overly sensitive to the specific wording of answers, leading to inaccurate assessments of a model's true knowledge. ParaEval addresses this by querying models with multiple paraphrased answer options, thereby providing a more robust measure of underlying capability rather than mere familiarity with specific phrases. AI

IMPACT Provides a more reliable method for assessing LLM knowledge, potentially leading to more accurate model development and comparison.

large language models
ParaEval