Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval
Researchers have developed ParaEval, a new framework designed to improve the evaluation of large language models. Current multiple-choice question-answering benchmarks are overly sensitive to the specific wording of answers, leading to inaccurate assessments of a model's true knowledge. ParaEval addresses this by querying models with multiple paraphrased answer options, thereby providing a more robust measure of underlying capability rather than mere familiarity with specific phrases. AI
IMPACT Provides a more reliable method for assessing LLM knowledge, potentially leading to more accurate model development and comparison.