Your LLM-as-judge eval set is too small. Here is the math
A recent article highlights the critical need for larger evaluation datasets when using LLMs as judges in AI model assessments. The author explains that common practice of using small, ad-hoc datasets is insufficient for reliable calibration. To achieve a 95% confidence interval of 0.10 for an LLM judge with moderate agreement (Cohen's kappa of 0.4-0.6), approximately 200-400 paired labels are necessary, significantly more than the typical 50 used by many teams. The article provides mathematical reasoning and code examples for calculating these requirements and performing statistical comparisons between judges. AI
IMPACT Ensures more reliable and statistically sound evaluations of LLMs, leading to better model development and deployment.