The Necessity of Setting Temperature in LLM-as-a-Judge
A new study published on arXiv investigates the impact of decoding temperature on the performance of Large Language Models (LLMs) when used as judges for evaluating other models' outputs. The research indicates that higher temperatures can lead to decreased consistency and increased formatting errors, but also reveal latent uncertainty that might be beneficial in complex evaluation scenarios. The findings suggest that temperature should be a task-dependent choice, balancing reliability with exploration, rather than a fixed hyperparameter. AI
IMPACT Provides guidance on optimizing LLM-as-a-judge setups for more reliable and insightful model evaluations.