LLM judge temperature impacts consistency and exploration

By PulseAugur Editorial · [1 sources] · 2026-06-08 04:00

A new study published on arXiv investigates the impact of decoding temperature on the performance of Large Language Models (LLMs) when used as judges for evaluating other models' outputs. The research indicates that higher temperatures can lead to decreased consistency and increased formatting errors, but also reveal latent uncertainty that might be beneficial in complex evaluation scenarios. The findings suggest that temperature should be a task-dependent choice, balancing reliability with exploration, rather than a fixed hyperparameter. AI

IMPACT Provides guidance on optimizing LLM-as-a-judge setups for more reliable and insightful model evaluations.

RANK_REASON Academic paper on LLM evaluation methodology. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia, Jerome Francois, Radu State · 2026-06-08 04:00

The Necessity of Setting Temperature in LLM-as-a-Judge

arXiv:2603.28304v2 Announce Type: replace Abstract: Using large language models (LLMs) as judges for evaluating model outputs has emerged as an important paradigm for automated evaluation. However, the choice of decoding temperature in LLM-as-a-judge settings is still largely cho…

COVERAGE [1]

The Necessity of Setting Temperature in LLM-as-a-Judge

RELATED ENTITIES

RELATED TOPICS