A new study on the Thai bar examination reveals that while human examiners sometimes diverge on grading free-form essays due to ambiguous rubric interpretations, Large Language Models (LLMs) overwhelmingly converge on the majority human reading. Out of 26 LLMs tested, none reproduced a minority human grading perspective when faced with a correct answer missing a statutory citation. An anchor sub-panel of three LLMs achieved a high agreement score (alpha=0.77) compared to the human panel (alpha=0.36), highlighting LLMs' tendency to align with the dominant human interpretation rather than exploring alternative valid readings. AI
IMPACT LLM judges show a strong tendency to align with majority human interpretations, potentially limiting their utility in capturing nuanced or minority viewpoints in subjective evaluations.
RANK_REASON The cluster contains an academic paper detailing a study on LLM performance in a specific domain.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →