PulseAugur
EN
LIVE 15:31:28

LLMs Overwhelmingly Reproduce Majority Human Grading in Thai Bar Exam Study

A new study on the Thai bar examination reveals that while human examiners sometimes diverge on grading free-form essays due to ambiguous rubric interpretations, Large Language Models (LLMs) overwhelmingly converge on the majority human reading. Out of 26 LLMs tested, none reproduced a minority human grading perspective when faced with a correct answer missing a statutory citation. An anchor sub-panel of three LLMs achieved a high agreement score (alpha=0.77) compared to the human panel (alpha=0.36), highlighting LLMs' tendency to align with the dominant human interpretation rather than exploring alternative valid readings. AI

IMPACT LLM judges show a strong tendency to align with majority human interpretations, potentially limiting their utility in capturing nuanced or minority viewpoints in subjective evaluations.

RANK_REASON The cluster contains an academic paper detailing a study on LLM performance in a specific domain.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

LLMs Overwhelmingly Reproduce Majority Human Grading in Thai Bar Exam Study

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Pawitsapak Akarajaradwong, Wuttikrai Lertprasertphakorn, Chompakorn Chaksangchaichot, Sarana Nutanong ·

    A Two-Phase Stability Study of LLM Judges and Bar Council Examiners on Thai Bar-Exam Free-Form Essays

    arXiv:2605.25652v1 Announce Type: new Abstract: Free-form legal essay evaluation in NLP treats expert inter-rater stability as a single ceiling number, and treats LLM-judge agreement with that ceiling as evidence of judge stability. We test both assumptions on the Thai bar examin…

  2. arXiv cs.CL TIER_1 English(EN) · Sarana Nutanong ·

    A Two-Phase Stability Study of LLM Judges and Bar Council Examiners on Thai Bar-Exam Free-Form Essays

    Free-form legal essay evaluation in NLP treats expert inter-rater stability as a single ceiling number, and treats LLM-judge agreement with that ceiling as evidence of judge stability. We test both assumptions on the Thai bar examination through an identical-inputs protocol: thre…