English(EN) Predicting Disagreement with Human Raters in LLM-as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals

人工智能预测LLM生成难度评分中的人类评分者不一致性

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-12 17:16

研究人员开发了一种新方法，可以预测AI生成的教育材料难度评分何时可能与人类评估不一致。该方法使用一个独立的嵌入空间（如ModernBERT）来识别潜在的不一致性，而无需依赖生成时概率信号（这些信号通常难以在不同AI模型之间进行比较）。实验表明，在使用GPT-OSS-120B和Qwen3-235B-A22B进行基于CEFR的句子难度评估时，这种几何一致性方法在预测人类评分者不一致性方面的准确性高于基于概率的基线。 AI

影响提高了AI生成教育内容评估的可靠性，减少了对大量人工重新评分的需求。

排序理由学术论文，详细介绍了一种评估AI生成内容的新方法。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Yo Ehara · 2026-05-12 17:16

在不使用生成时概率信号的情况下，预测LLM作为裁判在难度评估中的与人类评分者不一致

Automatic generation of educational materials using large language models (LLMs) is becoming increasingly common, but assigning difficulty levels to such materials still requires substantial human effort. LLM-as-a-Judge has therefore attracted attention, yet disagreement with hum…

报道来源 [1]

在不使用生成时概率信号的情况下，预测LLM作为裁判在难度评估中的与人类评分者不一致

相关实体

相关话题