English(EN) Evaluating alignment of behavioral dispositions in LLMs

新的大型语言模型评估方法解决对齐和偏见问题

作者 PulseAugur 编辑部 · [9 个来源] · 2026-04-03 08:00

研究人员正在开发新的方法来评估和改进大型语言模型（LLMs）的对齐性和可解释性。Google Research 提出了一个框架，该框架改编了心理学评估方法，以量化 LLM 的行为倾向并将其与人类共识进行比较。同时，一种名为 BINEVAL 的新方法将评估标准分解为二元问题，提供了比传统 LLM 裁判更具可解释性和可调试性的分数。其他研究则探讨了如何减轻 LLM 评估者中的自我偏好偏见，并通过考虑项目难度来改进置信度校准。 AI

影响这些在 LLM 评估和对齐方面的进展可能带来更可靠、更具可解释性和更值得信赖的 AI 系统。

排序理由多篇研究论文介绍了评估 LLM 行为、对齐和自我评估的新颖方法。

在 Google AI / Research 阅读 →

AI 生成摘要 · Google Gemini · 来自 9 个来源。我们如何撰写摘要 →

报道来源 [9]

Google AI / Research TIER_1 English(EN) · 2026-04-03 08:00

Evaluating alignment of behavioral dispositions in LLMs

Generative AI
arXiv cs.AI TIER_1 English(EN) · Sangwoo Cho, Kushal Chawla, Pengshan Cai, Zefang Liu, Chenyang Zhu, Shi-Xiong Zhang, Sambit Sahu · 2026-06-26 04:00

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

arXiv:2606.27226v1 Announce Type: new Abstract: Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores th…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-25 16:14

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to debug. We propose BINEVAL, a fram…
arXiv cs.AI TIER_1 English(EN) · Sambit Sahu · 2026-06-25 16:14

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

Evaluating LLM outputs remains a major bottleneck in NLP: human evaluation is expensive and slow, lexical metrics correlate poorly with human judgments on open-ended generation, and holistic LLM judges often produce opaque scores that are hard to debug. We propose BINEVAL, a fram…
arXiv cs.CL TIER_1 English(EN) · Yuzheng Xu, Tosho Hirasawa, Tadashi Kozuno, Yoshitaka Ushiku · 2026-06-25 04:00

Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge

arXiv:2602.02219v2 Announce Type: replace Abstract: Large language models are widely employed as evaluators, a paradigm commonly referred to as LLM-as-a-judge. Prior research has predominantly examined point-wise or pair-wise evaluation protocols; in contrast, our focus is on rub…
arXiv cs.LG TIER_1 English(EN) · Kai Qin, Jiaqi Wu, Jianxiang He, Haoyuan Sun, Yifei Zhao, Xu Wang, Bin Liang, Yongzhe Chang, Cheng Li, Tiantian Zhang, Houde Liu · 2026-06-25 04:00

Distribution Preference Optimization: A Fine-grained Perspective for LLM Unlearning

arXiv:2510.04773v2 Announce Type: replace Abstract: As Large Language Models (LLMs) demonstrate remarkable capabilities learned from vast corpora, concerns regarding data privacy and safety are receiving increasing attention. LLM unlearning, which aims to remove the influence of …
arXiv cs.AI TIER_1 English(EN) · Dani Roytburg, Matthew Bozoukov, Matthew Nguyen, Jou Barzdukas, Mackenzie Puig-Hall, Narmeen Oozeer · 2026-06-24 04:00

Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations

arXiv:2601.22548v4 Announce Type: replace-cross Abstract: Recent research has shown that large language models (LLMs) favor their own outputs when acting as judges, undermining the integrity of automated post-training and evaluation workflows. However, it is difficult to disentan…
arXiv cs.AI TIER_1 English(EN) · Dani Roytburg, Matthew Bozoukov, Matthew Nguyen, Jou Barzdukas, Simon Fu, Narmeen Oozeer · 2026-06-24 04:00

Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

arXiv:2509.03647v2 Announce Type: replace-cross Abstract: Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from "self-preference bias": a tendency to favor their own outputs over those of other models. This bias undermines fairness and reli…
arXiv cs.CL TIER_1 English(EN) · Yihuang Kang · 2026-06-20 08:13

LLM 自我评估的潜在置信度对齐

Confidence calibration in large language models (LLMs) is commonly evaluated by comparing predicted confidence with observed accuracy. However, such approaches do not model item difficulty, making it difficult to interpret discrepancies and to determine whether model confidence r…

报道来源 [9]

相关实体

相关话题