English(EN) Pretraining Exposure Explains Popularity Judgments in Large Language Models

研究发现：LLM流行度偏差由预训练数据暴露驱动

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-12 16:45

研究人员分析了大型语言模型（LLM）如何对知名实体产生偏好，这种现象通常与流行度偏差有关。他们使用开源的OLMo模型及其完整的Dolma预训练语料库，计算了7.4万亿个token中的实体暴露量。他们的发现表明，LLM的流行度判断比维基百科页面浏览量等外部信号更接近预训练暴露量，特别是对于更大的模型以及在不太受欢迎的实体长尾部分。这表明预训练期间的数据暴露是LLM流行度偏差的主要驱动因素。 AI

影响证明了LLM的偏差主要源于训练数据暴露，而非外部流行度指标。

排序理由学术论文，采用新颖的方法和发现分析LLM行为。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Adam Jatowt · 2026-05-12 16:45

Pretraining Exposure Explains Popularity Judgments in Large Language Models

Large language models (LLMs) exhibit systematic preferences for well-known entities, a phenomenon often attributed to popularity bias. However, the extent to which these preferences reflect real-world popularity versus statistical exposure during pretraining remains unclear, larg…

报道来源 [1]

Pretraining Exposure Explains Popularity Judgments in Large Language Models

相关实体

相关话题