LLM popularity bias driven by pretraining data exposure, study finds

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have analyzed how large language models (LLMs) develop preferences for well-known entities, a phenomenon often linked to popularity bias. Using the open OLMo models and their complete Dolma pretraining corpus, they calculated entity exposure across 7.4 trillion tokens. Their findings indicate that LLM popularity judgments align more closely with pretraining exposure than with external signals like Wikipedia pageviews, especially for larger models and in the long tail of less popular entities. This suggests that data exposure during pretraining is the primary driver of popularity bias in LLMs. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Demonstrates that LLM biases stem primarily from training data exposure, not external popularity metrics.

RANK_REASON Academic paper analyzing LLM behavior with novel methodology and findings. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

COVERAGE [1]

arXiv cs.CL TIER_1 · Adam Jatowt · 2026-05-12 16:45

Pretraining Exposure Explains Popularity Judgments in Large Language Models

Large language models (LLMs) exhibit systematic preferences for well-known entities, a phenomenon often attributed to popularity bias. However, the extent to which these preferences reflect real-world popularity versus statistical exposure during pretraining remains unclear, larg…

COVERAGE [1]

Pretraining Exposure Explains Popularity Judgments in Large Language Models

RELATED ENTITIES

RELATED TOPICS