Researchers have analyzed how large language models (LLMs) develop preferences for well-known entities, a phenomenon often linked to popularity bias. Using the open OLMo models and their complete Dolma pretraining corpus, they calculated entity exposure across 7.4 trillion tokens. Their findings indicate that LLM popularity judgments align more closely with pretraining exposure than with external signals like Wikipedia pageviews, especially for larger models and in the long tail of less popular entities. This suggests that data exposure during pretraining is the primary driver of popularity bias in LLMs. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Demonstrates that LLM biases stem primarily from training data exposure, not external popularity metrics.
RANK_REASON Academic paper analyzing LLM behavior with novel methodology and findings. [lever_c_demoted from research: ic=1 ai=1.0]