LLM search evaluation improved with historical user data · arXiv

By PulseAugur Editorial · [1 sources] · 2026-07-01 15:05

Researchers have developed a new method for evaluating search engine results using Large Language Models (LLMs) that incorporates historical user interaction data. This "behavior-grounded" approach uses Query-Relevance-Impressions (QRI) cards to provide LLMs with empirical evidence, improving their ability to align relevance judgments with actual user preferences, especially for ambiguous or long-tail queries. In evaluations at Spotify, this method enhanced alignment with user preferences by approximately 5% and showed a 91% relative improvement in resolving disagreement cases. The approach also demonstrated stronger correlation with human judgments on multilingual datasets and showed higher alignment with live A/B test outcomes, suggesting its practical utility for real-world search systems. AI

IMPACT Enhances LLM-based search evaluation reliability by grounding judgments in user behavior, improving relevance accuracy for real-world applications.

RANK_REASON The item is a research paper published on arXiv detailing a new method for evaluating LLM search performance. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.IR (Information Retrieval) →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLM search evaluation improved with historical user data · arXiv

COVERAGE [1]

arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Mounia Lalmas · 2026-07-01 15:05

As It Was: Aligning LLM Search Evaluation with Historical User Preferences

Large-scale search systems evolve faster than human quality assurance can scale, especially for long-tail intents and multilingual queries. LLM-as-a-judge approaches provide a scalable alternative for evaluating the relevance of search engine result pages (SERPs), but judgments b…

COVERAGE [1]

As It Was: Aligning LLM Search Evaluation with Historical User Preferences

RELATED ENTITIES

RELATED TOPICS