Researchers have developed a new method for evaluating search engine results using Large Language Models (LLMs) that incorporates historical user interaction data. This "behavior-grounded" approach uses Query-Relevance-Impressions (QRI) cards to provide LLMs with empirical evidence, improving their ability to align relevance judgments with actual user preferences, especially for ambiguous or long-tail queries. In evaluations at Spotify, this method enhanced alignment with user preferences by approximately 5% and showed a 91% relative improvement in resolving disagreement cases. The approach also demonstrated stronger correlation with human judgments on multilingual datasets and showed higher alignment with live A/B test outcomes, suggesting its practical utility for real-world search systems. AI
IMPACT Enhances LLM-based search evaluation reliability by grounding judgments in user behavior, improving relevance accuracy for real-world applications.
RANK_REASON The item is a research paper published on arXiv detailing a new method for evaluating LLM search performance. [lever_c_demoted from research: ic=1 ai=1.0]
Read on arXiv cs.IR (Information Retrieval) →
- A/B testing
- arXiv
- CatalyzeX
- DagsHub
- Gotit.pub
- Hugging Face
- Query-Relevance-Impressions (QRI) card
- ScienceCast
- Spearman's rank correlation coefficient
- Spotify
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →