Researchers have developed LLMScholarBench, a new benchmark designed to audit Large Language Models (LLMs) used for academic expert recommendation. This benchmark evaluates both the LLM's inherent capabilities and the impact of user interventions during the recommendation process. Experiments across 22 LLMs in physics expert recommendation revealed that interventions like temperature adjustments, diversity-focused prompting, and retrieval-augmented generation (RAG) each present unique trade-offs, affecting metrics such as factuality, diversity, and representation. AI
IMPACT Provides a framework for evaluating and improving the fairness and accuracy of LLM-driven academic discovery tools.
RANK_REASON The cluster contains an academic paper detailing a new benchmark for evaluating LLM performance. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →