Researchers have developed a new benchmark called LLM-S^3 to evaluate how well large language models can simulate human respondents in surveys. The benchmark includes 11 real-world datasets across various sociological domains. Experiments using GPT-3.5/4 Turbo and LLaMA 3.0/3.1-8B showed consistent performance trends and highlighted how prompt design impacts simulation accuracy. AI
IMPACT Introduces a new benchmark for evaluating LLM simulation capabilities, potentially improving data collection methods in social sciences.
RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating LLMs in survey simulation.
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →