Researchers have developed a new benchmark called LLM-S^3 to evaluate how well large language models can simulate human respondents in surveys. The benchmark includes 11 real-world datasets across various sociological domains. Experiments using GPT-3.5/4 Turbo and LLaMA 3.0/3.1-8B showed consistent performance trends and highlighted how prompt design impacts simulation accuracy. AI
影响 Introduces a new benchmark for evaluating LLM simulation capabilities, potentially improving data collection methods in social sciences.
排序理由 The cluster contains an academic paper introducing a new benchmark for evaluating LLMs in survey simulation.
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →