English(EN) MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions

新的MiroBench基准显示LLM代理无法模拟真实的Reddit讨论

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-16 04:00

研究人员推出了MiroBench，这是一个旨在评估LLM代理模拟现实世界讨论（特别是Reddit帖子）真实性的新基准。该基准在四个关键方面评估生成的讨论与真实讨论的差异：重复性和语义统一性、叙事内容、毒性和攻击性以及结构复杂性。使用MiroBench对五个模型和五个领域进行的实验显示，当前的模拟器无法准确复制实际Reddit对话的分布模式和交互动态，仅通过基于提示的增强观察到微小改进。 AI

影响突出了当前LLM代理模拟能力与现实世界人类互动复杂性之间的差距，为未来代理真实性研究提供指导。

排序理由该集群包含一篇介绍用于评估LLM代理模拟能力的基准的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Yaoning Yu, Ye Yu, Haojing Luo, Haohan Wang · 2026-06-16 04:00

MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions

arXiv:2606.14715v1 Announce Type: cross Abstract: LLM agents are increasingly used to simulate real world interactions, but it remains unclear whether simulated behaviors preserve the content patterns and interaction dynamics of real human behaviors. Existing evaluations remain f…

报道来源 [1]

MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions

相关实体

相关话题