English(EN)Personalized Turn-Level User Conversation Satisfaction Benchmark
新的基准和工具旨在改进对话式AI的评估
作者PulseAugur 编辑部·[11 个来源]·
研究人员正在开发新的基准和工具来评估和改进对话式AI的能力。最近的几篇arXiv论文介绍了专注于多轮交互、情商和个性化用户满意度的新型评估套件和数据集。这些努力旨在解决现有方法的局限性,这些方法通常难以处理类人对话的细微差别、不断发展的模型能力以及用户的个人期望。此外,Reddit等平台上的讨论突显了本地对话式AI解决方案的实际挑战和持续开发,以及管理长对话上下文的方法。
AI
arXiv:2604.08782v3 Announce Type: replace Abstract: Large language models (LLMs) suffer significant performance degradation when user instructions and context are distributed over multiple conversational turns, yet multi-turn (MT) interactions dominate chat interfaces. The routin…
arXiv:2603.23160v2 Announce Type: replace Abstract: Benchmarking large language models (LLMs) and agents in multi-turn interactive scenarios is essential for understanding their practical capabilities. However, existing evaluation protocols are highly heterogeneous, differing sig…
arXiv:2605.29711v1 Announce Type: cross Abstract: User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation meth…
arXiv cs.AI
TIER_1English(EN)·Kate M. Lubrano, Faisal Sayed, Ankita Rathod, Akshansh, Craver Corbyn Thomas-Smith, Mark E. Whiting, Karina Nguyen·
arXiv:2605.21739v2 Announce Type: replace Abstract: Emotional intelligence (EI), the ability to perceive, understand, and respond appropriately to others' emotional states, is central to human communication, and increasingly important to assess as LLMs assume conversational roles…
arXiv:2405.13003v2 Announce Type: replace-cross Abstract: Recent advancements in conversational systems have significantly enhanced human-machine interactions across various domains. However, training these systems is challenging due to the scarcity of specialized dialogue data. …
arXiv:2604.20443v2 Announce Type: replace-cross Abstract: We introduce DialToM, an annotated Theory of Mind (ToM) benchmark built from naturalistic human-human dialogues using a multiple-choice evaluation framework. Concurrent with recent work showing a gap between explicit menta…
arXiv:2605.28882v1 Announce Type: cross Abstract: With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, ye…
User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation methods mostly measure generic response quality, makin…
Behavioral analytics can surface issues customers may not explicitly report. A customer may abandon a checkout page repeatedly without ever contacting support.
<!-- SC_OFF --><div class="md"><p>I thought this was going to be easy. I searched reddit, google and even tried to find a solution with LLMs. I saw a few nice things: <a href="http://unmute.sh">unmute.sh</a> seems promising, there are webgpu implementations that look impressive, …
<!-- SC_OFF --><div class="md"><p>I kept running into the same problem while coding with AI.</p> <p>After 100+ messages, the conversation contains important decisions, bug fixes, architecture choices, and unresolved issues.</p> <p>Starting a new chat loses all of that.</p> <p>Kee…