UXBench: Benchmarking User Experience in AI Assistants
Researchers have introduced UXBench, a novel benchmark designed to evaluate the user experience of AI assistants. This benchmark is the first to use real user feedback signals and includes three tasks: UX Judge, UX Eval, and UX Recovery. It is built upon a dataset of 7,400 instances derived from over 70,000 interaction logs of a Chinese AI assistant, covering diverse scenarios and failure patterns. Experiments with 26 language models demonstrate that user feedback prediction is a learnable capability and highlight biases in current LLM-as-a-judge evaluation methods. AI
IMPACT Establishes a new evaluation framework for AI assistants, pushing for user-centric optimization beyond raw capability.