Researchers have introduced iOSWorld, a new benchmark designed to evaluate the personalized intelligence of phone agents. This benchmark features a simulated iOS environment with 26 interconnected apps containing user-specific data like messages, transactions, and travel records. It includes 133 tasks categorized by difficulty, focusing on single-app operations, multi-app interactions, and memory-based personalization. Initial evaluations show that even advanced models struggle with multi-app tasks, though privileged access to system information significantly boosts performance. AI
IMPACT This benchmark could drive the development of more context-aware and personalized AI agents for mobile devices.
RANK_REASON The cluster contains a research paper introducing a new benchmark for evaluating AI agents. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →