PulseAugur
EN
LIVE 07:59:40

New iOSWorld benchmark tests personalized phone agent intelligence

Researchers have introduced iOSWorld, a new benchmark designed to evaluate the personalized intelligence of phone agents. This benchmark features a simulated iOS environment with 26 interconnected apps containing user-specific data like messages, transactions, and travel records. It includes 133 tasks categorized by difficulty, focusing on single-app operations, multi-app interactions, and memory-based personalization. Initial evaluations show that even advanced models struggle with multi-app tasks, though privileged access to system information significantly boosts performance. AI

IMPACT This benchmark could drive the development of more context-aware and personalized AI agents for mobile devices.

RANK_REASON The cluster contains a research paper introducing a new benchmark for evaluating AI agents. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.LG TIER_1 English(EN) · Lawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang, Jing Yu Koh, Ruslan Salakhutdinov ·

    iOSWorld: A Benchmark for Personally Intelligent Phone Agents

    arXiv:2606.09764v1 Announce Type: new Abstract: A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile …

  2. arXiv cs.CL TIER_1 English(EN) · Ruslan Salakhutdinov ·

    iOSWorld: A Benchmark for Personally Intelligent Phone Agents

    A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalizati…