PulseAugur
实时 13:13:53
English(EN) PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

新基准评估AI代理在混合移动设备交互中的表现

研究人员推出PhoneHarness,这是一个旨在评估与移动设备交互的AI代理的新基准和执行框架。与以往只关注GUI控件的方法不同,PhoneHarness支持混合操作方法,允许代理利用图形用户界面、命令行界面和外部工具。该框架旨在评估代理完成具有可观察副作用的可验证移动工作流的能力,而不仅仅是预测下一个屏幕操作。相关的基准PhoneHarness Bench,通过率为75.0%,显著优于现有设置12.9个百分点,突显了操作表面路由和可验证执行对于可靠手机自动化的重要性。 AI

影响 这个新框架能够对移动自动化AI代理进行更强大的评估,推动该领域朝着能够处理复杂、真实世界工作流的代理发展。

排序理由 该集群描述了一篇介绍AI代理基准和执行框架的新学术论文。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

报道来源 [2]

  1. arXiv cs.CL TIER_1 English(EN) · Chenxin Li, Zhengyao Fang, Zhengyang Tang, Pengyuan Lyu, Xingran Zhou, Xin Lai, Fei Tang, Liang Wu, Yiduo Guo, Weinong Wang, Junyi Li, Yi Zhang, Yang Ding, Huawen Shen, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Chengquan Zhang, Han … ·

    PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

    arXiv:2606.14832v1 Announce Type: new Abstract: Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers tha…

  2. Hugging Face Daily Papers TIER_1 English(EN) ·

    PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

    PhoneHarness presents a mixed-action benchmark and execution framework for evaluating phone-use agents on verifiable mobile workflows, demonstrating superior performance over existing approaches through deterministic action routing and auditable execution traces.