A new benchmark called PhoneSafety has been developed to better evaluate the safety of AI agents designed for phone use. Existing evaluations often fail to distinguish between an agent's deliberate safe action and its inability to act due to capability limitations. PhoneSafety comprises 700 safety-critical moments from real phone interactions, distinguishing between safe actions, unsafe actions, and failures to act. The research indicates that stronger general phone-use ability does not necessarily correlate with safer choices in risky situations, and failures to act appear to be more of a capability issue than a safety one. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Introduces a new evaluation framework to more accurately assess AI safety in phone-based agents, crucial for user trust and responsible deployment.
RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating AI safety. [lever_c_demoted from research: ic=1 ai=1.0]