Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents
Researchers have developed a new method called Teach VLM to extract operational knowledge from mobile screen demonstrations. This model analyzes keyframes from videos to understand actions, UI elements, and execution orders, converting visual state transitions into natural language descriptions. To overcome data scarcity, a systematic data flywheel was created for scalable acquisition, and a Chinese Mobile Screen Teach Benchmark was introduced for evaluation. The Teach-and-Repeat paradigm uses this operational knowledge to guide screen-based execution agents, showing significant improvements in task success rates on Android World. AI
IMPACT This research could enable more sophisticated GUI agents by improving their ability to understand and replicate user actions on mobile devices.