English(EN)VISTA: View-Consistent Self-Verified Training for GUI Grounding
新方法提升 VLM 在 GUI 基础任务上的准确性 · 2 篇论文
作者PulseAugur 编辑部·[6 个来源]·
两篇新研究论文介绍了用于提高视觉语言模型 (VLM) 在 GUI 基础任务上的准确性和可靠性的新方法。第一篇论文《Trust the Right Teacher》提出了一种质量感知自蒸馏方法,通过使用正确性感知门控和概率缩放来处理不可靠的坐标-token 预测,从而优化教师信号。第二篇论文《VISTA》提出了一个视图一致性自验证训练框架,该框架利用 GUI 的多个语义等价视图来稳定强化学习并提高坐标生成准确性,在 Qwen 主干上取得了显著的提升。
AI
arXiv:2606.18101v1 Announce Type: new Abstract: Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promisi…
Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-se…
Quality-aware self-distillation improves vision-language model performance for GUI grounding by enhancing coordinate-token teacher signals through correctness-aware gating and probability scaling.
arXiv:2606.14579v1 Announce Type: new Abstract: When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones, yielding no…
When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones, yielding no useful relative advantage. We propose VISTA (Vi…
VISTA is a GRPO-based training framework for GUI grounding that uses multiple consistent views of the same GUI instance to improve training stability and accuracy.