PulseAugur
实时 23:12:21

新方法提升视觉Transformer效率和几何推理能力

研究人员开发了两种新方法来提高视觉几何Transformer的效率。一种方法“Good Token Hunting”采用两阶段框架,通过选择关键Token来降低计算成本,在包含500张图像的场景中实现了超过85%的加速。另一种方法“GeoWeaver”则侧重于在场景推理之前,利用几何证据来锚定视觉Token,通过自适应地将几何抽象分配给单个Token来增强空间推理能力。 AI

影响 这些方法为视觉几何Transformer提供了显著的加速和改进的推理能力,有望加速3D重建和空间理解任务。

排序理由 两篇学术论文详细介绍了改进视觉Transformer架构的新颖方法。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 5 个来源。 我们如何撰写摘要 →

报道来源 [5]

  1. arXiv cs.AI TIER_1 English(EN) · Shuhong Zheng, Michael Oechsle, Erik Sandstr\"om, Marie-Julie Rakotosaona, Federico Tombari, Igor Gilitschenski ·

    Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

    arXiv:2605.23892v1 Announce Type: cross Abstract: Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically …

  2. Hugging Face Daily Papers TIER_1 English(EN) ·

    Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

    Visual geometry transformers are accelerated through a two-stage token selection framework that reduces computational costs while maintaining performance.

  3. arXiv cs.CV TIER_1 English(EN) · Igor Gilitschenski ·

    Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

    Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global a…

  4. arXiv cs.CV TIER_1 English(EN) · Deshui Miao, Xingsen Huang, Yameng Gu, Xin Li, Haijun Zhang, Ming-Hsuan Yang ·

    GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

    arXiv:2605.22558v1 Announce Type: new Abstract: Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structura…

  5. arXiv cs.CV TIER_1 English(EN) · Ming-Hsuan Yang ·

    GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

    Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stag…