新方法提升视觉Transformer效率和几何推理能力

作者 PulseAugur 编辑部 · [5 个来源] · 2026-05-21 14:40

研究人员开发了两种新方法来提高视觉几何Transformer的效率。一种方法“Good Token Hunting”采用两阶段框架，通过选择关键Token来降低计算成本，在包含500张图像的场景中实现了超过85%的加速。另一种方法“GeoWeaver”则侧重于在场景推理之前，利用几何证据来锚定视觉Token，通过自适应地将几何抽象分配给单个Token来增强空间推理能力。 AI

影响这些方法为视觉几何Transformer提供了显著的加速和改进的推理能力，有望加速3D重建和空间理解任务。

排序理由两篇学术论文详细介绍了改进视觉Transformer架构的新颖方法。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 5 个来源。我们如何撰写摘要 →

报道来源 [5]

arXiv cs.AI TIER_1 English(EN) · Shuhong Zheng, Michael Oechsle, Erik Sandstr\"om, Marie-Julie Rakotosaona, Federico Tombari, Igor Gilitschenski · 2026-05-25 04:00

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

arXiv:2605.23892v1 Announce Type: cross Abstract: Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically …
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-22 00:00

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

Visual geometry transformers are accelerated through a two-stage token selection framework that reduces computational costs while maintaining performance.
arXiv cs.CV TIER_1 English(EN) · Igor Gilitschenski · 2026-05-22 17:55

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global a…
arXiv cs.CV TIER_1 English(EN) · Deshui Miao, Xingsen Huang, Yameng Gu, Xin Li, Haijun Zhang, Ming-Hsuan Yang · 2026-05-22 04:00

GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

arXiv:2605.22558v1 Announce Type: new Abstract: Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structura…
arXiv cs.CV TIER_1 English(EN) · Ming-Hsuan Yang · 2026-05-21 14:40

GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stag…

报道来源 [5]

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

相关实体

相关话题