English(EN) Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning

新的视觉Transformer通过聚类降低图像字幕成本

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-16 04:00

研究人员开发了一种新的视觉Transformer架构，显著降低了图像字幕的计算成本。通过用基于高斯混合模型的聚类方法替换标准的自注意力机制，该模型将相似的图像块分组，将复杂度从二次降低到线性。该方法利用期望最大化算法和基于GPT的解码器，在Flickr 30K数据集上取得了有竞争力的结果。 AI

影响降低了图像字幕模型的计算开销，可能支持更快、更高效的应用。

排序理由关于图像字幕新方法的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Chiradeep Ghosh, Dakshina Ranjan Kisku · 2026-06-16 04:00

Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning

arXiv:2606.14753v1 Announce Type: cross Abstract: Image captioning is a challenging and significant task that aims to generate coherent and semantically meaningful textual descriptions for given images. To accomplish this task, it requires a deep understanding of visual content a…