PulseAugur
实时 13:54:50

新方法大幅削减VLM视觉Token,提升效率

研究人员开发了三种新方法,可显著压缩大型视觉语言模型(VLM)使用的视觉Token,旨在降低计算开销并提高推理速度。InfoMerge利用时间指纹差异和内容感知分配,ETC采用任务感知视觉信息蒸馏,EvoCut分析多层Token演化。这些方法在Token数量上实现了大幅削减,其中一些在保持超过98%的原始性能的同时实现了显著的加速。 AI

影响 这些技术为VLM提供了显著的效率提升,有望加速涉及视觉理解的AI应用的部署并降低运营成本。

排序理由 三篇不同的研究论文,提出了用于优化大型视觉语言模型的新颖方法。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

报道来源 [4]

  1. arXiv cs.CL TIER_1 English(EN) · Xinxin Liu, Shiwei Gan, Xiao Liu, Yafeng Yin, Lei Xie, Sanglu Lu ·

    InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models

    arXiv:2606.02161v1 Announce Type: cross Abstract: Video Large Language Models (Video-LLMs) achieve strong performance in video understanding, but their excessive visual tokens bring substantial computational overhead. Existing training-free compression methods improve inference e…

  2. arXiv cs.CL TIER_1 English(EN) · Sanglu Lu ·

    InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models

    Video Large Language Models (Video-LLMs) achieve strong performance in video understanding, but their excessive visual tokens bring substantial computational overhead. Existing training-free compression methods improve inference efficiency by reducing visual tokens, yet they ofte…

  3. arXiv cs.CV TIER_1 English(EN) · Yiling Gao, Hongchen Wei, Zhenzhong Chen ·

    ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs

    arXiv:2606.00543v1 Announce Type: new Abstract: In Vision-Language Models (VLMs), high-resolution images produce a large number of visual tokens, resulting in high computational costs and KV-cache overhead during inference. To address this problem, we propose an Extreme Token Com…

  4. arXiv cs.CV TIER_1 English(EN) · Hongyu Lu, Feng Zhang, Wenwei Jin, Huanling Hu, Pengfei Zhang, Yao Hu, Jiawei Li, Shikai Jiang ·

    EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models

    arXiv:2606.01756v1 Announce Type: new Abstract: Large vision-language models (LVLMs) achieve strong performance on image and video understanding tasks, but their inference efficiency is constrained by the large number of visual tokens produced by vision encoders. Most existing vi…