PulseAugur
EN
LIVE 18:04:39

VLMs in production: Fixed-patch ViTs still dominant?

A discussion on Reddit's r/MachineLearning subreddit explores whether current production-level Vision-Language Models (VLMs) utilize fixed-patch Vision Transformers (ViTs) for their visual processing. The original poster questions if more efficient, input-adaptive tokenization methods are being employed by major VLM developers, speculating on potential reasons for the continued use of fixed patches, such as marginal gains, pipeline efficiencies, or underdeveloped scaling laws for dynamic patching. AI

IMPACT This discussion highlights a technical detail about the current implementation of VLMs, potentially influencing future development or understanding of their capabilities.

RANK_REASON This is a discussion thread on Reddit about a technical aspect of VLMs, not a primary source announcement or research paper.

Read on r/MachineLearning →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/MachineLearning TIER_1 · /u/howtorewriteaname ·

    Do VLMs in production still use fixed-patch ViTs for their vision capabilities? [D]

    <!-- SC_OFF --><div class="md"><p>The research community has provided (already for some time) seemingly more efficient and effective tokenizations for vision. Do we have any hint on whether non-fixed-patches tokenization is being applied on the big player models?</p> <p>I imagine…