← heapsort
RESEARCH↑ trending42

Do VLMs in production still use fixed-patch ViTs for their vision capabilities? [D]

Reddit r/MachineLearningΒ·May 21, 2026

This discussion questions whether production Vision-Language Models (VLMs) still rely on fixed-patch Vision Transformers (ViTs) for their vision capabilities, despite the existence of more efficient tokenization methods. It explores potential reasons for this, such as marginal gains, pipeline limitations, or unclear scaling laws for adaptive patching.

Read original β†—