RESEARCHarXiv CS.CL·4/16/2026
Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling
This paper argues that the primary bottleneck in multimodal scaling for MLLMs is knowledge density in training data, rather than task format. It demonstrates that task-specific supervision like VQA adds little incremental semantic information beyond image captions, and that increasing knowledge density leads to consistent performance improvements.
27