RESEARCH32
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
arXiv CS.LGΒ·April 17, 2026
MixAtlas introduces an uncertainty-aware method for optimizing data mixtures in multimodal LLM midtraining by decomposing corpora along image concepts and task supervision. Using proxy models and a Gaussian-process surrogate, it finds better-performing data recipes for improved sample efficiency and generalization.
Read original β