MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
MixAtlas introduces an uncertainty-aware method for optimizing data mixtures in multimodal LLM midtraining by decomposing corpora along image concepts and task supervision. Using proxy models and a Gaussian-process surrogate, it finds better-performing data recipes for improved sample efficiency and generalization.
