RESEARCH27
Data Mixing for Large Language Models Pretraining: A Survey and Outlook
arXiv CS.CLΒ·April 21, 2026
This paper provides a comprehensive survey on data mixing for Large Language Model (LLM) pretraining, a crucial factor for training efficiency and downstream generalization. It formalizes data mixture optimization as a bilevel problem and introduces a fine-grained taxonomy for existing methods.
Read original β