RESEARCH27

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

arXiv CS.CL·April 21, 2026

This paper provides a comprehensive survey on data mixing for Large Language Model (LLM) pretraining, a crucial factor for training efficiency and downstream generalization. It formalizes data mixture optimization as a bilevel problem and introduces a fine-grained taxonomy for existing methods.

data optimization pretraining machine learning large language models AI Research

Read original ↗