RESEARCH29
Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications
arXiv CS.CLΒ·May 27, 2026
This paper offers the first unified survey of Pretraining Data Exposure (PDE) in Large Language Models (LLMs), covering data contamination and membership inference. It formalizes PDE, reviews attack and defense methods, and highlights open challenges to ensure evaluation integrity and protect privacy.
Read original β