RESEARCHarXiv CS.CL·13d ago
Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications
This paper offers the first unified survey of Pretraining Data Exposure (PDE) in Large Language Models (LLMs), covering data contamination and membership inference. It formalizes PDE, reviews attack and defense methods, and highlights open challenges to ensure evaluation integrity and protect privacy.
29