RESEARCH29

Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

arXiv CS.CL·May 27, 2026

This paper offers the first unified survey of Pretraining Data Exposure (PDE) in Large Language Models (LLMs), covering data contamination and membership inference. It formalizes PDE, reviews attack and defense methods, and highlights open challenges to ensure evaluation integrity and protect privacy.

LLMs membership inference data privacy security AI ethics

Read original ↗