RESEARCHarXiv CS.LG·20d ago
Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry
Geometry-Lite is a novel prompt-level probe designed to interpret how safety evidence develops across layers in large language models. It analyzes layer-wise margin geometry using various readouts to understand boundary formation, improving safety detection over single-layer probes.
29