RESEARCH28

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

arXiv CS.LG·May 15, 2026

This paper explores the mechanistic interpretability of EEG foundation models by applying TopK Sparse Autoencoders (SAEs) to extract sparse feature dictionaries from their embeddings. It benchmarks monosemanticity and entanglement across different EEG transformer architectures, grounds these features in a clinical taxonomy, and introduces concept steering to quantify selectivity and expose representational failures.

Clinical AI AI interpretability Foundation Models Sparse autoencoders EEG

Read original ↗