Sparse autoencoders

3 items

RESEARCHarXiv CS.LG·5/8/2026

Structural Instability of Feature Composition

This paper presents a geometric framework to analyze the instability of feature unions in Sparse Autoencoders (SAEs), particularly concerning compositional steering. It derives an asymptotic compositional-collapse threshold under a spherical dictionary model.

Feature Composition Transformer architectures Sparse autoencoders AI Research

RESEARCHarXiv CS.LG·25d ago

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

This paper explores the mechanistic interpretability of EEG foundation models by applying TopK Sparse Autoencoders (SAEs) to extract sparse feature dictionaries from their embeddings. It benchmarks monosemanticity and entanglement across different EEG transformer architectures, grounds these features in a clinical taxonomy, and introduces concept steering to quantify selectivity and expose representational failures.

Clinical AI AI interpretability Foundation Models Sparse autoencoders

RESEARCHarXiv CS.CL·4/7/2026

LangFIR: Discovering Sparse Language-Specific Features from Monolingual Data for Language Steering

Este artigo introduz LangFIR, um método que descobre características de linguagem esparsas e específicas a partir de dados monolíngues para direcionar a saída de LLMs. Ele supera a limitação de abordagens existentes que exigem dados multilingues caros, utilizando autoencoders esparsos e sequências de tokens aleatórios.

model interpretability Multilingual Models LLMs Monolingual Data