AI interpretability

4 items

RESEARCHarXiv CS.AI·4/23/2026

From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents

This paper presents a conformal interpretability framework for LLM agents to understand temporal concept evolution. It uses step-wise reward modeling and conformal prediction to statistically label internal representations and identify latent directions linked to success, failure, or reasoning drift.

LLM Agents AI interpretability Conformal Prediction

RESEARCHarXiv CS.LG·25d ago

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

This paper explores the mechanistic interpretability of EEG foundation models by applying TopK Sparse Autoencoders (SAEs) to extract sparse feature dictionaries from their embeddings. It benchmarks monosemanticity and entanglement across different EEG transformer architectures, grounds these features in a clinical taxonomy, and introduces concept steering to quantify selectivity and expose representational failures.

Clinical AI AI interpretability Foundation Models Sparse autoencoders

RESEARCHarXiv CS.LG·14d ago

Towards Verifiable Transformers: Solver-Checkable Circuit Explanations

This research introduces Verifiable Transformers, a framework for converting task-localized Transformer circuits into bounded, solver-checkable claims. It enables verification of properties like functional equivalence and robustness through direct or surrogate-mediated SMT encoding.

AI interpretability Formal verification Transformers

RESEARCHarXiv CS.CL·28d ago

How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits

This paper measures the consistency and specificity of language model circuits using edge attribution patching across multiple tasks and models. It finds high within-task circuit reuse that is necessary for performance, but also significant overlap across tasks, indicating circuits are not task-specific.

language models Mechanistic Interpretability AI interpretability model circuits