interpretability

17 items

RESEARCHarXiv CS.CL·20h ago

ABLE: Representing and Mapping LLMs via Attribution-Based Large-model Embedding

ABLE (Attribution-Based Large-model Embedding) introduces a framework for representing large language models by leveraging interpretability space through attribution-based embeddings. It addresses challenges in systematic model comparison by aggregating gradient-based feature attributions to capture model-specific input-sensitivity patterns.

LLMs model representation security model comparison

ARTICLEDEV.to AI·4/18/2026

Mastering AI UX: How to Animate Confidence Scores and Probability Distributions with Swift 6

This article explores how animating AI confidence scores and probability distributions with Swift 6 can transform "black box" models into transparent systems. This approach enhances user trust, provides real-time feedback, and aids in debugging by visualizing the AI's "thought process."

swiftui interpretability AI UX

RESEARCHarXiv CS.LG·19d ago

Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

Geometry-Lite is a novel prompt-level probe designed to interpret how safety evidence develops across layers in large language models. It analyzes layer-wise margin geometry using various readouts to understand boundary formation, improving safety detection over single-layer probes.

deep learning Probing interpretability large language models

RESEARCHarXiv CS.CL·4d ago

From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment

This research proposes a framework for sentence-level interpretability in rubric-based scoring, combining Shapley-value attributions with rationales from large language models (LLMs). It compares fine-tuned language models and prompted LLMs for teaching quality assessment, finding PLMs offer better prediction accuracy despite label compression.

LLMs Automated Scoring Shapley Values interpretability

ARTICLEDEV.to AI·4/8/2026

Announcing the OpenAI Safety Fellowship

O OpenAI Safety Fellowship é um programa de pesquisa focado na segurança da IA, abordando aspectos críticos como robustez, interpretabilidade e alinhamento de valores humanos. O texto detalha seus objetivos e componentes técnicos, como treinamento adversarial e técnicas de explicabilidade.

robustness OpenAI interpretability alignment

RESEARCHarXiv CS.AI·4/20/2026

LLM Reasoning Is Latent, Not the Chain of Thought

This position paper argues that large language model (LLM) reasoning should be studied as latent-state trajectory formation rather than faithful surface chain-of-thought (CoT). It formalizes three competing hypotheses regarding the primary object of reasoning, impacting claims about faithfulness, interpretability, and benchmarks.

Chain-of-Thought interpretability AI Reasoning large language models

RESEARCHarXiv CS.LG·26d ago

OceanCBM: A Concept Bottleneck Model for Mechanistic Interpretability in Ocean Forecasting

OceanCBM is the first concept bottleneck model (CBM) for spatiotemporal prediction and mechanistic interrogation of ocean dynamics. It predicts mixed layer heat content, a precursor to marine heatwaves, using mixed supervision and prescribed geophysical fluid dynamics concepts to ensure fidelity to ground-truth physics.

forecasting AI models Oceanography machine learning

DOCDEV.to AI·4/21/2026

Mastering AI UI: Building a Reusable, Animated Confidence Bar with Swift 6 and SwiftUI

This guide explains the critical role of a confidence bar in AI applications for building user trust and enhancing transparency about model predictions. It details how to build a reusable, animated confidence bar using Swift 6 and SwiftUI.

swiftui user experience AI UI interpretability

RESEARCHAnthropic (YouTube)·5/7/2026

Translating Claude’s thoughts into language

This content explores the fascinating area of translating the internal processes or "thoughts" of an artificial intelligence model like Claude into understandable language. It investigates how the complex operations of AI can be interpreted and expressed to better understand its reasoning.

cognitive AI Natural Language Processing interpretability AI

Translating Claude’s thoughts into language

NEWSMIT Tech Review AI·4/30/2026

This startup’s new mechanistic interpretability tool lets you debug LLMs

The startup Goodfire has released Silico, a new mechanistic interpretability tool that allows researchers to debug and adjust LLM parameters during training. This provides model makers with more fine-grained control over AI development.

LLMs interpretability AI tools Debugging

RESEARCHarXiv CS.AI·5/9/2026

Understanding Annotator Safety Policy with Interpretability

The paper introduces challenges in understanding annotator disagreement regarding AI safety policies, which can arise from operational failures, policy ambiguity, or value pluralism. It highlights the difficulty of discerning the root causes of these disagreements and the unreliability of self-reported reasoning from annotators.

policy machine learning Data Annotation interpretability

RESEARCHarXiv CS.LG·5/4/2026

What Physics do Data-Driven MoCap-to-Radar Models Learn?

This research introduces a physics-based interpretability framework to assess what physics data-driven MoCap-to-radar models learn. It finds that low reconstruction error doesn't guarantee physical consistency, and temporal attention is critical for transformer-based models to learn the underlying physics.

Physics Motion Capture machine learning interpretability

RESEARCHarXiv CS.LG·17d ago

A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction

This study introduces yvsoucom-iterkit, a deterministic and log-driven automated machine learning framework for interpretable pipeline optimization in healthcare risk prediction. It enables reproducible analysis of pipeline components, revealing that performance is driven by a small subset of interacting elements like augmentation, model choice, and imbalance handling.

Healthcare machine learning interpretability AutoML

RESEARCHarXiv CS.AI·28d ago

Belief or Circuitry? Causal Evidence for In-Context Graph Learning

This paper investigates how LLMs learn in-context, using a graph random-walk task to explore whether they pattern-match or infer latent structure. It reveals that neither account alone is sufficient, presenting evidence of simultaneous encoding of graph topologies and causal interventions.

LLMs learning interpretability graph learning

RESEARCHarXiv CS.AI·4/9/2026

SymptomWise: A Deterministic Reasoning Layer for Reliable and Efficient AI Systems

SymptomWise é um framework que aprimora a análise de sintomas por IA, separando a compreensão da linguagem do raciocínio diagnóstico para aumentar a confiabilidade e rastreabilidade. Ele utiliza conhecimento médico especializado e inferência determinística, empregando LLMs apenas para extração de sintomas e explicações, não para o diagnóstico em si.

deterministic AI LLM applications interpretability AI reliability

RESEARCHarXiv CS.LG·4/6/2026

Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning

O artigo apresenta PRISM, uma estrutura para Reinforcement Learning que fundamenta as decisões de agentes em conceitos discretos e causalmente validados, usando-os como interface de transferência zero-shot. Ele demonstra que esses conceitos impulsionam diretamente o comportamento do agente e que a importância de um conceito pode ser dissociada de sua frequência de uso.

Strategy Mapping reinforcement learning Transfer Learning interpretability

NEWSGoogle DeepMind Blog·12/16/2025

Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior

Gemma Scope 2 foi lançado, disponibilizando ferramentas abertas de interpretabilidade para a família Gemma 3. Isso visa ajudar a comunidade de segurança de IA a aprofundar a compreensão do comportamento de modelos de linguagem complexos.

language models Gemma interpretability AI safety