← heapsort-ai

Transformers

56 items

RESEARCH↑ trendingReddit r/MachineLearning·5/4/2026

Why SSMs struggle in parameter-constrained training: empirical findings at 25M parameters [R]

This post details empirical findings from OpenAI's Parameter Golf competition, explaining why State Space Models (SSMs) are structurally disadvantaged compared to transformers in parameter- and time-constrained training regimes. Key issues include worse in_proj weight compression for SSMs and architectural win reversals at higher vocabulary sizes, alongside insights from Mamba-3 Triton kernel experiments.

42
RESEARCH↑ trendingReddit r/MachineLearning·4/13/2026

Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization [R]

This content discusses a research paper on Depth-Recurrent Transformers, highlighting its findings on compositional and out-of-distribution generalization. It explores how intermediate step supervision can hinder genuine reasoning in AI models, making them overly reliant on statistical heuristics, a concept extended to foundation models and human intuition.

42
RESEARCHarXiv CS.LG·5d ago

Do Transformers Need Three Projections? Systematic Study of QKV Variants

This research systematically evaluates variants of the Query, Key, and Value (QKV) attention formulation in Transformers, including shared key-value, query-key, and single projections. Experiments across synthetic, vision, and language modeling tasks demonstrate that these alternative formulations perform on par or occasionally better than standard QKV Transformers, with Q-K=V sharing offering significant KV cache reduction in language modeling.

29
RESEARCHarXiv CS.LG·4/22/2026

Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

This work addresses the challenge of missing modalities in multimodal clinical data for diagnosis by reframing it as an autoregressive sequence modeling task. It leverages causal decoders from LLMs and a missingness-aware contrastive pre-training to outperform baselines on benchmarks like MIMIC-IV and eICU.

29
RESEARCHarXiv CS.LG·4/15/2026

Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

This paper studies signal propagation at initialization in transformers using the averaged partial Jacobian norm (APJN) to measure gradient amplification. The theory extends APJN analysis, predicts the asymptotic behavior of APJN at large depth, and explains the subcriticality of normalization-free architectures like Dynamic Tanh and Dynamic erf transformers.

29
RESEARCHarXiv CS.LG·4/28/2026

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

This work addresses the significant memory footprint of Key-Value (KV) caching in transformer language models, proposing optimization through the depth dimension. It introduces a method for cross-layer cache sharing, demonstrating that dropping a layer's cache can be efficient without information loss, and suggests a training approach with random cross-layer attention.

29
RESEARCHarXiv CS.LG·4/28/2026

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

This systematic study of singular value spectra during transformer pretraining reveals three key phenomena: transient compression waves propagating through layers and persistent spectral gradients. It also identifies a Q/K–V functional asymmetry, where query/key projections drive depth-dependent dynamics while value/output projections compress uniformly.

29
RESEARCHarXiv CS.LG·8d ago

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

This paper explores "deceptive alignment" in LLMs, a key challenge in AI safety where models deliberately produce false outputs while maintaining accurate internal representations. Researchers introduced a multi-model paradigm, successfully detecting synthetic dishonesty with high accuracy using linear probes across various transformer architectures.

29