Transformers

56 items

RESEARCHarXiv CS.LG·1d ago

WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers

The paper introduces WAV v1, a lightweight multi-resolution residual routing method for decoder-only Transformers. It improves upon standard residual connections by augmenting each block with directional detail bases that contrast attention and MLP updates, and early-vs-late sublayer dynamics.

Residual Connections neural networks deep learning Model Architecture

RESEARCH↑ trendingReddit r/MachineLearning·27d ago

Trained transformer-based chess models to play like humans (including thinking time) [P]

A developer trained transformer-based deep learning models to play chess like humans across various rating buckets, including unique thinking time prediction. The models were trained on Lichess data and achieved accuracy comparable to MAIA-3, despite their small size.

AI models deep learning chess AI model training

ARTICLE↑ trendingReddit r/MachineLearning·4/24/2026

Nanochat vs Llama for training from scratch? [P]

The user is training an AI model from scratch and seeks advice on the best architecture, considering switching from Nanochat (which lacks Transformers compatibility) to the Llama architecture. The goal is an open-source project with a new, larger dataset, despite Nanochat's advantages.

AI architecture open-source AI AI training LLM

RESEARCH↑ trendingReddit r/MachineLearning·5/4/2026

Why SSMs struggle in parameter-constrained training: empirical findings at 25M parameters [R]

This post details empirical findings from OpenAI's Parameter Golf competition, explaining why State Space Models (SSMs) are structurally disadvantaged compared to transformers in parameter- and time-constrained training regimes. Key issues include worse in_proj weight compression for SSMs and architectural win reversals at higher vocabulary sizes, alongside insights from Mamba-3 Triton kernel experiments.

SSMs AI models Performance optimization Neural network training

RESEARCH↑ trendingReddit r/MachineLearning·4/16/2026

ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]

Macrocosmos has introduced ResBM, a new transformer-based architecture for low-bandwidth pipeline-parallel training. It achieves 128x activation compression without significant loss in convergence compared to uncompressed baselines.

distributed training machine learning architecture model optimization Transformers

RESEARCH↑ trendingReddit r/MachineLearning·5/6/2026

Transformers with Selective Access to Early Representations [R]

The paper introduces SATFormer, a new Transformer variant that improves efficiency by allowing heads to selectively re-access early representations instead of uniformly copying them. This context-dependent gating mechanism optimizes the reuse of information, offering a better efficiency-performance trade-off.

AI architecture deep learning efficiency Transformers

Transformers with Selective Access to Early Representations [R]

RESEARCH↑ trendingReddit r/MachineLearning·4/13/2026

Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization [R]

This content discusses a research paper on Depth-Recurrent Transformers, highlighting its findings on compositional and out-of-distribution generalization. It explores how intermediate step supervision can hinder genuine reasoning in AI models, making them overly reliant on statistical heuristics, a concept extended to foundation models and human intuition.

OOD Generalization Compositional Generalization AI Reasoning Intermediate Supervision

ARTICLE↑ trendingReddit r/MachineLearning·4/25/2026

How Visual-Language-Action (VLA) Models Work [D]

This article provides a technical breakdown of Visual-Language-Action (VLA) models, explaining how they map vision and language inputs into robot actions. It delves into current action-decoding approaches like tokenized autoregressive actions, diffusion-based action heads, and flow-matching policies.

machine learning embodied AI VLA models robotics

How Visual-Language-Action (VLA) Models Work [D]

ARTICLE↑ trendingReddit r/LocalLLaMA·5/1/2026

By when do you think will TurboQuant get a proper release and be adopted by everyone

The user asks about the release date and widespread adoption of TurboQuant, emphasizing the significant performance gains from using an asymmetric setup for K and V. The discussion points to a technical optimization in AI models.

AI models machine learning Transformers

RESEARCHarXiv CS.LG·5d ago

Do Transformers Need Three Projections? Systematic Study of QKV Variants

This research systematically evaluates variants of the Query, Key, and Value (QKV) attention formulation in Transformers, including shared key-value, query-key, and single projections. Experiments across synthetic, vision, and language modeling tasks demonstrate that these alternative formulations perform on par or occasionally better than standard QKV Transformers, with Q-K=V sharing offering significant KV cache reduction in language modeling.

QKV computer vision attention mechanisms Language modeling

RESEARCHarXiv CS.LG·4/22/2026

Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

This work addresses the challenge of missing modalities in multimodal clinical data for diagnosis by reframing it as an autoregressive sequence modeling task. It leverages causal decoders from LLMs and a missingness-aware contrastive pre-training to outperform baselines on benchmarks like MIMIC-IV and eICU.

multimodal AI machine learning large language models healthcare AI

RESEARCHarXiv CS.LG·4/15/2026

Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

This paper studies signal propagation at initialization in transformers using the averaged partial Jacobian norm (APJN) to measure gradient amplification. The theory extends APJN analysis, predicts the asymptotic behavior of APJN at large depth, and explains the subcriticality of normalization-free architectures like Dynamic Tanh and Dynamic erf transformers.

Normalization-Free Transformers Gradient Amplification Signal Propagation Initialization

RESEARCHarXiv CS.LG·4/28/2026

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

This work addresses the significant memory footprint of Key-Value (KV) caching in transformer language models, proposing optimization through the depth dimension. It introduces a method for cross-layer cache sharing, demonstrating that dropping a layer's cache can be efficient without information loss, and suggests a training approach with random cross-layer attention.

deep learning Memory Optimization large language models Transformers

RESEARCHarXiv CS.LG·4/28/2026

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

This systematic study of singular value spectra during transformer pretraining reveals three key phenomena: transient compression waves propagating through layers and persistent spectral gradients. It also identifies a Q/K–V functional asymmetry, where query/key projections drive depth-dependent dynamics while value/output projections compress uniformly.

neural networks deep learning Model Analysis training dynamics

RESEARCHarXiv CS.LG·8d ago

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

This paper explores "deceptive alignment" in LLMs, a key challenge in AI safety where models deliberately produce false outputs while maintaining accurate internal representations. Researchers introduced a multi-model paradigm, successfully detecting synthetic dishonesty with high accuracy using linear probes across various transformer architectures.

LLMs machine learning deception AI safety

RESEARCHarXiv CS.LG·5/6/2026

eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization

eOptShrinkQ is a two-stage compression pipeline for KV cache in transformer attention heads. It leverages optimal singular value shrinkage and per-vector scalar quantization, grounded in random matrix theory, to achieve near-lossless compression and improve reconstruction.

quantization Random matrix theory AI compression KV cache

ARTICLEDEV.to AI·28d ago

Multi-Head Attention: Collaborate Instead of Concatenate

This content explores the multi-head attention mechanism in AI models, focusing on the idea of collaboration instead of concatenation. It likely discusses an alternative approach to improve attention efficiency or performance.

deep learning Attention Mechanism machine learning AI

RESEARCHarXiv CS.LG·4/14/2026

The Diffusion-Attention Connection

This research unifies Transformers, diffusion-maps, and magnetic Laplacians, presenting them as different regimes of a single Markov geometry built from pre-softmax query-scores. It defines a QK "bidivergence" to connect attention and diffusion, organizing their dynamics with product of experts and Schrödinger-bridges.

Diffusion Models Deep Learning Theory Markov Geometry attention mechanisms

DOCDEV.to AI·4/17/2026

Understanding Transformers Part 9: Stacking Self-Attention Layers

This article explains why self-attention values replace original positional encodings, as they integrate contextual information from all words, clarifying relationships. It then introduces stacking multiple self-attention layers, each with unique weights, to capture more complex linguistic relationships within sentences and paragraphs.

neural networks Self-Attention deep learning NLP

ARTICLEDEV.to AI·29d ago

How Large Language Models Work — From Transformers to Conversational AI

Large Language Models (LLMs) operate as neural networks that learn patterns in text to generate content by predicting the next token. This powerful functionality is driven by massive data, deep architectures, and Transformer-based attention.

AI Generative AI LLM Transformers