Transformers

56 items

RESEARCHarXiv CS.LG·4/6/2026

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

Este trabalho explora o agendamento de modelos para acelerar os Modelos de Linguagem de Difusão Mascarada (MDLMs), substituindo o modelo completo por um menor em certas etapas de denoising. A pesquisa mostra que as etapas iniciais e finais são mais robustas a essa substituição, permitindo uma redução de até 17% nos FLOPs com degradação mínima na perplexidade generativa.

Diffusion Models language models Computational Efficiency denoising

RESEARCHarXiv CS.LG·7d ago

DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions

DAStatFormer is a hybrid multibranch Transformer proposed to overcome the challenges of high dimensionality and complex spatio-temporal patterns in Distributed Acoustic Sensing (DAS). It integrates compact statistical features from multiple domains, significantly reducing data size and enhancing event classification.

deep learning machine learning pattern recognition distributed acoustic sensing

RESEARCHarXiv CS.LG·4/20/2026

The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

This research reveals that KV caching in autoregressive transformer inference, under standard FP16 precision, causes a systematic divergence in decoded token sequences due to different floating-point accumulation orders. Across LLaMA-2-7B, Mistral-7B, and Gemma-2-2B, a 100% token divergence rate was observed, with cache-ON often leading to higher accuracy.

AI models inference LLMs numerical precision

RESEARCHarXiv CS.LG·4/15/2026

How Transformers Learn to Plan via Multi-Token Prediction

This paper investigates how Multi-token prediction (MTP) enables Transformers to learn to plan, outperforming standard Next-token prediction (NTP). Empirically, MTP consistently improves performance on reasoning tasks, and theoretically, it induces a two-stage reverse reasoning process via gradient decoupling.

Next-token prediction Planning Multi-Token Prediction Reasoning

DOCDEV.to AI·20d ago

92. BERT: The Model That Reads in Both Directions

BERT distinguishes itself from GPT through its bidirectional reading capability, predicting masked words rather than sequential ones. This comprehensive contextual understanding made it dominant in NLP benchmarks and a cornerstone for understanding tasks. The content details BERT's pre-training mechanisms and fine-tuning techniques.

BERT GPT machine learning NLP

RESEARCHDEV.to AI·24d ago

Efficient 8-Bit Quantization of Transformer Neural Machine Language TranslationModel

This paper discusses efficient 8-bit quantization for Transformer neural machine language translation models. The goal is to optimize the performance and efficiency of these models by reducing memory consumption and latency.

AI models efficiency NLP quantization

DOCDEV.to AI·24d ago

83. HuggingFace: Your Library for Every Pretrained Model

This content introduces how HuggingFace makes practical NLP accessible through its libraries and Model Hub. It demonstrates simplifying the use of pretrained models for tasks like sentiment analysis with minimal code.

learning machine learning NLP HuggingFace

RESEARCHDEV.to AI·4/27/2026

An Attention Free Transformer

This content introduces the concept of an Attention Free Transformer, a novel architectural design aiming to achieve the capabilities of traditional Transformers without relying on the self-attention mechanism. It likely explores alternative mechanisms for contextual information processing in sequence-to-sequence tasks.

neural networks deep learning AI Architectures Transformers

RESEARCHarXiv CS.LG·4/27/2026

Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

This research investigates the necessity of learned memory tokens as a computational scratchpad for Universal Transformers with Adaptive Computation Time (ACT) on a combinatorial reasoning benchmark, Sudoku-Extreme. It finds that memory tokens are empirically necessary for non-trivial performance, identifying a sharp lower threshold for optimal count and a common router initialization trap.

neural networks deep learning memory Reasoning

RESEARCHarXiv CS.LG·4/16/2026

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

This research investigates the 'grokking' phenomenon in transformers, finding that the long delay to generalization in arithmetic models stems from a decoder bottleneck. The encoder acquires relevant structural knowledge early, but the decoder struggles to access it, a hypothesis supported by causal interventions like transplanting encoders.

grokking machine learning representation learning Transformers

RESEARCHarXiv CS.LG·4/27/2026

LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

LayerBoost proposes an optimization for LLMs by selectively modifying the attention mechanism based on the sensitivity of individual transformer layers. This aims to reduce the quadratic complexity of softmax attention, a major bottleneck for efficient inference, without significant model quality degradation.

LLMs AI optimization attention mechanisms Transformers

RESEARCHarXiv CS.LG·4/20/2026

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason

This research paper discovers spectral phase transitions in large language models' hidden activation spaces during reasoning versus factual recall. A systematic spectral analysis across 11 models and 5 architecture families identifies seven core phenomena, including reasoning spectral compression and instruction tuning spectral reversal.

neural networks LLMs machine learning AI Research

RESEARCHarXiv CS.LG·5/8/2026

Adaptive Computation Depth via Learned Token Routing in Transformers

This paper introduces Token-Selective Attention (TSA), a mechanism for Transformer architectures that enables adaptive computation depth per token. TSA learns to route tokens based on contextual difficulty, saving 14-23% of token-layer operations with minimal quality loss.

neural networks deep learning machine learning efficiency

RESEARCHarXiv CS.LG·17d ago

Temporal Contrastive Transformer for Financial Crime Detection: Self-Supervised Sequence Embeddings via Predictive Contrastive Coding

The Temporal Contrastive Transformer (TCT) is a new representation learning framework designed for financial transaction sequences to detect fraud. It uses self-supervised contrastive learning to generate embeddings that capture temporal behavioral patterns, showing meaningful predictive performance, especially when combined with domain-engineered features.

Financial AI security machine learning fraud detection

RESEARCHarXiv CS.LG·4/24/2026

Absorber LLM: Harnessing Causal Synchronization for Test-Time Training

Transformers struggle with high computational costs and memory consumption for long sequences, while alternatives lose long-tail dependencies. Absorber LLM proposes a self-supervised causal synchronization to absorb historical contexts into parameters, ensuring a contextless model matches the original full-context one on future generations.

AI architecture Natural Language Processing Machine Learning Optimization large language models

RESEARCHarXiv CS.LG·28d ago

TTCD:Transformer Integrated Temporal Causal Discovery from Non-Stationary Time Series Data

The Transformer Integrated Temporal Causal Discovery (TTCD) Framework is a novel end-to-end approach designed to learn contemporaneous and lagged causal relations from complex non-stationary time series data. This method addresses the limitations of existing techniques by integrating temporal and frequency-domain attention, providing a unified solution for challenging real-world scenarios.

Causal Discovery machine learning non-stationary data Time Series

RESEARCHarXiv CS.AI·5/7/2026

The Scaling Properties of Implicit Deductive Reasoning in Transformers

This paper investigates the scaling properties of implicit deductive reasoning over Horn clauses in depth-bounded Transformers. Deep models with a bidirectional prefix mask approach explicit CoT performance, though CoT remains necessary for depth extrapolation.

neural networks scaling deductive reasoning AI Research

RESEARCHarXiv CS.LG·21d ago

Forecasting Medium-Horizon Alzheimer's Disease Progression: Residual Gap-Aware Transformers for 24-Month CDR-SB Change from ADNI Clinical and Biomarker Histories

This paper introduces a residual gap-aware transformer for forecasting 24-month Alzheimer's disease progression using ADNI clinical and biomarker histories. The research analyzes changes in CDR-SB scores, anchoring samples at mild cognitive impairment visits.

Biomarkers machine learning Alzheimer's disease Medical Diagnosis

RESEARCHarXiv CS.LG·28d ago

Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant

This research analyzes three KV cache quantization schemes (KV, KQV, QKQV) and their impact on inner product variance, especially how QJL on K inflates it, amplified by softmax. Empirical findings highlight KQV's superior performance at a budget of n=4, an unconditional K-V asymmetry where QKQV is consistently worse than KQV in KL divergence, and budget-dependent crossovers for geometric K reconstruction.

machine learning quantization AI statistical inference

RESEARCHarXiv CS.LG·29d ago

Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

The Toeplitz MLP Mixer (TMM) is a new transformer-like architecture that replaces attention with triangular-masked Toeplitz matrix multiplication, significantly reducing computational complexity to O(dn log n) time and O(dn) space. TMMs demonstrate superior training efficiency and better input information retention compared to traditional transformers, despite their simpler design.

neural networks AI architecture Computational Efficiency sequence models