← heapsort-ai

Transformers

56 items

RESEARCHarXiv CS.LG·4/6/2026

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

Este trabalho explora o agendamento de modelos para acelerar os Modelos de Linguagem de Difusão Mascarada (MDLMs), substituindo o modelo completo por um menor em certas etapas de denoising. A pesquisa mostra que as etapas iniciais e finais são mais robustas a essa substituição, permitindo uma redução de até 17% nos FLOPs com degradação mínima na perplexidade generativa.

28
RESEARCHarXiv CS.LG·7d ago

DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions

DAStatFormer is a hybrid multibranch Transformer proposed to overcome the challenges of high dimensionality and complex spatio-temporal patterns in Distributed Acoustic Sensing (DAS). It integrates compact statistical features from multiple domains, significantly reducing data size and enhancing event classification.

28
RESEARCHarXiv CS.LG·4/20/2026

The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

This research reveals that KV caching in autoregressive transformer inference, under standard FP16 precision, causes a systematic divergence in decoded token sequences due to different floating-point accumulation orders. Across LLaMA-2-7B, Mistral-7B, and Gemma-2-2B, a 100% token divergence rate was observed, with cache-ON often leading to higher accuracy.

27
DOCDEV.to AI·20d ago

92. BERT: The Model That Reads in Both Directions

BERT distinguishes itself from GPT through its bidirectional reading capability, predicting masked words rather than sequential ones. This comprehensive contextual understanding made it dominant in NLP benchmarks and a cornerstone for understanding tasks. The content details BERT's pre-training mechanisms and fine-tuning techniques.

27
RESEARCHDEV.to AI·4/27/2026

An Attention Free Transformer

This content introduces the concept of an Attention Free Transformer, a novel architectural design aiming to achieve the capabilities of traditional Transformers without relying on the self-attention mechanism. It likely explores alternative mechanisms for contextual information processing in sequence-to-sequence tasks.

27
RESEARCHarXiv CS.LG·4/27/2026

Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

This research investigates the necessity of learned memory tokens as a computational scratchpad for Universal Transformers with Adaptive Computation Time (ACT) on a combinatorial reasoning benchmark, Sudoku-Extreme. It finds that memory tokens are empirically necessary for non-trivial performance, identifying a sharp lower threshold for optimal count and a common router initialization trap.

27
RESEARCHarXiv CS.LG·4/16/2026

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

This research investigates the 'grokking' phenomenon in transformers, finding that the long delay to generalization in arithmetic models stems from a decoder bottleneck. The encoder acquires relevant structural knowledge early, but the decoder struggles to access it, a hypothesis supported by causal interventions like transplanting encoders.

27
RESEARCHarXiv CS.LG·4/20/2026

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason

This research paper discovers spectral phase transitions in large language models' hidden activation spaces during reasoning versus factual recall. A systematic spectral analysis across 11 models and 5 architecture families identifies seven core phenomena, including reasoning spectral compression and instruction tuning spectral reversal.

27
RESEARCHarXiv CS.LG·17d ago

Temporal Contrastive Transformer for Financial Crime Detection: Self-Supervised Sequence Embeddings via Predictive Contrastive Coding

The Temporal Contrastive Transformer (TCT) is a new representation learning framework designed for financial transaction sequences to detect fraud. It uses self-supervised contrastive learning to generate embeddings that capture temporal behavioral patterns, showing meaningful predictive performance, especially when combined with domain-engineered features.

27
RESEARCHarXiv CS.LG·4/24/2026

Absorber LLM: Harnessing Causal Synchronization for Test-Time Training

Transformers struggle with high computational costs and memory consumption for long sequences, while alternatives lose long-tail dependencies. Absorber LLM proposes a self-supervised causal synchronization to absorb historical contexts into parameters, ensuring a contextless model matches the original full-context one on future generations.

27
RESEARCHarXiv CS.LG·28d ago

TTCD:Transformer Integrated Temporal Causal Discovery from Non-Stationary Time Series Data

The Transformer Integrated Temporal Causal Discovery (TTCD) Framework is a novel end-to-end approach designed to learn contemporaneous and lagged causal relations from complex non-stationary time series data. This method addresses the limitations of existing techniques by integrating temporal and frequency-domain attention, providing a unified solution for challenging real-world scenarios.

27
RESEARCHarXiv CS.LG·21d ago

Forecasting Medium-Horizon Alzheimer's Disease Progression: Residual Gap-Aware Transformers for 24-Month CDR-SB Change from ADNI Clinical and Biomarker Histories

This paper introduces a residual gap-aware transformer for forecasting 24-month Alzheimer's disease progression using ADNI clinical and biomarker histories. The research analyzes changes in CDR-SB scores, anchoring samples at mild cognitive impairment visits.

27
RESEARCHarXiv CS.LG·28d ago

Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant

This research analyzes three KV cache quantization schemes (KV, KQV, QKQV) and their impact on inner product variance, especially how QJL on K inflates it, amplified by softmax. Empirical findings highlight KQV's superior performance at a budget of n=4, an unconditional K-V asymmetry where QKQV is consistently worse than KQV in KL divergence, and budget-dependent crossovers for geometric K reconstruction.

27
RESEARCHarXiv CS.LG·29d ago

Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

The Toeplitz MLP Mixer (TMM) is a new transformer-like architecture that replaces attention with triangular-masked Toeplitz matrix multiplication, significantly reducing computational complexity to O(dn log n) time and O(dn) space. TMMs demonstrate superior training efficiency and better input information retention compared to traditional transformers, despite their simpler design.

27