← heapsort-ai

attention mechanisms

28 items

NEWS↑ trendingReddit r/LocalLLaMA·4/22/2026

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Moonshot AI has open-sourced FlashKDA, a CUTLASS C++ kernel for Kimi Delta Attention, offering up to 2.22x performance improvement over the Triton baseline on H20 benchmarks. This new implementation integrates with flash-linear-attention and enhances linear attention architectures like KDA.

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20
42
RESEARCHarXiv CS.LG·4/21/2026

UniMamba: A Unified Spatial-Temporal Modeling Framework with State-Space and Attention Integration

UniMamba is a new unified spatial-temporal forecasting framework that integrates efficient state-space dynamics with attention-based dependency learning to tackle multivariate time series challenges. It employs a Mamba Variate-Channel Encoding Layer and a Spatial Temporal Attention Layer to capture both global temporal dependencies and inter-variate correlations.

33
RESEARCHarXiv CS.LG·5d ago

Do Transformers Need Three Projections? Systematic Study of QKV Variants

This research systematically evaluates variants of the Query, Key, and Value (QKV) attention formulation in Transformers, including shared key-value, query-key, and single projections. Experiments across synthetic, vision, and language modeling tasks demonstrate that these alternative formulations perform on par or occasionally better than standard QKV Transformers, with Q-K=V sharing offering significant KV cache reduction in language modeling.

29
RESEARCHarXiv CS.CL·4/7/2026

LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling

Este artigo propõe LPC-SM, uma arquitetura híbrida autorregressiva para modelos de linguagem de contexto longo, que separa atenção local, memória persistente, correção preditiva e controle em tempo de execução. O modelo de 158M parâmetros é avaliado, demonstrando melhorias na perda de LM e estabilidade em sequências longas.

28
RESEARCHarXiv CS.LG·4/23/2026

Super Apriel: One Checkpoint, Many Speeds

Super Apriel, a 15B-parameter supernet, has been released, offering four trained mixer choices per decoder layer to enable multiple speed/quality presets from a single checkpoint. This allows for 2.9x to 10.7x decode throughput gains with 96% to 77% quality retention, and also facilitates speculative decoding without a separate draft model.

28
RESEARCHarXiv CS.AI·5/7/2026

ANDRE: An Attention-based Neuro-symbolic Differentiable Rule Extractor

This paper introduces ANDRE, a novel Attention-based Neuro-symbolic Differentiable Rule Extractor (ILP) framework for learning first-order logic programs. It optimizes over a continuous rule space with fully differentiable, attention-driven logical operators, addressing scalability challenges in noisy and probabilistic settings.

27
RESEARCHDEV.to AI·5/8/2026

Tiny weight edits improve LLM safety

Targeted, tiny weight edits to specific attention heads in LLMs, as demonstrated by the ASGuard method, can drastically reduce jailbreak success rates from linguistic tricks. This surgical approach patches vulnerabilities by dampening activations in relevant attention heads, maintaining overall model competence while significantly enhancing safety.

27
RESEARCHarXiv CS.CL·4/27/2026

Where Should LoRA Go? Component-Type Placement in Hybrid Language Models

This research systematically investigates LoRA placement in hybrid language models, which combine attention and recurrent components. It finds that adapting the attention pathway consistently outperforms full-model adaptation with significantly fewer parameters, while the effect of adapting the recurrent backbone varies drastically depending on the hybrid architecture (sequential vs. parallel).

27
RESEARCHarXiv CS.LG·4/24/2026

Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention

This paper introduces Gist Sparse Attention (GSA), an end-to-end learnable method to scale large language models to long contexts without architectural modifications. GSA compresses context into 'gist tokens' for summary, then selectively restores relevant raw chunks for detailed attention, combining compact global representations with targeted fine-grained access.

27