← heapsort-ai

Performance optimization

44 items

RESEARCHarXiv CS.LG·19h ago

Enabling KV Caching of Shared Prefix for Diffusion Language Models

The paper introduces "bicache", the first KV caching technique for shared prefixes in diffusion language models (DLMs), addressing challenges where existing LLM caching methods fail due to DLMs' bidirectional attention. This new approach aims to unlock high-throughput DLM serving by leveraging observations about shared prefix KVs stability in shallow layers.

54
RESEARCH↑ trendingReddit r/MachineLearning·4/10/2026

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D]

Um bug de desempenho foi identificado no cuBLAS para operações de multiplicação de matrizes em GPUs NVIDIA RTX, como a 5090, utilizando apenas 40% da capacidade. O autor demonstrou um kernel customizado que supera o cuBLAS em até 70%, sugerindo otimização deficiente para essas GPUs em comparação com modelos Pro e H-series.

44
RESEARCH↑ trendingReddit r/LocalLLaMA·26d ago

A First Comprehensive Study of TurboQuant: Accuracy and Performance

A comprehensive study on TurboQuant compares its variants (k8v4, 4bit-nc, k3v4-nc, 3bit-nc) with FP8 for KV-cache quantization. FP8 is recommended as the default, offering 2x capacity with negligible accuracy loss and good performance. TurboQuant variants show limited advantages or significant degradation in accuracy and performance, with 4bit-nc being an option for memory-constrained scenarios.

A First Comprehensive Study of TurboQuant: Accuracy and Performance
43
NEWS↑ trendingReddit r/LocalLLaMA·4/22/2026

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Moonshot AI has open-sourced FlashKDA, a CUTLASS C++ kernel for Kimi Delta Attention, offering up to 2.22x performance improvement over the Triton baseline on H20 benchmarks. This new implementation integrates with flash-linear-attention and enhances linear attention architectures like KDA.

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20
42
RESEARCH↑ trendingReddit r/MachineLearning·5/4/2026

Why SSMs struggle in parameter-constrained training: empirical findings at 25M parameters [R]

This post details empirical findings from OpenAI's Parameter Golf competition, explaining why State Space Models (SSMs) are structurally disadvantaged compared to transformers in parameter- and time-constrained training regimes. Key issues include worse in_proj weight compression for SSMs and architectural win reversals at higher vocabulary sizes, alongside insights from Mamba-3 Triton kernel experiments.

42
ARTICLE↑ trendingReddit r/LocalLLaMA·25d ago

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

The author investigates why a specific Qwen3.6 27B INT8 Autoround quantization recipe outperforms others, observing the model "thinks" less but provides better outputs in benchmarks. They then replicated this performance with a new GGUF quant, noting both consistently achieve answers faster than UD Q8 K XL.

42
RESEARCHarXiv CS.CL·4/6/2026

Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models

Modelos de linguagem de difusão discreta (dLLMs) aceleram a geração de texto, mas a decodificação paralela degrada a qualidade ao desconsiderar a dependência entre tokens. DEMASK propõe um preditor leve que estima influências condicionais para guiar o desmascaramento simultâneo, comprovadamente melhorando a qualidade. A técnica resulta em um ganho de velocidade de 1.7 a 2.2x, mantendo ou superando o desempenho.

29
RESEARCHarXiv CS.LG·4/23/2026

Super Apriel: One Checkpoint, Many Speeds

Super Apriel, a 15B-parameter supernet, has been released, offering four trained mixer choices per decoder layer to enable multiple speed/quality presets from a single checkpoint. This allows for 2.9x to 10.7x decode throughput gains with 96% to 77% quality retention, and also facilitates speculative decoding without a separate draft model.

28