Performance optimization

44 items

RESEARCHarXiv CS.LG·19h ago

Enabling KV Caching of Shared Prefix for Diffusion Language Models

The paper introduces "bicache", the first KV caching technique for shared prefixes in diffusion language models (DLMs), addressing challenges where existing LLM caching methods fail due to DLMs' bidirectional attention. This new approach aims to unlock high-throughput DLM serving by leveraging observations about shared prefix KVs stability in shallow layers.

Diffusion Models KV Caching Performance optimization High-throughput serving

RESEARCH↑ trendingReddit r/MachineLearning·4/10/2026

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D]

Um bug de desempenho foi identificado no cuBLAS para operações de multiplicação de matrizes em GPUs NVIDIA RTX, como a 5090, utilizando apenas 40% da capacidade. O autor demonstrou um kernel customizado que supera o cuBLAS em até 70%, sugerindo otimização deficiente para essas GPUs em comparação com modelos Pro e H-series.

Matrix Multiplication RTX GPUs Performance optimization NVIDIA

RESEARCH↑ trendingReddit r/LocalLLaMA·26d ago

A First Comprehensive Study of TurboQuant: Accuracy and Performance

A comprehensive study on TurboQuant compares its variants (k8v4, 4bit-nc, k3v4-nc, 3bit-nc) with FP8 for KV-cache quantization. FP8 is recommended as the default, offering 2x capacity with negligible accuracy loss and good performance. TurboQuant variants show limited advantages or significant degradation in accuracy and performance, with 4bit-nc being an option for memory-constrained scenarios.

AI models TurboQuant Performance optimization FP8

A First Comprehensive Study of TurboQuant: Accuracy and Performance

NEWS↑ trendingReddit r/LocalLLaMA·4/22/2026

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Moonshot AI has open-sourced FlashKDA, a CUTLASS C++ kernel for Kimi Delta Attention, offering up to 2.22x performance improvement over the Triton baseline on H20 benchmarks. This new implementation integrates with flash-linear-attention and enhances linear attention architectures like KDA.

Open Source deep learning Performance optimization attention mechanisms

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

RESEARCH↑ trendingReddit r/MachineLearning·5/4/2026

Why SSMs struggle in parameter-constrained training: empirical findings at 25M parameters [R]

This post details empirical findings from OpenAI's Parameter Golf competition, explaining why State Space Models (SSMs) are structurally disadvantaged compared to transformers in parameter- and time-constrained training regimes. Key issues include worse in_proj weight compression for SSMs and architectural win reversals at higher vocabulary sizes, alongside insights from Mamba-3 Triton kernel experiments.

SSMs AI models Performance optimization Neural network training

ARTICLE↑ trendingReddit r/LocalLLaMA·25d ago

Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct

The author investigates why a specific Qwen3.6 27B INT8 Autoround quantization recipe outperforms others, observing the model "thinks" less but provides better outputs in benchmarks. They then replicated this performance with a new GGUF quant, noting both consistently achieve answers faster than UD Q8 K XL.

AI models Qwen3.6 Performance optimization quantization

ARTICLE↑ trendingReddit r/MachineLearning·4/13/2026

TurboOCR: 270–1200 img/s OCR with Paddle + TensorRT (C++/CUDA, FP16) [P]

TurboOCR achieves 270–1200 img/s OCR by optimizing PaddleOCR with C++/CUDA, FP16 TensorRT, fused kernels, and batched processing, addressing the performance bottlenecks of VLM-based approaches. This solution drastically improves throughput for large-scale document processing and is suitable for real-time RAG applications.

CUDA Performance optimization TensorRT C++

RESEARCHarXiv CS.LG·4/20/2026

Dispatch-Aware Ragged Attention for Pruned Vision Transformers

This paper investigates the dispatch-overhead bottleneck that prevents token pruning from fully realizing latency reductions in Vision Transformers (ViTs). It proposes a lightweight Triton attention kernel with a lower dispatch floor, achieving up to 2.24x end-to-end throughput for pruned ViTs.

AI models deep learning Performance optimization attention mechanisms

CASEDEV.to AI·4/20/2026

Real Performance Wins with AI Pair Programming: Before/After Benchmarks

This content demonstrates how AI pair programming, using Claude, can lead to significant application performance gains by efficiently identifying and fixing bottlenecks. It presents real before-and-after results, showing how AI detected complex N+1 queries that humans overlooked.

AI assistant Software Development Performance optimization Benchmarking

DOCDEV.to AI·4/22/2026

Hands-On Performance: Diagnosing and Fixing Databricks SQL Bottlenecks

This hands-on guide focuses on performance tuning in Databricks SQL, detailing how to diagnose and fix bottlenecks. It teaches methods like reducing data scans, optimizing joins, and leveraging caching to make queries faster and cheaper, thereby avoiding common mistakes that lead to high latency and wasted resources.

Databricks SQL data engineering Performance optimization

DOCAmazon Web Services (YouTube)·4d ago

How do I troubleshoot latency and optimize Amazon Bedrock Agents performance?

This content focuses on how to troubleshoot latency issues and optimize the performance of Amazon Bedrock Agents. It offers a practical guide to enhance the efficiency and responsiveness of AI agents.

Troubleshooting Performance optimization Amazon Bedrock latency

How do I troubleshoot latency and optimize Amazon Bedrock Agents performance?

RESEARCHarXiv CS.CL·4/6/2026

Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models

Modelos de linguagem de difusão discreta (dLLMs) aceleram a geração de texto, mas a decodificação paralela degrada a qualidade ao desconsiderar a dependência entre tokens. DEMASK propõe um preditor leve que estima influências condicionais para guiar o desmascaramento simultâneo, comprovadamente melhorando a qualidade. A técnica resulta em um ganho de velocidade de 1.7 a 2.2x, mantendo ou superando o desempenho.

Dependency Prediction DEMASK Parallel Decoding machine learning

DOCAWS Machine Learning Blog·6d ago

Reducing container cold start times using SOCI index on DLAMI and DLC

This post demonstrates how to utilize the SOCI index on publicly available Deep Learning AMIs and Containers to reduce cold start times. It covers the various SOCI modes and provides guidance on efficiently implementing this tool in current workloads.

Containers SOCI deep learning cloud computing

ARTICLEDEV.to AI·4/23/2026

Your Customer Service Bot Is Slow Because It's Single-Threaded

This article highlights that single-threaded customer service bots are slow due to sequential LLM calls, causing up to 12 seconds latency. It proposes a parallel sub-agent pattern with LangGraph and LangSmith to execute research tasks concurrently, significantly reducing response times to around 6.5 seconds.

LangGraph customer service AI Performance optimization AI agents

ARTICLEDEV.to AI·4/22/2026

The Parallelization Trap: Why Running More Agents Simultaneously Often Makes Things Worse

The "Parallelization Trap" describes how increasing concurrent AI agents can paradoxically reduce overall system throughput due to coordination and coherence problems. This happens as agents contend for shared context, leading to stale or conflicting information.

AI architecture Performance optimization distributed systems concurrency

ARTICLEDEV.to AI·4/9/2026

Disarming the "Join Bomb": Re-Engineering Collaborative Filtering on Neo4j

O conteúdo aborda a 'Join Bomb', um problema de desempenho em motores de recomendação construídos com Neo4j, causado por travessias bidirecionais não otimizadas em grafos densos. É proposta uma solução comparando uma query Cypher 'ingênua' com uma query otimizada baseada em APOC para eliminar esse gargalo.

Graph Database Performance optimization recommendation-engine Neo4j

ARTICLEDEV.to AI·4/15/2026

How I Build AI Features Into Mobile Apps Without Killing Performance

This article discusses the challenges of integrating AI features into mobile apps without sacrificing performance, such as speed and battery life. It emphasizes that AI performance in mobile applications is a multifaceted problem involving product, architecture, API, and user experience.

mobile development user experience Performance optimization AI

RESEARCHarXiv CS.LG·4/23/2026

Super Apriel: One Checkpoint, Many Speeds

Super Apriel, a 15B-parameter supernet, has been released, offering four trained mixer choices per decoder layer to enable multiple speed/quality presets from a single checkpoint. This allows for 2.9x to 10.7x decode throughput gains with 96% to 77% quality retention, and also facilitates speculative decoding without a separate draft model.

neural network architecture Performance optimization attention mechanisms large language models

DOCDEV.to AI·4/23/2026

Cursor Rules for Django: The Complete Guide to AI-Assisted Django Development

This guide addresses common performance and stability pitfalls in Django development, such as inefficient queries and blocking operations. It highlights how AI assistants, specifically Cursor and Claude Code, can significantly aid in building more robust and efficient Django applications.

Software Development Performance optimization Django AI development tools

ARTICLEDEV.to AI·4/20/2026

How We Integrate AI Into Real Mobile and Web Apps

This content provides practical advice and lessons learned from Zartek on integrating AI into real mobile and web applications, emphasizing problem-first approaches, performance optimization, reliability, cost-saving through caching, and robust observability. It highlights common pitfalls and effective AI features.

AI integration web development Reliability Performance optimization