performance

95 items

ARTICLEDEV.to AI·4/15/2026

Stop Scrolling Perfetto Timelines: Query Your Traces with SQL and Let AI Find the Bugs

This article presents a novel approach to debugging Android app performance by leveraging SQL queries against Perfetto traces and feeding the output to AI for automated analysis. This method allows developers to quickly identify and rank performance bottlenecks, significantly speeding up the optimization process compared to manual timeline exploration.

SQL Perfetto AI Debugging

ARTICLEDEV.to AI·7d ago

How I optimized a Python AI gesture engine to run on a 12-year-old laptop

This article details the development of GestCtrl, a gesture recognition engine optimized to run on old hardware, such as a 12-year-old laptop. The focus is on providing frictionless, touchless shortcuts instead of replacing the mouse and keyboard, addressing performance and user experience challenges.

AI optimization gesture recognition Python performance

NEWSDEV.to AI·4/26/2026

DeepSeek V4 Pro Just Dropped — Here's What Changed for AI Agents

DeepSeek V4 Pro launched on April 24, 2026, featuring 1.6T parameters, 1M token context, and dual Think/Non-Think modes with an MIT license. It is optimized for AI agent workloads, offering improved multi-step planning and more reliable function calling at a competitive price compared to Claude Sonnet 4.6 and GPT-4o.

deepseek-v4-pro performance AI agents Pricing

RESEARCHarXiv CS.LG·5/8/2026

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

This paper introduces sparse prefix caching, an optimization for LLM serving that stores recurrent states at checkpoint positions rather than requiring the entire token history. The method consistently improves the Pareto frontier compared to standard heuristics, especially for use cases where requests share a non-trivial prefix.

LLMs AI infrastructure Caching performance

DOCDEV.to AI·22d ago

Three memory-leak patterns in long-running scrapers (and how I caught them after 968 Trustpilot runs)

This content details three common memory-leak patterns observed in long-running web scrapers, specifically after 968 Trustpilot runs. These leaks, which silently increase memory usage and cost, are often caused by producers fetching URLs faster than consumers can process them in asynchronous queues.

Apify Asynchronous Programming memory leaks performance

ARTICLEDEV.to AI·10d ago

The Bitter Truth About Scaling AI-Powered Search Engines: My Treasure Hunt Engine Debacle

The author recounts the failure of their AI-powered search engine, the Treasure Hunt Engine, as it crossed 100,000 users, highlighting severe scaling and result accuracy challenges. Attempts to resolve the issues by adding more hardware proved ineffective, necessitating a reassessment of their scaling approach.

search engine AI scaling Technical Debt performance

NEWSDEV.to AI·18d ago

6.4 Claim Puts Nemotron-Labs Diffusion in AI Fast Lane

NVIDIA's Nemotron-Labs Diffusion aims to accelerate AI applications by tackling the one-token bottleneck through parallel generation of multiple tokens. This new diffusion language model claims up to 6.4 times higher tokens per forward pass, significantly benefiting latency-sensitive AI products like coding assistants and agent workflows.

Diffusion Models language models AI NVIDIA

ARTICLEDEV.to AI·4/27/2026

MEMORY.md Every Turn? That’s Noise, Not Memory.

Large language models require explicit history feeding, as they don't retain memory inherently. Common methods like expanding context windows or pasting fixed memory every turn prove inefficient and problematic at scale, leading to increased cost, slower inference, and reduced quality.

Context window memory management Cost Optimization large language models

CASEDEV.to AI·15d ago

The Overhyped Promise of Treasure Hunt Engines: Lessons from a Real-World Failure

The article details the failure of an AI-powered "treasure hunt engine" intended to drive an in-game rewards program. The team encountered latency issues and struggles to keep the system operational, realizing the technology was a means to an end, not the goal itself.

game development monetization system failure AI

ARTICLEDEV.to AI·5/7/2026

Vector Index Cold Start: Why Your First Query Takes 8 Seconds

This article addresses the "cold start" problem in vector indexes for RAG services, where the first query after a deployment can take several seconds due to the index loading from disk. Although temporary, this latency spike impacts user experience, especially in high-traffic scenarios.

Vector Index deployment RAG AI infrastructure

RESEARCHDEV.to AI·15d ago

We Benchmarked the Most Popular Code Search Tools. We Beat All of Them.

A benchmark compared popular code search tools, revealing that "knowing" significantly outperformed competitors like "codegraph" in precision (P@10) and time-to-consistency. Despite having zero GitHub stars, "knowing" proved 1.53x more precise than "codegraph" and utilizes a Random Walk with Restart approach.

code search software development Benchmarking AI tools

RESEARCHDEV.to AI·23d ago

The cheapest and fastest way to generate an image

The content benchmarks 25 image generation models from 6 providers on Vercel AI Gateway, identifying the cheapest and fastest options. It reveals significant price and speed differences, with models like bfl/flux-2-klein-4b and bfl/flux-pro-1.1 leading in cost and speed, respectively.

Benchmarking image generation AI cost

DOCDEV.to AI·22d ago

Running Qwen3.6-27B on a 16GB M1 MacBook Pro: A Practical Engineer’s Guide

This practical engineer's guide details how to run the Qwen3.6-27B model on a 16GB M1 MacBook Pro, overcoming memory limitations to keep the machine usable. The approach focuses on local testing, eliminating cloud dependency and API costs.

M1 Mac local LLM learning Qwen

ARTICLEDEV.to AI·5/8/2026

The Agentic Gap: Claude Oneshots, Gemma Fails

The article compares Gemma 4 and Opus 4.6 by testing them on a real-world software development task, adding public-facing search to a website. While Gemma 4 previously topped a local benchmark for speed and code quality, it failed the one-shot coding challenge, whereas Opus successfully implemented the feature.

AI models software development Benchmarking Local AI

RESEARCHDEV.to AI·5/8/2026

Model Showdown Round 2: Adding Gemma, Kimi, and 579 GB of Stubborn Optimism

This article presents "Model Showdown Round 2," introducing new models like Google's Gemma 4 and Moonshot AI's Kimi K2, and re-evaluating previous models with corrected configurations. The updated benchmarks revealed significant changes in the leaderboard, addressing issues like token limits and command interpretation from the initial round.

AI models inference LLMs Benchmarking

ARTICLEDEV.to AI·4/20/2026

Background Tasks: The One Actor in the Codebase and the SIGTERM Bug That Only Broke on Linux

An AI agent's efficiency is hindered by blocking tool calls that force sequential task execution, creating a bottleneck for slow operations. The proposed fix is a background execution layer, allowing the agent loop to remain non-blocking and process results asynchronously via a notification queue.

asynchronous processing Software Architecture performance AI agents

ARTICLEDEV.to AI·4/25/2026

The Intention-Action Gap in Autonomous Agents

The "intention-action gap" describes autonomous agents acknowledging tasks but failing to perform them, without errors or crashes. This is identified as a critical reliability issue in production agent systems.

Reliability AI Systems performance AI agents

ARTICLEDEV.to AI·29d ago

When I started running models locally, I thought quantization meant squeezing more into RAM. Turns o

The article advises against defaulting to Q4_K_M for local LLM inference, emphasizing that optimal performance comes from testing quantization levels tailored to specific workflows. It suggests that aggressive quantization like Q3_K_S can significantly cut latency with imperceptible quality loss for many tasks, though context length presents a trade-off.

Optimization LLMs quantization hardware

RESEARCHarXiv CS.LG·4/24/2026

FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

FairyFuse is a new inference system designed for CPU-only platforms, enabling multiplication-free execution of large language models. It uses ternary weights ({-1, 0, +1}) to replace floating-point multiplications with conditional additions and subtractions, significantly reducing memory bandwidth bottlenecks and offering up to 16x weight compression.

inference CPU optimization quantization performance

RESEARCHarXiv CS.CL·7d ago

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

Long-context decoding in Large Language Models (LLMs) is severely constrained by the memory bandwidth of the Key-Value (KV) cache. This paper proposes Attention Run-time Termination (ART), a lightweight mechanism that optimizes KV cache access, leading to a 20% higher generation throughput.

LLMs memory management decoding performance