performance

95 items

ARTICLE↑ trendingHacker News (AI)·1d ago

Show HN: Web Speed – A shared web-map registry for AI agents (MCP, open source)

The author introduces Web Speed, an open-source tool that parses HTML web pages into easily readable sitemaps for AI agents, making them faster and cheaper. The project includes a global cache of sitemaps to further speed up agents, currently accessible only via the paid API version.

Open Source sitemaps performance web parsing

ARTICLEDEV.to AI·4/23/2026

Stop Using sleep() in Your Agent Loops: Event-Driven AI Agent Scheduling

This article criticizes the common use of `sleep()` in AI agent loops, pointing out its costs in terms of API budget waste, high latency, and masked failures. It advocates for event-driven scheduling as a superior alternative to optimize costs and performance at scale.

Optimization performance developer tools scheduling

ARTICLE↑ trendingReddit r/LocalLLaMA·25d ago

I have (even faster) DeepSeek V4 Pro at home

The author successfully ran the DeepSeek V4 Pro model even faster on their home hardware using ktransformers. They detail hardware tweaks and present performance benchmark results with increasing context depth.

DeepSeek Benchmarking hardware performance

ARTICLE↑ trendingReddit r/LocalLLaMA·25d ago

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

The author tested the Qwen 3.6 35b MTP model locally, observing a 1.5x increase in speed. They explored the use of a large context window, reaching 300k tokens with potential for higher.

LLMs Benchmarking Local AI Qwen

NEWS↑ trendingReddit r/LocalLLaMA·4/27/2026

Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090

Luce DFlash introduces a GGUF port of DFlash speculative decoding for Qwen3.6-27B, achieving nearly 2x throughput on a single RTX 3090. This standalone C++/CUDA stack, available as an MIT-licensed open-source project, significantly enhances LLM performance on consumer-grade hardware.

Open Source Optimization performance Speculative Decoding

Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090

NEWS↑ trendingReddit r/LocalLLaMA·4/9/2026

Gemma 4 on Llama.cpp should be stable now

A integração de correções no Llama.cpp resolveu problemas conhecidos do Gemma 4, tornando-o estável para uso. O conteúdo oferece dicas de execução, como uso de `--chat-template-file` e otimização de cache, além de alertar contra o uso do CUDA 13.2.

Technical Tips Gemma 4 llama.cpp performance

RESEARCH↑ trendingReddit r/LocalLLaMA·5/1/2026

nvidia/Gemma-4-26B-A4B-NVFP4

The content confirms the performance of the Gemma-4-26B-A4B-NVFP4 model on an NVIDIA 5090 GPU, detailing 18.8GB VRAM usage and 50k context capability. It also presents benchmark scores for the NVFP4 version compared to full precision across various metrics like GPQA, AIME, and MMLU Pro.

AI models GPU Benchmarking NVIDIA

RESEARCH↑ trendingReddit r/LocalLLaMA·4/16/2026

Qwen 3.6 35B A3B, RTX 5090 32GB, 187t/s, Q5 K S, 120K Context Size, Thinking Mode Off, Temp 0.1

The content details the performance of the Qwen 3.6 35B A3B model, achieving 187 tokens per second on an RTX 5090 32GB GPU. It highlights support for a 120K context size, using Q5 K S quantization and a temperature of 0.1.

inference AI hardware benchmark performance

Qwen 3.6 35B A3B, RTX 5090 32GB, 187t/s, Q5 K S, 120K Context Size, Thinking Mode Off, Temp 0.1

NEWS↑ trendingReddit r/LocalLLaMA·4/19/2026

llama.cpp speculative checkpointing was merged

Speculative checkpointing has been merged into llama.cpp, potentially offering speedups for certain prompts. While some prompts, like for coding with optimized parameters, can see 0-50% speed improvement, others may not benefit due to low draft acceptance streaks.

Open Source llama.cpp speculative-checkpointing AI inference

NEWS↑ trendingReddit r/LocalLLaMA·5/4/2026

Llama.cpp MTP support now in beta!

Llama.cpp's MTP support is now in beta, initially supporting Qwen3.5 MTP, with potential for an imminent merge. This enhancement, alongside maturing tensor-parallel support, is expected to close performance gaps with vLLM, particularly in token generation speeds.

AI models Qwen3.5 MTP llama.cpp

ARTICLE↑ trendingReddit r/LocalLLaMA·4/30/2026

Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix)

This update details running Qwen3.6-27B on a single RTX 3090, achieving ~218K context and stable tool calls at 50-66 TPS. A critical memory issue with long tool outputs was resolved by fixing an anchor drift in a Genesis patch (PN12) for vLLM.

Optimization hardware performance vLLM

ARTICLE↑ trendingReddit r/LocalLLaMA·4/23/2026

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026

The title describes an impressive optimization for the Qwen3.6–27B model, achieving 85 TPS and 125K context with vision capabilities on a single RTX 3090. This represents a significant technical feat for efficient LLM deployment.

Optimization multimodal AI GPU large language models

An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026

RESEARCH↑ trendingReddit r/LocalLLaMA·4/22/2026

Dense vs. MoE gap is shrinking fast with the 3.6-27B release

Dense AI models currently outperform MoE overall, but MoE is rapidly catching up, particularly in coding benchmarks. For users with 24GB VRAM and a need for large context windows, MoE is becoming a more appealing option.

AI models LLMs Benchmarks MoE

Dense vs. MoE gap is shrinking fast with the 3.6-27B release

ARTICLE↑ trendingReddit r/LocalLLaMA·4/22/2026

Is a high-end private local LLM setup worth it?

The user questions the worth of a high-end local LLM setup, citing high costs, setup difficulties, and perceived performance gaps compared to cloud services like Claude and GPT. They are willing to invest in powerful hardware but want to know if it can truly match the speed and intelligence of top commercial models.

local LLM private-ai cost hardware

ARTICLE↑ trendingReddit r/LocalLLaMA·5/7/2026

Need advice on hardware purchasing decision: RTX 5090 vs. M5 Max 128GB for agentic software development

The user is seeking advice on choosing between an RTX 5090 and an M5 Max 128GB for agentic software development using Qwen3.6 27B locally. The RTX 5090 offers 3x speed, while the M5 Max provides 4x memory, presenting a trade-off between rapid code generation and larger context capacity.

LLMs GPU hardware performance

RESEARCH↑ trendingReddit r/LocalLLaMA·4/19/2026

QWEN3.6 + ik_llama is fast af

A user reported running the Qwen3.6 + ik_llama model at over 50 tokens/second with a 200k context window on 16GB VRAM and 32GB RAM. This marks a significant performance benchmark for large language models.

Benchmarking hardware performance LLM

ARTICLEDEV.to AI·4/23/2026

Building a Bit-Accurate Fused QKV + RoPE Kernel for Qwen 2.5 in Triton

This article details the creation of a bit-accurate Triton kernel for Qwen 2.5, fusing QKV projection, RoPE, and KV cache write into a single GPU launch. It achieves a 4.5-5x speedup over multiple PyTorch operations while maintaining exact output accuracy, with the post explaining its design and benchmarking.

GPU computing Transformer AI optimization Triton

NEWS↑ trendingReddit r/LocalLLaMA·4/9/2026

backend-agnostic tensor parallelism has been merged into llama.cpp

A funcionalidade de paralelismo de tensor backend-agnóstico foi integrada ao llama.cpp, permitindo que modelos de IA rodem muito mais rápido em sistemas com múltiplas GPUs. Isso significa que a aceleração de desempenho não exige mais CUDA.

LLMs Otimização GPU IA

ARTICLE↑ trendingReddit r/LocalLLaMA·4/23/2026

Compared QWEN 3.6 35B with QWEN 3.6 27B for coding primitives

The content compares Qwen 3.6 35B and 27B models for coding primitives, noting the 35B is faster (72 TPS) but less precise than the 27B (18 TPS), which produces more accurate results despite being slower. It includes a prompt used for testing and asks for user experiences.

Benchmarking Qwen performance coding

Compared QWEN 3.6 35B with QWEN 3.6 27B for coding primitives

NEWS↑ trendingReddit r/LocalLLaMA·4/12/2026

MiniMax m2.7 (mac only) 63gb: 88% and 89gb: 95%, MMLU 200q

The content announces the launch of the MiniMax M2.7 AI model, available in 63GB and 89GB versions, optimized for Mac. It highlights its promising performance, suggesting it approaches levels of models like Sonnet 4.5 and mentions the MMLU benchmark.

local inference MiniMax performance HuggingFace