LLM performance

10 items

RESEARCH↑ trendingReddit r/LocalLLaMA·4/19/2026

Same 9B Qwen weights: 19.1% in Aider vs 45.6% with a scaffold adapted to small local models

A study demonstrates that adapting the scaffolding for a small local LLM (Qwen3.5-9B) significantly improves its performance on the Aider Polyglot coding benchmark from 19.1% to 45.6%. This highlights the importance of scaffold design over inherent model weakness for local models in coding agents.

scaffolding Benchmarking coding AI local models

RESEARCH↑ trendingReddit r/LocalLLaMA·4/11/2026

DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max)

This content describes a native DFlash implementation on MLX for Apple Silicon, significantly accelerating token generation in Qwen models. The speculative decoding technique achieves speedups of up to 3.3x while maintaining identical output quality.

apple-silicon MLX Qwen LLM performance

ARTICLE↑ trendingReddit r/LocalLLaMA·4/18/2026

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.

The content details how to optimize Qwen3.6-35B-A3B on consumer hardware (RTX 5070 Ti, Ryzen 9800X3D), achieving 79 t/s with 128K context. The key finding is the correct use of the `--n-cpu-moe N` flag in llama.cpp, which significantly outperforms the common `--cpu-moe` by utilizing more GPU VRAM for MoE experts.

llama.cpp AI optimization MoE LLM performance

ARTICLE↑ trendingReddit r/LocalLLaMA·4/27/2026

GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

This content details an optimization of the GBNF grammar for Qwen3.6 35B-A3B and 27B models, resulting in enhanced performance for coding and puzzle-solving. Benchmarking on an RTX 5090 setup with llama.cpp showed a significant uplift, particularly for the 35B-A3B model.

GBNF AI optimization Benchmarking Qwen

GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B

ARTICLE↑ trendingReddit r/LocalLLaMA·4/9/2026

Could it be that this take is not too far fetched?

Este conteúdo aborda a preocupação da comunidade de IA com a degradação de modelos de ponta, como o Claude Opus, semanas após o lançamento, levantando hipóteses sobre economia de custos ou sobrecarga de infraestrutura. Também discute os desafios de estabelecer benchmarks consistentes, pois os provedores podem ajustar o acesso aos modelos para evitar detecção.

AI benchmarking Cost Optimization Cloud Compute AI Model Degradation

ARTICLE↑ trendingReddit r/LocalLLaMA·4/21/2026

Opus 4.7 Max subscriber. Switching to Kimi 2.6

A former Opus 4.7 Max subscriber reports that the model became lazy and expensive. After supplementing with Qwen 3.6, the user switched to Kimi 2.6, finding it surprisingly fast, pleasurable to use, and with seemingly better context management despite a smaller context window.

AI models user experience LLM performance Cost Efficiency

ARTICLE↑ trendingReddit r/LocalLLaMA·4/15/2026

Major drop in intelligence across most major models.

The author reports a major drop in intelligence across several major AI models like ChatGPT, Claude, Gemini, and Grok, as of mid-April 2026. They observed models ignoring instructions and giving shallow outputs, hypothesizing quantization reduction or a deliberate policy, and suggesting using rented GPUs or local AI.

quantization Local AI model degradation AI intelligence drop

ARTICLEDEV.to AI·4/24/2026

An agent is only as good as the system engineering around it.

Anthropic's postmortem on Claude Code's quality drop revealed that the issue stemmed from orchestration, not the base model, highlighting the critical role of system engineering. The analysis divides AI agent quality into three layers—Model, Context, and Harness—concluding that overall performance is primarily defined by the engineering of the system around the model.

orchestration System Engineering LLM performance AI agents

ARTICLEDEV.to AI·4/19/2026

An Hour Down Claude Code's Memory Hole

Claude Code introduced a default auto-memory feature that consumed 47% of the system prompt and degraded the model's performance. The author details how they discovered and disabled this feature via an environment variable, restoring the AI's expected behavior.

user experience AI tools AI debugging System prompt optimization

RESEARCHarXiv CS.CL·4/7/2026

Robust LLM Performance Certification via Constrained Maximum Likelihood Estimation

Este artigo propõe uma nova abordagem eficiente para estimar as taxas de falha de LLMs, essencial para sua implantação segura. O método utiliza estimação por máxima verossimilhança restrita, combinando dados humanos de calibração, anotações de LLM-judge e informações adicionais via restrições de domínio, sendo validado empiricamente contra métodos como PPI.

LLM-as-a-judge Constrained MLE Model Evaluation Failure Rate Estimation