LLM

609 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/10/2026

offline companion robot for my disabled husband (8GB RAM constraints) – looking for optimization advice

Uma pessoa está desenvolvendo um robô companheiro de IA offline para seu marido tetraplégico, buscando reduzir o isolamento. O protótipo atual usa Mistral-7B-Instruct em um ThinkPad com 8GB de RAM para conversação e faster-whisper em um Jetson Nano para reconhecimento de fala, e a autora busca conselhos de otimização.

assistive technology AI robotics offline AI

RESEARCH↑ trendingReddit r/LocalLLaMA·4/16/2026

Qwen 3.6 35B A3B, RTX 5090 32GB, 187t/s, Q5 K S, 120K Context Size, Thinking Mode Off, Temp 0.1

The content details the performance of the Qwen 3.6 35B A3B model, achieving 187 tokens per second on an RTX 5090 32GB GPU. It highlights support for a 120K context size, using Q5 K S quantization and a temperature of 0.1.

inference AI hardware benchmark performance

Qwen 3.6 35B A3B, RTX 5090 32GB, 187t/s, Q5 K S, 120K Context Size, Thinking Mode Off, Temp 0.1

NEWS↑ trendingReddit r/LocalLLaMA·4/16/2026

Released Qwen3.6-35B-A3B

The Qwen3.6-35B-A3B model has been released. This new model from the Alibaba Qwen team is now available on Hugging Face.

AI Model model release LLM

ARTICLE↑ trendingReddit r/MachineLearning·4/27/2026

How do you test AI agents in production? The unpredictability is overwhelming.[D]

A QA professional highlights the overwhelming challenges of testing non-deterministic LLM-based AI agents in production, where traditional quality assurance methods fail. They struggle with the variability of outputs and reasoning chains, finding existing approaches like snapshot testing and human evaluation insufficient or unscalable.

production AI testing Quality Assurance LLM

ARTICLE↑ trendingReddit r/MachineLearning·4/21/2026

The AI Database Landscape in 2026 - Four architecturally distinct approaches [D]

An industry survey outlines four distinct architectural approaches for integrating AI into databases by 2026: vector, ML-in-database, LLM-augmented, and predictive databases. It details their inference mechanisms with diagrams and comparisons, also noting what the taxonomy excludes.

Vector Databases database architecture AI databases LLM

ARTICLE↑ trendingReddit r/LocalLLaMA·4/10/2026

[Model Release] I trained a 9B model to be agentic Data Analyst (Qwen3.5-9B + LoRA). Base model failed 100%, this LoRA completes 89% of workflows without human intervention.

Um desenvolvedor treinou um modelo Qwen3.5-9B com LoRA para atuar como analista de dados agente, focando em autonomia através de pesos. O modelo alcançou 89% de conclusão de fluxos de trabalho de ponta a ponta sem intervenção humana, superando a falha total do modelo base.

data analysis Agentic AI Fine-tuning LoRA

RESEARCH↑ trendingReddit r/LocalLLaMA·5/1/2026

Qwen 3.6 27B vs Gemma 4 31B - making Packman game!

A local LLM gamedev contest compared Qwen 3.6 27B and Gemma 4 31B in creating a Pac-Man game. Gemma 4 31B was the clear winner, producing stronger game logic and higher quality in much less time, despite Qwen generating more tokens.

code generation model comparison benchmark LLM

Qwen 3.6 27B vs Gemma 4 31B - making Packman game!

CASE↑ trendingReddit r/LocalLLaMA·4/11/2026

Gemma 4 26B A4B is still fully capable at 245283/262144 (94%) contex !

The local Gemma 4 26B A4B model shows exceptional capability, operating at 94% of its 262,144 token context. It successfully solved a problem that Gemini 3.1 could not, maintaining high performance and integrity under intense VRAM usage.

Context window Gemma Local AI performance testing

NEWS↑ trendingReddit r/LocalLLaMA·4/17/2026

Qwen3.6-35B-A3B Uncensored Aggressive is out with K_P quants!

The Qwen3.6-35B-A3B "Aggressive" variant has been released, offering an uncensored version of the original model with no refusals and zero capability loss. This release includes various K_P quants and vision support.

uncensored AI quantization Qwen model release

ARTICLE↑ trendingReddit r/LocalLLaMA·4/14/2026

24/7 Headless AI Server on Xiaomi 12 Pro (Snapdragon 8 Gen 1 + Ollama/Gemma4)

This post details the transformation of a Xiaomi 12 Pro into a 24/7 local AI server using LineageOS and Ollama to serve Gemma4. The setup includes OS optimizations, custom thermal management, and battery protection for continuous operation.

Ollama Snapdragon AI Xiaomi

24/7 Headless AI Server on Xiaomi 12 Pro (Snapdragon 8 Gen 1 + Ollama/Gemma4)

RESEARCH↑ trendingReddit r/LocalLLaMA·4/14/2026

We benchmarked TranslateGemma-12b against 5 frontier LLMs on subtitle translation - it won across the board, with one significant catch

A study benchmarked TranslateGemma-12b against five frontier LLMs on subtitle translation for six language pairs, showing the task-specific model consistently outperformed general-purpose models. While initial numbers indicated a clear win, human QA added a significant catch which will be detailed in the full report.

Translation Gemma benchmark AI

We benchmarked TranslateGemma-12b against 5 frontier LLMs on subtitle translation - it won across the board, with one significant catch

RESEARCH↑ trendingReddit r/LocalLLaMA·4/14/2026

Updated Qwen3.5-9B Quantization Comparison

This content compares various GGUF quantizations of the Qwen3.5-9B model using KL Divergence (KLD) to assess faithfulness to the BF16 baseline. The goal is to provide users with a data-driven basis for selecting the most faithful quantized file, where lower KLD scores indicate less information loss.

Qwen3.5-9B KLD quantization GGUF

Updated Qwen3.5-9B Quantization Comparison

ARTICLE↑ trendingReddit r/LocalLLaMA·4/12/2026

MiniMax-M2.7 vs Qwen3.5-122B-A10B for 96GB VRAM full offload?!

The author compares MiniMax-M2.7 and Qwen3.5-122B-A10B GGUF models for local full offload on a 96GB VRAM rig. For their purposes, Qwen3.5-122B is preferred, despite MiniMax being more quantized, highlighting the trade-offs in performance for local LLM inference.

VRAM GGUF MiniMax Qwen

MiniMax-M2.7 vs Qwen3.5-122B-A10B for 96GB VRAM full offload?!

ARTICLE↑ trendingReddit r/MachineLearning·4/30/2026

A Hackable ML Compiler Stack in 5,000 Lines of Python [P]

The author built a simplified, hackable ML compiler stack in 5,000 lines of Python that emits raw CUDA, aiming to provide an easy-to-follow reference without the complexity of existing frameworks. It lowers small models like TinyLlama and Qwen2.5-7B through six Intermediate Representations, focusing on clarity over performance.

CUDA ML compiler compiler design Python

ARTICLE↑ trendingReddit r/LocalLLaMA·4/30/2026

Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix)

This update details running Qwen3.6-27B on a single RTX 3090, achieving ~218K context and stable tool calls at 50-66 TPS. A critical memory issue with long tool outputs was resolved by fixing an anchor drift in a Genesis patch (PN12) for vLLM.

Optimization hardware performance vLLM

RESEARCH↑ trendingReddit r/MachineLearning·4/26/2026

Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch [P]

A new educational implementation repository has been launched for speculative decoding, implementing various methods like EAGLE-3 and Medusa-1 from scratch to facilitate studying proposer design differences. It includes training and inference paths for models like Qwen/Qwen2.5-7B-Instruct and aims to clarify the distinction between proposer quality and verifier cost, and why a high acceptance rate doesn't always imply higher throughput.

software development machine learning AI optimization Speculative Decoding

ARTICLE↑ trendingReddit r/LocalLLaMA·4/8/2026

I tracked a major cache reuse issue down to Qwen 3.5’s chat template

Um desenvolvedor investigou persistentes falhas de cache em fluxos de trabalho de agentes de IA locais, resultando no reprocessamento desnecessário de grandes blocos de contexto. A causa foi rastreada até um problema com o template de chat do modelo Qwen 3.5, após descartar outras possibilidades como erros no motor de inferência ou bugs na implementação do cache.

Otimização Qwen 3.5 AI Cache

ARTICLE↑ trendingReddit r/LocalLLaMA·4/8/2026

Qwen3.5-35B-A3B-Uncensored-FernflowerAI-GGUF

O autor encontrou e corrigiu um bug de treinamento no modelo Qwen3.5-35B-A3B, disponibilizando uma versão fixa, um prompt de sistema aprimorado, um template de chat com suporte a tool calling e configurações recomendadas para LM Studio. A correção aborda problemas de perda de contexto e repetição que ocorriam em conversas longas com a versão anterior do modelo.

Model Fix Qwen3.5 GGUF Uncensored

ARTICLE↑ trendingReddit r/MachineLearning·4/8/2026

[P] Building a LLM from scratch with Mary Shelley's "Frankenstein" (on Kaggle)

Este conteúdo oferece um tutorial aprofundado e um notebook no GitHub que demonstram como construir um Large Language Model (LLM) do zero. O projeto utiliza o romance 'Frankenstein' de Mary Shelley como conjunto de dados para o treinamento do modelo.

from scratch tutorial machine learning Python

NEWS↑ trendingReddit r/LocalLLaMA·4/26/2026

HauhauCS (of "Uncensored Aggressive" fame) published an abliteration package that plagiarizes Heretic without attribution, and violates its license

An investigation reveals that HauhauCS, a publisher of popular uncensored LLM models, plagiarized code from the Heretic project, violating its AGPL-3.0 license. Detailed evidence was found in the recovered source code, including identical module and function names.

Open Source AI ethics Intellectual Property LLM