LLM

609 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/13/2026

Experiment: Olmo 3 7B Instruct Q1_0

The author attempted to quantize OLMo-3 7B Instruct into a 1-bit format using quantization aware distillation, training the model for 12 hours on 4x B200 GPUs. Although the resulting model can produce basic English, it's generally unusable due to repetition loops and lack of context tracking, attributed to premature training cessation and an unsuitable dataset choice.

OLMo-3 distillation quantization 1-bit model

ARTICLE↑ trendingReddit r/LocalLLaMA·4/23/2026

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post

The content describes an experiment demonstrating significant speed gains (up to 68.35 tokens/s) using speculative decoding with the Qwen-3.6-27B model via llamacpp. The author showcases the AI's ability to efficiently generate and debug code.

Benchmarking AI performance Speculative Decoding LLM

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post

ARTICLE↑ trendingReddit r/LocalLLaMA·4/22/2026

Local manga translator with LLM build-in, written in Rust with llama.cpp integration

This project is a manga and image translator, built in Rust, that leverages object detection, visual LLM-based OCR, layout analysis, and fine-tuned inpainting models. It integrates llama.cpp to support local LLM inference with models like Gemma and Qwen, offering a performant and user-friendly pipeline.

Open Source Image processing Rust OCR

Local manga translator with LLM build-in, written in Rust with llama.cpp integration

RESEARCH↑ trendingReddit r/MachineLearning·4/23/2026

We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]

Researchers benchmarked 18 LLMs for OCR, finding that cheaper and older models often match or exceed the accuracy of flagship models at a fraction of the cost. They open-sourced their dataset and benchmarking framework.

Open Source Benchmarking OCR Cost Efficiency

RESEARCH↑ trendingReddit r/LocalLLaMA·4/10/2026

Stanford: Self improving Meta-Harness

Meta-Harness é um novo sistema da Stanford que otimiza o "harness" de Large Language Models (LLMs), corrigindo autonomamente erros para melhorar o desempenho e reduzir o uso de contexto. Ele demonstra melhorias notáveis em classificação de texto, superando sistemas existentes e utilizando 4 vezes menos tokens.

auto-melhoria Meta-Harness AI harness

NEWS↑ trendingReddit r/LocalLLaMA·4/22/2026

Qwen 3.6 27B is out

The Qwen 3.6 27B model has been released, representing a new addition to large language models. The announcement links to the model's official Hugging Face page for further details.

Qwen model release Large Language Model LLM

ARTICLE↑ trendingHacker News (AI)·6d ago

Show HN: Mnemo – local-first AI memory layer for any LLM (Rust, SQLite,petgraph)

Mnemo is a local-first AI memory layer designed for any Large Language Model, implemented using Rust and SQLite. It enables efficient storage and retrieval of contextual information for LLMs.

SQLite memory AI Rust

NEWS↑ trendingReddit r/LocalLLaMA·4/16/2026

Qwen3.6-35B-A3B released!

The Qwen3.6-35B-A3B model has been released and open-sourced, featuring a sparse MoE architecture with 35B total parameters and 3B active, under an Apache 2.0 license. It excels in agentic coding, multimodal perception, and reasoning, touted as efficient, powerful, and versatile.

multimodal AI open-source AI AI Model sparse MoE

RESEARCH↑ trendingReddit r/LocalLLaMA·4/13/2026

MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — bench numbers

This content presents benchmark results for the MiniMax-M2.7 LLM, utilizing NVFP4 quantization on a dual NVIDIA RTX PRO 6000 Blackwell GPU setup. It details decode throughput at various concurrency levels and prefill performance across different context sizes.

GPU Benchmarking NVIDIA Blackwell MiniMax M2.7

MiniMax-M2.7 NVFP4 on 2x RTX PRO 6000 Blackwell — bench numbers

RESEARCH↑ trendingReddit r/MachineLearning·4/14/2026

You can decompose models into a graph database [N]

This content introduces the LarQL project, which allows the decomposition of static LLM models into a graph database to perform KNN walks mathematically identical to matrix multiplication. This innovative approach enables updating a model's factual knowledge without retraining, simply by inserting information into the graph database, and it uses less memory.

Graph Database Knowledge Update AI Model Decomposition

DOC↑ trendingReddit r/MachineLearning·4/22/2026

Need Info on quality benchmarks to run on DeepSeek V3.2 different quant levels [D]

A user is seeking advice on what quality benchmarks to run to measure the performance degradation when applying runtime quantization to the DeepSeek V3.2 large language model. The goal is to compare the quality loss against the non-quantized version.

Benchmarking quantization model optimization AI evaluation

RESEARCH↑ trendingReddit r/MachineLearning·4/20/2026

Open-source single-GPU reproductions of Cartridges and STILL for neural KV-cache compaction [P]

The author implemented and open-sourced reproductions of two recent ideas, Cartridges and STILL, for neural KV-cache compaction and long-context inference. The goal is to make these research ideas easy to inspect and run with benchmark code, also comparing them against existing methods.

neural networks Open Source research Memory Optimization

NEWS↑ trendingReddit r/LocalLLaMA·18d ago

New Release of ROCm based MLX LLM Engine - lemon-mlx-engine

The lemon-mlx-engine now integrates TheRock / ROCm 7.13, enabling users to try the latest ROCm with the MLX engine on their local hardware. This update also includes various bug and kernel fixes for Qwen3, 3.5, and 3.6 MoE and dense models.

ROCm Software release MLX AI development

ARTICLE↑ trendingReddit r/LocalLLaMA·25d ago

I have (even faster) DeepSeek V4 Pro at home

The author successfully ran the DeepSeek V4 Pro model even faster on their home hardware using ktransformers. They detail hardware tweaks and present performance benchmark results with increasing context depth.

DeepSeek Benchmarking hardware performance

RESEARCH↑ trendingReddit r/LocalLLaMA·4/13/2026

Gemma 4 has a systemic attention failure. Here's the proof.

The author developed a diagnostic method for LLMs, revealing a systemic attention failure in Gemma 4 26B A4B. It identified 29 tensors with significant distribution drift, 21 of which are in attention layers, indicating a compromised attention mechanism.

Gemma 4 Attention Mechanism diagnostic method KL-drift

NEWS↑ trendingReddit r/LocalLLaMA·4/27/2026

Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090

Luce DFlash introduces a GGUF port of DFlash speculative decoding for Qwen3.6-27B, achieving nearly 2x throughput on a single RTX 3090. This standalone C++/CUDA stack, available as an MIT-licensed open-source project, significantly enhances LLM performance on consumer-grade hardware.

Open Source Optimization performance Speculative Decoding

Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090

ARTICLE↑ trendingReddit r/MachineLearning·4/24/2026

Nanochat vs Llama for training from scratch? [P]

The user is training an AI model from scratch and seeks advice on the best architecture, considering switching from Nanochat (which lacks Transformers compatibility) to the Llama architecture. The goal is an open-source project with a new, larger dataset, despite Nanochat's advantages.

AI architecture open-source AI AI training LLM

NEWS↑ trendingReddit r/LocalLLaMA·4/9/2026

Gemma 4 on Llama.cpp should be stable now

A integração de correções no Llama.cpp resolveu problemas conhecidos do Gemma 4, tornando-o estável para uso. O conteúdo oferece dicas de execução, como uso de `--chat-template-file` e otimização de cache, além de alertar contra o uso do CUDA 13.2.

Technical Tips Gemma 4 llama.cpp performance

RESEARCH↑ trendingReddit r/MachineLearning·4/16/2026

Training Qwen2.5-0.5B-Instruct on Reddit posts summarization tasks with length constraint on my 3xMac Minis with GRPO - evals update [P]

The author trained Qwen2.5-0.5B-Instruct for Reddit post summarization using two reward strategies, finding that a combination of quality and length penalties yielded significantly better results. Evaluation was conducted using LLM-As-A-Judge and DeepEval tools for metrics like conscientiousness and clarity.

evaluation reinforcement learning AI training summarization

NEWS↑ trendingReddit r/LocalLLaMA·5/7/2026

Qwen3.6 27B uncensored heretic v2 Native MTP Preserved is Out Now With KLD 0.0021, 6/100 Refusals and the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs and NVFP4s formats.

The Qwen3.6 27B uncensored heretic v2 Native MTP Preserved language model has been released, boasting a KLD of 0.0021 and only 6 refusals out of 100. It is available in various formats including Safetensors, GGUFs, and NVFP4s, with all 15 MTPs fully preserved and retained.

uncensored AI Hugging Face Qwen3.6 model release