llama.cpp

33 items

ARTICLE↑ trendingReddit r/LocalLLaMA·4/12/2026

huge improvement after moving from ollama to llama.cpp

The content describes a war robots project where the Qwen3 model generates code to control the robots. The author reports a significant improvement in AI execution after transitioning from Ollama to llama.cpp.

Ollama llama.cpp AI robotics

ARTICLE↑ trendingReddit r/LocalLLaMA·4/10/2026

I no longer need a cloud LLM to do quick web research

O autor compartilha sua configuração para pesquisa e raspagem web rápida usando LLMs locais, especificamente Qwen3.5:27B-Q3_K_M em uma RTX 4090 com llama.cpp. Ele detalha as ferramentas e o processo que o permite realizar extração eficaz de conteúdo web offline, indicando que modelos locais agora atendem aos seus padrões de qualidade.

RTX 4090 Qwen3.5 local LLM llama.cpp

NEWS↑ trendingReddit r/LocalLLaMA·4/15/2026

What is the current status with Turbo Quant?

This content inquires about the current status of "Turbo Quant" technology, referencing its hype approximately two weeks prior and mentions of pull requests into llama.cpp. The user is seeking an update on its development and adoption.

Turbo Quant llama.cpp quantization AI development

NEWS↑ trendingReddit r/LocalLLaMA·4/10/2026

More Gemma4 fixes in the past 24 hours

Este conteúdo relata correções recentes e novos templates de chat para os modelos Gemma 4 do Google, visando melhorar o orçamento de raciocínio e a chamada de ferramentas. Ele oferece instruções para usuários de llama.cpp sobre como aplicar esses novos templates para garantir o funcionamento correto.

updates AI models Gemma 4 llama.cpp

NEWS↑ trendingReddit r/LocalLLaMA·4/8/2026

It looks like we’ll need to download the new Gemma 4 GGUFs

Este conteúdo anuncia a atualização dos modelos Gemma 4 GGUF da Unsloth, incorporando várias melhorias e correções do projeto llama.cpp. As atualizações abordam aspectos técnicos como cache KV, suporte CUDA, manuseio de vocabulário e parsing específico para Gemma 4.

unsloth Gemma 4 modelos de IA llama.cpp

NEWSDEV.to AI·4/19/2026

llama.cpp Speculative Checkpointing, Ollama Multimodal Tool, MLX vs GGUF for Gemma 4

Today's top stories feature the merger of speculative checkpointing in llama.cpp to accelerate local LLM inference and a new Ollama multimodal tool for local audio/video analysis. Additionally, a detailed comparison between MLX and GGUF is provided for optimizing Gemma 4 deployment on consumer hardware.

LLMs Ollama llama.cpp model inference

DOCDEV.to AI·16d ago

로컬 LLM 셋업 가이드 (v16)

This guide details how to set up and run Large Language Models (LLMs) locally, specifying hardware prerequisites such as an NVIDIA GPU and sufficient RAM, and comparing frameworks like llama.cpp and Ollama. It provides step-by-step instructions for installing llama.cpp and running a model with GPU acceleration.

local setup GPU llama.cpp guide

DOCDEV.to AI·22d ago

Building llama.cpp from source on a Dell Precision T5820 with an RTX 3090 Ti (after seven power cycles)

This post details setting up a Dell Precision T5820 with an RTX 3090 Ti for AI inference using llama.cpp to run Qwen3.6-27B. The author shares the build recipe, PCIe troubleshooting, and long-context tricks, highlighting patience as a crucial fix.

Homelab GPU Troubleshooting llama.cpp

DOCDEV.to AI·23d ago

Building and Running Llama.cpp on an Air-Gapped Mac

This guide explains how to build and run Llama.cpp on an air-gapped macOS device, specifically addressing GateKeeper errors and new WebUI download dependencies that prevent offline compilation. It details issues encountered when `cmake` attempts to download assets from Hugging Face or npm without an internet connection.

air-gapped llama.cpp build guide offline compilation

DOCDEV.to AI·16d ago

로컬 LLM 셋업 가이드 (v4)

This guide details setting up local LLMs on Linux systems, specifically Ubuntu 20.04+. It covers hardware requirements, compares frameworks like llama.cpp, Ollama, vLLM, and LocalAI, and provides a step-by-step installation process.

local LLM AI frameworks llama.cpp setup guide

RESEARCHDEV.to AI·22d ago

Three Months of Speed-Up Experiments on a 3090 Ti: Autoregressive DFlash MTP for Qwen3.6-27B

This content details a three-month experiment aimed at optimizing the decode performance of the Qwen3.6-27B model on an RTX 3090 Ti GPU. The project successfully improved decoding speed from 43 to 39-49 tokens per second, leveraging a new speculative decoding technique (MTP) within llama.cpp.

LLM optimization llama.cpp Qwen3.6-27B GPU performance

NEWSDEV.to AI·4/12/2026

llama.cpp Adds Gemma 4 Audio, Speculative Decoding & Ollama Agent Boost Local AI

llama.cpp now supports multimodal audio processing for Gemma 4 models, enhancing its versatility on consumer hardware. Performance gains have also been implemented with speculative decoding and a new Ollama agent for local coding.

Ollama Gemma 4 llama.cpp decodificação especulativa

NEWSHugging Face Blog·2/20/2026

GGML and llama.cpp join HF to ensure the long-term progress of Local AI

GGML e llama.cpp se uniram à Hugging Face para assegurar o progresso contínuo da inteligência artificial local. Esta colaboração visa fortalecer o desenvolvimento de soluções de IA acessíveis e eficientes.

Inferência de IA IA Local Hugging Face open-source AI