local inference

16 items

CASE↑ trendingReddit r/LocalLLaMA·4/23/2026

Qwen 3.6 27B is a BEAST

A user reports that Qwen 3.6 27B, run locally on a laptop, excels at data science tasks like tool calls and data transformation debugging. Its performance was so impressive that they are considering canceling cloud subscriptions, finding it perfect for pyspark/python work.

local inference Benchmarking data science LLM

ARTICLE↑ trendingReddit r/LocalLLaMA·4/22/2026

Qwen3 TTS is seriously underrated - I got it running locally in real-time and it's one of the most expressive open TTS models I've tried

The author revisited an old real-time, local ASR->LLM->TTS pipeline project and was pleasantly surprised by Qwen3 TTS. After significant experimentation, they managed to get Qwen3 TTS working reliably for local streaming, praising its expressiveness and suitable architecture.

Open Source Qwen3 TTS real-time local inference

Qwen3 TTS is seriously underrated - I got it running locally in real-time and it's one of the most expressive open TTS models I've tried

CASE↑ trendingReddit r/LocalLLaMA·4/23/2026

Qwen 3.6 is actually useful for vibe-coding, and way cheaper than Claude

The author successfully implemented Qwen 3.6 models (27B and 35B) locally for coding, demonstrating comparable performance to Claude Code. This local setup drastically reduced costs, from an estimated $142 in API calls to less than $4 in electricity over 8 hours.

GPU Claude local inference Cost Savings

Qwen 3.6 is actually useful for vibe-coding, and way cheaper than Claude

DOC↑ trendingReddit r/LocalLLaMA·4/15/2026

Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s

The author shares a successful optimization for running the Qwen3.5-35B-A3B-UD-Q4_K_L model on an RTX 4060 Ti 16GB using llama.cpp, achieving 40-60 tokens/s with 64k context. The post provides the detailed `models.ini` configuration and server start command to replicate this performance.

Hardware Acceleration AI Model Optimization llama.cpp local inference

CASE↑ trendingReddit r/LocalLLaMA·4/18/2026

qwen3.6 performance jump is real, just make sure you have it properly configured

A user reports that Qwen 3.6 demonstrates a significant performance leap, proving capable for workloads typically handled by Opus and Codex, though not yet at their level. The user highlights its usefulness and speed when properly configured with `preserve_thinking` on an M5 Max with specific settings.

LLMs AI hardware local inference AI performance

qwen3.6 performance jump is real, just make sure you have it properly configured

ARTICLE↑ trendingReddit r/LocalLLaMA·4/19/2026

Is anyone getting real coding work done with Qwen3.6-35B-A3B-UD-Q4_K_M on a 32GB Mac in opencode, claude code or similar?

A user is attempting to perform real coding tasks with Qwen3.6-35B on a 32GB M2 Macbook Pro, encountering memory exhaustion and context window management issues. Despite the model identifying the essence of a bug, it struggles with implementation as critical information is lost during context compaction.

LLMs open-source AI local inference code generation

ARTICLE↑ trendingReddit r/LocalLLaMA·4/15/2026

Gemma4 26b & E4B are crazy good, and replaced Qwen for me!

The user describes their previous AI setup before switching to Gemma4, detailing the hardware configuration (GPUs and RAM) and the specific Qwen models used for various tasks. They explain the roles of different Qwen versions (3.5 4B, 30b, 27b, 80B, 122b) for semantic routing, general chat, reasoning, code generation, and knowledge retrieval, based on their quantization and context needs.

local inference Gemma model comparison Qwen

NEWS↑ trendingReddit r/LocalLLaMA·4/12/2026

MiniMax m2.7 (mac only) 63gb: 88% and 89gb: 95%, MMLU 200q

The content announces the launch of the MiniMax M2.7 AI model, available in 63GB and 89GB versions, optimized for Mac. It highlights its promising performance, suggesting it approaches levels of models like Sonnet 4.5 and mentions the MMLU benchmark.

local inference MiniMax performance HuggingFace

NEWSDEV.to AI·4/19/2026

Gemini App Launches on Mac

Google has launched the Gemini App for macOS, representing its first major desktop expansion and a strategic shift towards local AI execution. This allows users to run Gemini models directly on their machines for faster local inference, reduced cloud dependency, and improved privacy and performance.

local inference Gemini Google AI application

DOCDEV.to AI·4/17/2026

How to Run LLMs Locally with Ollama — A Developer's Guide

This guide details how to run Large Language Models (LLMs) locally using Ollama, a free and private tool with an OpenAI-compatible API. It provides installation instructions for Linux, macOS, and Windows, along with commands to pull specific code-focused and general-purpose models.

LLMs Ollama local inference developer tools

ARTICLEDEV.to AI·5/8/2026

KIWI-CHAN GOES DARK: QWEN 35B TAKES THE HELM AND WE DON'T NEED THE CLOUD ANYMORE

Kiwi-chan has successfully migrated to an entirely on-premise AI inference system, eliminating cloud dependencies and API costs. Its reasoning engine now utilizes Qwen 35B with a custom quantized stack, currently in a phase of intensive learning and experimentation.

on-premise AI local inference AI automation machine learning

CASEDEV.to AI·4/16/2026

The Free Tier Wars 2026: Gemini vs Claude vs Ollama — Which One Actually Saves You Money?

The article details a 90-day experiment by Ultra Lab comparing the cost-performance of Google Gemini 2.5 Flash (free tier), Claude Opus 4.6 (Pro plan), and Ollama with ultralab:7b (local inference). It aims to reveal which LLM stack offers the best value for various production tasks, presenting real-world data.

local inference Performance Comparison Cost analysis LLM

DOCDEV.to AI·5/8/2026

Putting the GPU to Work: Running Local LLMs on a Home Lab

This content details installing Ollama and running local LLMs on a workstation using GPUs, emphasizing VRAM as a critical constraint. It describes integrating local models with Coder Agents for various coding tasks.

LLMs Ollama learning GPU

DOCDEV.to AI·4/21/2026

How to Install Ollama on Linux and Windows: Complete Setup Guide

This guide details how to install and configure Ollama on Linux and Windows systems, a tool that simplifies running and managing large language models (LLMs) locally. It covers system requirements, the step-by-step installation process, and how to run your first model, such as Llama3.

installation LLMs tutorials Ollama

ARTICLEDEV.to AI·4/14/2026

Best Open-Source Models for OpenClaw — Run Locally, No API Costs

This article recommends the best open-source AI models for local execution on OpenClaw in April 2026, highlighting Qwen3.5:27b as the best all-rounder, DeepSeek-R1-Distill-32B for coding, and Llama 4 Scout for multimodal tasks. It details VRAM requirements and benchmark performance for each model.

open source models LLMs GPU local inference

NEWSDEV.to AI·4/26/2026

DeepSeek-V4 Ported to MLX for Apple Silicon Inference

DeepSeek-V4 has been ported to Apple's MLX framework, enabling the large language model to run on Apple Silicon Macs. The functional port, a community effort by @Prince_Canuma, still requires optimization for improved performance.

apple-silicon local inference MLX large language models