LLM

609 items

RESEARCH↑ trendingReddit r/MachineLearning·5/3/2026

torch-nvenc-compress: GPU NVENC silicon as a PCIe bandwidth multiplier — PCA + pure-ctypes Video Codec SDK wrapper. Parallel-path overlap measured at 67% of theoretical max on a real GEMM + encode workload. [P]

This project introduces the Python library "torch-nvenc-compress," which leverages the GPU's NVENC/NVDEC hardware to compress LLM activations and KV cache, aiming to overcome PCIe bandwidth bottlenecks in multi-GPU setups. It measures a parallel-path overlap at 67% of theoretical max, improving communication between consumer GPUs.

NVENC GPU PCIe compression

torch-nvenc-compress: GPU NVENC silicon as a PCIe bandwidth multiplier — PCA + pure-ctypes Video Codec SDK wrapper. Parallel-path overlap measured at 67% of theoretical max on a real GEMM + encode workload. [P]

ARTICLE↑ trendingReddit r/MachineLearning·4/26/2026

How to collect evidence for LLM reviewer? [D]

A researcher received a weak rejection from a reviewer suspected of using an LLM, whose points were irrelevant and unoriginal, contrasting with positive feedback from other reviewers. The author seeks advice on how to collect evidence and report the reviewer to the academic committee for low-quality or LLM-generated feedback, considering the challenge of proving AI usage.

academic-ethics AI misuse Peer review LLM

DOC↑ trendingReddit r/LocalLLaMA·4/15/2026

Gemma 4 Jailbreak System Prompt

This content discusses the "jailbreak" of the Gemma 4 model, focusing on the use of system prompts to exploit vulnerabilities. It explores the techniques employed to bypass the language model's safeguards and restrictions.

system prompt jailbreak security Gemma

ARTICLEDEV.to AI·1d ago

Enhancing LLM Reliability with Evaluation Engineering

This article explores how evaluation engineering is crucial for enhancing the reliability of Large Language Models (LLMs), discussing its principles and techniques. By focusing on this discipline, organizations can ensure their LLMs are effective and meet the demands of real-world applications.

Reliability Evaluation Engineering AI evaluation LLM

ARTICLE↑ trendingReddit r/MachineLearning·4/23/2026

2b or not 2b ? Custom LLM Scheduling Competition [P]

A Kaggle competition has been launched, focusing on optimizing token costs for LLM answers by deciding whether to run a small model or skip a question. The goal is to minimize weighted cost, considering compute, failure, and penalty for skipping a correct answer.

Kaggle Benchmarking model optimization resource management

ARTICLE↑ trendingReddit r/LocalLLaMA·4/10/2026

GLM 5.1 crushes every other model except Opus in agentic benchmark at about 1/3 of the Opus cost

Um teste de benchmark agentic revela que o modelo GLM 5.1 alcança desempenho similar ao Opus por um terço do custo em tarefas agentic, superando outros modelos testados. O autor enfatiza a relevância de testes em ambientes reais como o OpenClaw, classificando o GLM 5.1 como um dos principais modelos para agentes atualmente.

OpenClaw Benchmarks Agentic AI GLM 5.1

ARTICLE↑ trendingReddit r/LocalLLaMA·4/21/2026

Gemma 4 Vision

Gemma 4's default vision budget is often too low for effective detail recognition, causing poor OCR performance. Users can significantly enhance its vision by configuring `llama.cpp` parameters like `--image-min-tokens` and `--image-max-tokens` to higher values, such as 560 and 2240.

Optimization Configuration computer vision Gemma

ARTICLE↑ trendingReddit r/LocalLLaMA·4/26/2026

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better!

A user switched from Qwen3.6 35b-a3b to Qwen3.6 27b (IQ3_M) mid-coding and found the latter noticeably better, even solving a difficult bug. They question if dense models handle compression better than MoE models, given the positive experience with a more aggressive quantization.

AI models local LLM Performance Comparison GGUF

Switched from Qwen3.6 35b-a3b to Qwen3.6 27b mid coding and it's noticeably better!

NEWS↑ trendingReddit r/LocalLLaMA·4/15/2026

DFlash Doubles the T/S Gen Speed of Qwen3.5 27B (BF16) on Mac M5 Max

The new DFlash support in oMLX 0.3.5 RC1 has reportedly doubled the generation speed of the Qwen3.5 27B (BF16) model on a Mac M5 Max, increasing it from 9 to 22 T/S. This breakthrough could significantly improve local deployment of this high-quality model at higher quantizations/full weights.

oMLX DFlash Qwen3.5 AI performance

DFlash Doubles the T/S Gen Speed of Qwen3.5 27B (BF16) on Mac M5 Max

ARTICLE↑ trendingReddit r/LocalLLaMA·4/15/2026

Guys we have to change the pelican test

A user proposes a new creative test for AI models, challenging them to generate an HTML SVG of a horse in an F1 race car. The post compares and presents the outputs from several prominent large language models, including Gemini, DeepSeek, and Claude Sonnet.

SVG generation prompt-engineering model comparison AI

ARTICLEDEV.to AI·4/22/2026

Privacy-first RAG on Cloudflare's edge — here's everything I changed from the naïve baseline published:

This post details LocalMind, a privacy-first document intelligence platform built on Cloudflare's edge using RAG, Workers AI, Vectorize, and Google Gemma 4. It covers the RAG pipeline, quality improvements, and an extensive NLP layer for secure document processing.

Cloudflare privacy RAG edge computing

ARTICLE↑ trendingReddit r/LocalLLaMA·5/4/2026

The more I use it, the more I'm impressed

A user found Qwen 3.6 27b capable of discovering a critical bug that both GPT 5.5 and Claude Opus 4.7 initially missed and denied. This observation suggests that slower, more thorough processing by models like Qwen can sometimes outperform faster, frontier models in critical problem-solving.

AI models bug discovery model comparison LLM

The more I use it, the more I'm impressed

ARTICLE↑ trendingReddit r/LocalLLaMA·19d ago

When your LLM treats data center GPUs like an optional DLC

The title suggests a discussion about when a Large Language Model (LLM) appears to underutilize or treat data center GPUs as optional resources. It implies an inefficiency or a challenge in managing powerful hardware resources for LLMs.

efficiency GPUs resource management data center

When your LLM treats data center GPUs like an optional DLC

RESEARCH↑ trendingReddit r/MachineLearning·4/15/2026

Was looking at a ICLR 2025 Oral paper and I am shocked it got oral [D]

A user expresses shock regarding an ICLR 2025 Oral paper, criticizing its evaluation methodology for SQL code generation by LLMs. The paper reportedly used natural language metrics instead of execution metrics, leading to an approximately 20% false positive rate.

ICLR Evaluation Metrics Peer review SQL Generation

NEWS↑ trendingReddit r/LocalLLaMA·4/12/2026

Unsloth MiniMax M2.7 quants just finished uploading to HF

New quantizations for the Unsloth MiniMax M2.7 model, ranging from Q1 to BF16, have been uploaded to Hugging Face. A detailed list of GGUF quantizations, along with their respective sizes, is now available for download.

unsloth IA modelos quantização

ARTICLE↑ trendingReddit r/LocalLLaMA·4/12/2026

huge improvement after moving from ollama to llama.cpp

The content describes a war robots project where the Qwen3 model generates code to control the robots. The author reports a significant improvement in AI execution after transitioning from Ollama to llama.cpp.

Ollama llama.cpp AI robotics

RESEARCH↑ trendingReddit r/LocalLLaMA·4/19/2026

QWEN3.6 + ik_llama is fast af

A user reported running the Qwen3.6 + ik_llama model at over 50 tokens/second with a 200k context window on 16GB VRAM and 32GB RAM. This marks a significant performance benchmark for large language models.

Benchmarking hardware performance LLM

ARTICLE↑ trendingReddit r/LocalLLaMA·4/11/2026

If Dense Models are better for Coding, why are Qwen-Coders MoE?

The author questions Qwen's decision to use the Mixture-of-Experts (MoE) architecture for its coding models, instead of more accurate dense models. They speculate the choice might be related to inference speed and regret the absence of a 14B successor.

Model Architecture coding AI MoE AI

DOCDEV.to AI·3d ago

Building a LangGraph RAG Agent from Scratch — with a Live UI That Shows Every Step

This article details a learning project that teaches how to build a full ReAct RAG agent using LangChain and LangGraph, complete with a real-time React UI. It explains each concept step-by-step and demonstrates the live pipeline visualization.

LangChain LangGraph learning RAG

ARTICLE↑ trendingReddit r/LocalLLaMA·4/10/2026

making my own ai waifu app that can teach me any language.

Um desenvolvedor criou um aplicativo de IA 'waifu' para ensino de idiomas, utilizando Gemma-4, Omnivoice TTS e modelagem 3D. O app, com recursos como chamadas de voz/vídeo, impressionou o criador pela capacidade de Gemma-4 de seguir prompts sem censura.

App Development 3D modeling TTS AI