Speculative Decoding

18 items

RESEARCH↑ trendingReddit r/LocalLLaMA·4/11/2026

DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max)

This content describes a native DFlash implementation on MLX for Apple Silicon, significantly accelerating token generation in Qwen models. The speculative decoding technique achieves speedups of up to 3.3x while maintaining identical output quality.

apple-silicon MLX Qwen LLM performance

DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

This content details the implementation of Multi-Token Prediction (MTP) with quantized GGUFs for Qwen3-27B, utilizing Unsloth's UD XL quantizations with Q8_0 MTP layers grafted on top, resulting in a 2.5x throughput increase. The author shares grafted GGUF files, raw MTP layer source, and a conversion script, along with custom llama.cpp build instructions incorporating speculative decoding support from an unmerged PR.

Multi-Token Prediction llama.cpp quantization large language models

ARTICLE↑ trendingReddit r/LocalLLaMA·4/23/2026

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post

The content describes an experiment demonstrating significant speed gains (up to 68.35 tokens/s) using speculative decoding with the Qwen-3.6-27B model via llamacpp. The author showcases the AI's ability to efficiently generate and debug code.

Benchmarking AI performance Speculative Decoding LLM

Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post

ARTICLE↑ trendingReddit r/LocalLLaMA·5/7/2026

why llama.cpp can’t combine speculative decode methods?

A user is exploring why speculative decode methods like MTP and N-gram cannot be combined simultaneously in llama.cpp, noting that N-gram offers significant improvements for agentic coding. They seek to understand if this is a fundamental or implementation limitation, finding that others have already asked the same question.

Optimization LLMs llama.cpp Qwen3.6

NEWS↑ trendingReddit r/LocalLLaMA·4/27/2026

Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090

Luce DFlash introduces a GGUF port of DFlash speculative decoding for Qwen3.6-27B, achieving nearly 2x throughput on a single RTX 3090. This standalone C++/CUDA stack, available as an MIT-licensed open-source project, significantly enhances LLM performance on consumer-grade hardware.

Open Source Optimization performance Speculative Decoding

Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090

ARTICLE↑ trendingReddit r/LocalLLaMA·4/12/2026

Speculative Decoding works great for Gemma 4 31B with E2B draft (+29% avg, +50% on code)

Speculative decoding tests using Gemma 4 E2B as a draft for Gemma 4 31B revealed a remarkable performance improvement. Average speed increased by 29%, reaching 50% in code generation, with specific hardware and software configurations.

Gemma 4 31B llama.cpp benchmark AI performance

RESEARCH↑ trendingReddit r/MachineLearning·4/26/2026

Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch [P]

A new educational implementation repository has been launched for speculative decoding, implementing various methods like EAGLE-3 and Medusa-1 from scratch to facilitate studying proposer design differences. It includes training and inference paths for models like Qwen/Qwen2.5-7B-Instruct and aims to clarify the distinction between proposer quality and verifier cost, and why a high acceptance rate doesn't always imply higher throughput.

software development machine learning AI optimization Speculative Decoding

RESEARCHarXiv CS.CL·7d ago

SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding

This paper proposes SENSE (Semantic Embedding Navigation with Soft-gated Evaluation) to enhance Retrieval-based Speculative Decoding (RSD) for LLMs. SENSE addresses RSD's rigid lexical dependencies by using robust semantic alignment and a soft-gated evaluation module to validate semantic equivalence.

LLMs NLP Inference Optimization Speculative Decoding

RESEARCHarXiv CS.LG·4/23/2026

Accelerating PayPal's Commerce Agent with Speculative Decoding: An Empirical Study on EAGLE3 with Fine-Tuned Nemotron Models

This paper evaluates speculative decoding with EAGLE3 as an inference-time optimization for PayPal's Commerce Agent, powered by fine-tuned Nemotron models. The study demonstrates significant performance improvements, including 22-49% throughput increase and 18-33% latency reduction at zero additional hardware cost.

Performance benchmarking LLM optimization Inference acceleration large language models

RESEARCHarXiv CS.LG·4/23/2026

Super Apriel: One Checkpoint, Many Speeds

Super Apriel, a 15B-parameter supernet, has been released, offering four trained mixer choices per decoder layer to enable multiple speed/quality presets from a single checkpoint. This allows for 2.9x to 10.7x decode throughput gains with 96% to 77% quality retention, and also facilitates speculative decoding without a separate draft model.

neural network architecture Performance optimization attention mechanisms large language models

RESEARCHDEV.to AI·22d ago

Three Months of Speed-Up Experiments on a 3090 Ti: Autoregressive DFlash MTP for Qwen3.6-27B

This content details a three-month experiment aimed at optimizing the decode performance of the Qwen3.6-27B model on an RTX 3090 Ti GPU. The project successfully improved decoding speed from 43 to 39-49 tokens per second, leveraging a new speculative decoding technique (MTP) within llama.cpp.

LLM optimization llama.cpp Qwen3.6-27B GPU performance

RESEARCHarXiv CS.CL·4/30/2026

SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding

SpecTr-GBV is a novel speculative decoding method that unifies multi-draft and greedy block verification to accelerate language model inference. It formulates the verification step as an optimal transport problem, improving both theoretical efficiency and empirical performance by achieving the optimal expected acceptance length.

large language models Inference Optimization Speculative Decoding AI Research

RESEARCHarXiv CS.AI·5/7/2026

Parallel Prefix Verification for Speculative Generation

PARSE (PArallel pRefix Speculative Engine) is a new speculative generation framework that accelerates large language model (LLM) inference. It achieves this by parallelizing prefix verification on a semantic level, overcoming existing limitations by evaluating correctness across multiple prefixes in a single forward pass.

inference AI acceleration parallelization Speculative Decoding

RESEARCHarXiv CS.CL·4/21/2026

Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

This research evaluates cross-family speculative decoding for Polish LLMs on Apple Silicon, extending the MLX-LM framework with Universal Assisted Generation (UAG) for cross-tokenizer compatibility. Experiments show that context-aware token translation significantly improves acceptance rates for Bielik 11B on Polish language datasets.

apple-silicon Natural Language Processing Inference Optimization Speculative Decoding

RESEARCHarXiv CS.CL·12d ago

EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter AdaptationTarget

EvoSpec introduces a framework for real-time evolution of draft models in speculative decoding for Large Language Models, addressing the bottleneck of large vocabulary sizes. It uses dynamic vocabulary and parameter adaptation, employing a context-aware mechanism and a lightweight online alignment strategy to improve acceptance rates and minimize distributional gaps.

Optimization machine learning large language models AI inference

RESEARCHTogether AI Blog·3/31/2026

Aurora

Aurora is an open-source RL framework designed to self-improve speculative decoding, learning from every served request. It achieves a 1.25x performance increase over well-trained static speculators.

Open Source AI Framework reinforcement learning Performance Improvement

NEWSDEV.to AI·4/15/2026

AWS Speed Boosts, Agentic Limits, and Clinical AI Advances

AWS is optimizing LLM inference with speculative decoding on Trainium and vLLM, and the Spring AI SDK for Bedrock AgentCore is now generally available. New research also explores agentic system failures, CNN uncertainty quantification, and LLMs' role in clinical reasoning.

Clinical AI AWS LLM inference Agentic AI

ARTICLE↑ trendingReddit r/LocalLLaMA·4/19/2026

Speculative decoding question, 665% speed increase

A question regarding speculative decoding highlights a remarkable 665% speed increase. This topic points to a technical discussion on AI model optimization.

deep learning AI performance model optimization speed improvement