← heapsort-ai

Speculative Decoding

18 items

DOC↑ trendingReddit r/LocalLLaMA·5/6/2026

Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR

This content details the implementation of Multi-Token Prediction (MTP) with quantized GGUFs for Qwen3-27B, utilizing Unsloth's UD XL quantizations with Q8_0 MTP layers grafted on top, resulting in a 2.5x throughput increase. The author shares grafted GGUF files, raw MTP layer source, and a conversion script, along with custom llama.cpp build instructions incorporating speculative decoding support from an unmerged PR.

43
ARTICLE↑ trendingReddit r/LocalLLaMA·5/7/2026

why llama.cpp can’t combine speculative decode methods?

A user is exploring why speculative decode methods like MTP and N-gram cannot be combined simultaneously in llama.cpp, noting that N-gram offers significant improvements for agentic coding. They seek to understand if this is a fundamental or implementation limitation, finding that others have already asked the same question.

43
RESEARCH↑ trendingReddit r/MachineLearning·4/26/2026

Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch [P]

A new educational implementation repository has been launched for speculative decoding, implementing various methods like EAGLE-3 and Medusa-1 from scratch to facilitate studying proposer design differences. It includes training and inference paths for models like Qwen/Qwen2.5-7B-Instruct and aims to clarify the distinction between proposer quality and verifier cost, and why a high acceptance rate doesn't always imply higher throughput.

42
RESEARCHarXiv CS.LG·4/23/2026

Accelerating PayPal's Commerce Agent with Speculative Decoding: An Empirical Study on EAGLE3 with Fine-Tuned Nemotron Models

This paper evaluates speculative decoding with EAGLE3 as an inference-time optimization for PayPal's Commerce Agent, powered by fine-tuned Nemotron models. The study demonstrates significant performance improvements, including 22-49% throughput increase and 18-33% latency reduction at zero additional hardware cost.

28
RESEARCHarXiv CS.LG·4/23/2026

Super Apriel: One Checkpoint, Many Speeds

Super Apriel, a 15B-parameter supernet, has been released, offering four trained mixer choices per decoder layer to enable multiple speed/quality presets from a single checkpoint. This allows for 2.9x to 10.7x decode throughput gains with 96% to 77% quality retention, and also facilitates speculative decoding without a separate draft model.

28
RESEARCHarXiv CS.CL·4/30/2026

SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding

SpecTr-GBV is a novel speculative decoding method that unifies multi-draft and greedy block verification to accelerate language model inference. It formulates the verification step as an optimal transport problem, improving both theoretical efficiency and empirical performance by achieving the optimal expected acceptance length.

27
RESEARCHarXiv CS.CL·4/21/2026

Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

This research evaluates cross-family speculative decoding for Polish LLMs on Apple Silicon, extending the MLX-LM framework with Universal Assisted Generation (UAG) for cross-tokenizer compatibility. Experiments show that context-aware token translation significantly improves acceptance rates for Bielik 11B on Polish language datasets.

27
RESEARCHarXiv CS.CL·12d ago

EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter AdaptationTarget

EvoSpec introduces a framework for real-time evolution of draft models in speculative decoding for Large Language Models, addressing the bottleneck of large vocabulary sizes. It uses dynamic vocabulary and parameter adaptation, employing a context-aware mechanism and a lightweight online alignment strategy to improve acceptance rates and minimize distributional gaps.

27