← heapsort-ai

model optimization

26 items

ARTICLE↑ trendingReddit r/MachineLearning·4/23/2026

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

The user is optimizing a Transformer model for size and inference speed, having plateaued after FP16 conversion and ONNX optimization, with pruning yielding limited gains. They are seeking advice on advanced techniques like low-rank factorization, aggressive quantization (INT8/INT4), knowledge distillation, or hardware-specific optimizations to achieve further real-world improvements.

50
ARTICLE↑ trendingReddit r/LocalLLaMA·4/22/2026

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent

The author demonstrates that pairing the Qwen3.6-35B model with the "little-coder" agent drastically improves its performance on the Polyglot benchmark to 78.7%, making it competitive with top cloud models. This finding suggests that a "harness mismatch" in testing setups might explain performance gaps between local and cloud AI models.

46
RESEARCH↑ trendingReddit r/LocalLLaMA·4/18/2026

Qwen3.6-35B-A3B-Uncensored-Wasserstein-GGUF

A user discovered and fixed a significant tensor drift issue in the `ssm_conv1d` layers of quantized Qwen3.6-35B GGUF models, proposing the Wasserstein metric as superior to Kullback Leibler for detecting numerical instability. The fix, which specifically targets recurrent state transition layers responsible for long-context memory, is now available in a shared model.

44
RESEARCHarXiv CS.LG·4/16/2026

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

This paper presents a necessary condition for intra-group learning algorithm design in Reinforcement Learning, requiring objectives to maintain gradient exchangeability across token updates to prevent reward-irrelevant drift. It proposes minimal transformations to restore this cancellation structure, which stabilizes training and improves sample efficiency.

29
RESEARCHarXiv CS.LG·4/20/2026

Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit

This research introduces sequential KV compression, a novel two-layer architecture for transformer key-value caches that surpasses the per-vector Shannon limit. It leverages the sequential nature of KV cache tokens, using probabilistic prefix deduplication with language tries and predictive delta coding to achieve more efficient compression.

27
RESEARCHarXiv CS.CL·4/7/2026

SoLA: Leveraging Soft Activation Sparsity and Low-Rank Decomposition for Large Language Model Compression

SoLA é um novo método de compressão sem treinamento para LLMs, que utiliza esparsidade de ativação suave e decomposição de baixo-rank. Ele identifica componentes cruciais para a inferência e comprime a maioria, visando reduzir parâmetros de modelos de linguagem grandes de forma eficiente e acessível.

27
RESEARCHDEV.to AI·20d ago

AI/ML Research Digest — May 16, 2026

Recent AI/ML research breakthroughs significantly enhance model efficiency and inference speed across various applications. Techniques like knowledge distillation with low-rank adapters, improved on-policy distillation, the Pion optimizer, and prune-then-distill methods are reducing computational costs and enabling broader deployment of advanced AI models.

27
RESEARCHarXiv CS.CL·4/27/2026

An End-to-End Ukrainian RAG for Local Deployment. Optimized Hybrid Search and Lightweight Generation

This paper introduces a highly efficient Retrieval-Augmented Generation (RAG) system specifically for Ukrainian document question answering, which achieved 2nd place in the UNLP 2026 Shared Task. It features a custom hybrid search and a specialized Ukrainian language model, compressed for high-quality, verifiable local deployment on resource-constrained hardware.

27
RESEARCHarXiv CS.LG·27d ago

LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection

Diffusion Language Models (dLLMs) face scalability limits in parallelism due to overly conservative confidence thresholds that hinder their potential for highly parallel processing. This paper introduces LEAP, a training-free, plug-and-play method that improves dLLM parallelism by detecting early-converging tokens, thereby accelerating decoding.

27