← heapsort-ai

inference

28 items

RESEARCHarXiv CS.LG·4/24/2026

FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

FairyFuse is a new inference system designed for CPU-only platforms, enabling multiplication-free execution of large language models. It uses ternary weights ({-1, 0, +1}) to replace floating-point multiplications with conditional additions and subtractions, significantly reducing memory bandwidth bottlenecks and offering up to 16x weight compression.

27
RESEARCHarXiv CS.LG·18d ago

Harnesses for Inference-Time Alignment over Execution Trajectories

This research investigates harness engineering as an inference-time technique for large language model (LLM) agents, focusing on improving long-term performance via task decomposition and guided execution. It quantifies how design elements like workflow granularity and guidance impact performance, revealing common failure modes such as over-decomposition and hallucinated execution.

27
NEWSTogether AI Blog·3/17/2026

Mamba-3

Mamba-3 is introduced as a new open-source State Space Model (SSM) built for inference. It boasts superior performance, being faster than Transformers at decode and stronger than Mamba-2.

27