inference

28 items

RESEARCHarXiv CS.LG·4/24/2026

FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

FairyFuse is a new inference system designed for CPU-only platforms, enabling multiplication-free execution of large language models. It uses ternary weights ({-1, 0, +1}) to replace floating-point multiplications with conditional additions and subtractions, significantly reducing memory bandwidth bottlenecks and offering up to 16x weight compression.

inference CPU optimization quantization performance

RESEARCHarXiv CS.LG·18d ago

Harnesses for Inference-Time Alignment over Execution Trajectories

This research investigates harness engineering as an inference-time technique for large language model (LLM) agents, focusing on improving long-term performance via task decomposition and guided execution. It quantifies how design elements like workflow granularity and guidance impact performance, revealing common failure modes such as over-decomposition and hallucinated execution.

inference LLMs machine learning Task Decomposition

RESEARCHDEV.to AI·12d ago

Sleep Phase Cuts Transformer Costs by Consolidating Memory

A new research paper introduces a "sleep phase" for language models, consolidating context into fixed-size memory layers. This method significantly reduces quadratic inference costs and enhances performance on long-horizon tasks.

language models inference Transformer memory

DOCDEV.to AI·4/28/2026

How to Deploy Phi-3.5 Mini with vLLM on a $5/Month DigitalOcean Droplet: Lightweight Production Inference Under $60/Year

This article guides users on deploying Microsoft's Phi-3.5 Mini LLM with vLLM on a $5/month DigitalOcean Droplet. The setup offers lightweight production inference for under $60 annually, aiming to drastically cut costs compared to expensive commercial LLM APIs.

inference cloud computing Cost Optimization LLM deployment

DOCTogether AI Blog·5/8/2026

Deploy and inference any model from HuggingFace

This session teaches how to deploy any Hugging Face model using Goose and Together's Dedicated Container Inference. It aims to simplify setup complexity, enabling models to run quickly in a production-grade GPU environment.

inference learning GPU AI deployment

ARTICLEML Mastery·11d ago

Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient

This article explores how continuous batching improves LLM inference efficiency, addressing the issues of static batching. It details dynamic scheduling and ragged batching to process multiple requests simultaneously.

inference deep learning efficiency Batching

Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient

NEWSTogether AI Blog·3/17/2026

Mamba-3

Mamba-3 is introduced as a new open-source State Space Model (SSM) built for inference. It boasts superior performance, being faster than Transformers at decode and stronger than Mamba-2.

Open Source inference Mamba-3 SSM

NEWSDEV.to AI·4/18/2026

AI Hub Phase 8: Adding DeepInfra and Liquid AI — Now at 33 Providers

AI Hub Phase 8 announces the addition of DeepInfra and Liquid AI, expanding its provider count to 33. DeepInfra is highlighted for its cost-effectiveness and OpenAI-compatible endpoint, while Liquid AI introduces a novel, non-transformer architecture for long-context tasks.

AI platforms DeepInfra inference LLMs