RESEARCH27

RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

arXiv CS.LG·April 30, 2026

RaMP is a routing-aware dispatch framework designed to optimize Mixture-of-Experts (MoE) inference, addressing significant throughput loss from current batch-size-only configurations. It uses a performance-region analysis and a four-parameter wave cost model to select optimal kernel configurations, achieving up to 1.22x kernel speedup and 0.93% mean regret versus exhaustive search.

deep learning AI optimization performance

Read original ↗