← heapsort
RESEARCH35

Dispatch-Aware Ragged Attention for Pruned Vision Transformers

arXiv CS.LGΒ·April 20, 2026

This paper investigates the dispatch-overhead bottleneck that prevents token pruning from fully realizing latency reductions in Vision Transformers (ViTs). It proposes a lightweight Triton attention kernel with a lower dispatch floor, achieving up to 2.24x end-to-end throughput for pruned ViTs.

Read original β†—